-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More robust None
handling
#3195
Conversation
I also created a PR regarding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good !
I just have a question about padding when concatenating datasets or adding a column, and some nitpicks:
@lhoestq I addressed your comments, added tests, did some refactoring to make the implementation cleaner and added support for My only concern is that during decoding |
Cool ! :D
Yes that makes sense to only fill with nan if the type is compatible |
After some more experimenting, I think we can keep auto-cast to float because PyArrow also does it: import pyarrow as pa
arr = pa.array([1, 2, 3, 4, None], type=pa.int32()).to_numpy(zero_copy_only=False) # None present - int32 -> float64
assert arr.dtype == np.float64 Additional changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice ! My final comments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Thank you :)
The CI fail for windows is unrelated to this PR, merging |
PyArrow has explicit support for
null
values, so it makes sense to support Nones on our side as well.Colab Notebook with examples
Changes:
ClassLabel, TranslationVariableLanguages, Value, _ArrayXD
)class_encode_column
(also there is an option to stringify Nones and treat them as a class)sort
(use pandas for that)None
tonp.nan
to align the behavior with PyArrow)pa.concat_tables(table_list, promote=True)
) andnull
row/columnbroadcasting similar to pandasAdditional notes:
null
instead ofnone
for function arguments for consistency with existingdisable_nullable
update_metadata_with_features
call inDataset.rename_columns
TODO:
concatenate_datasets
/add_item
TODOs for subsequent PRs:
drop_null
/fill_null
toDataset
/DatasetDict
Fix #3181 #3253