Conversation
|
I also created a PR regarding |
lhoestq
left a comment
There was a problem hiding this comment.
Looking good !
I just have a question about padding when concatenating datasets or adding a column, and some nitpicks:
|
@lhoestq I addressed your comments, added tests, did some refactoring to make the implementation cleaner and added support for My only concern is that during decoding |
|
Cool ! :D
Yes that makes sense to only fill with nan if the type is compatible |
|
After some more experimenting, I think we can keep auto-cast to float because PyArrow also does it: import pyarrow as pa
arr = pa.array([1, 2, 3, 4, None], type=pa.int32()).to_numpy(zero_copy_only=False) # None present - int32 -> float64
assert arr.dtype == np.float64Additional changes:
|
lhoestq
left a comment
There was a problem hiding this comment.
Very nice ! My final comments:
|
The CI fail for windows is unrelated to this PR, merging |
PyArrow has explicit support for
nullvalues, so it makes sense to support Nones on our side as well.Colab Notebook with examples
Changes:
ClassLabel, TranslationVariableLanguages, Value, _ArrayXD)class_encode_column(also there is an option to stringify Nones and treat them as a class)sort(use pandas for that)Nonetonp.nanto align the behavior with PyArrow)pa.concat_tables(table_list, promote=True)) andnullrow/columnbroadcasting similar to pandasAdditional notes:
nullinstead ofnonefor function arguments for consistency with existingdisable_nullableupdate_metadata_with_featurescall inDataset.rename_columnsTODO:
concatenate_datasets/add_itemTODOs for subsequent PRs:
drop_null/fill_nulltoDataset/DatasetDictFix #3181 #3253