Implement capability to restore non-nullability in Features#7482
Implement capability to restore non-nullability in Features#7482BramVanroy wants to merge 2 commits intohuggingface:mainfrom BramVanroy:schema
Conversation
|
Interestingly, this does not close #7479. The Features are not correctly maintained when calling |
|
Unfortunately this PR does not fix the reported issue. After more digging:
Interestingly, passing custom Features does not immediately load the underlying data with the right arrow_schema. Instead, the workflow is like this:
So I figured, since many/all of the pyarrow datasets/src/datasets/arrow_dataset.py Line 940 in 5f8d2ad to include the arrow_schema, if provided: pa_table = InMemoryTable.from_pydict(mapping=mapping, schema=features.arrow_schema if features is not None else None)But that leads to: and I am not too familiar with pyarrow to solve this. So ultimately I'm a bit at a loss here. I think, if we'd want to do this right, the automatic casting in init should be removed in favor of handling the logic inside |
|
It's indeed a bit more work to support nullable since in addition to your comments, there are unclear behavior when it comes to concatenating nullable with non-nullable, and maybe how to handle non-nullable lists and nested data. But yup I agree having the Just one comment about this error: This happens because |
This PR attempts to keep track of non_nullable pyarrow fields when converting a
pa.SchematoFeatures. At the same time, when outputting thearrow_schema, the original non-nullable fields are restored. This allows for more consistent behavior and avoids breaking behavior as illustrated in #7479.I am by no means a pyarrow expert so some logic in
find_non_nullable_fieldsmay not perfect. Not sure if more logic (type checks) are needed for deep-checking a given schema. Maybe there are other pyarrow structures that need to be covered?Tests are added, but again, these may not have sufficient coverage in terms of pyarrow structure types.
closes #7479