-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Description
Describe the bug
I came across this, perhaps niche, bug where Features does not/cannot account for pyarrow's nullable=False option in Fields. Interestingly, I found that in regular "flat" fields this does not necessarily lead to conflicts, but when a non-nullable field is in a struct, an incompatibility arises.
It's not easy to explain in words, so the minimal example below should help I hope.
Note that I suggest a solution in the comments in the code, simply allowing Dataset.to_parquet to allow for a schema argument which, when provided, will override the default ds.features.arrow_schema.
Steps to reproduce the bug
import os
from datasets import Dataset, Features
import pyarrow as pa
import pyarrow.parquet as pq
# HF datasets is destructive when you call Features.from_arrow_schema(schema) on a schema
# because it will not account for nullable and non-nullable fields in structs (it will always allow nullable)
# Reloading the same dataset with the original schema will raise an error because the schema is not the same anymore
non_nullable_schema = pa.schema(
[
pa.field("text", pa.string(), nullable=False),
pa.field("meta",
pa.struct(
[
pa.field("date", pa.list_(pa.string()), nullable=False),
],
),
),
]
)
print("ORIGINAL SCHEMA")
print(non_nullable_schema)
print()
feats = Features.from_arrow_schema(non_nullable_schema)
print("FEATUR-IZED SCHEMA (nullable-restrictions are gone)")
print(feats.arrow_schema)
print()
ds = Dataset.from_dict(
{
"text": ["a", "b", "c"],
"meta": [{"date": ["2021-01-01"]}, {"date": ["2021-01-02"]}, {"date": ["2021-01-03"]}],
},
features=feats,
)
fname = "tmp.parquet"
# This is not possible: TypeError: pyarrow.parquet.core.ParquetWriter() got multiple values for keyword argument 'schema'
# Though I believe this would be the easiest fix: allow schema to be passed to to_parquet and overwrite the schema in the dataset
# ds.to_parquet(fname, schema=non_nullable_schema)
ds.to_parquet(fname)
try:
_ = pq.read_table(fname, schema=non_nullable_schema)
finally:
os.unlink(fname)Expected behavior
- Non-destructive behavior when converting an arrow schema to Features; or
- the ability to override the default arrow schema with a custom one
Environment info
datasetsversion: 3.2.0- Platform: Linux-5.14.0-427.20.1.el9_4.x86_64-x86_64-with-glibc2.34
- Python version: 3.11.10
huggingface_hubversion: 0.27.1- PyArrow version: 18.1.0
- Pandas version: 2.2.3
fsspecversion: 2024.9.0
lhoestq
Metadata
Metadata
Assignees
Labels
No labels