Features.from_arrow_schema is destructive

### Describe the bug

I came across this, perhaps niche, bug where `Features` does not/cannot account for pyarrow's `nullable=False` option in Fields. Interestingly, I found that in regular "flat" fields this does not necessarily lead to conflicts, but when a non-nullable field is in a struct, an incompatibility arises.

It's not easy to explain in words, so the minimal example below should help I hope.

Note that I suggest a solution in the comments in the code, simply allowing `Dataset.to_parquet` to allow for a `schema` argument which, when provided, will override the default ds.features.arrow_schema.

### Steps to reproduce the bug

```python
import os
from datasets import Dataset, Features

import pyarrow as pa
import pyarrow.parquet as pq

# HF datasets is destructive when you call Features.from_arrow_schema(schema) on a schema 
# because it will not account for nullable and non-nullable fields in structs (it will always allow nullable)
# Reloading the same dataset with the original schema will raise an error because the schema is not the same anymore
non_nullable_schema = pa.schema(
    [
        pa.field("text", pa.string(), nullable=False),
        pa.field("meta",
            pa.struct(
                [
                    pa.field("date", pa.list_(pa.string()), nullable=False),
                ],
            ),
        ),

    ]
)
print("ORIGINAL SCHEMA")
print(non_nullable_schema)
print()

feats = Features.from_arrow_schema(non_nullable_schema)

print("FEATUR-IZED SCHEMA (nullable-restrictions are gone)")
print(feats.arrow_schema)
print()

ds = Dataset.from_dict(
    {
        "text": ["a", "b", "c"],
        "meta": [{"date": ["2021-01-01"]}, {"date": ["2021-01-02"]}, {"date": ["2021-01-03"]}],
    },
    features=feats,
)

fname = "tmp.parquet"

# This is not possible: TypeError: pyarrow.parquet.core.ParquetWriter() got multiple values for keyword argument 'schema'
# Though I believe this would be the easiest fix: allow schema to be passed to to_parquet and overwrite the schema in the dataset
# ds.to_parquet(fname, schema=non_nullable_schema)

ds.to_parquet(fname)

try:
    _ = pq.read_table(fname, schema=non_nullable_schema)
finally:
    os.unlink(fname)
```


### Expected behavior

- Non-destructive behavior when converting an arrow schema to Features; or
- the ability to override the default arrow schema with a custom one

### Environment info

- `datasets` version: 3.2.0
- Platform: Linux-5.14.0-427.20.1.el9_4.x86_64-x86_64-with-glibc2.34
- Python version: 3.11.10
- `huggingface_hub` version: 0.27.1
- PyArrow version: 18.1.0
- Pandas version: 2.2.3
- `fsspec` version: 2024.9.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Features.from_arrow_schema is destructive #7479

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Features.from_arrow_schema is destructive #7479

Description

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions