Problems with local dataset after upgrade from 3.3.2 to 3.4.0

### Describe the bug

I was not able to open a local saved dataset anymore that was created using an older datasets version after the upgrade yesterday from datasets 3.3.2 to 3.4.0

The traceback is 

```
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/arrow/arrow.py", line 67, in _generate_tables
    batches = pa.ipc.open_stream(f)
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/ipc.py", line 190, in open_stream
    return RecordBatchStreamReader(source, options=options,
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/ipc.py", line 52, in __init__
    self._open(source, options=options, memory_pool=memory_pool)
  File "pyarrow/ipc.pxi", line 1006, in pyarrow.lib._RecordBatchStreamReader._open
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected to read 538970747 metadata bytes, but only read 2126

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1855, in _prepare_split_single
    for _, table in generator:
  File "/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/arrow/arrow.py", line 69, in _generate_tables
    reader = pa.ipc.open_file(f)
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/ipc.py", line 234, in open_file
    return RecordBatchFileReader(
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/ipc.py", line 110, in __init__
    self._open(source, footer_offset=footer_offset,
  File "pyarrow/ipc.pxi", line 1090, in pyarrow.lib._RecordBatchFileReader._open
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Not an Arrow file
```

### Steps to reproduce the bug

Load a dataset from a local folder with 

```
dataset = load_dataset(
                args.train_data_dir,
                cache_dir=args.cache_dir,
            )
```
as it is done for example in the training script for SD3 controlnet.

This is the minimal script to test it: 

```
from datasets import load_dataset

def main():
    dataset = load_dataset(
        "local_dataset",  
    )
    print(dataset)
    print("Sample data:", dataset["train"][0])

if __name__ == "__main__":
    main()
````

### Expected behavior

Work in 3.4.0 like in 3.3.2

### Environment info

- `datasets` version: 3.4.0
- Platform: Linux-5.15.0-75-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- `huggingface_hub` version: 0.29.3
- PyArrow version: 19.0.1
- Pandas version: 2.2.3
- `fsspec` version: 2024.12.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with local dataset after upgrade from 3.3.2 to 3.4.0 #7455

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problems with local dataset after upgrade from 3.3.2 to 3.4.0 #7455

Description

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions