-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Description
Describe the bug
I was not able to open a local saved dataset anymore that was created using an older datasets version after the upgrade yesterday from datasets 3.3.2 to 3.4.0
The traceback is
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/arrow/arrow.py", line 67, in _generate_tables
batches = pa.ipc.open_stream(f)
File "/usr/local/lib/python3.10/dist-packages/pyarrow/ipc.py", line 190, in open_stream
return RecordBatchStreamReader(source, options=options,
File "/usr/local/lib/python3.10/dist-packages/pyarrow/ipc.py", line 52, in __init__
self._open(source, options=options, memory_pool=memory_pool)
File "pyarrow/ipc.pxi", line 1006, in pyarrow.lib._RecordBatchStreamReader._open
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected to read 538970747 metadata bytes, but only read 2126
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1855, in _prepare_split_single
for _, table in generator:
File "/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/arrow/arrow.py", line 69, in _generate_tables
reader = pa.ipc.open_file(f)
File "/usr/local/lib/python3.10/dist-packages/pyarrow/ipc.py", line 234, in open_file
return RecordBatchFileReader(
File "/usr/local/lib/python3.10/dist-packages/pyarrow/ipc.py", line 110, in __init__
self._open(source, footer_offset=footer_offset,
File "pyarrow/ipc.pxi", line 1090, in pyarrow.lib._RecordBatchFileReader._open
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Not an Arrow file
Steps to reproduce the bug
Load a dataset from a local folder with
dataset = load_dataset(
args.train_data_dir,
cache_dir=args.cache_dir,
)
as it is done for example in the training script for SD3 controlnet.
This is the minimal script to test it:
from datasets import load_dataset
def main():
dataset = load_dataset(
"local_dataset",
)
print(dataset)
print("Sample data:", dataset["train"][0])
if __name__ == "__main__":
main()
Expected behavior
Work in 3.4.0 like in 3.3.2
Environment info
datasetsversion: 3.4.0- Platform: Linux-5.15.0-75-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
huggingface_hubversion: 0.29.3- PyArrow version: 19.0.1
- Pandas version: 2.2.3
fsspecversion: 2024.12.0
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels