Skip to content

Problems with local dataset after upgrade from 3.3.2 to 3.4.0 #7455

@andjoer

Description

@andjoer

Describe the bug

I was not able to open a local saved dataset anymore that was created using an older datasets version after the upgrade yesterday from datasets 3.3.2 to 3.4.0

The traceback is

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/arrow/arrow.py", line 67, in _generate_tables
    batches = pa.ipc.open_stream(f)
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/ipc.py", line 190, in open_stream
    return RecordBatchStreamReader(source, options=options,
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/ipc.py", line 52, in __init__
    self._open(source, options=options, memory_pool=memory_pool)
  File "pyarrow/ipc.pxi", line 1006, in pyarrow.lib._RecordBatchStreamReader._open
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected to read 538970747 metadata bytes, but only read 2126

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1855, in _prepare_split_single
    for _, table in generator:
  File "/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/arrow/arrow.py", line 69, in _generate_tables
    reader = pa.ipc.open_file(f)
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/ipc.py", line 234, in open_file
    return RecordBatchFileReader(
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/ipc.py", line 110, in __init__
    self._open(source, footer_offset=footer_offset,
  File "pyarrow/ipc.pxi", line 1090, in pyarrow.lib._RecordBatchFileReader._open
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Not an Arrow file

Steps to reproduce the bug

Load a dataset from a local folder with

dataset = load_dataset(
                args.train_data_dir,
                cache_dir=args.cache_dir,
            )

as it is done for example in the training script for SD3 controlnet.

This is the minimal script to test it:

from datasets import load_dataset

def main():
    dataset = load_dataset(
        "local_dataset",  
    )
    print(dataset)
    print("Sample data:", dataset["train"][0])

if __name__ == "__main__":
    main()

Expected behavior

Work in 3.4.0 like in 3.3.2

Environment info

  • datasets version: 3.4.0
  • Platform: Linux-5.15.0-75-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • huggingface_hub version: 0.29.3
  • PyArrow version: 19.0.1
  • Pandas version: 2.2.3
  • fsspec version: 2024.12.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions