Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid Arrow data from JSONL #5531

Open
lhoestq opened this issue Feb 14, 2023 · 0 comments
Open

Invalid Arrow data from JSONL #5531

lhoestq opened this issue Feb 14, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@lhoestq
Copy link
Member

lhoestq commented Feb 14, 2023

This code fails:

from datasets import Dataset

ds = Dataset.from_json(path_to_file)
ds.data.validate()

raises

ArrowInvalid: Column 2: In chunk 1: Invalid: Struct child array #3 invalid: Invalid: Length spanned by list offsets (4064) larger than values array (length 4063)

This causes many issues for @TevenLeScao:

  • map fails because it fails to rewrite invalid arrow arrays

    ~/Desktop/hf/datasets/src/datasets/arrow_writer.py in write_examples_on_file(self)
        438             if all(isinstance(row[0][col], (pa.Array, pa.ChunkedArray)) for row in self.current_examples):
        439                 arrays = [row[0][col] for row in self.current_examples]
    --> 440                 batch_examples[col] = array_concat(arrays)
        441             else:
        442                 batch_examples[col] = [
    
    ~/Desktop/hf/datasets/src/datasets/table.py in array_concat(arrays)
    1885 
    1886     if not _is_extension_type(array_type):
    -> 1887         return pa.concat_arrays(arrays)
    1888 
    1889     def _offsets_concat(offsets):
    
    ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.concat_arrays()
    
    ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
    
    ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
    
    ArrowIndexError: array slice would exceed array length
  • to_dict() segfaults ⚠️

    /Users/runner/work/crossbow/crossbow/arrow/cpp/src/arrow/array/data.cc:99:  Check failed: (off) <= (length) Slice offset greater 
    than array length

To reproduce: unzip the archive and run the above code using sanity_oscar_en.jsonl
sanity_oscar_en.jsonl.zip

PS: reading using pandas and converting to Arrow works though (note that the dataset lives in RAM in this case):

ds = Dataset.from_pandas(pd.read_json(path_to_file, lines=True))
ds.data.validate()
@lhoestq lhoestq added the bug Something isn't working label Feb 14, 2023
shuheikurita added a commit to llm-jp/llm-jp-corpus that referenced this issue Sep 9, 2023
ybracke pushed a commit to ybracke/transnormer-data that referenced this issue Dec 13, 2023
* New function `util.load_dataset_via_pandas` to be used instead of Dataset.load_dataset. Using a pandas df as an intermediate format somehow prevents weird `datasets.builder.DatasetGenerationError`s that occured while processing some files before. This was presented as a solution for a related problem here: huggingface/datasets#5531
* Use the new function in CLI script modify_dataset.py
* Remove the temporary file used for finding the problem
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant