Invalid Arrow data from JSONL #5531

lhoestq · 2023-02-14T15:39:49Z

This code fails:

from datasets import Dataset

ds = Dataset.from_json(path_to_file)
ds.data.validate()

raises

ArrowInvalid: Column 2: In chunk 1: Invalid: Struct child array #3 invalid: Invalid: Length spanned by list offsets (4064) larger than values array (length 4063)

This causes many issues for @TevenLeScao:

map fails because it fails to rewrite invalid arrow arrays

~/Desktop/hf/datasets/src/datasets/arrow_writer.py in write_examples_on_file(self)
    438             if all(isinstance(row[0][col], (pa.Array, pa.ChunkedArray)) for row in self.current_examples):
    439                 arrays = [row[0][col] for row in self.current_examples]
--> 440                 batch_examples[col] = array_concat(arrays)
    441             else:
    442                 batch_examples[col] = [

~/Desktop/hf/datasets/src/datasets/table.py in array_concat(arrays)
1885 
1886     if not _is_extension_type(array_type):
-> 1887         return pa.concat_arrays(arrays)
1888 
1889     def _offsets_concat(offsets):

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.concat_arrays()

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowIndexError: array slice would exceed array length

to_dict() segfaults ⚠️

/Users/runner/work/crossbow/crossbow/arrow/cpp/src/arrow/array/data.cc:99:  Check failed: (off) <= (length) Slice offset greater 
than array length

To reproduce: unzip the archive and run the above code using sanity_oscar_en.jsonl
sanity_oscar_en.jsonl.zip

PS: reading using pandas and converting to Arrow works though (note that the dataset lives in RAM in this case):

ds = Dataset.from_pandas(pd.read_json(path_to_file, lines=True))
ds.data.validate()

The text was updated successfully, but these errors were encountered:

huggingface/datasets#5531

* New function `util.load_dataset_via_pandas` to be used instead of Dataset.load_dataset. Using a pandas df as an intermediate format somehow prevents weird `datasets.builder.DatasetGenerationError`s that occured while processing some files before. This was presented as a solution for a related problem here: huggingface/datasets#5531 * Use the new function in CLI script modify_dataset.py * Remove the temporary file used for finding the problem

lhoestq added the bug Something isn't working label Feb 14, 2023

shuheikurita added a commit to llm-jp/llm-jp-corpus that referenced this issue Sep 9, 2023

Add pandas-jsonl to tokenize_data.py

15870f5

huggingface/datasets#5531

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid Arrow data from JSONL #5531

Invalid Arrow data from JSONL #5531

lhoestq commented Feb 14, 2023 •

edited

Invalid Arrow data from JSONL #5531

Invalid Arrow data from JSONL #5531

Comments

lhoestq commented Feb 14, 2023 • edited

lhoestq commented Feb 14, 2023 •

edited