You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a json file contains a text field that is larger than the block_size, the JSON dataset builder fails.
Steps to reproduce the bug
Create a folder that contains the following:
.
├── testdata
│ └── mydata.json
└── test.py
Please download this file as mydata.json. (The error does not occur in JSON files with shorter text, but it is reproducible when the text is long as in the file I provide)
❗ ❗ GitHub doesn't allow me to upload JSON so this file is a TXT, and you should rename it to .json!
Using custom data configuration testdata-d490389b8ab4fd82
Downloading and preparing dataset json/testdata to /home/junshern.chan/.cache/huggingface/datasets/json/testdata-d490389b8ab4fd82/0.0.0/3333a8af0db9764dfcff43a42ff26228f0f2e267f0d8a0a294452d188beadb34...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2264.74it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 447.01it/s]
Failed to read file '/home/junshern.chan/hf-json-bug/testdata/mydata.json' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Missing a name for object member. in row 0
Traceback (most recent call last):
File "test.py", line 28, in <module>
my_dataset = load_dataset("testdata")
File "/home/junshern.chan/.casio/miniconda/envs/hf-json-bug/lib/python3.8/site-packages/datasets/load.py", line 1632, in load_dataset
builder_instance.download_and_prepare(
File "/home/junshern.chan/.casio/miniconda/envs/hf-json-bug/lib/python3.8/site-packages/datasets/builder.py", line 607, in download_and_prepare
self._download_and_prepare(
File "/home/junshern.chan/.casio/miniconda/envs/hf-json-bug/lib/python3.8/site-packages/datasets/builder.py", line 697, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/junshern.chan/.casio/miniconda/envs/hf-json-bug/lib/python3.8/site-packages/datasets/builder.py", line 1156, in _prepare_split
for key, table in utils.tqdm(
File "/home/junshern.chan/.casio/miniconda/envs/hf-json-bug/lib/python3.8/site-packages/tqdm/std.py", line 1168, in __iter__
for obj in iterable:
File "/home/junshern.chan/.casio/miniconda/envs/hf-json-bug/lib/python3.8/site-packages/datasets/packaged_modules/json/json.py", line 146, in _generate_tables
raise ValueError(
ValueError: Not able to read records in the JSON file at /home/junshern.chan/hf-json-bug/testdata/mydata.json. You should probably indicate the field of the JSON file containing your records. This JSON file contain the following fields: ['text']. Select the correct one and provide it as `field='XXX'` to the dataset loading method.
I have additionally identified the source of the error, being that this condition in the file python3.8/site-packages/datasets/packaged_modules/json/json.py is not being entered correctly:
if (
isinstance(e, pa.ArrowInvalid)
and"straddling"notinstr(e)
orblock_size>len(batch)
):
From what I can tell, in my case the block_size simply needs to be increased, but the error message does not contain "straddling" so the condition does trigger correctly and we fail to reach the line to increase block_size.
Changing the condition above to simply
if (
block_size>len(batch)
):
Fixes the error for me. I'm happy to create a PR containing this fix if the developers deem the other conditions unnecessary.
Describe the bug
When a json file contains a
text
field that is larger than the block_size, the JSON dataset builder fails.Steps to reproduce the bug
Create a folder that contains the following:
Please download this file as
mydata.json
. (The error does not occur in JSON files with shorter text, but it is reproducible when the text is long as in the file I provide)❗ ❗ GitHub doesn't allow me to upload JSON so this file is a TXT, and you should rename it to
.json
!test.py
simply contains:To reproduce the error, simply run
Expected results
The data should load correctly without error.
Actual results
The dataset builder fails with:
Environment info
datasets
version: 1.15.1The text was updated successfully, but these errors were encountered: