Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Json(datasets.ArrowBasedBuilder) class #3227

Closed
JunShern opened this issue Nov 7, 2021 · 3 comments
Closed

Error in Json(datasets.ArrowBasedBuilder) class #3227

JunShern opened this issue Nov 7, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@JunShern
Copy link

JunShern commented Nov 7, 2021

Describe the bug

When a json file contains a text field that is larger than the block_size, the JSON dataset builder fails.

Steps to reproduce the bug

Create a folder that contains the following:

.
├── testdata
│   └── mydata.json
└── test.py

Please download this file as mydata.json. (The error does not occur in JSON files with shorter text, but it is reproducible when the text is long as in the file I provide)
❗ ❗ GitHub doesn't allow me to upload JSON so this file is a TXT, and you should rename it to .json!

test.py simply contains:

from datasets import load_dataset
my_dataset = load_dataset("testdata")

To reproduce the error, simply run

python test.py

Expected results

The data should load correctly without error.

Actual results

The dataset builder fails with:

Using custom data configuration testdata-d490389b8ab4fd82
Downloading and preparing dataset json/testdata to /home/junshern.chan/.cache/huggingface/datasets/json/testdata-d490389b8ab4fd82/0.0.0/3333a8af0db9764dfcff43a42ff26228f0f2e267f0d8a0a294452d188beadb34...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2264.74it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 447.01it/s]
Failed to read file '/home/junshern.chan/hf-json-bug/testdata/mydata.json' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Missing a name for object member. in row 0
Traceback (most recent call last):
  File "test.py", line 28, in <module>
    my_dataset = load_dataset("testdata")
  File "/home/junshern.chan/.casio/miniconda/envs/hf-json-bug/lib/python3.8/site-packages/datasets/load.py", line 1632, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/junshern.chan/.casio/miniconda/envs/hf-json-bug/lib/python3.8/site-packages/datasets/builder.py", line 607, in download_and_prepare
    self._download_and_prepare(
  File "/home/junshern.chan/.casio/miniconda/envs/hf-json-bug/lib/python3.8/site-packages/datasets/builder.py", line 697, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/junshern.chan/.casio/miniconda/envs/hf-json-bug/lib/python3.8/site-packages/datasets/builder.py", line 1156, in _prepare_split
    for key, table in utils.tqdm(
  File "/home/junshern.chan/.casio/miniconda/envs/hf-json-bug/lib/python3.8/site-packages/tqdm/std.py", line 1168, in __iter__
    for obj in iterable:
  File "/home/junshern.chan/.casio/miniconda/envs/hf-json-bug/lib/python3.8/site-packages/datasets/packaged_modules/json/json.py", line 146, in _generate_tables
    raise ValueError(
ValueError: Not able to read records in the JSON file at /home/junshern.chan/hf-json-bug/testdata/mydata.json. You should probably indicate the field of the JSON file containing your records. This JSON file contain the following fields: ['text']. Select the correct one and provide it as `field='XXX'` to the dataset loading method. 

Environment info

  • datasets version: 1.15.1
  • Platform: Linux-5.8.0-63-generic-x86_64-with-glibc2.17
  • Python version: 3.8.12
  • PyArrow version: 6.0.0
@JunShern JunShern added the bug Something isn't working label Nov 7, 2021
@JunShern
Copy link
Author

JunShern commented Nov 7, 2021

I have additionally identified the source of the error, being that this condition in the file
python3.8/site-packages/datasets/packaged_modules/json/json.py is not being entered correctly:

                                    if (
                                        isinstance(e, pa.ArrowInvalid)
                                        and "straddling" not in str(e)
                                        or block_size > len(batch)
                                    ):

From what I can tell, in my case the block_size simply needs to be increased, but the error message does not contain "straddling" so the condition does trigger correctly and we fail to reach the line to increase block_size.

Changing the condition above to simply

                                    if (
                                        block_size > len(batch)
                                    ):

Fixes the error for me. I'm happy to create a PR containing this fix if the developers deem the other conditions unnecessary.

@lhoestq
Copy link
Member

lhoestq commented Nov 8, 2021

Hi ! I think the issue comes from the fact that your JSON file is not a valid JSON Lines file.
Each example should be on one single line.

Can you try fixing the format to have one line per example and try again ?

@JunShern
Copy link
Author

JunShern commented Nov 9, 2021

😮 you're right, that did it! I just put everything on a single line (my file only has a single example) and that fixed the error. Thank you so much!

@JunShern JunShern closed this as completed Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants