Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading big dataset raises pyarrow.lib.ArrowNotImplementedError #5695

Closed
amariucaitheodor opened this issue Apr 2, 2023 · 5 comments
Closed

Comments

@amariucaitheodor
Copy link

Describe the bug

Calling datasets.load_dataset to load the (publicly available) dataset theodor1289/wit fails with pyarrow.lib.ArrowNotImplementedError.

Steps to reproduce the bug

Steps to reproduce this behavior:

  1. !pip install datasets
  2. !huggingface-cli login
  3. This step will throw the error (it might take a while as the dataset has ~170GB):
from datasets import load_dataset
dataset = load_dataset("theodor1289/wit", "train", use_auth_token=True)

Stack trace:

(torch-multimodal) bash-4.2$ python test.py 
Downloading and preparing dataset None/None to /cluster/work/cotterell/tamariucai/HuggingfaceDatasets/theodor1289___parquet/theodor1289--wit-7a3e984414a86a0f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 491.68it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 16.93it/s]
Traceback (most recent call last):                                                                                                                                                                                                    
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split_single
    for _, table in generator:
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/packaged_modules/parquet/parquet.py", line 69, in _generate_tables
    for batch_idx, record_batch in enumerate(
  File "pyarrow/_parquet.pyx", line 1323, in iter_batches
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/cluster/work/cotterell/tamariucai/multimodal-mirror/examples/test.py", line 2, in <module>
    dataset = load_dataset("theodor1289/wit", "train", use_auth_token=True)
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/load.py", line 1791, in load_dataset
    builder_instance.download_and_prepare(
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 891, in download_and_prepare
    self._download_and_prepare(
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 986, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 1748, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 1893, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Expected behavior

The dataset is loaded in variable dataset.

Environment info

  • datasets version: 2.11.0
  • Platform: Linux-3.10.0-1160.80.1.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.10.4
  • Huggingface_hub version: 0.13.3
  • PyArrow version: 11.0.0
  • Pandas version: 1.5.3
@lhoestq
Copy link
Member

lhoestq commented Apr 4, 2023

Hi ! It looks like an issue with PyArrow: https://issues.apache.org/jira/browse/ARROW-5030

It appears it can happen when you have parquet files with row groups larger than 2GB.
I can see that your parquet files are around 10GB. It is usually advised to keep a value around the default value 500MB to avoid these issues.

Note that currently the row group size is simply defined by the number of rows datasets.config.DEFAULT_MAX_BATCH_SIZE, so reducing this value could let you have parquet files bigger than 2GB and with row groups lower than 2GB.

Would it be possible for you to re-upload the dataset with the default shard size 500MB ?

@amariucaitheodor
Copy link
Author

Hey, thanks for the reply! I've since switched to working with the locally-saved dataset (which works).
Maybe it makes sense to show a warning for uploads with large shard sizes? Since the functionality completely breaks (due to the PyArrow bug).

@amariucaitheodor
Copy link
Author

Just tried uploading the same dataset with 500MB shards, I get an errors 4 hours in:

Pushing dataset shards to the dataset hub:  25%|██▍       | 358/1453 [4:40:31<14:18:00, 47.01s/it]
Traceback (most recent call last):
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/_commit_api.py", line 344, in _inner_upload_lfs_object
    return _upload_lfs_object(operation=operation, lfs_batch_action=batch_action, token=token)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/_commit_api.py", line 391, in _upload_lfs_object
    lfs_upload(
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/lfs.py", line 254, in lfs_upload
    _upload_multi_part(
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/lfs.py", line 374, in _upload_multi_part
    hf_raise_for_status(part_upload_res)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 301, in hf_raise_for_status
    raise HfHubHTTPError(str(e), response=response) from e
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 46, in __init__
    server_data = response.json()
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/requests/models.py", line 899, in json
    return complexjson.loads(
  File "/cluster/work/cotterell/tamariucai/miniconda3/envs/torch-multimodal/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/cluster/work/cotterell/tamariucai/miniconda3/envs/torch-multimodal/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/cluster/work/cotterell/tamariucai/miniconda3/envs/torch-multimodal/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "process_wit.py", line 146, in <module>
    dataset.push_to_hub(FINAL_PATH, max_shard_size="500MB", private=False)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/datasets/dataset_dict.py", line 1534, in push_to_hub
    repo_id, split, uploaded_size, dataset_nbytes, _, _ = self[split]._push_parquet_shards_to_hub(
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 4804, in _push_parquet_shards_to_hub
    _retry(
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 281, in _retry
    return func(*func_args, **func_kwargs)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
    return fn(*args, **kwargs)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/hf_api.py", line 2593, in upload_file
    commit_info = self.create_commit(
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
    return fn(*args, **kwargs)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/hf_api.py", line 2411, in create_commit
    upload_lfs_files(
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
    return fn(*args, **kwargs)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/_commit_api.py", line 351, in upload_lfs_files
    thread_map(
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/cluster/work/cotterell/tamariucai/miniconda3/envs/torch-multimodal/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "/cluster/work/cotterell/tamariucai/miniconda3/envs/torch-multimodal/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/cluster/work/cotterell/tamariucai/miniconda3/envs/torch-multimodal/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/cluster/work/cotterell/tamariucai/miniconda3/envs/torch-multimodal/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/_commit_api.py", line 346, in _inner_upload_lfs_object
    raise RuntimeError(f"Error while uploading '{operation.path_in_repo}' to the Hub.") from exc
RuntimeError: Error while uploading 'data/train-00358-of-01453-22a5cc8b3eb12be3.parquet' to the Hub.

Local saves do work, however.

@lhoestq
Copy link
Member

lhoestq commented Apr 8, 2023

Hmmm that was probably an intermitent bug, you can resume the upload by re-running push_to_hub

@amariucaitheodor
Copy link
Author

Leaving this other error here for the record, which occurs when I load the +700GB dataset from the hub with shard sizes of 500MB:

 Traceback (most recent call last):                                                        
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split_single
    for _, table in generator:
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/packaged_modules/parquet/parquet.py", line 69, in _generate_tables
    for batch_idx, record_batch in enumerate(
  File "pyarrow/_parquet.pyx", line 1323, in iter_batches
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Corrupt snappy compressed data.

I will probably switch back to the local big dataset or shrink it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants