Loading big dataset raises pyarrow.lib.ArrowNotImplementedError #5695

amariucaitheodor · 2023-04-02T14:42:44Z

Describe the bug

Calling datasets.load_dataset to load the (publicly available) dataset theodor1289/wit fails with pyarrow.lib.ArrowNotImplementedError.

Steps to reproduce the bug

Steps to reproduce this behavior:

!pip install datasets
!huggingface-cli login
This step will throw the error (it might take a while as the dataset has ~170GB):

from datasets import load_dataset
dataset = load_dataset("theodor1289/wit", "train", use_auth_token=True)

Stack trace:

(torch-multimodal) bash-4.2$ python test.py 
Downloading and preparing dataset None/None to /cluster/work/cotterell/tamariucai/HuggingfaceDatasets/theodor1289___parquet/theodor1289--wit-7a3e984414a86a0f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 491.68it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 16.93it/s]
Traceback (most recent call last):                                                                                                                                                                                                    
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split_single
    for _, table in generator:
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/packaged_modules/parquet/parquet.py", line 69, in _generate_tables
    for batch_idx, record_batch in enumerate(
  File "pyarrow/_parquet.pyx", line 1323, in iter_batches
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/cluster/work/cotterell/tamariucai/multimodal-mirror/examples/test.py", line 2, in <module>
    dataset = load_dataset("theodor1289/wit", "train", use_auth_token=True)
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/load.py", line 1791, in load_dataset
    builder_instance.download_and_prepare(
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 891, in download_and_prepare
    self._download_and_prepare(
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 986, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 1748, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 1893, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Expected behavior

The dataset is loaded in variable dataset.

Environment info

datasets version: 2.11.0
Platform: Linux-3.10.0-1160.80.1.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.10.4
Huggingface_hub version: 0.13.3
PyArrow version: 11.0.0
Pandas version: 1.5.3

The text was updated successfully, but these errors were encountered:

lhoestq · 2023-04-04T17:08:52Z

Hi ! It looks like an issue with PyArrow: https://issues.apache.org/jira/browse/ARROW-5030

It appears it can happen when you have parquet files with row groups larger than 2GB.
I can see that your parquet files are around 10GB. It is usually advised to keep a value around the default value 500MB to avoid these issues.

Note that currently the row group size is simply defined by the number of rows datasets.config.DEFAULT_MAX_BATCH_SIZE, so reducing this value could let you have parquet files bigger than 2GB and with row groups lower than 2GB.

Would it be possible for you to re-upload the dataset with the default shard size 500MB ?

amariucaitheodor · 2023-04-04T17:21:22Z

Hey, thanks for the reply! I've since switched to working with the locally-saved dataset (which works).
Maybe it makes sense to show a warning for uploads with large shard sizes? Since the functionality completely breaks (due to the PyArrow bug).

amariucaitheodor · 2023-04-08T00:20:30Z

Just tried uploading the same dataset with 500MB shards, I get an errors 4 hours in:

Pushing dataset shards to the dataset hub:  25%|██▍       | 358/1453 [4:40:31<14:18:00, 47.01s/it]
Traceback (most recent call last):
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/_commit_api.py", line 344, in _inner_upload_lfs_object
    return _upload_lfs_object(operation=operation, lfs_batch_action=batch_action, token=token)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/_commit_api.py", line 391, in _upload_lfs_object
    lfs_upload(
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/lfs.py", line 254, in lfs_upload
    _upload_multi_part(
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/lfs.py", line 374, in _upload_multi_part
    hf_raise_for_status(part_upload_res)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 301, in hf_raise_for_status
    raise HfHubHTTPError(str(e), response=response) from e
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 46, in __init__
    server_data = response.json()
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/requests/models.py", line 899, in json
    return complexjson.loads(
  File "/cluster/work/cotterell/tamariucai/miniconda3/envs/torch-multimodal/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/cluster/work/cotterell/tamariucai/miniconda3/envs/torch-multimodal/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/cluster/work/cotterell/tamariucai/miniconda3/envs/torch-multimodal/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "process_wit.py", line 146, in <module>
    dataset.push_to_hub(FINAL_PATH, max_shard_size="500MB", private=False)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/datasets/dataset_dict.py", line 1534, in push_to_hub
    repo_id, split, uploaded_size, dataset_nbytes, _, _ = self[split]._push_parquet_shards_to_hub(
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 4804, in _push_parquet_shards_to_hub
    _retry(
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 281, in _retry
    return func(*func_args, **func_kwargs)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
    return fn(*args, **kwargs)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/hf_api.py", line 2593, in upload_file
    commit_info = self.create_commit(
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
    return fn(*args, **kwargs)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/hf_api.py", line 2411, in create_commit
    upload_lfs_files(
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
    return fn(*args, **kwargs)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/_commit_api.py", line 351, in upload_lfs_files
    thread_map(
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/cluster/work/cotterell/tamariucai/miniconda3/envs/torch-multimodal/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "/cluster/work/cotterell/tamariucai/miniconda3/envs/torch-multimodal/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/cluster/work/cotterell/tamariucai/miniconda3/envs/torch-multimodal/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/cluster/work/cotterell/tamariucai/miniconda3/envs/torch-multimodal/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/cluster/home/tamariucai/.local/lib/python3.8/site-packages/huggingface_hub/_commit_api.py", line 346, in _inner_upload_lfs_object
    raise RuntimeError(f"Error while uploading '{operation.path_in_repo}' to the Hub.") from exc
RuntimeError: Error while uploading 'data/train-00358-of-01453-22a5cc8b3eb12be3.parquet' to the Hub.

Local saves do work, however.

lhoestq · 2023-04-08T08:56:46Z

Hmmm that was probably an intermitent bug, you can resume the upload by re-running push_to_hub

amariucaitheodor · 2023-04-11T09:17:53Z

Leaving this other error here for the record, which occurs when I load the +700GB dataset from the hub with shard sizes of 500MB:

 Traceback (most recent call last):                                                        
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split_single
    for _, table in generator:
  File "/cluster/home/tamariucai/.local/lib/python3.10/site-packages/datasets/packaged_modules/parquet/parquet.py", line 69, in _generate_tables
    for batch_idx, record_batch in enumerate(
  File "pyarrow/_parquet.pyx", line 1323, in iter_batches
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Corrupt snappy compressed data.

I will probably switch back to the local big dataset or shrink it.

lhoestq mentioned this issue Apr 6, 2023

ArrowNotImplementedError when loading dataset from the hub #5713

Closed

amariucaitheodor closed this as completed Apr 10, 2023

kopyl mentioned this issue Jan 16, 2024

Loading big dataset raises pyarrow.lib.ArrowNotImplementedError 2 #6595

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading big dataset raises pyarrow.lib.ArrowNotImplementedError #5695

Loading big dataset raises pyarrow.lib.ArrowNotImplementedError #5695

amariucaitheodor commented Apr 2, 2023

lhoestq commented Apr 4, 2023 •

edited

amariucaitheodor commented Apr 4, 2023

amariucaitheodor commented Apr 8, 2023

lhoestq commented Apr 8, 2023

amariucaitheodor commented Apr 11, 2023

Loading big dataset raises pyarrow.lib.ArrowNotImplementedError #5695

Loading big dataset raises pyarrow.lib.ArrowNotImplementedError #5695

Comments

amariucaitheodor commented Apr 2, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented Apr 4, 2023 • edited

amariucaitheodor commented Apr 4, 2023

amariucaitheodor commented Apr 8, 2023

lhoestq commented Apr 8, 2023

amariucaitheodor commented Apr 11, 2023

lhoestq commented Apr 4, 2023 •

edited