New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading big dataset raises pyarrow.lib.ArrowNotImplementedError #5695
Comments
Hi ! It looks like an issue with PyArrow: https://issues.apache.org/jira/browse/ARROW-5030 It appears it can happen when you have parquet files with row groups larger than 2GB. Note that currently the row group size is simply defined by the number of rows Would it be possible for you to re-upload the dataset with the default shard size 500MB ? |
Hey, thanks for the reply! I've since switched to working with the locally-saved dataset (which works). |
Just tried uploading the same dataset with 500MB shards, I get an errors 4 hours in:
Local saves do work, however. |
Hmmm that was probably an intermitent bug, you can resume the upload by re-running push_to_hub |
Leaving this other error here for the record, which occurs when I load the +700GB dataset from the hub with shard sizes of 500MB:
I will probably switch back to the local big dataset or shrink it. |
Describe the bug
Calling
datasets.load_dataset
to load the (publicly available) datasettheodor1289/wit
fails withpyarrow.lib.ArrowNotImplementedError
.Steps to reproduce the bug
Steps to reproduce this behavior:
!pip install datasets
!huggingface-cli login
Stack trace:
Expected behavior
The dataset is loaded in variable
dataset
.Environment info
datasets
version: 2.11.0The text was updated successfully, but these errors were encountered: