You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* New function `util.load_dataset_via_pandas` to be used instead of Dataset.load_dataset. Using a pandas df as an intermediate format somehow prevents weird `datasets.builder.DatasetGenerationError`s that occured while processing some files before. This was presented as a solution for a related problem here: huggingface/datasets#5531
* Use the new function in CLI script modify_dataset.py
* Remove the temporary file used for finding the problem
This code fails:
raises
This causes many issues for @TevenLeScao:
map
fails because it fails to rewrite invalid arrow arraysto_dict()
segfaultsTo reproduce: unzip the archive and run the above code using
sanity_oscar_en.jsonl
sanity_oscar_en.jsonl.zip
PS: reading using pandas and converting to Arrow works though (note that the dataset lives in RAM in this case):
The text was updated successfully, but these errors were encountered: