Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs : dataset.map() is frozen on ELI5 #482

Closed
ratthachat opened this issue Aug 7, 2020 · 8 comments · Fixed by #496
Closed

Bugs : dataset.map() is frozen on ELI5 #482

ratthachat opened this issue Aug 7, 2020 · 8 comments · Fixed by #496
Assignees

Comments

@ratthachat
Copy link

ratthachat commented Aug 7, 2020

Hi Huggingface Team!

Thank you guys once again for this amazing repo.

I have tried to prepare ELI5 to train with T5, based on this wonderful notebook of Suraj Patil

However, when I run dataset.map() on ELI5 to prepare input_text, target_text, dataset.map is frozen in the first hundreds examples. On the contrary, this works totally fine on SQUAD (80,000 examples). Both nlp version 0.3.0 and 0.4.0 cause frozen process . Also try various pyarrow versions from 0.16.0 / 0.17.0 / 1.0.0 also have the same frozen process.

Reproducible code can be found on this colab notebook , where I also show that the same mapping function works fine on SQUAD, so the problem is likely due to ELI5 somehow.


More Info : instead of map, if I run for loop and apply function by myself, there's no error and can finish within 10 seconds. However, nlp dataset is immutable (I couldn't manually assign a new key-value to dataset object)

I also notice that SQUAD texts are quite clean while ELI5 texts contain many special characters, not sure if this is the cause ?

@lhoestq
Copy link
Member

lhoestq commented Aug 11, 2020

This comes from an overflow in pyarrow's array.
It is stuck inside the loop that reduces the batch size to avoid the overflow.
I'll take a look

@lhoestq
Copy link
Member

lhoestq commented Aug 11, 2020

I created a PR to fix the issue.
It was due to an overflow check that handled badly an empty list.

You can try the changes by using

!pip install git+https://github.com/huggingface/nlp.git@fix-bad-type-in-overflow-check

Also I noticed that the first 1000 examples have an empty list in the title_urls field. The feature type inference in .map will consider it null because of that, and it will crash when it encounter the next example with a title_urls that is not empty.

Therefore to fix that, what you can do for now is increase the writer batch size so that the feature inference will take into account at least one example with a non-empty title_urls:

# default batch size is 1_000 and it's not enough for feature type inference because of empty lists
valid_dataset = valid_dataset.map(make_input_target, writer_batch_size=3_000) 

I was able to run the frozen cell with these changes.

@ratthachat
Copy link
Author

@lhoestq Perfect and thank you very much!!
Close the issue.

@ratthachat
Copy link
Author

ratthachat commented Aug 12, 2020

@lhoestq mapping the function make_input_target was passed by your fixing.

However, there is another error in the final step of valid_dataset.map(convert_to_features, batched=True)

ArrowInvalid: Could not convert Thepiratebay.vg with type str: converting to null type
(The same colab notebook above with new error message)

Do you have some ideas? (I am really sorry I could not debug it by myself since I never used pyarrow before)
Note that train_dataset.map(convert_to_features, batched=True) can be run successfully even though train_dataset is 27x bigger than valid_dataset so I believe the problem lies in some field of valid_dataset again .

@lhoestq
Copy link
Member

lhoestq commented Aug 12, 2020

I got this issue too and fixed it by specifying writer_batch_size=3_000 in .map.
This is because Arrow didn't expect Thepiratebay.vg in title_urls , as all previous examples have empty lists in title_urls

@ratthachat
Copy link
Author

I am clear now . Thank so much again Quentin!

@lxe
Copy link

lxe commented Apr 5, 2023

I'm getting a hanging dataset.map() when running a gradio app with gradio for auto-reloading instead of python

@lhoestq
Copy link
Member

lhoestq commented Apr 6, 2023

Maybe this is an issue with gradio, could you open an issue on their repo ? Dataset.map simply uses multiprocess.Pool for multiprocessing

If you interrupt the program mayeb the stack trace would give some information of where it was hanging in the code (maybe a lock somewhere ?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants