New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugs : dataset.map() is frozen on ELI5 #482
Comments
This comes from an overflow in pyarrow's array. |
I created a PR to fix the issue. You can try the changes by using
Also I noticed that the first 1000 examples have an empty list in the Therefore to fix that, what you can do for now is increase the writer batch size so that the feature inference will take into account at least one example with a non-empty # default batch size is 1_000 and it's not enough for feature type inference because of empty lists
valid_dataset = valid_dataset.map(make_input_target, writer_batch_size=3_000) I was able to run the frozen cell with these changes. |
@lhoestq Perfect and thank you very much!! |
@lhoestq mapping the function However, there is another error in the final step of
Do you have some ideas? (I am really sorry I could not debug it by myself since I never used |
I got this issue too and fixed it by specifying |
I am clear now . Thank so much again Quentin! |
I'm getting a hanging |
Maybe this is an issue with gradio, could you open an issue on their repo ? If you interrupt the program mayeb the stack trace would give some information of where it was hanging in the code (maybe a lock somewhere ?) |
Hi Huggingface Team!
Thank you guys once again for this amazing repo.
I have tried to prepare ELI5 to train with T5, based on this wonderful notebook of Suraj Patil
However, when I run
dataset.map()
on ELI5 to prepareinput_text, target_text
,dataset.map
is frozen in the first hundreds examples. On the contrary, this works totally fine on SQUAD (80,000 examples). Bothnlp
version 0.3.0 and 0.4.0 cause frozen process . Also try variouspyarrow
versions from 0.16.0 / 0.17.0 / 1.0.0 also have the same frozen process.Reproducible code can be found on this colab notebook , where I also show that the same mapping function works fine on SQUAD, so the problem is likely due to ELI5 somehow.
More Info : instead of
map
, if I runfor
loop and apply function by myself, there's no error and can finish within 10 seconds. However,nlp dataset
is immutable (I couldn't manually assign a new key-value todataset
object)I also notice that SQUAD texts are quite clean while ELI5 texts contain many special characters, not sure if this is the cause ?
The text was updated successfully, but these errors were encountered: