Bugs : dataset.map() is frozen on ELI5 #482

ratthachat · 2020-08-07T08:23:35Z

Hi Huggingface Team!

Thank you guys once again for this amazing repo.

I have tried to prepare ELI5 to train with T5, based on this wonderful notebook of Suraj Patil

However, when I run dataset.map() on ELI5 to prepare input_text, target_text, dataset.map is frozen in the first hundreds examples. On the contrary, this works totally fine on SQUAD (80,000 examples). Both nlp version 0.3.0 and 0.4.0 cause frozen process . Also try various pyarrow versions from 0.16.0 / 0.17.0 / 1.0.0 also have the same frozen process.

Reproducible code can be found on this colab notebook , where I also show that the same mapping function works fine on SQUAD, so the problem is likely due to ELI5 somehow.

More Info : instead of map, if I run for loop and apply function by myself, there's no error and can finish within 10 seconds. However, nlp dataset is immutable (I couldn't manually assign a new key-value to dataset object)

I also notice that SQUAD texts are quite clean while ELI5 texts contain many special characters, not sure if this is the cause ?

The text was updated successfully, but these errors were encountered:

lhoestq · 2020-08-11T15:31:43Z

This comes from an overflow in pyarrow's array.
It is stuck inside the loop that reduces the batch size to avoid the overflow.
I'll take a look

lhoestq · 2020-08-11T16:34:32Z

I created a PR to fix the issue.
It was due to an overflow check that handled badly an empty list.

You can try the changes by using

!pip install git+https://github.com/huggingface/nlp.git@fix-bad-type-in-overflow-check

Also I noticed that the first 1000 examples have an empty list in the title_urls field. The feature type inference in .map will consider it null because of that, and it will crash when it encounter the next example with a title_urls that is not empty.

Therefore to fix that, what you can do for now is increase the writer batch size so that the feature inference will take into account at least one example with a non-empty title_urls:

# default batch size is 1_000 and it's not enough for feature type inference because of empty lists
valid_dataset = valid_dataset.map(make_input_target, writer_batch_size=3_000)

I was able to run the frozen cell with these changes.

ratthachat · 2020-08-11T23:55:11Z

@lhoestq Perfect and thank you very much!!
Close the issue.

ratthachat · 2020-08-12T00:08:28Z

@lhoestq mapping the function make_input_target was passed by your fixing.

However, there is another error in the final step of valid_dataset.map(convert_to_features, batched=True)

ArrowInvalid: Could not convert Thepiratebay.vg with type str: converting to null type
(The same colab notebook above with new error message)

Do you have some ideas? (I am really sorry I could not debug it by myself since I never used pyarrow before)
Note that train_dataset.map(convert_to_features, batched=True) can be run successfully even though train_dataset is 27x bigger than valid_dataset so I believe the problem lies in some field of valid_dataset again .

lhoestq · 2020-08-12T10:13:43Z

I got this issue too and fixed it by specifying writer_batch_size=3_000 in .map.
This is because Arrow didn't expect Thepiratebay.vg in title_urls , as all previous examples have empty lists in title_urls

ratthachat · 2020-08-12T14:13:46Z

I am clear now . Thank so much again Quentin!

lxe · 2023-04-05T20:26:34Z

I'm getting a hanging dataset.map() when running a gradio app with gradio for auto-reloading instead of python

lhoestq · 2023-04-06T09:39:59Z

Maybe this is an issue with gradio, could you open an issue on their repo ? Dataset.map simply uses multiprocess.Pool for multiprocessing

If you interrupt the program mayeb the stack trace would give some information of where it was hanging in the code (maybe a lock somewhere ?)

lhoestq self-assigned this Aug 11, 2020

lhoestq mentioned this issue Aug 11, 2020

fix bad type in overflow check #496

Merged

ratthachat closed this as completed Aug 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugs : dataset.map() is frozen on ELI5 #482

Bugs : dataset.map() is frozen on ELI5 #482

ratthachat commented Aug 7, 2020 •

edited

lhoestq commented Aug 11, 2020

lhoestq commented Aug 11, 2020

ratthachat commented Aug 11, 2020

ratthachat commented Aug 12, 2020 •

edited

lhoestq commented Aug 12, 2020

ratthachat commented Aug 12, 2020

lxe commented Apr 5, 2023

lhoestq commented Apr 6, 2023

Bugs : dataset.map() is frozen on ELI5 #482

Bugs : dataset.map() is frozen on ELI5 #482

Comments

ratthachat commented Aug 7, 2020 • edited

lhoestq commented Aug 11, 2020

lhoestq commented Aug 11, 2020

ratthachat commented Aug 11, 2020

ratthachat commented Aug 12, 2020 • edited

lhoestq commented Aug 12, 2020

ratthachat commented Aug 12, 2020

lxe commented Apr 5, 2023

lhoestq commented Apr 6, 2023

ratthachat commented Aug 7, 2020 •

edited

ratthachat commented Aug 12, 2020 •

edited