dataloading slow when using HUGE dataset #2210

hwijeen · 2021-04-12T08:33:02Z

Hi,

When I use datasets with 600GB data, the dataloading speed increases significantly.
I am experimenting with two datasets, and one is about 60GB and the other 600GB.
Simply speaking, my code uses datasets.set_format("torch") function and let pytorch-lightning handle ddp training.
When looking at the pytorch-lightning supported profile of two different runs, I see that fetching a batch(get_train_batch) consumes an unreasonable amount of time when data is large. What could be the cause?

60GB data

Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  200.33         	|  100 %          	|
------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                 	|  71.994         	|1              	|  71.994         	|  35.937         	|
run_training_batch                 	|  0.64373        	|100            	|  64.373         	|  32.133         	|
optimizer_step_and_closure_0       	|  0.64322        	|100            	|  64.322         	|  32.108         	|
training_step_and_backward         	|  0.61004        	|100            	|  61.004         	|  30.452         	|
model_backward                     	|  0.37552        	|100            	|  37.552         	|  18.745         	|
model_forward                      	|  0.22813        	|100            	|  22.813         	|  11.387         	|
training_step                      	|  0.22759        	|100            	|  22.759         	|  11.361         	|
get_train_batch                    	|  0.066385       	|100            	|  6.6385         	|  3.3138         	|

600GB data

Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  3285.6         	|  100 %          	|
------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                 	|  1397.9         	|1              	|  1397.9         	|  42.546         	|
run_training_batch                 	|  7.2596         	|100            	|  725.96         	|  22.095         	|
optimizer_step_and_closure_0       	|  7.2589         	|100            	|  725.89         	|  22.093         	|
training_step_and_backward         	|  7.223          	|100            	|  722.3          	|  21.984         	|
model_backward                     	|  6.9662         	|100            	|  696.62         	|  21.202         	|
get_train_batch                    	|  6.322          	|100            	|  632.2          	|  19.241         	|
model_forward                      	|  0.24902        	|100            	|  24.902         	|  0.75789        	|
training_step                      	|  0.2485         	|100            	|  24.85          	|  0.75633        	|

The text was updated successfully, but these errors were encountered:

lhoestq · 2021-04-12T20:29:55Z

Hi ! Yes this is an issue with datasets<=1.5.0
This issue has been fixed by #2122 , we'll do a new release soon :)
For now you can test it on the master branch.

hwijeen · 2021-04-13T02:03:05Z

Hi, thank you for your answer. I did not realize that my issue stems from the same problem.

hwijeen closed this as completed Apr 13, 2021

hwijeen mentioned this issue Apr 23, 2021

Slow dataloading with big datasets issue persists #2252

Closed

finiteautomata mentioned this issue Jul 31, 2021

Workaround for training models with really big text files huggingface/transformers#12966

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataloading slow when using HUGE dataset #2210

dataloading slow when using HUGE dataset #2210

hwijeen commented Apr 12, 2021

lhoestq commented Apr 12, 2021 •

edited

hwijeen commented Apr 13, 2021

dataloading slow when using HUGE dataset #2210

dataloading slow when using HUGE dataset #2210

Comments

hwijeen commented Apr 12, 2021

lhoestq commented Apr 12, 2021 • edited

hwijeen commented Apr 13, 2021

lhoestq commented Apr 12, 2021 •

edited