Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataloading slow when using HUGE dataset #2210

Closed
hwijeen opened this issue Apr 12, 2021 · 2 comments
Closed

dataloading slow when using HUGE dataset #2210

hwijeen opened this issue Apr 12, 2021 · 2 comments

Comments

@hwijeen
Copy link

hwijeen commented Apr 12, 2021

Hi,

When I use datasets with 600GB data, the dataloading speed increases significantly.
I am experimenting with two datasets, and one is about 60GB and the other 600GB.
Simply speaking, my code uses datasets.set_format("torch") function and let pytorch-lightning handle ddp training.
When looking at the pytorch-lightning supported profile of two different runs, I see that fetching a batch(get_train_batch) consumes an unreasonable amount of time when data is large. What could be the cause?

  • 60GB data
Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  200.33         	|  100 %          	|
------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                 	|  71.994         	|1              	|  71.994         	|  35.937         	|
run_training_batch                 	|  0.64373        	|100            	|  64.373         	|  32.133         	|
optimizer_step_and_closure_0       	|  0.64322        	|100            	|  64.322         	|  32.108         	|
training_step_and_backward         	|  0.61004        	|100            	|  61.004         	|  30.452         	|
model_backward                     	|  0.37552        	|100            	|  37.552         	|  18.745         	|
model_forward                      	|  0.22813        	|100            	|  22.813         	|  11.387         	|
training_step                      	|  0.22759        	|100            	|  22.759         	|  11.361         	|
get_train_batch                    	|  0.066385       	|100            	|  6.6385         	|  3.3138         	|
  • 600GB data
Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  3285.6         	|  100 %          	|
------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                 	|  1397.9         	|1              	|  1397.9         	|  42.546         	|
run_training_batch                 	|  7.2596         	|100            	|  725.96         	|  22.095         	|
optimizer_step_and_closure_0       	|  7.2589         	|100            	|  725.89         	|  22.093         	|
training_step_and_backward         	|  7.223          	|100            	|  722.3          	|  21.984         	|
model_backward                     	|  6.9662         	|100            	|  696.62         	|  21.202         	|
get_train_batch                    	|  6.322          	|100            	|  632.2          	|  19.241         	|
model_forward                      	|  0.24902        	|100            	|  24.902         	|  0.75789        	|
training_step                      	|  0.2485         	|100            	|  24.85          	|  0.75633        	|
@lhoestq
Copy link
Member

lhoestq commented Apr 12, 2021

Hi ! Yes this is an issue with datasets<=1.5.0
This issue has been fixed by #2122 , we'll do a new release soon :)
For now you can test it on the master branch.

@hwijeen
Copy link
Author

hwijeen commented Apr 13, 2021

Hi, thank you for your answer. I did not realize that my issue stems from the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants