-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark the new PyT data loader (with sparse tensors support) scalability with multi-GPU and larger datasets #28
Comments
Gabriel and I did a debugging session and we found out that the problem with model distributed happens between 50-70% of the training with the first parquet file (day) when this args are set in our dataloader
when these are arguments are disabled, we can see that we can train on two GPUs (but both using the same dataset). So most likely the issues is because of our NVT PyT dataloader. We can reproduce it quickly with ecom_small |
Gabriel, Julio and I did another debugging session and looks like one of our worker is not waiting for the other worker, and this creates a bottleneck. The options/guidance to explore:
|
@rnyak Does the library currently support training on multi-gpu configuration? Even though I have multi GPUs, the training is happening only on one of them, and not parallelizing the training on both. Is there a way to add this into our trainer? |
@Ahanmr currently we are working on support training on multi-gpu. |
@rnyak after a custom preparing dataset, i have 1321 folders that is 1 for each day.
Any help would be great. |
Benchmark the new PyT data loader with the REES46 ecommerce dataset, using multiple GPUs
Train set: All train.parquet files for 31 days (1 parquet file by week). P.s. Set row group size accordingly
Eval set: All valid.parquet files concatenated
The text was updated successfully, but these errors were encountered: