Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark the new PyT data loader (with sparse tensors support) scalability with multi-GPU and larger datasets #28

Open
3 tasks
gabrielspmoreira opened this issue Jun 8, 2021 · 5 comments
Assignees
Projects

Comments

@gabrielspmoreira
Copy link
Member

gabrielspmoreira commented Jun 8, 2021

Benchmark the new PyT data loader with the REES46 ecommerce dataset, using multiple GPUs

Train set: All train.parquet files for 31 days (1 parquet file by week). P.s. Set row group size accordingly
Eval set: All valid.parquet files concatenated

  • Create a recsys_main.py variation for non-incremental training
  • Train with 3 weeks and evaluate on the last week
  • Run experiments varying the number of GPUs: Single GPU, Multi-GPU Data Parallel, Multi-GPU Distributed DataParallel
@gabrielspmoreira gabrielspmoreira added this to To do in v0.1 via automation Jun 8, 2021
@gabrielspmoreira gabrielspmoreira moved this from To do to P1 in v0.1 Jun 8, 2021
@rnyak
Copy link
Contributor

rnyak commented Aug 12, 2021

Gabriel and I did a debugging session and we found out that the problem with model distributed happens between 50-70% of the training with the first parquet file (day) when this args are set in our dataloader

NVTDataLoader(        
        global_size=global_size,
        global_rank=global_rank,

when these are arguments are disabled, we can see that we can train on two GPUs (but both using the same dataset). So most likely the issues is because of our NVT PyT dataloader.

We can reproduce it quickly with ecom_small

@rnyak
Copy link
Contributor

rnyak commented Aug 31, 2021

Gabriel, Julio and I did another debugging session and looks like one of our worker is not waiting for the other worker, and this creates a bottleneck. The options/guidance to explore:

@Ahanmr
Copy link

Ahanmr commented Mar 3, 2022

@rnyak Does the library currently support training on multi-gpu configuration? Even though I have multi GPUs, the training is happening only on one of them, and not parallelizing the training on both. Is there a way to add this into our trainer?

@rnyak
Copy link
Contributor

rnyak commented Oct 3, 2022

@Ahanmr currently we are working on support training on multi-gpu.

@alan-ai-learner
Copy link

alan-ai-learner commented Feb 22, 2023

@rnyak after a custom preparing dataset, i have 1321 folders that is 1 for each day.
So i'm training it as mentioned in the youchoose dataset example. I have few question regarding that...

  1. Currently i'm training it with train batch size = 32, eval batch size = 16, as i'm having 16gb gpu memory. I'm not sure what could be better number number as per my resources, so any suggestion on that would be helpful?
  2. After training it...after 500 days of traing the loss is 0, i'm not sure it is overfitting or it is the way. Do i need to stop or is there any better way to do the training.
  3. Also these evaluation scores are very less for each day, so how any one know final evaluation score?

Any help would be great.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
v0.1
  
P1
Development

No branches or pull requests

4 participants