Benchmark the new PyT data loader (with sparse tensors support) scalability with multi-GPU and larger datasets #28

gabrielspmoreira · 2021-06-08T14:37:33Z

Benchmark the new PyT data loader with the REES46 ecommerce dataset, using multiple GPUs

Train set: All train.parquet files for 31 days (1 parquet file by week). P.s. Set row group size accordingly
Eval set: All valid.parquet files concatenated

Create a recsys_main.py variation for non-incremental training
Train with 3 weeks and evaluate on the last week
Run experiments varying the number of GPUs: Single GPU, Multi-GPU Data Parallel, Multi-GPU Distributed DataParallel

rnyak · 2021-08-12T21:39:00Z

Gabriel and I did a debugging session and we found out that the problem with model distributed happens between 50-70% of the training with the first parquet file (day) when this args are set in our dataloader

NVTDataLoader(        
        global_size=global_size,
        global_rank=global_rank,

when these are arguments are disabled, we can see that we can train on two GPUs (but both using the same dataset). So most likely the issues is because of our NVT PyT dataloader.

We can reproduce it quickly with ecom_small

rnyak · 2021-08-31T15:53:58Z

Gabriel, Julio and I did another debugging session and looks like one of our worker is not waiting for the other worker, and this creates a bottleneck. The options/guidance to explore:

[] torch.distributed.barrier()
[] https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html

Ahanmr · 2022-03-03T12:35:17Z

@rnyak Does the library currently support training on multi-gpu configuration? Even though I have multi GPUs, the training is happening only on one of them, and not parallelizing the training on both. Is there a way to add this into our trainer?

rnyak · 2022-10-03T18:14:46Z

@Ahanmr currently we are working on support training on multi-gpu.

alan-ai-learner · 2023-02-22T12:20:15Z

@rnyak after a custom preparing dataset, i have 1321 folders that is 1 for each day.
So i'm training it as mentioned in the youchoose dataset example. I have few question regarding that...

Currently i'm training it with train batch size = 32, eval batch size = 16, as i'm having 16gb gpu memory. I'm not sure what could be better number number as per my resources, so any suggestion on that would be helpful?
After training it...after 500 days of traing the loss is 0, i'm not sure it is overfitting or it is the way. Do i need to stop or is there any better way to do the training.
Also these evaluation scores are very less for each day, so how any one know final evaluation score?

Any help would be great.
Thanks!

gabrielspmoreira added the Multi-GPU label Jun 8, 2021

gabrielspmoreira assigned rnyak Jun 8, 2021

gabrielspmoreira added this to To do in v0.1 via automation Jun 8, 2021

gabrielspmoreira moved this from To do to P1 in v0.1 Jun 8, 2021

gabrielspmoreira mentioned this issue Sep 8, 2021

Test Multi-GPU support with PyTorch NVT DataLoader (like with Horovod) #10

Closed

marcromeyn added this to P1 in 21.10 Oct 5, 2021

benfred removed this from P1 in 21.10 Oct 12, 2021

sararb mentioned this issue Jul 19, 2022

[Task] Support of multi-gpu DistributedDataParallel training #456

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark the new PyT data loader (with sparse tensors support) scalability with multi-GPU and larger datasets #28

Benchmark the new PyT data loader (with sparse tensors support) scalability with multi-GPU and larger datasets #28

gabrielspmoreira commented Jun 8, 2021 •

edited

rnyak commented Aug 12, 2021

rnyak commented Aug 31, 2021

Ahanmr commented Mar 3, 2022

rnyak commented Oct 3, 2022

alan-ai-learner commented Feb 22, 2023 •

edited

Benchmark the new PyT data loader (with sparse tensors support) scalability with multi-GPU and larger datasets #28

Benchmark the new PyT data loader (with sparse tensors support) scalability with multi-GPU and larger datasets #28

Comments

gabrielspmoreira commented Jun 8, 2021 • edited

rnyak commented Aug 12, 2021

rnyak commented Aug 31, 2021

Ahanmr commented Mar 3, 2022

rnyak commented Oct 3, 2022

alan-ai-learner commented Feb 22, 2023 • edited

gabrielspmoreira commented Jun 8, 2021 •

edited

alan-ai-learner commented Feb 22, 2023 •

edited