You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using Trainer with PyTorch DDP on single node multiple GPU. torch.dist.init_process_group() is setup ok. Seems like Trainer _get_train_sampler() does not use DistributedSampler but rather RandomSampler? Or could this be another issue I am missing? Any inputs appreciated! Thanks!
Who can help?
No response
Information
The official example scripts
My own modified scripts
Tasks
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
Error originates in data collator. Same code works with Single GPU.
Expected behavior
Expect to train in a distributed fashion on multiple GPUs using Trainer API.
The text was updated successfully, but these errors were encountered:
I'm also curious why removing DistributedSampler from _get_train_sampler() as I remember older version has it implemented for training with multi-gpu case.
It uses Accelerate's sampler for the data now @yuyemin since the trainer has a complete integration. Can you post the error @gtanya89 with the full trace and a reproducer?
System Info
Using Trainer with PyTorch DDP on single node multiple GPU. torch.dist.init_process_group() is setup ok. Seems like Trainer _get_train_sampler() does not use DistributedSampler but rather RandomSampler? Or could this be another issue I am missing? Any inputs appreciated! Thanks!
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Error originates in data collator. Same code works with Single GPU.
Expected behavior
Expect to train in a distributed fashion on multiple GPUs using Trainer API.
The text was updated successfully, but these errors were encountered: