Difference with torch.distributed #1919

base-y · 2022-04-27T15:07:56Z

base-y
Apr 27, 2022

Hi, I am new to distributed training and am using huggingface to train large models. I see many options to run distributed training. Can I know what is the difference between the following options:

python train.py .....<ARGS>
python -m torch.distributed.launch <ARGS>
deepspeed train.py <ARGS>

I did not expect option 1 to use distributed training. But it even seem to use some sort of torch distributed training? In that case, whats the difference between option 1 and option 2?

Does deepspeed use torch.distributed in the background?

base-y · 2022-04-30T19:37:44Z

base-y
Apr 30, 2022
Author

@tjruwase can you help me with this/

0 replies

tjruwase · 2022-05-01T13:39:14Z

tjruwase
May 1, 2022
Maintainer

Historically, 1 was only capable of doing distributed training using a single multi-threaded process (1 thread per rank) and only worked within a node. I am not sure if that is still the case, or if it now defaults to 2 in the background.

2 was created for better distributed training using multiple processes (1 process per rank) and woks across nodes.

3 uses 2 in the background and adds some deepspeed specific options. Hope that helps.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference with torch.distributed #1919

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Difference with torch.distributed #1919

base-y Apr 27, 2022

Replies: 2 comments

base-y Apr 30, 2022 Author

tjruwase May 1, 2022 Maintainer

base-y
Apr 27, 2022

base-y
Apr 30, 2022
Author

tjruwase
May 1, 2022
Maintainer