Some notes on the examples [torchrun] #2095

TJ-Solergibert · 2023-10-27T11:13:48Z

Observing the documentation in the examples directory, I have noticed that the code for launching jobs with torchrun is far from what is indicated in the PyTorch documentation itself, and even those in the multi-node training section are incorrect. As you can see here (even in older versions), the correct flags to launch the distributed script are as follows: --rdzv-id, --rdzv_backend, and --rdzv-endpoint.

Personally, I have been running the scripts and they work correctly; it would only be necessary to clarify the instructions. I could do it myself without any problem, or someone else could take a look at it!

muellerzr · 2023-10-27T11:26:52Z

A PR would be welcome with any fixes you have! Else I’ll try and get to it soon :)

muellerzr added documentation Improvements or additions to documentation good first issue Good for newcomers labels Oct 27, 2023

TJ-Solergibert mentioned this issue Oct 27, 2023

Updated torchrun instructions #2096

Merged

5 tasks

muellerzr closed this as completed in #2096 Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some notes on the examples [torchrun] #2095

Some notes on the examples [torchrun] #2095

TJ-Solergibert commented Oct 27, 2023

muellerzr commented Oct 27, 2023

Some notes on the examples [torchrun] #2095

Some notes on the examples [torchrun] #2095

Comments

TJ-Solergibert commented Oct 27, 2023

muellerzr commented Oct 27, 2023