Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some notes on the examples [torchrun] #2095

Closed
TJ-Solergibert opened this issue Oct 27, 2023 · 1 comment · Fixed by #2096
Closed

Some notes on the examples [torchrun] #2095

TJ-Solergibert opened this issue Oct 27, 2023 · 1 comment · Fixed by #2096
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers

Comments

@TJ-Solergibert
Copy link
Contributor

Observing the documentation in the examples directory, I have noticed that the code for launching jobs with torchrun is far from what is indicated in the PyTorch documentation itself, and even those in the multi-node training section are incorrect. As you can see here (even in older versions), the correct flags to launch the distributed script are as follows: --rdzv-id, --rdzv_backend, and --rdzv-endpoint.

Personally, I have been running the scripts and they work correctly; it would only be necessary to clarify the instructions. I could do it myself without any problem, or someone else could take a look at it!

@muellerzr
Copy link
Collaborator

A PR would be welcome with any fixes you have! Else I’ll try and get to it soon :)

@muellerzr muellerzr added documentation Improvements or additions to documentation good first issue Good for newcomers labels Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants