You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Observing the documentation in the examples directory, I have noticed that the code for launching jobs with torchrun is far from what is indicated in the PyTorch documentation itself, and even those in the multi-node training section are incorrect. As you can see here (even in older versions), the correct flags to launch the distributed script are as follows: --rdzv-id, --rdzv_backend, and --rdzv-endpoint.
Personally, I have been running the scripts and they work correctly; it would only be necessary to clarify the instructions. I could do it myself without any problem, or someone else could take a look at it!
The text was updated successfully, but these errors were encountered:
Observing the documentation in the examples directory, I have noticed that the code for launching jobs with torchrun is far from what is indicated in the PyTorch documentation itself, and even those in the multi-node training section are incorrect. As you can see here (even in older versions), the correct flags to launch the distributed script are as follows: --rdzv-id, --rdzv_backend, and --rdzv-endpoint.
Personally, I have been running the scripts and they work correctly; it would only be necessary to clarify the instructions. I could do it myself without any problem, or someone else could take a look at it!
The text was updated successfully, but these errors were encountered: