-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]CUDA error in pipeline parallel #5536
Comments
Hi @sunkun1997 - can you please share more information on your setup, ds_config, ds_report, and sample repro script? |
repro script
And modified the main of train.py
Then Each node run the run.sh.
|
By the way, If I modify the start train.py with
|
Describe the bug
When I trained the model with two nodes for pipeline parallel tasks, each node has eight graphics cards. So the incoming LOCAL_RANK of Node One is 8-17, and the line 201 of deepspeed/runtime/pipe/module. py
self. to (get_accelerator().device_name (self.local_rank))
.Here local_rank does not match the graphics card numbers 0-7, so CUDA error is raised
The text was updated successfully, but these errors were encountered: