-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to make it work with torch DDP #813
Comments
I completely agree that debugging code for parallel training can be
incredibly challenging, and the idea of being able to work on it in a
notebook is certainly appealing. Thank you for bringing up this important
issue.
…On Thu, Jun 22, 2023 at 6:43 PM Xie Zejian ***@***.***> wrote:
Debugging code for parallel training is very painful, and it would be very
appealing to be able to do it in a notebook.
—
Reply to this email directly, view it on GitHub
<#813>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJBYOE2NR2YQC6QFLIS7V53XMRRYLANCNFSM6AAAAAAZQMYC5E>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Is there a task here? I'm not sure why this has been opened as an issue on this repo. You can use IPython Parallel for a certain kind of debugging in parallel (it's not a debugger and certainly not a parallel debugger, which is very challenging), but it can be used to get an interactive interpreter in each of your worker processes for poking around. |
Having an interactive interpreter during parallel training is already quite appealing, especially now that deep learning increasingly relies on a significant amount of resources. This could be a scenario where ipyparallel can shine. This issue primarily serves as a feature request or question because initiating torch ddp requires some additional setup, which remains challenging for regular users like myself. |
Got it. I don't think there's any feature request here, but perhaps an example or some docs. I don't know much about torch or what ddp is, but if you can provide some examples of how workers are started with it, I may be able to give a hint to how to get them into IPython Parallel. Usually the easiest way is to start IPython Parallel and then run whatever startup code the workers use in the IPython session. The second way is to use some code injection to launch an IPython interpreter in each worker after they are launched in whatever way your tool usually does. That's all the information I can give without knowing anything about the situation. |
Debugging code for parallel training is very painful, and it would be very appealing to be able to do it in a notebook.
The most relevant thing I found is https://github.com/philtrade/Ddip and https://github.com/Bluefog-Lib/bluefog, but it seems that they are no longer being maintained.
The text was updated successfully, but these errors were encountered: