-
-
Notifications
You must be signed in to change notification settings - Fork 994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote restart ipengine #94
Comments
The plan is to put a nanny process next to each Engine, which would enable remote signalling, restarting, etc. This is a general plan for Jupyter kernels that will be extended to IPython parallel. |
The tricky bit for IPython parallel is to not ruin cases like MPI, where the engine itself must be the main process, it cannot be started by the nanny. This means that either the engine starts the nanny the first time, or we special case MPI somehow. |
Right. MPI. Well, at the moment, I can't see reliably handling restarts for MPI with any fewer than 3 processes. Kinda dumb, but here's the picture I have in mind... We have to allow that there may be at least three computers involved in an mpi situation.
We want to be able to send keyboard interrupt signals to the engine, ergo, the nanny needs to be on the same node as the engine (correct me if I'm wrong). So at the very least, we would need a setup like this.
Now let's say we get a restart signal. We need to kill engine and launch a new engine that is also part of the same MPI universe. We can do this, with, e.g. To deal with this situation, we actually need a third process. The setup now looks like this:
Now if nanny is told to keyboard-interrupt, it talks to the micronanny, that actually sends the SIGINT. If nanny is told to restart, it creates an entirely new (micronanny,engine) pair, which might be on ClusterNodeA and it might be on ClusterNodeB. Remarks
|
...in conclusion, I hope MPI doesn't block progress on this. When they introduced "dynamic processes management" in MPI2.0, I don't believe they were thinking of a scenario where a single worker could restart. At the base minimum, every time an mpi process restarts we will need to destroy an old intracommunicator and inject a new one into the ipengine namespace. If users have data structures referencing the old intracommunicator, they will become invalid. Which could be a bit confusing for users :). But perhaps somebody with more MPI-foo can come along and prove me wrong! In other news, let me know if there's a useful way I could contribute to the kernel-nanny architecture for Jupyter. |
I'm 100% okay with MPI engines not being allowed to be restarted, that's not a problem. It's just the parent/child relationship that's an issue, because the mpiexec'd process must be the actual kernel, not the nanny. |
Cool. That makes sense. |
The engine restart feature would be a really useful. E.g. I use theano on a cluster. Once I import theano a GPU is assigned to the importing process. The only way (I know of) to "free" the GPU again is to terminate/restart the process. |
Any news here in the last two years? |
Any news in the last 4 years?
|
#463 lays the groundwork for this to be possible |
I think this actually a pretty old idea, but I was wondering if there has been any movement on it.
In the same way you can restart a kernel in a notebook, it would be awesome if you could restart an ipengine. My understanding is that this would require a nontrivial rewrite of the engine, involving an entire extra monitoring process that just isn't there right now.
Does this seem like something likely to be implemented? If I took a crack at it, would that be helpful? Or is there another plan...
The text was updated successfully, but these errors were encountered: