-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster Python API #463
Cluster Python API #463
Conversation
8e06543
to
0973e5b
Compare
@sahil1105 it's not quite ready, but the basics are here if you want to start reviewing. In particular, you can check out the example notebook walking through:
|
e65a940
to
a15fd86
Compare
@sahil1105 I think this is reasonably complete for a first draft. I'm still debugging little discrepancies between my environment and CI, but everything's working for me locally |
5e1cedb
to
501d105
Compare
@sahil1105 I think this is about ready to go for a first draft if you'd like to review. It's working and relatively complete. I've opened issues for the follow-up tasks to do after landing this. I'm still ironing out the kinks of testing MPI on CI (everything passes for me locally) |
6cda89d
to
849915a
Compare
All tests are passing now |
Will do. Thanks! |
easiest way to grant clients access to restart/add_engines APIs, especially since the context manager yields the client, not the cluster itself
reasonably complete now fix various little things along the way
easier to clean them up
and add Cluster to docs
fixes capture of subprocess output
improves response time of stop/restart events
CI VMs are slow
replace Mixin with Launcher, since it's a public config API and it's not really a Mixin, it must be Configurable for its traits to be configurable...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@minrk Sorry about the delay, looks great to me, thanks!
Couple of questions:
- Is there a way/API to send signal to
mpiexec
itself if needed? I know we're handling SIGINT, etc. using a signal handler on each process without involvingmpiexec
which makes sense, but we might for instance send aSIGUSER
tompiexec
which the application could receive and handle. - When we restart NB kernel (w/o calling
stop_cluster()
), are all controller/engine resources cleaned up or left dangling (and if so what's the best way to clean them)? - For certain implementation, like MPI, when the engines are started using a Launcher, do we need heartbeats between engine and controller? Also, if the
mpiexec
process dies while running user code, will the user (client) get notified immediately (since the launcher has a handle on the mpiexec process), or will it still need to wait until enough heartbeats have missed?
Yes,
They should be, but it cannot be assumed because cleanup code cannot be guaranteed to run (e.g. killed with SIGKILL, no cleanup events can be assumed to have run). I can work on cleaning up as often as possible (e.g. with atexit, which I started working on already, but haven't pushed it yet). Ultimately, we need #480 to interact with clusters not started by the current process. Actions would include:
That's a tricky one. If a launcher can be trusted to be notified, yes. mpiexec is not such a launcher in general, because a process can begin exiting at which point it becomes unresponsive and should be considered 'dead'. The hearbeat mechanism should catch this, but mpiexec will not because the process still exists and/or has begun 'cleanly' shutting down via MPI_Finalize. mpiexec only notifies about "abnormal" exit, i.e. exit without finalize. So relying on mpiexec can mean engines are less likely to be notified of shutdown than the current mechanism. We can, however, at least send the same notification about engine shutdown in all the cases where mpiexec does exit, which should have most of the benefit–improved responsiveness–without the cost of removing it–missed shutdown events. I've opened #491 to track this. |
Oh ok, makes sense. Thanks.
Got it, makes sense. Yes, I did notice that sometimes there are lingering processes sometimes, but a simple bash script to terminate all ipcontroller processes could take care of that for now.
Ah ok, understood.
Agreed. Sounds good, thanks! |
Thanks for the review, @sahil1105! |
DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher. See: ipython/ipyparallel#463 Found in: https://github.com/quaquel/EMAworkbench/runs/6491767224?check_suite_focus=true#step:6:158
DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher. See: ipython/ipyparallel#463 Found in: https://github.com/quaquel/EMAworkbench/runs/6491767224?check_suite_focus=true#step:6:158
DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher. See: ipython/ipyparallel#463
DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher. See: ipython/ipyparallel#463
DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher. See: ipython/ipyparallel#463
DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher. See: ipython/ipyparallel#463
Cluster is a Python API for starting/stopping/signaling controllers and engines. It is basically rewriting IPClusterApp as a simpler object with a reusable API. It's a fully asyncio API natively, with automatic
_sync
methods that provide synchronous versions of everything that should work both inside and outside asyncio.TODO:
ipcluster stop/list
(Serialize Cluster objects #480)closes #22
closes #216
closes #243
closes #241
closes #460
closes #94