Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Python API #463

Merged
merged 31 commits into from
Jun 21, 2021
Merged

Cluster Python API #463

merged 31 commits into from
Jun 21, 2021

Conversation

minrk
Copy link
Member

@minrk minrk commented Jun 3, 2021

Cluster is a Python API for starting/stopping/signaling controllers and engines. It is basically rewriting IPClusterApp as a simpler object with a reusable API. It's a fully asyncio API natively, with automatic _sync methods that provide synchronous versions of everything that should work both inside and outside asyncio.

TODO:

closes #22
closes #216
closes #243
closes #241
closes #460
closes #94

@minrk
Copy link
Member Author

minrk commented Jun 9, 2021

@sahil1105 it's not quite ready, but the basics are here if you want to start reviewing. In particular, you can check out the example notebook walking through:

  • starting/stopping clusters
  • using them as context managers
  • starting engines with MPI
  • interrupting and restarting engines

@minrk minrk force-pushed the cluster-manager branch 2 times, most recently from e65a940 to a15fd86 Compare June 10, 2021 09:30
@minrk
Copy link
Member Author

minrk commented Jun 10, 2021

@sahil1105 I think this is reasonably complete for a first draft. I'm still debugging little discrepancies between my environment and CI, but everything's working for me locally

@minrk minrk force-pushed the cluster-manager branch 3 times, most recently from 5e1cedb to 501d105 Compare June 10, 2021 10:07
@minrk
Copy link
Member Author

minrk commented Jun 10, 2021

@sahil1105 I think this is about ready to go for a first draft if you'd like to review. It's working and relatively complete. I've opened issues for the follow-up tasks to do after landing this.

I'm still ironing out the kinks of testing MPI on CI (everything passes for me locally)

@minrk minrk force-pushed the cluster-manager branch 2 times, most recently from 6cda89d to 849915a Compare June 10, 2021 10:47
@minrk
Copy link
Member Author

minrk commented Jun 10, 2021

All tests are passing now

@sahil1105
Copy link
Collaborator

@sahil1105 it's not quite ready, but the basics are here if you want to start reviewing. In particular, you can check out the example notebook walking through:

  • starting/stopping clusters
  • using them as context managers
  • starting engines with MPI
  • interrupting and restarting engines

Will do. Thanks!

minrk added 14 commits June 15, 2021 14:06
easiest way to grant clients access to restart/add_engines APIs,
especially since the context manager yields the client, not the cluster itself
reasonably complete now

fix various little things along the way
fixes capture of subprocess output
improves response time of stop/restart events
replace Mixin with Launcher, since it's a public config API

and it's not really a Mixin, it must be Configurable for its traits to be configurable...
Copy link
Collaborator

@sahil1105 sahil1105 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minrk Sorry about the delay, looks great to me, thanks!
Couple of questions:

  1. Is there a way/API to send signal tompiexec itself if needed? I know we're handling SIGINT, etc. using a signal handler on each process without involving mpiexec which makes sense, but we might for instance send a SIGUSER to mpiexec which the application could receive and handle.
  2. When we restart NB kernel (w/o calling stop_cluster()), are all controller/engine resources cleaned up or left dangling (and if so what's the best way to clean them)?
  3. For certain implementation, like MPI, when the engines are started using a Launcher, do we need heartbeats between engine and controller? Also, if the mpiexec process dies while running user code, will the user (client) get notified immediately (since the launcher has a handle on the mpiexec process), or will it still need to wait until enough heartbeats have missed?

@minrk
Copy link
Member Author

minrk commented Jun 18, 2021

Is there a way/API to send signal to mpiexec itself if needed?

Yes, signal_engines does exactly this. I still think we should implement sending signals to engines directly for a host of reasons (mpiexec's own poor signal handling among them). That would become a client method instead of a cluster method, though, so we wouldn't lose the ability to signal mpiexec from the Launcher. That approach is not yet implemented, though (#475).

When we restart NB kernel (w/o calling stop_cluster()), are all controller/engine resources cleaned up or left dangling (and if so what's the best way to clean them)?

They should be, but it cannot be assumed because cleanup code cannot be guaranteed to run (e.g. killed with SIGKILL, no cleanup events can be assumed to have run). I can work on cleaning up as often as possible (e.g. with atexit, which I started working on already, but haven't pushed it yet). Ultimately, we need #480 to interact with clusters not started by the current process. Actions would include:

  • listing running clusters
  • connecting to a cluster started previously
  • shutting down clusters not started by this process

For certain implementation, like MPI, when the engines are started using a Launcher, do we need heartbeats between engine and controller?

That's a tricky one. If a launcher can be trusted to be notified, yes. mpiexec is not such a launcher in general, because a process can begin exiting at which point it becomes unresponsive and should be considered 'dead'. The hearbeat mechanism should catch this, but mpiexec will not because the process still exists and/or has begun 'cleanly' shutting down via MPI_Finalize. mpiexec only notifies about "abnormal" exit, i.e. exit without finalize.

So relying on mpiexec can mean engines are less likely to be notified of shutdown than the current mechanism.

We can, however, at least send the same notification about engine shutdown in all the cases where mpiexec does exit, which should have most of the benefit–improved responsiveness–without the cost of removing it–missed shutdown events. I've opened #491 to track this.

@sahil1105
Copy link
Collaborator

Is there a way/API to send signal to mpiexec itself if needed?

Yes, signal_engines does exactly this. I still think we should implement sending signals to engines directly for a host of reasons (mpiexec's own poor signal handling among them). That would become a client method instead of a cluster method, though, so we wouldn't lose the ability to signal mpiexec from the Launcher. That approach is not yet implemented, though (#475).

Oh ok, makes sense. Thanks.

When we restart NB kernel (w/o calling stop_cluster()), are all controller/engine resources cleaned up or left dangling (and if so what's the best way to clean them)?

They should be, but it cannot be assumed because cleanup code cannot be guaranteed to run (e.g. killed with SIGKILL, no cleanup events can be assumed to have run). I can work on cleaning up as often as possible (e.g. with atexit, which I started working on already, but haven't pushed it yet). Ultimately, we need #480 to interact with clusters not started by the current process. Actions would include:

  • listing running clusters
  • connecting to a cluster started previously
  • shutting down clusters not started by this process

Got it, makes sense. Yes, I did notice that sometimes there are lingering processes sometimes, but a simple bash script to terminate all ipcontroller processes could take care of that for now.

For certain implementation, like MPI, when the engines are started using a Launcher, do we need heartbeats between engine and controller?

That's a tricky one. If a launcher can be trusted to be notified, yes. mpiexec is not such a launcher in general, because a process can begin exiting at which point it becomes unresponsive and should be considered 'dead'. The hearbeat mechanism should catch this, but mpiexec will not because the process still exists and/or has begun 'cleanly' shutting down via MPI_Finalize. mpiexec only notifies about "abnormal" exit, i.e. exit without finalize.

Ah ok, understood.
Yes, I was interested in the abnormal exit case where one (or more) processes exit and MPI closes all of them and exits with code 6 or code 9 or something.

So relying on mpiexec can mean engines are less likely to be notified of shutdown than the current mechanism.

We can, however, at least send the same notification about engine shutdown in all the cases where mpiexec does exit, which should have most of the benefit–improved responsiveness–without the cost of removing it–missed shutdown events. I've opened #491 to track this.

Agreed. Sounds good, thanks!

@minrk
Copy link
Member Author

minrk commented Jun 21, 2021

Thanks for the review, @sahil1105!

@minrk minrk merged commit 508af10 into ipython:main Jun 21, 2021
@minrk minrk deleted the cluster-manager branch June 21, 2021 07:54
@minrk minrk mentioned this pull request Jul 2, 2021
EwoutH added a commit to EwoutH/EMAworkbench that referenced this pull request May 18, 2022
DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher.

See: ipython/ipyparallel#463

Found in: https://github.com/quaquel/EMAworkbench/runs/6491767224?check_suite_focus=true#step:6:158
EwoutH added a commit to EwoutH/EMAworkbench that referenced this pull request May 18, 2022
DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher.

See: ipython/ipyparallel#463

Found in: https://github.com/quaquel/EMAworkbench/runs/6491767224?check_suite_focus=true#step:6:158
EwoutH added a commit to quaquel/EMAworkbench that referenced this pull request Oct 23, 2022
DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher.

See: ipython/ipyparallel#463
EwoutH added a commit to quaquel/EMAworkbench that referenced this pull request Oct 23, 2022
DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher.

See: ipython/ipyparallel#463
EwoutH added a commit to quaquel/EMAworkbench that referenced this pull request Oct 23, 2022
DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher.

See: ipython/ipyparallel#463
EwoutH added a commit to quaquel/EMAworkbench that referenced this pull request Oct 25, 2022
DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher.

See: ipython/ipyparallel#463
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants