Cluster Python API #463

minrk · 2021-06-03T13:08:15Z

Cluster is a Python API for starting/stopping/signaling controllers and engines. It is basically rewriting IPClusterApp as a simpler object with a reusable API. It's a fully asyncio API natively, with automatic _sync methods that provide synchronous versions of everything that should work both inside and outside asyncio.

TODO:

closes #22
closes #216
closes #243
closes #241
closes #460
closes #94

minrk · 2021-06-09T18:48:50Z

@sahil1105 it's not quite ready, but the basics are here if you want to start reviewing. In particular, you can check out the example notebook walking through:

starting/stopping clusters
using them as context managers
starting engines with MPI
interrupting and restarting engines

minrk · 2021-06-10T09:31:17Z

@sahil1105 I think this is reasonably complete for a first draft. I'm still debugging little discrepancies between my environment and CI, but everything's working for me locally

minrk · 2021-06-10T10:35:30Z

@sahil1105 I think this is about ready to go for a first draft if you'd like to review. It's working and relatively complete. I've opened issues for the follow-up tasks to do after landing this.

I'm still ironing out the kinks of testing MPI on CI (everything passes for me locally)

minrk · 2021-06-10T10:52:50Z

All tests are passing now

sahil1105 · 2021-06-10T14:52:57Z

@sahil1105 it's not quite ready, but the basics are here if you want to start reviewing. In particular, you can check out the example notebook walking through:

starting/stopping clusters

using them as context managers

starting engines with MPI

interrupting and restarting engines

Will do. Thanks!

easiest way to grant clients access to restart/add_engines APIs, especially since the context manager yields the client, not the cluster itself

reasonably complete now fix various little things along the way

easier to clean them up

and add Cluster to docs

fixes capture of subprocess output

improves response time of stop/restart events

CI VMs are slow

replace Mixin with Launcher, since it's a public config API and it's not really a Mixin, it must be Configurable for its traits to be configurable...

sahil1105

@minrk Sorry about the delay, looks great to me, thanks!
Couple of questions:

Is there a way/API to send signal tompiexec itself if needed? I know we're handling SIGINT, etc. using a signal handler on each process without involving mpiexec which makes sense, but we might for instance send a SIGUSER to mpiexec which the application could receive and handle.
When we restart NB kernel (w/o calling stop_cluster()), are all controller/engine resources cleaned up or left dangling (and if so what's the best way to clean them)?
For certain implementation, like MPI, when the engines are started using a Launcher, do we need heartbeats between engine and controller? Also, if the mpiexec process dies while running user code, will the user (client) get notified immediately (since the launcher has a handle on the mpiexec process), or will it still need to wait until enough heartbeats have missed?

minrk · 2021-06-18T12:25:20Z

Is there a way/API to send signal to mpiexec itself if needed?

Yes, signal_engines does exactly this. I still think we should implement sending signals to engines directly for a host of reasons (mpiexec's own poor signal handling among them). That would become a client method instead of a cluster method, though, so we wouldn't lose the ability to signal mpiexec from the Launcher. That approach is not yet implemented, though (#475).

When we restart NB kernel (w/o calling stop_cluster()), are all controller/engine resources cleaned up or left dangling (and if so what's the best way to clean them)?

They should be, but it cannot be assumed because cleanup code cannot be guaranteed to run (e.g. killed with SIGKILL, no cleanup events can be assumed to have run). I can work on cleaning up as often as possible (e.g. with atexit, which I started working on already, but haven't pushed it yet). Ultimately, we need #480 to interact with clusters not started by the current process. Actions would include:

listing running clusters
connecting to a cluster started previously
shutting down clusters not started by this process

For certain implementation, like MPI, when the engines are started using a Launcher, do we need heartbeats between engine and controller?

That's a tricky one. If a launcher can be trusted to be notified, yes. mpiexec is not such a launcher in general, because a process can begin exiting at which point it becomes unresponsive and should be considered 'dead'. The hearbeat mechanism should catch this, but mpiexec will not because the process still exists and/or has begun 'cleanly' shutting down via MPI_Finalize. mpiexec only notifies about "abnormal" exit, i.e. exit without finalize.

So relying on mpiexec can mean engines are less likely to be notified of shutdown than the current mechanism.

We can, however, at least send the same notification about engine shutdown in all the cases where mpiexec does exit, which should have most of the benefit–improved responsiveness–without the cost of removing it–missed shutdown events. I've opened #491 to track this.

sahil1105 · 2021-06-19T01:33:04Z

Is there a way/API to send signal to mpiexec itself if needed?

Yes, signal_engines does exactly this. I still think we should implement sending signals to engines directly for a host of reasons (mpiexec's own poor signal handling among them). That would become a client method instead of a cluster method, though, so we wouldn't lose the ability to signal mpiexec from the Launcher. That approach is not yet implemented, though (#475).

Oh ok, makes sense. Thanks.

When we restart NB kernel (w/o calling stop_cluster()), are all controller/engine resources cleaned up or left dangling (and if so what's the best way to clean them)?

They should be, but it cannot be assumed because cleanup code cannot be guaranteed to run (e.g. killed with SIGKILL, no cleanup events can be assumed to have run). I can work on cleaning up as often as possible (e.g. with atexit, which I started working on already, but haven't pushed it yet). Ultimately, we need #480 to interact with clusters not started by the current process. Actions would include:

listing running clusters

connecting to a cluster started previously

shutting down clusters not started by this process

Got it, makes sense. Yes, I did notice that sometimes there are lingering processes sometimes, but a simple bash script to terminate all ipcontroller processes could take care of that for now.

For certain implementation, like MPI, when the engines are started using a Launcher, do we need heartbeats between engine and controller?

That's a tricky one. If a launcher can be trusted to be notified, yes. mpiexec is not such a launcher in general, because a process can begin exiting at which point it becomes unresponsive and should be considered 'dead'. The hearbeat mechanism should catch this, but mpiexec will not because the process still exists and/or has begun 'cleanly' shutting down via MPI_Finalize. mpiexec only notifies about "abnormal" exit, i.e. exit without finalize.

Ah ok, understood.
Yes, I was interested in the abnormal exit case where one (or more) processes exit and MPI closes all of them and exits with code 6 or code 9 or something.

So relying on mpiexec can mean engines are less likely to be notified of shutdown than the current mechanism.

We can, however, at least send the same notification about engine shutdown in all the cases where mpiexec does exit, which should have most of the benefit–improved responsiveness–without the cost of removing it–missed shutdown events. I've opened #491 to track this.

Agreed. Sounds good, thanks!

minrk · 2021-06-21T07:54:36Z

Thanks for the review, @sahil1105!

DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher. See: ipython/ipyparallel#463 Found in: https://github.com/quaquel/EMAworkbench/runs/6491767224?check_suite_focus=true#step:6:158

DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher. See: ipython/ipyparallel#463

minrk force-pushed the cluster-manager branch from acadb3f to 990a3ed Compare June 3, 2021 13:15

This was referenced Jun 4, 2021

Checking ssh availability of machines and launching ipcluster from python #369

Closed

remote restart ipengine #94

Closed

minrk force-pushed the cluster-manager branch from 990a3ed to 5d75c52 Compare June 7, 2021 12:40

minrk mentioned this pull request Jun 8, 2021

Add registration information to launchers #475

Closed

minrk force-pushed the cluster-manager branch 12 times, most recently from 8e06543 to 0973e5b Compare June 9, 2021 18:42

minrk force-pushed the cluster-manager branch 2 times, most recently from e65a940 to a15fd86 Compare June 10, 2021 09:30

minrk force-pushed the cluster-manager branch 3 times, most recently from 5e1cedb to 501d105 Compare June 10, 2021 10:07

minrk force-pushed the cluster-manager branch 2 times, most recently from 6cda89d to 849915a Compare June 10, 2021 10:47

create cluster subpackage

af269a3

minrk added 14 commits June 15, 2021 14:06

use client.wait_for_engines in cluster context managers

3fbd934

add Cluster API example notebook

4b9da9c

give clients a handle on their parent cluster object, if they have one

d0505f4

easiest way to grant clients access to restart/add_engines APIs, especially since the context manager yields the client, not the cluster itself

more test coverage for Cluster API

f093b39

reasonably complete now fix various little things along the way

allow using clients as context managers

caba06b

easier to clean them up

promote Cluster to top-level namespace

aaa4483

and add Cluster to docs

hook up tornado when running launchers

cf9d1cb

fixes capture of subprocess output

run mpi tests with more output

9e1fe6d

shorten ping for cluster tests

66b18dd

improves response time of stop/restart events

handle short-circuit in wait_for_engines(block=False)

44c32f3

relax timeout in cluster tests

68e4397

CI VMs are slow

fix setting config on common base classes

95789d5

replace Mixin with Launcher, since it's a public config API and it's not really a Mixin, it must be Configurable for its traits to be configurable...

install mpi4py for mpi tests

2d9e17c

less output now that mpi tests are passing

f61f383

minrk force-pushed the cluster-manager branch from 7d8a69b to f61f383 Compare June 15, 2021 12:06

minrk mentioned this pull request Jun 17, 2021

Add ControllerLauncher.get_connection_info #488

Closed

sahil1105 reviewed Jun 18, 2021

View reviewed changes

minrk merged commit 508af10 into ipython:main Jun 21, 2021

minrk deleted the cluster-manager branch June 21, 2021 07:54

minrk mentioned this pull request Jun 21, 2021

add ControllerLauncher.get_connection_info() #492

Merged

minrk mentioned this pull request Jul 2, 2021

Serialize Cluster objects #480

Closed

EwoutH added a commit to quaquel/EMAworkbench that referenced this pull request Oct 23, 2022

DEPR: import launcher from ipyparallel.cluster

1e2bdb8

DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher. See: ipython/ipyparallel#463

EwoutH added a commit to quaquel/EMAworkbench that referenced this pull request Oct 23, 2022

DEPR: import launcher from ipyparallel.cluster

9385ed9

DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher. See: ipython/ipyparallel#463

EwoutH added a commit to quaquel/EMAworkbench that referenced this pull request Oct 23, 2022

DEPR: import launcher from ipyparallel.cluster

f5c8220

DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher. See: ipython/ipyparallel#463

EwoutH added a commit to quaquel/EMAworkbench that referenced this pull request Oct 25, 2022

DEPR: import launcher from ipyparallel.cluster

ef9ecf7

DeprecationWarning: ipyparallel.apps.launcher is deprecated in ipyparallel 7. Use ipyparallel.cluster.launcher. See: ipython/ipyparallel#463

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Python API #463

Cluster Python API #463

minrk commented Jun 3, 2021 •

edited

Loading

minrk commented Jun 9, 2021

minrk commented Jun 10, 2021

minrk commented Jun 10, 2021

minrk commented Jun 10, 2021

sahil1105 commented Jun 10, 2021

sahil1105 left a comment

minrk commented Jun 18, 2021 •

edited

Loading

sahil1105 commented Jun 19, 2021

minrk commented Jun 21, 2021

Cluster Python API #463

Cluster Python API #463

Conversation

minrk commented Jun 3, 2021 • edited Loading

minrk commented Jun 9, 2021

minrk commented Jun 10, 2021

minrk commented Jun 10, 2021

minrk commented Jun 10, 2021

sahil1105 commented Jun 10, 2021

sahil1105 left a comment

Choose a reason for hiding this comment

minrk commented Jun 18, 2021 • edited Loading

sahil1105 commented Jun 19, 2021

minrk commented Jun 21, 2021

minrk commented Jun 3, 2021 •

edited

Loading

minrk commented Jun 18, 2021 •

edited

Loading