simplify IPython.parallel connections and enable Controller Resume #1471

Merged
merged 6 commits into from Jul 21, 2012

2 participants

@minrk
IPython member

Rolls back two-stage connection info, putting more information into the connection files. This makes it easier to use hand-crafted ssh tunnels, as all ports are read from the file, rather than from the reply to registration/connection requests.

It is no longer possible to connect to the Controller without a connection file.

Adding the serialization method to the connection file also makes it harder for custom serialization to result in a mismatch in configuration between the various objects.

@minrk
IPython member

Sample connection file before change:

{
  "url": "tcp://127.0.0.1:56224", 
  "exec_key": "e257a5bb-154f-41ee-a8aa-c51cea42b54a", 
  "ssh": "", 
  "location": "10.71.0.113"
}

and after:

{
  "control": 62418, 
  "task": 62424, 
  "hb_ping": 62421, 
  "exec_key": "fde6ebb4-8310-4fa1-b97a-dfacbbdaa2e7", 
  "mux": 62420, 
  "hb_pong": 62422, 
  "ssh": "", 
  "registration": 62415, 
  "interface": "tcp://127.0.0.1", 
  "iopub": 62426, 
  "pack": "msgpack.packb", 
  "unpack": "msgpack.unpackb", 
  "location": "192.168.1.188"
}
@minrk
IPython member

This PR now contains a simple first implementation of controller-resume. Engine state is stored to JSON on register/unregister, and loaded on startup if the --restore flag is given.

This should behave as expected as long as the cluster is idle, but is untested for resuming the Controller during active execution.

@minrk
IPython member

This enables resume of controller, but also removes explicit identities from the Client.

We need to make a decision on the explicit client identities, because there is a reason to go either way:

  1. in libzmq-2.1, setting an explicit identity results in a 'durable socket', which means that the Schedulers act on the assumption that clients never go away, and queue up replies to disconnected clients in memory. For users doing 'fire&forget' task submission, this will result in an apparent memory leak in the Schedulers.
  2. The durable socket side-effect of explicit identities is removed from libzmq-3.1.
  3. removing explicit identities means that client sockets will get new identities when using Controller resume introduced here, breaking the propagation of results of tasks submitted prior to controller crash.

So we have two principal choices:

  1. Continue to use explicit identities, and just state that the apparent memory leak is fixed by upgrading to libzmq-3.1.
  2. Remove explicit identities, and break client-replies across controller restart.

I think 2. makes the most sense, because fire&forget seems more important than preserving interactivity across controller crashes. The code here reflects this.

@minrk
IPython member

I should note that the changes here are highly experimental, and shouldn't go into 0.13.

@fperez
IPython member

@minrk, do you still think we should hold off on this one for after 0.13? I understand if they aren't robust yet, but do they introduce any potential problems? Because if it's only a matter of robustness, we can merge it and flag this explicitly as an experimental feature; that at least will give it some field-testing...

@minrk
IPython member

This would break connection-file compatibility between 0.13 and 0.12 clients/clusters, which I know there are people using right now. But the main reason is just that this is a big enough and tricky enough change that I want it to bake in master for longer than we have between now and 0.13. Especially the controller-resume part, which is far from complete, though it certainly works in some simple cases.

@fperez
IPython member

Understood; I'll update the description at the top with a more visible note about this.

@minrk
IPython member

Test results for commit 7d7a06b merged into master
Platform: darwin

  • python2.6: Failed, log at https://gist.github.com/2839954 (libraries not available: matplotlib pygments pymongo qt tornado wx wx.aui)
  • python2.7: OK (libraries not available: wx wx.aui zmq)
  • python3.2: OK (libraries not available: matplotlib pymongo qt wx wx.aui)

Not available for testing:

@minrk
IPython member

As we continue to move the goalposts on 0.13, I'm starting to feel like I want to split this one, and merge the new connection files into 0.13, and only leave controller resume until after release. Thoughts on that @fperez?

@fperez
IPython member

I was just thinking about pinging the list for 0.13 plans... We have so much new stuff that it's high time we start settling it down with eye towards a release... Do you think we'd be able to pull off an 0.13 release before we head out to SciPy?

@minrk
IPython member

Not sure about a Final, but at least a beta or RC. We have 70 issues marked for 0.13 and 35 PRs, but I think very few of them actually need to get in to 0.13 (0 of mine, for instance, though I would like the file/shell cell magics). So if we switch into stabilization-only mode this week, I think we can make it.

It's a bit of an unfortunate time as we are just now exploring what we can/should be doing with Cell Magics.

We still haven't done the cell-metadata bit, which I think is a blocker so we reduce the need to increment the nbformat in the future.

@fperez
IPython member

That sounds like a good plan. I'm going to try to finish this lovely report now, and will then ping the list.

minrk added some commits Mar 4, 2012
@minrk minrk Don't set explicit IDENTITY for clients
explicit IDENTITIES imply a potentially undesirable side-effect called 'durable sockets', which should be avoided.
7ea775c
@minrk minrk simplify IPython.parallel connections
Rolls back two-stage connection, putting complete connection info
into the connection files.  This makes it easier to use hand-crafted
ssh tunnels, as all ports are read from the file, rather than
from the reply to registration/connection requests.

It is no longer possible to connect to the Controller without a connection file.

Adding the serialization methods to the connection file also makes it
harder for custom serialization to result in a mismatch in configuration
between the various objects.
a70d89c
@minrk minrk use individual ports, rather than full urls in connection files be9601b
@minrk minrk enables resume of ipcontroller
cleans up EngineConnector and registration messages further, only storing single UUID for each engine (all the stored UUIDs happened to be the same already).

Message spec docs updated to reflect changes to message formats.
456307a
@minrk minrk remove a few dangling asbytes from rebase 5acc971
@minrk minrk IPython.parallel py3compat 2888383
@minrk
IPython member

Rebased. I don't know why GitHub thinks it cannot be automatically merged, because it certainly can. Presumably this will resolve itself before long.

@fperez
IPython member

We should probably discuss this one at the sprint...

@minrk
IPython member

Test results for commit 2888383 merged into master
Platform: darwin

  • python2.6: OK (libraries not available: cython matplotlib oct2py pygments pymongo qt rpy2 tornado wx wx.aui)
  • python2.7: OK (libraries not available: oct2py wx wx.aui)
  • python3.2: OK (libraries not available: cython matplotlib oct2py pymongo qt rpy2 wx wx.aui)

Not available for testing: python3.1

@fperez
IPython member

This is large and complex but the code and functionality looks very clean; as discussed during sprint this is best merged early in this cycle. Running testpr on linux just to be safe, will merge shortly if report is OK.

@fperez
IPython member

Test results for commit 2888383 merged into master
Platform: linux2

  • python2.7: OK
  • python3.2: OK (libraries not available: cython matplotlib oct2py pygments pymongo rpy2 wx wx.aui)

Not available for testing: python2.6, python3.1

@fperez fperez merged commit bff463b into ipython:master Jul 21, 2012
@minrk
IPython member

I have a feeling this will conflict with Jason's, and I just asked him to rebase already and don't want to do it twice. So I will get that one in once it is rebased, then make sure this still applies, and merge.

@minrk
IPython member

oops, nevermind. I'll take care of the metadata rebase if it needs one.

@minrk
IPython member

metadata doesn't need a rebase, so all is well.

@fperez
IPython member

ok, glad to hear that.

@fperez
IPython member

ok, glad to hear that.

@minrk minrk deleted the minrk:connection branch Mar 31, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment