start improving serialization in parallel code #2069

Merged
merged 12 commits into from Jul 26, 2012

Projects

None yet

2 participants

@minrk
IPython member
  • newserialized is now totally unused
  • never double-pickle anything
  • canning in pickleutil is extensible via can_map dict
  • Session has configurable zero-copy threshold
@minrk
IPython member

Test results for commit c857ed9 merged into master
Platform: darwin

  • python2.6: OK (libraries not available: cython matplotlib oct2py pygments pymongo qt rpy2 tornado wx wx.aui)
  • python2.7: OK (libraries not available: oct2py wx wx.aui)
  • python3.2: OK (libraries not available: cython matplotlib oct2py pymongo qt rpy2 wx wx.aui)

Not available for testing: python3.1

minrk added some commits Mar 25, 2012
@minrk minrk better serialization for parallel code
* no longer use newserialized
* no longer double-pickle anything
* can/pickleutil handles arrays
* pickleutil has [un]can_map dicts for extensibility
* don't can each element of long sequences
6e68a6c
@minrk minrk add copy_threshold to limit use of zero-copy to sufficiently large me…
…ssages

threshold is triggered on the largest part of the message, and copy is applied to all parts, to avoid split-sends.
20dd4df
@minrk minrk remove unused newserialized imports 7cc3323
@minrk minrk move zero-copy threshold to 64kB ec3e866
@minrk minrk use brief dtype, rather than full object 3cbec13
@minrk minrk update serialize tests 72f19a7
@minrk minrk fix inverted copy-threshold 2aee9e3
@minrk minrk tweak non-tracking session test 2a27e9b
@minrk minrk catch NotImplementedError when using isnan in test
old numpy raises this rather than TypeError on inappropriate dtypes
a7460c2
@minrk
IPython member

Test results for commit a7460c2 merged into master
Platform: darwin

  • python2.6: OK (libraries not available: cython matplotlib oct2py pygments pymongo qt rpy2 tornado wx wx.aui)
  • python2.7: OK (libraries not available: oct2py wx wx.aui)
  • python3.2: OK (libraries not available: cython matplotlib oct2py pymongo qt rpy2 wx wx.aui)

Not available for testing:

@minrk
IPython member

(retested after rebase)

@minrk minrk referenced this pull request Jul 21, 2012
Merged

can func_defaults #1990

@fperez
IPython member

Test results for commit a7460c2 merged into master
Platform: linux2

  • python2.7: OK
  • python3.2: OK (libraries not available: cython matplotlib oct2py pymongo rpy2 wx wx.aui)

Not available for testing: python2.6

@fperez fperez and 1 other commented on an outdated diff Jul 24, 2012
IPython/utils/pickleutil.py
-def canDict(obj):
+def can(obj):
+ """prepare an object for pickling"""
+ for cls,canner in can_map.iteritems():
+ if isinstance(cls, basestring):
+ try:
+ cls = import_item(cls)
+ except Exception:
+ # not importable
+ print "not importable: %r" % cls
@fperez
fperez Jul 24, 2012

Do you want just a bare print here? We've been moving towards more structured logging in general, so I was a bit surprised to see just a simple print for this failure...

@minrk
minrk Jul 25, 2012

uses logging now.

@fperez fperez and 1 other commented on an outdated diff Jul 24, 2012
IPython/utils/pickleutil.py
@@ -147,5 +206,27 @@ def uncanSequence(obj, g=None):
return obj
-def rebindFunctionGlobals(f, glbls):
- return FunctionType(f.func_code, glbls)
+#-------------------------------------------------------------------------------
+# API dictionary
+#-------------------------------------------------------------------------------
+
+# These dicts can be extended for custom serialization of new objects
+
+can_map = {
+ 'IPython.parallel.dependent' : lambda obj: CannedObject(obj, keys=('f','df')),
+ 'numpy.ndarray' : CannedArray,
+ FunctionType : CannedFunction,
+ bytes : CannedBytes,
+ buffer : CannedBuffer,
+ # dict : can_dict,
@fperez
fperez Jul 24, 2012

Why are these guys commented out?

@minrk
minrk Jul 25, 2012

One of the goals of this is to avoid the unlimited iteration through containers of containers. There are two approaches to this:

  1. only iterate through top-level containers (e.g. return (5, np.array(...)))
  2. allow some configurable depth, which can limit this.

I went with 1., which means that containers should not automatically trigger nested canning (now there's a mixed metaphor).

What this means is that it is now easy to manufacture a case that this new serialization cannot handle:

In [17]: view.apply_sync(lambda x: x, lambda y:y)
Out[17]: <function <lambda>>

In [18]: view.apply_sync(lambda x: x, [lambda y:y])
---------------------------------------------------------------------------
PicklingError                             Traceback (most recent call last)
...
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

The first worked, because can_sequence was called on:

args = (lambda x:x,)

and the lambda was canned. The second failed, because it was called on:

args = ( [ lambda x:x ], )

and the inner list was left untouched.

If we want to allow this kind of thing, then we need to expose a limited depth somehow, and I would probably set the default depth limit to 2, where the current behavior would be equivalent to setting it to 1.

Since that is more complex, I would rather have this limited case first, and only add the complexity of limited depth if people are encountering problems in real use cases, rather than contrived ones.

@fperez
fperez Jul 26, 2012

Fair enough, I agree with your reasoning. But it's good to have this discussion of fairly fine points documented here.

@fperez

mmh, I'm wondering if we should make this logic for getting the logger object correctly available in a more centralized location. I bet you someone is going to one day write the wrong logic missing the Application object.

Note that this isn't an objection on this PR in particular, rather that I saw this as an opportunity to improve this particular aspect in general. But if you're OK with that, it can be done in a separate PR later, along with a short section in our dev guide titled Proper use of logging in IPython or somesuch. Thoughts?

IPython member
from IPython import get_logger

?

There would really be no problem if someone skipped the Application logic, the log would just be handled by the root logger instead of the App logger. Rarely does this cause anything other than an aesthetic issue.

IPython member
IPython member

But that's not usually a 'problem'. tornado uses the root logger, and this would just go in the general log if called from a tornado app.

IPython member

Right, it's not a huge issue. But it may make debugging and other analysis easier if all ipython logging goes into our logs, no? For example, the tornado general logger could well be configured with a very different default format, so you could imagine a grep-based script someone writes to scan through our logs not finding things in there.

Again, I agree it's not a big deal, I'm just taking the opportunity to think here about what the best practice should be in the long run.

IPython member

Sure. Shall I add get_logger to IPython.__init__ as a part of this (or somewhere else)?

IPython member

Totally up to you, here is fine if you want to keep our overhead low... I only offered for it to be done elsewhere if you wanted to delay that for another time.

IPython member
IPython member
@fperez
IPython member

Since the above is mostly a question thinking about a future PR I'm OK merging this now as-is, unless you prefer to modify this one while you're at it to include that. Totally up to you.

@minrk
IPython member

I'll make sure to do a bit more testing this evening, since I haven't tested since some recent changes, then it can go in.

@fperez
IPython member
@minrk
IPython member

Test results for commit 3abd3ab merged into master
Platform: darwin

  • python2.6: OK (libraries not available: cython matplotlib oct2py pygments pymongo qt rpy2 tornado wx wx.aui)
  • python2.7: OK (libraries not available: oct2py pymongo rpy2 wx wx.aui)
  • python3.2: Failed, log at https://gist.github.com/3180303 (libraries not available: matplotlib oct2py pygments pymongo qt rpy2 wx wx.aui)

Not available for testing:

@minrk
IPython member

faiure is related to Cython+py3.2 on my recently updated OS, and not IPython's problem. So this is good to go, for me.

@fperez
IPython member

OK, then let's land this puppy so this code sees more testing... Merging now.

@fperez fperez merged commit 8796581 into ipython:master Jul 26, 2012
@minrk
IPython member

Thanks! And #1990 should be able to go in as well now that this is (running test_pr there after rebase).

@minrk minrk deleted the minrk:betterserial branch Mar 31, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment