ENH: core: make unpickling with encoding='bytes' work #4888

pv · 2014-07-18T17:55:00Z

When loading Py2-generated pickles on Py3, it may be sometimes useful to use pickle.load(..., encoding='bytes') instead of encoding='latin1'.

This is currently blocked by dtype.__setstate__ not accepting the endian as a byte string.
There's not really a good reason to not support this, so we could change it, as below.

The routine also seemed to have some old bug with missing error return, so fix that too.

charris · 2014-07-18T18:14:53Z

Interesting test failure pattern, 3.2 and 3.3 fail, 3.4 works.

pv · 2014-07-18T19:19:51Z

The bytes encoding thing in pickle seems to be added in Python 3.4 (it's not in Py3.3 docs but is in Py3.4). Fixed.

charris · 2014-07-18T19:47:57Z

Want to try Julian's rebase trick? Probably need HEAD^^ instead of HEAD^.

njsmith · 2014-07-18T19:51:42Z

I think we need some docs somewhere explaining how to load py2 pickles into py3?

juliantaylor · 2014-07-18T19:59:13Z

hm will this fix gh-4798?
edit: yes it does, if one changes the pickle to use bytes encoding

pv · 2014-07-18T20:16:43Z

@charris: I think this is already based on the merge base.

@njsmith: maybe separate PR, I don't immediately see where it should be put.

@juliantaylor: yes, if you also add encoding='bytes' to numpy/lib/format.py:560. (encoding='latin1' does not work because Py2 datetime objects are not unpickleable on Py3 with it.) The generated dtype objects however end up with byte strings as dtype field names, which is wrong --- I'll try fix that here in a moment.

The right thing to do with np.load could be to add similar encoding etc. compatibility arguments to it that pickle.load has. (I'm a bit unhappy that the NPY file format supports pickles at all, nice security loophole for executing arbitrary malicious code in an unexpected place.)

juliantaylor · 2014-07-18T20:24:53Z

hm I guess my commit proposal is flawed, merging this onto maintenance conflicts for no good reason :/

juliantaylor · 2014-07-18T20:30:18Z

oh it does work but we need the recursive merge strategy git merge -s recursive
but its not rebased onto the merge base properly, possibly due to the wrong anchestor used, this should work:

git rebase --onto $(git merge-base master maintenance/1.9.x) $(git merge-base master HEAD)

pv · 2014-07-18T20:43:35Z

@juliantaylor: the parent commit is 88cf0e4 which AFAIK is the current merge base, so I don't see what to fix. (Note also that your merge-base command assumes the local clone also has those branches, and that they are up-to-date, which is not usually the case. Moreover, some people might use origin/ and others upstream/ as upstream repo prefixes, so one would have to write the instructions carefully.)

juliantaylor · 2014-07-18T20:49:26Z

oh yes right the branch is fine, I screwed up a command used a merge into my master instead of the original branch to test

pv · 2014-07-18T21:55:53Z

Added the dtype field name conversions.

juliantaylor · 2014-07-18T23:29:55Z

numpy/core/src/multiarray/descriptor.c

-#undef _ARGSTR_
+            PyErr_Clear();
+#endif
+            if (!PyArg_ParseTuple(args, "(icOOOiii)", &version, &endian_char,


would it be simpler to check the type of args[2] and change the format string accordingly instead of parsing twice?
PyArg_ParseTuple is a very slow function it could maybe slow down loading large pickles

how about something like this: juliantaylor@bfd96a9

actually shouldn't we just convert the endian object to ascii? anything else is not allowed as endian character anyway

Make dtype.__setstate__ accept endian either as a byte string or unicode. Also fix a missing error return introduced in c355157, apparently mistake.

… + coerce on Py3 That 'names' is a tuple and 'fields' a dict (when present) is assumed in most of the code base, so check them during unpickling. Also add coercion from bytes using ASCII codec on Python 3. This is never triggered in the "usual" case of loading Py3-generated pickles on Py3, but can occur if loading Py2 generated pickles with pickle.load(f, encoding='bytes'), which sometimes is the only working option. The ASCII codec is probably the safest choice and likely covers most use cases.

pv · 2014-07-22T20:43:47Z

Rewrote it in a saner way.

juliantaylor · 2014-07-22T23:16:39Z

looks good, thanks

ENH: core: make unpickling with encoding='bytes' work

pv mentioned this pull request Jul 18, 2014

Error unpickling unicode in Python 3 #4879

Closed

juliantaylor reviewed Jul 18, 2014
View reviewed changes

pv added 2 commits July 22, 2014 23:42

ENH: core: make unpickling with encoding='bytes' work

4008bb4

Make dtype.__setstate__ accept endian either as a byte string or unicode. Also fix a missing error return introduced in c355157, apparently mistake.

juliantaylor added a commit that referenced this pull request Jul 22, 2014

Merge pull request #4888 from pv/fix-bytes-encoding-unpickle

809938d

ENH: core: make unpickling with encoding='bytes' work

juliantaylor merged commit 809938d into numpy:master Jul 22, 2014

juliantaylor added a commit that referenced this pull request Jul 22, 2014

Merge pull request #4888 from pv/fix-bytes-encoding-unpickle

857c5e2

ENH: core: make unpickling with encoding='bytes' work

njsmith mentioned this pull request Sep 24, 2015

WIP: preparing for numpy 1.9.4 release #6349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: core: make unpickling with encoding='bytes' work #4888

ENH: core: make unpickling with encoding='bytes' work #4888

pv commented Jul 18, 2014

charris commented Jul 18, 2014

pv commented Jul 18, 2014

charris commented Jul 18, 2014

njsmith commented Jul 18, 2014

juliantaylor commented Jul 18, 2014

pv commented Jul 18, 2014

juliantaylor commented Jul 18, 2014

juliantaylor commented Jul 18, 2014

pv commented Jul 18, 2014

juliantaylor commented Jul 18, 2014

pv commented Jul 18, 2014

juliantaylor Jul 18, 2014

juliantaylor Jul 21, 2014

juliantaylor Jul 21, 2014

pv commented Jul 22, 2014

juliantaylor commented Jul 22, 2014

Navigation Menu

ENH: core: make unpickling with encoding='bytes' work #4888

ENH: core: make unpickling with encoding='bytes' work #4888

Conversation

pv commented Jul 18, 2014

charris commented Jul 18, 2014

pv commented Jul 18, 2014

charris commented Jul 18, 2014

njsmith commented Jul 18, 2014

juliantaylor commented Jul 18, 2014

pv commented Jul 18, 2014

juliantaylor commented Jul 18, 2014

juliantaylor commented Jul 18, 2014

pv commented Jul 18, 2014

juliantaylor commented Jul 18, 2014

pv commented Jul 18, 2014

juliantaylor Jul 18, 2014

Choose a reason for hiding this comment

juliantaylor Jul 21, 2014

Choose a reason for hiding this comment

juliantaylor Jul 21, 2014

Choose a reason for hiding this comment

pv commented Jul 22, 2014

juliantaylor commented Jul 22, 2014