Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: core: make unpickling with encoding='bytes' work #4888

Merged
merged 2 commits into from Jul 22, 2014

Conversation

pv
Copy link
Member

@pv pv commented Jul 18, 2014

When loading Py2-generated pickles on Py3, it may be sometimes useful to use pickle.load(..., encoding='bytes') instead of encoding='latin1'.

This is currently blocked by dtype.__setstate__ not accepting the endian as a byte string.
There's not really a good reason to not support this, so we could change it, as below.

The routine also seemed to have some old bug with missing error return, so fix that too.

@charris
Copy link
Member

charris commented Jul 18, 2014

Interesting test failure pattern, 3.2 and 3.3 fail, 3.4 works.

@pv
Copy link
Member Author

pv commented Jul 18, 2014

The bytes encoding thing in pickle seems to be added in Python 3.4 (it's not in Py3.3 docs but is in Py3.4). Fixed.

@charris
Copy link
Member

charris commented Jul 18, 2014

Want to try Julian's rebase trick? Probably need HEAD^^ instead of HEAD^.

@njsmith
Copy link
Member

njsmith commented Jul 18, 2014

I think we need some docs somewhere explaining how to load py2 pickles into py3?

@juliantaylor
Copy link
Contributor

hm will this fix gh-4798?
edit: yes it does, if one changes the pickle to use bytes encoding

@pv
Copy link
Member Author

pv commented Jul 18, 2014

@charris: I think this is already based on the merge base.

@njsmith: maybe separate PR, I don't immediately see where it should be put.

@juliantaylor: yes, if you also add encoding='bytes' to numpy/lib/format.py:560. (encoding='latin1' does not work because Py2 datetime objects are not unpickleable on Py3 with it.) The generated dtype objects however end up with byte strings as dtype field names, which is wrong --- I'll try fix that here in a moment.

The right thing to do with np.load could be to add similar encoding etc. compatibility arguments to it that pickle.load has. (I'm a bit unhappy that the NPY file format supports pickles at all, nice security loophole for executing arbitrary malicious code in an unexpected place.)

@juliantaylor
Copy link
Contributor

hm I guess my commit proposal is flawed, merging this onto maintenance conflicts for no good reason :/

@juliantaylor
Copy link
Contributor

oh it does work but we need the recursive merge strategy git merge -s recursive
but its not rebased onto the merge base properly, possibly due to the wrong anchestor used, this should work:

git rebase --onto $(git merge-base master maintenance/1.9.x) $(git merge-base master HEAD)

@pv
Copy link
Member Author

pv commented Jul 18, 2014

@juliantaylor: the parent commit is 88cf0e4 which AFAIK is the current merge base, so I don't see what to fix. (Note also that your merge-base command assumes the local clone also has those branches, and that they are up-to-date, which is not usually the case. Moreover, some people might use origin/ and others upstream/ as upstream repo prefixes, so one would have to write the instructions carefully.)

@juliantaylor
Copy link
Contributor

oh yes right the branch is fine, I screwed up a command used a merge into my master instead of the original branch to test

@pv
Copy link
Member Author

pv commented Jul 18, 2014

Added the dtype field name conversions.

#undef _ARGSTR_
PyErr_Clear();
#endif
if (!PyArg_ParseTuple(args, "(icOOOiii)", &version, &endian_char,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be simpler to check the type of args[2] and change the format string accordingly instead of parsing twice?
PyArg_ParseTuple is a very slow function it could maybe slow down loading large pickles

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about something like this: juliantaylor@bfd96a9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually shouldn't we just convert the endian object to ascii? anything else is not allowed as endian character anyway

pv added 2 commits July 22, 2014 23:42
Make dtype.__setstate__ accept endian either as a byte string or unicode.

Also fix a missing error return introduced in c355157, apparently
mistake.
… + coerce on Py3

That 'names' is a tuple and 'fields' a dict (when present) is assumed in
most of the code base, so check them during unpickling.

Also add coercion from bytes using ASCII codec on Python 3.  This is
never triggered in the "usual" case of loading Py3-generated pickles on
Py3, but can occur if loading Py2 generated pickles with
pickle.load(f, encoding='bytes'), which sometimes is the only working
option.

The ASCII codec is probably the safest choice and likely covers most use
cases.
@pv
Copy link
Member Author

pv commented Jul 22, 2014

Rewrote it in a saner way.

@juliantaylor
Copy link
Contributor

looks good, thanks

juliantaylor added a commit that referenced this pull request Jul 22, 2014
ENH: core: make unpickling with encoding='bytes' work
@juliantaylor juliantaylor merged commit 809938d into numpy:master Jul 22, 2014
juliantaylor added a commit that referenced this pull request Jul 22, 2014
ENH: core: make unpickling with encoding='bytes' work
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants