Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error unpickling unicode in Python 3 #4879

Closed
tomgoddard opened this issue Jul 17, 2014 · 41 comments
Closed

Error unpickling unicode in Python 3 #4879

tomgoddard opened this issue Jul 17, 2014 · 41 comments
Milestone

Comments

@tomgoddard
Copy link

Numpy 1.8.1 used with Python 3 gives an error when unpickling a numpy unicode object which was pickled with Python 2.

The bug is in the numpy.core.multiarray.scalar(dtype,string) routine which is used to unpickle this type of numpy object. In Python 3 passing the second argument of scalar() as a string causes an error ("TypeError: initializing object must be a string"). The scalar() call works in Python 3 only if the second argument is a byte array. In Python 2 the scalar() routine works with a string as the second argument (and also with a byte array as second argument). The error is in file

numpy/core/src/multiarray/multiarraymodule.c

in the array_scalar() routine at line 1874 where it uses PyString_Check(obj) on the second argument then gives the error "initializing object must be a string". In Python 3 this accepts only a byte array, while in Python 2 it accepts a string. To check for a string in Python 3 I believe requires PyUnicode_Check() while checking for a byte array uses PyBytes_Check() and PyString_Check() from Python 2 has been eliminated. I'm not clear on how the numpy code compiles using Python 3 with that PyString_Check().

Here is a test case that demonstrates the bug. It uses Python 2 to create the pickle string, and Python 3 to unpickle it. It is necessary to create the pickle string with Python 2 because the handling of unicode by numpy changes in Python 3. Pickle is documented as being backwards compatible between all Python versions. This error causes PyTables, an HDF5 file interface to load unicode string data incorrectly (returning a pickle string instead of unicode string due to the error reported here).

Test case.

$ python2.7.5

import numpy
numpy.version
'1.6.2'
import pickle
p = pickle.dumps(numpy.unicode0('abc'))
pickle.loads(p)
u'abc'
p
"cnumpy.core.multiarray\nscalar\np0\n(cnumpy\ndtype\np1\n(S'U3'\np2\nI0\nI1\ntp3\nRp4\n(I3\nS'<'\np5\nNNNI12\nI4\nI0\ntp6\nbS'a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00'\np7\ntp8\nRp9\n."

$ python3.4

import numpy, pickle
numpy.version
'1.8.1rc1'
p = b"cnumpy.core.multiarray\nscalar\np0\n(cnumpy\ndtype\np1\n(S'U3'\np2\nI0\nI1\ntp3\nRp4\n(I3\nS'<'\np5\nNNNI12\nI4\nI0\ntp6\nbS'a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00'\np7\ntp8\nRp9\n."
pickle.loads(p)
Traceback (most recent call last):
File "", line 1, in
TypeError: initializing object must be a string

The unpickle operation (pickle.loads(p)) results in the following numpy call which is causing the error

numpy.core.multiarray.scalar(numpy.dtype('U3'), 'a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00')
Traceback (most recent call last):
File "", line 1, in
TypeError: initializing object must be a string

If we change the second argument of the scalar() call to a byte array it works correctly

numpy.core.multiarray.scalar(numpy.dtype('U3'), b'a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00')
'abc'

But the second argument is encoded in the pickle string as a string, not a byte array, so the numpy scalar() routine must accept a string as the second argument.

Here is the associate PyTables bug report

PyTables/PyTables#368

@nouiz
Copy link
Contributor

nouiz commented Jul 17, 2014

Just to tell that we also have this problems in Theano tests. The fix we
have is to pickle 2 files, one in python 2 and one in python 3. And use the
right one. But a real fix in numpy would be great!

On Thu, Jul 17, 2014 at 3:25 PM, tomgoddard notifications@github.com
wrote:

Numpy 1.8.1 used with Python 3 gives an error when unpickling a numpy
unicode object which was pickled with Python 2.

The bug is in the numpy.core.multiarray.scalar(dtype,string) routine which
is used to unpickle this type of numpy object. In Python 3 passing the
second argument of scalar() as a string causes an error ("TypeError:
initializing object must be a string"). The scalar() call works in Python 3
only if the second argument is a byte array. In Python 2 the scalar()
routine works with a string as the second argument (and also with a byte
array as second argument). The error is in file

numpy/core/src/multiarray/multiarraymodule.c

in the array_scalar() routine at line 1874 where it uses
PyString_Check(obj) on the second argument then gives the error
"initializing object must be a string". In Python 3 this accepts only a
byte array, while in Python 2 it accepts a string. To check for a string in
Python 3 I believe requires PyUnicode_Check() while checking for a byte
array uses PyBytes_Check() and PyString_Check() from Python 2 has been
eliminated. I'm not clear on how the numpy code compiles using Python 3
with that PyString_Check().

Here is a test case that demonstrates the bug. It uses Python 2 to create
the pickle string, and Python 3 to unpickle it. It is necessary to create
the pickle string with Python 2 because the handling of unicode by numpy
changes in Python 3. Pickle is documented as being backwards compatible
between all Python versions. This error causes PyTables, an HDF5 file
interface to load unicode string data incorrectly (returning a pickle
string instead of unicode string due to the error reported here).

Test case.

$ python2.7.5

import numpy
numpy.version
'1.6.2'
import pickle
p = pickle.dumps(numpy.unicode0('abc'))
pickle.loads(p)
u'abc'
p

"cnumpy.core.multiarray\nscalar\np0\n(cnumpy\ndtype\np1\n(S'U3'\np2\nI0\nI1\ntp3\nRp4\n(I3\nS'<'\np5\nNNNI12\nI4\nI0\ntp6\nbS'a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00'\np7\ntp8\nRp9\n."

$ python3.4

import numpy, pickle
numpy.version
'1.8.1rc1'
p =
b"cnumpy.core.multiarray\nscalar\np0\n(cnumpy\ndtype\np1\n(S'U3'\np2\nI0\nI1\ntp3\nRp4\n(I3\nS'<'\np5\nNNNI12\nI4\nI0\ntp6\nbS'a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00'\np7\ntp8\nRp9\n."
pickle.loads(p)
Traceback (most recent call last):
File "", line 1, in
TypeError: initializing object must be a string

The unpickle operation (pickle.loads(p)) results in the following numpy
call which is causing the error

numpy.core.multiarray.scalar(numpy.dtype('U3'),
'a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00')
Traceback (most recent call last):
File "", line 1, in
TypeError: initializing object must be a string

If we change the second argument of the scalar() call to a byte array it
works correctly

numpy.core.multiarray.scalar(numpy.dtype('U3'),
b'a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00')
'abc'

But the second argument is encoded in the pickle string as a string, not
a byte array, so the numpy scalar() routine must accept a string as the
second argument.

Here is the associate PyTables bug report

PyTables/PyTables#368 PyTables/PyTables#368


Reply to this email directly or view it on GitHub
#4879.

@njsmith
Copy link
Member

njsmith commented Jul 17, 2014

@tomgoddard: thanks for diagnosing all that! Want to take the final step
and submit a pull request with the fix? :-)

@tomgoddard
Copy link
Author

PyTables is used to archive data in HDF5 file format. Thousands of such files have been created
with the software I develop for electron microscopy data. I can’t go back and fix every user’s archived
files, so the numpy bug really needs to be fixed so old pickled data can be restored. The only way
I could work around this without a numpy fix would be to monkey patch numpy to substitute in a working
multiarray.scalar() routine, or inspect a change pickle byte code strings. Both work-arounds are very very
ugly. The numpy fix should be quite easy, although will require someone knowledgable about how numpy
Python 2 / 3 compatibility is handled in the C code.

On Jul 17, 2014, at 12:33 PM, Frédéric Bastien wrote:

Just to tell that we also have this problems in Theano tests. The fix we
have is to pickle 2 files, one in python 2 and one in python 3. And use the
right one. But a real fix in numpy would be great!

@tomgoddard
Copy link
Author

I don’t think I’m the best person to fix the code because I am not familiar with how numpy C code handles Python 2 / 3 compatibility issues.
It should be very easy for someone who understands or worked on the Python 2 to 3 port.

On Jul 17, 2014, at 12:36 PM, Nathaniel J. Smith notifications@github.com wrote:

@tomgoddard: thanks for diagnosing all that! Want to take the final step
and submit a pull request with the fix? :-)

@njsmith
Copy link
Member

njsmith commented Jul 17, 2014

I don't think there's much to say about py2 vs py3 - the unpickling code
just needs to be taught to handle the object that it's being handed (since,
as you note, it's too late to change the files).
On 17 Jul 2014 20:42, "tomgoddard" notifications@github.com wrote:

PyTables is used to archive data in HDF5 file format. Thousands of such
files have been created
with the software I develop for electron microscopy data. I can’t go back
and fix every user’s archived
files, so the numpy bug really needs to be fixed so old pickled data can
be restored. The only way
I could work around this without a numpy fix would be to monkey patch
numpy to substitute in a working
multiarray.scalar() routine, or inspect a change pickle byte code strings.
Both work-arounds are very very
ugly. The numpy fix should be quite easy, although will require someone
knowledgable about how numpy
Python 2 / 3 compatibility is handled in the C code.

On Jul 17, 2014, at 12:33 PM, Frédéric Bastien wrote:

Just to tell that we also have this problems in Theano tests. The fix we
have is to pickle 2 files, one in python 2 and one in python 3. And use
the
right one. But a real fix in numpy would be great!


Reply to this email directly or view it on GitHub
#4879 (comment).

@tomgoddard
Copy link
Author

The numpy unpicking C code array_scalar() needs to work in both Python 2 and Python 3
and Python C APIs for strings changed from Python 2 to 3. The fixed numpy Python code
has to compile with both Python 2 and Python 3 and probably has to be different code for those
two Python versions. But maybe numpy uses some Python 2/3 compatibility layer — I don’t
know how numpy handles the differences in Python 2 and 3 C APIs regarding strings.

On Jul 17, 2014, at 12:46 PM, Nathaniel J. Smith notifications@github.com wrote:

I don't think there's much to say about py2 vs py3 - the unpickling code
just needs to be taught to handle the object that it's being handed (since,
as you note, it's too late to change the files).
On 17 Jul 2014 20:42, "tomgoddard" notifications@github.com wrote:

PyTables is used to archive data in HDF5 file format. Thousands of such
files have been created
with the software I develop for electron microscopy data. I can’t go back
and fix every user’s archived
files, so the numpy bug really needs to be fixed so old pickled data can
be restored. The only way
I could work around this without a numpy fix would be to monkey patch
numpy to substitute in a working
multiarray.scalar() routine, or inspect a change pickle byte code strings.
Both work-arounds are very very
ugly. The numpy fix should be quite easy, although will require someone
knowledgable about how numpy
Python 2 / 3 compatibility is handled in the C code.

On Jul 17, 2014, at 12:33 PM, Frédéric Bastien wrote:

Just to tell that we also have this problems in Theano tests. The fix we
have is to pickle 2 files, one in python 2 and one in python 3. And use
the
right one. But a real fix in numpy would be great!


Reply to this email directly or view it on GitHub
#4879 (comment).


Reply to this email directly or view it on GitHub.

@juliantaylor
Copy link
Contributor

hm, is this right?

In [9]: pickle.dumps(numpy.unicode(u'abüä'))
Out[9]: 'Vab\xfc\xe4\np0\n.'

it loads back as u'ab\xfc\xe4' ...

seems our unicode pickling is quite borked like all our string handling

@juliantaylor
Copy link
Contributor

fwiw this hack does allow loading this back, but decoding to ascii feels very wrong:
https://github.com/juliantaylor/numpy/tree/unicode-unpickle

@njsmith
Copy link
Member

njsmith commented Jul 17, 2014

Your example looks correct to me. At least after staring at it in confusion
for a while :-) It seems that u"\xff" is the same as u"\u00ff", and that
unicode.repr prefers to encode stuff outside of ascii instead of
printing it directly. Try printing the string instead of just looking at
the repr...

On Thu, Jul 17, 2014 at 9:49 PM, Julian Taylor notifications@github.com
wrote:

hm, is this right?

In [9]: pickle.dumps(numpy.unicode(u'abüä'))
Out[9]: 'Vab\xfc\xe4\np0\n.'

it loads back as u'ab\xfc\xe4' ...

seems our unicode pickling is quite borked like all our string handling


Reply to this email directly or view it on GitHub
#4879 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@scopatz
Copy link

scopatz commented Jul 17, 2014

Just to second that we we really appreciate an upstream fix in numpy from the pytables perspective. Thanks a ton!

@juliantaylor
Copy link
Contributor

oh right forgot about __repr__ on python2
as expected my branch fails for non-ascii data, so its not very useful as is

 numpy.core.multiarray.scalar(numpy.dtype('U4'), 'a\x00\x00\x00b\x00\x00\x00\xfc\x00\x00\x00\xe4\x00\x00\x00'

@juliantaylor juliantaylor added this to the 1.9 blockers milestone Jul 17, 2014
@juliantaylor
Copy link
Contributor

is there a way to get the pickle protocol number from the scalar function? I think we need that to determine the encoding.

Also I don't understand this:

In [2]: pickle.dumps(u"#äüä#", protocol=0)
Out[2]: 'V#\xe4\xfc\xe4#\np0\n.'

this is latin1 encoding, where does that come from? sys.getdefaultencoding is ascii as is protocol 0, though I can't find its specification :/

@tomgoddard
Copy link
Author

Hi Julian,

I don’t understand your questions. You know pickle is producing Python byte codes, so it is not directly interpretable, and I don’t think the numpy code needs to know the protocol of the encoding because it the unpickling will execute the byte codes and all you need to know is that it will call the numpy scalar() routine with a string as the second argument. If you want to get a better handle on what is in the pickle string though to better debug this there is a standard pickletools module.

https://docs.python.org/3.4/library/pickletools.html

and here is an example disassembling the pickle string that you used in your test case:

$ python
Python 2.7.5 (default, Mar 9 2014, 22:15:05)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin

import pickle
pickle.dumps(u"#äüä#", protocol=0)
'V#\xe4\xfc\xe4#\np0\n.'
p = pickle.dumps(u"#äüä#", protocol=0)
import pickletools as pt
pt.dis(p)
0: V UNICODE u'#\xe4\xfc\xe4#'
7: p PUT 0
10: . STOP
highest protocol among opcodes = 0

Tom

On Jul 17, 2014, at 4:00 PM, Julian Taylor notifications@github.com wrote:

is there a way to get the pickle protocol number from the scalar function? I think we need that to determine the encoding.

Also I don't understand this:

In [2]: pickle.dumps(u"#äüä#", protocol=0)
Out[2]: 'V#\xe4\xfc\xe4#\np0\n.'
this is latin1 encoding, where does that come from? sys.getdefaultencoding is ascii as is protocol 0, though I can't find its specification :/


Reply to this email directly or view it on GitHub.

@juliantaylor
Copy link
Contributor

I guess I don't understand of the python string representation works for unicode.

0: V    UNICODE    u'#\xe4\xfc\xe4#'

this is u'#\xe4\xfc\xe4#' but the bytes in this(python) string are still latin1, how does python know how to decode this?

its not PyUnicode_AsUTF32String as e.g. 'a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00' which is utf32 for ´abc` put through this ends up as every zero byte encoded as a 4byte utf32 zero:

b'\xff\xfe\x00\x00a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00b\x00\x00\x00\x00\x00\x00\x00\x00\x0
0\x00\x00\x00\x00\x00\x00c\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

@juliantaylor
Copy link
Contributor

I guess using PyUnicode_FromKindAndData to force it to utf32 (the only representation numpy currently understands in the scalar path) if its not and then just use PyUnicode_DATA might work

@njsmith
Copy link
Member

njsmith commented Jul 18, 2014

From a bit of fiddling, it looks like protocol 0 replaces backslashes and
all characters whose codepoints are >255 with \u escapes, and then the
result is encoded with latin1. So decoding (on py2) would be
foo.decode("latin1").decode("unicode-escape"). Higher protocols appear to
just use utf8.

I'm not sure what this has to do with the bug, though, since dealing with
these encodings is the pickle module's problem not ours, right?

On Fri, Jul 18, 2014 at 8:16 AM, Julian Taylor notifications@github.com
wrote:

I guess I don't understand of the python string representation works for
unicode.

0: V UNICODE u'#\xe4\xfc\xe4#'

this is u'#\xe4\xfc\xe4#' but the bytes in this(python) string are still
latin1, how does python know how to decode this?

its not PyUnicode_AsUTF32String as e.g.
'a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00' put through this ends up as
every zero byte encoded as a 4byte utf32 zero:

b'\xff\xfe\x00\x00a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00b\x00\x00\x00\x00\x00\x00\x00\x00\x0
0\x00\x00\x00\x00\x00\x00c\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'


Reply to this email directly or view it on GitHub
#4879 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@pv
Copy link
Member

pv commented Jul 18, 2014

That the example above works at all is actually a coincidence. This fails at an earlier stage:

$ python2
>>> import numpy, pickle
>>> pickle.dumps(numpy.unicode0(u'åäö'))
"cnumpy.core.multiarray\nscalar\np0\n(cnumpy\ndtype\np1\n(S'U3'\np2\nI0\nI1\ntp3\nRp4\n(I3\nS'<'\np5\nNNNI12\nI4\nI0\ntp6\nbS'\\xe5\\x00\\x00\\x00\\xe4\\x00\\x00\\x00\\xf6\\x00\\x00\\x00'\np7\ntp8\nRp9\n."
$ python3
>>> import numpy, pickle
>>> p = b"cnumpy.core.multiarray\nscalar\np0\n(cnumpy\ndtype\np1\n(S'U3'\np2\nI0\nI1\ntp3\nRp4\n(I3\nS'<'\np5\nNNNI12\nI4\nI0\ntp6\nbS'\\xe5\\x00\\x00\\x00\\xe4\\x00\\x00\\x00\\xf6\\x00\\x00\\x00'\np7\ntp8\nRp9\n."
>>> pickle.loads(p)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)

The relevant code in CPython seems to be http://hg.python.org/cpython/file/45e8eb53edbc/Modules/_pickle.c#l4712 . The encoding used is ASCII by default (in the example above, the string 'a\x00\x00...` happens to be ascii-compatible).

However, the encoding can be specified by the user, via pickle.loads(p, encoding='utf-32'), for example. It is not possible for Numpy to know what encoding the user specified, I think.

@pv
Copy link
Member

pv commented Jul 18, 2014

That is, pickled python2 strings are loaded by python3 as unicode strings using some arbitrary user-specified encoding, which we do not know. The conclusion seems to be that pickle is simply not backward compatible between py2 and py3, and there is nothing Numpy can do to fix it.

@njsmith
Copy link
Member

njsmith commented Jul 18, 2014

Docs for encoding= are here:
https://docs.python.org/3/library/pickle.html#pickle.load

I would have thought that encoding="bytes" or encoding="latin1" might have
worked, but they both give me weird errors:

python 3.4.1

open("/tmp/foo", "rb").read()
b"cnumpy.core.multiarray\nscalar\np0\n(cnumpy\ndtype\np1\n(S'U4'\np2\nI0\nI1\ntp3\nRp4\n(I3\nS'<'\np5\nNNNI16\nI4\nI0\ntp6\nbS'#\x00\x00\x00\xfc\x00\x00\x00\xe4\x00\x00\x00#\x00\x00\x00'\np7\ntp8\nRp9\n."

pickle.loads(open("/tmp/foo", "rb").read())
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 4:
ordinal not in range(128)

pickle.loads(open("/tmp/foo", "rb").read(), encoding="bytes")
Traceback (most recent call last):
File "", line 1, in
TypeError: must be a unicode character, not bytes

pickle.loads(open("/tmp/foo", "rb").read(), encoding="latin1")
Traceback (most recent call last):
File "", line 1, in
TypeError: initializing object must be a string

On Fri, Jul 18, 2014 at 10:09 AM, Pauli Virtanen notifications@github.com
wrote:

That is, pickled python2 strings are loaded by python3 as unicode strings
using some arbitrary user-specified encoding, which we do not know. The
conclusion seems to be that pickle is simply not backward compatible
between py2 and py3, and there is nothing Numpy can to fix it.


Reply to this email directly or view it on GitHub
#4879 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@pv
Copy link
Member

pv commented Jul 18, 2014

The best workaround is probably to do as @juliantaylor above suggests, and accept unicode data as-is for dtype='U' after coercing it to utf-32. The correct operation then relies on the user providing a correct encoding= value to pickle.

Pickles containing both numpy unicode and numpy byte strings probably remain unloadable:

python2
>>> import numpy, pickle
>>> pickle.dumps((numpy.unicode_(u"åäö"), numpy.bytes_("\xff\xfa\xfe")))

@njsmith
Copy link
Member

njsmith commented Jul 18, 2014

Heh, this is also fun:

python 3:

pickle.dumps((numpy.unicode_(u"åäö"), numpy.bytes_("\xff\xfa\xfe")))
Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)

On Fri, Jul 18, 2014 at 10:20 AM, Pauli Virtanen notifications@github.com
wrote:

The best workaround is probably to do as @juliantaylor
https://github.com/juliantaylor above suggests, and accept unicode data
after coercing it to utf-32. The correct operation then relies on the user
providing a correct encoding= value to pickle.

Pickles of the following type however probably remain unloadable:

python2

import numpy, pickle
pickle.dumps((numpy.unicode_(u"åäö"), numpy.bytes_("\xff\xfa\xfe")))


Reply to this email directly or view it on GitHub
#4879 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@pv
Copy link
Member

pv commented Jul 18, 2014

@njsmith: that error comes from writing numpy.bytes_("\xff\xfa\xfe") and not numpy.bytes_(b"\xff\xfa\xfe"). The constructor accepts ascii.

@njsmith
Copy link
Member

njsmith commented Jul 18, 2014

Ah, thanks for the catch.

I don't understand what you mean about "correct encoding" though. AFAICT
numpy on py2 is generating some arbitrary bytestring which it knows how to
decode. We don't want to tell Python about this encoding, we just want
Python to give us the bytes so we can decode them ourselves. Consider also:

pickletools.dis(pickle.dumps(np.int16(255)))

Notice that this pickle also contains a string "\xff\x00". (And it also
blows up if you try to read it on py3.) There's no way that the correct
solution here is to use encoding="int16" or something, though.

I think this means we can and should make either encoding="bytes" or
encoding="latin1" work. (The former might be simpler, the latter is more
compatible for complex pickles that also contain non-numpy objects.)

On Fri, Jul 18, 2014 at 10:28 AM, Nathaniel Smith njs@pobox.com wrote:

Heh, this is also fun:

python 3:

pickle.dumps((numpy.unicode_(u"åäö"), numpy.bytes_("\xff\xfa\xfe")))
Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)

On Fri, Jul 18, 2014 at 10:20 AM, Pauli Virtanen <notifications@github.com

wrote:

The best workaround is probably to do as @juliantaylor
https://github.com/juliantaylor above suggests, and accept unicode
data after coercing it to utf-32. The correct operation then relies on the
user providing a correct encoding= value to pickle.

Pickles of the following type however probably remain unloadable:

python2

import numpy, pickle
pickle.dumps((numpy.unicode_(u"åäö"), numpy.bytes_("\xff\xfa\xfe")))


Reply to this email directly or view it on GitHub
#4879 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.or http://vorpus.org

@njsmith
Copy link
Member

njsmith commented Jul 18, 2014

In fact, this is pretty terrible. AFAICT all our py2<->py3 pickle stuff
is broken, it has nothing to do with unicode scalars in particular:

py2

In [42]: pickle.dump(np.arange(255), open("/tmp/foo", "w"))

py3

pickle.load(open("/tmp/foo", "rb"))
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 1024:
ordinal not in range(128)

On Fri, Jul 18, 2014 at 10:35 AM, Nathaniel Smith njs@pobox.com wrote:

Ah, thanks for the catch.

I don't understand what you mean about "correct encoding" though. AFAICT
numpy on py2 is generating some arbitrary bytestring which it knows how to
decode. We don't want to tell Python about this encoding, we just want
Python to give us the bytes so we can decode them ourselves. Consider also:

pickletools.dis(pickle.dumps(np.int16(255)))

Notice that this pickle also contains a string "\xff\x00". (And it also
blows up if you try to read it on py3.) There's no way that the correct
solution here is to use encoding="int16" or something, though.

I think this means we can and should make either encoding="bytes" or
encoding="latin1" work. (The former might be simpler, the latter is more
compatible for complex pickles that also contain non-numpy objects.)

On Fri, Jul 18, 2014 at 10:28 AM, Nathaniel Smith njs@pobox.com wrote:

Heh, this is also fun:

python 3:

pickle.dumps((numpy.unicode_(u"åäö"), numpy.bytes_("\xff\xfa\xfe")))
Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)

On Fri, Jul 18, 2014 at 10:20 AM, Pauli Virtanen <
notifications@github.com> wrote:

The best workaround is probably to do as @juliantaylor
https://github.com/juliantaylor above suggests, and accept unicode
data after coercing it to utf-32. The correct operation then relies on the
user providing a correct encoding= value to pickle.

Pickles of the following type however probably remain unloadable:

python2

import numpy, pickle
pickle.dumps((numpy.unicode_(u"åäö"), numpy.bytes_("\xff\xfa\xfe")))


Reply to this email directly or view it on GitHub
#4879 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.or http://vorpus.org

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@pv
Copy link
Member

pv commented Jul 18, 2014

Ah yes, you're right that using encoding="utf-32" breaks everything. This then probably leaves encoding="bytes" or encoding="latin1". Those then have the potential to break anything else in the pickle.

The reason why encoding="bytes" does not work is that arraydescr_setstate uses PyTuple_ParseArgs to parse the state arg, and does not accept byte chars on Py3. This should be easy to fix.

@pv
Copy link
Member

pv commented Jul 18, 2014

The solution then seems to be that we fix the code paths for encoding='bytes', and tell users to load their py2 pickles using that? This probably does not involve ugly hacks where we need to guess what encoding some unicode strings originally were in.

@njsmith
Copy link
Member

njsmith commented Jul 18, 2014

I guess latin1 is theoretically better if we can make it work, because it's
less disruptive when loading complex pickles that include both numpy
objects and regular strings. But yes, this only works if we have done way
to distinguish unicode objects that should have been bytes but got passed
through .decode("latin1"), versus unicode objects that are actually
supposed to be unicode.

If running under py2, are there any circumstances where that argument can
legitimately be a unicode object?
On 18 Jul 2014 10:49, "Pauli Virtanen" notifications@github.com wrote:

The solution then seems to be that we fix the code paths for
encoding='bytes', and tell users to load their py2 pickles using that?
This probably does not involve ugly hacks where we need to guess what
encoding some unicode strings originally were in.


Reply to this email directly or view it on GitHub
#4879 (comment).

@pv
Copy link
Member

pv commented Jul 18, 2014

As far as I see, numpy.core.multiarray.scalar is only referenced via the scalar __reduce__ method, which always returns the data as python byte string. On Py2, the second argument will never be unicode.

Interpreting unicode data in numpy.core.multiarray.scalar assuming the original encoding was latin1 is OK only if the user specified encoding='latin1', but otherwise can silently produces invalid results eg. if the user happens to be non-european person whose application did not use unicode strings on Py2 and stored stuff as plain strings in some other encoding.

@njsmith
Copy link
Member

njsmith commented Jul 18, 2014

numpy.core.multiarray._reconstruct also appears to be affected.

You're right about there being a risk when using latin1, but I think it's
pretty small. encoding= only applies to the interpretation of py2 'str'
objects, not py2 'unicode'. So getting a silent invalid result requires:
(a) people are storing multibyte text data in py2 'str' objects using a
non-latin1 encoding, (b) are doing so using an encoding which will silently
accept the random junk that is in numpy's str objects (e.g. utf8 will
usually refuse to decode these things), (c) the decoded strings nonetheless
turn out to contain only characters with codepoints <= 255. (If there are
any codepoints >255, then our attempt to convert back to a bytes object
will fail.)

And in cases where people are storing multibyte text in py2 strs, there's
not a lot we can do -- the only possible solutions are to use
encoding="bytes" or encoding="latin1", encoding="some-weird-thing" will
never work. And there's not much to distinguish the bytes versus latin1
options here -- neither loses information, and both will require some
manual processing to recover the real multibyte text. I guess this might be
an argument for supporting both encoding="bytes" and encoding="latin1", but
I don't think it should override the preference for encoding="latin1". It's
pretty much as good in the rare multibyte encoded strs case, and in the
common case of having ascii str's it's much better.

On Fri, Jul 18, 2014 at 11:06 AM, Pauli Virtanen notifications@github.com
wrote:

As far as I see, numpy.core.multiarray.scalar is only referenced via the
scalar reduce method, which always returns the data as python byte
string. On Py2, the second argument will never be unicode.

Interpreting unicode data in numpy.core.multiarray.scalar assuming the
original encoding was latin1 is OK only if the user specified
encoding='latin1', but can silently produces invalid results if the user
happens to be non-european person whose application did not use unicode
strings on Py2.


Reply to this email directly or view it on GitHub
#4879 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@pv
Copy link
Member

pv commented Jul 18, 2014

The necessary properties are also shared by other 8-bit codecs, not only by latin1. This is pretty much an arbitrary choice on our part, and a west european language centric choice. (Also, west european Windows users might also find "Windows-1252" preferable to "latin1" and so it goes...)

What the precise 8-bit codec chosen is, is however probably not very important. I'd expect the most common case be that applications are sensible and store non-ascii strings as unicode.

@njsmith
Copy link
Member

njsmith commented Jul 18, 2014

"necessary properties" meaning, ascii-compatible and invertible? I guess
that's true, but if we have to pick such a codec arbitrarily then latin1 is
the one to pick -- it's conventionally used, documented in py2->py3 guides,
etc., for exactly this purpose. Plus it's the unique codec for which the
unicode codepoint values match the input bytes, and it's the unique codec
that results in 1-byte-per-char internal storage for py3.3+ strings. (This
is particularly relevant if we want to keep the unpickling hack where we
steal the string's memory...)

On Fri, Jul 18, 2014 at 11:41 AM, Pauli Virtanen notifications@github.com
wrote:

The necessary properties are also shared by other 8-bit codecs, not only
by latin1. This is pretty much an arbitrary choice on our part, and
preferable for west european languages.


Reply to this email directly or view it on GitHub
#4879 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@pv
Copy link
Member

pv commented Jul 18, 2014

We seem to already assume latin1 in array_setstate, so it seems I missed adding a similar hack in scalar unpickling in the Py3 port.

@pv
Copy link
Member

pv commented Jul 18, 2014

I also missed adding a some checks to that hack, as it dumps core for some non-latin1 encodings. Needs to be fixed, too.

@pv
Copy link
Member

pv commented Jul 18, 2014

The same hack as Julian's but using latin1 instead of ascii in gh-4883.
The encoding='bytes' bugs should probably also be ironed out.

pv added a commit to pv/numpy that referenced this issue Jul 18, 2014
…der encoding='latin1'

There is a similar hack in place for arrays, but scalar unpickling was not covered.

Provides a workaround for numpygh-4879
@pv
Copy link
Member

pv commented Jul 18, 2014

encoding='bytes' fix is in gh-4888 --- however that pickle encoding is available only in Python >= 3.4.

Using encoding='latin1' probably works in many cases, but e.g. Python 2's datetime objects are not pickleable on Py3 with it.

@juliantaylor
Copy link
Contributor

I think we should have solved this as best as we can. Please try out the master or maintenance/1.9.x and see if it works to your satisfaction.
If not and there is something we can improve please reopen the issue.

@tomgoddard
Copy link
Author

Thanks for the quick work. I will try it and report back, probably tomorrow. Tom

On Jul 23, 2014, at 12:31 PM, Julian Taylorwrote:

I think we should have solved this as best as we can. Please try out the master or maintenance/1.9.x and see if it works to your satisfaction.
If not and there is something we can improve please reopen the issue.

@tomgoddard
Copy link
Author

Ok, I tested with numpy 1.9 source code from yesterday (obtained with git clone). PyTables in Python 3 now correctly reads unicode strings that were saved in HDF5 files by PyTables in Python 2 using a pickled numpy unicode string.

Also directly testing pickling and unpickling of numpy strings within Python3 worked in a simple case:

$ python3
Python 3.4.0 (default, May 8 2014, 10:47:10)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin

import numpy as n, pickle as p
s = n.str0('abc')
type(s)
<class 'numpy.str_'>
s
'abc'
ps = p.dumps(s)
ps
b'\x80\x03cnumpy.core.multiarray\nscalar\nq\x00cnumpy\ndtype\nq\x01X\x02\x00\x00\x00U3q\x02K\x00K\x01\x87q\x03Rq\x04(K\x03X\x01\x00\x00\x00<q\x05NNNK\x0cK\x04K\x00tq\x06bC\x0ca\x00\x00\x00b\x00\x00\x00c\x00\x00\x00q\x07\x86q\x08Rq\t.'
up = p.loads(ps)
up
'abc'
type(up)
type(up)
<class 'numpy.str_'>

Thanks the fix!

I think we should have solved this as best as we can. Please try out the master or maintenance/1.9.x and see if it works to your satisfaction.
If not and there is something we can improve please reopen the issue.

@vikrantsanghvi
Copy link

vikrantsanghvi commented Apr 21, 2016

hi,
I am new to python world. I am having same trouble. I have pickled file from python 2.7 with date time object in it. When i try to read in python 3.5 using unpickling it it gives similar error because of date time object. I went through the above threads, seems like it has to do with encoding for unpickling. Which encoding works for datetime objects? @pv @tomgoddard @juliantaylor
Also this pickle file has large dictionary.
thanks in advance.

UPDATE:
Apparently latin1 encoding worked. Thank you for the info

@harshaks23
Copy link

there is hickle which is faster than pickle and easier.
I tried to save and read it in pickle dump but while reading there were lot of problems and wasted an hour and still didnt find solution though I was working on my own data to create a chat bot.

vec_x and vec_y are numpy arrays

data=[vec_x,vec_y]
hkl.dump( data, 'new_data_file.hkl' )

Then you just read it and perform the operations

data2 = hkl.load( 'new_data_file.hkl' )

@mattip
Copy link
Member

mattip commented Jul 13, 2018

Please open a new issue, this one was closed long ago. It is not clear whether you are concurring that you found a solution to your problem or if you have discovered something new. If you do open a new issue, please supply self-contained code to reproduce, and specify what OS, what python, and numpy you are using. If it is an issue with hickle, you may want to reach out to them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants