-
-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error unpickling unicode in Python 3 #4879
Comments
Just to tell that we also have this problems in Theano tests. The fix we On Thu, Jul 17, 2014 at 3:25 PM, tomgoddard notifications@github.com
|
@tomgoddard: thanks for diagnosing all that! Want to take the final step |
PyTables is used to archive data in HDF5 file format. Thousands of such files have been created On Jul 17, 2014, at 12:33 PM, Frédéric Bastien wrote:
|
I don’t think I’m the best person to fix the code because I am not familiar with how numpy C code handles Python 2 / 3 compatibility issues. On Jul 17, 2014, at 12:36 PM, Nathaniel J. Smith notifications@github.com wrote:
|
I don't think there's much to say about py2 vs py3 - the unpickling code
|
The numpy unpicking C code array_scalar() needs to work in both Python 2 and Python 3 On Jul 17, 2014, at 12:46 PM, Nathaniel J. Smith notifications@github.com wrote:
|
hm, is this right?
it loads back as seems our unicode pickling is quite borked like all our string handling |
fwiw this hack does allow loading this back, but decoding to ascii feels very wrong: |
Your example looks correct to me. At least after staring at it in confusion On Thu, Jul 17, 2014 at 9:49 PM, Julian Taylor notifications@github.com
Nathaniel J. Smith |
Just to second that we we really appreciate an upstream fix in numpy from the pytables perspective. Thanks a ton! |
oh right forgot about
|
is there a way to get the pickle protocol number from the scalar function? I think we need that to determine the encoding. Also I don't understand this:
this is latin1 encoding, where does that come from? sys.getdefaultencoding is ascii as is protocol 0, though I can't find its specification :/ |
Hi Julian, I don’t understand your questions. You know pickle is producing Python byte codes, so it is not directly interpretable, and I don’t think the numpy code needs to know the protocol of the encoding because it the unpickling will execute the byte codes and all you need to know is that it will call the numpy scalar() routine with a string as the second argument. If you want to get a better handle on what is in the pickle string though to better debug this there is a standard pickletools module. https://docs.python.org/3.4/library/pickletools.html and here is an example disassembling the pickle string that you used in your test case: $ python
Tom On Jul 17, 2014, at 4:00 PM, Julian Taylor notifications@github.com wrote:
|
I guess I don't understand of the python string representation works for unicode.
this is its not
|
I guess using |
From a bit of fiddling, it looks like protocol 0 replaces backslashes and I'm not sure what this has to do with the bug, though, since dealing with On Fri, Jul 18, 2014 at 8:16 AM, Julian Taylor notifications@github.com
Nathaniel J. Smith |
That the example above works at all is actually a coincidence. This fails at an earlier stage:
The relevant code in CPython seems to be http://hg.python.org/cpython/file/45e8eb53edbc/Modules/_pickle.c#l4712 . The encoding used is ASCII by default (in the example above, the string 'a\x00\x00...` happens to be ascii-compatible). However, the encoding can be specified by the user, via |
That is, pickled python2 strings are loaded by python3 as unicode strings using some arbitrary user-specified encoding, which we do not know. The conclusion seems to be that pickle is simply not backward compatible between py2 and py3, and there is nothing Numpy can do to fix it. |
Docs for encoding= are here: I would have thought that encoding="bytes" or encoding="latin1" might have python 3.4.1
On Fri, Jul 18, 2014 at 10:09 AM, Pauli Virtanen notifications@github.com
Nathaniel J. Smith |
The best workaround is probably to do as @juliantaylor above suggests, and accept unicode data as-is for dtype='U' after coercing it to utf-32. The correct operation then relies on the user providing a correct Pickles containing both numpy unicode and numpy byte strings probably remain unloadable:
|
Heh, this is also fun: python 3:
On Fri, Jul 18, 2014 at 10:20 AM, Pauli Virtanen notifications@github.com
Nathaniel J. Smith |
@njsmith: that error comes from writing |
Ah, thanks for the catch. I don't understand what you mean about "correct encoding" though. AFAICT pickletools.dis(pickle.dumps(np.int16(255))) Notice that this pickle also contains a string "\xff\x00". (And it also I think this means we can and should make either encoding="bytes" or On Fri, Jul 18, 2014 at 10:28 AM, Nathaniel Smith njs@pobox.com wrote:
Nathaniel J. Smith |
In fact, this is pretty terrible. AFAICT all our py2<->py3 pickle stuff py2In [42]: pickle.dump(np.arange(255), open("/tmp/foo", "w")) py3
On Fri, Jul 18, 2014 at 10:35 AM, Nathaniel Smith njs@pobox.com wrote:
Nathaniel J. Smith |
Ah yes, you're right that using The reason why |
The solution then seems to be that we fix the code paths for |
I guess latin1 is theoretically better if we can make it work, because it's If running under py2, are there any circumstances where that argument can
|
As far as I see, Interpreting unicode data in |
numpy.core.multiarray._reconstruct also appears to be affected. You're right about there being a risk when using latin1, but I think it's And in cases where people are storing multibyte text in py2 strs, there's On Fri, Jul 18, 2014 at 11:06 AM, Pauli Virtanen notifications@github.com
Nathaniel J. Smith |
The necessary properties are also shared by other 8-bit codecs, not only by latin1. This is pretty much an arbitrary choice on our part, and a west european language centric choice. (Also, west european Windows users might also find "Windows-1252" preferable to "latin1" and so it goes...) What the precise 8-bit codec chosen is, is however probably not very important. I'd expect the most common case be that applications are sensible and store non-ascii strings as unicode. |
"necessary properties" meaning, ascii-compatible and invertible? I guess On Fri, Jul 18, 2014 at 11:41 AM, Pauli Virtanen notifications@github.com
Nathaniel J. Smith |
We seem to already assume latin1 in |
I also missed adding a some checks to that hack, as it dumps core for some non-latin1 encodings. Needs to be fixed, too. |
The same hack as Julian's but using latin1 instead of ascii in gh-4883. |
…der encoding='latin1' There is a similar hack in place for arrays, but scalar unpickling was not covered. Provides a workaround for numpygh-4879
Using |
I think we should have solved this as best as we can. Please try out the master or maintenance/1.9.x and see if it works to your satisfaction. |
Thanks for the quick work. I will try it and report back, probably tomorrow. Tom On Jul 23, 2014, at 12:31 PM, Julian Taylorwrote:
|
Ok, I tested with numpy 1.9 source code from yesterday (obtained with git clone). PyTables in Python 3 now correctly reads unicode strings that were saved in HDF5 files by PyTables in Python 2 using a pickled numpy unicode string. Also directly testing pickling and unpickling of numpy strings within Python3 worked in a simple case: $ python3
Thanks the fix!
|
hi, UPDATE: |
unpickling fails on Python 3 with numpy < 1.9 (see numpy/numpy#4879)
unpickling fails on Python 3 with numpy < 1.9 (see numpy/numpy#4879)
there is hickle which is faster than pickle and easier. vec_x and vec_y are numpy arrays data=[vec_x,vec_y] Then you just read it and perform the operations data2 = hkl.load( 'new_data_file.hkl' ) |
Please open a new issue, this one was closed long ago. It is not clear whether you are concurring that you found a solution to your problem or if you have discovered something new. If you do open a new issue, please supply self-contained code to reproduce, and specify what OS, what python, and numpy you are using. If it is an issue with hickle, you may want to reach out to them. |
Numpy 1.8.1 used with Python 3 gives an error when unpickling a numpy unicode object which was pickled with Python 2.
The bug is in the numpy.core.multiarray.scalar(dtype,string) routine which is used to unpickle this type of numpy object. In Python 3 passing the second argument of scalar() as a string causes an error ("TypeError: initializing object must be a string"). The scalar() call works in Python 3 only if the second argument is a byte array. In Python 2 the scalar() routine works with a string as the second argument (and also with a byte array as second argument). The error is in file
numpy/core/src/multiarray/multiarraymodule.c
in the array_scalar() routine at line 1874 where it uses PyString_Check(obj) on the second argument then gives the error "initializing object must be a string". In Python 3 this accepts only a byte array, while in Python 2 it accepts a string. To check for a string in Python 3 I believe requires PyUnicode_Check() while checking for a byte array uses PyBytes_Check() and PyString_Check() from Python 2 has been eliminated. I'm not clear on how the numpy code compiles using Python 3 with that PyString_Check().
Here is a test case that demonstrates the bug. It uses Python 2 to create the pickle string, and Python 3 to unpickle it. It is necessary to create the pickle string with Python 2 because the handling of unicode by numpy changes in Python 3. Pickle is documented as being backwards compatible between all Python versions. This error causes PyTables, an HDF5 file interface to load unicode string data incorrectly (returning a pickle string instead of unicode string due to the error reported here).
Test case.
$ python2.7.5
$ python3.4
The unpickle operation (pickle.loads(p)) results in the following numpy call which is causing the error
If we change the second argument of the scalar() call to a byte array it works correctly
But the second argument is encoded in the pickle string as a string, not a byte array, so the numpy scalar() routine must accept a string as the second argument.
Here is the associate PyTables bug report
PyTables/PyTables#368
The text was updated successfully, but these errors were encountered: