encoding not respected on read_msgpack #10581

Closed
ruidc opened this Issue Jul 15, 2015 · 1 comment

Comments

Projects
None yet
2 participants
Contributor

ruidc commented Jul 15, 2015

as discussed on https://groups.google.com/forum/#!topic/pydata/ngROaML_hLI
encoding does not seem to be respected on reading a msgpack, below i am expecting to get back what
I put in as utf8

In [17]: s
Out[17]: u'\u2019'

In [18]: s = pd.Series({'a' : u"\u2019" })

In [19]: s.values[0]
Out[19]: u'\u2019'

In [20]: pd.read_msgpack(s.to_msgpack(encoding='utf8')).values[0]
Out[20]: u'\xe2\x80\x99'

in stepping through, part of the problem seems to be that in the call to unpack on https://github.com/pydata/pandas/blob/master/pandas/io/packers.py#L134 that there is no encoding argument passed and so it defaults to latin1 in https://github.com/pydata/pandas/blob/master/pandas/io/packers.py#L558

changing L134 to :

l = list(unpack(fh, **kwargs))

and passing the encoding like:

pandas.read_msgpack(m, encoding='utf8') 

makes it work for me, however i don't have en environment set up to submit this as a pull request via GH, and we're still using 0.14.1 due to compatibility issues.

jreback added this to the 0.17.0 milestone Jul 15, 2015

Contributor

ruidc commented Jul 15, 2015

On Py2.7, even after making this change, this surprisingly raises a UnicodeDecodeError in msgpack.cpp:

pandas.read_msgpack(pandas.DataFrame([[401L, u'a']], index=[0], columns=['k', 'v']).to_msgpack(encoding='utf8'), encoding='utf8')

but i'm having trouble stepping through as my PyCharm environment crashes on me when inspecting.
curiously, some small changes like changing the 401L to 40L doesn't raise the error though.

jreback closed this in #10686 Aug 18, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment