Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.to_msgpack unexpectedly defaults to latin-1 encoding #12170

Closed
rspeer opened this issue Jan 28, 2016 · 4 comments
Closed

DataFrame.to_msgpack unexpectedly defaults to latin-1 encoding #12170

rspeer opened this issue Jan 28, 2016 · 4 comments
Labels
API Design Unicode Unicode strings
Milestone

Comments

@rspeer
Copy link

rspeer commented Jan 28, 2016

I am using Python 3.

I tried saving a DataFrame with Unicode labels using the .to_msgpack method. I didn't specify an encoding, because I assumed it would use UTF-8, which is the default encoding for Python in my locale (en_US.UTF-8) as well as just a sensible encoding to use in general.

Instead, it tried to encode labels in Latin-1, which failed. Latin-1 seems like a strangely antiquated default to use in modern code.

I can work around it by passing the encoding='utf-8' option, but it would be helpful if UTF-8 were the default, as it is in other Python I/O.

Here's my version information:

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-51-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: 1.5.4
setuptools: 2.2
Cython: 0.23.3
numpy: 1.10.4
scipy: 0.16.1
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.7.3
lxml: None
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
Jinja2: None
@jreback
Copy link
Contributor

jreback commented Jan 28, 2016

hmm, we have latin1 as the default. @kawochen do you know where that came from? I am guess from the prior version of code (before your upgrade)

@jreback jreback added Unicode Unicode strings Msgpack labels Jan 28, 2016
@jreback jreback added this to the 0.18.0 milestone Jan 28, 2016
@kawochen
Copy link
Contributor

I didn't know why either. the new spec says utf-8 for strings (but I think everyone is keeping the encoding/decoding option for compatibility).

@jreback
Copy link
Contributor

jreback commented Jan 28, 2016

hmm, also I we should have encoding as a top-level option for .to_msgpack and .read_msgpack
its passed thru now, but should be part of the doc-string at least.

@jreback
Copy link
Contributor

jreback commented Feb 9, 2016

@kawochen can you do a PR for this?

cldy pushed a commit to cldy/pandas that referenced this issue Feb 11, 2016
closes pandas-dev#12170

Author: Ka Wo Chen <kawoc@tepper.cmu.edu>

Closes pandas-dev#12277 from kawochen/API-12170 and squashes the following commits:

5adcf3b [Ka Wo Chen] API: to_msgpack and read_msgpack encoding defaults to utf-8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Unicode Unicode strings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants