Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3 writing to_csv file ignores encoding argument. #13068

Closed
graingert opened this issue May 3, 2016 · 32 comments · Fixed by #35129
Closed

Python 3 writing to_csv file ignores encoding argument. #13068

graingert opened this issue May 3, 2016 · 32 comments · Fixed by #35129
Labels
Bug Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv Unicode Unicode strings
Milestone

Comments

@graingert
Copy link
Contributor

graingert commented May 3, 2016

# is missing the UTF8 BOM (encoded with default encoding UTF8)
with open('path_to_f', 'w') as f:
    df.to_csv(f, encoding='utf-8-sig')

# is not missing the UTF8 BOM (encoded with passed encoding utf-8-sig)
df.to_csv('path_to_f', encoding='utf-8-sig')

I expect:

with open('path_to_f', 'w') as f:
    df.to_csv(f, encoding='utf-8-sig')

To crash with TypeError: write() argument must be str, not bytes

and I expect:

with open('path_to_f', 'wb') as f:
    df.to_csv(f, encoding='utf-8-sig')

To write the file correctly.

Copy pasta

#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame()
with open('file_one', 'w') as f:
    df.to_csv(f, encoding='utf-8-sig')

assert open('file_one', 'rb').read() == b'""\n'

# is not missing the UTF8 BOM (encoded with passed encoding utf-8-sig)
df.to_csv('file_two', encoding='utf-8-sig')
assert open('file_two', 'rb').read() == b'\xef\xbb\xbf""\n'
@jreback
Copy link
Contributor

jreback commented May 3, 2016

you would have to show a reproducible example. why does this have to do with excel? you are reporting a csv issue, no? excel being able to read something doesn't prove (or disprove) anything.

@jreback
Copy link
Contributor

jreback commented May 3, 2016

futhermore show pd.show_versions(), a sample of the frame, df.info() as well.

@graingert
Copy link
Contributor Author

@jreback updated issue to remove Excel problem

@jreback
Copy link
Contributor

jreback commented May 3, 2016

so will still need a copy-pastable example.

@jreback jreback added Unicode Unicode strings IO CSV read_csv, to_csv labels May 3, 2016
@graingert
Copy link
Contributor Author

>>> pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-21-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.18.0
nose: None
pip: 8.1.1
setuptools: 21.0.0
Cython: None
numpy: 1.11.0
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None

@graingert
Copy link
Contributor Author

@jreback updated with copy pastable

@graingert
Copy link
Contributor Author

This looks to be a design flaw in all "io" outputs that take encodings and file objects on Python 3.

@jreback
Copy link
Contributor

jreback commented May 3, 2016

Python 3.5.1 |Continuum Analytics, Inc.| (default, Dec  7 2015, 11:24:55) 

In [1]: 

In [1]: df = pd.DataFrame()

In [2]: with open('file_one', 'w') as f:
   ...:         df.to_csv(f, encoding='utf-8-sig')
   ...:     

In [3]: assert open('file_one', 'rb').read() == b'""\n'

In [4]: 

In [4]: # is not missing the UTF8 BOM (encoded with passed encoding utf-8-sig)

In [5]: df.to_csv('file_two', encoding='utf-8-sig')

In [6]: assert open('file_two', 'rb').read() == b'\xef\xbb\xbf""\n'

In [7]: pd.__version__
Out[7]: '0.18.1'

what's the problem?

@jreback
Copy link
Contributor

jreback commented May 3, 2016

works on 0.18.0 as well.

@graingert
Copy link
Contributor Author

The first call ignores the encoding... The first assert should fail
On 3 May 2016 18:54, "Jeff Reback" notifications@github.com wrote:

works on 0.18.0 as well.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#13068 (comment)

@jreback
Copy link
Contributor

jreback commented May 3, 2016

hmm, you are opening it in text mode. Not really sure if a stream indicates its text or binary. I don't know that this is a bug on pandas side. Can you repro using non-pandas?

@graingert
Copy link
Contributor Author

If I open the file in binary mode, pandas tries to write str to the file
and crashes
On 3 May 2016 19:06, "Jeff Reback" notifications@github.com wrote:

hmm, you are opening it in text mode. Not really sure if a stream
indicates its text or binary. I don't know that this is a bug on pandas
side. Can you repro using non-pandas?


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#13068 (comment)

@jreback
Copy link
Contributor

jreback commented May 3, 2016

can j show what happens? eg that would be the test

@graingert
Copy link
Contributor Author

graingert commented May 3, 2016

#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame()
with open('file_one', 'wb') as f:
df.to_csv(f, encoding='utf-8-sig')
*** crash ***

On 3 May 2016 19:57, "Jeff Reback" notifications@github.com wrote:

can j show what happens? eg that would be the test


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#13068 (comment)

@jreback
Copy link
Contributor

jreback commented May 3, 2016

ahh I see now. ok, it prob needs to be opened with a codec, so when the stream is created it should be inserted there. since you are familiar, want to do a PR?

@jreback
Copy link
Contributor

jreback commented May 3, 2016

looks really similar to #9712

@graingert
Copy link
Contributor Author

graingert commented May 3, 2016

@jaidevd what's the desired behaviour? Crash on passing a Unicode writer, or deprecate the encoding keyword argument in favour of passing Unicode writers only?

@jreback
Copy link
Contributor

jreback commented May 3, 2016

no i think I would raise a more informative message. If a user wants to pass a non-compat stream (and we can't do anything with it), then must raise. most usage does not pass a stream when writing.

@graingert
Copy link
Contributor Author

@jreback so crash if a unicode accepting stream is passed, and raise an informative error.

@jreback
Copy link
Contributor

jreback commented May 3, 2016

well if passed a non unicode accepting stream when an encoding is passed I guess. I don't think their is a way to fix it? raising an exception that is helpful is just fine.

@jreback jreback added the Error Reporting Incorrect or improved errors from pandas label May 3, 2016
@graingert
Copy link
Contributor Author

@jreback we need the to_csv and related functions to support: either binary file objects and the encoding argument; or unicode objects without the encoding argument.

@jreback
Copy link
Contributor

jreback commented May 3, 2016

I thought that's what I said.

@graingert
Copy link
Contributor Author

@jreback I thought you meant the status quo: neither binary file objects and the encoding argument; or unicode objects without the encoding argument. But with better exceptions.

@jreback
Copy link
Contributor

jreback commented May 3, 2016

oh you are saying 2 issues. I didn't really look too closely. I am all for writing things correctly, or raising if its incorrect. As I said I suspect we have very little testing on writing unicode with streams now (maybe no tests), esp with alternate encodings. This is quite uncommon.

Would be ok with complete tests and write if possible, raising if not.

@graingert
Copy link
Contributor Author

so always write bytes, regardless of Python version. With nice exceptions when writing to unicode streams.

@jreback
Copy link
Contributor

jreback commented May 3, 2016

no, I believe the existing impl in py2 is correct. Write out tests for all cases, test them under both versions and you will have the answer.

@JulieRossi
Copy link

Hi !
I am having troubles with Python 3 writing to_csv file ignoring encoding argument too.

To be more specific, the problem comes from the following code (modified to focus on the problem and be copy pastable):

df = pd.DataFrame([['a', 'é']])
with open('df_to_csv_utf8.csv', 'w') as f:
    df.to_csv(f, index=False, encoding='utf-8')
with open('df_to_csv_latin1.csv', 'w') as f:
    df.to_csv(f, index=False, encoding='latin1')

If run with python2, I actually get two files with different encoding:

>>> magic.from_file('df_to_csv_utf8.csv')
UTF-8 Unicode text
>>> magic.from_file('df_to_csv_latin1.csv')
ISO-8859 text

But with python3, they both are utf-8 encoded:

>>> magic.from_file('df_to_csv_utf8.csv')
UTF-8 Unicode text
>>> magic.from_file('df_to_csv_latin1.csv')
UTF-8 Unicode text

I know magic only guesses the encoding, but this seemed a clear way of showing the difference.

A better proof is to try to decode the text written in the files. Using python codecs module, you get:

python2:

>>> with codecs.open('df_to_csv_latin1.csv', encoding='latin1') as f:
>>>     print f.read()
0,1
a,é

python3:

>>> with codecs.open('df_to_csv_latin1.csv', encoding='latin1') as f:
>>>     print(f.read())
0,1
a,é

For the record, using LibreOffice calc to try to open both files gives the same result: the file written with python3 using latin1 encoding cannot be opened properly when you specify latin1 encoding, it must be opened with utf-8 encoding to be displayed correctly.

@graingert @jreback have you progressed on this subject ?

For information, I use:

python2 python3
commit: None commit: None
python: 2.7.13.final.0 python: 3.5.3.final.0
python-bits: 64 python-bits: 64
OS: Linux OS: Linux
OS-release: 4.10.0-33-generic OS-release: 4.10.0-33-generic
machine: x86_64 machine: x86_64
processor: x86_64 processor: x86_64
byteorder: little byteorder: little
LC_ALL: None LC_ALL: None
LANG: en_US.UTF-8 LANG: en_US.UTF-8
LOCALE: None.None LOCALE: en_US.UTF-8
pandas: 0.20.3 pandas: 0.20.3
pytest: None pytest: None
pip: 9.0.1 pip: 9.0.1
setuptools: 36.4.0 setuptools: 36.4.0
Cython: None Cython: None
numpy: 1.13.1 numpy: 1.13.1
scipy: None scipy: None
xarray: None xarray: None
IPython: 5.4.1 IPython: 6.1.0
sphinx: None sphinx: None
patsy: None patsy: None
dateutil: 2.6.1 dateutil: 2.6.1
pytz: 2017.2 pytz: 2017.2
blosc: None blosc: None
bottleneck: None bottleneck: None
tables: None tables: None
numexpr: None numexpr: None
feather: None feather: None
matplotlib: None matplotlib: None
openpyxl: None openpyxl: None
xlrd: None xlrd: None
xlwt: None xlwt: None
xlsxwriter: None xlsxwriter: None
lxml: None lxml: None
bs4: None bs4: None
html5lib: 0.999999999 html5lib: 0.999999999
sqlalchemy: None sqlalchemy: None
pymysql: None pymysql: None
psycopg2: None psycopg2: None
jinja2: 2.9.6 jinja2: 2.9.6
s3fs: None s3fs: None
pandas_gbq: None pandas_gbq: None
pandas_datareader: None pandas_datareader: None

@watercrossing
Copy link
Contributor

I am confused by this. You are doing
with open('df_to_csv_latin1.csv', 'w') as f: - which sets the encoding of f to your systems default encoding, which on your machine is likely to be UTF-8 (check with locale.getpreferredencoding()).

If you want to write in latin1, why don't you just open the file in latin1?

df = pd.DataFrame([['a', 'é']])
with open('df_to_csv_utf8.csv', 'w', encoding='utf-8') as f:
    df.to_csv(f, index=False)
with open('df_to_csv_latin1.csv', 'w', encoding='latin1') as f:
    df.to_csv(f, index=False)

with open('df_to_csv_latin1.csv', encoding='latin1') as f:
    print(f.read())

prints

0,1
a,é

as expected.
(Hint: if you are working with excel, you may want to write as utf-8-sig to write the UTF-8 BOM so that excel actually knows its UTF-8.)

@JulieRossi
Copy link

Hi @watercrossing
I cannot specify the encoding when opening the file because I receive an already opened file descriptor.
Otherwise, this would indeed be the solution.

My point is that you can specify an encoding in to_csv method but it is not taken into account (which was not the case with python 2).

There may be no solution but it is confusing: you can set an option that is (quietly) not used.

@graingert
Copy link
Contributor Author

@anuraagmadhavRacherla please don't hijack unrelated issues. This question would be better suited for stackoverlfow. Also you've striped all the useful debugging information from your exception (the stack trace).

@shm007g
Copy link

shm007g commented Jun 26, 2018

data_frame.to_excel will need 'wb' and to_csv will need 'w' for my part.

pandas==0.22.0

@Enteee
Copy link

Enteee commented May 13, 2020

Hi folks, I wrote an article on my blog on how to Support Binary File Objects with pandas.DataFrame.to_csv. At the end of the article I added a monkey patch I think can also be used as a work around for this problem. Hope this helps until this is resolved in pandas.

@jreback jreback added this to the 1.2 milestone Aug 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv Unicode Unicode strings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants