Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_csv failing with encoding='utf-16' #21118

Closed
lgonzalezsa opened this Issue May 18, 2018 · 3 comments

Comments

Projects
None yet
3 participants
@lgonzalezsa
Copy link

lgonzalezsa commented May 18, 2018

Code Sample:

df.to_csv('test.gz', sep='~',  header=False, index=False,compression='gzip',line_terminator='\r\n',encoding='utf-16', na_rep='')

/opt/anaconda/lib/python3.6/encodings/ascii.py in decode(self, input, final)
24 class IncrementalDecoder(codecs.IncrementalDecoder):
25 def decode(self, input, final=False):
---> 26 return codecs.ascii_decode(input, self.errors)[0]
27
28 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

Problem description

In first place, big thank you for supporting pandas, my life is easier and fun with pandas in the toolkit.
In previous version 0.22 we were able to do to_csv with encoding='utf-16' to handle Japanese, Chinese among other content properly. Need the utf-16 encoding for next steps like upload data to MSSQL server in bulk mode.

I would like to know if I can use a workaround to continue have the support of uft-16.

Any other suggestions are welcome.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.114-42-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: POSIX
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.2
scipy: 1.1.0
pyarrow: 0.9.0
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd

This comment has been minimized.

Copy link
Member

WillAyd commented May 18, 2018

Can you please post a reproducible example? Tried locally with the below on master and it worked fine:

buf = io.BytesIO(b'\xff\x34')
df = pd.read_csv(buf, encoding='utf16')

outbuf = buf.StringIO()
df.to_csv(outbuf, encoding='utf-16')
@lgonzalezsa

This comment has been minimized.

Copy link
Author

lgonzalezsa commented May 18, 2018

Notice my code is failing not due to data content.
Here an example using sklearn datasets

from sklearn import datasets
iris = datasets.load_iris()

data1 = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

data1.to_csv('test.gz'
         , sep='~'
         ,  header=False, index=False
         ,compression='gzip'
         ,line_terminator='\r\n'
         ,encoding='utf-16'
         , na_rep=''
        )

If I use compression parameter I got the UnicodeDecodeError; but without the parameter, runs properly.

@WillAyd WillAyd added the Regression label May 19, 2018

@WillAyd WillAyd added this to the 0.23.1 milestone May 19, 2018

@WillAyd WillAyd removed the Needs Info label May 20, 2018

@minggli minggli referenced this issue Jun 3, 2018

Merged

BUG: encoding error in to_csv compression #21300

4 of 4 tasks complete
@minggli

This comment has been minimized.

Copy link
Contributor

minggli commented Jun 4, 2018

tested your iris dataset example, this problem should go away with this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.