Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csv_read() fails on properly decoding latin-1(i.e. non utf8) encoded file from URL #10424

Closed
BotoKopo opened this issue Jun 24, 2015 · 4 comments · Fixed by #35742
Closed

csv_read() fails on properly decoding latin-1(i.e. non utf8) encoded file from URL #10424

BotoKopo opened this issue Jun 24, 2015 · 4 comments · Fixed by #35742
Labels
Bug IO CSV read_csv, to_csv IO Network Local or Cloud (AWS, GCS, etc.) IO Issues Unicode Unicode strings
Milestone

Comments

@BotoKopo
Copy link

Problem

Here is a problem that we had with a colleague, working on data available on a ftp (or http) server (internal network, we're sorry we can't have a proper example file to point to).

Reading a csv file (with csv_read) encoded with non utf8 (like latin-1), with special character in header, fails to properly unicode the header when file is accessed through an URL (http or ftp), but not when file is local, nor when it's utf-8 (local or distant) file.
The result looks like the file was decoded twice.

An example shoud be clearer.

Let's say we have 2 CSV files (on a distant server), data.latin1.csv and data.utf8.csv, encoded in latin-1 and utf-8, and both containing :

a,b°
1.1,2.2

Then following code :

import sys
import os.path as op
import pandas as pd

path = "ftp://sorry/I/cant/supply/such/a/path/for/the/example/data.encoding.csv"

for enc in ('latin1', 'utf8') :
    f = path.replace('encoding', enc)
    data = pd.read_csv(f, encoding=enc)
    print("encoding {0} : non-ascii={1} , length={2}".format(enc, data.columns[1].encode('utf8'), len(data.columns[1])))

will give :

encoding latin1 : non-ascii=b° , length=3
encoding utf8 : non-ascii=b° , length=2

This was tested with Python 2.7.6 + Pandas 0.13.1 and Python 3.4.0 + Pandas 0.15.2 with same result.

Same action on local files will give appropriate result, i.e. like previous 'utf8' encoding output (this REALLY IS a matter of URL+latin1 or anything but utf-8). It looks like data was decoded twice, as we can see in output length as latin1 escape code for '°' is considered as a "normal" character being converted to utf-8.

This test will raise an error ("UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position 3: ordinal not in range(128)") when python engine is used for read_csv() .

in pandas code

Now, having a look at Pandas' code, I would focus on 2 points in pandas.io.parsers :

  • when file is an url, data is opened through urllib (or urllib2), then read, decoded (according to requested encoding) and result is fed into a StringIO stream (Cf. pandas.io.common.maybe_read_encoded_stream() ) ,
  • as far as I could trace it, file seems to be decoded later, especially for 'c'-engine in pandas.io.parsers.CParserWrapper.read() method (in fact by _parser.read() at the end, which is C-parser)

This would explain the twice decoding scheme when file is url, and normal decoding when file is local.

Furthermore, in pandas.io.common, when replacing (in maybe_read_encoded_stream() function) :

from pandas.compat import StringIO
...
reader = StringIO(reader.read().decode(encoding, errors))

by :

from pandas.compat import StringIO, BytesIO
...
reader = BytesIO(reader.read())

this problem seems to be solved (which is logical when we look at which StringIO/ByteIO functions are pointing to (depending on Python version) and which data they're handling).

So it seems to me that the problem is located at that point, and it would then be a bug.
However, it could be a feature ;-) as I don't know whether there could be side-effects for other cases than the one discussed here, especially if StringIO was intentionally used for a purpose I can't figure out.

@shoyer
Copy link
Member

shoyer commented Jun 24, 2015

Why don't you give this change a try and see if it breaks anything else? See here for instructions: http://pandas.pydata.org/pandas-docs/stable/contributing.html

@BotoKopo
Copy link
Author

Well, I'm quite not used to contributing here -- in fact, it's the first time in a project which is not in my job's context -- so I'm not sure about the contribution process.
So, just to be sure I understand well, you mean I should commit this change and request for a pull (sorry for this naive question) ?

@shoyer
Copy link
Member

shoyer commented Jun 25, 2015

If you install the pandas from git, you can run tests locally.

Another option is to commit the changes and issue a pull request. Then the tests will be run automatically by our continuous integration system (Travis CI).

BotoKopo added a commit to BotoKopo/pandas that referenced this issue Jul 8, 2015
@jreback jreback added Bug IO Data IO issues that don't fit into a more specific label Unicode Unicode strings labels Jul 8, 2015
@jreback jreback added this to the Next Major Release milestone Jul 8, 2015
@mjpieters
Copy link

Another testcase, and a workaround by using urllib.request to load the data instead of leaving this to pandas:

import pandas as pd
import urllib

# Data encoded with CP1252, just one non-ASCII byte 0x92 == U+2019 RIGHT SINGLE QUOTATION MARK
url = "https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv"

df1 = pd.read_csv(url, sep=';', encoding='cp1252')

print(df1[' '][102])  # Korea, Dem. People’s Rep.
print(df1[' '][102].encode('cp1252').decode('utf8'))  # Korea, Dem. People’s Rep.

with urllib.request.urlopen(url) as resp:
    df2 = pd.read_csv(resp, sep=";", encoding='cp1252')
print(df2[' '][102])  # Korea, Dem. People’s Rep.

Pandas seems to have decoded as CP1252 twice, with an intermediary UTF-8 encoding applied. You get the same data when you manually decode the data as CP1252, then encoding again as UTF-8, then decoding once more as CP1252:

>>> b'Korea, Dem. People\x92s Rep.'.decode('cp1252')
'Korea, Dem. People’s Rep.'
>>> b'Korea, Dem. People\x92s Rep.'.decode('cp1252').encode('utf8').decode('cp1252')
'Korea, Dem. People’s Rep.'

Passing in a file-like object from urllib.request neatly evades the issue. Perhaps Pandas is getting confused by the text/plain; charset=utf-8 Content-Type header?

@jbrockmendel jbrockmendel added the IO CSV read_csv, to_csv label Jul 25, 2018
@jbrockmendel jbrockmendel removed the IO Data IO issues that don't fit into a more specific label label Dec 1, 2019
@mroeschke mroeschke added the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Apr 14, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.2 Aug 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv IO Network Local or Cloud (AWS, GCS, etc.) IO Issues Unicode Unicode strings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants