csv_read() fails on properly decoding  latin-1(i.e. non utf8) encoded file from URL

### Problem

Here is a problem that we had with a colleague, working on data available on a ftp (or http) server (internal network, we're sorry we can't have a proper example file to point to).

Reading a csv file (with csv_read) encoded with non utf8 (like latin-1), with special character in header, fails to properly unicode the header when file is accessed through an URL (http or ftp), but **not** when file is local, nor when it's utf-8 (local or distant) file.
The result looks like the file was decoded twice.

An example shoud be clearer.

Let's say we have 2 CSV files (on a distant server), _data.latin1.csv_ and _data.utf8.csv_, encoded in latin-1 and utf-8, and both containing :

```
a,b°
1.1,2.2
```

Then following code :

``` python
import sys
import os.path as op
import pandas as pd

path = "ftp://sorry/I/cant/supply/such/a/path/for/the/example/data.encoding.csv"

for enc in ('latin1', 'utf8') :
    f = path.replace('encoding', enc)
    data = pd.read_csv(f, encoding=enc)
    print("encoding {0} : non-ascii={1} , length={2}".format(enc, data.columns[1].encode('utf8'), len(data.columns[1])))
```

will give :

```
encoding latin1 : non-ascii=bÂ° , length=3
encoding utf8 : non-ascii=b° , length=2
```

This was tested with _Python 2.7.6 +  Pandas 0.13.1_ and _Python 3.4.0 + Pandas 0.15.2_ with same result.

Same action on local files will give appropriate result, i.e. like previous 'utf8' encoding output (this REALLY IS  a matter of URL+latin1 or anything but utf-8). It looks like data was decoded twice, as we can see in output length as latin1 escape code for '°' is considered as a "normal" character being converted to utf-8.

This test will raise an error ("UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position 3: ordinal not in range(128)") when python engine is used for read_csv() .
### in pandas code

Now, having a look at Pandas' code, I would focus on 2 points in pandas.io.parsers : 
- when file is an url, data is opened through urllib (or urllib2), then read, decoded (according to requested encoding) and result is fed into a StringIO stream  (Cf. _pandas.io.common.maybe_read_encoded_stream()_ ) , 
- as far as I could trace it, file seems to be decoded later, especially for 'c'-engine in _pandas.io.parsers.CParserWrapper.read()_ method (in fact by __parser.read()_ at the end, which is C-parser)

This would explain the twice decoding scheme when file is url, and normal decoding when file is local.

Furthermore, in  pandas.io.common, when replacing (in _maybe_read_encoded_stream()_ function) :

``` python
from pandas.compat import StringIO
...
reader = StringIO(reader.read().decode(encoding, errors))
```

by :

``` python
from pandas.compat import StringIO, BytesIO
...
reader = BytesIO(reader.read())
```

this problem seems to be solved (which is logical when we look at which StringIO/ByteIO functions are pointing to (depending on Python version) and which data they're handling).

So it seems to me that the problem is located at that point, and it would then be a bug.
However, it could be a feature ;-) as I don't know whether there could be side-effects for other cases than the one discussed here, especially if StringIO was intentionally used for a purpose I can't figure out.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

csv_read() fails on properly decoding latin-1(i.e. non utf8) encoded file from URL #10424

Problem

in pandas code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

csv_read() fails on properly decoding latin-1(i.e. non utf8) encoded file from URL #10424

Description

Problem

in pandas code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions