pandas 1.0.1 read_csv() is broken for some file-like objects #31819

sasanquaneuf · 2020-02-09T11:33:52Z

Code Sample

import os
import pandas
import tempfile
import traceback

# pandas.show_versions()

fname = ''
with tempfile.NamedTemporaryFile(delete=False) as f:
    f.write('てすと\nこむ'.encode('shift-jis'))
    f.seek(0)
    fname = f.name

    try:
        result = pandas.read_csv(f, encoding='shift-jis')
        print('read shift-jis')
        print(result)

    except Exception as e:
        print(e)
        print(traceback.format_exc())

os.unlink(fname)

Problem description

Pandas 1.0.1, this sample does not work. But pandas 0.25.3, this sample works fine.
As stated in issue #31575, the encode of file-like object is ignored when its class is not io.BufferedIOBase neither RawIOBase.
However, some file-like objects are NOT inherited one of them, although the "actual" inner object is one of them.
In this code sample case, according to the cpython implementation, they has file as their attribute self.file = file, and __getattr__() returns the file's attribute as their attribute.
So the code is not work. The identic problems are in other file-like objects, for example, tempfile.*File class, werkzeug's FileStorage class, and so on.

Note: I first recognized this problem with using pandas via flask's posted file. The file-like object is an instance of werkzeug's FileStorage. I avoided this problem with following code:

pandas.read_csv(request.files['file'].stream._file, encoding='shift-jis')

Expected Output

read shift-jis
  てすと
0  こむ

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.6.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.138-89.102.amzn1.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : ja_JP.UTF-8
LOCALE : ja_JP.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 9.0.3
setuptools : 36.2.7
Cython : None
pytest : 3.6.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.0.5
lxml.etree : 4.2.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : None
pandas_datareader: None
bs4 : 4.6.0
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.2.1
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 3.6.2
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.4
tables : None
tabulate : None
xarray : None
xlrd : 1.1.0
xlwt : None
xlsxwriter : 1.0.5
numba : None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2020-02-10T20:31:05Z

Thanks for the report!

cc @paihu @gfyoung

gfyoung · 2020-02-10T21:05:36Z

I don't mind expanding support (since we used to have it before), but Is there a generic way to check all of these? It seems they have varying interfaces.

Would our existing file-like check suffice?

paihu · 2020-02-11T05:27:00Z

I don't think it's enough to make sure it's file-like object.
This main point is file-like object's read() is return binary or str (or other type?).
If it return binary, decode return values as encoding. Currently, we use TextIOWrapper.
But I don't know how to check the return value of read () without using read().(some case, file-like object is not seekable)

just an idea.

if is_file_like(buffer):
    t = buffer.read(1)
    u = makebuffer(t,buffer) # combine t & buffer  or  seek buffer
    is_binary(t):
        r = io.TextIOWrapper(u)
    else:
        r = u

Or consider a file-like object that returns binary if the encoding is set?

gfyoung · 2020-02-11T05:40:56Z

Alternatively, we could just drop the check for it being an instance of io.BufferedIOBase or RawIOBase, no? Perhaps wrap the TextIOWrapper instantiations with a try-except to provide an informative error message?

paihu · 2020-02-11T06:37:30Z

sadly TextIOWarapper instantiation does not return an error, if file-like object's read() return str.
If the read() of a file-like object returns a value other than binary, using read() on a TextIOWrapper instance will raise an error. (e.g. TypeError: a bytes-like object is required, not 'str')

gfyoung · 2020-02-11T06:47:55Z

I see. Given how many different file-like objects we have to support, I think this is the best way to go for the time being. It's better that we restore existing file-likes that worked before and then re-evaluate our logic later to catch "invalid file-likes" (for lack of a better term).

sasanquaneuf · 2020-02-11T07:50:05Z

Just idea, we might classify the object by it has encoding attribute or not.

import os
import pandas
import tempfile

fname = ""
with tempfile.NamedTemporaryFile(delete=False, mode="w+", encoding="shift-jis") as f:
    f.write("てすと\nbar")
    fname = f.name
    print(hasattr(f, 'encoding'))  # True
print(fname)

try:
    with open(fname,mode="r", encoding="shift-jis") as f:
        print(hasattr(f, 'encoding'))  # True
        result = pandas.read_csv(f)
        print("read shift-jis")
        print(result)

    with open(fname,mode="r", encoding="shift-jis") as f:
        print(hasattr(f, 'encoding'))  # True
        result = pandas.read_csv(f,encoding="utf-8")
        print("open shift-jis file and read_csv with encoding: utf-8")
        print(result)

    with open(fname,mode="rb") as f:
        print(hasattr(f, 'encoding'))  # False
        result = pandas.read_csv(f,encoding="shift-jis")
        print("open binary with buffered and read_csv with encoding: shift-jis")
        print(result)

    with open(fname,mode="rb",buffering=0) as f:
        print(hasattr(f, 'encoding'))  # False
        result = pandas.read_csv(f,encoding="shift-jis")
        print("open binary without burrered and read_csv with encoding: shift-jis")
        print(result)
except Exception as e:
    print(e)

os.unlink(fname)

and BufferedIOBase does not have encoding attribute.

Add:
If we could not use hasattr() directly, we colud use getattr() with try-except.

gfyoung · 2020-02-11T08:15:14Z

Just idea, we might classify the object by it has encoding attribute or not.

Not sure I follow you here. How would this help determine which files to pass through or not?

sasanquaneuf · 2020-02-11T08:47:11Z

Sorry, I think about @paihu 's following comment:

This main point is file-like object's read() is return binary or str (or other type?).

TextIOBase has encoding attribute.
https://docs.python.org/3/library/io.html#io.TextIOBase

RawIOBase and BufferedIOBase does not have encoding attribute, because they treats binary data. So I think that we can check the type of return value without use read().
Just idea, change the condition

if self.encoding and isinstance(source, (io.BufferedIOBase, io.RawIOBase)):

to

if self.encoding and hasattr(source, 'read') and not hasattr(source, 'encoding'):

(or use getattr(source, 'read') and getattr(source, 'encoding') with try-except)

gfyoung · 2020-02-11T09:21:42Z

@sasanquaneuf : You are more than welcome to give your suggestion a try!

EgorBEremeev · 2020-03-02T09:54:11Z

Not exactly sure, take a look on the #32392, please. It looks similar or\and connected issue and may affects possible solution.

TomAugspurger · 2020-03-06T17:51:03Z

We're hoping to release 1.0.2 soon. Have we reached an agreed behavior, and is anyone working on this?

gfyoung · 2020-03-06T18:51:29Z

@TomAugspurger : We haven't come upon an agreed-upon fix yet. I was hoping to get some more community input on this, but if not, I'll see what I can change here to fix the regression.

TomAugspurger · 2020-03-09T17:30:52Z

I'm roughly targeting Wednesday for the release so if you're able to get something together quickly it'd be welcome.

gfyoung · 2020-03-09T17:38:12Z

Sounds good. Let me see what I can put together. It may not be optimal, but if we can at least restore functionality, that would be good for this release.

Restores behavior down to the fact that the Python engine cannot handle NamedTemporaryFile. Closes pandas-dev#31819

@sasanquaneuf