Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas 1.0.1 read_csv() is broken for some file-like objects #31819

Closed
sasanquaneuf opened this issue Feb 9, 2020 · 22 comments · Fixed by #32577
Closed

pandas 1.0.1 read_csv() is broken for some file-like objects #31819

sasanquaneuf opened this issue Feb 9, 2020 · 22 comments · Fixed by #32577
Labels
IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@sasanquaneuf
Copy link

sasanquaneuf commented Feb 9, 2020

Code Sample

import os
import pandas
import tempfile
import traceback

# pandas.show_versions()

fname = ''
with tempfile.NamedTemporaryFile(delete=False) as f:
    f.write('てすと\nこむ'.encode('shift-jis'))
    f.seek(0)
    fname = f.name

    try:
        result = pandas.read_csv(f, encoding='shift-jis')
        print('read shift-jis')
        print(result)

    except Exception as e:
        print(e)
        print(traceback.format_exc())

os.unlink(fname)

Problem description

Pandas 1.0.1, this sample does not work. But pandas 0.25.3, this sample works fine.
As stated in issue #31575, the encode of file-like object is ignored when its class is not io.BufferedIOBase neither RawIOBase.
However, some file-like objects are NOT inherited one of them, although the "actual" inner object is one of them.
In this code sample case, according to the cpython implementation, they has file as their attribute self.file = file, and __getattr__() returns the file's attribute as their attribute.
So the code is not work. The identic problems are in other file-like objects, for example, tempfile.*File class, werkzeug's FileStorage class, and so on.

Note: I first recognized this problem with using pandas via flask's posted file. The file-like object is an instance of werkzeug's FileStorage. I avoided this problem with following code:

pandas.read_csv(request.files['file'].stream._file, encoding='shift-jis')

Expected Output

read shift-jis
  てすと
0  こむ

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.138-89.102.amzn1.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : ja_JP.UTF-8
LOCALE : ja_JP.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 9.0.3
setuptools : 36.2.7
Cython : None
pytest : 3.6.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.0.5
lxml.etree : 4.2.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : None
pandas_datareader: None
bs4 : 4.6.0
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.2.1
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 3.6.2
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.4
tables : None
tabulate : None
xarray : None
xlrd : 1.1.0
xlwt : None
xlsxwriter : 1.0.5
numba : None

@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Feb 10, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.0.2 milestone Feb 10, 2020
@jorisvandenbossche
Copy link
Member

Thanks for the report!

cc @paihu @gfyoung

@gfyoung
Copy link
Member

gfyoung commented Feb 10, 2020

I don't mind expanding support (since we used to have it before), but Is there a generic way to check all of these? It seems they have varying interfaces.

Would our existing file-like check suffice?

@paihu
Copy link
Contributor

paihu commented Feb 11, 2020

I don't think it's enough to make sure it's file-like object.
This main point is file-like object's read() is return binary or str (or other type?).
If it return binary, decode return values as encoding. Currently, we use TextIOWrapper.
But I don't know how to check the return value of read () without using read().(some case, file-like object is not seekable)

just an idea.

if is_file_like(buffer):
    t = buffer.read(1)
    u = makebuffer(t,buffer) # combine t & buffer  or  seek buffer
    is_binary(t):
        r = io.TextIOWrapper(u)
    else:
        r = u

Or consider a file-like object that returns binary if the encoding is set?

@gfyoung
Copy link
Member

gfyoung commented Feb 11, 2020

Alternatively, we could just drop the check for it being an instance of io.BufferedIOBase or RawIOBase, no? Perhaps wrap the TextIOWrapper instantiations with a try-except to provide an informative error message?

@paihu
Copy link
Contributor

paihu commented Feb 11, 2020

sadly TextIOWarapper instantiation does not return an error, if file-like object's read() return str.
If the read() of a file-like object returns a value other than binary, using read() on a TextIOWrapper instance will raise an error. (e.g. TypeError: a bytes-like object is required, not 'str')

@gfyoung
Copy link
Member

gfyoung commented Feb 11, 2020

I see. Given how many different file-like objects we have to support, I think this is the best way to go for the time being. It's better that we restore existing file-likes that worked before and then re-evaluate our logic later to catch "invalid file-likes" (for lack of a better term).

@sasanquaneuf
Copy link
Author

sasanquaneuf commented Feb 11, 2020

Just idea, we might classify the object by it has encoding attribute or not.

import os
import pandas
import tempfile

fname = ""
with tempfile.NamedTemporaryFile(delete=False, mode="w+", encoding="shift-jis") as f:
    f.write("てすと\nbar")
    fname = f.name
    print(hasattr(f, 'encoding'))  # True
print(fname)

try:
    with open(fname,mode="r", encoding="shift-jis") as f:
        print(hasattr(f, 'encoding'))  # True
        result = pandas.read_csv(f)
        print("read shift-jis")
        print(result)

    with open(fname,mode="r", encoding="shift-jis") as f:
        print(hasattr(f, 'encoding'))  # True
        result = pandas.read_csv(f,encoding="utf-8")
        print("open shift-jis file and read_csv with encoding: utf-8")
        print(result)

    with open(fname,mode="rb") as f:
        print(hasattr(f, 'encoding'))  # False
        result = pandas.read_csv(f,encoding="shift-jis")
        print("open binary with buffered and read_csv with encoding: shift-jis")
        print(result)

    with open(fname,mode="rb",buffering=0) as f:
        print(hasattr(f, 'encoding'))  # False
        result = pandas.read_csv(f,encoding="shift-jis")
        print("open binary without burrered and read_csv with encoding: shift-jis")
        print(result)
except Exception as e:
    print(e)

os.unlink(fname)

and BufferedIOBase does not have encoding attribute.

Add:
If we could not use hasattr() directly, we colud use getattr() with try-except.

@gfyoung
Copy link
Member

gfyoung commented Feb 11, 2020

Just idea, we might classify the object by it has encoding attribute or not.

Not sure I follow you here. How would this help determine which files to pass through or not?

@sasanquaneuf
Copy link
Author

sasanquaneuf commented Feb 11, 2020

Sorry, I think about @paihu 's following comment:

This main point is file-like object's read() is return binary or str (or other type?).

TextIOBase has encoding attribute.
https://docs.python.org/3/library/io.html#io.TextIOBase

RawIOBase and BufferedIOBase does not have encoding attribute, because they treats binary data. So I think that we can check the type of return value without use read().
Just idea, change the condition

if self.encoding and isinstance(source, (io.BufferedIOBase, io.RawIOBase)):

to

if self.encoding and hasattr(source, 'read') and not hasattr(source, 'encoding'):

(or use getattr(source, 'read') and getattr(source, 'encoding') with try-except)

@simonjayhawkins simonjayhawkins added the IO CSV read_csv, to_csv label Feb 11, 2020
@gfyoung
Copy link
Member

gfyoung commented Feb 11, 2020

@sasanquaneuf : You are more than welcome to give your suggestion a try!

@EgorBEremeev
Copy link

Not exactly sure, take a look on the #32392, please. It looks similar or\and connected issue and may affects possible solution.

@TomAugspurger
Copy link
Contributor

We're hoping to release 1.0.2 soon. Have we reached an agreed behavior, and is anyone working on this?

@gfyoung
Copy link
Member

gfyoung commented Mar 6, 2020

@TomAugspurger : We haven't come upon an agreed-upon fix yet. I was hoping to get some more community input on this, but if not, I'll see what I can change here to fix the regression.

@TomAugspurger
Copy link
Contributor

I'm roughly targeting Wednesday for the release so if you're able to get something together quickly it'd be welcome.

@gfyoung
Copy link
Member

gfyoung commented Mar 9, 2020

Sounds good. Let me see what I can put together. It may not be optimal, but if we can at least restore functionality, that would be good for this release.

gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 10, 2020
Restores behavior down to the fact that the
Python engine cannot handle NamedTemporaryFile.

Closes pandas-dev#31819
gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 10, 2020
Restores behavior down to the fact that the
Python engine cannot handle NamedTemporaryFile.

Closes pandas-dev#31819

Credit to @sasanquaneuf for originating idea.
gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 10, 2020
Restores behavior down to the fact that the
Python engine cannot handle NamedTemporaryFile.

Closes pandas-dev#31819

Credit to @sasanquaneuf for originating idea.
@gfyoung
Copy link
Member

gfyoung commented Mar 10, 2020

PR is up and ready for review

gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 10, 2020
Restores behavior down to the fact that the
Python engine cannot handle NamedTemporaryFile.

Closes pandas-dev#31819

Credit to @sasanquaneuf for originating idea.
gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 10, 2020
Restores behavior down to the fact that the
Python engine cannot handle NamedTemporaryFile.

Closes pandas-dev#31819

Credit to @sasanquaneuf for originating idea.
gfyoung added a commit that referenced this issue Mar 11, 2020
Restores behavior down to the fact that the
Python engine cannot handle NamedTemporaryFile.

Closes #31819

Credit to @sasanquaneuf for originating idea.
SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this issue Mar 22, 2020
Restores behavior down to the fact that the
Python engine cannot handle NamedTemporaryFile.

Closes pandas-dev#31819

Credit to @sasanquaneuf for originating idea.
@Colin-b
Copy link

Colin-b commented Nov 4, 2020

Using pandas==1.1.4, I am still unable to use pandas.read_csv with a FileStorage instance (Werkzeug==1.0.1):

    pandas.read_csv(
  File "lib\site-packages\pandas\io\parsers.py", line 688, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "lib\site-packages\pandas\io\parsers.py", line 454, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "lib\site-packages\pandas\io\parsers.py", line 948, in __init__
    self._make_engine(self.engine)
  File "lib\site-packages\pandas\io\parsers.py", line 1180, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "lib\site-packages\pandas\io\parsers.py", line 2010, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas\_libs\parsers.pyx", line 537, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas\_libs\parsers.pyx", line 711, in pandas._libs.parsers.TextReader._get_header
  File "pandas\_libs\parsers.pyx", line 905, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 2031, in pandas._libs.parsers.raise_parser_error
io.UnsupportedOperation: not readable

Either it's a bug in the code or an issue in the documentation, as documentation is stating that By file-like object, we refer to objects with a read() method, and FileStorage.read do exists and work as expected.

@jorisvandenbossche
Copy link
Member

@Colin-b can you open a new issue?

@Colin-b
Copy link

Colin-b commented Nov 5, 2020

Hi @jorisvandenbossche ,

While trying to reproduce the issue with the smallest possible sample, I figured out that the issue might not be on pandas side.

The tempfile.SpooledTemporaryFile used by wekzeug.datastructures.FileStorage stream has it's mode set to "wb+" (because the underlying BytesIO does not contains "mode" attribute).

And I guess pandas uses file.mode to know if it can read a file or not, thus throwing a "not readable" even if the read method would work.

Do you still consider this as an issue and want me to open one?

In the meantime I will see if it can be reported to werkzeug as I would expect FileStorage to expose read only mode.

@jorisvandenbossche
Copy link
Member

Sorry, I don't know enough about this part of pandas to answer your question. (@gfyoung ?)
(but so if you think there is an issue in pandas, opening a new issue might get more attention to the question)

@Colin-b
Copy link

Colin-b commented Nov 5, 2020

I will thanks 👍

@davidism
Copy link

davidism commented Jan 15, 2021

I'm not sure why @Colin-b didn't follow up here, but I think this indicates an issue with Pandas as well. Pandas should accept w+ as readable. I don't know enough about Pandas to say though, so I'm not opening a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants