pandas read_csv no longer supports file-like objects from tarfile (pandas 0.20.1) #16530

Closed
gjanvier opened this Issue May 29, 2017 · 17 comments

Comments

Projects
None yet
6 participants

Code Sample, a copy-pastable example if possible

import pandas as pd
import tarfile

tar = tarfile.open(name="xxx.tar.bz2", mode='r')
myfile = tar.extractfile('yyy.csv') # file-like object with a read() method
data = pd.read_csv(myfile, sep=r'\s+')

This code generates this error:

  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 392, in _read
    filepath_or_buffer, encoding, compression)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/common.py", line 210, in get_filepath_or_buffer
    raise ValueError(msg.format(_type=type(filepath_or_buffer)))
ValueError: Invalid file path or buffer object type: <class 'tarfile.ExFileObject'>

Problem description

This code works with pandas 0.19.2 but fails with 0.20.1.

According to pandas doc for read_csv:

filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)

I guess the new validations are too restrictive ?

Contributor

bashtage commented May 29, 2017

Works as expected on windows with 0.20.1.

pd.read_csv(myfile, sep='\s+')
Out[25]: 
  Col1,COl2
0       a,1
1       b,2
2       c,3
3       d,4

Which Python? You should include the show_versions() output in the details area of the template.

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-78-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: None
pip: 8.1.1
setuptools: 20.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

gjanvier closed this May 29, 2017

Contributor

bashtage commented May 29, 2017

Did you mean to close? If it isn't working for you, please reopen. Could be combination of Python version/OS.

gjanvier reopened this May 29, 2017

oups sorry, reopen.

Contributor

TomAugspurger commented May 29, 2017

Working for me as well with python3 and master.

@gjanvier can you make a fully reproducible example (including writing the csv and adding it to the tar)

Sure, here is a test case.

import tarfile
import pandas as pd

pd.show_versions()

data = pd.DataFrame(
    data=[[1,2], [3,4]],
    columns=['col1', 'col2']
)

print "data"
print data
print ""

data.to_csv('mydata.csv', sep="\t", index=False)

tar = tarfile.open('test.tar', 'w')
tar.add('mydata.csv')
tar.close()

tar = tarfile.open('test.tar', 'r')
myfile = tar.extractfile('mydata.csv')
data2 = pd.read_csv(myfile, sep=r'\s+')

print "data2"
print data2
print ""

FYI, I run my tests in a docker container...

Result with pandas 0.19.2

root@0a35b054b4da:xxxx# pip install pandas==0.19.2
Collecting pandas==0.19.2
  Downloading pandas-0.19.2-cp27-cp27mu-manylinux1_x86_64.whl (17.2MB)
    100% |################################| 17.2MB 25kB/s 
Requirement already satisfied (use --upgrade to upgrade): pytz>=2011k in /usr/local/lib/python2.7/dist-packages (from pandas==0.19.2)
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /usr/local/lib/python2.7/dist-packages (from pandas==0.19.2)
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.7.0 in /usr/local/lib/python2.7/dist-packages (from pandas==0.19.2)
Requirement already satisfied (use --upgrade to upgrade): six>=1.5 in /usr/lib/python2.7/dist-packages (from python-dateutil->pandas==0.19.2)
Installing collected packages: pandas
  Found existing installation: pandas 0.20.1
    Uninstalling pandas-0.20.1:
      Successfully uninstalled pandas-0.20.1
Successfully installed pandas-0.19.2
You are using pip version 8.1.1, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
root@0a35b054b4da:xxxx# python test_tar_pd.py 

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-78-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.2
nose: None
pip: 8.1.1
setuptools: 20.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
statsmodels: 0.8.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.10.3
apiclient: 1.6.2
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
data
   col1  col2
0     1     2
1     3     4

data2
   col1  col2
0     1     2
1     3     4

Result with pandas 0.20.1

root@0a35b054b4da:xxx# pip install pandas==0.20.1
Collecting pandas==0.20.1
  Downloading pandas-0.20.1-cp27-cp27mu-manylinux1_x86_64.whl (22.3MB)
    100% |################################| 22.3MB 19kB/s 
Requirement already satisfied (use --upgrade to upgrade): pytz>=2011k in /usr/local/lib/python2.7/dist-packages (from pandas==0.20.1)
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /usr/local/lib/python2.7/dist-packages (from pandas==0.20.1)
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.7.0 in /usr/local/lib/python2.7/dist-packages (from pandas==0.20.1)
Requirement already satisfied (use --upgrade to upgrade): six>=1.5 in /usr/lib/python2.7/dist-packages (from python-dateutil->pandas==0.20.1)
Installing collected packages: pandas
  Found existing installation: pandas 0.19.2
    Uninstalling pandas-0.19.2:
      Successfully uninstalled pandas-0.19.2
Successfully installed pandas-0.20.1
You are using pip version 8.1.1, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
root@0a35b054b4da:xxx# python test_tar_pd.py 

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-78-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: None
pip: 8.1.1
setuptools: 20.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None
data
   col1  col2
0     1     2
1     3     4

Traceback (most recent call last):
  File "test_tar_pd.py", line 23, in <module>
    data2 = pd.read_csv(myfile, sep=r'\s+')
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 392, in _read
    filepath_or_buffer, encoding, compression)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/common.py", line 210, in get_filepath_or_buffer
    raise ValueError(msg.format(_type=type(filepath_or_buffer)))
ValueError: Invalid file path or buffer object type: <class 'tarfile.ExFileObject'>
Contributor

jreback commented May 29, 2017

IIRC @gfyoung fixed this by patching is_filelike, can't find the issue ATM

Member

gfyoung commented May 29, 2017

@jreback : The relevant PR is #16150.

Contributor

jreback commented May 29, 2017 edited

hmm, so maybe this IS an issue on py2.7? maybe tarfile is not a proper iterator? (or doesn't have read?)

Member

gfyoung commented May 29, 2017 edited

This is indeed a compatibility issue. Turns out tarfile.ExFileObject is not a proper iterator object in our eyes under the Python 2.x implementation (it has no next or __next__ attribute, but Python 3.x tarfile.ExFileObject has the __next__ attribute).

I guess we just need to check for the __iter__ attribute ONLY it seems for is_file_like?

Contributor

jreback commented May 29, 2017

yeah maybe relax in the is_file_like only

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue May 29, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
957c67e

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue May 29, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
ed929a7
Member

gfyoung commented May 29, 2017 edited

@jreback : So the C engine doesn't require that the file-like have a next method, but the Python engine does (we explicitly call next(self.data)). This presents a slight dilemma then: how do we check that a file-like has next and that the engine specified is Python? If possible, I would want to switch to the C engine.

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue May 29, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
6113567
Member

gfyoung commented May 29, 2017

Also, I should add reading tarfile objects isn't actually feasible in Python's csv library. So this is just for the C engine in Python 2.x

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue May 29, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
2e829f2

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue May 29, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
7feeaf7

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue May 29, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
e05cf2a

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue May 29, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
f17a20c

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue May 29, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
a9c9365

jreback added this to the 0.20.2 milestone May 30, 2017

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue May 30, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
6dd7837

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue May 30, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
5a6bad4

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue May 30, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
0df7b2c

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue May 30, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
e236ba5

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue May 31, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
8611b79
Contributor

jtratner commented May 31, 2017

@gfyoung - I hit the same issue when trying to pass Luigi's ReadableS3File to read_csv in pandas 0.20.1.

My understanding is that that while this is not required to be true for a file-like:

next(fp)
iter(fp)

What is required is that it produce an iterator with a next method.

it = iter(fp)
next(it)

So if pandas wants to use next(self.data) perhaps just need to call iter on it first and work from there?

Contributor

jtratner commented May 31, 2017

But I'm not sure this is explicitly defined anywhere, so much as it looks like an in-practice kind of thing.

Member

gfyoung commented May 31, 2017

@jtratner : For an object to be file-like, I am proposing that the object just have an __iter__ method (it need not have next or __next__). What you propose might work but would require a little more beefing up, as iter objects are not file-like by themselves. If we combine attributes from iter and the object we are wrapping, we could get a valid file-like.

Interesting idea. Worth pursuing once this issue gets resolved.

Contributor

jtratner commented May 31, 2017

@gfyoung - cool I agree with your definition :) - just was reinforcing that the definition of file-like as "has iter" is upheld many places but not having the next() method.

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue May 31, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
0526fd6

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue May 31, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
7a5fcd3

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue May 31, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
08efe9c

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue Jun 1, 2017

@gfyoung gfyoung COMPAT: Consider Python 2.x tarfiles file-like
Tarfile.ExFileObject has no "next" method in
Python 2.x, making it an invalid file-like
object in read_csv. However, they can be
read in just fine, meaning our check is too
strict for file-like. This commit relaxes
the check to just look for "__iter__".

Closes gh-16530.
7c59fc9

jreback closed this in #16533 Jun 1, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment