New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhelpful error message when loading a single column with `read_csv` and `usecols` #20529

Closed
mattmotoki opened this Issue Mar 29, 2018 · 10 comments

Comments

Projects
None yet
5 participants
@mattmotoki

mattmotoki commented Mar 29, 2018

Code

>>> import pandas as pd
>>> df = pd.DataFrame({'x': [0,1], 'x1': [2,3]})
>>> df.to_csv('tmp.csv', index=False)
>>> pd.read_csv('tmp.csv', usecols='x')
   x
0  0
1  1
>>> pd.read_csv('tmp.csv', usecols=['x1'])
   x1
0   2
1   3
>>> pd.read_csv('tmp.csv', usecols='x1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/matt/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 709, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/matt/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 449, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/matt/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 818, in __init__
    self._make_engine(self.engine)
  File "/home/matt/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1049, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/matt/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1740, in __init__
    raise ValueError("Usecols do not match names.")
ValueError: Usecols do not match names.

Problem description

When using usecols to load a single column, one needs to have either a single-character column name or provide an array-like object. In the example above, pd.read_csv('tmp.csv', usecols='x') and pd.read_csv('tmp.csv', usecols=['x1']) work as expected; however, things break down for pd.read_csv('tmp.csv', usecols='x1'). The corresponding error message ValueError: Usecols do not match names. is not very helpful either.

Expected Output

It would be nice if there were some type checking done on usecols so that things don't break in the example above. At the least, the error message should be a bit more helpful; e.g., ValueError: Usecols should be array-like.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-37-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.7.1
patsy: None
dateutil: 2.7.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Mar 29, 2018

The error message is improved on master:

ValueError: Usecols do not match columns, columns expected but not found: ['1']

I'm trying to figure out what's going on, I think this is just buggy.

It would be nice if there were some type checking done on usecols so that things don't break in the example above.

Yeah, we should ensure that usecols is not a string, right?

cc @gfyoung do you have thoughts here?

@gfyoung

This comment has been minimized.

Member

gfyoung commented Mar 29, 2018

Yeah, we should ensure that usecols is not a string, right?

Agreed. This "array-like" should have been disallowed.

@gfyoung gfyoung added Bug and removed Data IO labels Mar 29, 2018

@mattmotoki

This comment has been minimized.

mattmotoki commented Mar 30, 2018

I'm trying to figure out what's going on, I think this is just buggy.

@TomAugspurger It looks like pandas is iterating through the characters in the string. In particular, if usecols='x1', then it's first looking for the column 'x' then it's looking for the column '1'.

>>> import pandas as pd
>>> df = pd.DataFrame({'x': [0,1], '1':[2,3], 'x1': [4,5]})
>>> df.to_csv('tmp.csv', index=False)
>>> pd.read_csv('tmp.csv', usecols='x1')
   1  x
0  2  0
1  3  1
>>> 
@minggli

This comment has been minimized.

Contributor

minggli commented Mar 30, 2018

happy to work on this issue, if @mattmotoki is happy to let me :)

@mattmotoki

This comment has been minimized.

mattmotoki commented Mar 30, 2018

@minggli I'd be more than happy if you worked on this, but I don't think I have any authority on that.

@gfyoung

This comment has been minimized.

Member

gfyoung commented Mar 30, 2018

@mattmotoki : Actually, you do, since it was your issue 😄 . @minggli go for it!

@jreback jreback added this to the Next Major Release milestone Mar 30, 2018

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Mar 30, 2018

@mattmotoki

This comment has been minimized.

mattmotoki commented Mar 30, 2018

@gfyoung @minggli Oops, sorry about that; I'm new to the open source community.
Is there anything else that I need to do to keep this process running smoothly?

@gfyoung

This comment has been minimized.

Member

gfyoung commented Mar 30, 2018

Nope, you're good! Though feel free to checkout #20558, which will fix your issue.

@minggli

This comment has been minimized.

Contributor

minggli commented Mar 30, 2018

@mattmotoki welcome. :}

@mattmotoki

This comment has been minimized.

mattmotoki commented Apr 1, 2018

@gfyoung @minggli Great work guys! I checked out #20558 and I'd be happy to close this issue whenever it's okay to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment