Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv with usecols and chunksize fails if first row of chunk has fewer columns #21211

Closed
MikeBeller opened this issue May 25, 2018 · 1 comment · Fixed by #44644
Closed
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@MikeBeller
Copy link

Code Sample, a copy-pastable example if possible

import io
import pandas as pd

input = """23 45 32 17
18 19 23 20
17 4 9
"""

f = io.StringIO(input)

df_chunks = pd.read_csv(f, sep=" ", names=["A", "B"], 
          chunksize=2, usecols=[0,1], header=None,)

for i,df in enumerate(df_chunks):
    print(i, df)

Actual output

ubuntu@ip-172-31-31-255:~/git/sparkdaas$ python csv_bug.py 
0     A   B
0  23  45
1  18  19
Traceback (most recent call last):
  File "csv_bug.py", line 19, in <module>
    for i,df in enumerate(df_chunks):
  File "/ebs/home/ubuntu/venv3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "/ebs/home/ubuntu/venv3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "/ebs/home/ubuntu/venv3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "/ebs/home/ubuntu/venv3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1028, in pandas._libs.parsers.TextReader._convert_column_data
pandas.errors.ParserError: Too many columns specified: expected 4 and found 3

Problem description

Often I use CSV files where the first n columns are the same in each row, but there may be additional columns present. I do not need those additional columns for my analysis, so I'd like to use read_csv to load a dataframe with the n columns common to each row. I use read_csv with header=None, names=[...], and usecols=[0,1...n]. This works fine when the read_csv is used to read the whole file, but if the file is very big so I add chunksize, and if it also happens that the first row of some chunk has fewer columns than the first row of the while file, then the above error message appears.

I believe this represents a bug, and the bug may be that the code here: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/parsers.pyx#L1027 needs to account for the case where usecols is set, or maybe it's the code here: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/parsers.pyx#L539. Perhaps someone who knows the parser well can figure out the right course of action (or explain why this is not a bug but expected behavior?).

Expected Output

0    A   B
0    23  45
1    18  19
1    A   B
2    17  4

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-1060-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.3
scipy: None
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd WillAyd added the IO CSV read_csv, to_csv label May 25, 2018
@MikeBeller
Copy link
Author

A bit more info -- the problem does not manifest if I add engine='python' to the call to read_csv. (But of course the read takes much longer.) This reinforces that the behavior above is a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants