read_csv with usecols and chunksize fails if first row of chunk has fewer columns #21211

MikeBeller · 2018-05-25T20:04:59Z

Code Sample, a copy-pastable example if possible

import io
import pandas as pd

input = """23 45 32 17
18 19 23 20
17 4 9
"""

f = io.StringIO(input)

df_chunks = pd.read_csv(f, sep=" ", names=["A", "B"], 
          chunksize=2, usecols=[0,1], header=None,)

for i,df in enumerate(df_chunks):
    print(i, df)

Actual output

ubuntu@ip-172-31-31-255:~/git/sparkdaas$ python csv_bug.py 
0     A   B
0  23  45
1  18  19
Traceback (most recent call last):
  File "csv_bug.py", line 19, in <module>
    for i,df in enumerate(df_chunks):
  File "/ebs/home/ubuntu/venv3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "/ebs/home/ubuntu/venv3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "/ebs/home/ubuntu/venv3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "/ebs/home/ubuntu/venv3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1028, in pandas._libs.parsers.TextReader._convert_column_data
pandas.errors.ParserError: Too many columns specified: expected 4 and found 3

Problem description

Often I use CSV files where the first n columns are the same in each row, but there may be additional columns present. I do not need those additional columns for my analysis, so I'd like to use read_csv to load a dataframe with the n columns common to each row. I use read_csv with header=None, names=[...], and usecols=[0,1...n]. This works fine when the read_csv is used to read the whole file, but if the file is very big so I add chunksize, and if it also happens that the first row of some chunk has fewer columns than the first row of the while file, then the above error message appears.

I believe this represents a bug, and the bug may be that the code here: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/parsers.pyx#L1027 needs to account for the case where usecols is set, or maybe it's the code here: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/parsers.pyx#L539. Perhaps someone who knows the parser well can figure out the right course of action (or explain why this is not a bug but expected behavior?).

Expected Output

0    A   B
0    23  45
1    18  19
1    A   B
2    17  4

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-1060-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.3
scipy: None
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

MikeBeller · 2018-05-28T17:21:46Z

A bit more info -- the problem does not manifest if I add engine='python' to the call to read_csv. (But of course the read takes much longer.) This reinforces that the behavior above is a bug.

WillAyd added the IO CSV read_csv, to_csv label May 25, 2018

mroeschke added the Bug label Jun 19, 2021

phofl mentioned this issue Nov 27, 2021

BUG: read_csv raising ParserError when some chunks have less columns than header #44644

Merged

4 tasks

jreback added this to the 1.4 milestone Nov 28, 2021

jreback closed this as completed in #44644 Nov 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv with usecols and chunksize fails if first row of chunk has fewer columns #21211

read_csv with usecols and chunksize fails if first row of chunk has fewer columns #21211

MikeBeller commented May 25, 2018

INSTALLED VERSIONS

MikeBeller commented May 28, 2018

read_csv with usecols and chunksize fails if first row of chunk has fewer columns #21211

read_csv with usecols and chunksize fails if first row of chunk has fewer columns #21211

Comments

MikeBeller commented May 25, 2018

Code Sample, a copy-pastable example if possible

Actual output

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

MikeBeller commented May 28, 2018

Output of `pd.show_versions()`