New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas 0.19 read_csv with header=[0, 1] on an empty df throws error #14515

Closed
kaloramik opened this Issue Oct 27, 2016 · 7 comments

Comments

Projects
None yet
3 participants
@kaloramik

kaloramik commented Oct 27, 2016

Pandas 0.19 incorrectly handles empty dataframe files with multi index columns

import pandas as pd
import tempfile

df = pd.DataFrame.from_records([], columns=['col_1', 'col_2'])
joined_df_in = pd.concat([df, df], keys=['a', 'b'], axis=1)
joined_df_in.reset_index(drop=True, inplace=True)

with tempfile.NamedTemporaryFile(delete=False) as f:
    joined_df_in.to_csv(f.name, index=False)

What the file looks like

a,a,b,b
col_1,col_2,col_1,col_2

Expected Output

# in pandas 0.18.1
pd.read_csv(f.name, header=[0,1])

yields what we expect, an empty MultiIndex data frame

(a, col_1)  (a, col_2)  (b, col_1)  (b, col_2)
# in pandas 0.19
pd.read_csv(f.name, header=[0,1])

Throws

---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-10-1051c5f9aa58> in <module>()
----> 1 pd.read_csv(f.name, header=[0,1])

/Users/mik-OD/anaconda/envs/signals/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    643                     skip_blank_lines=skip_blank_lines)
    644 
--> 645         return _read(filepath_or_buffer, kwds)
    646 
    647     parser_f.__name__ = name

/Users/mik-OD/anaconda/envs/signals/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    386 
    387     # Create the parser.
--> 388     parser = TextFileReader(filepath_or_buffer, **kwds)
    389 
    390     if (nrows is not None) and (chunksize is not None):

/Users/mik-OD/anaconda/envs/signals/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    727             self.options['has_index_names'] = kwds['has_index_names']
    728 
--> 729         self._make_engine(self.engine)
    730 
    731     def close(self):

/Users/mik-OD/anaconda/envs/signals/lib/python3.5/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    920     def _make_engine(self, engine='c'):
    921         if engine == 'c':
--> 922             self._engine = CParserWrapper(self.f, **self.options)
    923         else:
    924             if engine == 'python':

/Users/mik-OD/anaconda/envs/signals/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1387         kwds['allow_leading_cols'] = self.index_col is not False
   1388 
-> 1389         self._reader = _parser.TextReader(src, **kwds)
   1390 
   1391         # XXX

pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5811)()

pandas/parser.pyx in pandas.parser.TextReader._get_header (pandas/parser.c:8615)()

CParserError: Passed header=[0,1], len of 2, but only 2 lines in file

Expected Output

Output of pd.show_versions()

For pandas 0.81

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: None
numpy: 1.11.1
scipy: 0.17.1
statsmodels: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.1
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)
jinja2: 2.7.3
boto: 2.40.0
pandas_datareader: None

For pandas 0.19


INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: None
numpy: 1.11.1
scipy: 0.17.1
statsmodels: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.1
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)
jinja2: 2.7.3
boto: 2.40.0
pandas_datareader: None
@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Oct 27, 2016

Member

@kaloramik So the change is not in read_csv (because the example you give raises for me for both 0.19.0 and 0.18.1, and also 0.16), but in the output that to_csv is generating.

In versions < 0.19.0, the file looks like:

a,a,b,b
col_1,col_2,col_1,col_2
,,,

while in 0.19.0 it looks like (what you showed above):

a,a,b,b
col_1,col_2,col_1,col_2

So previously there was an extra line with empty values. Reading this in with 0.19.0 still gives your desired result of an empty frame:

s = """a,a,b,b
col_1,col_2,col_1,col_2
,,,"""

In [89]: pd.read_csv(StringIO(s), header=[0,1])
Out[89]: 
Empty DataFrame
Columns: [(a, col_1), (a, col_2), (b, col_1), (b, col_2)]
Index: []

In [90]: pd.__version__
Out[90]: '0.19.0'

(however, something could be said this should actually give you one row of NaNs)

So the change is in to_csv. In 0.19.0, the extra line is not added

In [94]: df = pd.DataFrame(columns=pd.MultiIndex.from_product([('a', 'b'), ('col_1', 'col_2')]))

In [96]: print(df.to_csv())
,a,a,b,b
,col_1,col_2,col_1,col_2

while in 0.18.0 there was an extra line with comma's:

In [32]: df = pd.DataFrame(columns=pd.MultiIndex.from_product([('a', 'b'), ('col_1', 'col_2')]))

In [34]: print(df.to_csv())
,a,a,b,b
,col_1,col_2,col_1,col_2
,,,,

This was a bug (since you don't have any data, there should not be a line of missing values), and this bug was fixed in 0.19.0, see #6618

Member

jorisvandenbossche commented Oct 27, 2016

@kaloramik So the change is not in read_csv (because the example you give raises for me for both 0.19.0 and 0.18.1, and also 0.16), but in the output that to_csv is generating.

In versions < 0.19.0, the file looks like:

a,a,b,b
col_1,col_2,col_1,col_2
,,,

while in 0.19.0 it looks like (what you showed above):

a,a,b,b
col_1,col_2,col_1,col_2

So previously there was an extra line with empty values. Reading this in with 0.19.0 still gives your desired result of an empty frame:

s = """a,a,b,b
col_1,col_2,col_1,col_2
,,,"""

In [89]: pd.read_csv(StringIO(s), header=[0,1])
Out[89]: 
Empty DataFrame
Columns: [(a, col_1), (a, col_2), (b, col_1), (b, col_2)]
Index: []

In [90]: pd.__version__
Out[90]: '0.19.0'

(however, something could be said this should actually give you one row of NaNs)

So the change is in to_csv. In 0.19.0, the extra line is not added

In [94]: df = pd.DataFrame(columns=pd.MultiIndex.from_product([('a', 'b'), ('col_1', 'col_2')]))

In [96]: print(df.to_csv())
,a,a,b,b
,col_1,col_2,col_1,col_2

while in 0.18.0 there was an extra line with comma's:

In [32]: df = pd.DataFrame(columns=pd.MultiIndex.from_product([('a', 'b'), ('col_1', 'col_2')]))

In [34]: print(df.to_csv())
,a,a,b,b
,col_1,col_2,col_1,col_2
,,,,

This was a bug (since you don't have any data, there should not be a line of missing values), and this bug was fixed in 0.19.0, see #6618

@jorisvandenbossche jorisvandenbossche added this to the No action milestone Oct 27, 2016

@kaloramik

This comment has been minimized.

Show comment
Hide comment
@kaloramik

kaloramik Oct 27, 2016

@jorisvandenbossche hmm really? That's not what I'm seeing at all. Is it possible I have a package thats screwing something up? Can you post your pd.show_versions?

But looking at the behavior, shouldn't the expected behavior be what I posted? As in, if you read in a file of length 2, and your headers are taken up to by 2 lines, then it should return an empty df with those columns. I believe the same behavior applies for a single header.

The error message doesn't seem to make sense

Passed header=[0,1], len of 2, but only 2 lines in file

it DOES have 2 lines in the file, so it should be able to construct the header. In addition, the source code has the following comment
https://github.com/pandas-dev/pandas/blob/6130e77fb7c9d44fde5d98f9719bd67bb9ec2ade/pandas/parser.pyx

                # e.g., if header=3 and file only has 2 lines
                elif self.parser.lines < hr + 1:
                    msg = self.orig_header
                    if isinstance(msg, list):
                        msg = "[%s], len of %d," % (
                            ','.join([ str(m) for m in msg ]), len(msg))
                    raise CParserError(
                        'Passed header=%s but only %d lines in file'
                        % (msg, self.parser.lines))

According to the comment, the function should fail if the file has less than len(header) lines, implying that the function should succeed if len(header) == len(lines). Does that sound right?

kaloramik commented Oct 27, 2016

@jorisvandenbossche hmm really? That's not what I'm seeing at all. Is it possible I have a package thats screwing something up? Can you post your pd.show_versions?

But looking at the behavior, shouldn't the expected behavior be what I posted? As in, if you read in a file of length 2, and your headers are taken up to by 2 lines, then it should return an empty df with those columns. I believe the same behavior applies for a single header.

The error message doesn't seem to make sense

Passed header=[0,1], len of 2, but only 2 lines in file

it DOES have 2 lines in the file, so it should be able to construct the header. In addition, the source code has the following comment
https://github.com/pandas-dev/pandas/blob/6130e77fb7c9d44fde5d98f9719bd67bb9ec2ade/pandas/parser.pyx

                # e.g., if header=3 and file only has 2 lines
                elif self.parser.lines < hr + 1:
                    msg = self.orig_header
                    if isinstance(msg, list):
                        msg = "[%s], len of %d," % (
                            ','.join([ str(m) for m in msg ]), len(msg))
                    raise CParserError(
                        'Passed header=%s but only %d lines in file'
                        % (msg, self.parser.lines))

According to the comment, the function should fail if the file has less than len(header) lines, implying that the function should succeed if len(header) == len(lines). Does that sound right?

@kaloramik

This comment has been minimized.

Show comment
Hide comment
@kaloramik

kaloramik Oct 27, 2016

Oh actually, scratch that, you are right about 0.18.1 returning an extra line of commas (And so the read_csv succeeds I guess)

But this breaks behavior now, as in my data pipelines, I am unable to write then read empty dataframes as before. I think the above behavior I described is still the desired one? Unless you have better workarounds? ( I don't think replicating the old behavior by forcibly adding a row of commas would be a good idea)

kaloramik commented Oct 27, 2016

Oh actually, scratch that, you are right about 0.18.1 returning an extra line of commas (And so the read_csv succeeds I guess)

But this breaks behavior now, as in my data pipelines, I am unable to write then read empty dataframes as before. I think the above behavior I described is still the desired one? Unless you have better workarounds? ( I don't think replicating the old behavior by forcibly adding a row of commas would be a good idea)

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Oct 27, 2016

Member

But looking at the behavior, shouldn't the expected behavior be what I posted?

Possibly. But I am just pointing out that it is not a change in read_csv. The code you link to hasn't changed in 2 years (and I tested up to 0.16 that this has been raising this error consistently).

Apart from that, it is worth discussing if we should allow this. IMO returning an empty frame is indeed more logical to do.

Member

jorisvandenbossche commented Oct 27, 2016

But looking at the behavior, shouldn't the expected behavior be what I posted?

Possibly. But I am just pointing out that it is not a change in read_csv. The code you link to hasn't changed in 2 years (and I tested up to 0.16 that this has been raising this error consistently).

Apart from that, it is worth discussing if we should allow this. IMO returning an empty frame is indeed more logical to do.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Oct 27, 2016

Member

The bug fix in to_csv was in any case a good one, so we can only fix it in read_csv. Personally I am in favor of returning an empty frame instead of erroring.
As you point out, this is more in line with a single header line:

s = """a,b
"""

In [14]: pd.read_csv(StringIO(s))
Out[14]: 
Empty DataFrame
Columns: [a, b]
Index: []

Note that also for a single header, once you pass the header kwarg, it raises:

In [105]: pd.read_csv(StringIO(s), header=[0])
...
CParserError: Passed header=[0], len of 1, but only 1 lines in file

cc @gfyoung @chris-b1

Member

jorisvandenbossche commented Oct 27, 2016

The bug fix in to_csv was in any case a good one, so we can only fix it in read_csv. Personally I am in favor of returning an empty frame instead of erroring.
As you point out, this is more in line with a single header line:

s = """a,b
"""

In [14]: pd.read_csv(StringIO(s))
Out[14]: 
Empty DataFrame
Columns: [a, b]
Index: []

Note that also for a single header, once you pass the header kwarg, it raises:

In [105]: pd.read_csv(StringIO(s), header=[0])
...
CParserError: Passed header=[0], len of 1, but only 1 lines in file

cc @gfyoung @chris-b1

@kaloramik

This comment has been minimized.

Show comment
Hide comment
@kaloramik

kaloramik Oct 27, 2016

Got it. Thanks for the clarification! Actually as a temporary workaround I guess forcing a write of an empty row on empty data frames should be ok.

Do you know if there are any other workarounds, perhaps from the read side?

kaloramik commented Oct 27, 2016

Got it. Thanks for the clarification! Actually as a temporary workaround I guess forcing a write of an empty row on empty data frames should be ok.

Do you know if there are any other workarounds, perhaps from the read side?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Oct 27, 2016

Member

Hmm, I don't directly see a workaround on the read side. If you want to end up with the multi-index, I don't think there is an easy solution. Probably easier to temporarily fix on the write side as you point out.

Member

jorisvandenbossche commented Oct 27, 2016

Hmm, I don't directly see a workaround on the read side. If you want to end up with the multi-index, I don't think there is an easy solution. Probably easier to temporarily fix on the write side as you point out.

@bkandel bkandel referenced this issue Nov 6, 2016

Closed

Fix parse empty df #14596

4 of 4 tasks complete

@jreback jreback modified the milestones: 0.19.2, No action Nov 22, 2016

@jreback jreback closed this in f862b52 Nov 22, 2016

amolkahat added a commit to amolkahat/pandas that referenced this issue Nov 26, 2016

BUG: Fix parse empty df
closes #14515

This commit fixes a bug where `read_csv` failed when given a file with
a multiindex header and empty content. Because pandas reads index
names as a separate line following the header lines, the reader looks
for the line with index names in it. If the content of the dataframe
is empty, the reader will choke. This bug surfaced after
pandas-dev#6618 stopped writing an
extra line after multiindex columns, which led to a situation where
pandas could write CSV's that it couldn't then read.     This commit
changes that behavior by explicitly checking if the index name row
exists, and processing it correctly if it doesn't.

Author: Ben Kandel <ben.kandel@gmail.com>

Closes #14596 from bkandel/fix-parse-empty-df and squashes the following commits:

32e3b0a [Ben Kandel] lint
e6b1237 [Ben Kandel] lint
fedfff8 [Ben Kandel] fix multiindex column parsing
518982d [Ben Kandel] move to 0.19.2
fc23e5c [Ben Kandel] fix errant this_columns
3d9bbdd [Ben Kandel] whatsnew
68eadf3 [Ben Kandel] Modify test.
17e44dd [Ben Kandel] fix python parser too
72adaf2 [Ben Kandel] remove unnecessary test
bfe0423 [Ben Kandel] typo
2f64d57 [Ben Kandel] pep8
b8200e4 [Ben Kandel] BUG: read_csv with empty df

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Dec 14, 2016

[Backport #14596] BUG: Fix parse empty df
closes #14515

This commit fixes a bug where `read_csv` failed when given a file with
a multiindex header and empty content. Because pandas reads index
names as a separate line following the header lines, the reader looks
for the line with index names in it. If the content of the dataframe
is empty, the reader will choke. This bug surfaced after
pandas-dev#6618 stopped writing an
extra line after multiindex columns, which led to a situation where
pandas could write CSV's that it couldn't then read.     This commit
changes that behavior by explicitly checking if the index name row
exists, and processing it correctly if it doesn't.

Author: Ben Kandel <ben.kandel@gmail.com>

Closes #14596 from bkandel/fix-parse-empty-df and squashes the following commits:

32e3b0a [Ben Kandel] lint
e6b1237 [Ben Kandel] lint
fedfff8 [Ben Kandel] fix multiindex column parsing
518982d [Ben Kandel] move to 0.19.2
fc23e5c [Ben Kandel] fix errant this_columns
3d9bbdd [Ben Kandel] whatsnew
68eadf3 [Ben Kandel] Modify test.
17e44dd [Ben Kandel] fix python parser too
72adaf2 [Ben Kandel] remove unnecessary test
bfe0423 [Ben Kandel] typo
2f64d57 [Ben Kandel] pep8
b8200e4 [Ben Kandel] BUG: read_csv with empty df

(cherry picked from commit f862b52)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment