Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parse empty df #14596

Closed
wants to merge 12 commits into from
Closed

Conversation

bkandel
Copy link
Contributor

@bkandel bkandel commented Nov 6, 2016

This commit fixes a bug where read_csv failed when given a file with a multiindex header and empty content. Because pandas reads index names as a separate line following the header lines, the reader looks for the line with index names in it. If the content of the dataframe is empty, the reader will choke. This bug surfaced after #6618 stopped writing an extra line after multiindex columns, which led to a situation where pandas could write CSV's that it couldn't then read.

This commit changes that behavior by explicitly checking if the index name row exists, and processing it correctly if it doesn't.

@@ -57,5 +57,6 @@ Bug Fixes
- Bug in ``DataFrame.to_json`` where ``lines=True`` and a value contained a ``}`` character (:issue:`14391`)
- Bug in ``df.groupby`` causing an ``AttributeError`` when grouping a single index frame by a column and the index level (:issue`14327`)
- Bug in ``df.groupby`` where ``TypeError`` raised when ``pd.Grouper(key=...)`` is passed in a list (:issue:`14334`)
- Bug in ``pd.read_csv`` where reading files fails if the number of headers is equal to the number of lines in the file (:issue:`14515`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.20

"""
df = self.read_csv(StringIO(data), header=[0])
expected = DataFrame(columns=[('a'), ('b')])
tm.assert_frame_equal(df, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the same styling as above
e.g. expected

@@ -714,7 +714,9 @@ cdef class TextReader:
start = self.parser.line_start[0]

# e.g., if header=3 and file only has 2 lines
elif self.parser.lines < hr + 1:
if (self.parser.lines < hr + 1
and not isinstance(self.orig_header, list)) or (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really odd what r u trying to do

Copy link
Contributor Author

@bkandel bkandel Nov 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counteract the extension to the header added in: https://github.com/pandas-dev/pandas/blob/master/pandas/parser.pyx#L519 The issue is that if the header is passed in as a list, it's extended to enable reading in the index name line (I think that's what it's for if I interpreted that comment correctly). But that extended header may actually end up being longer than the total length of the file. The check here if the file is too short doesn't take into account whether or not the header has been artificially extended. So this checks if the header has been artificially extended and disables the complaint about the file being too short if the header was extended beyond the length of the file.

@sinhrks sinhrks added IO CSV read_csv, to_csv MultiIndex Bug labels Nov 7, 2016
@jorisvandenbossche
Copy link
Member

@bkandel You have several tests that are failing now (I think also the one you added)

@bkandel
Copy link
Contributor Author

bkandel commented Nov 11, 2016

@jorisvandenbossche Sorry for the delay in fixing this -- there was more complexity here than I realized. I think it's getting there.

@@ -80,3 +80,4 @@ Performance Improvements

Bug Fixes
~~~~~~~~~
- Bug in ``pd.read_csv`` where reading files fails if the number of headers is equal to the number of lines in the file (:issue:`14515`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can move to 0.19.2

data = """a,b
"""
df = self.read_csv(StringIO(data), header=[0])
expected = DataFrame(columns=[('a',), ('b',)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect these to resolve to an Index actually (we don't have a single level MI)

In [6]: pd.MultiIndex.from_tuples([('a',), ('b',)])
Out[6]: Index(['a', 'b'], dtype='object')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I thought too, but I got dtype matching errors when I did that. I'll try to figure out what's going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [29]: data = 'a,b\n'

In [30]: df = pd.read_csv(StringIO(data), header=[0])

In [31]: df.columns
Out[31]: Index([(u'a',), (u'b',)], dtype='object')

In [32]: df.columns == Index(['a', 'b'])
Out[32]: array([False, False], dtype=bool)

In [33]: df.columns == Index(['a', 'b'], dtype='object')
Out[33]: array([False, False], dtype=bool)

Looks like it's being parsed as a list of tuples, not a MultiIndex, but that's probably incorrect. I'll see if I can fix that.

"""
df2 = self.read_csv(StringIO(data_multiline), header=[0, 1])
expected2 = DataFrame(columns=[('a', 'c'), ('b', 'd')])
tm.assert_frame_equal(df2, expected2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then this works nicely

In [7]: pd.MultiIndex.from_tuples([('a','c'), ('b', 'd')])
Out[7]: 
MultiIndex(levels=[['a', 'b'], ['c', 'd']],
           labels=[[0, 1], [0, 1]])

@codecov-io
Copy link

codecov-io commented Nov 13, 2016

Current coverage is 85.20% (diff: 100%)

Merging #14596 into master will increase coverage by <.01%

@@             master     #14596   diff @@
==========================================
  Files           143        143          
  Lines         50787      50793     +6   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43273      43280     +7   
+ Misses         7514       7513     -1   
  Partials          0          0          

Powered by Codecov. Last update f26b049...32e3b0a

@jreback
Copy link
Contributor

jreback commented Nov 15, 2016

@bkandel looks pretty good. ping when all green.

@bkandel
Copy link
Contributor Author

bkandel commented Nov 15, 2016

@jreback all green. Should be ready for final review now.

Ben Kandel added 12 commits November 21, 2016 20:48
read_csv would fail on files if the number of header lines passed in includes
all the lines in the files. This commit fixes that bug.
A test in test_to_csv checked for the presence of exactly the behavior we're
fixing here: A file with 5 lines that asks for a header of length 5 should work
and return an empty dataframe, not error.
@bkandel
Copy link
Contributor Author

bkandel commented Nov 22, 2016

@jreback just rebased and should be good to go now.

@jreback jreback added this to the 0.19.2 milestone Nov 22, 2016
@jreback
Copy link
Contributor

jreback commented Nov 22, 2016

thanks!

@jreback jreback closed this in f862b52 Nov 22, 2016
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Dec 14, 2016
closes pandas-dev#14515

This commit fixes a bug where `read_csv` failed when given a file with
a multiindex header and empty content. Because pandas reads index
names as a separate line following the header lines, the reader looks
for the line with index names in it. If the content of the dataframe
is empty, the reader will choke. This bug surfaced after
pandas-dev#6618 stopped writing an
extra line after multiindex columns, which led to a situation where
pandas could write CSV's that it couldn't then read.     This commit
changes that behavior by explicitly checking if the index name row
exists, and processing it correctly if it doesn't.

Author: Ben Kandel <ben.kandel@gmail.com>

Closes pandas-dev#14596 from bkandel/fix-parse-empty-df and squashes the following commits:

32e3b0a [Ben Kandel] lint
e6b1237 [Ben Kandel] lint
fedfff8 [Ben Kandel] fix multiindex column parsing
518982d [Ben Kandel] move to 0.19.2
fc23e5c [Ben Kandel] fix errant this_columns
3d9bbdd [Ben Kandel] whatsnew
68eadf3 [Ben Kandel] Modify test.
17e44dd [Ben Kandel] fix python parser too
72adaf2 [Ben Kandel] remove unnecessary test
bfe0423 [Ben Kandel] typo
2f64d57 [Ben Kandel] pep8
b8200e4 [Ben Kandel] BUG: read_csv with empty df

(cherry picked from commit f862b52)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas 0.19 read_csv with header=[0, 1] on an empty df throws error
5 participants