Join GitHub today
Pandas 0.19 read_csv with header=[0, 1] on an empty df throws error #14515
Pandas 0.19 incorrectly handles empty dataframe files with multi index columns
import pandas as pd import tempfile df = pd.DataFrame.from_records(, columns=['col_1', 'col_2']) joined_df_in = pd.concat([df, df], keys=['a', 'b'], axis=1) joined_df_in.reset_index(drop=True, inplace=True) with tempfile.NamedTemporaryFile(delete=False) as f: joined_df_in.to_csv(f.name, index=False)
What the file looks like
# in pandas 0.18.1 pd.read_csv(f.name, header=[0,1])
yields what we expect, an empty MultiIndex data frame
# in pandas 0.19 pd.read_csv(f.name, header=[0,1])
@kaloramik So the change is not in
In versions < 0.19.0, the file looks like:
while in 0.19.0 it looks like (what you showed above):
So previously there was an extra line with empty values. Reading this in with 0.19.0 still gives your desired result of an empty frame:
(however, something could be said this should actually give you one row of NaNs)
So the change is in
while in 0.18.0 there was an extra line with comma's:
This was a bug (since you don't have any data, there should not be a line of missing values), and this bug was fixed in 0.19.0, see #6618
@jorisvandenbossche hmm really? That's not what I'm seeing at all. Is it possible I have a package thats screwing something up? Can you post your pd.show_versions?
But looking at the behavior, shouldn't the expected behavior be what I posted? As in, if you read in a file of length 2, and your headers are taken up to by 2 lines, then it should return an empty df with those columns. I believe the same behavior applies for a single header.
The error message doesn't seem to make sense
it DOES have 2 lines in the file, so it should be able to construct the header. In addition, the source code has the following comment
According to the comment, the function should fail if the file has less than len(header) lines, implying that the function should succeed if len(header) == len(lines). Does that sound right?
Oh actually, scratch that, you are right about 0.18.1 returning an extra line of commas (And so the read_csv succeeds I guess)
But this breaks behavior now, as in my data pipelines, I am unable to write then read empty dataframes as before. I think the above behavior I described is still the desired one? Unless you have better workarounds? ( I don't think replicating the old behavior by forcibly adding a row of commas would be a good idea)
Possibly. But I am just pointing out that it is not a change in
Apart from that, it is worth discussing if we should allow this. IMO returning an empty frame is indeed more logical to do.
The bug fix in
Note that also for a single header, once you pass the