Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Parse two date columns broken in read_csv with multiple headers #15376

Closed
stephenrauch opened this issue Feb 12, 2017 · 3 comments
Closed
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@stephenrauch
Copy link
Contributor

Problem description

As can be seen here, a KeyError is being thrown when when trying to merge two date columns from a csv with multiple header lines.

So this csv file:

X,,Y,,Z,
Date,Time,A,B,A,B
2017-01-21,01:57:49.390,0,1,2,3
2017-01-21,01:57:50.400,4,5,7,9
2017-01-21,01:57:51.410,3,2,4,1

Causes this line:

pandas.read_csv('foo.csv', header=[0,1], parse_dates={'datetime': [0,1]})

to raise a KeyError

NOTE: I will be submitting a PR shortly.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.1.final.0 python-bits: 32 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel byteorder: little

pandas: 0.19.2
nose: 1.3.7
pip: 8.1.2
setuptools: 18.2
Cython: None
numpy: 1.12.0
dateutil: 2.6.0
pytz: 2016.6.1

@jreback
Copy link
Contributor

jreback commented Feb 12, 2017

so your csv is invalid as far as multi-line parsing goes. It 'works' but is not very useful.

A multi-line csv header needs non-sparsity (this is in fact how '.to_csv' writes it).

In [16]: data
Out[16]: 'X,X,Y,Y,Z,Z\nDate,Time,A,B,A,B\n2017-01-21,01:57:49.390,0,1,2,3\n2017-01-21,01:57:50.400,4,5,7,9\n2017-01-21,01:57:51.410,3,2,4,1\n'

In [17]: pandas.read_csv(StringIO(data), header=[0,1], parse_dates={'datetime' : [('X', 'Date'), ('X', 'Time')]})
Out[17]: 
                 datetime  (Y, A)  (Y, B)  (Z, A)  (Z, B)
0 2017-01-21 01:57:49.390       0       1       2       3
1 2017-01-21 01:57:50.400       4       5       7       9
2 2017-01-21 01:57:51.410       3       2       4       1

I think we have a an issue to parse it with the sparsity but not sure.

So it parses correctly when the names are given (as tuples). I suppose its a bug that parse_dates doesn't handle the column numbers.

And so your original example parses when the columns are fully declared. (though again not very useful).

In [22]: pandas.read_csv(StringIO(data), header=[0,1], parse_dates={'datetime' : [('X', 'Date'), ('Unnamed: 1_level_0', 'Time')]})
Out[22]: 
                 datetime  (Y, A)  (Unnamed: 3_level_0, B)  (Z, A)  (Unnamed: 5_level_0, B)
0 2017-01-21 01:57:49.390       0                        1       2                        3
1 2017-01-21 01:57:50.400       4                        5       7                        9
2 2017-01-21 01:57:51.410       3                        2       4                        1

Further make a single level is just not useful in a multi-level frame.

I would prob do this:

In [25]: pandas.read_csv(StringIO(data), header=0, skiprows=1, parse_dates={'datetime':[0,1]})
Out[25]: 
                 datetime  A  B  A.1  B.1
0 2017-01-21 01:57:49.390  0  1    2    3
1 2017-01-21 01:57:50.400  4  5    7    9
2 2017-01-21 01:57:51.410  3  2    4    1

So in summary I'll mark this as a bug, but you have other issues.
cc @gfyoung

@jreback jreback added this to the Next Major Release milestone Feb 12, 2017
@stephenrauch
Copy link
Contributor Author

Wow, thanks for nice response. Do you mind if I cross post this response on Stack Overflow?

Also, the problem exists with/without the sparsity. It is the tuple causing the problem. A proposed fix is running in the CI now, I will submit a PR shortly.

Thanks again for your quick and very detailed attention.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2017

sure. Its not the tuple, rather the positional indexers which look up the tuple.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Feb 16, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
In `io/parsers/_try_convert_dates()` when selecting
columns based on a  column index from a set of columns with multi-
level names, the column  `name` was converted to a string.  This
appears to be a bug since the  `name` was a tuple before the
conversion.  This causes problems  downstream when there is an attempt
to use this name to lookup a  column, and that lookup fails because
the desired column is keyed from  the tuple, not its string
representation

closes pandas-dev#15376

Author: Stephen Rauch <stephen.rauch+github@gmail.com>

Closes pandas-dev#15378 from stephenrauch/fix_read_csv_merge_datetime and squashes the following commits:

030f5ec [Stephen Rauch] BUG: Parse two date columns broken in read_csv with multiple headers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants