BUG: Parse two date columns broken in read_csv with multiple headers #15376

stephenrauch · 2017-02-12T15:42:48Z

Problem description

As can be seen here, a KeyError is being thrown when when trying to merge two date columns from a csv with multiple header lines.

So this csv file:

X,,Y,,Z,
Date,Time,A,B,A,B
2017-01-21,01:57:49.390,0,1,2,3
2017-01-21,01:57:50.400,4,5,7,9
2017-01-21,01:57:51.410,3,2,4,1

Causes this line:

pandas.read_csv('foo.csv', header=[0,1], parse_dates={'datetime': [0,1]})

to raise a KeyError

NOTE: I will be submitting a PR shortly.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.1.final.0 python-bits: 32 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel byteorder: little

pandas: 0.19.2
nose: 1.3.7
pip: 8.1.2
setuptools: 18.2
Cython: None
numpy: 1.12.0
dateutil: 2.6.0
pytz: 2016.6.1

The text was updated successfully, but these errors were encountered:

jreback · 2017-02-12T16:01:20Z

so your csv is invalid as far as multi-line parsing goes. It 'works' but is not very useful.

A multi-line csv header needs non-sparsity (this is in fact how '.to_csv' writes it).

In [16]: data
Out[16]: 'X,X,Y,Y,Z,Z\nDate,Time,A,B,A,B\n2017-01-21,01:57:49.390,0,1,2,3\n2017-01-21,01:57:50.400,4,5,7,9\n2017-01-21,01:57:51.410,3,2,4,1\n'

In [17]: pandas.read_csv(StringIO(data), header=[0,1], parse_dates={'datetime' : [('X', 'Date'), ('X', 'Time')]})
Out[17]: 
                 datetime  (Y, A)  (Y, B)  (Z, A)  (Z, B)
0 2017-01-21 01:57:49.390       0       1       2       3
1 2017-01-21 01:57:50.400       4       5       7       9
2 2017-01-21 01:57:51.410       3       2       4       1

I think we have a an issue to parse it with the sparsity but not sure.

So it parses correctly when the names are given (as tuples). I suppose its a bug that parse_dates doesn't handle the column numbers.

And so your original example parses when the columns are fully declared. (though again not very useful).

In [22]: pandas.read_csv(StringIO(data), header=[0,1], parse_dates={'datetime' : [('X', 'Date'), ('Unnamed: 1_level_0', 'Time')]})
Out[22]: 
                 datetime  (Y, A)  (Unnamed: 3_level_0, B)  (Z, A)  (Unnamed: 5_level_0, B)
0 2017-01-21 01:57:49.390       0                        1       2                        3
1 2017-01-21 01:57:50.400       4                        5       7                        9
2 2017-01-21 01:57:51.410       3                        2       4                        1

Further make a single level is just not useful in a multi-level frame.

I would prob do this:

In [25]: pandas.read_csv(StringIO(data), header=0, skiprows=1, parse_dates={'datetime':[0,1]})
Out[25]: 
                 datetime  A  B  A.1  B.1
0 2017-01-21 01:57:49.390  0  1    2    3
1 2017-01-21 01:57:50.400  4  5    7    9
2 2017-01-21 01:57:51.410  3  2    4    1

So in summary I'll mark this as a bug, but you have other issues.
cc @gfyoung

stephenrauch · 2017-02-12T16:38:23Z

Wow, thanks for nice response. Do you mind if I cross post this response on Stack Overflow?

Also, the problem exists with/without the sparsity. It is the tuple causing the problem. A proposed fix is running in the CI now, I will submit a PR shortly.

Thanks again for your quick and very detailed attention.

jreback · 2017-02-12T16:52:56Z

sure. Its not the tuple, rather the positional indexers which look up the tuple.

In `io/parsers/_try_convert_dates()` when selecting columns based on a column index from a set of columns with multi- level names, the column `name` was converted to a string. This appears to be a bug since the `name` was a tuple before the conversion. This causes problems downstream when there is an attempt to use this name to lookup a column, and that lookup fails because the desired column is keyed from the tuple, not its string representation closes pandas-dev#15376 Author: Stephen Rauch <stephen.rauch+github@gmail.com> Closes pandas-dev#15378 from stephenrauch/fix_read_csv_merge_datetime and squashes the following commits: 030f5ec [Stephen Rauch] BUG: Parse two date columns broken in read_csv with multiple headers

jreback added Bug Difficulty Novice IO CSV read_csv, to_csv labels Feb 12, 2017

jreback added this to the Next Major Release milestone Feb 12, 2017

stephenrauch mentioned this issue Feb 12, 2017

BUG: Parse two date columns broken in read_csv with multiple headers #15378

Closed

4 tasks

jreback modified the milestones: 0.20.0, Next Major Release Feb 16, 2017

jreback closed this as completed in fb7dc7d Feb 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Parse two date columns broken in read_csv with multiple headers #15376

BUG: Parse two date columns broken in read_csv with multiple headers #15376

stephenrauch commented Feb 12, 2017

jreback commented Feb 12, 2017 •

edited

Loading

stephenrauch commented Feb 12, 2017

jreback commented Feb 12, 2017

BUG: Parse two date columns broken in read_csv with multiple headers #15376

BUG: Parse two date columns broken in read_csv with multiple headers #15376

Comments

stephenrauch commented Feb 12, 2017

Problem description

Output of pd.show_versions()

jreback commented Feb 12, 2017 • edited Loading

stephenrauch commented Feb 12, 2017

jreback commented Feb 12, 2017

Output of `pd.show_versions()`

jreback commented Feb 12, 2017 •

edited

Loading