ENH: Excel - allow for multiple rows to be treated as hierarchical columns #4679

jtratner · 2013-08-26T21:52:25Z

related #4468

Add keyword argument to ExcelFile parser to take an integer / list for the rows to interpret as header rows. If more than one, interpret as hierarchical columns / MultiIndex.

Presumably this would also allow you to round-trip Data Frame with hierarchical columns.

Basically, if you have something spanning two columns, just converts to two cells with the data in the original cell, ending up just like what you need for csv reader.

jreback · 2013-08-26T22:08:30Z

http://pandas.pydata.org/pandas-docs/dev/io.html#reading-columns-with-a-multiindex

(but for csv), and ths might/probably needs special handling for excel

jtratner · 2013-08-26T22:15:26Z

only special handling would be converting merged cells into repeated entries like csv, so this is relatively minor.

I.e.

____________________
| bar    |   baz    |
| A | B  | C | D | E|
|___________________|

just needs to change to something like

[['bar', 'bar', 'baz', 'baz', 'baz'], ['A', 'B', 'C', 'D', 'E']]

under the hood

jtratner · 2013-08-26T22:16:15Z

so, really, a function that takes in merged cell and splits it into individual cells all with the same value would be sufficient to take advantage of csv's existing behavior.

jreback · 2013-08-26T22:18:33Z

@jtratner I think that is right, in your example header=[0,1] if it produces the output you put then should parse to a mi

related is the reverse (in to_excel); again exists in to_csv, but would need porting (then could round-trip)

jtratner · 2013-08-26T22:18:43Z

@cancan101 interested in implementing this? just a minor modification of your get_effective_cell function.

cancan101 · 2013-08-26T22:25:51Z

I can take a look at this. I am equally interested in solving this for HTML files, for example: http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSCI

jtratner · 2013-08-26T22:39:14Z

Yeah, that's basically the same thing, you just want to end up with the following arrays.

>>> data = [
    ['Three months ended April 30','Three months ended April 30',
     'Six months ended April 30', 'Six months ended April 30'],
    ['2013', '2012', '2013', '2012']
]
>>> MultiIndex.from_arrays(data)
MultiIndex
[(u'Three months ended April 30', u'2013'), (u'Three months ended April 30', u'2012'), (u'Six months ended April 30', u'2013'), (u'Six months ended April 30', u'2012')]

So if you have something like:

<td colspan=2>Span2</td>

You want to convert that into 2 cells with text 'Span2'

cancan101 · 2013-08-26T22:42:28Z

Exactly. I am going to create a similar issues to this one for HTML. FWIW, It would be great to merge the IO backends so that functionality like this can be shared. See: #4682. Shoot closing the other issue.

jtratner · 2013-08-26T22:46:00Z

@cancan101 well, I believe they mostly are, they just pass to a TextReader which does the majority of the work. (so, for example, the ExcelFile reader has to do some magic to convert all the values to a list of lists that can be passed to text reader). I think you could do both of these in the same issue and then refactor the multiindex creation methods from read_csv out for something they can all use - check out code around here for how it works under the hood (I think):

https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L703

cancan101 · 2013-08-26T22:47:50Z

I believe that I looked at the parsers backends and that some, but not all, of the parsers use TextReader. I believe that HTML parser does not use TextReader

cancan101 · 2013-08-26T22:58:43Z

Now comes the other reason for improving the ExcelParser and/or the HTML parser: parsing hierarchical row indexes. A good example of this would be (different link from above): http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSE.

This table has 4 major sections (they can be identified as lines with no other data):

Net revenue:
Costs and expenses:
Net earnings per share:
Weighted-average shares used to compute net earnings per share:

Within the first of those sections are a number of lines items and a section total.

A good feature of the parser would be to extract this structure from the table. Obviously this is non-trivial.

jtratner · 2013-08-26T23:31:42Z

@cancan101 hierarchical row indexes are almost the same thing as hierarchical columns - the goal is to turn the input into something similar to what csv takes for hierarchical indexes. You just need to convert rowspan/horizontally-merged cells into repeated horizontal cells.

cancan101 · 2013-08-26T23:34:26Z

@jtratner That is not actually the issue here. In this case there is no explicit indication in the table itself that the rows should be grouped in a certain manner. For example there is no rowspan / merged cell. In this case the user would need to supply some additional information: for example to treat the heading in an empty row as a new level in the hierarchy,

jtratner · 2013-08-27T00:01:20Z

@cancan101 that's probably outside the scope of what the pandas parser could do [if it's that complicated, user probably should handle the cleaning after that] - I'd suggest going for a very simple implementation to start out with (i.e., take colspan or merge length --> convert it to individual cells --> use read_csv's existing functionality to get a multiindex going) then after you have that working, you can consider what else makes sense to add.

cancan101 · 2013-08-27T00:04:41Z

@jtratner The above is unrelated to the hierarchical columns. That would be another issues. The simplest option might be split the table on those empty lines. For example have some options for the above link to split out 3 tables rather than one. Alternatively the user could attempt to detect this himself this given that that row will have a lot of nans for values.

jtratner · 2013-08-27T00:31:49Z

@cancan101 can you try passing a list to header and see if it converts to mi? it may work already.

cancan101 · 2013-08-27T16:22:09Z

@jratner I will take a look. A pre-requisite of dealing with merged cells in Excel is my PR #4439. It likely needs cleaning up.

cpcloud · 2013-08-27T20:35:03Z

@cancan101 read_html does parse header=[a list of integers] into a MultiIndex:

In [16]: url = 'http://www.fdic.gov/bank/individual/failed/banklist.html'

In [17]: df = read_html(url,header=[0,1])[0]

In [18]: type(df.columns)
Out[18]: pandas.core.index.MultiIndex

In [21]: df
Out[21]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 510 entries, 2 to 511
Data columns (total 7 columns):
(Sunrise Bank of Arizona, Community South Bank)                 510  non-null values
(Phoenix, Parsons)                                              510  non-null values
(AZ, TN)                                                        510  non-null values
(34707, 19849)                                                  510  non-null values
(First Fidelity Bank, National Association, CB&S Bank, Inc.)    510  non-null values
(2013-08-23 00:00:00, 2013-08-23 00:00:00)                      510  non-null values
(2013-08-27 00:00:00, 2013-08-27 00:00:00)                      510  non-null values
dtypes: datetime64[ns](2), int64(1), object(4)

If there's a more specific issue that you've come across, please create a new one just for that. I (and I think the other devs) would prefer to keep the IO backends' issues separate from each other.

Here are the issues:

Refactor read_html to use TextParser
Excel merging cells etc.
Support the use of colspan to create a MultiIndex for read_html.

(1 should be done before 3)

I'll create 1, 3 is #4683, and I think you should create 2 since you've got the clearest idea of what the issue is 😄.

cancan101 · 2013-08-28T03:28:15Z

@cpcloud That page is an interesting example. Why is this not the first row of the header:

Bank Name   City    ST  CERT    Acquiring Institution   Closing Date    Updated Date

cpcloud · 2013-08-28T12:38:58Z

@cancan101 That is the first row of the header if you don't pass in the header argument. The header is not counted as part of the data, currently. This is a bug, or at least an inconsistency. This will be fixed in the refactor of read_html to use TextParser:

In [9]: url = 'http://www.fdic.gov/bank/individual/failed/banklist.html'

In [10]: df = read_html(url, header=[0, 1])[0]

In [11]: df
Out[11]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 510 entries, 0 to 509
Data columns (total 7 columns):
(Bank Name, Sunrise Bank of Arizona)                                  510  non-null values
(City, Phoenix)                                                       510  non-null values
(ST, AZ)                                                              510  non-null values
(CERT, 34707)                                                         510  non-null values
(Acquiring Institution, First Fidelity Bank, National Association)    510  non-null values
(Closing Date, August 23, 2013)                                       510  non-null values
(Updated Date, August 27, 2013)                                       510  non-null values
dtypes: datetime64[ns](2), int64(1), object(4)

In [12]: df = read_html(url)[0]

In [13]: df
Out[13]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 512 entries, 0 to 511
Data columns (total 7 columns):
Bank Name                512  non-null values
City                     512  non-null values
ST                       512  non-null values
CERT                     512  non-null values
Acquiring Institution    512  non-null values
Closing Date             512  non-null values
Updated Date             512  non-null values
dtypes: datetime64[ns](2), int64(1), object(4)

cancan101 · 2013-08-28T12:48:36Z

@cpcloud I am not sure if this feature make sense: auto detecting the number of header rows based upon the table's use of hr or thead. For example passing: header="auto". While not all tables use these tags, it would be helpful to use them when they exist.

cpcloud · 2013-08-28T12:58:41Z

@cancan101 Right. There's a couple of things you should know, just so that you understand why/how the header parsing is happening. There's no dependence on thead elements. The parse is structured so that a tuple of lists of elements is returned from all the HTML parsers, i.e.,

header, body, footer = data

Since I have control over where I place the header list, it can be the first row of the DataFrame instance, which conveniently is the default header keyword argument to TextParser and I believe all of the other IO functions with a header parameter.

You're right that not all tables use the thead element, however I've found that many do; much more so than, say, the caption element, and especially in cases where tables are used as data tables (as opposed to being used to style the page).

So, it's not that the number of header rows is detected. One thing that there isn't a test case for is multiple header rows, which is partially what some of these issues are about.

cancan101 · 2013-08-28T13:07:06Z

@cpcloud I also believe that some tables use th without using thead. I am not sure if the converse is true: using tds withing a thead.

cpcloud · 2013-08-28T13:16:20Z

@cancan101 Yes, that case is covered. th and thead are both parsed as headers cells. td within a thead/th should be okay. If it isn't please open an issue and I'll take a look.

hayd · 2014-06-03T04:59:31Z

This came up here: http://stackoverflow.com/q/23703638/1240268 (with a strange html table example)

ENH: read_excel MultiIndex #4679

cancan101 mentioned this issue Aug 26, 2013

ENH: HTML - parse multiple rows in header as MultiIndex columns #4682

Closed

jtratner mentioned this issue Aug 26, 2013

get_effective_cell for getting the contents of Excel cell when the cell is merged #4673

Closed

This was referenced Aug 26, 2013

ENH Factored out excel_value_to_python_value from ExcelFile::_parse_exce... #4590

Merged

Support for Merging Columns in HTML Parser #4683

Closed

cpcloud mentioned this issue Aug 27, 2013

Refactor read_html to use TextParser #4697

Closed

cpcloud mentioned this issue Aug 28, 2013

inconsistency of header argument of read_html with other read_* functions #4700

Closed

cancan101 mentioned this issue Oct 17, 2013

ENH: Better handling of MultiIndex with Excel #5254

Closed

jreback closed this as completed Jan 3, 2014

jreback reopened this Jan 3, 2014

jreback modified the milestones: 0.15.0, 0.14.0 Feb 26, 2014

jtratner mentioned this issue Sep 19, 2014

Bloomberg Hackathon #8323

Closed

jreback added the Good as first PR label Sep 21, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

gdementen mentioned this issue Mar 12, 2015

fixed header=list (to create a MultiIndex) for read_excel #9637

Closed

jreback mentioned this issue Apr 2, 2015

ERR: raise on mutli-index columns as its not implemented in to_excel #9794

Closed

jreback mentioned this issue Jun 23, 2015

Writing as Excel with a MultiIndex is not yet implemented. #10410

Closed

atheyjohnc mentioned this issue Jul 8, 2015

Reading from Excel mangles columns #10523

Closed

chris-b1 mentioned this issue Sep 2, 2015

ENH: read_excel MultiIndex #4679 #10967

Merged

jreback modified the milestones: 0.17.0, Next Major Release Sep 3, 2015

chris-b1 added a commit to chris-b1/pandas that referenced this issue Sep 5, 2015

ENH: read_excel MultiIndex pandas-dev#4679

1121fd5

chris-b1 added a commit to chris-b1/pandas that referenced this issue Sep 9, 2015

ENH: read_excel MultiIndex pandas-dev#4679

98405f0

jreback closed this as completed in #10967 Sep 9, 2015

jreback added a commit that referenced this issue Sep 9, 2015

Merge pull request #10967 from chris-b1/excel-read-multiindex

0e56279

ENH: read_excel MultiIndex #4679

nickeubank pushed a commit to nickeubank/pandas that referenced this issue Sep 29, 2015

ENH: read_excel MultiIndex pandas-dev#4679

e18288e

jowens mentioned this issue Jul 22, 2017

ENH: read_html to handle rowspan, colspan #17054

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Excel - allow for multiple rows to be treated as hierarchical columns #4679

ENH: Excel - allow for multiple rows to be treated as hierarchical columns #4679

jtratner commented Aug 26, 2013

jreback commented Aug 26, 2013

jtratner commented Aug 26, 2013

jtratner commented Aug 26, 2013

jreback commented Aug 26, 2013

jtratner commented Aug 26, 2013

cancan101 commented Aug 26, 2013

jtratner commented Aug 26, 2013

cancan101 commented Aug 26, 2013

jtratner commented Aug 26, 2013

cancan101 commented Aug 26, 2013

cancan101 commented Aug 26, 2013

jtratner commented Aug 26, 2013

cancan101 commented Aug 26, 2013

jtratner commented Aug 27, 2013

cancan101 commented Aug 27, 2013

jtratner commented Aug 27, 2013

cancan101 commented Aug 27, 2013

cpcloud commented Aug 27, 2013

cancan101 commented Aug 28, 2013

cpcloud commented Aug 28, 2013

cancan101 commented Aug 28, 2013

cpcloud commented Aug 28, 2013

cancan101 commented Aug 28, 2013

cpcloud commented Aug 28, 2013

hayd commented Jun 3, 2014

ENH: Excel - allow for multiple rows to be treated as hierarchical columns #4679

ENH: Excel - allow for multiple rows to be treated as hierarchical columns #4679

Comments

jtratner commented Aug 26, 2013

jreback commented Aug 26, 2013

jtratner commented Aug 26, 2013

jtratner commented Aug 26, 2013

jreback commented Aug 26, 2013

jtratner commented Aug 26, 2013

cancan101 commented Aug 26, 2013

jtratner commented Aug 26, 2013

cancan101 commented Aug 26, 2013

jtratner commented Aug 26, 2013

cancan101 commented Aug 26, 2013

cancan101 commented Aug 26, 2013

jtratner commented Aug 26, 2013

cancan101 commented Aug 26, 2013

jtratner commented Aug 27, 2013

cancan101 commented Aug 27, 2013

jtratner commented Aug 27, 2013

cancan101 commented Aug 27, 2013

cpcloud commented Aug 27, 2013

cancan101 commented Aug 28, 2013

cpcloud commented Aug 28, 2013

cancan101 commented Aug 28, 2013

cpcloud commented Aug 28, 2013

cancan101 commented Aug 28, 2013

cpcloud commented Aug 28, 2013

hayd commented Jun 3, 2014