Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Excel - allow for multiple rows to be treated as hierarchical columns #4679

Closed
jtratner opened this issue Aug 26, 2013 · 52 comments · Fixed by #10967
Closed

ENH: Excel - allow for multiple rows to be treated as hierarchical columns #4679

jtratner opened this issue Aug 26, 2013 · 52 comments · Fixed by #10967
Labels
API Design Enhancement IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel
Milestone

Comments

@jtratner
Copy link
Contributor

related #4468

Add keyword argument to ExcelFile parser to take an integer / list for the rows to interpret as header rows. If more than one, interpret as hierarchical columns / MultiIndex.

Presumably this would also allow you to round-trip Data Frame with hierarchical columns.

Basically, if you have something spanning two columns, just converts to two cells with the data in the original cell, ending up just like what you need for csv reader.

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

http://pandas.pydata.org/pandas-docs/dev/io.html#reading-columns-with-a-multiindex

(but for csv), and ths might/probably needs special handling for excel

@jtratner
Copy link
Contributor Author

only special handling would be converting merged cells into repeated entries like csv, so this is relatively minor.

I.e.

____________________
| bar    |   baz    |
| A | B  | C | D | E|
|___________________|

just needs to change to something like

[['bar', 'bar', 'baz', 'baz', 'baz'], ['A', 'B', 'C', 'D', 'E']]

under the hood

@jtratner
Copy link
Contributor Author

so, really, a function that takes in merged cell and splits it into individual cells all with the same value would be sufficient to take advantage of csv's existing behavior.

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

@jtratner I think that is right, in your example header=[0,1] if it produces the output you put then should parse to a mi

related is the reverse (in to_excel); again exists in to_csv, but would need porting (then could round-trip)

@jtratner
Copy link
Contributor Author

@cancan101 interested in implementing this? just a minor modification of your get_effective_cell function.

@cancan101
Copy link
Contributor

I can take a look at this. I am equally interested in solving this for HTML files, for example: http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSCI

@jtratner
Copy link
Contributor Author

Yeah, that's basically the same thing, you just want to end up with the following arrays.

>>> data = [
    ['Three months ended April 30','Three months ended April 30',
     'Six months ended April 30', 'Six months ended April 30'],
    ['2013', '2012', '2013', '2012']
]
>>> MultiIndex.from_arrays(data)
MultiIndex
[(u'Three months ended April 30', u'2013'), (u'Three months ended April 30', u'2012'), (u'Six months ended April 30', u'2013'), (u'Six months ended April 30', u'2012')]

So if you have something like:

<td colspan=2>Span2</td>

You want to convert that into 2 cells with text 'Span2'

@cancan101
Copy link
Contributor

Exactly. I am going to create a similar issues to this one for HTML. FWIW, It would be great to merge the IO backends so that functionality like this can be shared. See: #4682. Shoot closing the other issue.

@jtratner
Copy link
Contributor Author

@cancan101 well, I believe they mostly are, they just pass to a TextReader which does the majority of the work. (so, for example, the ExcelFile reader has to do some magic to convert all the values to a list of lists that can be passed to text reader). I think you could do both of these in the same issue and then refactor the multiindex creation methods from read_csv out for something they can all use - check out code around here for how it works under the hood (I think):

https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L703

@cancan101
Copy link
Contributor

I believe that I looked at the parsers backends and that some, but not all, of the parsers use TextReader. I believe that HTML parser does not use TextReader

@cancan101
Copy link
Contributor

Now comes the other reason for improving the ExcelParser and/or the HTML parser: parsing hierarchical row indexes. A good example of this would be (different link from above): http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSE.

This table has 4 major sections (they can be identified as lines with no other data):

  1. Net revenue:
  2. Costs and expenses:
  3. Net earnings per share:
  4. Weighted-average shares used to compute net earnings per share:

Within the first of those sections are a number of lines items and a section total.

A good feature of the parser would be to extract this structure from the table. Obviously this is non-trivial.

@jtratner
Copy link
Contributor Author

@cancan101 hierarchical row indexes are almost the same thing as hierarchical columns - the goal is to turn the input into something similar to what csv takes for hierarchical indexes. You just need to convert rowspan/horizontally-merged cells into repeated horizontal cells.

@cancan101
Copy link
Contributor

@jtratner That is not actually the issue here. In this case there is no explicit indication in the table itself that the rows should be grouped in a certain manner. For example there is no rowspan / merged cell. In this case the user would need to supply some additional information: for example to treat the heading in an empty row as a new level in the hierarchy,

@jtratner
Copy link
Contributor Author

@cancan101 that's probably outside the scope of what the pandas parser could do [if it's that complicated, user probably should handle the cleaning after that] - I'd suggest going for a very simple implementation to start out with (i.e., take colspan or merge length --> convert it to individual cells --> use read_csv's existing functionality to get a multiindex going) then after you have that working, you can consider what else makes sense to add.

@cancan101
Copy link
Contributor

@jtratner The above is unrelated to the hierarchical columns. That would be another issues. The simplest option might be split the table on those empty lines. For example have some options for the above link to split out 3 tables rather than one. Alternatively the user could attempt to detect this himself this given that that row will have a lot of nans for values.

@jtratner
Copy link
Contributor Author

@cancan101 can you try passing a list to header and see if it converts to mi? it may work already.

@cancan101
Copy link
Contributor

@jratner I will take a look. A pre-requisite of dealing with merged cells in Excel is my PR #4439. It likely needs cleaning up.

@cpcloud
Copy link
Member

cpcloud commented Aug 27, 2013

@cancan101 read_html does parse header=[a list of integers] into a MultiIndex:

In [16]: url = 'http://www.fdic.gov/bank/individual/failed/banklist.html'

In [17]: df = read_html(url,header=[0,1])[0]

In [18]: type(df.columns)
Out[18]: pandas.core.index.MultiIndex

In [21]: df
Out[21]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 510 entries, 2 to 511
Data columns (total 7 columns):
(Sunrise Bank of Arizona, Community South Bank)                 510  non-null values
(Phoenix, Parsons)                                              510  non-null values
(AZ, TN)                                                        510  non-null values
(34707, 19849)                                                  510  non-null values
(First Fidelity Bank, National Association, CB&S Bank, Inc.)    510  non-null values
(2013-08-23 00:00:00, 2013-08-23 00:00:00)                      510  non-null values
(2013-08-27 00:00:00, 2013-08-27 00:00:00)                      510  non-null values
dtypes: datetime64[ns](2), int64(1), object(4)

If there's a more specific issue that you've come across, please create a new one just for that. I (and I think the other devs) would prefer to keep the IO backends' issues separate from each other.

Here are the issues:

  1. Refactor read_html to use TextParser
  2. Excel merging cells etc.
  3. Support the use of colspan to create a MultiIndex for read_html.

(1 should be done before 3)

I'll create 1, 3 is #4683, and I think you should create 2 since you've got the clearest idea of what the issue is 😄.

@cancan101
Copy link
Contributor

@cpcloud That page is an interesting example. Why is this not the first row of the header:

Bank Name   City    ST  CERT    Acquiring Institution   Closing Date    Updated Date

@cpcloud
Copy link
Member

cpcloud commented Aug 28, 2013

@cancan101 That is the first row of the header if you don't pass in the header argument. The header is not counted as part of the data, currently. This is a bug, or at least an inconsistency. This will be fixed in the refactor of read_html to use TextParser:

In [9]: url = 'http://www.fdic.gov/bank/individual/failed/banklist.html'

In [10]: df = read_html(url, header=[0, 1])[0]

In [11]: df
Out[11]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 510 entries, 0 to 509
Data columns (total 7 columns):
(Bank Name, Sunrise Bank of Arizona)                                  510  non-null values
(City, Phoenix)                                                       510  non-null values
(ST, AZ)                                                              510  non-null values
(CERT, 34707)                                                         510  non-null values
(Acquiring Institution, First Fidelity Bank, National Association)    510  non-null values
(Closing Date, August 23, 2013)                                       510  non-null values
(Updated Date, August 27, 2013)                                       510  non-null values
dtypes: datetime64[ns](2), int64(1), object(4)

In [12]: df = read_html(url)[0]

In [13]: df
Out[13]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 512 entries, 0 to 511
Data columns (total 7 columns):
Bank Name                512  non-null values
City                     512  non-null values
ST                       512  non-null values
CERT                     512  non-null values
Acquiring Institution    512  non-null values
Closing Date             512  non-null values
Updated Date             512  non-null values
dtypes: datetime64[ns](2), int64(1), object(4)

@cancan101
Copy link
Contributor

@cpcloud I am not sure if this feature make sense: auto detecting the number of header rows based upon the table's use of hr or thead. For example passing: header="auto". While not all tables use these tags, it would be helpful to use them when they exist.

@cpcloud
Copy link
Member

cpcloud commented Aug 28, 2013

@cancan101 Right. There's a couple of things you should know, just so that you understand why/how the header parsing is happening. There's no dependence on thead elements. The parse is structured so that a tuple of lists of elements is returned from all the HTML parsers, i.e.,

header, body, footer = data

Since I have control over where I place the header list, it can be the first row of the DataFrame instance, which conveniently is the default header keyword argument to TextParser and I believe all of the other IO functions with a header parameter.

You're right that not all tables use the thead element, however I've found that many do; much more so than, say, the caption element, and especially in cases where tables are used as data tables (as opposed to being used to style the page).

So, it's not that the number of header rows is detected. One thing that there isn't a test case for is multiple header rows, which is partially what some of these issues are about.

@cancan101
Copy link
Contributor

@cpcloud I also believe that some tables use th without using thead. I am not sure if the converse is true: using tds withing a thead.

@cpcloud
Copy link
Member

cpcloud commented Aug 28, 2013

@cancan101 Yes, that case is covered. th and thead are both parsed as headers cells. td within a thead/th should be okay. If it isn't please open an issue and I'll take a look.

@hayd
Copy link
Contributor

hayd commented Jun 3, 2014

This came up here: http://stackoverflow.com/q/23703638/1240268 (with a strange html table example)

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jreback jreback modified the milestones: 0.17.0, Next Major Release Sep 3, 2015
chris-b1 added a commit to chris-b1/pandas that referenced this issue Sep 5, 2015
chris-b1 added a commit to chris-b1/pandas that referenced this issue Sep 9, 2015
jreback added a commit that referenced this issue Sep 9, 2015
nickeubank pushed a commit to nickeubank/pandas that referenced this issue Sep 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel
Projects
None yet
6 participants