read_html() doesn't handle tables with multiple header rows #13434

Closed
tdszyman opened this Issue Jun 13, 2016 · 2 comments

Comments

Projects
None yet
3 participants

tdszyman commented Jun 13, 2016 edited

The read_html() function seems to treat every <th> in a table as a column, even if they occur in separate <tr>s. This means that it breaks even on simple tables generated by pandas' to_html() function.

Code Sample, a copy-pastable example if possible

df = pd.DataFrame(
    columns=["Name", "Age", "Party"], 
    data = [("Hillary", 68, "D"), ("Bernie", 74, "D"), ("Donald", 69, "R")])
df = df.set_index("Name")
html = df.to_html()
df2 = pd.read_html(html)[0]
print df2

This is the value of html, generated by the to_html() function on the original data frame:

Age Party
Name
Hillary 68 D
Bernie 74 D
Donald 69 R
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Age</th>
      <th>Party</th>
    </tr>
    <tr>
      <th>Name</th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Hillary</th>
      <td>68</td>
      <td>D</td>
    </tr>
...
  </tbody>
</table>

And this is the printed output of the newly-parsed dataframe df2:

  Unnamed: 0  Age Party  Name  Unnamed: 4  Unnamed: 5
0    Hillary   68     D   NaN         NaN         NaN
1     Bernie   74     D   NaN         NaN         NaN
2     Donald   69     R   NaN         NaN         NaN

What happens is that the to_html() function produces an html table with two header rows, one for the column names and one with the index name. However the read_html() parser interprets each individual th cell as an expected column, resulting in twice the number of columns. Even worse, this produces a column with the same name as the original index but without any data.

Expected Output

The read_html parser could either treat the multi-row header fully correctly:

         Age Party
Name              
Hillary   68     D
Bernie    74     D
Donald    69     R

Or it could just ignore any rows after the first one:

  Unnamed: 0  Age Party
0    Hillary   68     D
1     Bernie   74     D
2     Donald   69     R

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.1
setuptools: 21.0.0
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.1
sphinx: 1.3.5
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.1.2
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.3.5
bs4: 4.4.1
html5lib: 1.0b3
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

tdszyman changed the title from from_html() doesn't handle tables with multiple header rows to read_html() doesn't handle tables with multiple header rows Jun 13, 2016

Contributor

TomAugspurger commented Jun 14, 2016

Correct me if I'm wrong here... would you be able to differentiate between HTML where the first row really is two blank strings, and a table with a header spanning multiple rows? My thoughts are read_html are that the user should expect to have a bit of cleanup work to do. But if the change to handle this case doesn't break anything and isn't too complicated, I'd say it'd be a good addition.

TomAugspurger added this to the 0.19.0 milestone Jun 14, 2016

jreback added the Enhancement label Jun 14, 2016

@jreback jreback modified the milestone: Next Major Release, 0.19.0 Jun 14, 2016

@TomAugspurger the case I'm thinking of is where the first two rows are in the <thead> part of the <table>, and the other rows are in the <tbody> part. So yes they can clearly be distinguished from a row that is simply empty. Also, in the example I gave, every single <tr> element contains the same number of cells/columns (whether they are <th> or <td>), so there is no reason to generate a data frame with a different number of columns.

@jreback jreback modified the milestone: 0.20.0, Next Major Release Mar 29, 2017

jreback closed this in 0ab0813 Mar 29, 2017

@mattip mattip added a commit to mattip/pandas that referenced this issue Apr 3, 2017

@brianhuey @mattip brianhuey + mattip ENH: read_html() handles tables with multiple header rows #13434
closes #13434

Author: Brian <sbhuey@gmail.com>
Author: S. Brian Huey <brianhuey@users.noreply.github.com>

Closes #15242 from brianhuey/thead-improvement and squashes the following commits:

fc1c80e [S. Brian Huey] Merge branch 'master' into thead-improvement
b54aa0c [Brian] removed duplicate test case
6ae2860 [Brian] updated docstring and io.rst
41fe8cd [Brian] review changes
873ea58 [Brian] switched from range to lrange
cd70225 [Brian] ENH:read_html() handles tables with multiple header rows #13434
3ab2fe4

@linebp linebp added a commit to linebp/pandas that referenced this issue Apr 17, 2017

@brianhuey @linebp brianhuey + linebp ENH: read_html() handles tables with multiple header rows #13434
closes #13434

Author: Brian <sbhuey@gmail.com>
Author: S. Brian Huey <brianhuey@users.noreply.github.com>

Closes #15242 from brianhuey/thead-improvement and squashes the following commits:

fc1c80e [S. Brian Huey] Merge branch 'master' into thead-improvement
b54aa0c [Brian] removed duplicate test case
6ae2860 [Brian] updated docstring and io.rst
41fe8cd [Brian] review changes
873ea58 [Brian] switched from range to lrange
cd70225 [Brian] ENH:read_html() handles tables with multiple header rows #13434
1331d11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment