Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
read_html() doesn't handle tables with multiple header rows #13434
Comments
tdszyman
changed the title from
from_html() doesn't handle tables with multiple header rows to read_html() doesn't handle tables with multiple header rows
Jun 13, 2016
|
Correct me if I'm wrong here... would you be able to differentiate between HTML where the first row really is two blank strings, and a table with a header spanning multiple rows? My thoughts are |
TomAugspurger
added the
IO HTML
label
Jun 14, 2016
TomAugspurger
added this to the
0.19.0
milestone
Jun 14, 2016
TomAugspurger
added Difficulty Intermediate Effort Medium
labels
Jun 14, 2016
jreback
added the
Enhancement
label
Jun 14, 2016
jreback
modified the milestone: Next Major Release, 0.19.0
Jun 14, 2016
tdszyman
commented
Jun 14, 2016
|
@TomAugspurger the case I'm thinking of is where the first two rows are in the |
viswaraavi
referenced
this issue
Jun 19, 2016
Closed
ENH:read_html() handles tables with multiple header rows #13434 #13485
viswaraavi
added a commit
to viswaraavi/pandas
that referenced
this issue
Sep 10, 2016
|
|
viswaraavi |
35582e0
|
brianhuey
added a commit
to brianhuey/pandas
that referenced
this issue
Jan 27, 2017
|
|
brianhuey |
47ece9d
|
brianhuey
referenced
this issue
Jan 27, 2017
Closed
ENH:read_html() handles tables with multiple header rows #13434 #15242
brianhuey
added a commit
to brianhuey/pandas
that referenced
this issue
Feb 1, 2017
|
|
brianhuey |
5a6f43f
|
brianhuey
added a commit
to brianhuey/pandas
that referenced
this issue
Feb 2, 2017
|
|
brianhuey |
82c96b6
|
brianhuey
added a commit
to brianhuey/pandas
that referenced
this issue
Feb 16, 2017
|
|
brianhuey |
cd70225
|
jreback
modified the milestone: 0.20.0, Next Major Release
Mar 29, 2017
jreback
closed this
in 0ab0813
Mar 29, 2017
mattip
added a commit
to mattip/pandas
that referenced
this issue
Apr 3, 2017
|
|
brianhuey + mattip |
3ab2fe4
|
linebp
added a commit
to linebp/pandas
that referenced
this issue
Apr 17, 2017
|
|
brianhuey + linebp |
1331d11
|
tdszyman commentedJun 13, 2016
•
edited
The
read_html()function seems to treat every<th>in a table as a column, even if they occur in separate<tr>s. This means that it breaks even on simple tables generated by pandas'to_html()function.Code Sample, a copy-pastable example if possible
This is the value of
html, generated by theto_html()function on the original data frame:And this is the printed output of the newly-parsed dataframe
df2:What happens is that the
to_html()function produces an html table with two header rows, one for the column names and one with the index name. However theread_html()parser interprets each individualthcell as an expected column, resulting in twice the number of columns. Even worse, this produces a column with the same name as the original index but without any data.Expected Output
The
read_htmlparser could either treat the multi-row header fully correctly:Or it could just ignore any rows after the first one:
output of
pd.show_versions()INSTALLED VERSIONS
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8
pandas: 0.18.1
nose: 1.3.7
pip: 8.1.1
setuptools: 21.0.0
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.1
sphinx: 1.3.5
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.1.2
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.3.5
bs4: 4.4.1
html5lib: 1.0b3
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None