New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: handle multiple tbody in read_html() #20690

Closed
jstray opened this Issue Apr 13, 2018 · 6 comments

Comments

Projects
None yet
5 participants
@jstray

jstray commented Apr 13, 2018

Code Sample

url = 'https://www.wunderground.com/history/airport/KORD/2018/3/21/CustomHistory.html?dayend=10&monthend=4&yearend=2018'
df = pandas.read_html(url)[1]

Problem description

Expected: table of weather information at page bottom.

Actual:
0.21.1 - first two rows
0.22.0 - exception
current master - first two rows

For reference, Google Sheets IMPORTHTML loads this table correctly.,

@WillAyd

This comment has been minimized.

Member

WillAyd commented Apr 13, 2018

It looks like the second table has multiple tbody tags but the parser only looks at the first:

res = self._parse_tr(tbody[0])

PRs welcome

@jstray

This comment has been minimized.

jstray commented Apr 14, 2018

Thanks. I'll probably have to fix this.

@gfyoung gfyoung added the IO HTML label Apr 15, 2018

@gfyoung

This comment has been minimized.

Member

gfyoung commented Apr 15, 2018

We could always enhance read_html to behave like read_excel where we can read in multiple tbody elements. That doesn't seem too unreasonable for the moment.

@gfyoung gfyoung added the Enhancement label Apr 15, 2018

@jstray

This comment has been minimized.

jstray commented Apr 15, 2018

How is read_excel similar to this? There’s no tbody in that case.

@WillAyd

This comment has been minimized.

Member

WillAyd commented Apr 15, 2018

@gfyoung are you just referring to the ability to return multiple tables? read_html technically already does that as it returns a list instead of just a DataFrame, but chime in if I misunderstand.

@jstray one design consideration to think about - should multiple tbody elements return a MultiIndexed DataFrame? I suppose having those multiple tbody tags in the first place is indicative of different groupings within the table, so maybe it makes sense for each of them to be a unique value within the first level of a MultiIndex?

Not saying that needs to happen at the outset as certainly being able to just parse them would be an improvement over what we have today. Just throwing that out there as food for thought as you try a patch

@gfyoung

This comment has been minimized.

Member

gfyoung commented Apr 16, 2018

@WillAyd : Sorry, misspoke there. Please ignore that comment. 😄

@chris-b1 chris-b1 changed the title from read_html() failing on certain pages to ENH: handle multiple tbody in read_html() Apr 16, 2018

@chris-b1 chris-b1 added this to the Next Major Release milestone Apr 16, 2018

adamhooper added a commit to adamhooper/pandas that referenced this issue May 1, 2018

adamhooper added a commit to adamhooper/pandas that referenced this issue May 1, 2018

adamhooper added a commit to adamhooper/pandas that referenced this issue May 1, 2018

@adamhooper adamhooper referenced this issue May 1, 2018

Merged

Read from multiple <tbody> within a <table> #20891

4 of 4 tasks complete

@jreback jreback modified the milestones: Next Major Release, 0.23.0 May 1, 2018

TomAugspurger added a commit that referenced this issue May 1, 2018

Read from multiple <tbody> within a <table> (#20891)
* Read from multiple <tbody> within a <table>

refs #20690
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment