Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_html repeats table (regression in 0.23) #22135

Closed
dhimmel opened this issue Jul 30, 2018 · 2 comments
Closed

read_html repeats table (regression in 0.23) #22135

dhimmel opened this issue Jul 30, 2018 · 2 comments

Comments

@dhimmel
Copy link
Contributor

dhimmel commented Jul 30, 2018

I'm running into an issue extracting a dataframe from a HTML page when using pandas 0.23. Before my environment was updated, I was using pandas 0.22 and the issue did not occur. If I install a new environment with 0.22, the issue does not occur, so I think this is probably a regression. I've cached the relevant HTML at trailheads.html.txt.

Here is the code:

import requests
import pandas
url = 'https://github.com/pandas-dev/pandas/files/2242597/trailheads.html.txt'
response = requests.get(url)
wide_df, = pandas.read_html(
    response.text,
    header=1,
    attrs = {'id': 'cs_idLayout2'},
    flavor='html5lib',
    parse_dates=['Date'],
)
wide_df = wide_df.iloc[:, :6]

Below I'm included a subset of the resulting rows of the table to demonstrate what I mean by repeat. It seems the table has been repeated a single time leading to double the number of rows plus one (since the header is repeated as a row). Below I've excluded all rows besides the top and bottom 2 for brevity.

Date Happy Isles->Little Yosemite Valley Happy Isles->Sunrise/Merced Lake (pass through) Glacier Point->Little Yosemite Valley Sunrise Lakes Lyell Canyon
7/28/2018 0 0 0 0 0
7/29/2018 0 0 0 0 0
12/17/2018 18 6 6 9 15
12/18/2018 18 6 6 9 15
Date Happy Isles->Little Yosemite Valley Happy Isles->Sunrise/Merced Lake (pass through) Glacier Point->Little Yosemite Valley Sunrise Lakes Lyell Canyon
7/28/2018 0 0 0 0 0
7/29/2018 0 0 0 0 0
12/17/2018 18 6 6 9 15
12/18/2018 18 6 6 9 15
@Liam3851
Copy link
Contributor

Liam3851 commented Jul 30, 2018

I can reproduce this in 0.23.3 but not on master. Can you verify? I don't see a PR that says it was specifically targeting this behavior but appears to already be handled.

@dhimmel
Copy link
Contributor Author

dhimmel commented Jul 31, 2018

Thanks @Liam3851 for the help.

I can reproduce this in 0.23.3 but not on master. Can you verify?

144 rows when running with codebase from current master, but 289 in 0.23.3. So I agree, the issue must have been fixed. Will close.

I don't see a PR that says it was specifically targeting this behavior but appears to already be handled.

Perhaps one of the recent commits in pandas/io/html.py

@dhimmel dhimmel closed this as completed Jul 31, 2018
dhimmel added a commit to dhimmel/hackjohn that referenced this issue Dec 9, 2018
Detects problematic pandas version that would trigger
pandas-dev/pandas#22135
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants