New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_html providing title from a attribute as well as the text - in effect duplicating output #20027

Closed
MikeWoodward opened this Issue Mar 7, 2018 · 5 comments

Comments

Projects
None yet
4 participants
@MikeWoodward

MikeWoodward commented Mar 7, 2018

Code Sample

url = """https://en.wikipedia.org/wiki/List_of_winners_of_the_Boston_Marathon"""
tables = pd.read_html(url, header=0)
print(tables[0].head())

Problem description

The above code ''should' just extract the displayed text in the HTML table; what's in the dataframe should be what's displayed on screen. This isn't what happens. If the HTML contains a hyperlink with a title attribute, this is picked up and added to the dataframe, duplicating the data.

Expected Output

   Year                   Athlete  \
0  1897          John J. McDermott   
1  1898         Ronald J. MacDonald   
2  1899          Lawrence Brignolia   
3  1900         John "Jack" Caffery   
4  1901         John "Jack" Caffery   

                      Country/State     Time        Notes  
0                United States (NY)  2:55:10          NaN  
1                     Canada Canada  2:42:00          NaN  
2                United States (MA)  2:54:38          NaN  
3                            Canada  2:39:44          NaN  
4                            Canada  2:29:23  2nd victory 

Output

Here's the actual output, the duplication is in the Athlete and Country/State columns.

   Year                                  Athlete  
0  1897      McDermott, John J.John J. McDermott   
1  1898  MacDonald, Ronald J.Ronald J. MacDonald   
2  1899    Brignolia, LawrenceLawrence Brignolia   
3  1900         Caffery, JohnJohn "Jack" Caffery   
4  1901         Caffery, JohnJohn "Jack" Caffery   

                      Country/State     Time        Notes  
0  United States United States (NY)  2:55:10          NaN  
1                     Canada Canada  2:42:00          NaN  
2  United States United States (MA)  2:54:38          NaN  
3                     Canada Canada  2:39:44          NaN  
4                     Canada Canada  2:29:23  2nd victory 
@WillAyd

This comment has been minimized.

Member

WillAyd commented Mar 7, 2018

Are you seeing the issue on any other sites? Just glancing at the source I would think this has to do more with the span elements with display:none that are on the Wikipedia than the links (see screenshot below_ but curious if there's another source that was leading you to think the link is responsible

screen shot 2018-03-07 at 11 35 18 am

@MikeWoodward

This comment has been minimized.

MikeWoodward commented Mar 9, 2018

I think you're right, I think this is an error on my part and it's the display:none setting that's doing it. I need a bit more time to investigate and I'll post again when I have some results.

@WillAyd

This comment has been minimized.

Member

WillAyd commented Mar 9, 2018

OK thanks Mike. For what it's worth I already put what I believe to be the fix here in #20047 - might want to give that a look

@MikeWoodward

This comment has been minimized.

MikeWoodward commented Mar 9, 2018

Confirmed, this is the display:none doing this. Here's some example HTML that shows the issue.

<title>Example</title>

This is a H1

This is a paragraph

Column1 Column2 Column3
Span text display attribute:noneJohn J. McDermottText not in span or a Plain text - no elements Plain text - no elements
Span text no display attributeJohn J. McDermottText not in span or a Plain text - no elements
Plain text - no elements

Some text




@WillAyd

This comment has been minimized.

Member

WillAyd commented Mar 9, 2018

Great thanks. Feel free to mess around with the fix I put in the above PR. Targeting v0.23 if all works out

@TomAugspurger TomAugspurger reopened this Mar 9, 2018

@jreback jreback added this to the 0.23.0 milestone Mar 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment