Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
read_html providing title from a attribute as well as the text - in effect duplicating output #20027
url = """https://en.wikipedia.org/wiki/List_of_winners_of_the_Boston_Marathon"""
The above code ''should' just extract the displayed text in the HTML table; what's in the dataframe should be what's displayed on screen. This isn't what happens. If the HTML contains a hyperlink with a title attribute, this is picked up and added to the dataframe, duplicating the data.
Here's the actual output, the duplication is in the Athlete and Country/State columns.
Are you seeing the issue on any other sites? Just glancing at the source I would think this has to do more with the span elements with
referenced this issue
Mar 7, 2018
Confirmed, this is the display:none doing this. Here's some example HTML that shows the issue.<title>Example</title>
This is a H1
This is a paragraph