This tutorial was written by Katherine Walden and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
This lab provides an introduction and overview to parsing HTML in Python using beautifulsoup
and pandas
.
|
Lab overview |
|
Lecture/live coding playlist |
The author consulted the following materials when building this lab:
- Lisa Tagliaferri, "How to Scrape Web Pages With Beautiful Soup and Python 3" Digital Ocean (20 March 2019). https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3
- Jeri Wieringa, "Intro to Beautiful Soup," The Programming Historian 1 (2012), https://doi.org/10.46430/phen0008.
- Martin Breuss, "Beautiful Soup: Build a Web Scraper With Python" Real Python. https://realpython.com/beautiful-soup-web-scraper-python/
- DataQuest, "Tutorial: Web Scraping With Python Using Beautiful Soup" DataQuest (2 September 2020). https://www.dataquest.io/blog/web-scraping-tutorial-python/
The introductory sections of this lab are based on and adapted from:
- Lisa Tagliaferri, "How to Scrape Web Pages With Beautiful Soup and Python 3" Digital Ocean (20 March 2019). https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3
- Lecture & Live Coding
- Lab Notebook Template
- Beautiful Soup Overview
- Beautiful Soup & Structured Data
- An Alternate Approach:
pandas.read_html()
- Beautiful Soup & Unstructured Text
- Why did we do this?
- How to Submit This Lab (and show your work)
- Lab Notebook Questions
Link to lab procedure as a Jupyter Notebook
Throughout this lab, you will see a Panopto icon at the start of select sections.
This icon indicates there is lecture/live coding asynchronous content that accompanies this section of the lab.
You can click the link in the figure caption to access these materials (ND users only).
Example:
|
Lab overview |
|
Lecture/live coding playlist |
Click here to access the lab notebook template as a Jupyter Notebook (Google Colab, ND Users).
Moving forward, we're going to be submitting lab notebooks using Jupyter Notebook templates.
- If working in JupyterLab (or another desktop IDE), download the
.ipynb
file to your local computerFile
-Download
-Download as .ipynb
- If working in Google Colaboratory, MAKE SURE you save a copy to your local drive. Otherwise your changes will not be saved.
File
-Save a copy in Drive
The lab notebook template includes all of the questions as well as pre-created markdown cells for narrative text answers and pre-created code cells for any programs you may need to create.
- Double click on these cells to edit and add your own content
- If questions do not require a code component, you can ignore those cells
- If questions to not require a narrative component, you can ignore those cells
If working in JupyterLab or another desktop IDE, upload the lab notebook template .ipynb
file to Canvas as your lab submission.
If working in Google Colaboratory, submit the link to your notebook (checking sharing permissions, similar with Google Docs).
|
Getting Started With Beautiful Soup |
Table of contents for this section:
- Introduction to Beautiful Soup
- Installing Beautiful Soup
- Loading URLs in Python
- Creating a Beautiful Soup Object
What is Beautiful Soup? "Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work" (Beautiful Soup Documentation) The library name is an allusion to the song Lewis Carroll character the Mock Turtle sings in Chapter 10 of Alice's Adventures in Wonderland.
Beautiful Soup
uses the structure of markup language documents (i.e. HTML, XML, etc) to extract information we can work with in a Python programming environment. In this lab, we will be focusing on using Beautiful Soup
to parse and scrape web pages built using HTML and CSS.
We'll need to install a couple of libraries to be able to parse HTML with BeautifulSoup
BeautifulSoup
:pip install beautifulsoup4
lxml
:pip install lxml
We will be using the requests
library to load URLs in Python. This library is already built into Python and does not require additional installation. Core syntax for using the requests
library to load a URL:
# import requests
import requests
# load url using get requests.get() method
page = requests.get('URL GOES HERE')
Once we have a URL loaded in Python, we want to be able to use BeautifulSoup
to parse the web page's HTML components. We can do that by creating a BeautifulSoup object. Core syntax:
# import beautifulsoup
from bs4 import BeautifulSoup
# create beautifulsoup object
soup = BeautifulSoup(page.text, 'html.parser')
Putting that all together, the core syntax for loading a URL and creating a BeautifulSoup object:
# import requests
import requests
# import beautifulsoup
from bs4 import BeautifulSoup
# load url using get requests.get() method
page = requests.get('URL GOES HERE')
# create beautiful soup object
soup = BeautifulSoup(page.text, 'html.parser')
Now, we are ready to parse the web page's HTML using BeautifulSoup
.
Table of contents for this section:
|
HTML & File Methods Refresh |
When using BeautifulSoup
, we can isolate information on the webpage through interacting with the HTML syntax. HTML uses a few core tags for web pages that include tables.
table
(marks the start and end of a tabletbody
(marks the start and end of the table body)tr
(marks the start and end of each table row)th
(marks the start and end of each column in the first row of the table)td
(marks the start and end of each column after the first row of the table)
How we might see those tags combined in a table structure:
<table>
<tr>
<th>First Column Header</th>
<th>Second Column Header</th>
<th>Third Column Header</th>
</tr>
<tr>
<td>Data in first column/first row</td>
<td>Data in second column/first row</td>
<td>Data in third column/first row</td>
</tr>
<tr>
<td>Data in first column/second row</td>
<td>Data in second column/second row</td>
<td>Data in third column/second row</td>
</tr>
</table>
The output of that HTML would look like:
First Column Header | Second Column Header | Third Column Header |
---|---|---|
Data in first column/first row | Data in second column/first row | Data in third column/first row |
Data in first column/second row | Data in second column/second row | Data in third column/second row |
Additional attributes like align
, style
, etc. can be used with many of these tags. Some other tags that we might encounter when trying to parse HTML using BeautifulSoup
:
Name | Syntax | Description |
Heading | h1, h2, h3, etc |
HTML headings |
Division | div |
Defines sections or divisions in an HTML document |
Paragraph | p |
Defines paragraphs in an HTML document |
Span | span |
Defines a section in an HTML document |
Unordered list | ul |
Defines the start of an unordered list (followed by li tags) |
Hyperlink | a |
Defines a hyperlink in an HTML document (followed by href = "URL" ) |
For more detail on HTML and CSS:
As part of our work in this lab, we'll be saving the content we have scraped from a web page to a plain-text file (either .txt
or .csv
), focusing on...
open()
write()
Click here to review file I/O and access modes in Python.
|
BeautifulSoup and a Single Web page |
Now, we're going to see HTML syntax and file methods at work to develop a web scraping program that takes tabular data on a web page and writes it to a CSV
file.
Head to https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table in a web browser. There are a nubmer of tables on this page, but we're going to focus on the Unranked medal table (sortable)
section of the page.
Take a look at this table on the public web page, thinking about the rows, columns, and data values we might want to map onto a tabular (table, spreadsheet) data structure.
Then, right click on the page (Control-click
on a Mac) and select the View Page Source
option.
- The specific label for this option may differ across browsers and operating systems.
There's a lot going on here- we're looking at the back-end HTML for the Wikipedia page with the table we want to work with. But we want to figure out what HTML elements are adjacent to the section of the page we want to isolate. Scrolling through the HTML until you see things that look promising is one place to start. You can also type Control-F
(Command-F
on a Mac) to search for particular words/phrases in the HTML. We know one of the country names in the table is Afghanistan
, so let's search the HTML to see where that term appears.
<th style="width:2em;">
<p><span title="Combined total number of all Medals won by each country" class="rt-commentedText" style="border-bottom:1px dotted">Total</span>
</p>
</th></tr>
<tr>
<td align="left"><span id="AFG"><img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Flag_of_Afghanistan_%282013%E2%80%932021%29.svg/22px-Flag_of_Afghanistan_%282013%E2%80%932021%29.svg.png" decoding="async" width="22" height="15" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Flag_of_Afghanistan_%282013%E2%80%932021%29.svg/33px-Flag_of_Afghanistan_%282013%E2%80%932021%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Flag_of_Afghanistan_%282013%E2%80%932021%29.svg/44px-Flag_of_Afghanistan_%282013%E2%80%932021%29.svg.png 2x" data-file-width="900" data-file-height="600" /> <a href="/wiki/Afghanistan_at_the_Olympics" title="Afghanistan at the Olympics">Afghanistan</a> <span style="font-size:90%;">(AFG)</span></span>
</td>
<td style="background:#f2f2ce;">15</td>
<td style="background:#f2f2ce;">0</td>
<td style="background:#f2f2ce;">0</td>
<td style="background:#f2f2ce;">2</td>
<td style="background:#f2f2ce;">2
</td>
<td style="background:#cedff2;">0</td>
<td style="background:#cedff2;">0</td>
<td style="background:#cedff2;">0</td>
<td style="background:#cedff2;">0</td>
<td style="background:#cedff2;">0
</td>
<td>15</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>2
</td></tr>
<tr>
Again, there's still a lot going on, but we can see the table row <tr>
tag around a section of HTML that includes multiple <td>
tags with information about the country name (Afghanistan
) and number values that correspond to the medal counts from that row in the table. Now, we need to figure out where this table "starts," or what HTML elements mark the beginning/end of the table. If we scroll up from the Afghanistan
search results, we can see the section of HTML that marks the start of this table.
<h2><span id="Unranked_medal_table_.28sortable.29"></span><span class="mw-headline" id="Unranked_medal_table_(sortable)">Unranked medal table (sortable)</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=All-time_Olympic_Games_medal_table&action=edit&section=1" title="Edit section: Unranked medal table (sortable)">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<p>The table is pre-sorted by the name of each Olympic Committee, but can be displayed as sorted by any other column, such as the total number of <a href="/wiki/Gold_medal" title="Gold medal">gold medals</a> or total number of overall medals. To sort by gold, silver, and then bronze, sort first by the bronze column, then the silver, and then the gold. The table does not include revoked medals (e.g., due to <a href="/wiki/Doping_in_sport" title="Doping in sport">doping</a>).
</p><p>Medal totals in this table are current through the <a href="/wiki/2020_Summer_Olympics" title="2020 Summer Olympics">2020 Summer Olympics</a> in <a href="/wiki/Tokyo" title="Tokyo">Tokyo</a>, Japan, and all <a href="/wiki/List_of_stripped_Olympic_medals" title="List of stripped Olympic medals">changes in medal standings</a> due to doping cases and medal redistributions up to 8 August 2021 are taken into account.
</p>
<style data-mw-deduplicate="TemplateStyles:r981673959">.mw-parser-output .legend{page-break-inside:avoid;break-inside:avoid-column}.mw-parser-output .legend-color{display:inline-block;min-width:1.25em;height:1.25em;line-height:1.25;margin:1px 0;text-align:center;border:1px solid black;background-color:transparent;color:black}.mw-parser-output .legend-text{}</style><div class="legend"><span class="legend-color" style="background-color:lightblue; color:black;"> </span> Special delegation, not an actual nation</div>
<table class="wikitable sortable" style="margin-top:0; text-align:center; font-size:90%;">
<tbody><tr>
<th>Team
</th>
<th align="center" style="background-color:#f2f2ce;" colspan="5"><a href="/wiki/Summer_Olympic_Games" title="Summer Olympic Games">Summer Olympic Games</a>
</th>
<th align="center" style="background-color:#cedff2;" colspan="5"><a href="/wiki/Winter_Olympic_Games" title="Winter Olympic Games">Winter Olympic Games</a>
</th>
<th align="center" colspan="5"><a href="/wiki/Olympic_Games" title="Olympic Games">Combined total</a>
</th></tr>
<tr>
<th><span title="International Olympic Committee country code" class="rt-commentedText" style="border-bottom:1px dotted">Team (IOC code)</span>
</th>
Since we're looking for a table, identifying a <table>
tag is a good place to start. We can see the line <table class="wikitable sortable" style="margin-top:0; text-align:center; font-size:90%;">
marks the start of the medal count table. This gives us the critical pieces of information we need when working in Python with BeautifulSoup
to isolate this section of HTML.
The first step in our program is to import the libraries we'll need, load the web page using requests
and create the BeautifulSoup
object.
# import statements
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
# get page
page = requests.get("https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table")
# create BeautifulSoup object
soup = BeautifulSoup(page.text, 'html.parser')
# show BeautifulSoup object
soup
At this point we have loaded the web page into Python as a BeautifulSoup
object.
The next step is to use BeautifulSoup
to identify and isolate the section of the web page we want to focus on. We can do this using the .find()
or .find_all()
with our BeautifulSoup
object.
.find()
isolates a single instance of an HTML class or tag.find_all()
creates a list-like object with each instance of a specific tag or class
# basic syntax for find
soup.find("HTML TAG GOES HERE")
# a specific example with the p tag
soup.find("p")
# basic syntax for find_all, this time using a class instead of a tag
soup.find_all(class_ = "CLASS NAME GOES HERE")
# a specific example using the wikitable sortable class
soup.find_all(class_ = 'wikitable sortable')
Again, we know the line of HTML <table class="wikitable sortable" style="margin-top:0; text-align:center; font-size:90%;">
marks the start of the medal count table. But we don't know how many instances of the <table>
tag or wikitable sortable
class appear in this web page. Some more work with Command-F
/Control-F
shows us there are 30 instances of the <table>
tag and 9 instances of the wikitable sortable
class in this page. So let's use .find_all()
with the wikitable sortable
class and then figure out which of those 9 instances is the table we want to work with.
# isolate all HTML with wikitable sortable class
tables = soup.find_all(class_ = 'wikitable sortable')
Then, we can use index operators to go through the items in tables
to see which one has the HTML we want to work with.
# show first table in tables
tables[0]
The first table in the list is the medal count, so we can assign that to a variable.
# isolate first table in tables
table = tables[0]
Now that we have just the HTML for the table we want to work with, we want to create a list-like object for each row in the table. So, we can use .find()
or .find_all()
again to create a list-like object with all elements that match a particular class or tag. In this case, we want the HTML for each row in the table to be a value or element in the list, so we're going to be looking for the table row <tr>
tag. And since this table has multiple rows, we're going to use .find_all("tr")
to find all instances of the <tr>
tag and map those onto a list-like structure.
# get all table rows from medal_table
rows = table.find_all("tr")
# show table rows
rows
Table of contents for this section:
So now we have the HTML for each row in the table as a list value in our medal_test
list. To recap:
- Load url using
requests.get()
- Creat
BeautifulSoup
object using `BeautifulSoup() - Isolate section of page with table using
.find_all()
(withwikitable sortable
class) and index values - Create list of rows using
.find_all()
(withtr
tag)
We'll eventually want to figure out how to iterate over all the rows in the table, but for now, let's isolate a single row so we can test our program on a single row. Since the first two rows in the the table have column header information and do not match the structure/format of the country rows, we'll use the third row (first country row) for our single-row testing.
# isolate first country in medal_table
row = rows[2]
# show first country
row
Now, we have a row
object that includes the entire HTML associated with the <tr>
tag for Afghanistan
. As before, we want to create a list-like object for each cell or column in the table. Since there are multiple cells in each row, we'll use .find_all()
to look for the table cell <td>
tag and map those values onto a list-like structure.
# get all cells in single_country
cells = row.find_all("td")
# show single country
cells
Now we have a list of cells (<td>
tags) for a single row of HTML in the original table. The next step is to figure out how we work within that list of <td>
tag contents to isolate the specific values we want to eventually write to each row in our .csv
file. At this point, before we start writing code to extract these values, it's useful to stop and think about the endpoint for the data scraping process. Specifically, thinking about what columns we have in mind for the ultimate data structure helps us keep the end goal in mind as we start working through each aspect of the program.
For this table, we might want a data structure with the following columns:
Country_Name
Country URL
Number of Summer Olympic Appearances
Number of Summer Olympic Gold Medals
Number of Summer Olympic Silver Medals
Number of Summer Olympic Bronze Medals
Total Number of Summer Olympic Medals
Number of Winter Olympic Appearances
Number of Winter Olympic Gold Medals
Number of Winter Olympic Silver Medals
Number of Winter Olympic Bronze Medals
Total Number of Winter Olympic Medals
Combined Number of Olympic Appearances
Combined Number of Gold Medals
Combined Number of Silver Medals
Combined Number of Bronze Medals
Total Combined Number of Medals
So let's figure out how we can isolate each of those values, with the ultimate goal of developing a program that can iterate over all the similarly-structured rows in our table.
To get the country name, we can isolate the first cell (<td>
tag) in the row
list of cells.
# isolate first cell in single_country
country_name = cells[0]
# show first cell
country_name
Then, we can use .contents[]
(part of BeautifulSoup
) to get the contents of that first cell.
# isolate country name using contents and index
name = cells[0].contents[0]
# show country_name HTML
name
So now we have the HTML associated with the first <td>
tag, but there's still a lot of extraneous markup we want to remove to get just the country name.
<span id="AFG"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Flag_of_Afghanistan_%282013%E2%80%932021%29.svg/22px-Flag_of_Afghanistan_%282013%E2%80%932021%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Flag_of_Afghanistan_%282013%E2%80%932021%29.svg/33px-Flag_of_Afghanistan_%282013%E2%80%932021%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Flag_of_Afghanistan_%282013%E2%80%932021%29.svg/44px-Flag_of_Afghanistan_%282013%E2%80%932021%29.svg.png 2x" width="22"/> <a href="/wiki/Afghanistan_at_the_Olympics" title="Afghanistan at the Olympics">Afghanistan</a> <span style="font-size:90%;">(AFG)</span></span>
If we drill down into this HTML, we can see the <a href...>
tag is the HTML element closest to the country name we want to extract. So let's use .find()
to get just the a
tag HTML.
# get HTML with a tag
name = name.find('a')
# show name_tag
name
Now with just the a
tag (<a href="/wiki/Afghanistan_at_the_Olympics" title="Afghanistan at the Olympics">Afghanistan</a>
), we can use .contents[]
again to get the tag contents. Remember an index value goes in the brackets for .contents[]
.
# get contents of a href tag
name = name.contents[0]
# show country
name
Voila! The country name. Putting that all together:
# combined program for country name
name = cells[0].contents[0].find('a').contents[0]
<span id="AFG"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Flag_of_Afghanistan_%282013%E2%80%932021%29.svg/22px-Flag_of_Afghanistan_%282013%E2%80%932021%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Flag_of_Afghanistan_%282013%E2%80%932021%29.svg/33px-Flag_of_Afghanistan_%282013%E2%80%932021%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Flag_of_Afghanistan_%282013%E2%80%932021%29.svg/44px-Flag_of_Afghanistan_%282013%E2%80%932021%29.svg.png 2x" width="22"/> <a href="/wiki/Afghanistan_at_the_Olympics" title="Afghanistan at the Olympics">Afghanistan</a> <span style="font-size:90%;">(AFG)</span></span>
If we look at the HTML associated with country_name
, we can see that the country's Wikipedia URL is also in this section of HTML, as part of the <a href...>
tag. So we can again use .find()
to isolate the <a>
tag's contents.
# get first/only instance of a link tag in HTML for single country
link = cells[0].find('a')
# show link
link
Now, looking at the link_test
piece of HTML (<a href="/wiki/Afghanistan_at_the_Olympics" title="Afghanistan at the Olympics">Afghanistan</a>
), we want to extract the URL associated with the href
attribute. We were able to use .contents[]
to get the contents of this HTML tag, but we'll need a different approach for getting the href
attribute value. BeautifulSoup
includes a .get()
argument that lets us isolate the value associated with an HTML attribute. So we can use .get('href')
to isolate the URL.
# get the contents of the url from the href attribute
link = link.get('href')
# show href contents
link
But wait! /wiki/Afghanistan_at_the_Olympics
doesn't look like a full url...That's because it isn't. Wikipedia is using an internal link with the <a href...>
tag to connect to another Wikipedia page, which means they don't have to use the full URL for the link to work. But, we can use concatenation to recreate the full link, since all English-language Wikipedia pages start with the same root URL: https://en.wikipedia.org
# use contenation to get full url
link = "https://en.wikipedia.org" + link
# show full link
link
Great! Now we have the full link.
# combined program for link
link = cells[0].find('a').get('href')
link = "https://en.wikipedia.org" + link
print(link)
If we take a look at the original Wikipedia table, we notice there are three column "groupings":
Summer Olympic Games
Winter Olympic Games
Combined total
We can extract medal counts within each of these column groups by selecting the cell and then getting the value of that cell using contents[]
and index values.
# number of summer olympic appearances
summer = cells[1].contents[0]
# number of summer gold medals
summer_gold = cells[2].contents[0]
# number of summer silver medals
summer_silver = cells[3].contents[0]
# number of summer bronze medals
summer_bronze = cells[4].contents[0]
# total number of summer olympics medals, using strip() to remove line break
summer_medals = cells[5].contents[0].strip()
# total number of winter olympic appearances
winter = cells[6].contents[0]
# number of winter gold medals
winter_gold = cells[7].contents[0]
# number of winter silver
winter_silver = cells[8].contents[0]
# number of winter bronze
winter_bronze = cells[9].contents[0]
# total number of winter medals
winter_medals = cells[10].contents[0].strip()
# total number combined olympic appearances
combined_olympics = cells[11].contents[0].strip()
# total number gold medals
gold_total = cells[12].contents[0]
# total number silver medals
silver_total = cells[13].contents[0]
# total number bronze medals
bronze_total = cells[14].contents[0]
# total number of medals, combined
medals_total = cells[15].contents[0].strip()
# create list with all of these values
medals = [summer, summer_gold, summer_silver, summer_bronze, summer_medals, winter, winter_gold, winter_silver, winter_bronze, winter_medals, combined_olympics, gold_total, silver_total, bronze_total, medals_total]
# show list
print(medals)
We could also create a list with the medal data and the country names + links we extracted in the previous section of the lab.
# add country name and URL to medals list
medals.insert(0, name)
medals.insert(1, link)
# show updated list
print(medals)
Now we have working code that extracts each piece of data from a row in the table.
To recap our workflow so far:
- Load url using
requests.get()
- Creat
BeautifulSoup
object using `BeautifulSoup() - Isolate section of page with table using
.find_all()
(withwikitable sortable
class) and index values - Create list of rows using
.find_all()
(withtr
tag) - Isolate single row using index values
- Develop program that works on single row using combination of index values,
.get()
,.find()
, and.contents[]
- Create list from extracted values
Now that we have working code that will extract each piece of data as a variable, we want to run these lines of code on each country in the table to get the data in each column. We can do this using a for
loop that iterates over each row in the table.
But before we throw iteration into the mix, we might want to remove or exclude rows in the table that don't follow the pattern or structure of the rows with data. Specifically for this table, the first two rows have the column header information and don't match the format/structure of the rest of the table. One option for removing these rows would be del
statements.
# remove second row
del rows[1]
# remove first row
del rows[0]
# show updated list of rows
print(rows)
Now we're ready to iterate over each country (table row) using the code we tested on a single country. We can create a list with each row's values and append that list as a sublist or nested list to an empty list.
table_rows = [] # empty list for data
for row in rows: # for loop that iterates over rows
tags = [] # create empty list for td tags in single row
cells = row.find_all('td') # isolate cells, td elements
tags.append(cells) # append tags to tag list
for cells in tags:
try:
name = cells[0].contents[0].find('a').contents[0] # combined program for country name
link = cells[0].find('a').get('href') # for loop that iterates over cells in row
link = "https://en.wikipedia.org" + link
summer = cells[1].contents[0] # number of summer olympic appearances
summer_gold = cells[2].contents[0] # number of summer gold medals
summer_silver = cells[3].contents[0] # number of summer silver medals
summer_bronze = cells[4].contents[0] # number of summer bronze medals
summer_medals = cells[5].contents[0].strip() # total number of summer olympics medals, using strip() to remove line break
winter = cells[6].contents[0] # total number of winter olympic appearances
winter_gold = cells[7].contents[0] # number of winter gold medals
winter_silver = cells[8].contents[0] # number of winter silver
winter_bronze = cells[9].contents[0] # number of winter bronze
winter_medals = cells[10].contents[0].strip() # total number of winter medals
combined_olympics = cells[11].contents[0].strip() # total number combined olympic appearances
gold_total = cells[12].contents[0] # total number gold medals
silver_total = cells[13].contents[0] # total number silver medals
bronze_total = cells[14].contents[0] # total number bronze medals
medals_total = cells[15].contents[0].strip() # total number of medals, combined
# creates list of values from each data element
row_data = [name, link, summer, summer_gold, summer_silver, summer_bronze, summer_medals, winter, winter_gold, winter_silver, winter_bronze, winter_medals, combined_olympics, gold_total, silver_total, bronze_total, medals_total]
# append row_data to test_list
table_rows.append(row_data)
except:
continue
# show list
table_rows
Each row of data in the table is a list value, or sublist/nested list in tags
. In this program try
and except
instruct Python to try
to run the lines nested under try
, but if it runs into an error (except
), continue
iterating over the values in the tags
list. Using try
and except
can be dangerous if you don't know if or how a program is working- in this case, we've tested the program and got it working on a single row that matches the pattern for all the rows we want to scrape data frame.
Now that we know the iteration is working, we can take table_rows
and write each item in the list (which itself is a row of data) to a row in a .csv
file.
# create new csv file
f = csv.writer(open('medals.csv', 'w'))
# list of strings with headers
headers = ["country_name", "link", "no_summer_olympics", "no_summer_gold", "no_summer_silver", "no_summer_bronze", "total_no_summer_medals", "no_winter_olympics", "no_winter_gold", "no_winter_silver", "no_winter_bronze", "total_no_winter_medals", "no_combined_olympics", "no_combined_gold", "no_combined_silver", "no_combined_bronze", "total_no_combined_medals"]
# write header row or first row for CSV file
f.writerow(headers)
# for loop that assigns each element in test_list to a new row
for row in table_rows:
f.writerow(row)
We can also use table_rows
and headers
to create a pandas
DataFrame
. pandas
will map each value in the list (nested list or sublist with single row of data) to a row in the DataFrame
. And the headers
list we created for writing the first row of the medals
CSV file can be used to set the DataFrame
column names.
# import pandas
import pandas as pd
# list of strings with headers
headers = ["country_name", "link", "no_summer_olympics", "no_summer_gold", "no_summer_silver", "no_summer_bronze", "total_no_summer_medals", "no_winter_olympics", "no_winter_gold", "no_winter_silver", "no_winter_bronze", "total_no_winter_medals", "no_combined_olympics", "no_combined_gold", "no_combined_silver", "no_combined_bronze", "total_no_combined_medals"]
# create data frame
df = pd.DataFrame(table_rows, columns=headers)
# show df
df
Then we could use pd.to_csv
to write the DataFrame
to a CSV
file.
# write df to csv file
df.to_csv("output.csv", index=False)
|
Lab Notebook Question 1 |
Q1A: Select another Wikipedia page that includes a table. From looking at the public web page, what data do you want to scrape from this web page (i.e. specific table, multiple tables, etc.)? What do you want the resulting data structure to look like (columns, rows, etc)?
Q1B: Take a look at the HTML for this page. What tags or other HTML components do you see around the section of the page you want to work with? For this question, we're thinking about how we will end up writing a program with BeautifulSoup
to isolate a section of the web page.
Q1C: Develop an outline for a Python program that scrapes data from the web page you selected.
A preliminary workflow:
- Load URL and create BeautifulSoup object
- Isolate section of HTML with your table (either directly or extract from list)
- Isolate table row elements (create list where each element is a table row)
- Extract contents from row (isolate the pieces of information from each row)
- Create Pandas DataFrame
- Write extracted row contents to CSV file
NOTE: This program will have a lot of moving parts! Starting with pseudocode to think about the underlying logic can help you organize the components you're building. As you start to build out your own program, think about where you can draw on sample code from the lab (or follow the same conceptual workflow).
Q1D: What challenges or roadblocks did you face working on Q1C? How did you start to tackle or solve them?
|
pandas.read_html() |
BeautifulSoup
is an incredibly powerful package that lets us drill down into the structure of an HTML document to isolate the pieces of information we need or want to work with. But if we're dealing with an HTML table that has straightforward formatting, an alternate approach is to use the .read_html()
function that is part of the pandas
library. pd.read_html()
looks for any <table>
tag on a web page and reads each <table>
tag's HTML into Python as a list of DataFrame
objects. Let's take a look at how we could use pd.read_html()
for the College Football Reference web page we looked at in the previous section of the lab.
# import pandas
import pandas as pd
# load url
url = "https://www.sports-reference.com/cfb/schools/notre-dame/1940-schedule.html"
# create list of dataframes from url using .read_html()
dfs = pd.read_html(url)
# show dfs list
dfs
We could also the URL directly to pd.read_html()
:
# import pandas
import pandas as pd
# create list of dataframes from url using .read_html()
dfs = pd.read_html("https://www.sports-reference.com/cfb/schools/notre-dame/1940-schedule.html")
# show dfs list
dfs
At first glance, dfs
doesn't seem to have clear data tables, but remember dfs
is a list of DataFrames
created from any HTML table
content on the page.
# number of elements in dfs list
len(dfs)
Since there are only two elements in the dfs
list, let's isolate each one.
# show first dataframe in dfs
dfs[0]
These values correspond to the AP Poll Summary
table on the page.
# show second dataframe in dfs
dfs[1]
These values correspond to the 9 Games
schedule table on the page. We could create a single DataFrame
by selecting the second element in dfs
.
# select single dataframe
df = dfs[1]
# show new dataframe
df
College Football Reference schedule pages for pre-1936 Notre Dame do not have the first AP Poll
table. We can use the 1935 schedule page (https://www.sports-reference.com/cfb/schools/notre-dame/1935-schedule.html) as an example.
# load url
url = "https://www.sports-reference.com/cfb/schools/notre-dame/1935-schedule.html"
# create list of dataframes from url using .read_html()
dfs = pd.read_html(url)
# show number of tables in dfs
len(dfs)
Only one table here, so we can isolate that table as a DataFrame
.
# select single dataframe
df = dfs[0]
# show new dataframe
df
If we were ultimately going to do further work with analyzing/visualizing this data, we would probably want to spend some more time standardizing column names and values. And if we were going to scrape this table across multiple pages, we would need to find some way to get the year or season represented in the DataFrame
. But for now, if we wanted to write the DataFrame
we created using pd.read_html()
to a CSV
...
df.to_csv("output.csv", index=False)
|
Lab Notebook Question 2 |
Q2A: Develop an outline for a Python program that uses pd.read_html()
to scrape data from a web page of your choosing.
- Wikipedia tables are fair game, but folks interested in sport data may want to head over to a Sports Reference site.
A preliminary workflow:
- Use
pd.read_html()
to create a list of DataFrame objects - Identify which DataFrame object in the list is the table you want to work with
- Isolate the list element to create a new DataFrame
- Write the new DataFrame to a CSV file
NOTE: Even though pd.read_html()
has an easier learning curve than BeautifulSoup
, this program will have a few moving parts! Starting with pseudocode to think about the underlying logic can help you organize the components you're building. As you start to build out your own program, think about where you can draw on sample code from the lab (or follow the same conceptual workflow).
ANOTHER NOTE: For many Sports Reference pages, tables further down the page are buried in HTML comments. These tables will not show up when you use pd.read_html()
. We can come back to these "hidden tables" in the final project, but for now, focus on the tables that do show up when you use pd.read_html()
.
Q2B: What challenges or roadblocks did you face working on Q2A? How did you start to tackle or solve them?
|
Unstructured Text |
So far, this lab has focused on approaches for scraping structured data from the web using Python. But there are other contexts and use cases when we might want to be able to scrape unstructured text from the web and be able to work with that "data" in Python (or write it to a plain-text .txt
file).
For example, take a look at U.S. President John F. Kennedy's 1961 inaugural address: https://en.wikisource.org/wiki/John_F._Kennedy%27s_Inaugural_Address
Imagine we wanted to use Python to analyze the words or terms used in this text. The first step would be to get a plain-text .txt
version of the article text. We would start with the same BeautifulSoup
workflow we've used previously:
- Load URL
- Create
BeautifulSoup
object - Isolate section of HTML with speech text
# get page
page = requests.get("https://en.wikisource.org/wiki/John_F._Kennedy%27s_Inaugural_Address")
# create BeautifulSoup object
soup = BeautifulSoup(page.text, 'html.parser')
# show BeautifulSoup object
soup
Next, we want to isolate the section of the page that includes the article text.
<span class="__watch ws-noexport noprint"><figure class="mw-default-size mw-halign-right" typeof="mw:File/Thumb"><span><video id="mwe_player_1" poster="//upload.wikimedia.org/wikipedia/commons/thumb/7/74/John_F._Kennedy_Inauguration_Speech.ogv/220px--John_F._Kennedy_Inauguration_Speech.ogv.jpg" controls="" preload="none" class="mw-file-element" width="220" height="165" data-durationhint="931" data-mwtitle="John_F._Kennedy_Inauguration_Speech.ogv" data-mwprovider="wikimediacommons" resource="/wiki/File:John_F._Kennedy_Inauguration_Speech.ogv"><source src="//upload.wikimedia.org/wikipedia/commons/transcoded/7/74/John_F._Kennedy_Inauguration_Speech.ogv/John_F._Kennedy_Inauguration_Speech.ogv.480p.vp9.webm" type="video/webm; codecs="vp9, opus"" data-transcodekey="480p.vp9.webm" data-width="640" data-height="480" /><source src="//upload.wikimedia.org/wikipedia/commons/7/74/John_F._Kennedy_Inauguration_Speech.ogv" type="video/ogg; codecs="theora, vorbis"" data-width="640" data-height="480" /><source src="//upload.wikimedia.org/wikipedia/commons/transcoded/7/74/John_F._Kennedy_Inauguration_Speech.ogv/John_F._Kennedy_Inauguration_Speech.ogv.240p.vp9.webm" type="video/webm; codecs="vp9, opus"" data-transcodekey="240p.vp9.webm" data-width="320" data-height="240" /><source src="//upload.wikimedia.org/wikipedia/commons/transcoded/7/74/John_F._Kennedy_Inauguration_Speech.ogv/John_F._Kennedy_Inauguration_Speech.ogv.360p.vp9.webm" type="video/webm; codecs="vp9, opus"" data-transcodekey="360p.vp9.webm" data-width="480" data-height="360" /><source src="//upload.wikimedia.org/wikipedia/commons/transcoded/7/74/John_F._Kennedy_Inauguration_Speech.ogv/John_F._Kennedy_Inauguration_Speech.ogv.360p.webm" type="video/webm; codecs="vp8, vorbis"" data-transcodekey="360p.webm" data-width="480" data-height="360" /><track src="https://commons.wikimedia.org/w/api.php?action=timedtext&title=File%3AJohn_F._Kennedy_Inauguration_Speech.ogv&lang=en&trackformat=vtt&origin=%2A" kind="subtitles" type="text/vtt" srclang="en" label="English (en)" data-dir="ltr" /></video></span><figcaption>John F. Kennedy delivering his inaugural address<br />(<small>241.44 MB, 15m 30s, <a href="https://commons.wikimedia.org/wiki/Project:Media_help" class="extiw" title="commons:Project:Media help">help</a>, <a href="https://commons.wikimedia.org/wiki/File:John_F._Kennedy_Inauguration_Speech.ogv" class="extiw" title="commons:File:John F. Kennedy Inauguration Speech.ogv">file info or download</a></small>)</figcaption></figure></span>
<p>Vice President Johnson, Mr. Speaker, Mr. Chief Justice, President Eisenhower, Vice President Nixon, President Truman, Reverend Clergy, fellow citizens:
</p><p>We observe today not a victory of party but a celebration of freedom, symbolizing an end as well as a beginning, signifying renewal as well as change. For I have sworn before you and Almighty God the same solemn oath our forbears prescribed nearly a century and three-quarters ago.
</p><p>The world is very different now. For man holds in his mortal hands the power to abolish all forms of human poverty and all forms of human life. And yet the same revolutionary beliefs for which our forebears fought are still at issue around the globe–the belief that the rights of man come not from the generosity of the state but from the hand of God.
</p><p>We dare not forget today that we are the heirs of that first revolution. Let the word go forth from this time and place, to friend and foe alike, that the torch has been passed to a new generation of Americans, born in this century, tempered by war, disciplined by a hard and bitter peace, proud of our ancient heritage, and unwilling to witness or permit the slow undoing of those human rights to which this nation has always been committed, and to which we are committed today at home and around the world.
</p><p>Let every nation know, whether it wishes us well or ill, that we shall pay any price, bear any burden, meet any hardship, support any friend, oppose any foe to assure the survival and the success of liberty.
</p><p>This much we pledge–and more.
</p><p>To those old allies whose cultural and spiritual origins we share, we pledge the loyalty of faithful friends. United there is little we cannot do in a host of cooperative ventures. Divided there is little we can do; for we dare not meet a powerful challenge at odds and split asunder.
</p><p>To those new states whom we welcome to the ranks of the free, we pledge our word that one form of colonial control shall not have passed away merely to be replaced by a far more iron tyranny. We shall not always expect to find them supporting our view. But we shall always hope to find them strongly supporting their own freedom; and to remember that, in the past, those who foolishly sought power by riding the back of the tiger ended up inside.
</p><p>To those people in the huts and villages of half the globe struggling to break the bonds of mass misery, we pledge our best efforts to help them help themselves, for whatever period is required–not because the communists may be doing it, not because we seek their votes, but because it is right. If a free society cannot help the many who are poor, it cannot save the few who are rich.
</p><p>To our sister republics south of our border, we offer a special pledge: to convert our good words into good deeds in a new alliance for progress; to assist free men and free governments in casting off the chains of poverty. But this peaceful revolution of hope cannot become the prey of hostile powers. Let all our neighbors know that we shall join with them to oppose aggression or subversion anywhere in the Americas. And let every other power know that this Hemisphere intends to remain the master of its own house.
</p><p>To that world assembly of sovereign states, the United Nations, our last best hope in an age where the instruments of war have far outpaced the instruments of peace, we renew our pledge of support, to prevent it from becoming merely a forum for invective, to strengthen its shield of the new and the weak, and to enlarge the area in which its writ may run.
</p><p>Finally, to those nations who would make themselves our adversary, we offer not a pledge but a request: that both sides begin anew the quest for peace, before the dark powers of destruction unleashed by science engulf all humanity in planned or accidental self-destruction.
</p><p>We dare not tempt them with weakness. For only when our arms are sufficient beyond doubt can we be certain beyond doubt that they will never be employed.
</p><p>But neither can two great and powerful groups of nations take comfort from our present course–both sides overburdened by the cost of modern weapons, both rightly alarmed by the steady spread of the deadly atom, yet both racing to alter that uncertain balance of terror that stays the hand of mankind's final war.
</p><p>So let us begin anew–remembering on both sides that civility is not a sign of weakness, and sincerity is always subject to proof. Let us never negotiate out of fear. But let us never fear to negotiate.
</p><p>Let both sides explore what problems unite us instead of belaboring those problems which divide us.
</p><p>Let both sides, for the first time, formulate serious and precise proposals for the inspection and control of arms–and bring the absolute power to destroy other nations under the absolute control of all nations.
</p><p>Let both sides seek to invoke the wonders of science instead of its terrors. Together let us explore the stars, conquer the deserts, eradicate disease, tap the ocean depths and encourage the arts and commerce.
</p><p>Let both sides unite to heed in all corners of the Earth the command of Isaiah to "<a href="/wiki/Bible_(King_James)/Isaiah#Chapter_58" title="Bible (King James)/Isaiah">undo the heavy burdens ... (and) let the oppressed go free.</a>"
</p><p>And if a beachhead of cooperation may push back the jungle of suspicion, let both sides join in creating a new endeavor, not a new balance of power, but a new world of law, where the strong are just and the weak secure and the peace preserved.
</p><p>All this will not be finished in the first one hundred days. Nor will it be finished in the first one thousand days, nor in the life of this Administration, nor even perhaps in our lifetime on this planet. But let us begin.
</p><p>In your hands, my fellow citizens, more than mine, will rest the final success or failure of our course. Since this country was founded, each generation of Americans has been summoned to give testimony to its national loyalty. The graves of young Americans who answered the call to service surround the globe.
</p><p>Now the trumpet summons us again; not as a call to bear arms, though arms we need; not as a call to battle, though embattled we are; but a call to bear the burden of a long twilight struggle, year in and year out, "<a href="/wiki/Bible_(King_James)/Romans#Chapter_12" title="Bible (King James)/Romans">rejoicing in hope, patient in tribulation</a>"–a struggle against the common enemies of man: tyranny, poverty, disease and war itself.
</p><p>Can we forge against these enemies a grand and global alliance, North and South, East and West, that can assure a more fruitful life for all mankind? Will you join in that historic effort?
</p><p>In the long history of the world, only a few generations have been granted the role of defending freedom in its hour of maximum danger. I do not shrink from this responsibility; I welcome it. I do not believe that any of us would exchange places with any other people or any other generation. The energy, the faith, the devotion which we bring to this endeavor will light our country and all who serve it–and the glow from that fire can truly light the world.
</p><p>And so, my fellow Americans: ask not what your country can do for you–ask what you can do for your country.
</p><p>My fellow citizens of the world: ask not what America will do for you, but what together we can do for the freedom of man.
</p><p>Finally, whether you are citizens of America or citizens of the world, ask of us here the same high standards of strength and sacrifice which we ask of you. With a good conscience our only sure reward, with history the final judge of our deeds, let us go forth to lead the land we love, asking His blessing and His help, but knowing that here on Earth God's work must truly be our own.
</p><p><br />
</p>
From taking a look at the page source, we can see <p>
(paragraph tags) mark each paragraph. And we can use a Control-F
/Command-F
search to see that that most of the <p>
tags appear around the text of the speech.
We can use .find_all("p")
to create a list-like object with each <p>
tag's content as a list item.
# isolate paragraph tags
paragraphs = soup.find_all("p")
# show number of paragraphs
len(paragraphs)
# show first paragraph
paragraphs[0]
# show last paragraph
paragraphs[-1]
We'll need to subset the paragraphs
list to get just the <p>
tags that have the speech text. We can do this using list indexing.
subset = paragraphs[1:29] # subset paragraph tags
subset[0].contents[0].strip() # isolate single paragraph; strip removes the "\n" new line regular expression
So now we have just the <p>
tags connected to the speech. The last step is to use a for
loop to iterate over each paragraph in article_paragraphs
, and use .contents[]
to extract the paragraph contents and append it to a list. Since the output of BeautifulSoup
's .contents[]
command is still a BeautifulSoup
object type, we can use the str()
function to convert the value to a string before appending it to a list.
for p in subset: # iterate over paragraphs
print(p.contents[0].strip()) # show text
lines = [] # empty list for text
for p in subset: # iterate over paragraphs
lines.append(str(p.contents[0].strip())) # isolate p tag contents and append to list
lines[0] # show first line
Now that we have a working program, we could write each paragraph to a newly-created .txt
file.
with open("output.txt", "a") as f: # new file
for p in subset: # iterate over paragraphs
f.write(str(p.contents[0].strip())) # write paragraph to file
f.write("\n") # preserve line breaks
Putting that all together:
# import statements
import requests, csv, pandas as pd
from bs4 import BeautifulSoup
# load page
page = requests.get("https://en.wikisource.org/wiki/John_F._Kennedy%27s_Inaugural_Address")
# beautifulsoup object
soup = BeautifulSoup(page.text, 'html.parser')
# isolate paragraph tags
paragraphs = soup.find_all('p')
# subset paragraph tags
subset = paragraphs[1:29]
# write output to text file
with open("output.txt", "a") as f: # new file
for p in subset: # iterate over paragraphs
f.write(str(p.contents[0].strip())) # write paragraph to file
f.write("\n") # preserve line breaks
print(p.contents[0].strip()) # show paragraph
|
Lab Notebook Question 3 |
Q3A: Select another web page that includes unstructured text. From looking at the public web page, what text do you want to scrape from this web page (i.e. specific sections, multiple paragraphs, etc.)?
A few places to start for unstructured text:
- WikiSource(a library of texts that are not covered by copyright)
- Project Gutenberg (a library of literary works or texts that are not covered by copyright)
Q3B: Take a look at the HTML for this page. What tags or other HTML components do you see around the section of the page you want to work with? For this question, we're thinking about how we will end up writing a program with BeautifulSoup
to isolate a section of the web page.
Q3C: Develop an outline for a Python program that scrapes unstructured text from the web page you selected.
A preliminary workflow:
- Load URL and create BeautifulSoup object
- Isolate section of HTML with your text (either directly or extract from list)
- IF NEEDED: Isolate text elements (create list where each element is a section of text)
- IF NEEDED: Extract text contents (isolate text from each section/paragraph)
- Write text to TXT file
NOTE: This program will have a lot of moving parts! Starting with pseudocode to think about the underlying logic can help you organize the components you're building. As you start to build out your own program, think about where you can draw on sample code from the lab (or follow the same conceptual workflow).
Q3D: What challenges or roadblocks did you face working on Q3C? How did you start to tackle or solve them?
|
Oh the Places You Could Go |
Once we have the isolated text, we could use Python to generate a list of words used in the speech.
lines = [] # empty list for text
for p in subset: # iterate over paragraphs
lines.append(str(p.contents[0].strip())) # isolate p tag contents and append to list
lines[0] # show first line
speech = "\n".join(lines) # use list method to concatenate list items
print(speech) # show speech text
wordlist = speech.split() # list of words
print((wordlist[0:120])) # show first 120 words in wordlist
Then, we could use the list of words wordlist
to count term frequency.
# create empty list for word frequency
wordfreq = []
# append number of word appearanes to wordfreq list
for word in wordlist:
wordfreq.append(wordlist.count(word))
# create nested list or lists with sublists that include word and number of appearances
str(list(zip(wordlist, wordfreq)))
Or, we could connect the terms and frequency count using a dictionary's key-value pairs.
# convert lists to dictionary using dictionary comprehension
word_count = {wordlist[i]: wordfreq[i] for i in range(len(wordlist))}
# show dictionary
word_count
We could also put that data in a Pandas DataFrame.
# import pandas
import pandas as pd
# create dataframe
df = pd.DataFrame(list(word_count.items()), columns=['term', 'count'])
# show df with count column sorted
df.sort_values('count')
And once it's in a DataFrame, we could filter or subset the data, by...
Filtering out commonly-used words:
# stopwords
stopWords = ['a', 'an', 'and', 'the', 'has', 'to', 'of', 'that', 'in', 'for', 'this']
# filter out rows that match list value
subset = df[~df['term'].isin(stopWords)]
# show updated df
subset.sort_values('count')
Filtering out rows with terms that only appear 1-2 times.
# filter out terms that only 1-2 times
subset = subset[subset['count'] > 2]
# show updated df
subset.sort_values('count')
With normalized textual data, we could use Python to do things like:
- count word frequency
- William J. Turkel and Adam Crymble, "Counting Word Frequencies with Python," The Programming Historian 1 (2012), https://programminghistorian.org/en/lessons/counting-frequencies.
- analyze keywords in contact using n-grams
- William J. Turkel and Adam Crymble, "Keywords in Context (Using n-grams) with Python," The Programming Historian 1 (2012), https://programminghistorian.org/en/lessons/keywords-in-context-using-n-grams.
Why go to all this trouble? Computational text analysis or other kinds of natural language processing allow us to see patterns and detect change over time in large bodies of textual materials.
A few projects that do this with speeches given by U.S. presidents:
- Google News Lab's Inaugurate project that looks at the subjects of inauguration speeches
- "The Language of the State of the Union," an interactive project from the Atlantic's Ben Schmidt and Mitch Fraas that "reveals how the words presidents use reflect the twists and turns of American history"
- "Mapping the State of the Union,", also from Schmidt and Fraas, that "shows the 1,410 different spots on the globe presidents have referenced in 224 speeches"
|
Why Did We Do This |
At this point, your brain probably hurts. Mine does. Why go to all the trouble of building a Python program when we could just copy and paste the text from an Observer article into a .txt
file, or copy and paste data from Sports Reference into an Excel file or Google Sheet document? Why write a program when we could just remove the markup ourselves through manual data manipulation?
That's a fair question.
Scientific disciplines value research that can be reproduced and replicated. A 2013 editorial in PLOS Computational Biology outlines the value of reproducible computational research:
"The importance of replication and reproducibility has recently been exemplified through studies showing that scientific papers commonly leave out experimental details essential for reproduction, studies showing difficulties with replicating published experimental results, an increase in retracted papers, and through a high number of failing clinical trials.
"We want to emphasize that reproducibility is not only a moral responsibility with respect to the scientific field, but that a lack of reproducibility can also be a burden for you as an individual researcher. As an example, a good practice of reproducibility is necessary in order to allow previously developed methodology to be effectively applied on new data, or to allow reuse of code and results for new projects. In other words, good habits of reproducibility may actually turn out to be a time-saver in the longer run.
"We further note that reproducibility is just as much about the habits that ensure reproducible research as the technologies that can make these processes efficient and realistic."
Sandve GK, Nekrutenko A, Taylor J, Hovig E (2013) Ten Simple Rules for Reproducible Computational Research. PLOS Computational Biology 9(10): e1003285. https://doi.org/10.1371/journal.pcbi.1003285.
Manually wrangling or manipulating data makes it difficult for colleagues (and future you) to understand how you got from point A to point B with your data. It also increases the likelihood of human error. In the PLOS Computational Biology article, one of the ten rules the authors outline is "Avoid Manual Data Manipulation Steps." The Python workflows we're covering in this lab move in the direction of automating the data scraping/manipulation process, creating a template or workflow others could implement or adapt.
Moving forward, we're going to be submitting lab notebooks using Jupyter Notebook templates.
- If working in JupyterLab (or another desktop IDE), download the
.ipynb
file to your local computerFile
-Download
-Download as .ipynb
- If working in Google Colaboratory, MAKE SURE you save a copy to your local drive. Otherwise your changes will not be saved.
File
-Save a copy in Drive
The lab notebook template includes all of the questions as well as pre-created markdown cells for narrative text answers and pre-created code cells for any programs you may need to create.
- Double click on these cells to edit and add your own content
- If questions do not require a code component, you can ignore those cells
- If questions to not require a narrative component, you can ignore those cells
If working in JupyterLab or another desktop IDE, upload the lab notebook template .ipynb
file to Canvas as your lab submission.
If working in Google Colaboratory, submit the link to your notebook (checking sharing permissions, similar with Google Docs).
Click here to access the lab notebook template as a Jupyter Notebook (Google Colab, ND Users).
Q1A: Select another Wikipedia page that includes a table. From looking at the public web page, what data do you want to scrape from this web page (i.e. specific table, multiple tables, etc.)? What do you want the resulting data structure to look like (columns, rows, etc)?
Q1B: Take a look at the HTML for this page. What tags or other HTML components do you see around the section of the page you want to work with? For this question, we're thinking about how we will end up writing a program with BeautifulSoup
to isolate a section of the web page.
Q1C: Develop an outline for a Python program that scrapes data from the web page you selected.
A preliminary workflow:
- Load URL and create BeautifulSoup object
- Isolate section of HTML with your table (either directly or extract from list)
- Isolate table row elements (create list where each element is a table row)
- Extract contents from row (isolate the pieces of information from each row)
- Create Pandas DataFrame
- Write extracted row contents to CSV file
NOTE: This program will have a lot of moving parts! Starting with pseudocode to think about the underlying logic can help you organize the components you're building. As you start to build out your own program, think about where you can draw on sample code from the lab (or follow the same conceptual workflow).
Q1D: What challenges or roadblocks did you face working on Q1C? How did you start to tackle or solve them?
Q2A: Develop an outline for a Python program that uses pd.read_html() to scrape data from a web page of your choosing.
- Wikipedia tables are fair game, but folks interested in sport data may want to head over to a Sports Reference site.
A preliminary workflow:
- Use
pd.read_html()
to create a list of DataFrame objects - Identify which DataFrame object in the list is the table you want to work with
- Isolate the list element to create a new DataFrame
- Write the new DataFrame to a CSV file
NOTE: Even though pd.read_html()
has an easier learning curve than BeautifulSoup
, this program will have a few moving parts! Starting with pseudocode to think about the underlying logic can help you organize the components you're building. As you start to build out your own program, think about where you can draw on sample code from the lab (or follow the same conceptual workflow).
ANOTHER NOTE: For many Sports Reference pages, tables further down the page are buried in HTML comments. These tables will not show up when you use pd.read_html()
. We can come back to these "hidden tables" in the final project, but for now, focus on the tables that do show up when you use pd.read_html()
.
Q2B: What challenges or roadblocks did you face working on Q2A? How did you start to tackle or solve them?
Q3A: Select another web page that includes unstructured text. From looking at the public web page, what text do you want to scrape from this web page (i.e. specific sections, multiple paragraphs, etc.)?
A few places to start for unstructured text:
- WikiSource(a library of texts that are not covered by copyright)
- Project Gutenberg (a library of literary works or texts that are not covered by copyright)
Q3B: Take a look at the HTML for this page. What tags or other HTML components do you see around the section of the page you want to work with? For this question, we're thinking about how we will end up writing a program with BeautifulSoup
to isolate a section of the web page.
Q3C: Develop an outline for a Python program that scrapes unstructured text from the web page you selected.
A preliminary workflow:
- Load URL and create BeautifulSoup object
- Isolate section of HTML with your text (either directly or extract from list)
- IF NEEDED: Isolate text elements (create list where each element is a section of text)
- IF NEEDED: Extract text contents (isolate text from each section/paragraph)
- Write text to TXT file
NOTE: This program will have a lot of moving parts! Starting with pseudocode to think about the underlying logic can help you organize the components you're building. As you start to build out your own program, think about where you can draw on sample code from the lab (or follow the same conceptual workflow).
Q3D: What challenges or roadblocks did you face working on Q3C? How did you start to tackle or solve them?