# Introduction to BeautifulSoup
 
`BeautifulSoup` is a powerful Python library that enables you to extract data from web pages.

We will start by importing BeautifulSoup and other libraries. We are going to extract information from the following example webpage:
[https://guyfrancis.github.io/Example.html](https://guyfrancis.github.io/Example.html)

We will use the `request` library to get the webpage.

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests as r

# Set up URL and HTTP headers
url = "https://guyfrancis.github.io/Example.html"
headers = { "User-agent": "Jupyter Hub / Python 3.x Data Science Class" }
print(url)

https://guyfrancis.github.io/Example.html


In [2]:
# Make HTTP request and check response code - should be 200
response = r.get(url, headers=headers)
response.status_code

200

Now we will print the entire webpage. This should look like a raw HTML document.

In [3]:
print(response.text)

<!DOCTYPE html>
<html>
	<head>
		<title>
			Example Webpage
		</title>
		<style>
			table, th, td {
  				border: 2px solid black;
			}
		body {
			background-color:powderblue;
		</style>
	</head>
	<body>
		<h1>This is a Heading</h1>
		<p id='p1'>This is a paragraph.</p>
		<p id='p2'>This is another paragraph.</p>
		<a href='https://bbc.com/sport/football'>A link</a><br>
		<a href='https://restcountries.com/v3.1/name/germany?fields=capital'>Another link</a>
		<br>
		<br>
		<h2>Solar System Planets by Mass</h2>
		<table id='table1'>
			<tr>
				<th>Planet</th>
		        <th>Mass (kg)</th>	
			</tr>
		    <tr>
				<td>Jupiter</td>
				<td>1.90 x 10^27</td>
			</tr>
		    <tr>
				<td>Saturn</td>
				<td>5.68 x 10^26</td>
			</tr>
		    <tr>
				<td>Neptune</td>
				<td>1.02 x 10^26</td>
			</tr>
		    <tr>
				<td>Uranus</td>
				<td>8.68 x 10^25</td>
			</tr>
		    <tr>
				<td>Earth</td>
				<td>5.97 x 10^24</td>
			</tr>
			<tr>
				<td>Venus</td>
				<td>4.87 x 10^24</td>
			</tr>
		

## Using Soup

`BeautifulSoup` *parses* the document into its individual elements so we can look at the contents piece by piece. The following line of code will create a `soup` object which we can examine programmatically. We'll then run some code to examine parts of the document.

In [4]:
# convert the HTML document into a soup object
soup = BeautifulSoup(response.text, 'html.parser')

In [5]:
# print the title - including tags
print(soup.title)

# print just the title text
print(soup.title.text)

<title>
			Example Webpage
		</title>

			Example Webpage
		


In [6]:
# Find all the paragraph elements (that is elements with p tags)
paras = soup.find_all("p")

# Print the contents of each of the paragraph elements
for para in paras:
    print(para.text)

This is a paragraph.
This is another paragraph.


In [7]:
# Get all the text inside all text elements 
# this includes headings, paragraphs, table cells and links
page_text = soup.body.text.strip()
print(page_text)

This is a Heading
This is a paragraph.
This is another paragraph.
A link
Another link


Solar System Planets by Mass


Planet
Mass (kg)


Jupiter
1.90 x 10^27


Saturn
5.68 x 10^26


Neptune
1.02 x 10^26


Uranus
8.68 x 10^25


Earth
5.97 x 10^24


Venus
4.87 x 10^24


Mars
6.42 x 10^23


Mercury
3.3 x 10^23


In [8]:
# Get the links in the 'a' tags
links = soup.find_all("a")

for link in links:
    url = link.get('href')
    print(url)

https://bbc.com/sport/football
https://restcountries.com/v3.1/name/germany?fields=capital


## Exercise 1

Practice retrieving elements from the webpage by using the `soup.find()` and `soup.find_all()` methods.

1. Get the first `h1` heading and print it, including tags and text.
2. Get the first `h2` heading and print the heading text, excluding tags.
3. Get all the `td` elements. These are table data elements, so the contents of the table. Use a for loop to print out all the contents of the `td` elements.
4. Get the element whose `id` is `table1`. Print it out.

In [None]:
# Type your code here


## Exercise 2

Retrieve the Wikipedia page: https://en.wikipedia.org/wiki/Weeke

Convert it into a BeautifulSoup object and use it to print out all the URLs of the links on the page.

In [None]:
# Type your code here


## Getting Data From Tables

There are several ways to get data from a table in an HTML page. Here are a couple:

1. Use `BeautifulSoup` to iterate through the table elements, grabbing the contents of each table cell as needed.
2. Use `pandas` to load an entire table into a dataframe.

Method 2 is typically much more straightforward, however, method 1 provides the most flexibility in terms of what you retrieve.

We'll use method 2 to get the table from our example webpage: [https://guyfrancis.github.io/Example.html](https://guyfrancis.github.io/Example.html)

In [9]:
# Get the data from our Example webpage and load it into pandas.
from bs4 import BeautifulSoup
import pandas as pd
import requests as r

# Set up URL and HTTP headers
url = "https://guyfrancis.github.io/Example.html"
headers = { "User-agent": "Jupyter Hub / Python 3.x Data Science Class" }
response = r.get(url, headers=headers)
planet_data = pd.read_html(response.content)[0]
planet_data

Unnamed: 0,Planet,Mass (kg)
0,Jupiter,1.90 x 10^27
1,Saturn,5.68 x 10^26
2,Neptune,1.02 x 10^26
3,Uranus,8.68 x 10^25
4,Earth,5.97 x 10^24
5,Venus,4.87 x 10^24
6,Mars,6.42 x 10^23
7,Mercury,3.3 x 10^23


## Exercise 3

Load the Super Bowl Championships wikipedia page into a `BeautifulSoup` object.

[https://en.wikipedia.org/wiki/List_of_Super_Bowl_champions](https://en.wikipedia.org/wiki/List_of_Super_Bowl_champions)

Print out the text of all the second-level headings `'h2'`.

In [None]:
# Type your code here


## Exercise 4

Load the Super Bowl Championships table from the following Wikipedia page into a `pandas` dataframe.

[https://en.wikipedia.org/wiki/List_of_Super_Bowl_champions](https://en.wikipedia.org/wiki/List_of_Super_Bowl_champions)

Use the dataframe to work out then average attendance for all the superbowl games.

In [None]:
# Type your code here


## Extension Activities

There are 3 extension activities to do if you complete Exercises 1-4. You can do any of these and you can work with a class neighbour on them. You don't have to do them in the order shown.

## Extension 1

Pick any website that you are familiar with. Before using `BeautifulSoup`, look at the `robots.txt` file (which will typically be at the root level of the domain:
```
www.example.com/robots.txt
```

Then, depending on permissions in robots.txt, see if you can use BeautifulSoup to extract some data from one or more webpages.

In [None]:
# Type your code here


## Extension 2

This and the next extension problem are data processing challenges. We will look more at these types of challenge when we cover 'advanced data wrangling'.

First make sure you have loaded the contents of the [https://guyfrancis.github.io/Example.html](https://guyfrancis.github.io/Example.html) table into a dataframe called `planet_data`.

The `planet_data` masses are stored as strings, not numbers. Your task is to convert them to numbers, specifically, `float`.

Create two new columns in the dataframe:
1. A column that represents the `str` values as `float`, with the string value converted to a numerical `float` value.
2. A column that represents the numerical values, but as multiples of Earth's mass. 

Use your answers to work out the total mass of the all planets in the solar system, measured in numbers of 'earth masses.'

Hints: You will need to convert `strings` to `floats` and you will need to use the `str.split()` method to split each string into pieces you can convert to numbers. You will also need to iterate through the relevant rows and columns to make the changes. The easiest way to do this last part is with the `.loc` method.

In [None]:
# Type your code here


## Extension 3

Use the data from the superbowl championships Wikipedia page to find the average winning and average losing score in all the superbowls. This will require some data processing first.

In [None]:
# Type your code here
