# Introduction to BeautifulSoup
 
Beautiful soup is a powerful Python library that enables you to extract information from web pages.

We will start by importing BeautifulSoup and other libraries. We are going to extract information from the following webpage:
```
    https://guyfrancis.github.io/Example.html
```

We will use the `request` library to get the webpage.

In [45]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import requests as r
import re

# Set up URL and HTTP headers
url = "https://guyfrancis.github.io/Example.html"
headers = { "User-agent": "Data Science Class" }

In [46]:
# Make HTTP request and check response code - should be 200
response = r.get(url, headers=headers)
response.status_code

200

Now we will print the entire webpage. This should look like an un-processed HTML document.

In [47]:
print(response.text)

<!DOCTYPE html>
<html>
	<head>
		<title>
			Example Webpage
		</title>
		<style>
			table, th, td {
  				border: 2px solid black;
			}
		body {
			background-color:powderblue;
		</style>
	</head>
	<body>
		<h1>This is a Heading</h1>
		<p id='p1'>This is a paragraph.</p>
		<p id='p2'>This is another paragraph.</p>
		<a href='https://bbc.com/sport/football'>A link</a><br>
		<a href='https://restcountries.com/v3.1/name/germany?fields=capital'>Another link</a>
		<br>
		<br>
		<h2>Solar System Planets by Mass</h2>
		<table id='table1'>
			<tr>
				<th>Planet</th>
		        <th>Mass (kg)</th>	
			</tr>
		    <tr>
				<td>Jupiter</td>
				<td>1.90 x 10^27</td>
			</tr>
		    <tr>
				<td>Saturn</td>
				<td>5.68 x 10^26</td>
			</tr>
		    <tr>
				<td>Neptune</td>
				<td>1.02 x 10^26</td>
			</tr>
		    <tr>
				<td>Uranus</td>
				<td>8.68 x 10^25</td>
			</tr>
		    <tr>
				<td>Earth</td>
				<td>5.97 x 10^24</td>
			</tr>
			<tr>
				<td>Venus</td>
				<td>4.87 x 10^24</td>
			</tr>
		

## Using Soup

BeautifulSoup 'parses' the document into its individual components so we can look at the contents piece by piece. The following line of code will create a `soup` object which we can examine programmatically. We'll then run some code to examine parts of the document.

In [48]:
# convert the HTML document into a soup object
soup = BeautifulSoup(response.text, 'html.parser')

In [49]:
# print the title 
print(soup.title.text)


			Example Webpage
		


In [50]:
# Find all the paragraph elements (that is elements with p tags)
paras = soup.find_all("p")

# Print the contents of each of the paragraph elements
for para in paras:
    print(para.text)

This is a paragraph.
This is another paragraph.


In [51]:
# Get all the text inside all text elements - this includes headings, paragraphs, table cells and links
page_text = soup.body.text.strip()
print(page_text)

This is a Heading
This is a paragraph.
This is another paragraph.
A link
Another link


Solar System Planets by Mass


Planet
Mass (kg)


Jupiter
1.90 x 10^27


Saturn
5.68 x 10^26


Neptune
1.02 x 10^26


Uranus
8.68 x 10^25


Earth
5.97 x 10^24


Venus
4.87 x 10^24


Mars
6.42 x 10^23


Mercury
3.3 x 10^23


In [52]:
# Get the links in the 'a' tags
links = soup.find_all("a")

for link in links:
    url = link.get('href')
    print(url)

https://bbc.com/sport/football
https://restcountries.com/v3.1/name/germany?fields=capital


## Getting Data From Tables

There are several ways to get data from a table in an HTML page. Here are a couple:

1. Use soup to iterate through the table elements, grabbing the contents of each table cell as needed.
2. Use pandas to load an entire table into a dataframe.

Method 2 is more straightforward, however, Method 1 is show below for reference.

In [54]:
planets = []
masses = []
pmass_dict = {}
table = soup.find('table')

for row in table.find_all('tr'):
    cells = row.find_all('td')
    if len(cells)==2:
        planets.append(cells[0].text)
        masses.append(cells[1].text)
        pmass_dict[cells[0].text]=cells[1].text
print(planets, masses)
print(pmass_dict)

['Jupiter', 'Saturn', 'Neptune', 'Uranus', 'Earth', 'Venus', 'Mars', 'Mercury'] ['1.90 x 10^27', '5.68 x 10^26', '1.02 x 10^26', '8.68 x 10^25', '5.97 x 10^24', '4.87 x 10^24', '6.42 x 10^23', '3.3 x 10^23']
{'Jupiter': '1.90 x 10^27', 'Saturn': '5.68 x 10^26', 'Neptune': '1.02 x 10^26', 'Uranus': '8.68 x 10^25', 'Earth': '5.97 x 10^24', 'Venus': '4.87 x 10^24', 'Mars': '6.42 x 10^23', 'Mercury': '3.3 x 10^23'}


In [55]:
# Now we need to convert the masses into numbers
for key, value in pmass_dict.items():
    mantissa = float(value.split('x')[0])
    power = int(value.split('x')[1].split('^')[1])
    pmass_dict[key]=mantissa*(10**power)
print(pmass_dict)

{'Jupiter': 1.8999999999999998e+27, 'Saturn': 5.68e+26, 'Neptune': 1.02e+26, 'Uranus': 8.68e+25, 'Earth': 5.969999999999999e+24, 'Venus': 4.87e+24, 'Mars': 6.419999999999999e+23, 'Mercury': 3.2999999999999996e+23}


In [56]:
# Now we will convert into 'earth masses'
earth_mass = pmass_dict['Earth']
for key, value in pmass_dict.items():
    pmass_dict[key]=pmass_dict[key] / earth_mass
print(pmass_dict)

{'Jupiter': 318.2579564489112, 'Saturn': 95.142378559464, 'Neptune': 17.085427135678394, 'Uranus': 14.539363484087104, 'Earth': 1.0, 'Venus': 0.8157453936348409, 'Mars': 0.10753768844221105, 'Mercury': 0.05527638190954774}


In [57]:
# Let's print out what we've worked out, nicely formatted
for planet, mass in pmass_dict.items():
    print(f"The mass of {planet} is {pmass_dict[planet]:.2f} earth masses.")

The mass of Jupiter is 318.26 earth masses.
The mass of Saturn is 95.14 earth masses.
The mass of Neptune is 17.09 earth masses.
The mass of Uranus is 14.54 earth masses.
The mass of Earth is 1.00 earth masses.
The mass of Venus is 0.82 earth masses.
The mass of Mars is 0.11 earth masses.
The mass of Mercury is 0.06 earth masses.


In [62]:
table_data = pd.read_html(response.content)[0]
table_data

Unnamed: 0,Planet,Mass (kg)
0,Jupiter,1.90 x 10^27
1,Saturn,5.68 x 10^26
2,Neptune,1.02 x 10^26
3,Uranus,8.68 x 10^25
4,Earth,5.97 x 10^24
5,Venus,4.87 x 10^24
6,Mars,6.42 x 10^23
7,Mercury,3.3 x 10^23


In [60]:
print(table_data)

[    Planet     Mass (kg)
0  Jupiter  1.90 x 10^27
1   Saturn  5.68 x 10^26
2  Neptune  1.02 x 10^26
3   Uranus  8.68 x 10^25
4    Earth  5.97 x 10^24
5    Venus  4.87 x 10^24
6     Mars  6.42 x 10^23
7  Mercury   3.3 x 10^23]
