# Data Extraction from HTML/Web

This project is a practice exercise developed as part of the IBM Data Science Certificate Program at Coursera. This exercise demonstrated my proficiency in using Python for:
* HTML/Web scraping
* Usage of BeautifulSoup Python library
* Extraction of data and HTML elements

## Project 1 Scenario

You are asked to extract data such as HTML elements, tags, attributes, etc. from the given hypothetical webpages and tables.

## Steps

### 1. Define HTML

In [1]:
html = "<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3> \
<b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p> \
<h3>Stephen Curry</h3><p> Salary: $85,000,000</p> \
<h3>Kevin Durant</h3><p> Salary: $73,200,000</p></body></html>"

### 2. Import Required Libraries

In [2]:
# Import BeautifulSoup & requests
from bs4 import BeautifulSoup
import requests

### 3. Extract Data

#### Display HTML in nested structure

In [3]:
# Pass the HTML into BeautifulSoup constructor
soup = BeautifulSoup(html, 'html5lib')

# Print nested structure
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000,000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200,000
  </p>
 </body>
</html>



#### Extract tags information

In [4]:
# Title
tag_object = soup.title
print("tag object: ", tag_object) 
print("tag object type: ", type(tag_object))

tag object:  <title>Page Title</title>
tag object type:  <class 'bs4.element.Tag'>


In [5]:
# H3
tag_object = soup.h3
print("tag object: ", tag_object) 
print("tag object type: ", type(tag_object))

tag object:  <h3> <b id="boldest">Lebron James</b></h3>
tag object type:  <class 'bs4.element.Tag'>


In [6]:
# Child of H3
tag_child = tag_object.b
print("tag child: ", tag_child) 
print("tag child type: ", type(tag_child))

tag child:  <b id="boldest">Lebron James</b>
tag child type:  <class 'bs4.element.Tag'>


In [7]:
# Parent of H3
tag_parent = tag_object.parent
print("tag parent: ", tag_parent) 
print("tag parent type: ", type(tag_parent))

tag parent:  <body><h3> <b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p> <h3>Stephen Curry</h3><p> Salary: $85,000,000</p> <h3>Kevin Durant</h3><p> Salary: $73,200,000</p></body>
tag parent type:  <class 'bs4.element.Tag'>


In [8]:
# Siblings of H3
sibling_1 = tag_object.next_sibling
print("sibling 1: ", sibling_1)

sibling_2 = sibling_1.next_sibling
print("sibling 2: ", sibling_2)

sibling_3 = sibling_2.next_sibling
print("sibling 3: ", sibling_3)

sibling_4 = sibling_3.next_sibling
print("sibling 4: ", sibling_4)

sibling 1:  <p> Salary: $ 92,000,000 </p>
sibling 2:   
sibling 3:  <h3>Stephen Curry</h3>
sibling 4:  <p> Salary: $85,000,000</p>


#### Extracting HTML Attributes

In [9]:
# Extract id
tag_child['id']

'boldest'

In [10]:
# Alternatively, use the get() method
tag_child.get('id')

'boldest'

In [11]:
# Access id as dictionary
tag_child.attrs

{'id': 'boldest'}

#### Extract Strings

In [12]:
# Display string
tag_string = tag_child.string
print("string: ", tag_string)
print("type: ", type(tag_string))

string:  Lebron James
type:  <class 'bs4.element.NavigableString'>


In [13]:
# Convert to Python string object
unicode_string = str(tag_string)
unicode_string

'Lebron James'

### 4. Filtering Data

In [14]:
# Define table
table = "<table><tr><td id='flight'>Flight No</td><td>Launch site</td> \
<td>Payload mass</td></tr><tr> <td>1</td> \
<td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td> \
<td>300 kg</td></tr><tr><td>2</td> \
<td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td> \
<td>94 kg</td></tr><tr><td>3</td> \
<td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td> \
<td>80 kg</td></tr></table>"

# Display table
table_bs = BeautifulSoup(table, 'html5lib')
table_bs

<html><head></head><body><table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td> <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td> <td>300 kg</td></tr><tr><td>2</td> <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td> <td>94 kg</td></tr><tr><td>3</td> <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td> <td>80 kg</td></tr></tbody></table></body></html>

In [15]:
# Locate table rows
table_rows = table_bs.find_all('tr')
table_rows

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <tr> <td>1</td> <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td> <td>300 kg</td></tr>,
 <tr><td>2</td> <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td> <td>94 kg</td></tr>,
 <tr><td>3</td> <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td> <td>80 kg</td></tr>]

In [16]:
# Locate first row
first_row = table_rows[0]
first_row

<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>

In [17]:
# Type of first row
print(type(first_row))

<class 'bs4.element.Tag'>


In [18]:
# Print first cell of first row
first_row.td

<td id="flight">Flight No</td>

In [19]:
# Print elements in each row
for i, row in enumerate(table_rows):
    print("row", i, "is", row)

row 0 is <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
row 1 is <tr> <td>1</td> <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td> <td>300 kg</td></tr>
row 2 is <tr><td>2</td> <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td> <td>94 kg</td></tr>
row 3 is <tr><td>3</td> <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td> <td>80 kg</td></tr>


In [20]:
# Print row, column and data
for i, row in enumerate(table_rows):
    # Print row number
    print("row", i)
    cells = row.find_all('td')
    for j, cell in enumerate(cells):
        # Print column number and cell content
        print('column', j, "cell", cell)

row 0
column 0 cell <td id="flight">Flight No</td>
column 1 cell <td>Launch site</td>
column 2 cell <td>Payload mass</td>
row 1
column 0 cell <td>1</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>
column 2 cell <td>300 kg</td>
row 2
column 0 cell <td>2</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
column 2 cell <td>94 kg</td>
row 3
column 0 cell <td>3</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>
column 2 cell <td>80 kg</td>


In [21]:
# Using list to display cell content
list_input = table_bs.find_all(name=['tr', 'td'])
list_input

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <tr> <td>1</td> <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td> <td>300 kg</td></tr>,
 <td>1</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>,
 <td>300 kg</td>,
 <tr><td>2</td> <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td> <td>94 kg</td></tr>,
 <td>2</td>,
 <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>,
 <td>94 kg</td>,
 <tr><td>3</td> <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td> <td>80 kg</td></tr>,
 <td>3</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>,
 <td>80 kg</td>]

In [22]:
# Display only elements with id 'flight'
table_bs.find_all(id = 'flight')

[<td id="flight">Flight No</td>]

In [23]:
# Locate elements linking to the Florida Wikipedia page
list_input = table_bs.find_all(href='https://en.wikipedia.org/wiki/Florida')
list_input

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

In [24]:
# Locate all anchors with links
table_bs.find_all('a', href=True)

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

In [25]:
# Locate all anchors without links
table_bs.find_all('a', href=False)

[<a></a>, <a> </a>]

In [26]:
# Find string 'Florida'
table_bs.find_all(string = 'Florida')

['Florida', 'Florida']

## 5. Find Specific Elements

In [27]:
# Define table
two_tables="<h3>Rocket Launch </h3> \
<p><table class='rocket'> \
<tr><td>Flight No</td><td>Launch site</td><td>Payload mass</td></tr> \
<tr><td>1</td><td>Florida</td><td>300 kg</td></tr> \
<tr><td>2</td><td>Texas</td><td>94 kg</td></tr> \
<tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p>\
<p><h3>Pizza Party</h3> \
<table class='pizza'> \
<tr><td>Pizza Place</td><td>Orders</td><td>Slices </td></tr> \
<tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr> \
<tr><td>Little Caesars</td><td>12</td><td >144 </td></tr> \
<tr><td>Papa John's</td><td>15 </td><td>165</td></tr>"

In [28]:
# Create object using BeautifulSoup
two_tables_bs = BeautifulSoup(two_tables, 'html.parser')
two_tables_bs

<h3>Rocket Launch </h3> <p><table class="rocket"> <tr><td>Flight No</td><td>Launch site</td><td>Payload mass</td></tr> <tr><td>1</td><td>Florida</td><td>300 kg</td></tr> <tr><td>2</td><td>Texas</td><td>94 kg</td></tr> <tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party</h3> <table class="pizza"> <tr><td>Pizza Place</td><td>Orders</td><td>Slices </td></tr> <tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr> <tr><td>Little Caesars</td><td>12</td><td>144 </td></tr> <tr><td>Papa John's</td><td>15 </td><td>165</td></tr></table></p>

In [29]:
# Find the first table
two_tables_bs.find('table')

<table class="rocket"> <tr><td>Flight No</td><td>Launch site</td><td>Payload mass</td></tr> <tr><td>1</td><td>Florida</td><td>300 kg</td></tr> <tr><td>2</td><td>Texas</td><td>94 kg</td></tr> <tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>

In [30]:
# Find the second table using its class
two_tables_bs.find('table', class_= 'pizza')

<table class="pizza"> <tr><td>Pizza Place</td><td>Orders</td><td>Slices </td></tr> <tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr> <tr><td>Little Caesars</td><td>12</td><td>144 </td></tr> <tr><td>Papa John's</td><td>15 </td><td>165</td></tr></table>

## Project 2 Scenario

You are asked to download and scrape web contents of a web page and HTML tables.

#### Web Page

In [31]:
# Define URL
url = 'http://www.ibm.com'

# Get request as text
data = requests.get(url).text

# Create object using BeautifulSoup
soup = BeautifulSoup(data, 'html5lib')

In [32]:
# Scrape all links
for link in soup.find_all('a', href = True):
    print(link.get('href'))

https://www.ibm.com/hybrid-cloud?lnk=hpUSbt1


In [33]:
# Scrape all images
for link in soup.find_all('img'):
    print(link.get('src'))

#### HTML Table

In [34]:
# Define table URL
url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html'

In [35]:
## Examine contents of webpage
# Get request in text
data = requests.get(url).text

# Create object
soup = BeautifulSoup(data, 'html5lib')

In [36]:
# Find html table
table = soup.find('table')

In [37]:
# Find all rows from the table
for row in table.find_all('tr'):
    cols = row.find_all('td')
    color_name = cols[2].string
    color_code = cols[3].text
    print('{}--->{}'.format(color_name, color_code))

Color Name--->Hex Code#RRGGBB
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF


#### HTML Table: Using BeautifulSoup and Pandas

In [38]:
# Import libraries
import pandas as pd
import requests

In [39]:
# Define URL
url = 'https://en.wikipedia.org/wiki/World_population'

In [40]:
# Get the data
data = requests.get(url).text

In [41]:
# Parse data into BeautifulSoup object
soup = BeautifulSoup(data, 'html.parser')

In [42]:
# Locate all HTML tables in the webpage
tables = soup.find_all('table')

In [43]:
# Number of tables found
len(tables)

30

In [44]:
# Find table with description '10 most densely populated countries'
for index, table in enumerate(tables):
    if ('10 most densely populated countries' in str(table)):
        table_index = index
print(table_index)

7


In [45]:
population_data = pd.DataFrame(columns=["Rank", "Country", "Population", "Area", "Density"])

for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")

    if col:  # Check if col is not empty
        rank = col[0].text
        country = col[1].text
        population = col[2].text.strip()
        area = col[3].text.strip()
        density = col[4].text.strip()

        # Create a dictionary for the current row
        row_dict = {"Rank": rank, "Country": country, "Population": population, "Area": area, "Density": density}
        
        # Append the dictionary to the list
        data_list = []
        data_list.append(row_dict)

# Convert the list of dictionaries into a DataFrame
population_data = pd.concat([population_data, pd.DataFrame(data_list)], ignore_index=True)

print(population_data)

  Rank       Country  Population    Area Density
0   10   Netherlands  17,400,824  41,543     419


In [46]:
## Scrape the data into dataframe

# Build the dataframe
population_data = pd.DataFrame(columns = ['Rank', 'Country', 'Population', 'Area', 'Density'])

# Scrape the data
for row in tables[table_index].tbody.find_all('tr'):
    col = row.find_all('td')
    
    if (col != []): # Check if col is not empty
        rank = col[0].text
        country = col[1].text
        population = col[2].text.strip()
        area = col[3].text.strip()
        density = col[4].text.strip()
        
        # Create a dictionary for the current row
        row_dict = {'Rank': rank, 'Country': country, 'Population': population, 'Area': area, 'Density': density}
        
        # Append the dictionary to a list
        data_list = []
        data_list.append(row_dict)
        
# Convert the list into dataframe
population_data = pd.concat([population_data, pd.DataFrame(data_list)], ignore_index=True)

# Print the dataframe
population_data

Unnamed: 0,Rank,Country,Population,Area,Density
0,10,Netherlands,17400824,41543,419


In [66]:
# Print the located table
#print(tables[table_index].prettify())

#### HTML Table: Using BeautifulSoup and read_html

In [58]:
# Read HTML
tables = pd.read_html(url, flavor = 'bs4')

In [59]:
# List of dataframes
len(tables)

27

In [63]:
tables[6]

Unnamed: 0,Rank,Country,Population,Area (km2),Density (pop/km2)
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,Palestine[note 3][102],5223000,6025,867
3,4,Taiwan[note 4],23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,India,1389637446,3287263,423
9,10,Netherlands,17400824,41543,419


In [65]:
# Use match parameter to select specific table
pd.read_html(url, match="10 most densely populated countries", flavor='bs4')[0]

Unnamed: 0,Rank,Country,Population,Area (km2),Density (pop/km2)
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,Palestine[note 3][102],5223000,6025,867
3,4,Taiwan[note 4],23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,India,1389637446,3287263,423
9,10,Netherlands,17400824,41543,419
