In [16]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


# Step 1 Import Required Libraries 
These are essential Python libraries for web scraping and data handling: 
* requests: Used to send an HTTP request to the website and get the HTML content. 
* BeautifulSoup: Helps parse the HTML content and extract useful information. 
* pandas: Used to create and manage data structures like DataFrames, and later save them as CSV files.

In [17]:
'''
  Title: Web Scraping Project
  Name: Jude Oluya
  Date: 13 May 2025
  You can write a few comments about your project
'''
from bs4 import BeautifulSoup
import requests
import pandas as pd

# Step 2: Send HTTP Request to Target URL

We specify the target URL and use requests.get() to fetch the HTML content of the page. This content will later be parsed and analyzed. 

In [18]:
#set website in a variable
url = 'https://www.scrapethissite.com/pages/forms/'
page= requests.get(url)
print(page)

<Response [200]>


# Step 3: Parse HTML Content Using BeautifulSoup
The raw HTML returned from the website is converted into a BeautifulSoup object, making it easier to 
search for specific HTML elements (like tables, rows, and data cells).

In [19]:
#Use BeautifulSoup to extract the HTML content
soup = BeautifulSoup(page.text, 'html')
print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping</title>
<link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components." name="description"/>
<link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
<link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
<meta content="noindex" name="robo

# Step 4: Locate the Table Containing Hockey Data 

The find() function searches for the table element with class table, which contains the hockey team statistics. This is the main source of data on the page.

In [20]:
#Extract the table with the Hockey Scores

hockey_table = soup.find_all('table',class_='table')
print(hockey_table)

[<table class="table">
<tr>
<th>
                            Team Name
                        </th>
<th>
                            Year
                        </th>
<th>
                            Wins
                        </th>
<th>
                            Losses
                        </th>
<th>
                            OT Losses
                        </th>
<th>
                            Win %
                        </th>
<th>
                            Goals For (GF)
                        </th>
<th>
                            Goals Against (GA)
                        </th>
<th>
                            + / -
                        </th>
</tr>
<tr class="team">
<td class="name">
                            Boston Bruins
                        </td>
<td class="year">
                            1990
                        </td>
<td class="wins">
                            44
                        </td>
<td class="losses">
                            

# Step 5: Extract Column Headers

In [21]:
#Extract the column headings
table_titles = soup.find_all('th')

for titles in table_titles:
    print(titles.text.strip())



Team Name
Year
Wins
Losses
OT Losses
Win %
Goals For (GF)
Goals Against (GA)
+ / -


Column headers (th tags) are extracted from the table and cleaned using .strip() to remove extra 
spaces or newline characters. This list will serve as the column names for our DataFrame.

In [22]:
#Extracting the column Headings
table_titles= soup.find_all('th')
hockey_table_titles = [title.text.strip() for title in table_titles]
print(hockey_table_titles)


['Team Name', 'Year', 'Wins', 'Losses', 'OT Losses', 'Win %', 'Goals For (GF)', 'Goals Against (GA)', '+ / -']


# Step 6: Create an Empty 

A blank pandas DataFrame is initialized with the extracted column names. This structure will store the 
team data row by row.

In [23]:
#Save the column headings onto a Pandas DataFrame
df= pd.DataFrame(columns=hockey_table_titles)
df

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -


# Step 7: Extract Data Rows and Populate the DataFrame 

* Loop through all table rows (<tr> tags), skipping the header. 
* Inside each row, extract all data cells (<td> tags). 
* Clean each cell’s text and store in a list. 
* Append the list as a new row to the DataFrame using df.loc[len(df)].

In [24]:
#Extract the data row by row. First get all rows, then loop through each while stripping and saving data into the DataFrame
# Assuming hockey_table contains a list (ResultSet) and you want the first table
# Access the first table element from the ResultSet
if hockey_table: # Check if the ResultSet is not empty
  main_hockey_table = hockey_table[0]
  table_data = main_hockey_table.find_all('tr')
  for row in table_data[1:]:
    raw_data = row.find_all('td')
    each_raw_data = [data.text.strip() for data in raw_data]
    print(each_raw_data)
    df.loc[len(df)] = each_raw_data
else:
  print("No table with class 'table' found.")

['Boston Bruins', '1990', '44', '24', '', '0.55', '299', '264', '35']
['Buffalo Sabres', '1990', '31', '30', '', '0.388', '292', '278', '14']
['Calgary Flames', '1990', '46', '26', '', '0.575', '344', '263', '81']
['Chicago Blackhawks', '1990', '49', '23', '', '0.613', '284', '211', '73']
['Detroit Red Wings', '1990', '34', '38', '', '0.425', '273', '298', '-25']
['Edmonton Oilers', '1990', '37', '37', '', '0.463', '272', '272', '0']
['Hartford Whalers', '1990', '31', '38', '', '0.388', '238', '276', '-38']
['Los Angeles Kings', '1990', '46', '24', '', '0.575', '340', '254', '86']
['Minnesota North Stars', '1990', '27', '39', '', '0.338', '256', '266', '-10']
['Montreal Canadiens', '1990', '39', '30', '', '0.487', '273', '249', '24']
['New Jersey Devils', '1990', '32', '33', '', '0.4', '272', '264', '8']
['New York Islanders', '1990', '25', '45', '', '0.312', '223', '290', '-67']
['New York Rangers', '1990', '36', '31', '', '0.45', '297', '265', '32']
['Philadelphia Flyers', '1990', '3

# Step 8: Display the DataFrame

This line shows the complete DataFrame in the notebook to verify that all rows and columns were 
correctly extracted. 

In [25]:
#Inspect the resulting DataFrame
df

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -
0,Boston Bruins,1990,44,24,,0.55,299,264,35
1,Buffalo Sabres,1990,31,30,,0.388,292,278,14
2,Calgary Flames,1990,46,26,,0.575,344,263,81
3,Chicago Blackhawks,1990,49,23,,0.613,284,211,73
4,Detroit Red Wings,1990,34,38,,0.425,273,298,-25
5,Edmonton Oilers,1990,37,37,,0.463,272,272,0
6,Hartford Whalers,1990,31,38,,0.388,238,276,-38
7,Los Angeles Kings,1990,46,24,,0.575,340,254,86
8,Minnesota North Stars,1990,27,39,,0.338,256,266,-10
9,Montreal Canadiens,1990,39,30,,0.487,273,249,24


# Step 9: Export Data to CSV 

The final dataset is saved to a .csv file named Hockey.csv. This makes it easy to reuse the data for 
further analysis or visualization.

In [26]:
#Save to a .csv file in the current folder
df.to_csv(r'./Hockey.csv')