## INFT 2067 - Data Acquisition and Wrangling

Assessment: **CA 1.1: Web Scraping**<br>
Due Date: Monday, 29 January 2024<br>
<br>
Student: Michael Hudson<br>
Python version: 3.11.2<br>
Operating System: Windows 11 Pro (version 22H2, OS build 22621.1992)

---

First install the required libraries. For example, in the terminal:<br>
pip install requests<br>
pip install beautifulsoup4<br>
Then import the required modules:

In [1]:
# Built-in Modules
import csv
import re

# External Modules
import requests
from bs4 import BeautifulSoup


Use requests and BeautifulSoup to get the content from the websites.

In [2]:
# The website URLs. These are also contained in the accompanying text file.
url_m = "https://en.wikipedia.org/wiki/2023_AFL_season"
url_w = "https://en.wikipedia.org/wiki/2023_AFL_Women%27s_season"

response_m = requests.get(url_m, timeout=10)
html_m = response_m.text
soup_m = BeautifulSoup(html_m, "html.parser")
response_m.close()

response_w = requests.get(url_w, timeout=10)
html_w = response_w.text
soup_w = BeautifulSoup(html_w, "html.parser")
response_w.close()


Create regex expressions that capture the round number, and the game information.

In [3]:
PATTERN_game_info = "^\\n(.+day), (\d+ .+) \((\d+:\d+)\\xa0([a,p]m)\)\\n\\n(.*?) \d+.\d+ \((\d+)\)\\n\\n(.+?)\\n\\n(.*?) \d+.\d+ \((\d+)\)\\n\\n(.*?) \(crowd:\\xa0(\d*,*\d+)\)\\n\\nReportStats\\n$"

PATTERN_round_number = "^\\n\\n\\nRound (\d+)"


Grab all the tables (each table is a round of games) on each website for the regular home and away season, and then grab the rows from each table. We end up with a list of lists. The outer list contains all the rounds of the season, and each inner list (that is, each item of the outer list) contains the round number and games information for each round.

In [4]:
# Get the rows for the men's competition
allTables_m = soup_m.find_all("table")
allRoundRows_m = []

# From inspection of the AFL web page we know the index of the first table we want
# and that there are 24 tables of interest.
for i in range(4, 28):
    RoundRows = allTables_m[i].find_all("tr")
    allRoundRows_m.append(RoundRows)

# Get the rows for the women's competition
allTables_w = soup_w.find_all("table")
allRoundRows_w = []

# From inspection of the AFLW web page we know the index of the first table we want
# and that there are 10 tables of interest.
for i in range(4, 14):
    RoundRows = allTables_w[i].find_all("tr")
    allRoundRows_w.append(RoundRows)


We are now ready to pull the data we want out of the rows. First create an empty list that can be used as a container for the scraped information, and a function to reduce code duplication.

In [7]:
# Set the valid options for the arguments of the function.
VALID_LEAGUES = {"men", "women"}

def GetData(men_or_women):
    # Create a list container
    data = []
    
    #Error checking
    if men_or_women not in VALID_LEAGUES:
        raise ValueError("Argument must be one of %r." % VALID_LEAGUES)
    if men_or_women == "men":
        num_rounds = 24
        league = "AFL-M"
        allRoundRows = allRoundRows_m
    elif men_or_women == "women":
        num_rounds = 10
        league = "AFL-W"
        allRoundRows = allRoundRows_w
    else:
        raise ValueError("Please enter either 'men' or 'women'")
    
    # Iterate through the list of lists and extract the game information    
    for i in range(num_rounds):
        for row in allRoundRows[i]:
            rowText = row.text
            if re.search(PATTERN_round_number, rowText):
                m2 = re.search(PATTERN_round_number, rowText)
                roundno = m2.group(1)
            if re.search(PATTERN_game_info, rowText):
                m = re.search(PATTERN_game_info, rowText)
                day = m.group(1)
                date = m.group(2)
                time = m.group(3) + " " + m.group(4)
                team1name = m.group(5)
                team1score = m.group(6)
                team1_win_status = m.group(7)
                team2name = m.group(8)
                team2score = m.group(9)
                location = m.group(10)
                attendance = int(m.group(11).replace(",", ""))
                row_list = [league, roundno, day, date, time, team1name, team1score, team1_win_status, team2name, team2score, location, attendance]
                data.append(row_list)
    
    # Return the list of games information, ready for saving as csv file.            
    return data
                

Use the function to prepare the data for writing to csv file. Create a header for the csv file.

In [8]:
data_m = GetData("men")
data_w = GetData("women")

header = [
    "League",
    "Round",
    "Day",
    "Date",
    "Time",
    "First_Team_Name",
    "First_Team_Score",
    "First_Team_Win_Status",
    "Second_Team_Name",
    "Second_Team_Score",
    "Location",
    "Crowd_Attendance"
]

Write the data to a csv file.

In [9]:
with open("ALF_Leagues_2023.csv", "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(header)
    writer.writerows(data_m)
    writer.writerows(data_w)
    
file.close()


For interest, and checking, run %whos to see information about variables used in above code.

In [10]:
%whos


Variable               Type             Data/Info
-------------------------------------------------
BeautifulSoup          type             <class 'bs4.BeautifulSoup'>
GetData                function         <function GetData at 0x000001CE613F1760>
PATTERN_game_info      str              ^\n(.+day), (\d+ .+) \((\<...>*\d+)\)\n\nReportStats\n$
PATTERN_round_number   str              ^\n\n\nRound (\d+)
RoundRows              ResultSet        [<tr style="background-co<...>p></li></ul>\n</td></tr>]
VALID_LEAGUES          set              {'men', 'women'}
allRoundRows_m         list             n=24
allRoundRows_w         list             n=10
allTables_m            ResultSet        [<table class="infobox vc<...>td></tr></tbody></table>]
allTables_w            ResultSet        [<table class="infobox vc<...>td></tr></tbody></table>]
csv                    module           <module 'csv' from 'C:\\U<...>\Python311\\Lib\\csv.py'>
data_m                 list             n=207
data_w             