# FIT5196 Assessment 2

Student Name: Juan Pablo Grimaldi
Student ID: 32980523

Date: 31 August 2022


Environment: Python 3.9

Libraries used:
* os (for interacting with the operating system, included in Python 3.xxx package) 
* re (for regular expression, installed and imported) 
* pandas (for data manipulation)

-------------------------------------

# Table of Contents


[1. Introduction](#Intro) <br>
[2. Importing Libraries](#libs) <br>
[3. Examining Review Files](#examine) <br>
[4. Loading and Parsing Files](#load) <br>
$\;\;\;\;$[4.1. Defining Regular Expressions](#Reg_Exp) <br>
$\;\;\;\;$[4.2. Extract Information](#Read) <br>
$\;\;\;\;$[4.3. Structure results and data export](#structure) <br>
$\;\;\;\;$[4.4. Debugging](#debugging) <br>
[5. Writing Output to CSV File](#write) <br>
$\;\;\;\;$[5.1. Verification - using the sample files](#test) <br>
[6. Summary](#summary) <br>
[7. References](#Ref) <br>

-------------------------------------

## Introduction  <a class="anchor" name="Intro"></a>

In the real world, it's unlikely to have the good fortune of working with perfectly clean data.
Most often than not, data scientists will run into data sets that require some attention before
the analysis can even begin. For example, it's common in domains like Finance to observe outliers
 within stock returns that would create a biased set of estimates when creating the models.

George Box famously said:

> “All models are wrong, but some are useful.”

This reveals that data cleaning is indeed a key part of the process of data analysis, and
this assessment will cover common issues and solutions of working with data.

### Assessment Specifications

For this project, we will use and examine three data sets where each one of them contains a
problem to be resolved:

* `32980523_dirty_data.csv`: detect and fix errors.
* `32980523_outlier_data.csv`: detect and remove outlier rows (to be found w.r.t delivery_fee
attribute).
* `32980523_missing_data.csv`: impute the missing values.

In addition, there are three auxiliary files to be used during parts of the data cleaning process: `branches.csv`,`edges.csv`,`nodes.csv`.
<br>
The dataset contains Food Delivery data from a restaurant in Melbourne, Australia. The restaurant has three branches around CBD area. All three branches share the same menu but they have different management so they operate differently.
<br>
Below are the variables contained in the data sets:
<br>
* `order_id`: A unique id for each order
* `date`: The date the order was made, given in YYYY-MM-DD format
* `time`: The time the order was made, given in hh:mm:ss format
* `order_type`: A categorical attribute representing the different types of orders namely: Breakfast, Lunch or Dinner
* `branch_code`: A categorical attribute representing the branch code in which the order was made. Branch information is given in the branches.csv file.
* `order_items`: A list of tuples representing the order items: first element of the tuple is the
item ordered, and the second element is the quantity ordered for that item.
* `order_price`: A float value representing the order total price.
* `customer_lat`: Latitude of the customer coming from the nodes.csv file.
* `customer_lon`: Longitude of the customer coming from the nodes.csv file.
* `customerHasloyalty?`: A logical variable denoting whether the customer has a loyalty card with
the restaurant (1 if the customer has loyalty and 0 otherwise).
* `distance_to_customer_KM`: A float representing the shortest distance, in kilometers, between the branch and the customer nodes with respect to the nodes.csv and the edges.csv files.
* `delivery_fee`: A float representing the delivery fee of the order.


### Additional Information

The following helpful details are also provided in the brief:

* There are three types of meals:
    * Breakfast - served during morning (8am - 12pm),
    * Lunch - served during afternoon (12:00:01pm - 4pm)
    * Dinner - served during evening (4:00:01pm - 8pm)

* Each meal has a distinct set of items in the menu (ex: breakfast items can't be served during lunch or dinner and so on).

* A useful python package to solve a linear system of equations is numpy.linalg

* Delivery fee is calculated using a different method for each branch. The fee depends linearly (but in different ways for each branch) on:
    1. weekend or weekday (1 or 0) - as a binary variable
    2. time of the day (morning 0, afternoon 1, evening 2) - as a categorical variable
    3. distance between branch and customer

* **If a customer has loyalty, they get a 50% discount on delivery fee**

* The restaurant uses Djikstra algorithm to calculate the shortest distance between customer and restaurant. (explore networkx python package for this or alternatively find a way to implement the algorithm yourself)

* We know that the below columns are error-free:
    * order_id
    * time
    * the numeric quantity in order_items

-------------------------------------

## Enviroment Preparation  <a class="anchor" name="libs"></a>

The packages to be used in this assessment are imported in the following. They are used to fulfill the following tasks:

* **pandas:** to structure the results into a tabular format, as per the assessment requirements.
* **matplotlib:** for visual exploratory data analysis.
* **missingno:** to process and visualise missing data.

In [82]:
import datetime

# import libraries to be used in the assignment
#Basic scientific python libs
import pandas as pd
# Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import missingno as msno
# Configure visualisations
%matplotlib inline
# to make things pretty
mpl.style.use( 'ggplot' )

-------------------------------------

## Load and Examine Datasets <a class="anchor" name="examine"></a>

<br>Most of the exploratory data analysis will be performed with methods from the `pandas` library.
To begin the process, all the csv files will be read into a dataframe:


In [83]:
#load datasets
dirty_data = pd.read_csv("data/32980523_dirty_data.csv")
missing_data = pd.read_csv("data/32980523_missing_data.csv")
outlier_data = pd.read_csv("data/32980523_outlier_data.csv")

The following sections will cover the process of review and amendment of each of the datasets.

### Dirty Data

Firstly, we will use the info() method to review the variably types within the dataset and print
the data set size.

In [84]:
# review dirty data variable types
print("---- Dirty Data: data types ----")
dirty_data.info()
print("---- Dirty Data: data frame size ----")
dirty_data.shape

---- Dirty Data: data types ----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   order_id                 500 non-null    object 
 1   date                     500 non-null    object 
 2   time                     500 non-null    object 
 3   order_type               500 non-null    object 
 4   branch_code              500 non-null    object 
 5   order_items              500 non-null    object 
 6   order_price              500 non-null    float64
 7   customer_lat             500 non-null    float64
 8   customer_lon             500 non-null    float64
 9   customerHasloyalty?      500 non-null    int64  
 10  distance_to_customer_KM  500 non-null    float64
 11  delivery_fee             500 non-null    float64
dtypes: float64(5), int64(1), object(6)
memory usage: 47.0+ KB
---- Dirty Data: data frame size ----


(500, 12)

Next, the head and tail methods from pandas will allow a quick overview of the dataframe structure.

In [85]:
#inspect dirty data head
dirty_data.head()

Unnamed: 0,order_id,date,time,order_type,branch_code,order_items,order_price,customer_lat,customer_lon,customerHasloyalty?,distance_to_customer_KM,delivery_fee
0,ORDJ06243,2021-11-02,18:08:27,Dinner,TP,"[('Salmon', 4), ('Pasta', 2), ('Fish&Chips', 1...",308.0,-37.806521,144.944874,0,10.082,13.802598
1,ORDA10907,2021-11-03,18:08:27,Dinner,BK,"[('Salmon', 5), ('Pasta', 5)]",342.5,-37.810712,144.946133,0,9.145,16.15099
2,ORDA06776,2021-08-14,15:26:11,Lunch,BK,"[('Fries', 1), ('Salad', 7), ('Chicken', 2), (...",320.4,-37.819004,144.954318,0,8.676,16.680944
3,ORDY05744,2021-10-26,17:48:10,Dinner,TP,"[('Salmon', 7), ('Pasta', 10), ('Shrimp', 5)]",832.0,-37.817244,144.967764,0,11.792,11.549074
4,ORDX00833,2021-05-05,12:03:22,Lunch,BK,"[('Salad', 2), ('Steak', 10), ('Chicken', 2), ...",892.4,-37.809557,144.972643,0,6.714,12.512411


In [86]:
#inspect dirty data head
dirty_data.tail()

Unnamed: 0,order_id,date,time,order_type,branch_code,order_items,order_price,customer_lat,customer_lon,customerHasloyalty?,distance_to_customer_KM,delivery_fee
495,ORDX04741,2021-03-24,10:42:15,Breakfast,BK,"[('Coffee', 9), ('Pancake', 9)]",249.75,-37.806837,144.95138,0,8.581,13.10343
496,ORDJ05472,2021-10-03,17:07:36,Dinner,TP,"[('Pasta', 4), ('Fish&Chips', 7), ('Shrimp', 2)]",463.0,-37.807966,144.945429,0,9.956,15.517818
497,ORDK01012,2021-06-16,13:24:30,Lunch,BK,"[('Burger', 5), ('Chicken', 5), ('Fries', 10),...",600.4,-37.807797,144.973202,0,6.525,11.83828
498,ORDK04997,2021-03-16,10:21:58,Breakfast,BK,"[('Eggs', 8), ('Coffee', 2), ('Cereal', 3), ('...",283.5,-37.799207,144.961314,0,8.333,13.031216
499,ORDA02222,2021-04-27,11:53:14,Breakfast,BK,"[('Pancake', 7), ('Coffee', 1), ('Cereal', 8)]",321.25,-37.803561,144.918101,0,11.587,16.161815


In [87]:
# summarise numerical variables
dirty_data.describe()

Unnamed: 0,order_price,customer_lat,customer_lon,customerHasloyalty?,distance_to_customer_KM,delivery_fee
count,500.0,500.0,500.0,500.0,500.0,500.0
mean,481.5167,-35.379832,143.138468,0.066,8.613976,13.669932
std,265.064722,19.355386,18.203815,0.248531,1.643464,2.39298
min,41.0,-37.833333,-37.816142,0.0,3.657,5.218812
25%,276.8125,-37.819053,144.951689,0.0,7.63075,12.589348
50%,436.6,-37.812472,144.963557,0.0,8.7625,13.921834
75%,656.075,-37.805754,144.977142,0.0,9.631,15.094976
max,1432.0,145.000032,145.017716,1.0,16.698,21.566636


#### Individual Variable Review

This section will cover the review of each one of the variables in the `dirty_data` dataframe.

#### Date

The following line of code should easily parse the date column into a proper pandas datetime format. However, there's a number of errors that pop up when running it that indicate a deeper review is required on each component of the date:

``` python
dirty_date = pd.to_datetime(dirty_data['date'], format ='%Y-%m-%d')
>>> "ValueError: time data 2021-30-08 doesn't match format specified"
```
We will use some basic regular expressions to extract the days, months and years from the `date` series. All common regular expression methods are available directly from pandas.

In [106]:
#years
years = ([itm[0][0] for itm in 
  dirty_data['date'].str.findall('(\d{4})-(\d{2})-(\d{2})')
  if len(itm)>0])
years = pd.Series(years).apply(pd.to_numeric)
#months
months = ([itm[0][1] for itm in 
  dirty_data['date'].str.findall('(\d{4})-(\d{2})-(\d{2})')
  if len(itm)>0])
months = pd.Series(months).apply(pd.to_numeric)
#days
days = ([itm[0][2] for itm in 
  dirty_data['date'].str.findall('(\d{4})-(\d{2})-(\d{2})')
  if len(itm)>0])
days = pd.Series(days).apply(pd.to_numeric)
# concatenate series
date_concat = pd.concat([years,months,days], axis=1)
# assign names
date_concat.columns=['year','month','day']

In [89]:
# create max min average function
date_concat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 498 entries, 0 to 497
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   year    498 non-null    int64
 1   month   498 non-null    int64
 2   day     498 non-null    int64
dtypes: int64(3)
memory usage: 11.8 KB


In [90]:
date_concat.describe()

Unnamed: 0,year,month,day
count,498.0,498.0,498.0
mean,2021.0,6.670683,15.829317
std,0.0,3.825676,8.729156
min,2021.0,1.0,1.0
25%,2021.0,4.0,8.0
50%,2021.0,6.0,16.5
75%,2021.0,10.0,23.0
max,2021.0,30.0,31.0


We can immediately see observe two issues:

1. Two rows weren't properly parsed by the regex format, which should be inspected in further
detail.
2. It appears that some months are higher than the permitted range for a valid date (i.e. larger
than 12).

In [91]:
#check rows were the month is larger than 12.
date_concat[date_concat['month'] > 12]

Unnamed: 0,year,month,day
113,2021,30,8
170,2021,19,10
263,2021,30,6
416,2021,22,8


In [92]:
index = 0
for each in dirty_data['date']:
  #check for years outside 2021
  if each.split("-")[0] != "2021":
    print(f"index:{index}, value:{each}")
  #check for nun-numeric month
  elif not each.split("-")[1].isnumeric():
    print(f"index:{index}, value:{each}")
  index += 1

index:400, value:Tue Jun  1 00:00:00 2021
index:418, value:2021-Aug-03


In [93]:
# changing formats
# observation 400
dirty_data.at[400,'date'] = "2021-06-01"
# observation 418
dirty_data.at[418,'date'] = "2021-08-03"

In [94]:
# re-run date concat
#re run date concat
years = ([itm[0][0] for itm in
          dirty_data['date'].str.findall('(\d{4})-(\d{2})-(\d{2})')
          if len(itm)>0])
years = pd.Series(years).apply(pd.to_numeric)
#months
months = ([itm[0][1] for itm in
           dirty_data['date'].str.findall('(\d{4})-(\d{2})-(\d{2})')
           if len(itm)>0])
months = pd.Series(months).apply(pd.to_numeric)
#days
days = ([itm[0][2] for itm in
         dirty_data['date'].str.findall('(\d{4})-(\d{2})-(\d{2})')
         if len(itm)>0])
days = pd.Series(days).apply(pd.to_numeric)
# concatenate series
date_concat = pd.concat([years,months,days], axis=1)
# assign names
date_concat.columns=['year','month','day']

In [95]:
date_concat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   year    500 non-null    int64
 1   month   500 non-null    int64
 2   day     500 non-null    int64
dtypes: int64(3)
memory usage: 11.8 KB


In [96]:
#check rows were the month is larger than 12.
date_concat[date_concat['month'] > 12]

Unnamed: 0,year,month,day
113,2021,30,8
170,2021,19,10
263,2021,30,6
417,2021,22,8


We will impute the incorrect months with proper months:

In [97]:
# most common month
date_concat['month'].mode()

0    6
Name: month, dtype: int64

In [98]:
# correct months
import numpy as np
date_concat['month'] = np.where(date_concat['month'] > 12, 6, date_concat['month'])


In [100]:
# add leading zeroes
date_concat['month'] = date_concat['month'].apply(lambda x: '{0:0>2}'.format(x))

In [102]:
date_concat['day'] = date_concat['day'].apply(lambda x: '{0:0>2}'.format(x))

In [103]:
# put all months together and
dirty_data['date'] = date_concat['year'].astype(str) + "-"+ date_concat['month'].astype(str) + \
                     "-" + date_concat['day'].astype(str)


In [104]:
dirty_data.head()

Unnamed: 0,order_id,date,time,order_type,branch_code,order_items,order_price,customer_lat,customer_lon,customerHasloyalty?,distance_to_customer_KM,delivery_fee
0,ORDJ06243,2021-11-02,18:08:27,Dinner,TP,"[('Salmon', 4), ('Pasta', 2), ('Fish&Chips', 1...",308.0,-37.806521,144.944874,0,10.082,13.802598
1,ORDA10907,2021-11-03,18:08:27,Dinner,BK,"[('Salmon', 5), ('Pasta', 5)]",342.5,-37.810712,144.946133,0,9.145,16.15099
2,ORDA06776,2021-08-14,15:26:11,Lunch,BK,"[('Fries', 1), ('Salad', 7), ('Chicken', 2), (...",320.4,-37.819004,144.954318,0,8.676,16.680944
3,ORDY05744,2021-10-26,17:48:10,Dinner,TP,"[('Salmon', 7), ('Pasta', 10), ('Shrimp', 5)]",832.0,-37.817244,144.967764,0,11.792,11.549074
4,ORDX00833,2021-05-05,12:03:22,Lunch,BK,"[('Salad', 2), ('Steak', 10), ('Chicken', 2), ...",892.4,-37.809557,144.972643,0,6.714,12.512411


#### Observations

The following observations can be made about the `dirty_data` data frame:

* There are 500 observations in total, each corresponding to a delivery order.
* The date and time variables need to be formatted into a proper variable.


### Missing Data

In [16]:
# review dirty data variable types
print("---- Dirty Data: data types ----")
dirty_data.info()
print("---- Dirty Data: data frame size ----")
dirty_data.shape

---- Dirty Data: data types ----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   order_id                 500 non-null    object 
 1   date                     500 non-null    object 
 2   time                     500 non-null    object 
 3   order_type               500 non-null    object 
 4   branch_code              500 non-null    object 
 5   order_items              500 non-null    object 
 6   order_price              500 non-null    float64
 7   customer_lat             500 non-null    float64
 8   customer_lon             500 non-null    float64
 9   customerHasloyalty?      500 non-null    int64  
 10  distance_to_customer_KM  500 non-null    float64
 11  delivery_fee             500 non-null    float64
dtypes: float64(5), int64(1), object(6)
memory usage: 47.0+ KB
---- Dirty Data: data frame size ----


(500, 12)

In [17]:
#inspect dirty data head
dirty_data.head()

Unnamed: 0,order_id,date,time,order_type,branch_code,order_items,order_price,customer_lat,customer_lon,customerHasloyalty?,distance_to_customer_KM,delivery_fee
0,ORDJ06243,2021-11-02,18:08:27,Dinner,TP,"[('Salmon', 4), ('Pasta', 2), ('Fish&Chips', 1...",308.0,-37.806521,144.944874,0,10.082,13.802598
1,ORDA10907,2021-11-03,18:08:27,Dinner,BK,"[('Salmon', 5), ('Pasta', 5)]",342.5,-37.810712,144.946133,0,9.145,16.15099
2,ORDA06776,2021-08-14,15:26:11,Lunch,BK,"[('Fries', 1), ('Salad', 7), ('Chicken', 2), (...",320.4,-37.819004,144.954318,0,8.676,16.680944
3,ORDY05744,2021-10-26,17:48:10,Dinner,TP,"[('Salmon', 7), ('Pasta', 10), ('Shrimp', 5)]",832.0,-37.817244,144.967764,0,11.792,11.549074
4,ORDX00833,2021-05-05,12:03:22,Lunch,BK,"[('Salad', 2), ('Steak', 10), ('Chicken', 2), ...",892.4,-37.809557,144.972643,0,6.714,12.512411


In [18]:
#inspect dirty data head
dirty_data.tail()

Unnamed: 0,order_id,date,time,order_type,branch_code,order_items,order_price,customer_lat,customer_lon,customerHasloyalty?,distance_to_customer_KM,delivery_fee
495,ORDX04741,2021-03-24,10:42:15,Breakfast,BK,"[('Coffee', 9), ('Pancake', 9)]",249.75,-37.806837,144.95138,0,8.581,13.10343
496,ORDJ05472,2021-10-03,17:07:36,Dinner,TP,"[('Pasta', 4), ('Fish&Chips', 7), ('Shrimp', 2)]",463.0,-37.807966,144.945429,0,9.956,15.517818
497,ORDK01012,2021-06-16,13:24:30,Lunch,BK,"[('Burger', 5), ('Chicken', 5), ('Fries', 10),...",600.4,-37.807797,144.973202,0,6.525,11.83828
498,ORDK04997,2021-03-16,10:21:58,Breakfast,BK,"[('Eggs', 8), ('Coffee', 2), ('Cereal', 3), ('...",283.5,-37.799207,144.961314,0,8.333,13.031216
499,ORDA02222,2021-04-27,11:53:14,Breakfast,BK,"[('Pancake', 7), ('Coffee', 1), ('Cereal', 8)]",321.25,-37.803561,144.918101,0,11.587,16.161815


In [19]:
# summarise numerical variables
dirty_data.describe()

Unnamed: 0,order_price,customer_lat,customer_lon,customerHasloyalty?,distance_to_customer_KM,delivery_fee
count,500.0,500.0,500.0,500.0,500.0,500.0
mean,481.5167,-35.379832,143.138468,0.066,8.613976,13.669932
std,265.064722,19.355386,18.203815,0.248531,1.643464,2.39298
min,41.0,-37.833333,-37.816142,0.0,3.657,5.218812
25%,276.8125,-37.819053,144.951689,0.0,7.63075,12.589348
50%,436.6,-37.812472,144.963557,0.0,8.7625,13.921834
75%,656.075,-37.805754,144.977142,0.0,9.631,15.094976
max,1432.0,145.000032,145.017716,1.0,16.698,21.566636


The following observations can be made about the `dirty_data` data frame:

* There are 500 observations in total, each corresponding to a delivery order.
* The date and time variables need to be formatted into a proper variable.


In [20]:
missing_data.describe()

Unnamed: 0,order_price,customer_lat,customer_lon,customerHasloyalty?,distance_to_customer_KM,delivery_fee
count,500.0,500.0,500.0,500.0,500.0,400.0
mean,478.684,-37.812103,144.967062,0.054,8.693266,13.811205
std,272.722121,0.007481,0.021305,0.226244,1.663536,2.413113
min,33.25,-37.828542,144.916772,0.0,3.478,5.793977
25%,272.75,-37.818302,144.952119,0.0,7.734,12.732104
50%,417.5,-37.812203,144.964018,0.0,8.7045,13.808147
75%,665.0,-37.805941,144.9816,0.0,9.7475,15.173717
max,1335.5,-37.79645,145.01837,1.0,16.645,22.28196


In [21]:
outlier_data.describe()

Unnamed: 0,order_price,customer_lat,customer_lon,customerHasloyalty?,distance_to_customer_KM,delivery_fee
count,500.0,500.0,500.0,500.0,500.0,500.0
mean,498.5734,-37.812952,144.967694,0.052,8.758112,13.690094
std,264.370055,0.00763,0.02131,0.222249,1.651498,3.138408
min,36.5,-37.833333,144.921857,0.0,3.605,4.189152
25%,287.0,-37.819211,144.95223,0.0,7.7985,12.429743
50%,464.9,-37.813074,144.96464,0.0,8.8645,13.807846
75%,677.975,-37.806776,144.982712,0.0,9.72,15.268315
max,1493.0,-37.798083,145.01959,1.0,16.676,28.822662


It can be insightful to review some information about the number of files (confirming there are no files missing) and the naming convention.

In [22]:
missing_data.isna().sum()

order_id                     0
date                         0
time                         0
order_type                   0
branch_code                 50
order_items                  0
order_price                  0
customer_lat                 0
customer_lon                 0
customerHasloyalty?          0
distance_to_customer_KM      0
delivery_fee               100
dtype: int64

In [23]:
missing_data.isnull().sum()

order_id                     0
date                         0
time                         0
order_type                   0
branch_code                 50
order_items                  0
order_price                  0
customer_lat                 0
customer_lon                 0
customerHasloyalty?          0
distance_to_customer_KM      0
delivery_fee               100
dtype: int64

In [24]:
missing_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   order_id                 500 non-null    object 
 1   date                     500 non-null    object 
 2   time                     500 non-null    object 
 3   order_type               500 non-null    object 
 4   branch_code              450 non-null    object 
 5   order_items              500 non-null    object 
 6   order_price              500 non-null    float64
 7   customer_lat             500 non-null    float64
 8   customer_lon             500 non-null    float64
 9   customerHasloyalty?      500 non-null    int64  
 10  distance_to_customer_KM  500 non-null    float64
 11  delivery_fee             400 non-null    float64
dtypes: float64(5), int64(1), object(6)
memory usage: 47.0+ KB


-------------------------------------

## Summary <a class="anchor" name="summary"></a>

In conclussion, this excercise has covered all the necessary steps required to process semi-structured files into a unified dataframe that is adequate for analysis.
<br>
The key steps of the process undoubtedly lie on understanding the structure of the information to be structured. In this case, the data was formatted in HTML and hence it wasn't too difficult to design regular expressions that could extract the right type of information.
<br>
With the right regular expression design, the data was extracted using simple loops to store the data into a format that is efficient to parse (in this case, a list). Lastly, the pandas library was extremely helpful in structuring the data in a tabular format that can then be examined.

-------------------------------------

## References <a class="anchor" name="Ref"></a>




[1]<a class="anchor" name="ref-2"></a> Why do I need to add DOTALL to python regular expression to match new line in raw string, https://stackoverflow.com/questions/22610247, Accessed 30/08/2022.

....


--------------------------------------------------------------------------------------------------------------------------