<h1 style = "text-align: center; ">Descriptive Title</h1>
<h2 style = "text-align: center; ">ST445 - Managing and Visualizing Data</h2>
<h3 style = "text-align: center; ">Candidate IDs: 38682, XXXXX, YYYYY</h3>


### I. Notebook preparation (maybe this section is not needed)

Perhaps we include something similar to this example from "Example 2"

[[Before running this notebook, please make sure you have all necessary modules installed in your environment. Potentially less common modules used include:

google.cloud
dotenv
networkx
geopandas
praw
transformers
plotly.graph_objects
ipywidgets
folium
As usual, they can be installed by running the command pip install [module] in the terminal.

Furthermore, please make sure your Python version is compatible with all the modules. While writing this, it became apparent there might be some compatibility issues with newer Python versions (especially 3.11 and newer). In case you run into any issues, it might be worth trying to run the code with an older version such as Python 3.9.]]

Our complete GitHub repository can be found at the following location: https://github.com/lse-st445/2024-project-data-knows-ball [[Should we put this in the title of our paper??]]

In [1]:
# Import relevant packages
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Install lxml with conda install anaconda::lxml to use HMTL and XML with Python


### II. Introduction and data description

[[Describe our data sets and pose our research question]]

[[Maybe include data dictionaries of some sort similar to Table 1.3.1 and Table 1.3.2 in "Example 2"]]

### III. Data acquisition

#### III.i. Marketcheck UK API

In [5]:
# API 



#### III.ii. Webscrapping UK Office of National Statistics (ONS)

In [17]:
# Write function for webscrapping data from the UK Office of National Statistics
def webscrape_ONS(url):
    '''
    
    '''

    page = requests.get(url)
    soup = BeautifulSoup(page.content, "lxml")

    table_headers = soup.find_all("th")
    table_headers = table_headers[0:2] # We only need the first two columns of data from the ONS
    table_headers = [t.text for t in table_headers]

    ons_data = []

    for i, row in enumerate(soup.find_all("tr")[2:]): # The frist two rows of ONS tables are headers
        try:
            period, value = row.find_all("td")[0:2] # We only need the first two columns of data from the ONS
            ons_data.append([period.text, value.text])
        except:
            print("Error parsing row #{}".format(i))

    ons_df = pd.DataFrame(ons_data, columns = table_headers)

    ons_year_df = ons_df[ons_df["Period"].str.len() == 4]
    ons_quarter_df = ons_df[ons_df["Period"].str.len() == 7]
    ons_month_df = ons_df[ons_df["Period"].str.len() == 8]

    split_df_len = [len(ons_year_df), len(ons_quarter_df), len(ons_month_df)]
    total_split_df_len = sum(split_df_len)
    orig_df_len = len(ons_data)

    assert total_split_df_len == orig_df_len, "ERROR: Not all rows from original ONS table are split into corresponding year/quarter/month dataframes"

    # return 

'''

unemp_year_df = unemp_df[unemp_df["Period"].str.len() == 4]
# display(unemp_year_df)
unemp_quarter_df = unemp_df[unemp_df["Period"].str.len() == 7]
# display(unemp_quarter_df)
unemp_month_df = unemp_df[unemp_df["Period"].str.len() == 8]
# display(unemp_month_df)
print([len(unemp_year_df), len(unemp_quarter_df), len(unemp_month_df)])
test = [len(unemp_year_df), len(unemp_quarter_df), len(unemp_month_df)]
print(sum(test))
print(len(unemp_df))

'''

'\n\nunemp_year_df = unemp_df[unemp_df["Period"].str.len() == 4]\n# display(unemp_year_df)\nunemp_quarter_df = unemp_df[unemp_df["Period"].str.len() == 7]\n# display(unemp_quarter_df)\nunemp_month_df = unemp_df[unemp_df["Period"].str.len() == 8]\n# display(unemp_month_df)\nprint([len(unemp_year_df), len(unemp_quarter_df), len(unemp_month_df)])\ntest = [len(unemp_year_df), len(unemp_quarter_df), len(unemp_month_df)]\nprint(sum(test))\nprint(len(unemp_df))\n\n'

In [19]:
url_uk_unemp = "https://www.ons.gov.uk/employmentandlabourmarket/peoplenotinwork/unemployment/timeseries/mgsx/lms"
url_uk_cpih = "https://www.ons.gov.uk/economy/inflationandpriceindices/timeseries/l55o/mm23"

webscrape_ONS(url_uk_cpih)

[35, 143, 431]
609


In [2]:
# Webscrapping UK unemployment data
url_uk_unemp = "https://www.ons.gov.uk/employmentandlabourmarket/peoplenotinwork/unemployment/timeseries/mgsx/lms"
page = requests.get(url_uk_unemp)
soup = BeautifulSoup(page.content, "lxml")


In [6]:
table_headers = soup.find_all("th")
table_headers = table_headers[0:2] # We only need the first two columns of data
table_headers = [t.text for t in table_headers]
table_headers


['Period', 'Value']

In [7]:
unemp_data = []

for i, row in enumerate(soup.find_all("tr")[2:]): # The first two rows are header rows
    try:
        period, unemp_pct = row.find_all("td")[0:2]
        period = period.text
        unemp_pct = unemp_pct.text
        unemp_data.append([period, unemp_pct])
    except:
        print("Error parsing row #{}".format(i))

unemp_df = pd.DataFrame(unemp_data, columns = table_headers)


In [9]:
unemp_year_df = unemp_df[unemp_df["Period"].str.len() == 4]
# display(unemp_year_df)
unemp_quarter_df = unemp_df[unemp_df["Period"].str.len() == 7]
# display(unemp_quarter_df)
unemp_month_df = unemp_df[unemp_df["Period"].str.len() == 8]
# display(unemp_month_df)
print([len(unemp_year_df), len(unemp_quarter_df), len(unemp_month_df)])
test = [len(unemp_year_df), len(unemp_quarter_df), len(unemp_month_df)]
print(sum(test))
print(len(unemp_df))

### Add an assert that the split 3 DFs have the same number of rows as the entire orig df (check HW for assert package & syntax)
### Make this scrapping & splitting of periodicity into a function to call for both unemp_pct & CPIH
    ### Add a print command stating that the data has been split into 3 DFs -- and output the names of the DFs
### Perhaps have a min/max value command to describe the data (i.e. unemp goes from 1971-2023, while CPIH goes from 1989-2023)


[53, 215, 644]
912
912


Unnamed: 0,Period,Value
0,1971,4.1
1,1972,4.3
2,1973,3.7
3,1974,3.7
4,1975,4.5
5,1976,5.4
6,1977,5.6
7,1978,5.5
8,1979,5.4
9,1980,6.8


In [3]:
# Webscrapping UK CPIH data
url_uk_cpih = "https://www.ons.gov.uk/economy/inflationandpriceindices/timeseries/l55o/mm23"
page = requests.get(url_uk_cpih)
soup = BeautifulSoup(page.content, "lxml")


In [4]:
table_headers = soup.find_all("th")
table_headers = table_headers[0:2] # We only need the first two columns of data
table_headers = [t.text.replace("\n", "") for t in table_headers]
table_headers


['Period', 'Value']

In [5]:
cpih_data = []

for i, row in enumerate(soup.find_all("tr")[2:]): # The first two rows are header rows
    try:
        period, cpih_rate = row.find_all("td")[0:2]
        period = period.text
        cpih_rate = cpih_rate.text
        cpih_data.append([period, cpih_rate])
    except:
        print("Error parsing row #{}".format(i))

cpih_df = pd.DataFrame(cpih_data, columns = table_headers)


In [6]:
display(cpih_df)

Unnamed: 0,Period,Value
0,1989,5.7
1,1990,8.0
2,1991,7.5
3,1992,4.6
4,1993,2.6
...,...,...
604,2024 JUL,3.1
605,2024 AUG,3.1
606,2024 SEP,2.6
607,2024 OCT,3.2


In [7]:
cpih_year_df = cpih_df[cpih_df["Period"].str.len() == 4]
# display(cpih_year_df)
cpih_quarter_df = cpih_df[cpih_df["Period"].str.len() == 7]
# display(cpih_quarter_df)
cpih_month_df = cpih_df[cpih_df["Period"].str.len() == 8]
# display(cpih_month_df)
print([len(cpih_year_df), len(cpih_quarter_df), len(cpih_month_df)])
test = [len(cpih_year_df), len(cpih_quarter_df), len(cpih_month_df)]
print(sum(test))
print(len(cpih_df))


[35, 143, 431]
609
609


### IV. Data preparation

In [7]:
# Clean and merge datasets



### V. Visualizations

[[Description of what visualizations we decided to include and why]]

In [8]:
# Code for visualizations



[[Explanation/interpretation of the visualizations are depicting]]

### VI. Data modeling

### VII. Conclusion