# Data Loading

This notebook demonstrates the finalized methods of collecting and preprocessing data in a repeatable manner.

Everything is built around the `DataCollector` and `Preprocessor` classes, which are collections of helper methods that provide a combined way of collecting all data and preprocessing it for feature selection. The following datasets are collected:


Collected datasets:

  - rivm: Dutch information on infections and deaths for COVID-19
  - stocks: Stock market data
  - jhu: Johns Hopkins University for COVID-19 vaccination data
  - weather: KNMI Dutch weather service information
  - measures: Oxford dataset providing Dutch government measures along with stringency index

In [None]:
from data_collection import DataCollector, Preprocessor
from datetime import datetime, timedelta
import yfinance as yf
from bs4 import BeautifulSoup as bs
from requests import get
import re
import pandas as pd

In [None]:
# Scraping the Wikipedia page for the AEX 25 list of companies as a representation of the Dutch Economy

wikisection = get("https://en.wikipedia.org/w/api.php?action=parse&page=AEX_index&section=7&contentmodel=wikitext&prop=wikitext&format=json").json()["parse"]["wikitext"]["*"]
wikitable = re.search(r"cellspacing=2((.|\s)+)\}", wikisection).group(1)
compacttable = wikitable.replace("|-\n","")
rows = [_.split("||") for _ in [f"[{_}" for _ in "".join(compacttable.split("\n")[2:]).split("| [") if _ != ""]]
cols = [
    "company",
    "sector",
    "ticker",
    "weighting"
]
aex_dict = {c: [] for c in cols}
for row in rows:
    for i, cell in enumerate(row):
        aex_dict[cols[i]] += [cell.strip(" ")]
aex_df = pd.DataFrame({k: v for k, v in aex_dict.items() if k != "company"}, index=aex_dict["company"])
aex_df["ticker"] = aex_df["ticker"].apply(lambda t: t.split("|")[1])
aex_df["weighting"] = aex_df["weighting"].str.replace("|","", regex=False).astype("float")

In [None]:
ROOT_DIR = "../data"
LOG_DIR = "../logs" 
START_DATE = "2019-01-01"
END_DATE = str(datetime.now() - timedelta(days=7))[:10]
TICKERS = [t + ".AS" for t in aex_df.ticker.to_list()]

dc = DataCollector(root_dir=ROOT_DIR, log_dir=LOG_DIR, start_date=START_DATE, end_date=END_DATE)
datasets = [
    "rivm",
    "stocks",
    "jhu",
    "weather",
    "measures"
]
dfs = {ds: dc.get(dataset=ds) if ds != "stocks" else dc.get(dataset=ds, tickers=TICKERS) for ds in datasets}

In [None]:
print("Collected datasets:")
for ds, df in dfs.items():
    print(f"  - {ds}, size: {df.shape}")

In [None]:
display(dfs["rivm"])

In [None]:
display(dfs["stocks"])

In [None]:
display(dfs["jhu"])

In [None]:
display(dfs["weather"])

In [None]:
display(dfs["measures"])

In [None]:
preprocessor = Preprocessor(root_dir=ROOT_DIR, log_dir=LOG_DIR, datasets=dfs, start_date=START_DATE, end_date=END_DATE)
df = preprocessor.preprocess_and_combine()
display(df)