# 0: Section Overview

In this section, we will impliment a baseline model. This will provide a basic model against which more complex models can be compared. This comparison will help us to gauge the effectiveness of enhancements or modifications made to the model. Additionally, It sets a standard for evaluating whether more sophisticated models actually provide improvements in predictive performance (in our case, this performance metric is measured as the MSE).

# 1: Necessary Imports

In [20]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# 2: Data Access

We will access the data in the same way we accessed the date in section 03-EDA

In [11]:
test_filepath = r"C:\Users\markm\OneDrive\Documents\University\Year 4\dst\GitHub\Assessment_1\Train_and_Test_data\test.xlsx" # Enter your file path for the train data
train_filepath = r"C:\Users\markm\OneDrive\Documents\University\Year 4\dst\GitHub\Assessment_1\Train_and_Test_data\train.xlsx" # Enter your file path for the test data

test_data = pd.read_excel(test_filepath)
train_data = pd.read_excel(train_filepath)

            date iso_code continent      location  Stringency Index  CH Index  \
0     2020-01-05      AND    Europe       Andorra               0.0       0.0   
1     2020-01-06      AND    Europe       Andorra               0.0       0.0   
2     2020-01-07      AND    Europe       Andorra               0.0       0.0   
3     2020-01-08      AND    Europe       Andorra               0.0       0.0   
4     2020-01-09      AND    Europe       Andorra               0.0       0.0   
...          ...      ...       ...           ...               ...       ...   
52257 2024-07-31      ZAF    Africa  South Africa               NaN       NaN   
52258 2024-08-01      ZAF    Africa  South Africa               NaN       NaN   
52259 2024-08-02      ZAF    Africa  South Africa               NaN       NaN   
52260 2024-08-03      ZAF    Africa  South Africa               NaN       NaN   
52261 2024-08-04      ZAF    Africa  South Africa               NaN       NaN   

       Gov Resp Index  Econ

In [None]:
print(test_data.head())

# 3: Implimenting the baseline model: Linear Regression

Initally, let us train the model on the entire training data. Note that we don't require the first 4 columns (date, iso_code, continent and location). Therefore, the first thing we will do is remove these columns.

In [16]:
train_data_string_cols_rem = train_data.iloc[:, 4:]
test_data_string_cols_rem = test_data.iloc[:, 4:]

print(train_data_string_cols_rem.columns)

Index(['Stringency Index', 'CH Index', 'Gov Resp Index', 'Econ Sup Index',
       'total_cases', 'new_cases', 'new_cases_smoothed', 'total_deaths',
       'new_deaths', 'new_deaths_smoothed', 'total_cases_per_million',
       'new_cases_per_million', 'new_cases_smoothed_per_million',
       'total_deaths_per_million', 'new_deaths_per_million',
       'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients',
       'icu_patients_per_million', 'hosp_patients',
       'hosp_patients_per_million', 'weekly_icu_admissions',
       'weekly_icu_admissions_per_million', 'weekly_hosp_admissions',
       'weekly_hosp_admissions_per_million', 'total_tests', 'new_tests',
       'total_tests_per_thousand', 'new_tests_per_thousand',
       'new_tests_smoothed', 'new_tests_smoothed_per_thousand',
       'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations',
       'people_vaccinated', 'people_fully_vaccinated', 'total_boosters',
       'new_vaccinations', 'new_vaccinatio

### Training Data

We will select the relevant columns for our particular model, these being the dependent variable, 'reproduction_rate', and all the covariates, these being 'Stringency Index','CH Index', 'Gov Resp Index', 'Econ Sup Index' and 'days_since'. We also choose to remove any rows with missing values. Later on in the project we will evaluate how we can go about using those rows with missing information.

In [23]:
train_data_filtered = train_data_string_cols_rem[['reproduction_rate','Stringency Index','CH Index', 'Gov Resp Index', 'Econ Sup Index','days_since']].dropna()

X_train = train_data_filtered[['Stringency Index','CH Index', 'Gov Resp Index', 'Econ Sup Index', 'days_since']]
y_train = train_data_filtered[['reproduction_rate']]

print()
print(X_train.shape)
print(len(y_train))


(139717, 5)
139717


### Testing Data

We impliment the same filtering as we did with the training data.

In [24]:
test_data_filtered = test_data_string_cols_rem[['reproduction_rate','Stringency Index','CH Index', 'Gov Resp Index', 'Econ Sup Index','days_since']].dropna()

X_test = test_data_filtered[['Stringency Index','CH Index', 'Gov Resp Index', 'Econ Sup Index', 'days_since']]
y_test = test_data_filtered[['reproduction_rate']]

print()
print(X_test.shape)
print(len(y_test))


(29044, 5)
29044


In [25]:
# Initialize the linear regression model
linear_model = LinearRegression()

# Fit the model on the training data
linear_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = linear_model.predict(X_test)

# Calculate Mean Squared Error (MSE) on the test data
mse = mean_squared_error(y_test, y_pred)

# Print the MSE result
print(f'Mean Squared Error on the test data: {mse:.4f}')



Mean Squared Error on the test data: 0.1233


# 4: Cross-validation

# 5: Visualising our performance metric for the baseline model

# 6: Evaluating the model