# **Advanced Model**
- Best model with **< 0.05** P-values, **low** condition number, and **underfits**
- Features included:
    * **`Year`**
    * **`Infant_deaths`**
    * **`Under_five_deaths`**
    * **`Adult_mortality`**
    * **`BMI`**
    * **`Incidents_HIV`**
    * **`GDP_per_capita`**
    * **`Schooling`**
    * **`Econonmy_status_Developed`**
    * All 8 **`Region`** (one-hot-encoded)

In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
import statsmodels.api as sm
import statsmodels.tools
import joblib
from sklearn import metrics

## **Train-Test-Split**
- `Life Expectancy Data.csv` was read into a pandas dataframe.
- The features and target for modelling and predicting were separated.
- Dataset was split into an **80-20** split using a `random_state` of 23.

In [10]:
WHO = pd.read_csv("Life Expectancy Data.csv")   # Reading the csv

In [11]:
# Preparing for Train-test splitting

feature_cols = list(WHO.columns)        # Extracts the columns and creates a list out of it
feature_cols.remove('Life_expectancy')  # Removes the 'life expectancy' column to make it the features for the X_train and the X_test

X = WHO[feature_cols]       # Creates the features to learn from
y = WHO['Life_expectancy']  # Separates the 'life_expectancy' as the target

In [12]:
feature_cols # Check that the 'life_expectancy' column has been removed

['Country',
 'Region',
 'Year',
 'Infant_deaths',
 'Under_five_deaths',
 'Adult_mortality',
 'Alcohol_consumption',
 'Hepatitis_B',
 'Measles',
 'BMI',
 'Polio',
 'Diphtheria',
 'Incidents_HIV',
 'GDP_per_capita',
 'Population_mln',
 'Thinness_ten_nineteen_years',
 'Thinness_five_nine_years',
 'Schooling',
 'Economy_status_Developed',
 'Economy_status_Developing']

In [13]:
# TRAIN-TEST SPLIT

X_train, X_test, y_train, y_test = train_test_split(X,                  # The features
                                                    y,                  # The target
                                                    test_size = 0.2,    # Allocated 20% of the data to test
                                                    random_state = 23)  # Add a random state
# 80-20 train-test split with a random state value of 23

## **Feature Engineering**
- A **`feature_eng`** function was created that had 4 arguements:
    - **`train_df`** represents the training dataset
    - **`test_df`** represents the test dataset or user input
    - **`save_metadata`** represents whether to save the scaler and feature columns. **Only** set as **`True`** during training
    - **`include_regions`** represents whether to include the regions or not. **Only** include when using the **Advanced Model**
- The function returns:
    - **`train_df`** represents feature engineered train dataset
    -  **`test_df`** represents feature engineered test dataset
- During training when **`save_metadata`** is `True` the scaler and feature columns from the train phase are saved using [joblib module](https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html) and loaded up in the predict phase to ensure the test dataset is scaled with the train dataset, avoiding **data leakage**.
- **`feature_columns`** is used to align the train and test/input dataset together.
- The saved **`scaler`** is used to scale the test/input dataset.
- The **`constant`** are added at the end.

In [15]:
# FEATURE ENGINEERING FUNCTION

def feature_eng(train_df, test_df, save_metadata=True, include_regions=True):
    """
    Feature engineering function with joblib for saving/loading scalers and feature columns.
    
    Args:
        train_df (pd.DataFrame): Training dataset.
        test_df (pd.DataFrame): Test dataset or user input.
        save_metadata (bool): Whether to save the scaler and feature columns. Only set to True during training.
        include_regions (bool): Whether to include the regions or not. Only include if using the advanced model
    Returns:
        train_df (pd.DataFrame): training dataset.
        test_df (pd.DataFrame): test dataset.
    """
    train_df = train_df.copy()  # Copy the training dataset
    test_df = test_df.copy()  # Copy the test dataset

    # All columns that needs to be scaled
    scale_columns = ['Year', 'Infant_deaths', 'Under_five_deaths', 'Adult_mortality',
                     'Alcohol_consumption', 'Hepatitis_B', 'Measles', 'BMI',
                     'Polio', 'Diphtheria', 'Incidents_HIV', 'GDP_per_capita',
                     'Population_mln', 'Thinness_ten_nineteen_years',
                     'Thinness_five_nine_years', 'Schooling']

    # Training phase:
    if save_metadata:
        # Fit scaler and save feature columns
        train_df = pd.get_dummies(train_df, columns=['Region'], drop_first=True, prefix='Region', dtype=int) # OHE the regions

        # Fit scaler and extract the train columns
        scaler = StandardScaler()                                                 # Call the scaling method
        train_df[scale_columns] = scaler.fit_transform(train_df[scale_columns])   # Fit scaler and transform the training dataset
        feature_columns = train_df.columns                                        # Extract the training columns

        # Save scaler and feature columns
        joblib.dump(scaler, 'scaler')                   # Save the training scaler
        joblib.dump(feature_columns, 'feature_columns') # Save the feature columns

    # Prediction phase:
    else:
        # Load scaler and feature columns
        scaler = joblib.load('scaler')                    # Load the training scaler
        feature_columns = joblib.load('feature_columns')  # Load the feature columns

        # If Region is present in test dataset:
        if include_regions and 'Region' in test_df.columns:
            test_df = pd.get_dummies(test_df, columns=['Region'], drop_first=False, prefix='Region', dtype=int) # OHE the regions and keep the first column

        # If region is not present in the test dataset
        else:
            test_df.drop(columns=['Region'], errors='ignore', inplace=True) # Drop the regions columns

        # Align test_df with train_df before scaling
        test_df = test_df.reindex(columns=feature_columns, fill_value=0)          # aligns the test and train dataset together and fills the missing values with 0
        common_columns = [col for col in scale_columns if col in test_df.columns] # Find the common columns from the scale_columns and the test dataset
        test_df[common_columns] = scaler.transform(test_df[common_columns])       # Scales the test dataset on the common columns using the training scaler

    # Add Constant
    train_df = sm.add_constant(train_df, has_constant='add') # Add constant to the training dataset
    test_df = sm.add_constant(test_df, has_constant='add')   # Add constant to the testing dataset

    return train_df, test_df

In [16]:
X_train_fe, X_test_fe = feature_eng(X_train, X_test) # Feature engineering the train and test dataset for the fitting of the model

In [17]:
X_train_fe # Check of feature engineered dataset

Unnamed: 0,const,Country,Year,Infant_deaths,Under_five_deaths,Adult_mortality,Alcohol_consumption,Hepatitis_B,Measles,BMI,...,Economy_status_Developed,Economy_status_Developing,Region_Asia,Region_Central America and Caribbean,Region_European Union,Region_Middle East,Region_North America,Region_Oceania,Region_Rest of Europe,Region_South America
2676,1.0,Singapore,0.952270,-1.028415,-0.907783,-1.196121,-0.723955,0.799286,0.950472,-0.644257,...,0,1,1,0,0,0,0,0,0,0
369,1.0,"Yemen, Rep.",0.515765,0.541674,0.352960,0.370161,-1.200723,-0.529570,0.950472,-0.780855,...,0,1,0,0,0,1,0,0,0,0
466,1.0,Austria,-1.012001,-0.952267,-0.853751,-0.913923,1.905876,-0.086618,-1.656268,0.038736,...,1,0,0,0,1,0,0,0,0,0
1739,1.0,Lesotho,-1.448505,1.448192,1.478623,3.597649,-0.493179,-0.339734,-1.656268,-0.416593,...,0,1,0,0,0,0,0,0,0,0
649,1.0,South Africa,-0.793748,0.683091,0.778461,2.690922,0.609981,-0.529570,-1.390274,0.676195,...,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1512,1.0,Cambodia,0.952270,0.063033,-0.122070,-0.052038,-0.576867,0.103218,-1.922262,-1.463847,...,0,1,1,0,0,0,0,0,0,0
1993,1.0,Norway,-0.138992,-0.999406,-0.887521,-1.093126,0.457821,0.229776,0.844075,0.539597,...,1,0,0,0,0,0,0,0,1,0
1064,1.0,Czechia,-1.012001,-0.966771,-0.862756,-0.630140,2.301492,0.482891,1.110069,0.721728,...,1,0,0,0,1,0,0,0,0,0
742,1.0,Hungary,0.079261,-0.908754,-0.824484,-0.225231,1.763860,0.229776,1.163268,0.721728,...,1,0,0,0,1,0,0,0,0,0


## **Modelling**
- Specific columns are removed before the model was trained such as:
    - **`Economy_status_Developing`** which had high correlation to **`Economy_status_Developed`**.
    - **`Country`** which would introduce a large amount of bias to the model.
- Linear Regression model was created using the **Ordinary Least Squares** method
- The model was fitted.
- The model results are saved using the statsmodels [.save()](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLSResults.save.html) method.
- Summary was generated giving:
    - **`R_squared`** value of **0.984**
    - **`Cond No`** value of **27.7**
    - **`P-Value`** of all features **< 0.05**

In [27]:
# REMOVING COLUMNS BEFORE MODELLING

# Define a set of columns to be removed
cols_to_remove = {'Country', 'Alcohol_consumption', 'Economy_status_Developing',
                  'Polio', 'Diphtheria', 'Population_mln', 'Thinness_five_nine_years',
                  'Thinness_ten_nineteen_years', 'Measles', 'Hepatitis_B'}

# Create a list of feature columns by excluding the ones listed in cols_to_remove from the columns in X_train_fe
feature_cols = [col for col in list(X_train_fe.columns) if col not in cols_to_remove]

In [28]:
lin_reg = sm.OLS(y_train, X_train_fe[feature_cols]) # Create a linear regression model using the Ordinary Least Squares method.
results = lin_reg.fit()                             # Fit the OLS model to the training data
results.save('advanced_model')                      # Save the trained model to a file ('advanced_model')
results.summary()                                   # Generate and display a summary of the fitted model

0,1,2,3
Dep. Variable:,Life_expectancy,R-squared:,0.984
Model:,OLS,Adj. R-squared:,0.984
Method:,Least Squares,F-statistic:,8096.0
Date:,"Mon, 09 Dec 2024",Prob (F-statistic):,0.0
Time:,15:49:36,Log-Likelihood:,-3663.1
No. Observations:,2291,AIC:,7362.0
Df Residuals:,2273,BIC:,7465.0
Df Model:,17,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,67.9636,0.074,913.119,0.000,67.818,68.110
Year,0.1449,0.027,5.435,0.000,0.093,0.197
Infant_deaths,-1.5101,0.172,-8.770,0.000,-1.848,-1.172
Under_five_deaths,-2.1271,0.174,-12.200,0.000,-2.469,-1.785
Adult_mortality,-5.2894,0.070,-75.636,0.000,-5.427,-5.152
BMI,-0.2774,0.047,-5.898,0.000,-0.370,-0.185
Incidents_HIV,0.1604,0.043,3.711,0.000,0.076,0.245
GDP_per_capita,0.3755,0.041,9.251,0.000,0.296,0.455
Schooling,0.3703,0.056,6.618,0.000,0.261,0.480

0,1,2,3
Omnibus:,7.211,Durbin-Watson:,2.048
Prob(Omnibus):,0.027,Jarque-Bera (JB):,7.381
Skew:,0.108,Prob(JB):,0.025
Kurtosis:,3.175,Cond. No.,27.7


## **Test & Metrics**
- **RMSE** was calculated on the training prediction giving a value of **1.197** which means the prediction was on average 1.197 years off the true value.
- **MAPE** was also calculated to give the average error as a percentage of the actual life expectancy for the train prediction giving a value of **1.45%**.
- The **`feature_eng`** function was run again for the prediction phase on the train and test/input dataset.
- **RMSE** was calculated on the testing prediction giving a value of **1.205** which means the prediction was on average 1.205 years off the true value.
- **MAPE** was also calculated to give the average error as a percentage of the actual life expectancy for the test prediction giving a value of **1.476%**.

In [32]:
## Let's check the performance of our model

y_pred = results.predict(X_train_fe[feature_cols])           # Use the trained regression model to make predictions on the training dataset.
rmse = statsmodels.tools.eval_measures.rmse(y_train, y_pred) # Calculate the Root Mean Squared Error (RMSE) between the actual target values (y_train) and the predicted values (y_pred)
print(rmse)                                                  # Prints the value

1.1971726790615589


In [34]:
mape = metrics.mean_absolute_percentage_error(y_train,y_pred) # Calculate the MAPE
print(f"{mape*100}%")                                         # Print the value

1.4510375442285888%


In [36]:
# FEATURE ENGINEERING FUNCTION

def feature_eng(train_df, test_df, save_metadata=False, include_regions=True):
    """
    Feature engineering function with joblib for saving/loading scalers and feature columns.
    
    Args:
        train_df (pd.DataFrame): Training dataset.
        test_df (pd.DataFrame): Test dataset or user input.
        save_metadata (bool): Whether to save the scaler and feature columns. Only set to True during training.
        include_regions (bool): Whether to include the regions or not. Only include if using the advanced model
    Returns:
        train_df (pd.DataFrame): training dataset.
        test_df (pd.DataFrame): test dataset.
    """
    train_df = train_df.copy()  # Copy the training dataset
    test_df = test_df.copy()  # Copy the test dataset

    # All columns that needs to be scaled
    scale_columns = ['Year', 'Infant_deaths', 'Under_five_deaths', 'Adult_mortality',
                     'Alcohol_consumption', 'Hepatitis_B', 'Measles', 'BMI',
                     'Polio', 'Diphtheria', 'Incidents_HIV', 'GDP_per_capita',
                     'Population_mln', 'Thinness_ten_nineteen_years',
                     'Thinness_five_nine_years', 'Schooling']

    # Training phase:
    if save_metadata:
        # Fit scaler and save feature columns
        train_df = pd.get_dummies(train_df, columns=['Region'], drop_first=True, prefix='Region', dtype=int) # OHE the regions

        # Fit scaler and extract the train columns
        scaler = StandardScaler()                                                 # Call the scaling method
        train_df[scale_columns] = scaler.fit_transform(train_df[scale_columns])   # Fit scaler and transform the training dataset
        feature_columns = train_df.columns                                        # Extract the training columns

        # Save scaler and feature columns
        joblib.dump(scaler, 'scaler')                   # Save the training scaler
        joblib.dump(feature_columns, 'feature_columns') # Save the feature columns

    # Prediction phase:
    else:
        # Load scaler and feature columns
        scaler = joblib.load('scaler')                    # Load the training scaler
        feature_columns = joblib.load('feature_columns')  # Load the feature columns

        # If Region is present in test dataset:
        if include_regions and 'Region' in test_df.columns:
            test_df = pd.get_dummies(test_df, columns=['Region'], drop_first=False, prefix='Region', dtype=int) # OHE the regions and keep the first column

        # If region is not present in the test dataset
        else:
            test_df.drop(columns=['Region'], errors='ignore', inplace=True) # Drop the regions columns

        # Align test_df with train_df before scaling
        test_df = test_df.reindex(columns=feature_columns, fill_value=0)          # aligns the test and train dataset together and fills the missing values with 0
        common_columns = [col for col in scale_columns if col in test_df.columns] # Find the common columns from the scale_columns and the test dataset
        test_df[common_columns] = scaler.transform(test_df[common_columns])       # Scales the test dataset on the common columns using the training scaler

    # Add Constant
    train_df = sm.add_constant(train_df, has_constant='add') # Add constant to the training dataset
    test_df = sm.add_constant(test_df, has_constant='add')   # Add constant to the testing dataset

    return train_df, test_df

In [38]:
_, X_test_fe = feature_eng(X_train, X_test) # Feature engineering the test dataset for predicting

In [40]:
# This is
y_test_pred = results.predict(X_test_fe[feature_cols])           # Use the trained regression model to make predictions on the test dataset.
rmse = statsmodels.tools.eval_measures.rmse(y_test, y_test_pred) # Calculate the Root Mean Squared Error (RMSE) between the actual target values (y_test) and the predicted values (y_test_pred)
print(rmse)                                                      # Print the value

1.2054852329139532


In [42]:
mape = metrics.mean_absolute_percentage_error(y_test,y_test_pred) # Calculate the MAPE
print(f"{mape*100}%")                                             # Print the value

1.47585825973383%
