# Overview

### 1. Dataset & Introduction

This dataset was collected from the Global Health Observatory (GHO) data repository under World Health Organization (WHO) from year 2000-2015. More information of the dataset can be found at:https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who

Our aim is predicting life expectancy (continous variable) by training a decision tree regressor using several health factors (mortality factors, socioeconomical factors, etc.)

We will use nested cross-validation for hyperparameter tuning and model evaluation.

***Example schematic view of nested cross-validation paradigm*** (Adopted from https://sebastianraschka.com/blog/2018/model-evaluation-selection-part4.html)

![image.png](attachment:image.png)

### 2. Main Steps We Will Follow

* Initialize regressors
* Set up the parameter grids
* Set up GridSearchCV for the hyperparameter tuning process. This takes place in the inner loop.
* Split the data into outer CV folds and iterate through each fold. Within each fold, apply GridSearch to the outer training set and train the training set with the best model. Validate on the outer validation set.

The inner loop selects the best hyperparameter setting. The best hyperparameter will be evaluated on both the average across inner test folds and the one corresponding test fold of the outer loop.

## Setting Up Our Environment

In [1]:
# Global Imports
import numpy as np   
import pandas as pd
import sklearn
import matplotlib.pyplot as plt   # plotting
from sklearn.model_selection import train_test_split  # ML data splits
from sklearn.preprocessing import MinMaxScaler # ML preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold


#Set seed
np.random.seed(1234) 

# Import model
from sklearn.tree import DecisionTreeRegressor

# Import metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

## The data set

In [2]:
#Import dataset
df = pd.read_csv('C:/Users/kimng/Desktop/EDPY506/Code Presentation/Life Expectancy Data.csv')
df.describe()

Unnamed: 0,Year,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,2938.0,2928.0,2928.0,2938.0,2744.0,2938.0,2385.0,2938.0,2904.0,2938.0,2919.0,2712.0,2919.0,2938.0,2490.0,2286.0,2904.0,2904.0,2771.0,2775.0
mean,2007.51872,69.224932,164.796448,30.303948,4.602861,738.251295,80.940461,2419.59224,38.321247,42.035739,82.550188,5.93819,82.324084,1.742103,7483.158469,12753380.0,4.839704,4.870317,0.627551,11.992793
std,4.613841,9.523867,124.292079,117.926501,4.052413,1987.914858,25.070016,11467.272489,20.044034,160.445548,23.428046,2.49832,23.716912,5.077785,14270.169342,61012100.0,4.420195,4.508882,0.210904,3.35892
min,2000.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,0.0,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,63.1,74.0,0.0,0.8775,4.685343,77.0,0.0,19.3,0.0,78.0,4.26,78.0,0.1,463.935626,195793.2,1.6,1.5,0.493,10.1
50%,2008.0,72.1,144.0,3.0,3.755,64.912906,92.0,17.0,43.5,4.0,93.0,5.755,93.0,0.1,1766.947595,1386542.0,3.3,3.3,0.677,12.3
75%,2012.0,75.7,228.0,22.0,7.7025,441.534144,97.0,360.25,56.2,28.0,97.0,7.4925,97.0,0.8,5910.806335,7420359.0,7.2,7.2,0.779,14.3
max,2015.0,89.0,723.0,1800.0,17.87,19479.91161,99.0,212183.0,87.3,2500.0,99.0,17.6,99.0,50.6,119172.7418,1293859000.0,27.7,28.6,0.948,20.7


In [3]:
#check if there were any missing data
df.isnull().sum()

Country                              0
Year                                 0
Status                               0
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Alcohol                            194
percentage expenditure               0
Hepatitis B                        553
Measles                              0
 BMI                                34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
 HIV/AIDS                            0
GDP                                448
Population                         652
 thinness  1-19 years               34
 thinness 5-9 years                 34
Income composition of resources    167
Schooling                          163
dtype: int64

In [4]:
# drop missing data list-wise
df.dropna(inplace=True)

#reset index -- this is to replace old data index with index based on current data.
df = df.reset_index(drop=True)

# get new data dimension
df.shape 

(1649, 22)

After missing data deletion, the final dataset has 1649 instances and 22 columns. For simplicity, we will predict life expectancy from all columns except for 'Country' (number of predictors = 20). Country 'Status' (developing/developed) will be recoded into binary value (0 = developing; 1= developed).

In [5]:
df.loc[df['Status'] == "Developing", 'Status'] = 0
df.loc[df['Status'] == "Developed", 'Status'] = 1
df['Status'] = df['Status'].apply(np.int64)

## Initialize Regressor

In [6]:
model = DecisionTreeRegressor(random_state=42)

## Set up the parameter grids

In [10]:
param_grid = {'max_depth': [5, 6, 7, 8, 9, 10],
               'min_samples_leaf': [5, 10, 15, 20]
               }


## Set up GridSearchCV

In [11]:
search = GridSearchCV(estimator=model, 
                      param_grid = param_grid,
                      scoring = 'neg_mean_squared_error', #use MSE for model selection (larger neg-MSE = better model)
                      cv = 3, 
                      n_jobs = 1, 
                      refit = True)

## Nested Cross Validation procedure

In [16]:
# Empty list to store evaluation metrics
outer_scores_mae = []
outer_scores_mse = []
outer_scores_rmse = []
outer_scores_r2 = []

# configure the cross-validation procedure
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0) 

# Loop through each outer CV fold
for train_index_outer, val_index_outer in outer_cv.split(df, df.Status):
    train_set = df.loc[train_index_outer,:]
    val_set = df.loc[val_index_outer,:]

    feature_names = ['Year', 'Status', 'Adult Mortality',
                     'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
                     'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
                     'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
                     ' thinness  1-19 years', ' thinness 5-9 years',
                     'Income composition of resources', 'Schooling']
    target = ['Life expectancy ']

    X_train = train_set[feature_names]
    y_train = train_set[target]
    X_val = val_set[feature_names]
    y_val = val_set[target]
            
    #Apply grid search with CV=3 on outer train_set (this is hyperparameter tuning process within the inner loop)
    search.fit(X = X_train, y = y_train) # run inner loop hyperparam tuning
    print('\n        Best MSE (inner test folds):', abs(search.best_score_))
    print('        Best parameters:', search.best_params_)

    y_train_hat = search.best_estimator_.predict(X_train)
    mse_train = mean_squared_error(y_train, y_train_hat)
    print('        MSE (on inner train fold)', (mse_train))
    
    # Calculate evaluation metrics using best-tuned model on the outer val_set  
    #MSE
    y_val_hat = search.best_estimator_.predict(X_val)
    mse_val = mean_squared_error(y_val, y_val_hat)
    outer_scores_mse.append(mse_val)
    print('        MSE (on outer validation fold)', (outer_scores_mse[-1]))
    
    #RMSE
    rmse_val = np.sqrt(mse_val)
    outer_scores_rmse.append(rmse_val)
    print('        RMSE (on outer validation fold)', (outer_scores_rmse[-1]))

    # MAE
    mae_val = mean_absolute_error(y_val, y_val_hat)
    outer_scores_mae.append(mae_val)          
    print('        MAE (on outer validation fold)', (outer_scores_mae[-1]))
            
    # R2
    r2_val = r2_score(y_val, y_val_hat)
    outer_scores_r2.append(r2_val)    
    print('        R2 (on outer validation fold)', (outer_scores_r2[-1]))

# Print evaluation metrics across all outer loop folds
print('\n    Average performance on Outer Loop:')
print('        MSE %.2f +/- %.2f'% (np.mean(outer_scores_mse), np.std(outer_scores_mse)))
print('        RMSE %.2f +/- %.2f'% (np.mean(outer_scores_rmse), np.std(outer_scores_rmse)))
print('        MAE %.2f +/- %.2f'% (np.mean(outer_scores_mae), np.std(outer_scores_mae)))
print('        R2 %.2f +/- %.2f'% (np.mean(outer_scores_r2), np.std(outer_scores_r2)))


        Best MSE (inner test folds): 11.031560299770236
        Best parameters: {'max_depth': 6, 'min_samples_leaf': 10}
        MSE (on inner train fold) 5.0747233057579
        MSE (on outer validation fold) 7.177791241413135
        RMSE (on outer validation fold) 2.679140019001085
        MAE (on outer validation fold) 1.9182962093879403
        R2 (on outer validation fold) 0.8937706037188775

        Best MSE (inner test folds): 12.250788663505418
        Best parameters: {'max_depth': 9, 'min_samples_leaf': 5}
        MSE (on inner train fold) 2.2468033209273472
        MSE (on outer validation fold) 7.197880011022941
        RMSE (on outer validation fold) 2.682886507294511
        MAE (on outer validation fold) 1.8171189170198603
        R2 (on outer validation fold) 0.9071553383814229

        Best MSE (inner test folds): 10.60236550669668
        Best parameters: {'max_depth': 6, 'min_samples_leaf': 10}
        MSE (on inner train fold) 4.765307133986369
        MSE (on ou