## Overall Workflow

1. Initialize regressors 
2. Build the pipelines
3. Set up the parameter grids
4. Set up multiple GridSearchCV for the hyperparameter tuning process of each algorithm. This takes place in the inner loop.
5. Define the inner and outer loops
* Use StratifiedKFold to create outer loop folds
* Training folds created in the outerloop will be used in the inner loop for parameter tuning.
* The inner loop selects the best hyperparameter setting. The best hyperparameter will be evaluated on both the average across inner test folds and the *one* corresponding test fold of the outer loop.

In [1]:
%matplotlib notebook
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold, StratifiedGroupKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

#Set random seed to current date and time to get a pseudorandom state
import random
from datetime import datetime
random.seed(datetime.now()) 

#This is to round the output to 3 decimal places when printing outputs
np.set_printoptions(precision=3)


In [2]:

#Import dataset
data = pd.read_csv("C:/Users/kimng/Desktop/ML - Age, Hipp, CVLT/CVLTHippocampus.csv")
data = data.reset_index(drop=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140 entries, 0 to 139
Columns: 206 entries, ID_code to RecallEfficiency
dtypes: float64(181), int64(22), object(3)
memory usage: 225.4+ KB


In [3]:
data.columns

Index(['ID_code', 'Scan_code', 'Sex', 'Age', 'Age50', 'AgeCategory', 'ICV',
       'Handedness', 'Education', 'EduYears',
       ...
       'CVLT_CorrectRate_B', 'CVLT_CorrectRate_SDFR', 'CVLT_CorrectRate_SDCR',
       'CVLT_CorrectRate_LDFR', 'CVLT_CorrectRate_LDCR', 'GenVerbLearn',
       'ResponseError', 'OrgStrat', 'SerialEffect', 'RecallEfficiency'],
      dtype='object', length=206)

In [4]:
columns = ['CVLT_Imm_Total', 'CVLT_DelR_SD_Free', 'CVLT_DelR_LD_Free',
            'Age','Sex', 'EduYears', 'Smoker', 'High_BP', 'COMT', 'BDNF2', 'ApoE4',
           'L_HH_Total', 'R_HH_Total', 'L_HB_Total', 'R_HB_Total', 'L_HT_Total', 'R_HT_Total',
           'L_DG_Total', 'R_DG_Total',
           'L_CA_Total', 'R_CA_Total',
           'L_Sub_Total', 'R_Sub_Total',
           'L_HH_CA', 'R_HH_CA', 'L_HB_CA', 'R_HB_CA', 'L_HT_CA', 'R_HT_CA', 
           'L_HH_DG', 'R_HH_DG', 'L_HB_DG', 'R_HB_DG', 'L_HT_DG', 'R_HT_DG',
           'L_HH_Sub', 'R_HH_Sub', 'L_HB_Sub', 'R_HB_Sub', 'L_HT_Sub', 'R_HT_Sub']


In [5]:
df = data[columns]

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140 entries, 0 to 139
Data columns (total 41 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CVLT_Imm_Total     137 non-null    float64
 1   CVLT_DelR_SD_Free  137 non-null    float64
 2   CVLT_DelR_LD_Free  137 non-null    float64
 3   Age                140 non-null    int64  
 4   Sex                140 non-null    int64  
 5   EduYears           140 non-null    int64  
 6   Smoker             140 non-null    int64  
 7   High_BP            140 non-null    int64  
 8   COMT               140 non-null    int64  
 9   BDNF2              140 non-null    int64  
 10  ApoE4              140 non-null    int64  
 11  L_HH_Total         129 non-null    float64
 12  R_HH_Total         129 non-null    float64
 13  L_HB_Total         129 non-null    float64
 14  R_HB_Total         129 non-null    float64
 15  L_HT_Total         129 non-null    float64
 16  R_HT_Total         129 non

In [7]:
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 129 entries, 0 to 139
Data columns (total 41 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CVLT_Imm_Total     129 non-null    float64
 1   CVLT_DelR_SD_Free  129 non-null    float64
 2   CVLT_DelR_LD_Free  129 non-null    float64
 3   Age                129 non-null    int64  
 4   Sex                129 non-null    int64  
 5   EduYears           129 non-null    int64  
 6   Smoker             129 non-null    int64  
 7   High_BP            129 non-null    int64  
 8   COMT               129 non-null    int64  
 9   BDNF2              129 non-null    int64  
 10  ApoE4              129 non-null    int64  
 11  L_HH_Total         129 non-null    float64
 12  R_HH_Total         129 non-null    float64
 13  L_HB_Total         129 non-null    float64
 14  R_HB_Total         129 non-null    float64
 15  L_HT_Total         129 non-null    float64
 16  R_HT_Total         129 non

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)


In [10]:
#df.Sex = df.Sex-1

In [9]:
df = df.reset_index()

In [11]:
df.loc[df.isnull().any(axis=1)]

Unnamed: 0,index,CVLT_Imm_Total,CVLT_DelR_SD_Free,CVLT_DelR_LD_Free,Age,Sex,EduYears,Smoker,High_BP,COMT,...,L_HB_DG,R_HB_DG,L_HT_DG,R_HT_DG,L_HH_Sub,R_HH_Sub,L_HB_Sub,R_HB_Sub,L_HT_Sub,R_HT_Sub


### Create category variable for stratification

In [12]:
#Bin Age into groups
bins = [10, 20, 30, 40, 50, 60, 70, 80, 90]
labels = [1,2,3,4,5,6,7,8]
df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels)
df = df.reset_index(drop=True)

#Function for categorise dataframe
def categorise(row):
    if row['AgeGroup'] == 1 and row['Sex'] == 0:
        return 1
    elif row['AgeGroup'] == 1 and row['Sex'] == 1:
        return 2
    elif row['AgeGroup'] == 2 and row['Sex'] == 0:
        return 3
    elif row['AgeGroup'] == 2 and row['Sex'] == 1:
        return 4
    elif row['AgeGroup'] == 3 and row['Sex'] == 0:
        return 5
    elif row['AgeGroup'] == 3 and row['Sex'] == 1:
        return 6
    elif row['AgeGroup'] == 4 and row['Sex'] == 0:
        return 7
    elif row['AgeGroup'] == 4 and row['Sex'] == 1:
        return 8
    elif row['AgeGroup'] == 5 and row['Sex'] == 0:
        return 9
    elif row['AgeGroup'] == 5 and row['Sex'] == 1:
        return 10
    elif row['AgeGroup'] == 6 and row['Sex'] == 0:
        return 11
    elif row['AgeGroup'] == 6 and row['Sex'] == 1:
        return 12
    elif row['AgeGroup'] == 7 and row['Sex'] == 0:
        return 13
    elif row['AgeGroup'] == 7 and row['Sex'] == 1:
        return 14
    elif row['AgeGroup'] == 8 and row['Sex'] == 0:
        return 15
    elif row['AgeGroup'] == 8 and row['Sex'] == 1:
        return 16


#Apply categories to dataframe
df['grp'] = df.apply(lambda row: categorise(row), axis=1)

In [13]:
df.head()

Unnamed: 0,index,CVLT_Imm_Total,CVLT_DelR_SD_Free,CVLT_DelR_LD_Free,Age,Sex,EduYears,Smoker,High_BP,COMT,...,L_HT_DG,R_HT_DG,L_HH_Sub,R_HH_Sub,L_HB_Sub,R_HB_Sub,L_HT_Sub,R_HT_Sub,AgeGroup,grp
0,0,63.0,15.0,16.0,31,1,16,1,1,1,...,299.47821,285.051172,285.855522,307.093904,208.841138,263.11585,34.753172,38.936464,3,6
1,1,52.0,12.0,12.0,50,0,18,1,1,3,...,216.331552,194.072584,287.146142,331.563567,189.98378,215.776155,18.675363,31.495736,4,7
2,2,60.0,15.0,15.0,30,1,16,1,1,3,...,296.479599,249.55392,327.550589,353.394837,229.576124,298.222287,43.905748,38.330404,2,4
3,3,54.0,12.0,11.0,67,1,13,1,1,1,...,171.081315,161.05481,249.491163,361.781214,238.156421,215.735688,24.102164,30.143226,6,12
4,4,67.0,15.0,14.0,19,1,14,1,1,2,...,252.891192,267.005147,230.117166,296.707669,232.6189,173.66208,30.216082,29.959485,1,2


### Initialize Regressors

In [14]:
reg1 = Ridge()
reg2 = Lasso()
reg3 = ElasticNet()
reg4 = KNeighborsRegressor()
reg5 = SVR()
reg6 = DecisionTreeRegressor()
reg7 = RandomForestRegressor()

### Build the pipelines
for Ridge, Lasso, ElasticNet, K-Neighbors, & SVM because feature scaling is required for these algorithms.

In [57]:
#pipe1 = Pipeline([
#                 #('scaler', MinMaxScaler()),
#                  ('reg1', reg1)])
#pipe2 = Pipeline([
#                 #('scaler', MinMaxScaler()),
#                  ('reg2', reg2)])
#pipe3 = Pipeline([
#                 #('scaler', MinMaxScaler()),
#                  ('reg3', reg3)])
#pipe4 = Pipeline([
#                 #('scaler', MinMaxScaler()),
#                  ('reg4', reg4)])
#pipe5 = Pipeline([
#                 #('scaler', MinMaxScaler()),
#                  ('reg5', reg5)])
#pipe6 = Pipeline([
                 #('scaler', MinMaxScaler()),
#                  ('reg6', reg6)])
#pipe7 = Pipeline([
#                 #('scaler', MinMaxScaler()),
#                  ('reg7', reg7)])


### Setting up the parameter grids

In [20]:
param_grid1 = [{'alpha': [.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 15, 20]}]

param_grid2 = [{'alpha': [.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 15, 20]}]

param_grid3 = [{'alpha': [.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 15, 20]}]

param_grid4 = [{'n_neighbors': list(range(1, 10))}]

param_grid5 = [{'C': [0.01, 0.1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50], 
                'gamma': ['scale', 'auto'], 
                'kernel': ['rbf', 'linear', 'poly'], 
                'degree': [1,2,3,4,5,6]}]

param_grid6 = [{'criterion': ['absolute_error'],
                'max_depth': [1, 2, 3, 4, 5],
                'min_samples_leaf': [1, 2, 3, 4, 5]
               }]

param_grid7 = [{'criterion': ['absolute_error'],
                'max_depth': [1, 2, 3, 4, 5],
                'min_samples_leaf': [1, 2, 3, 4, 5],
                'n_estimators': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

### Setting up Grid Search for inner loops

In [21]:

gridcvs = {}

inner_cv = StratifiedKFold(n_splits=3, shuffle=True)

for pgrid, est, name in zip((param_grid1, param_grid2, param_grid3, param_grid4, param_grid5, param_grid6, param_grid7),
                            (reg1, reg2, reg3, reg4, reg5, reg6, reg7),
                            ('Ridge', 'Lasso', 'ElasticNet', 'KNN', 'SVR', 'DTree', 'RForest')):
    gcv = GridSearchCV(estimator=est,
                       param_grid=pgrid,
                       scoring='neg_mean_squared_error',
                       n_jobs=-1,
                       cv=inner_cv,                       
                       verbose=0,
                       refit=True
                      )
    gridcvs[name] = gcv

In [22]:
gridcvs

{'Ridge': GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=True),
              estimator=Ridge(), n_jobs=-1,
              param_grid=[{'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 1, 5,
                                     10, 15, 20]}],
              scoring='neg_mean_squared_error'),
 'Lasso': GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=True),
              estimator=Lasso(), n_jobs=-1,
              param_grid=[{'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 1, 5,
                                     10, 15, 20]}],
              scoring='neg_mean_squared_error'),
 'ElasticNet': GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=True),
              estimator=ElasticNet(), n_jobs=-1,
              param_grid=[{'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 1, 5,
                                     10, 15, 20]}],
              scoring='neg_mean_squared_error'),
 'KNN': GridSearchCV(cv=StratifiedKFold(n_splits=3,

### Nested CV

In [23]:
import warnings
warnings.filterwarnings("ignore")

bins = [10, 20, 30, 40, 50, 60, 70, 80, 90]
labels = [1,2,3,4,5,6,7,8]
df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels)

stratifyCategory = ['Sex', 'AgeGroup']

#from sklearn.model_selection import StratifiedShuffleSplit
#split = StratifiedShuffleSplit(test_size = 24, random_state=42, n_splits = 5)


feature_names = ['Age','Sex', 'EduYears', 'Smoker', 'High_BP', 'COMT', 'BDNF2', 'ApoE4',
                 'L_HH_Total', 'R_HH_Total', 'L_HB_Total', 'R_HB_Total', 'L_HT_Total', 'R_HT_Total',
                 'L_DG_Total', 'R_DG_Total',
                 'L_CA_Total', 'R_CA_Total',
                 'L_Sub_Total', 'R_Sub_Total',
                 'L_HH_CA', 'R_HH_CA', 'L_HB_CA', 'R_HB_CA', 'L_HT_CA', 'R_HT_CA', 
                 'L_HH_DG', 'R_HH_DG', 'L_HB_DG', 'R_HB_DG', 'L_HT_DG', 'R_HT_DG',
                 'L_HH_Sub', 'R_HH_Sub', 'L_HB_Sub', 'R_HB_Sub', 'L_HT_Sub', 'R_HT_Sub']


# Feature Scaling
scaler = MinMaxScaler()
df[feature_names] = scaler.fit_transform(df[feature_names])

for name, est in gridcvs.items():

        print(50 * '-', '\n')
        print('Algorithm:', name)
        print('    Inner loop:')

        outer_scores_mae = []
        outer_scores_mse = []
        outer_scores_r2 = []
        outer_cv = StratifiedKFold(n_splits=5, shuffle=True) 

        for train_index_outer, val_index_outer in outer_cv.split(df, df.grp):
            train_set = df.loc[train_index_outer,:]
            val_set = df.loc[val_index_outer,:]

            X = ['Age','Sex', 'EduYears', 
                     'L_HH_Total', 'R_HH_Total', 'L_HB_Total', 'R_HB_Total', 'L_HT_Total', 'R_HT_Total',
                     'L_DG_Total', 'R_DG_Total',
                     'L_CA_Total', 'R_CA_Total',
                     'L_Sub_Total', 'R_Sub_Total',
                     'L_HH_CA', 'R_HH_CA', 'L_HB_CA', 'R_HB_CA', 'L_HT_CA', 'R_HT_CA', 
                     'L_HH_DG', 'R_HH_DG', 'L_HB_DG', 'R_HB_DG', 'L_HT_DG', 'R_HT_DG',
                     'L_HH_Sub', 'R_HH_Sub', 'L_HB_Sub', 'R_HB_Sub', 'L_HT_Sub', 'R_HT_Sub']
            y = ['CVLT_Imm_Total']

            X_train = train_set[X]
            y_train = train_set[y]
            X_val = val_set[X]
            y_val = val_set[y]
            
            #Apply grid search with CV=3 on outer train_set

            grid = gridcvs[name]
            grid.fit(X = X_train,
                     y = y_train) # run inner loop hyperparam tuning
            print('\n        Best MSE (inner test folds):', (grid.best_score_))
            print('        Best parameters:', grid.best_params_)

            # Calculate evaluation metrics using best-tuned model on out val_set
            #MSE
            y_val_hat = grid.best_estimator_.predict(X_val)
            mse_val = mean_squared_error(y_val, y_val_hat)
            outer_scores_mse.append(mse_val)

            print('        MSE (on outer validation fold)', (outer_scores_mse[-1]))

            # MAE
            mae_val = mean_absolute_error(y_val, y_val_hat)
            outer_scores_mae.append(mae_val)
            
            print('        MAE (on outer validation fold)', (outer_scores_mae[-1]))
            
            # R2
            r2_val = r2_score(y_val, y_val_hat)
            outer_scores_r2.append(r2_val)
            
            print('        R2 (on outer validation fold)', (outer_scores_r2[-1]))
            
        print('\n    Outer Loop:')
        print('        MSE %.2f +/- %.2f'%
                  (np.mean(outer_scores_mse), np.std(outer_scores_mse)))
        print('        MAE %.2f +/- %.2f'%
                  (np.mean(outer_scores_mae), np.std(outer_scores_mae)))
        print('        R2 %.2f +/- %.2f'%
                  (np.mean(outer_scores_r2), np.std(outer_scores_r2)))

-------------------------------------------------- 

Algorithm: Ridge
    Inner loop:

        Best MSE (inner test folds): -76.46413371705692
        Best parameters: {'alpha': 5}
        MSE (on outer validation fold) 41.0194927435878
        MAE (on outer validation fold) 5.386127357939007
        R2 (on outer validation fold) 0.34614876336000966

        Best MSE (inner test folds): -66.3258011536952
        Best parameters: {'alpha': 5}
        MSE (on outer validation fold) 83.03568267393825
        MAE (on outer validation fold) 7.476823386030388
        R2 (on outer validation fold) -0.13450939805530382

        Best MSE (inner test folds): -63.27652919178123
        Best parameters: {'alpha': 5}
        MSE (on outer validation fold) 78.22374619434288
        MAE (on outer validation fold) 8.018418168877565
        R2 (on outer validation fold) 0.2573660216645489

        Best MSE (inner test folds): -70.21993528443862
        Best parameters: {'alpha': 5}
        MSE (on oute

        MSE (on outer validation fold) 77.70192307692308
        MAE (on outer validation fold) 7.4423076923076925
        R2 (on outer validation fold) 0.0867497739759372

        Best MSE (inner test folds): -89.57275910364145
        Best parameters: {'criterion': 'absolute_error', 'max_depth': 1, 'min_samples_leaf': 1}
        MSE (on outer validation fold) 97.125
        MAE (on outer validation fold) 7.903846153846154
        R2 (on outer validation fold) -0.25757053381600903

        Best MSE (inner test folds): -73.75658263305321
        Best parameters: {'criterion': 'absolute_error', 'max_depth': 5, 'min_samples_leaf': 2}
        MSE (on outer validation fold) 142.04807692307693
        MAE (on outer validation fold) 9.942307692307692
        R2 (on outer validation fold) -0.3698803087150664

        Best MSE (inner test folds): -89.84397759103642
        Best parameters: {'criterion': 'absolute_error', 'max_depth': 2, 'min_samples_leaf': 4}
        MSE (on outer validation f