This notebook performs grid search in order to tune the hyper parameters for our model. We split our dataset into train/tune and then run each trained model numIts times and average the results, looking for the highest true positive score. 

# Load Libraries

In the following block of code we import the used libraries and set the seed value for the numpy randomstate.

In [6]:
import pickle
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from tqdm import tqdm


rng = np.random.RandomState(20210414)

# Load Data Set and Create Train/Tune

In the following block of code we load SWC and then split it into test/tune for the gridsearch.

In [7]:
SWCAll =  pickle.load( open( "../FeatureExtraction/DataSets/SWCFeatures/SWCFeat.p", "rb" ) )

train = SWCAll.sample(frac=0.80, random_state=rng)
subTrain = train.sample(frac=0.80, random_state=rng)
tuneMask = pd.Series(True, index=train.index)
tuneMask[subTrain.index] = False
tune = train[tuneMask].copy()

feat = subTrain.columns.tolist()
feat.remove('class')
featCols = feat
outCol = 'class'

subTrainX = subTrain[featCols]
subTrainY = subTrain[outCol]

tuneX = tune[featCols]
tuneY = tune[outCol]

# Perform Gridsearch

In the following block of code we perform the gridsearch, returning storing the model with the highest true positive parameters in bestParam. If two models share at TP, we then compare TN, storing the parameters of the one with the highest TP and TN. The seperation into two major loops is to distinguish between class_weight none and balanced; as we encounted difficulty using the classWeightParam to declare these parameters (so they are hard coded). 

In [3]:
bestTP = 0
bestTN = 0
numIts = 5

numEstimatorParam = [50, 100, 150, 200, 250, 300, 350, 400, 450, 500]
bootStrapParam = [True, False]
criterionParam = ['gini', 'entropy']
classWeightParam = ['None', 'balanced']
scalers = [StandardScaler(), '']

bestParam =	{
  "numEstimators": 0,
  "bootStrap": False,
  "criterion": "",
  "classWeight" : "",
  "scaler" : ""
}

length = (len(numEstimatorParam)*len(bootStrapParam)*len(criterionParam)*len(classWeightParam)*len(scalers))

with tqdm(total = length) as pbar:
    for numE in numEstimatorParam:
        for bS in bootStrapParam:
            for c in criterionParam:
                for scale in scalers:
                    
                    tpAvg = 0
                    tnAvg = 0
                    
                    for x in range(numIts):
                       
                        if scale:
                            tunePipe = Pipeline([
                                ('standardize', scale),
                                ('classify', RandomForestClassifier(criterion = c, n_estimators=numE, bootstrap = bS, class_weight = 'balanced', n_jobs=-1, random_state = rng ))
                            ])
                        else:
                            tunePipe = Pipeline([
                                ('classify', RandomForestClassifier(criterion = c, n_estimators=numE, bootstrap = bS, class_weight = 'balanced', n_jobs=-1, random_state = rng )) 
                            ])
                        
                        tunePipe.fit(subTrainX, subTrainY)

                        tuneAcc = accuracy_score(tuneY, tunePipe.predict(tuneX))
                        tn, fp, fn, tp = confusion_matrix(tuneY, tunePipe.predict(tuneX)).ravel()
                        tpAvg += tp
                        tnAvg += tn
                        
                    tpAvg = tpAvg/numIts
                    tnAvg = tnAvg/numIts

                    
                    if tpAvg > bestTP:
                        
                        bestTP = tpAvg
                        bestTN = tnAvg
                        bestParam["numEstimators"] = numE
                        bestParam["bootStrap"] = bS
                        bestParam["criterion"] = c
                        bestParam["classWeight"] = 'balanced'
                        bestParam["scaler"] = scale
                        print(tpAvg)
                        print(bestParam)
                    
                    if tpAvg == bestTP:
                        
                        if tnAvg >bestTN:
                            
                            bestTP = tpAvg
                            bestTN = tnAvg
                            bestParam["numEstimators"] = numE
                            bestParam["bootStrap"] = bS
                            bestParam["criterion"] = c
                            bestParam["classWeight"] = 'balanced'
                            bestParam["scaler"] = scale

                    pbar.update()
                    
    for numE in numEstimatorParam:
        for bS in bootStrapParam:
            for c in criterionParam:
                for scale in scalers:
                    
                    tpAvg = 0
                    tnAvg = 0
                    
                    for x in range(numIts):
                        
                        if scale:
                            tunePipe = Pipeline([
                                ('standardize', scale),
                                ('classify', RandomForestClassifier(criterion = c, n_estimators=numE, bootstrap = bS, n_jobs=-1, random_state = rng))
                            ])
                        else:
                            tunePipe = Pipeline([
                                 ('classify', RandomForestClassifier(criterion = c, n_estimators=numE, bootstrap = bS, n_jobs=-1, random_state = rng))
                            ])
                        tunePipe.fit(subTrainX, subTrainY)

                        tuneAcc = accuracy_score(tuneY, tunePipe.predict(tuneX))
                        tn, fp, fn, tp = confusion_matrix(tuneY, tunePipe.predict(tuneX)).ravel()
                        tpAvg += tp
                        tnAvg += tn
                        
                    tpAvg = tpAvg/numIts
                    tnAvg = tnAvg/numIts

                    if tpAvg > bestTP:
                        bestTP = tpAvg
                        bestTN = tnAvg
                        bestParam["numEstimators"] = numE
                        bestParam["bootStrap"] = bS
                        bestParam["criterion"] = c
                        bestParam["classWeight"] = ''
                        bestParam["scaler"] = scale
                        print(tpAvg)
                        print(bestParam)

                    if tpAvg == bestTP:
                        
                        if tnAvg >bestTN:
                            
                            bestTP = tpAvg
                            bestTN = tnAvg
                            bestParam["numEstimators"] = numE
                            bestParam["bootStrap"] = bS
                            bestParam["criterion"] = c
                            bestParam["classWeight"] = ''
                            bestParam["scaler"] = scale

                    pbar.update()

  1%|          | 1/160 [00:05<14:41,  5.54s/it]

922.2
{'numEstimators': 50, 'bootStrap': True, 'criterion': 'gini', 'classWeight': 'balanced', 'scaler': StandardScaler()}


  1%|▏         | 2/160 [00:11<15:06,  5.74s/it]

924.6
{'numEstimators': 50, 'bootStrap': True, 'criterion': 'gini', 'classWeight': 'balanced', 'scaler': ''}


  2%|▏         | 3/160 [00:19<16:39,  6.37s/it]

933.6
{'numEstimators': 50, 'bootStrap': True, 'criterion': 'entropy', 'classWeight': 'balanced', 'scaler': StandardScaler()}


  3%|▎         | 5/160 [00:40<21:56,  8.49s/it]

941.4
{'numEstimators': 50, 'bootStrap': False, 'criterion': 'gini', 'classWeight': 'balanced', 'scaler': StandardScaler()}


  4%|▍         | 7/160 [01:02<24:58,  9.79s/it]

951.8
{'numEstimators': 50, 'bootStrap': False, 'criterion': 'entropy', 'classWeight': 'balanced', 'scaler': StandardScaler()}


  5%|▌         | 8/160 [01:14<26:20, 10.40s/it]

961.2
{'numEstimators': 50, 'bootStrap': False, 'criterion': 'entropy', 'classWeight': 'balanced', 'scaler': ''}


 15%|█▌        | 24/160 [09:13<1:25:45, 37.83s/it]

963.6
{'numEstimators': 150, 'bootStrap': False, 'criterion': 'entropy', 'classWeight': 'balanced', 'scaler': ''}


 85%|████████▌ | 136/160 [2:54:40<57:23, 143.49s/it]  

965.2
{'numEstimators': 350, 'bootStrap': False, 'criterion': 'entropy', 'classWeight': '', 'scaler': ''}


100%|██████████| 160/160 [3:50:54<00:00, 86.59s/it] 


# Remove Tune From Data Sets

In the following block of code we remove the entries we used in order to perform our hyper parameter tuning from SWC, both the original data set as well as the data set merged with the aggregated features; as both will be used in our experiments.

In [8]:
SWC = pickle.load( open( "../Data/DataSets/SWC/SWC.p", "rb" ) )
toRemove = tune.index.tolist()
SWCFeatNoTune = SWCAll.drop(tune.index)
SWCNoTune = SWC[~SWC['sID'].isin(toRemove)]

# Return Data Sets with Tune Removed

The following block of code saves the data sets modified in the previous block.

In [9]:
pickle.dump(SWCFeatNoTune, open( "Pickles/SWCFeatNoTune.p", "wb" ))
pickle.dump(SWCNoTune, open( "Pickles/SWCNoTune.p", "wb" ))
pickle.dump(bestParam, open( "Pickles/BestParam.p", "wb" ))