# Machine Learning pipeline

In this notebook, we go through the machine learning pipeline to reproduce Lydia Chougar's paper. The following sections will be covered:

1 - Convert CSV to DataFrame

2 - Normalize

3 - Train and predict models

4 - Cross Validation

5 - Results 

### Imports

In [1]:
import pandas as pd
import numpy as np
import glob, utils, sys
from bs4 import BeautifulSoup as bs

# Convert CSV to DataFrame

Converts data from CSV to DataFrame and applies any function. 
- "combine": sums all Left and Right regions into one column

In [2]:
def get_data(csvFileName: str, ROI: [], heuristic = None, getDf = False):
    '''
    The following function will sanitize data and build a numpy array with X ROI's volumes and y being the class [NC, PD]
    @csvFileName: input volumes csv
    @ROI: regions of interests desired
    @heuristic: function key
    '''
    df = pd.read_csv(csvFileName)
    df = utils.remove_unwanted_columns(df, ROI)
    
    if heuristic == "combine":
        df = utils.combine_left_right_vol(df)
        
    if (getDf):
        return df
    else:
        df = df.drop("subjectId", 1)
        
    arr = df.values
    X = arr[:, :-1]
    y = utils.convert_Y(arr[:, -1])
    
    return X,y

Test *get_data()* function

In [3]:
ROI = [
      "subjectId", "class",
      "Left-Putamen", "Right-Putamen", 
      "Right-Caudate", "Left-Caudate", 
      "Right-Thalamus-Proper", "Left-Thalamus-Proper", 
      "Left-Pallidum", "Right-Pallidum", 
      "Left-Cerebellum-Cortex", "Right-Cerebellum-Cortex", "lhCortexVol", "rhCortexVol", "CortexVol",
      "Left-Cerebellum-White-Matter", "Right-Cerebellum-White-Matter",
      "CerebralWhiteMatterVol", 
      "3rd-Ventricle", "4th-Ventricle"
   ]
X, y = get_data("volumes.csv", ROI, "combine")
df = get_data("volumes.csv", ROI, "combine", getDf=True)

  df = df.drop(column, 1)


# 2. [Normalize](#normal)

In this section, normalization of the data using "Normalization 1" and "Normaliztion 2" techniques are implemented. 

Normalization 1:

$$\dfrac{Variable – mean \; of \;PD \;and \;NC \;in \;the \;training \;cohort}{\sigma \;of \;PD \;and \;NC \;in \;the \;training \;cohort}$$

Normalization 2:

$$\dfrac{Variable – mean \; of \;controls \;scanned \;using \;the \;same \;scanner}{\sigma \;of \;controls \;scanned \;using \;the \;same \;scanner}$$


In [4]:
def normalize1(data, mean, std):
    df = pd.DataFrame(data=data)

    if mean is None and std is None:
        mean = df.mean(axis=0)
        std = testDf.std(axis=0)
        normalizedDf = (df - mean)/std
        return normalizedDf, mean, std

    normalizedDf = (df - mean)/std
    return normalizedDf

In [5]:
# Testing normalization1
trainDf = pd.DataFrame(np.array([[1, 2, 3], [3, 4, 7]]),columns=['a', 'b', 'c'])
testDf = pd.DataFrame(np.array([[2, 6, 4], [3, 7, 9]]),columns=['a', 'b', 'c'])
normTrainDf, trainMean, trainStd = normalize1(trainDf, None, None)
normTestDf = normalize1(testDf, trainMean, trainStd)
normTestDf.values

array([[ 0.        ,  4.24264069, -0.28284271],
       [ 1.41421356,  5.65685425,  1.13137085]])

In [6]:
def normalize2(data, ROI):
    df = pd.DataFrame(data=data)
    df_with_subjects = get_data("volumes.csv", ROI, "combine", getDf=True)
    metadata_df = utils.parse_metadata()
    merged_df = pd.merge(df_with_subjects, metadata_df, on=["subjectId"])
    
    stats = {}
    for scanner in merged_df["scannerType"].unique():
        mean, std = utils.get_mean_and_stats(merged_df.drop("subjectId",1), scanner)
        stats[scanner] = {
            "mean": mean.to_dict(),
            "std": std.to_dict()
        }
        
    for index in merged_df.index:
        rowInfo = merged_df.iloc[index]
        scanner = rowInfo["scannerType"]
        mean = list(stats[scanner]["mean"].values())
        std = list(stats[scanner]["std"].values())
        df.iloc[index] = (df.iloc[index]-mean)/std
        
    return df
    
normalize2DF = normalize2(X, ROI)
print(normalize2DF)

           0         1         2         3         4         5         6   \
0   -1.254627  0.027858 -0.489725 -0.341841 -0.731373 -1.033358 -0.373735   
1    0.301883  2.070251  1.624695  0.607634 -0.196809   0.36249  1.738754   
2    0.735996  0.456097  0.450533  0.244247  0.843845  1.022511  0.947571   
3     0.00614  1.704153  0.818538 -0.533545 -0.242789 -0.858736 -0.721767   
4   -0.562185  0.003687 -1.088977 -0.984206 -0.765684 -0.124842  0.707845   
..        ...       ...       ...       ...       ...       ...       ...   
210  2.284984  0.610996 -0.301629  0.020009  2.710298 -0.453807 -0.530935   
211   0.95038  1.679982  0.072571  0.006177  0.756519 -0.563408  0.042006   
212  0.874365  -0.25656 -0.246639  -0.31144 -0.006132  0.662485  1.041801   
213  0.600595  1.061291  2.132112  1.529245  -0.15697 -0.271943  0.639098   
214    4655.8    8916.2    6840.7  102689.1   31238.7    1666.8    1258.1   

                7              8             9         10  
0        -0.591

  if __name__ == '__main__':
  mean = queryDf.mean()
  std = queryDf.std()


# 3. [Train and predict models](#predict)

In this section, we define four models being logisitc regression, SVM with linear and radial kernel and a random forest. As per the paper:

_Using the scikit-learn package, four supervised
machine learning algorithms were used: logistic regression, support vector machine (SVM) with a linear kernel, SVM with a radial basis function kernel, and
random forest_ (Chougar et al.)

Additionally, we will implement a stratified cross validation loop for hyperparameter tuning. As per the paper:

_The cross-validation procedure on the training cohort included two nested loops: an outer loop with repeated stratified random splits with 50 repetitions evaluating the classification performances and an inner loop with 5 fold cross-validation used to optimize the hyperparameters of the algorithms_ (Chougar et al.)

### Imports

In [26]:
# Models
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Utils
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import PredefinedSplit
from sklearn.model_selection import RepeatedStratifiedKFold

## Utilities


In [27]:
def split_data(X, y, training_split):
    '''
    The following function splits the training and testing data sets
    according to a split [0 - 1] passed.
    '''
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = training_split, random_state = 42)
    return X_train, X_test, y_train, y_test

def get_model_score(model, X_train, y_train, X_test, y_test):
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    print(f'training score: {round(train_acc, 3)}')
    print(f'testing score: {round(test_acc, 3)}')
    return train_acc, test_acc

def model(X, y, modelType, dataSplit, normalize, paramGrid):
    # Define training, validation and test sets
    X_train, X_test, y_train, y_test = split_data(X, y, dataSplit)
    
    # Setup CV
    cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=50)
    
    # Define model type
    if modelType == "SVM":
        clf = GridSearchCV(SVC(random_state=0), paramGrid, cv=cv)
    elif modelType == "RF":
        clf = GridSearchCV(RandomForestClassifier(random_state=0), paramGrid, cv=cv)
    elif modelType == "LR":
        clf = GridSearchCV(LogisticRegression(random_state=0), paramGrid, cv=cv)
        
    # Normalize model data
    if normalize.__name__ == "normalize1":
        X_grid_normalized, mean_train, std_train = normalize(X_train, None, None)
        X_test_normalized = normalize(X_test, mean_train, std_train)
        
    # Fit and predict
    model = clf.fit(X_grid_normalized, y_train)
    train_acc, test_acc = get_model_score(X_grid_normalized, y_grid, X_test_normalized, y_test)
    print(f'Best model params: {model.best_params_}')

### SVM

In [28]:
param_grid = {
    'C': [1.0, 10.0, 100.0, 1000.0],
    'gamma': [0.01, 0.10, 1.00, 10.00]
}
for kernelType in ["linear", "rbf"]:
    param_grid["kernel"] = kernelType
#     model(X, y, "SVM", 0.7, normalize1, param_grid)

### Logistic Regression

In [29]:
param_grid = {
    'penalty': ["l1", "l2", "elasticnet"],
    'C': [1.0, 10.0, 100.0, 1000.0]
}
# model(X, y, "LR", 0.7, normalize1, param_grid)

### Random forest

In [30]:
param_grid = {
    'n_estimators': [100, 500, 1000],
    'criterion': ['gini', 'entropy'],
    'min_samples_split': [2, 4, 5, 10, 13],
    'min_samples_leaf': [1, 2, 5, 8, 13]
}
# model(X, y, "RF", 0.7, normalize1, param_grid)