# Machine Learning Pipeline
In this notebook, we will develop the machine learning models by preprocessing and training the data in four models as per Lydia Chougar's pipeline.

[1. Convert CSV to DataFrame](#data)

[2. Normalize data](#normalize)

[3. Define models](#models)

[4. Training models](#training)

[5. Results](#results)

### Imports

In [1]:
import pandas as pd
import numpy as np
import glob, sys, os, json, utils

In [2]:
def get_data(csvFileName: str, ROI: []):
    '''
    The following function will sanitize data and build a numpy array with X ROI's volumes and y being the class [NC, PD]
    @csvFileName: input volumes csv
    @ROI: regions of interests desired
    '''
    df = pd.read_csv(csvFileName)
    df = utils.remove_unwanted_columns(df, ROI)
    df = utils.combine_left_right_vol(df)
        
    cols = list(df.columns.values)
    cols.pop(cols.index("subjectId"))
    df = df[["subjectId"]+cols]
    
    return df

In [3]:
ROI = [
      "subjectId",
      "Left-Putamen", "Right-Putamen", 
      "Right-Caudate", "Left-Caudate", 
      "Right-Thalamus-Proper", "Left-Thalamus-Proper", 
      "Left-Pallidum", "Right-Pallidum", 
      "Left-Cerebellum-White-Matter", "Right-Cerebellum-White-Matter", 
      "Left-Cerebellum-Cortex", "Right-Cerebellum-Cortex",
      "3rd-Ventricle", 
      "4th-Ventricle",
      "Pons",
      "SCP",
      "Midbrain",
      "Insula",
      "Precentral Cortex",
      "group"
]
df = get_data("../data/volume-data/freeSurferVolumes.csv", ROI)
df.to_csv("../data/volume-data/sanitizedVolumes.csv")

In [4]:
df

Unnamed: 0,subjectId,Pallidum,Putamen,Caudate,Thalamus-Proper,Cerebellum-Cortex,Cerebellum-White-Matter,3rd-Ventricle,4th-Ventricle,Pons,SCP,Midbrain,Insula,Precentral Cortex,group
0,3653,3371.2,7921.0,5937.7,11535.6,90705.2,24389.7,1359.7,1454.6,14782.351255,223.956388,5507.418376,11515,19803,0
1,3808,4107.5,8435.7,5795.8,14379.8,122310.6,34621.1,1988.2,2078.1,19281.923730,414.410048,7309.588981,15123,26678,0
2,4077,3492.6,9680.6,6077.7,13397.6,96912.0,25768.1,862.6,2061.4,15605.023062,224.569924,6060.309340,12291,27036,0
3,3838,3886.5,12452.7,7232.7,17360.1,98418.5,27164.3,1330.8,1764.2,15790.319743,309.250752,7002.026970,17047,35783,0
4,3068,3530.0,8921.6,8129.6,13607.1,109813.5,24579.1,2866.7,1977.7,16211.922345,284.042073,7016.588678,14146,24621,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
139,3557,4509.9,9951.4,7989.0,15730.2,113746.5,32995.4,1552.7,2013.0,17491.608770,291.027110,7011.172709,16289,25717,1
140,3167,3692.6,8999.2,6961.5,14749.3,110731.9,31584.1,790.2,1776.1,15433.882176,354.128580,5957.002062,12428,26644,1
141,115448,3641.8,9452.2,6374.1,13039.1,106682.4,24593.9,1161.7,1548.9,12596.867924,301.529898,5377.383783,15479,25223,1
142,3352,4571.1,9293.2,6747.1,16004.5,115332.4,33812.4,1093.2,1585.3,18207.561580,320.325328,7014.792288,13351,26733,1


# Normalize

In this section, normalization of the data using "Normalization 1" and "Normaliztion 2" techniques are implemented. 

Normalization 1:

$$\dfrac{Variable – mean \; of \;PD \;and \;NC \;in \;the \;training \;cohort}{\sigma \;of \;PD \;and \;NC \;in \;the \;training \;cohort}$$

Normalization 2:

$$\dfrac{Variable – mean \; of \;stableprogr \;scanned \;using \;the \;same \;scanner}{\sigma \;of \;stable+progr \;scanned \;using \;the \;same \;scanner}$$


In [12]:
def normalize1(df, mean, std):
    if mean is None and std is None:
        mean = df.mean(axis=0)
        std = df.std(axis=0)
        normalizedDf = (df - mean)/std
        return normalizedDf.values, mean, std

    normalizedDf = (df - mean)/std
    return normalizedDf.values

In [65]:
def normalize2(df):
    df_no_id = df.drop(columns=["subjectId", "group"])
    metadata_df = utils.parse_metadata()
    merged_df = pd.merge(df, metadata_df, on=["subjectId"], how="left")
   
    stats = {}
    for scanner in merged_df["scannerType"].dropna().unique():
        mean, std = utils.get_mean_and_stats(merged_df.drop(columns="subjectId"), scanner, df_no_id.shape[1])
        stats[scanner] = {
            "mean": mean.to_dict(),
            "std": std.to_dict()
        }

    for index in merged_df.index:
        rowInfo = merged_df.iloc[index]
        scanner = rowInfo["scannerType"]
        mean = list(stats[scanner]["mean"].values())
        std = list(stats[scanner]["std"].values())
        df_no_id.iloc[index] = (df_no_id.iloc[index]-mean)/std
        
    return df_no_id

In [5]:
df

Unnamed: 0,subjectId,Pallidum,Putamen,Caudate,Thalamus-Proper,Cerebellum-Cortex,Cerebellum-White-Matter,3rd-Ventricle,4th-Ventricle,Pons,SCP,Midbrain,Insula,Precentral Cortex,group
0,3653,3371.2,7921.0,5937.7,11535.6,90705.2,24389.7,1359.7,1454.6,14782.351255,223.956388,5507.418376,11515,19803,0
1,3808,4107.5,8435.7,5795.8,14379.8,122310.6,34621.1,1988.2,2078.1,19281.923730,414.410048,7309.588981,15123,26678,0
2,4077,3492.6,9680.6,6077.7,13397.6,96912.0,25768.1,862.6,2061.4,15605.023062,224.569924,6060.309340,12291,27036,0
3,3838,3886.5,12452.7,7232.7,17360.1,98418.5,27164.3,1330.8,1764.2,15790.319743,309.250752,7002.026970,17047,35783,0
4,3068,3530.0,8921.6,8129.6,13607.1,109813.5,24579.1,2866.7,1977.7,16211.922345,284.042073,7016.588678,14146,24621,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
139,3557,4509.9,9951.4,7989.0,15730.2,113746.5,32995.4,1552.7,2013.0,17491.608770,291.027110,7011.172709,16289,25717,1
140,3167,3692.6,8999.2,6961.5,14749.3,110731.9,31584.1,790.2,1776.1,15433.882176,354.128580,5957.002062,12428,26644,1
141,115448,3641.8,9452.2,6374.1,13039.1,106682.4,24593.9,1161.7,1548.9,12596.867924,301.529898,5377.383783,15479,25223,1
142,3352,4571.1,9293.2,6747.1,16004.5,115332.4,33812.4,1093.2,1585.3,18207.561580,320.325328,7014.792288,13351,26733,1


# Models

In this section, we define four models being logisitc regression, SVM with linear and radial kernel and a random forest. As per the paper:

_Using the scikit-learn package, four supervised
machine learning algorithms were used: logistic regression, support vector machine (SVM) with a linear kernel, SVM with a radial basis function kernel, and
random forest_ (Chougar et al.)

Additionally, we will implement a stratified cross validation loop for hyperparameter tuning. As per the paper:

_The cross-validation procedure on the training cohort included two nested loops: an outer loop with repeated stratified random splits with 50 repetitions evaluating the classification performances and an inner loop with 5 fold cross-validation used to optimize the hyperparameters of the algorithms_ (Chougar et al.)

### Imports

In [14]:
# Models
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Utils
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import train_test_split

# Parallel job
from joblib import Parallel, delayed
import matplotlib.pyplot as plt


### Parallel model

Since our cross validation loop produces 250 folds per model, it is bound to take a long time to run. Therefore, a refined version of the code above is re-written in parallel. 

It is recommended that you run the following code from your terminal:
```
conda activate research # Check README to get corect CONDA environemnt
cd ml/
python run.py
```

In [8]:
def train(clf, train_index, test_index, X, y, normalize, columns, modelType, reportKey, iteration):
    print(f"=================Iteration #{iteration}=================")
    performanceDict = {}
        
    # Get fold data train/test sets
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    print(f'Shape of train set: {X_train.shape}')
    print(f'Shape of test set: {X_test.shape}')
    
    # Normalize model data
    print("Normalizing data...")
    if normalize.__name__ == "normalize1":
        trainDf = pd.DataFrame(X_train, columns=columns).drop(columns=["subjectId", "class"])
        testDf = pd.DataFrame(X_test, columns=columns).drop(columns=["subjectId", "class"])
        X_train_normalized, mean_train, std_train = normalize(trainDf, None, None)
        X_test_normalized = normalize(testDf, mean_train, std_train)

    elif normalize.__name__ == "normalize2":
        trainDf = pd.DataFrame(X_train, columns=columns)
        testDf = pd.DataFrame(X_test, columns=columns)
        X_train_normalized = normalize2(trainDf)
        X_test_normalized = normalize2(testDf)
        
    print("Done normalizing data")
        
    print(f"Fitting {modelType} model #{iteration}...")
    model = clf.fit(X_train_normalized, y_train)
    print("Done fitting model")
    
    print(f"Computing results metrics for {modelType} model #{iteration}...")
    performanceDict = utils.performance_report(model, modelType, reportKey, iteration, X_train_normalized, X_test_normalized, y_train, y_test)
    print("Done computing results metrics\n")

    return performanceDict

def parallel_model(df, modelType, reportKey, normalize, paramGrid, dataFile, ROI, heuristic=None):
    print(f"\n======================Running {modelType} with the following parameters======================\nNormalization: {normalize.__name__}\nParam Grid: {paramGrid}\nData: {dataFile}\nROI: {ROI}")

    performance = []
    if not os.path.isdir(modelType):
        os.mkdir(modelType)

    X = df.values
    y = utils.convert_Y(X[:, -1])
    columns = df.columns
    
    # Setup CV
    cv = RepeatedStratifiedKFold(n_splits=2, n_repeats=3, random_state=42)

    # Define model type
    if modelType == "SVM":
        clf = GridSearchCV(SVC(random_state=0), paramGrid)
    elif modelType == "RF":
        clf = GridSearchCV(RandomForestClassifier(random_state=0, n_jobs = -1), paramGrid)
    elif modelType == "LR":
        clf = GridSearchCV(LogisticRegression(random_state=0), paramGrid)
    
    output = Parallel(n_jobs=-1)(delayed(train)(clf, train_index, test_index, X, y, normalize, columns, modelType, reportKey, iteration) for iteration, (train_index, test_index) in enumerate(cv.split(X, y)))

    performance.append(output)

    with open(f"{modelType}/{reportKey}_report.json", 'w', encoding='utf-8') as f:
        json.dump(performance, f, ensure_ascii=False, indent=4)
        
    return performance