# Machine Learning Pipeline
In this notebook, we will develop the machine learning models by preprocessing and training the data in four models as per Lydia Chougar's pipeline.

[1. Convert CSV to DataFrame](#data)

[2. Normalize data](#normalize)

[3. Define models](#models)

[4. Training models](#training)

[5. Results](#results)

### Imports

In [1]:
import pandas as pd
import numpy as np
import glob, sys, os, json, utils

In [2]:
def get_data(csvFileName: str, ROI: []):
    '''
    The following function will sanitize data and build a numpy array with X ROI's volumes and y being the class [NC, PD]
    @csvFileName: input volumes csv
    @ROI: regions of interests desired
    '''
    df = pd.read_csv(csvFileName)
    df = utils.remove_unwanted_columns(df, ROI)
    df = utils.combine_left_right_vol(df)
        
    cols = list(df.columns.values)
    cols.pop(cols.index("subjectId"))
    df = df[["subjectId"]+cols]
    
    return df

In [3]:
ROI = [
      "subjectId", "stage",
      "Left-Putamen", "Right-Putamen", 
      "Right-Caudate", "Left-Caudate", 
      "Right-Thalamus-Proper", "Left-Thalamus-Proper", 
      "Left-Pallidum", "Right-Pallidum", 
      "Left-Cerebellum-White-Matter", "Right-Cerebellum-White-Matter", 
      "Left-Cerebellum-Cortex", "Right-Cerebellum-Cortex",
      "3rd-Ventricle", 
      "4th-Ventricle"
      "Pons",
      "SCP",
      "Midbrain"
]
df = get_data("../data/volumes.csv", ROI)
df

Unnamed: 0,subjectId,Pallidum,Putamen,Caudate,Thalamus-Proper,Cerebellum-Cortex,Cerebellum-White-Matter,3rd-Ventricle,SCP,Midbrain,stage
0,4037,4261.3,10223.0,7586.8,17351.7,116856.4,38230.0,934.2,307.724746,6989.344230,1.0
1,3168,3776.7,8200.9,5738.2,13200.4,91395.5,31986.1,1089.4,281.394581,5453.716726,2.0
2,3131,4523.6,9383.2,8577.0,16020.2,118487.3,34742.2,1719.7,348.730585,8004.224865,2.0
3,4024,3444.1,8405.3,5940.0,12945.9,93723.6,22075.2,1587.0,297.837617,5969.685193,2.0
4,4001,4174.6,11058.9,7890.2,15731.7,126094.9,29284.0,1650.6,262.838901,7330.256842,2.0
...,...,...,...,...,...,...,...,...,...,...,...
146,3753,3443.9,9007.8,6511.1,14335.3,103145.8,25126.6,1366.5,241.183927,5918.611366,2.0
147,3372,4797.0,10114.8,9268.2,15093.5,112521.2,36136.9,3301.4,318.438874,6918.687371,1.0
148,3589,3067.4,7619.4,6386.1,13104.2,92495.3,29665.6,1319.4,283.262188,5467.899590,2.0
149,3586,3709.8,8082.5,5973.9,13831.9,109396.4,29020.2,1464.9,342.644554,6638.502535,1.0


# Normalize

In this section, normalization of the data using "Normalization 1" and "Normaliztion 2" techniques are implemented. 

Normalization 1:

$$\dfrac{Variable – mean \; of \;PD \;and \;NC \;in \;the \;training \;cohort}{\sigma \;of \;PD \;and \;NC \;in \;the \;training \;cohort}$$

Normalization 2:

$$\dfrac{Variable – mean \; of \;controls \;scanned \;using \;the \;same \;scanner}{\sigma \;of \;controls \;scanned \;using \;the \;same \;scanner}$$


In [4]:
def normalize1(df, mean, std):
    if mean is None and std is None:
        mean = df.mean(axis=0)
        std = df.std(axis=0)
        normalizedDf = (df - mean)/std
        return normalizedDf.values, mean, std

    normalizedDf = (df - mean)/std
    return normalizedDf.values

In [5]:
def normalize2(df):
    df_no_id = df.drop(columns=["subjectId", "stage"])
    metadata_df = utils.parse_metadata()
    merged_df = pd.merge(df, metadata_df, on=["subjectId"], how="left")
   
    stats = {}
    for scanner in merged_df["scannerType"].dropna().unique():
        mean, std = utils.get_mean_and_stats(merged_df.drop(columns="subjectId"), scanner, df_no_id.shape[1])
        stats[scanner] = {
            "mean": mean.to_dict(),
            "std": std.to_dict()
        }

    for index in merged_df.index:
        rowInfo = merged_df.iloc[index]
        scanner = rowInfo["scannerType"]
        mean = list(stats[scanner]["mean"].values())
        std = list(stats[scanner]["std"].values())
        df_no_id.iloc[index] = (df_no_id.iloc[index]-mean)/std
        
    return df_no_id

# Models

In this section, we define four models being logisitc regression, SVM with linear and radial kernel and a random forest. As per the paper:

_Using the scikit-learn package, four supervised
machine learning algorithms were used: logistic regression, support vector machine (SVM) with a linear kernel, SVM with a radial basis function kernel, and
random forest_ (Chougar et al.)

Additionally, we will implement a stratified cross validation loop for hyperparameter tuning. As per the paper:

_The cross-validation procedure on the training cohort included two nested loops: an outer loop with repeated stratified random splits with 50 repetitions evaluating the classification performances and an inner loop with 5 fold cross-validation used to optimize the hyperparameters of the algorithms_ (Chougar et al.)

### Imports

In [7]:
# Models
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Utils
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import train_test_split

# Parallel job
from joblib import Parallel, delayed
import matplotlib.pyplot as plt


### Parallel model

Since our cross validation loop produces 250 folds per model, it is bound to take a long time to run. Therefore, a refined version of the code above is re-written in parallel. 

It is recommended that you run the following code from your terminal:
```
conda activate research # Check README to get corect CONDA environemnt
cd ml/
python run.py
```

In [8]:
def train(clf, train_index, test_index, X, y, normalize, columns, modelType, reportKey, iteration):
    print(f"=================Iteration #{iteration}=================")
    performanceDict = {}
        
    # Get fold data train/test sets
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    print(f'Shape of train set: {X_train.shape}')
    print(f'Shape of test set: {X_test.shape}')
    
    # Normalize model data
    print("Normalizing data...")
    if normalize.__name__ == "normalize1":
        trainDf = pd.DataFrame(X_train, columns=columns).drop(columns=["subjectId", "class"])
        testDf = pd.DataFrame(X_test, columns=columns).drop(columns=["subjectId", "class"])
        X_train_normalized, mean_train, std_train = normalize(trainDf, None, None)
        X_test_normalized = normalize(testDf, mean_train, std_train)

    elif normalize.__name__ == "normalize2":
        trainDf = pd.DataFrame(X_train, columns=columns)
        testDf = pd.DataFrame(X_test, columns=columns)
        X_train_normalized = normalize2(trainDf)
        X_test_normalized = normalize2(testDf)
        
    print("Done normalizing data")
        
    print(f"Fitting {modelType} model #{iteration}...")
    model = clf.fit(X_train_normalized, y_train)
    print("Done fitting model")
    
    print(f"Computing results metrics for {modelType} model #{iteration}...")
    performanceDict = utils.performance_report(model, modelType, reportKey, iteration, X_train_normalized, X_test_normalized, y_train, y_test)
    print("Done computing results metrics\n")

    return performanceDict

def parallel_model(df, modelType, reportKey, normalize, paramGrid, dataFile, ROI, heuristic=None):
    print(f"\n======================Running {modelType} with the following parameters======================\nNormalization: {normalize.__name__}\nParam Grid: {paramGrid}\nData: {dataFile}\nROI: {ROI}")

    performance = []
    if not os.path.isdir(modelType):
        os.mkdir(modelType)

    X = df.values
    y = utils.convert_Y(X[:, -1])
    columns = df.columns
    
    # Setup CV
    cv = RepeatedStratifiedKFold(n_splits=2, n_repeats=3, random_state=42)

    # Define model type
    if modelType == "SVM":
        clf = GridSearchCV(SVC(random_state=0), paramGrid)
    elif modelType == "RF":
        clf = GridSearchCV(RandomForestClassifier(random_state=0, n_jobs = -1), paramGrid)
    elif modelType == "LR":
        clf = GridSearchCV(LogisticRegression(random_state=0), paramGrid)
    
    output = Parallel(n_jobs=-1)(delayed(train)(clf, train_index, test_index, X, y, normalize, columns, modelType, reportKey, iteration) for iteration, (train_index, test_index) in enumerate(cv.split(X, y)))

    performance.append(output)

    with open(f"{modelType}/{reportKey}_report.json", 'w', encoding='utf-8') as f:
        json.dump(performance, f, ensure_ascii=False, indent=4)
        
    return performance