# Machine Learning pipeline

In this notebook, we go through the machine learning pipeline to reproduce Lydia Chougar's paper. The following sections will be covered:

1 - Convert CSV to DataFrame

2 - Normalize

3 - Train and predict models

4 - Cross Validation

5 - Results 

### Imports

In [136]:
import pandas as pd
import numpy as np
import glob, utils, sys

# Convert CSV to DataFrame

Converts data from CSV to DataFrame and applies any function. 
- "combine": sums all Left and Right regions into one column

In [137]:
def get_data(csvFileName: str, ROI: [], heuristic = None):
    '''
    The following function will sanitize data and build a numpy array with X ROI's volumes and y being the class [NC, PD]
    @csvFileName: input volumes csv
    @ROI: regions of interests desired
    @heuristic: function key
    '''
    df = pd.read_csv(csvFileName)
    df = utils.remove_unwanted_columns(df, ROI)
    
    if heuristic == "combine":
        df = utils.combine_left_right_vol(df)
        
    arr = df.values
    X = arr[:, :-1]
    y = utils.convert_Y(arr[:, -1])
    return X,y

Test *get_data()* function

In [138]:
ROI = [
      "class",
      "Left-Putamen", "Right-Putamen", 
      "Right-Caudate", "Left-Caudate", 
      "Right-Thalamus-Proper", "Left-Thalamus-Proper", 
      "Left-Pallidum", "Right-Pallidum", 
      "Left-Cerebellum-Cortex", "Right-Cerebellum-Cortex", "lhCortexVol", "rhCortexVol", "CortexVol",
      "Left-Cerebellum-White-Matter", "Right-Cerebellum-White-Matter",
      "CerebralWhiteMatterVol", 
      "3rd-Ventricle", "4th-Ventricle"
   ]
X, y = get_data("volumes.csv", ROI, "combine")
X

  df = df.drop(column, 1)


array([[4805.9, 10689.0, 8204.3, ..., 260137.320394, 521837.72897,
        541416.0],
       [4025.5, 9543.6, 6856.3, ..., 227892.928374, 453029.527784,
        459966.0],
       [4416.1, 9640.2, 6508.2, ..., 226647.385438, 452146.451741,
        467340.0],
       ...,
       [3714.0, 9247.0, 6964.6, ..., 242313.526846, 483537.306049,
        427627.0],
       [4367.5, 10956.0, 6407.2, ..., 251437.424595, 503486.026837,
        453074.0],
       [4183.0, 9857.7, 6906.6, ..., 256543.358332, 508441.662564,
        508866.0]], dtype=object)

# 2. [Normalize](#normal)

In this section, normalization of the data using "Normalization 1" and "Normaliztion 2" techniques are implemented. 

Normalization 1:

$$\dfrac{Variable – mean \; of \;PD \;and \;NC \;in \;the \;training \;cohort}{\sigma \;of \;PD \;and \;NC \;in \;the \;training \;cohort}$$

Normalization 2:

$$\dfrac{Variable – mean \; of \;controls \;scanned \;using \;the \;same \;scanner}{\sigma \;of \;controls \;scanned \;using \;the \;same \;scanner}$$


In [139]:
## TODO: remove loop

def normalize1(data, mean, std):
    df = pd.DataFrame(data=data)

    if mean is None and std is None:
        mean = df.mean(axis=0)
        std = df.std(axis=0)
        print(mean)
        print(std)

    for i in range(df.shape[1]):
        df[i] = df[i].apply(lambda x: (x-mean[i])/std[i])

    return df.values, mean, std

In [140]:
# Testing normalization1
testDf = pd.DataFrame(np.array([[1, 2, 3], [3, 4, 7]]),columns=['a', 'b', 'c'])
print(testDf)
normalizedDfTest, mean, std = normalize1(testDf.values, None, None)
print(normalizedDfTest)

   a  b  c
0  1  2  3
1  3  4  7
0    2.0
1    3.0
2    5.0
dtype: float64
0    1.414214
1    1.414214
2    2.828427
dtype: float64
[[-0.70710678 -0.70710678 -0.70710678]
 [ 0.70710678  0.70710678  0.70710678]]


### TODO: Fetch metadata for every patient

In [44]:
def normalize2():
    print("TODO - Unimplemented")

# 3. [Train and predict models](#predict)

In this section, we define four models being logisitc regression, SVM with linear and radial kernel and a random forest. As per the paper:

_Using the scikit-learn package, four supervised
machine learning algorithms were used: logistic regression, support vector machine (SVM) with a linear kernel, SVM with a radial basis function kernel, and
random forest_ (Chougar et al.)

### Imports

In [141]:
# Models
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Utils
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import PredefinedSplit

## Utilities


In [150]:
def split_data(X, y, training_split):
    '''
    The following function splits the training and testing data sets
    according to a split [0 - 1] passed.
    '''
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = training_split, random_state = 42)
    return X_train, X_test, y_train, y_test

def get_model_score(model, X_train, y_train, X_test, y_test):
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    print(f'training score: {round(train_acc, 3)}')
    print(f'testing score: {round(test_acc, 3)}')
    return train_acc, test_acc

def model(X, y, modelType, dataSplit, normalize, paramGrid):
    # Define training, validation and test sets
    X_train, X_test, y_train, y_test = split_data(X, y, dataSplit)
    
    # Setup CV
    cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=50)
    
    # Define model type
    if modelType == "SVM":
        clf = GridSearchCV(SVC(random_state=0), paramGrid, cv=cv)
    elif modelType == "RF":
        clf = GridSearchCV(RandomForestClassifier(random_state=0), paramGrid, cv=cv)
    elif modelType == "LR":
        clf = GridSearchCV(LogisticRegression(random_state=0), paramGrid, cv=cv)
        
    # Normalize model data
    if normalize.__name__ == "normalize1":
        X_grid_normalized, mean_train, std_train = normalize(X_grid, None, None)
        X_test_normalized, mean_test, std_test = normalize(X_test, mean_train, std_train)
        
    # Fit and predict
    model = clf.fit(X_grid_normalized, y_grid)
    train_acc, test_acc = get_model_score(X_grid_normalized, y_grid, X_test_normalized, y_test)
    print(f'Best model params: {model.best_params_}')

### SVM

In [88]:
param_grid = {
    'C': [1.0, 10.0, 100.0, 1000.0],
    'gamma': [0.01, 0.10, 1.00, 10.00]
}
for kernelType in ["linear", "rbf"]:
    param_grid["kernel"] = kernelType
    model(X, y, "SVM", 0.7, normalize1, param_grid)

{'C': [1.0, 10.0, 100.0, 1000.0], 'gamma': [0.01, 0.1, 1.0, 10.0], 'kernel': 'linear'}
{'C': [1.0, 10.0, 100.0, 1000.0], 'gamma': [0.01, 0.1, 1.0, 10.0], 'kernel': 'rbf'}


### Logistic Regression

In [89]:
param_grid = {
    'penalty': ["l1", "l2", "elasticnet"],
    'C': [1.0, 10.0, 100.0, 1000.0]
}
model(X, y, "LR", 0.7, normalize1, param_grid)

### Random forest

In [90]:
param_grid = {
    'n_estimators': [100, 500, 1000],
    'criterion': ['gini', 'entropy'],
    'min_samples_split': [2, 4, 5, 10, 13],
    'min_samples_leaf': [1, 2, 5, 8, 13]
}
model(X, y, "RF", 0.7, normalize1, param_grid)

# 4. [Cross Validation](#crossval)

In this section, we will implement the cross validation loop used in the paper. As per the paper:

_The cross-validation procedure on the training cohort included two nested loops: an outer loop with repeated stratified random splits with 50 repetitions evaluating the classification performances and an inner loop with 5 fold cross-validation used to optimize the hyperparameters of the algorithms_ (Chougar et al.)

### Imports

In [148]:
from sklearn.model_selection import RepeatedStratifiedKFold

In [149]:
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=50)