# Machine Learning pipeline

In this notebook, we go through the machine learning pipeline to reproduce Lydia Chougar's paper. The following sections will be covered:

1 - Convert CSV to DataFrame

2 - Normalize

3 - Train and predict models

4 - Cross Validation

5 - Results 

### Imports

In [16]:
import pandas as pd
import numpy as np
import glob, utils, sys

# Convert CSV to DataFrame

Converts data from CSV to DataFrame and applies any function. 
- "combine": sums all Left and Right regions into one column

In [5]:
def get_data(csvFileName: str, ROI: [], heuristic = None):
    '''
    The following function will sanitize data and build a numpy array with X ROI's volumes and y being the class [NC, PD]
    @csvFileName: input volumes csv
    @ROI: regions of interests desired
    @heuristic: function key
    '''
    df = pd.read_csv(csvFileName)
    df = utils.remove_unwanted_columns(df, ROI)
    
    if heuristic == "combine":
        df = utils.combine_left_right_vol(df)
        
    arr = df.values
    X = arr[:, :-1]
    y = utils.convert_Y(arr[:, -1])
    return X,y

Test *get_data()* function

In [11]:
ROI = [
      "class",
      "Left-Putamen", "Right-Putamen", 
      "Right-Caudate", "Left-Caudate", 
      "Right-Thalamus-Proper", "Left-Thalamus-Proper", 
      "Left-Pallidum", "Right-Pallidum", 
      "Left-Cerebellum-Cortex", "Right-Cerebellum-Cortex", "lhCortexVol", "rhCortexVol", "CortexVol",
      "Left-Cerebellum-White-Matter", "Right-Cerebellum-White-Matter",
      "CerebralWhiteMatterVol", 
      "3rd-Ventricle", "4th-Ventricle"
   ]
X, y = get_data("test_volumes.csv", ROI, "combine")
X,y

(array([[3960.4, 8604.5, 6236.9, 12916.8, 104394.3, 27950.199999999997,
         1987.7, 2656.4, 198898.000288, 200628.568064, 399526.568352,
         402347.086652],
        [3962.5, 9066.5, 6554.4, 15101.8, 84171.1, 31016.5, 1366.0,
         1327.3, 213315.819602, 211745.893941, 425061.713542,
         452280.825563],
        [4608.3, 8298.6, 7397.2, 13759.7, 96178.0, 31029.0, 2021.7,
         1337.2, 224162.804546, 230368.263921, 454531.068467,
         443310.044391],
        [3814.2, 9217.3, 6923.0, 12966.0, 101741.2, 28785.2, 786.0,
         1407.4, 225090.539205, 228753.382537, 453843.921742,
         405300.824822],
        [3381.7, 8170.700000000001, 5873.1, 13135.5, 111484.70000000001,
         33025.7, 2095.3, 1511.8, 223562.648811, 228791.484653,
         452354.133464, 382072.850614]], dtype=object),
 array([0, 0, 0, 0, 0]))

# 2. [Normalize](#normal)

In this section, normalization of the data using "Normalization 1" and "Normaliztion 2" techniques are implemented. 

Normalization 1:

$$\dfrac{Variable – mean \; of \;PD \;and \;NC \;in \;the \;training \;cohort}{\sigma \;of \;PD \;and \;NC \;in \;the \;training \;cohort}$$

Normalization 2:

$$\dfrac{Variable – mean \; of \;controls \;scanned \;using \;the \;same \;scanner}{\sigma \;of \;controls \;scanned \;using \;the \;same \;scanner}$$

In [40]:
def normalize1(data):
    normalizedX = []
    
    for row in X:
        normalizedRow = []
        for columnIndex, variable in enumerate(row):
            mean = np.mean(data[:, columnIndex])
            std = np.std(data[:, columnIndex])
            normalizedValue = (variable - mean)/std
            normalizedRow.append(normalizedValue)        
        normalizedX.append(normalizedRow)
        
    return np.array(normalizedX)
            
normalize1(X)

array([[ 0.03805108, -0.16271336, -0.67986075, -0.80379434,  0.5246605 ,
        -1.33655941,  0.66540423,  1.98313006, -1.80883225, -1.63646943,
        -1.72283184, -0.55551598],
       [ 0.04338534,  0.95894545, -0.08029465,  1.86064318, -1.68561903,
         0.36318682, -0.5644739 , -0.63074384, -0.36861406, -0.70007485,
        -0.55084518,  1.32953587],
       [ 1.68379842, -0.90538745,  1.51124649,  0.22405664, -0.37333387,
         0.37011596,  0.73266473, -0.61127401,  0.71490798,  0.86845759,
         0.80170993,  0.99087932],
       [-0.33331527,  1.32506265,  0.61576855, -0.74379877,  0.23469192,
        -0.87369273, -1.711859  , -0.47321521,  0.80758083,  0.73243871,
         0.77017195, -0.44400922],
       [-1.43191957, -1.21590728, -1.36685964, -0.53710671,  1.29960047,
         1.47694937,  0.87826395, -0.267897  ,  0.6549575 ,  0.73564799,
         0.70179513, -1.32088999]])

### TODO: Fetch metadata for every patient

In [44]:
def normalize2():
    print("TODO - Unimplemented")

# 3. [Train and predict models](#predict)