<img src="images/mind_tree.jpg" align="center"/>

# A Mind Without Time: Classifying the Conversion to Alzheimer's Disease

This project attempts to identify cognitively normal and persons with Mild Cognitive Impairment (MCI) who will convert to a diagnosis of Alzheimer's Disease (AD). Alzheimer's Disease is one of the most prevalent neurodegenerative disorders in North America. In Canada alone, there are 564,000 people diagnosed with dementia, a number that is expected to increase to nearly a million by 2031.Aside from the impact on an individual, dementia places a large burden on the healthcare system and persons involved with an affected individual. Dementia is currently estimated to cost 10.4 billion dollars in yearly expenses within Canada.

Early diagnosis of AD is associated with a higher quality of life and a reduced cost on a healthcare system. However, detecting AD early in the disease progression is difficult due to the multifaceted nature of how neurodegeneration affects the brain, cognitive processing, and behavior. Clinical evaluation relies on assessment of a myriad of cognitive tests and biomarkers that are not always identifiable in patients with MCI, a precursor to AD. 

The multifaceted impact of cognitive impairment and neurodegeneration in MCI and AD suggests that machine learning algorithms may be beneficial in identifying and predicting disease progression. Current studies typically only incorporate one form of data, however, often relying solely on features extracted from structural magnetic resonance imaging (MRI) scans. Other forms of data that show promise in classification with machine learning algorithms include cognitive assessments and the connectivity patterns of resting-state functional networks. This is because spatial and episodic memory, cognitive processes that are typically the first affected in MCI and AD, rely on complex, dynamic interactions of distributed neural networks and are therefore susceptible to the impact of neurodegeneration. Critically, there has yet to be an assessment of how machine learning algorithms perform using features extracted from structural and functional MRI data, as well as cognitive assessments. This project aims to remedy this.

**Target audience and use cases:**

Healthcare providers. Structural and resting-state functional MRIs are one of easiest and fastest methods of brain imaging. Using them to classify persons at risk of AD would assist in providing targeted treatments.

**Notebook overview**

This notebook explores different machine learning approaches to classifying persons at risk of AD onset. It also covers model optimization.

## 3. Building a core model 

The first goal is to build a classification model that will classify individuals who convert from those that do not. We will use a reduced feature set in order to maximize the amount of datapoints being classified. These features will be based on demographic information, cognitive assessments, genetic, and structural MRI data as they are the variables that are recorded for the majority of the patients. Later, we will build reduced models that include other feature types and evaluate how they perform relative to our core model.

For our intial model, we will look at predicting AD onset just from the values at baseline. The reason is that this will provide the maximum time on average to target patients who will convert with early interventions.

In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as npr

from fastai import *          # Quick accesss to most common functionality
from fastai.tabular import *  # Quick accesss to tabular functionality


# remove future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings(module='sklearn*', action='ignore', category=DeprecationWarning)

import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

# Set plot style
plt.style.use('fivethirtyeight')

# set figure properties
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 100

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


In [2]:
df = pd.read_csv('df_final.csv', index_col=0)

In [3]:
df.columns

Index(['rid', 'viscode', 'd1', 'd2', 'dx_bl', 'dxchange', 'age', 'ptgender',
       'pteducat', 'ptethcat', 'ptmarry', 'cdrsb_bl', 'cdrsb', 'adas13_bl',
       'adas13', 'mmse_bl', 'mmse', 'moca', 'ecogptmem', 'ecogptvisspat',
       'ventricles_bl', 'wholebrain_bl', 'icv_bl', 'l_hippocampus_l',
       'l_hippocampus_r', 'x_hippocampus_l', 'x_hippocampus_r',
       'l_entorhinal_l', 'l_entorhinal_r', 'l_entorhinal_l_thick',
       'l_entorhinal_r_thick', 'x_entorhinal_l', 'x_entorhinal_r',
       'x_entorhinal_l_thick', 'x_entorhinal_r_thick', 'fdg', 'fdg_bl', 'pib',
       'av45', 'av45_bl', 'md_hippocampus_r', 'apoe4', 'converted', 'diff_1',
       'diff_2', 'x_hippocampus_l_bl', 'x_hippocampus_r_bl',
       'x_entorhinal_l_bl', 'x_entorhinal_r_bl', 'x_entorhinal_l_thick_bl',
       'x_entorhinal_r_thick_bl'],
      dtype='object')

In [4]:
# drop patients with baseline conversion to AD
df = df.loc[df.dx_bl != 'AD']

In [5]:
# extract non converter datapoints
df_non = df.loc[df.converted==0]
# extract all converter datapoints
df_converted = df.loc[df.converted==1]
# concatenate
df_core = pd.concat([df_non, df_converted])

In [6]:
df_core['converted'] = df_core['converted'].astype('category')

In [7]:
df_core.shape

(9027, 51)

In [8]:
df_core.converted.value_counts()

0    6677
1    2350
Name: converted, dtype: int64

In [9]:
df_core.columns

Index(['rid', 'viscode', 'd1', 'd2', 'dx_bl', 'dxchange', 'age', 'ptgender',
       'pteducat', 'ptethcat', 'ptmarry', 'cdrsb_bl', 'cdrsb', 'adas13_bl',
       'adas13', 'mmse_bl', 'mmse', 'moca', 'ecogptmem', 'ecogptvisspat',
       'ventricles_bl', 'wholebrain_bl', 'icv_bl', 'l_hippocampus_l',
       'l_hippocampus_r', 'x_hippocampus_l', 'x_hippocampus_r',
       'l_entorhinal_l', 'l_entorhinal_r', 'l_entorhinal_l_thick',
       'l_entorhinal_r_thick', 'x_entorhinal_l', 'x_entorhinal_r',
       'x_entorhinal_l_thick', 'x_entorhinal_r_thick', 'fdg', 'fdg_bl', 'pib',
       'av45', 'av45_bl', 'md_hippocampus_r', 'apoe4', 'converted', 'diff_1',
       'diff_2', 'x_hippocampus_l_bl', 'x_hippocampus_r_bl',
       'x_entorhinal_l_bl', 'x_entorhinal_r_bl', 'x_entorhinal_l_thick_bl',
       'x_entorhinal_r_thick_bl'],
      dtype='object')

In [10]:
df_core.head()

Unnamed: 0,rid,viscode,d1,d2,dx_bl,dxchange,age,ptgender,pteducat,ptethcat,...,apoe4,converted,diff_1,diff_2,x_hippocampus_l_bl,x_hippocampus_r_bl,x_entorhinal_l_bl,x_entorhinal_r_bl,x_entorhinal_l_thick_bl,x_entorhinal_r_thick_bl
0,2,0,1,1,CN,1.0,74.3,Male,16,Not Hisp/Latino,...,0.0,0,-18.67,,4117.0,4219.0,2241.0,1936.0,3.254,3.61
1,2,6,1,1,CN,1.0,74.3,Male,16,Not Hisp/Latino,...,0.0,0,-18.67,,4117.0,4219.0,2241.0,1936.0,3.254,3.61
2,2,36,1,1,CN,1.0,74.3,Male,16,Not Hisp/Latino,...,0.0,0,-18.67,,4117.0,4219.0,2241.0,1936.0,3.254,3.61
3,2,60,1,1,CN,1.0,74.3,Male,16,Not Hisp/Latino,...,0.0,0,-18.67,,4117.0,4219.0,2241.0,1936.0,3.254,3.61
4,2,72,1,1,CN,1.0,74.3,Male,16,Not Hisp/Latino,...,0.0,0,-18.67,,4117.0,4219.0,2241.0,1936.0,3.254,3.61


In [11]:
def model_eval(fitted_model, scaled=True, cv=False):
    """This method is used to evaluate model performance. It takes a fitted classification model and
    generates an accuracy score and plots an ROC curve."""

    # set training/test sets
    if scaled:
        X_train_set = X_train_trans
        X_test_set = X_test_trans
    else:
        X_train_set = X_train
        X_test_set = X_test

    # calculate predicted probabilities
    y_pred_prob = fitted_model.predict_proba(X_test_set)[:,1]

    # Generate ROC curve values: fpr, tpr, thresholds
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
    
    # generate precision-recall values
    precision, recall, pr_thresholds = precision_recall_curve(y_test, fitted_model.predict(X_test_set))
    # calculate average precision
    average_precision = average_precision_score(y_test, fitted_model.predict(X_test_set))
    
    # figure settings
    f, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 5))

    # plot ROC curve
    ax1.plot([0, 1], [0, 1], 'k--')
    ax1.plot(fpr, tpr)
    ax1.set_xlabel('False Positive Rate')
    ax1.set_ylabel('True Positive Rate')
    ax1.set_title('ROC Curve')
    
    # plot precision-recall curve
    ax2.step(precision, recall, where='post')
    ax2.set_xlabel('Recall')
    ax2.set_ylabel('Precision')
    ax2.set_title('Precion-Recall Curve')
    plt.show()
    
    

    # display results
    print(f'Accuracy on the training data is {round(accuracy_score(y_train, fitted_model.predict(X_train_set)), 3)}.')

    # display model accuracy
    print(f'Accuracy on the test data is {round(accuracy_score(y_test, fitted_model.predict(X_test_set)), 3)}.')

    # display AUC score
    print(f'The model AUC for ROC curve of the test data is {round(roc_auc_score(y_test, y_pred_prob), 3)}')
    
    # display average precision
    print(f'Average precision is {round(average_precision, 3)}.')

    if cv:
        print(f'The best parameters are {fitted_model.best_params_}.')


def model_performance(classifier, scaled=True, cv=False):
    """This method fits and evaluates a model. Input parameter is classifier type.
    Scaled indicates whether to use a scaled train/test set. Cv indicates whether evaluating cross validation performance."""

    # set training data
    if scaled:
        X_train_set = X_train_trans
    else:
        X_train_set = X_train

    # initiate classifier
    clf = classifier
    # fit model
    clf = clf.fit(X_train_trans, y_train)
    # evaluate model
    model_eval(fitted_model=clf, scaled=scaled, cv=cv)
    # return fitted model
    return clf

def nested_split(split_on, dataframe, test_prop):
    """This methods extracts a proportion of observations for each patient for train/test splits.
    Returns a list of index values that ensure stratification based on number of patients."""
    npr.seed = 42
    # find number of label classes
    n_classes = dataframe[split_on].unique()
    # list of test indicies for each class
    test_non = []
    test_conv = []
    for n in n_classes:
        # list of patient IDs
        patients = dataframe[dataframe[split_on] == n].rid.unique()
        # number of required samples for the test set
        n_samples = len(dataframe[dataframe[split_on] == n]) * test_prop
        # proportion per patient rounded down
        proportion = round(n_samples / len(patients) - 0.5)
        # remainder to randomly sample from
        remainder = round((n_samples - proportion * len(patients)) - 0.5)
        p_to_resample = [i for i in patients if len(dataframe[dataframe.rid == i]) > 1]
        # create list of patient IDs to resample based on remainder
        resample = np.random.choice(p_to_resample, remainder, replace=False)
        # randomly pick indices for each patient
        for p in patients:
            if p in resample:
                sample = proportion + 1
            else:
                sample = proportion
            if n == 0:
                test_non = test_non + np.random.choice(dataframe[dataframe.rid==p].index, sample, replace=False).tolist()
            else:
                test_conv = test_conv + np.random.choice(dataframe[dataframe.rid==p].index, sample, replace=False).tolist()
    return test_non+test_conv

In [12]:
# get indices for test set
test_ind = nested_split('converted', df, 0.2)

In [13]:
len(test_ind)

1805

In [14]:
# get indices for training set
train_ind = [i for i in df_core.index if i not in test_ind]

In [15]:
len(train_ind)

7222

In [16]:
cols_to_drop = ['d1', 'd2', 'dxchange', 'rid', 'viscode','cdrsb_bl', 'ptethcat',
 'adas13_bl',
 'mmse_bl',
 'ventricles_bl',
 'wholebrain_bl',
 'icv_bl',
 'fdg_bl',
 'av45_bl',
 'x_hippocampus_l_bl',
 'x_hippocampus_r_bl',
 'x_entorhinal_l_bl',
 'x_entorhinal_r_bl',
 'x_entorhinal_l_thick_bl',
 'x_entorhinal_r_thick_bl']
train_df = df_core.loc[train_ind, :].drop(cols_to_drop, axis=1)
test_df = df_core.loc[test_ind, :].drop(cols_to_drop, axis=1)

In [17]:
dep_var = 'converted'
cat_names = ['ptgender', 'ptmarry', 'dx_bl', 'apoe4']

In [18]:
df_nn = pd.concat([train_df, test_df])
df_nn.to_csv('df_nn.csv')

In [21]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7222 entries, 0 to 10100
Data columns (total 54 columns):
dx_bl                      7222 non-null category
age                        7222 non-null float64
ptgender                   7222 non-null category
pteducat                   7222 non-null int64
ptmarry                    7222 non-null category
cdrsb                      7222 non-null float64
adas13                     7222 non-null float64
mmse                       7222 non-null float64
moca                       7222 non-null float64
ecogptmem                  7222 non-null float64
ecogptvisspat              7222 non-null float64
l_hippocampus_l            7222 non-null float64
l_hippocampus_r            7222 non-null float64
x_hippocampus_l            7222 non-null float64
x_hippocampus_r            7222 non-null float64
l_entorhinal_l             7222 non-null float64
l_entorhinal_r             7222 non-null float64
l_entorhinal_l_thick       7222 non-null float64
l_entorhi

In [20]:
data = TabularDataBunch.from_df('nn', train_df, test_df, dep_var, tfms=[Categorify, FillMissing], cat_names=cat_names)

AttributeError: Can only use .cat accessor with a 'category' dtype