<img src="images/mind_tree.jpg" align="center"/>

# A Mind Without Time: Forcasting the Conversion to Alzheimer's Disease

This project attempts to forcast the conversion of cognitively normal and persons with Mild Cognitive Impairment (MCI) to a diagnosis of Alzheimer's Disease (AD). Alzheimer's Disease is one of the most prevalent neurodegenerative disorders in North America. In Canada alone, there are 564,000 people diagnosed with dementia, a number that is expected to increase to nearly a million by 2031.Aside from the impact on an individual, dementia places a large burden on the healthcare system and persons involved with an affected individual. Dementia is currently estimated to cost 10.4 billion dollars in yearly expenses within Canada.

Early diagnosis of AD is associated with a higher quality of life and a reduced cost on a healthcare system. However, detecting AD early in the disease progression is difficult due to the multifaceted nature of how neurodegeneration affects the brain, cognitive processing, and behavior. Clinical evaluation relies on assessment of a myriad of cognitive tests and biomarkers that are not always identifiable in patients with MCI, a precursor to AD. 

The multifaceted impact of cognitive impairment and neurodegeneration in MCI and AD suggests that machine learning algorithms such as neural networks may be beneficial in identifying and predicting disease progression. Current studies typically only incorporate one form of data, however, often relying solely on features extracted from structural magnetic resonance imaging (MRI) scans. Other forms of data that show promise in classification with machine learning algorithms include cognitive assessments and the connectivity patterns of resting-state functional networks. This is because spatial and episodic memory, cognitive processes that are typically the first affected in MCI and AD, rely on complex, dynamic interactions of distributed neural networks and are therefore susceptible to the impact of neurodegeneration. Critically, there has yet to be an assessment of how machine learning algorithms perform using features extracted from structural and functional MRI data, as well as cognitive assessments. This project aims to remedy this.

**Target audience and use cases:**

Healthcare providers. Structural and resting-state functional MRIs are one of easiest and fastest methods of brain imaging. Using them to classify persons at risk or with AD would assist in providing targeted treatments.

In [179]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.graph_objs as go

# python file with metadata
import vector_dict

plotly.offline.init_notebook_mode(connected=True)
%matplotlib inline

In [252]:
# load cleaned dataframe
df = pd.read_csv('df_cleaned.csv', index_col=0)
# make headers lower case
df.columns = map(str.lower, df.columns)

## 2. Exploratory Data Analysis

### 2.1 Overview of Patient Characteristics

In this first section, we'll look at some of the demographic information available in the dataset and how it relates to the different levels of cognitive function.

In [254]:
# extract entries from baseline
df_bl = df[df.viscode == 0]
# extract gender series
x0 = df_bl.age[df.ptgender == 'Male']
x1 = df_bl.age[df.ptgender == 'Female']

# plot with plotly
trace_m = go.Histogram(
    x = x0,
    name = 'Male',
    marker=dict(
                color='rgb(116,159,199)'),
    opacity = 0.75)

trace_f = go.Histogram(
    x = x1,
    name = 'Female',
    marker=dict(
                color='rgb(245,169,114)'),
    opacity = 0.75)

data = [trace_m, trace_f]

layout = go.Layout(
    title='Distribution of Age',
    xaxis=dict(
        title='Age'
    ),
    yaxis=dict(
        title='Count'
    ),
    bargap=0.2,
    bargroupgap=0.1
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='mri')

This distribution shows us that the average age of baseline measurements is around 74 years old. There are also more men than women in the dataset, particularly for patients that are older than the mean age. Importantly though, the distribution looks approximately normal.

In [255]:
# calculate mean age across sample and gender
mean_age = df_bl.age.mean()
f_age = df_bl.groupby("ptgender").age.mean()[0]
m_age = df_bl.groupby("ptgender").age.mean()[1]

# calculate percent male/female
n = len(df_bl)
f_per = df_bl.groupby("ptgender").age.count()[0]/n*100
m_per = df_bl.groupby("ptgender").age.count()[1]/n*100

In [334]:
# plot with plotly
trace1 = go.Bar(
                y=['Male', 'Group', 'Female'],
                x=[m_age.round(1), mean_age.round(1), f_age.round(1)],
                orientation='h',
                hoverinfo='x',
                marker=dict(
                    color=['rgb(116,159,199)', 'rgb(204,204,204)',
                    'rgb(245,169,114)']),
                opacity = 0.75
                )

trace2 = go.Bar(
                y=['Male', 'Female'],
                x=[m_per, f_per],
                orientation='h',
                hoverinfo='x',
                marker=dict(
                    color=['rgb(116,159,199)', 'rgb(245,169,114)']),
                opacity = 0.75
                )

# make subplots
fig = plotly.tools.make_subplots(cols=2, subplot_titles=('Mean Age', 'Percent of Sample by Gender'), print_grid=False)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)

# update layout
fig['layout'].update(showlegend=False)
fig['layout']['xaxis1'].update(range=[70,75])
py.iplot(fig, filename='mri')

Next, let's break down the sample by diagnosis at baseline, then explore some of the relationships between our predictor and demographic variables.

In [257]:
df_bl.dx_bl.unique()

array(['CN', 'AD', 'LMCI', 'EMCI', 'SMC'], dtype=object)

The diagnoses at baseline are coded above. Here are what they represent:

* CN = cognitively normal
* AD = Alzheimer's Disease
* LMCI = late or amnestic mild cognitive impairment
* EMCI = early mild cognitive impairment
* SMC = significant memory concern

A full breakdown of the inclusion criteria for the different diagnoses can be found [here](https://clinicaltrials.gov/ct2/show/NCT01231971). Briefly, significant memory concern patients are self reporters who are not showing memory impairments on clinical assessments but who report trouble with their memory. They have to be otherwise cognitively normal. Early MCI is characterized as the mildest symptomatic form of MCI and AD, while late MCI is characterized as having more pronounced amnesia. Let's now look at some of the characteristics of patients with the different baseline diagnoses.

In [258]:
# subset dataframes for plotting
df_cn = df_bl[df_bl['dx_bl'] == 'CN']
df_ad = df_bl[df_bl['dx_bl'] == 'AD']
df_lm = df_bl[df_bl['dx_bl'] == 'LMCI']
df_em = df_bl[df_bl['dx_bl'] == 'EMCI']
df_sm = df_bl[df_bl['dx_bl'] == 'SMC']

In [269]:
def plot_dx(v):
    """This method takes a variable name as an input and generates a series of
       of box plots using plotly for each diagnosis category."""
    
    trace_cn = go.Box(
                      y=df_cn[v],
                      name='CN')

    trace_ad = go.Box(
                      y=df_ad[v],
                      name='AD')

    trace_lm = go.Box(
                      y=df_lm[v],
                      name='LMCI')

    trace_em = go.Box(
                      y=df_em[v],
                      name='EMCI')

    trace_sm = go.Box(
                      y=df_sm[v],
                      name='SMC')
    
    layout= go.Layout(
                      title = '{} by Baseline Diagnosis'.format(v.capitalize()))
    # link traces
    data = [trace_cn, trace_ad, trace_lm, trace_em, trace_sm]
    # link data and layout
    fig = go.Figure(data=data, layout=layout)
    # generate plot
    return py.iplot(fig, filename='mri')

In [270]:
# plot age by diagnosis category
plot_dx('age')

Overall, the range of ages between the different diagnosis types looks similar. Importantly, the variance of cognitively normal individuals appears to match that of the other categories. Next, let's look at how the different groups perform on the cognitive tests.

### 2.2 Performance on Cognitive Assessments

In [323]:
# create list of baseline cognitive assessments
tests = [t.lower() for t in vector_dict.metadata['cognitive_tests'] if 'bl' in t]
# dispaly list
print(f'The cognitive assessments that will be looked at are {tests}')

The cognitive assessments that will be looked at are ['cdrsb_bl', 'adas13_bl', 'mmse_bl', 'moca_bl']


In [271]:
plot_dx(tests[0])

As we would expect, the scores on the CDRSB are almost zero for the CN and SMC groups. There are much higher ratings for AD, and interestingly there is only a small difference between early and late MCI.

In [272]:
# plot ADAS13
plot_dx(tests[1])

We can see a similar trend with the ADAS13, although the differences between early and late MCI appear more pronounced.

In [273]:
# plot MMSE
plot_dx(tests[2])

Here, the scale is inversed with lower score representing more cognitive impairment. Considering this, the overall trend between groups looks similar, although there are more outliers in the CN and SMC groups indicating that the MMSE may not be as good a metric for classification of AD.

In [274]:
# plot MOCA
plot_dx(tests[3])

Again, we can see the same overall trend. However, given that the MOCA was only conducted for ~50% of the sample, it may be worth building a model without it since it doesn't appear to differentiate the groups any more than the other assessments.

Next, let's look at the correlations between performamce on the different cognitive assessments. It is important to avoid highly correlated measures as features due to overfitting the model and the matrices will be important when we do some feature engineering.

In [275]:
from IPython.display import display_html
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

In [324]:
# add age to test list for correlations
tests.insert(0, 'age')
# display correlation matrices for sample, CN, and AD
display_side_by_side(df_bl[tests].corr(), 
                     df_cn[tests].corr(),
                     df_ad[tests].corr())

Unnamed: 0,age,cdrsb_bl,adas13_bl,mmse_bl,moca_bl
age,1.0,0.059861,0.139857,-0.132319,-0.187296
cdrsb_bl,0.059861,1.0,0.743807,-0.71902,-0.673245
adas13_bl,0.139857,0.743807,1.0,-0.741212,-0.784323
mmse_bl,-0.132319,-0.71902,-0.741212,1.0,0.707536
moca_bl,-0.187296,-0.673245,-0.784323,0.707536,1.0

Unnamed: 0,age,cdrsb_bl,adas13_bl,mmse_bl,moca_bl
age,1.0,0.117062,0.229192,-0.069383,-0.343349
cdrsb_bl,0.117062,1.0,0.083382,0.019856,-0.034113
adas13_bl,0.229192,0.083382,1.0,-0.170663,-0.430711
mmse_bl,-0.069383,0.019856,-0.170663,1.0,0.226774
moca_bl,-0.343349,-0.034113,-0.430711,0.226774,1.0

Unnamed: 0,age,cdrsb_bl,adas13_bl,mmse_bl,moca_bl
age,1.0,0.081323,-0.018548,-0.050178,-0.030545
cdrsb_bl,0.081323,1.0,0.449607,-0.29825,-0.396443
adas13_bl,-0.018548,0.449607,1.0,-0.446169,-0.726418
mmse_bl,-0.050178,-0.29825,-0.446169,1.0,0.541724
moca_bl,-0.030545,-0.396443,-0.726418,0.541724,1.0


In [325]:
# display correlation matrices for sample, CN, and AD
display_side_by_side(df_sm[tests].corr(), 
                     df_em[tests].corr(),
                     df_lm[tests].corr())

Unnamed: 0,age,cdrsb_bl,adas13_bl,mmse_bl,moca_bl
age,1.0,0.120102,0.232009,-0.092579,-0.179436
cdrsb_bl,0.120102,1.0,-0.022905,-0.098121,-0.224284
adas13_bl,0.232009,-0.022905,1.0,-0.123475,-0.423262
mmse_bl,-0.092579,-0.098121,-0.123475,1.0,0.329693
moca_bl,-0.179436,-0.224284,-0.423262,0.329693,1.0

Unnamed: 0,age,cdrsb_bl,adas13_bl,mmse_bl,moca_bl
age,1.0,0.028404,0.355035,-0.272018,-0.281299
cdrsb_bl,0.028404,1.0,0.154609,-0.111594,-0.212626
adas13_bl,0.355035,0.154609,1.0,-0.362637,-0.472468
mmse_bl,-0.272018,-0.111594,-0.362637,1.0,0.319844
moca_bl,-0.281299,-0.212626,-0.472468,0.319844,1.0

Unnamed: 0,age,cdrsb_bl,adas13_bl,mmse_bl,moca_bl
age,1.0,-0.019251,0.102015,-0.115877,-0.140551
cdrsb_bl,-0.019251,1.0,0.278052,-0.177689,-0.21004
adas13_bl,0.102015,0.278052,1.0,-0.390636,-0.581511
mmse_bl,-0.115877,-0.177689,-0.390636,1.0,0.518924
moca_bl,-0.140551,-0.21004,-0.581511,0.518924,1.0


In [333]:
trace_cn = go.Heatmap(
                      z=df_cn[tests].corr().values.tolist()[::-1],
                      x=tests,
                      y=tests[::-1],
                      hoverinfo='x+y+z')

trace_bl = go.Heatmap(
                      z=df_bl[tests].corr().values.tolist()[::-1],
                      x=tests,
                      y=tests[::-1],
                      showscale=False,
                      hoverinfo='x+y+z')

trace_ad = go.Heatmap(
                      z=df_ad[tests].corr().values.tolist()[::-1],
                      x=tests,
                      y=tests[::-1],
                      showscale=False,
                      hoverinfo='x+y+z')

trace_sm = go.Heatmap(
                      z=df_sm[tests].corr().values.tolist()[::-1],
                      x=tests,
                      y=tests[::-1],
                      showscale=False,
                      hoverinfo='x+y+z')

trace_em = go.Heatmap(
                      z=df_em[tests].corr().values.tolist()[::-1],
                      x=tests,
                      y=tests[::-1],
                      showscale=False,
                      hoverinfo='x+y+z')

trace_lm = go.Heatmap(
                      z=df_lm[tests].corr().values.tolist()[::-1],
                      x=tests,
                      y=tests[::-1],
                      showscale=False,
                      hoverinfo='x+y+z')

# make subplots
fig = plotly.tools.make_subplots(rows=2, cols=3, subplot_titles=('Baseline', 'CN', 'AD', 'SMC', 'EMCI', 'LMCI'), print_grid=False)
fig.append_trace(trace_bl, 1, 1)
fig.append_trace(trace_cn, 1, 2)
fig.append_trace(trace_ad, 1, 3)
fig.append_trace(trace_sm, 2, 1)
fig.append_trace(trace_em, 2, 2)
fig.append_trace(trace_lm, 2, 3)

fig['layout'].update(showlegend=False)
py.iplot(fig, filename='mri')

There are two primary points to note in these heatmaps:

1. The measures overall are not strongly correlated, except in the case of AD and late MCI. In AD, the CDRSB and the ADAS13 become strongly correlated. This is interesting and may be useful in modeling in that **we may consider using the correlation values between these two tests as a feature itself**. This correlation is there in late MCI as well, although it is not as strong. 

2. The MOCA has a fairly strong correlation with the MMSE and a negative correltion with the ADAS13. Since we will run an initial model without the MOCA, we can leave this for now.

### 2.3 Volumetric Brain Measures

In this section, we'll look at how volumetric brain measurements from the structural MRIs differ (or not) between the diagnosis categories.

Next, let's look at the relationship between age and (1) performance on the different cognitive tests and (2) volumetric estimates of different brain regions.

### 2.2 Characteristics of diagnostic change

This section will explore the target variable we're interested in forcasting is the conversion to AD, which is coded by the diagnositic change (DXCHANGE) column in the dataframe. The different categories for this variable are:

In [34]:
for k,v in vector_dict.dx_change_ids.items():
    print(f'Value {k} represents {v}')

Value 1 represents Stable:NL to NL
Value 2 represents Stable:MCI to MCI
Value 3 represents Stable:AD to AD
Value 4 represents Conv:NL to MCI
Value 5 represents Conv:MCI to AD
Value 6 represents Conv:NL to AD
Value 7 represents Rev:MCI to NL
Value 8 represents Rev:AD to MCI
Value 9 represents Rev:AD to NL
Value -1 represents Not available


The two values that we're interested in are 5, the conversion from MCI to AD, and 6, the conversion of cognitively normal to AD. Let's explore some of the characteristics in the dataset associated with these two diagnostic changes.