<img src="images/mind_tree.jpg" align="center"/>

# A Mind Without Time: Forcasting the Conversion to Alzheimer's Disease

This project attempts to forcast the conversion of cognitively normal and persons with Mild Cognitive Impairment (MCI) to a diagnosis of Alzheimer's Disease (AD). Alzheimer's Disease is one of the most prevalent neurodegenerative disorders in North America. In Canada alone, there are 564,000 people diagnosed with dementia, a number that is expected to increase to nearly a million by 2031.Aside from the impact on an individual, dementia places a large burden on the healthcare system and persons involved with an affected individual. Dementia is currently estimated to cost 10.4 billion dollars in yearly expenses within Canada.

Early diagnosis of AD is associated with a higher quality of life and a reduced cost on a healthcare system. However, detecting AD early in the disease progression is difficult due to the multifaceted nature of how neurodegeneration affects the brain, cognitive processing, and behavior. Clinical evaluation relies on assessment of a myriad of cognitive tests and biomarkers that are not always identifiable in patients with MCI, a precursor to AD. 

The multifaceted impact of cognitive impairment and neurodegeneration in MCI and AD suggests that machine learning algorithms such as neural networks may be beneficial in identifying and predicting disease progression. Current studies typically only incorporate one form of data, however, often relying solely on features extracted from structural magnetic resonance imaging (MRI) scans. Other forms of data that show promise in classification with machine learning algorithms include cognitive assessments and the connectivity patterns of resting-state functional networks. This is because spatial and episodic memory, cognitive processes that are typically the first affected in MCI and AD, rely on complex, dynamic interactions of distributed neural networks and are therefore susceptible to the impact of neurodegeneration. Critically, there has yet to be an assessment of how machine learning algorithms perform using features extracted from structural and functional MRI data, as well as cognitive assessments. This project aims to remedy this.

**Target audience and use cases:**

Healthcare providers. Structural and resting-state functional MRIs are one of easiest and fastest methods of brain imaging. Using them to classify persons at risk or with AD would assist in providing targeted treatments.

In [24]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.plotly as py
import plotly.graph_objs as go

# python file with metadata
import vector_dict

%matplotlib inline

In [25]:
def plot_dx_multibox(list_of_measures, rows, cols):
    # make subplots
    fig = plotly.tools.make_subplots(rows=rows, cols=cols, subplot_titles=tuple(list_of_measures), print_grid=False)
    row = 1
    col = 1
    for m in list_of_measures:
        trace_cn = go.Box(
                          y=df_cn[m],
                          name='CN',
                          marker = dict(
                          color = 'rgb(57, 118, 175)'))

        trace_ad = go.Box(
                          y=df_ad[m],
                          name='AD',
                          marker = dict(
                          color = 'rgb(240, 133, 54)'))

        trace_lm = go.Box(
                          y=df_lm[m],
                          name='LMCI',
                          marker = dict(
                          color = 'rgb(80, 157, 62)'))

        trace_em = go.Box(
                          y=df_em[m],
                          name='EMCI',
                          marker = dict(
                          color = 'rgb(198, 58, 50)'))

        trace_sm = go.Box(
                          y=df_sm[m],
                          name='SMC',
                          marker = dict(
                          color = 'rgb(142, 106, 184)'))
                              
        fig.append_trace(trace_cn, row, col)
        fig.append_trace(trace_ad, row, col)
        fig.append_trace(trace_lm, row, col)
        fig.append_trace(trace_em, row, col)
        fig.append_trace(trace_sm, row, col)
        row += 1
        if row > rows and col <= cols:
            row = 1
            col += 1
    fig['layout'].update(showlegend=False)
    return py.iplot(fig, filename='mri')

In [224]:
# list of measures to drop from analysis based on inspection of the dataset
measures_to_drop = []
# load cleaned dataframe
df = pd.read_csv('df_cleaned.csv', index_col=0)

## 2. Exploratory Data Analysis

### 2.1 Overview of Patient Characteristics

In this first section, we'll look at some of the demographic information available in the dataset and how it relates to the different levels of cognitive function.

### 2.1.1 Age and gender 

In [27]:
# extract entries from baseline
df_bl = df[df.viscode == 0].copy()
# extract gender series
x0 = df_bl.age[df.ptgender == 'Male']
x1 = df_bl.age[df.ptgender == 'Female']

# plot with plotly
trace_m = go.Histogram(
    x = x0,
    name = 'Male',
    marker=dict(
                color='rgb(116,159,199)'),
    opacity = 0.75)

trace_f = go.Histogram(
    x = x1,
    name = 'Female',
    marker=dict(
                color='rgb(245,169,114)'),
    opacity = 0.75)

data = [trace_m, trace_f]

layout = go.Layout(
    title='Distribution of Age',
    xaxis=dict(
        title='Age'
    ),
    yaxis=dict(
        title='Count'
    ),
    bargap=0.2,
    bargroupgap=0.1
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='age')

This distribution shows us that the average age of baseline measurements is around 74 years old. There are also more men than women in the dataset, particularly for patients that are older than the mean age. Importantly though, the distribution looks approximately normal.

In [28]:
# calculate mean age across sample and gender
mean_age = df_bl.age.mean()
f_age = df_bl.groupby("ptgender").age.mean()[0]
m_age = df_bl.groupby("ptgender").age.mean()[1]

# calculate percent male/female
n = len(df_bl)
f_per = df_bl.groupby("ptgender").age.count()[0]/n*100
m_per = df_bl.groupby("ptgender").age.count()[1]/n*100

In [29]:
# plot with plotly
trace1 = go.Bar(
                y=['Male', 'Group', 'Female'],
                x=[m_age.round(1), mean_age.round(1), f_age.round(1)],
                orientation='h',
                hoverinfo='x',
                marker=dict(
                    color=['rgb(116,159,199)', 'rgb(204,204,204)',
                    'rgb(245,169,114)']),
                opacity = 0.75
                )

trace2 = go.Bar(
                y=['Male', 'Female'],
                x=[m_per, f_per],
                orientation='h',
                hoverinfo='x',
                marker=dict(
                    color=['rgb(116,159,199)', 'rgb(245,169,114)']),
                opacity = 0.75
                )

# make subplots
fig = plotly.tools.make_subplots(cols=2, subplot_titles=('Mean Age', 'Percent of Sample by Gender'), print_grid=False)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)

# update layout
fig['layout'].update(showlegend=False)
fig['layout']['xaxis1'].update(range=[70,75])
py.iplot(fig, filename='gender')

Next, let's break down the sample by diagnosis at baseline, then explore some of the relationships between our predictor and demographic variables.

In [30]:
df_bl.dx_bl.unique()

array(['CN', 'AD', 'LMCI', 'EMCI', 'SMC'], dtype=object)

The diagnoses at baseline are coded above. Here are what they represent:

* CN = cognitively normal
* AD = Alzheimer's Disease
* LMCI = late or amnestic mild cognitive impairment
* EMCI = early mild cognitive impairment
* SMC = significant memory concern

A full breakdown of the inclusion criteria for the different diagnoses can be found [here](https://clinicaltrials.gov/ct2/show/NCT01231971). Briefly, significant memory concern patients are self reporters who are not showing memory impairments on clinical assessments but who report trouble with their memory. They have to be otherwise cognitively normal. Early MCI is characterized as the mildest symptomatic form of MCI and AD, while late MCI is characterized as having more pronounced amnesia. Let's now look at some of the characteristics of patients with the different baseline diagnoses.

In [31]:
# subset dataframes for plotting
df_cn = df_bl[df_bl['dx_bl'] == 'CN']
df_ad = df_bl[df_bl['dx_bl'] == 'AD']
df_lm = df_bl[df_bl['dx_bl'] == 'LMCI']
df_em = df_bl[df_bl['dx_bl'] == 'EMCI']
df_sm = df_bl[df_bl['dx_bl'] == 'SMC']

In [32]:
def plot_dx(v):
    """This method takes a variable name as an input and generates a series of
       of box plots using plotly for each diagnosis category."""
    
    trace_cn = go.Box(
                      y=df_cn[v],
                      name='CN')

    trace_ad = go.Box(
                      y=df_ad[v],
                      name='AD')

    trace_lm = go.Box(
                      y=df_lm[v],
                      name='LMCI')

    trace_em = go.Box(
                      y=df_em[v],
                      name='EMCI')

    trace_sm = go.Box(
                      y=df_sm[v],
                      name='SMC')
    
    layout= go.Layout(
                      title = '{} by Baseline Diagnosis'.format(v.capitalize()))
    # link traces
    data = [trace_cn, trace_ad, trace_lm, trace_em, trace_sm]
    # link data and layout
    fig = go.Figure(data=data, layout=layout)
    # generate plot
    return py.iplot(fig, filename=v)

In [33]:
# plot age by diagnosis category
plot_dx('age')

Overall, the range of ages between the different diagnosis types looks similar. Importantly, the variance of cognitively normal individuals appears to match that of the other categories. Next, let's look at how the different groups perform on the cognitive tests.

### 2.1.2 Performance on Cognitive Assessments

In [34]:
# create list of baseline cognitive assessments
tests = [t.lower() for t in vector_dict.metadata['cognitive_tests'] if 'bl' in t]
# dispaly list
print(f'The cognitive assessments that will be looked at are {tests}')

The cognitive assessments that will be looked at are ['cdrsb_bl', 'adas13_bl', 'mmse_bl', 'moca_bl']


In [35]:
plot_dx(tests[0])

As we would expect, the scores on the CDRSB are almost zero for the CN and SMC groups. There are much higher ratings for AD, and interestingly there is only a small difference between early and late MCI.

In [36]:
# plot ADAS13
plot_dx(tests[1])

We can see a similar trend with the ADAS13, although the differences between early and late MCI appear more pronounced.

In [37]:
# plot MMSE
plot_dx(tests[2])

Here, the scale is inversed with lower score representing more cognitive impairment. Considering this, the overall trend between groups looks similar, although there are more outliers in the CN and SMC groups indicating that the MMSE may not be as good a metric for classification of AD.

In [38]:
# plot MOCA
plot_dx(tests[3])

Again, we can see the same overall trend. However, given that the MOCA was only conducted for ~50% of the sample, it may be worth building a model without it since it doesn't appear to differentiate the groups any more than the other assessments.

In [39]:
measures_to_drop.append('MOCA')

Next, let's look at the correlations between performamce on the different cognitive assessments. It is important to avoid highly correlated measures as features due to overfitting the model and the matrices will be important when we do some feature engineering.

In [40]:
from IPython.display import display_html
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

In [41]:
# add age to test list for correlations
tests.insert(0, 'age')
# display correlation matrices for sample, CN, and AD
display_side_by_side(df_bl[tests].corr(), 
                     df_cn[tests].corr(),
                     df_ad[tests].corr())

Unnamed: 0,age,cdrsb_bl,adas13_bl,mmse_bl,moca_bl
age,1.0,0.059861,0.139857,-0.132319,-0.187296
cdrsb_bl,0.059861,1.0,0.743807,-0.71902,-0.673245
adas13_bl,0.139857,0.743807,1.0,-0.741212,-0.784323
mmse_bl,-0.132319,-0.71902,-0.741212,1.0,0.707536
moca_bl,-0.187296,-0.673245,-0.784323,0.707536,1.0

Unnamed: 0,age,cdrsb_bl,adas13_bl,mmse_bl,moca_bl
age,1.0,0.117062,0.229192,-0.069383,-0.343349
cdrsb_bl,0.117062,1.0,0.083382,0.019856,-0.034113
adas13_bl,0.229192,0.083382,1.0,-0.170663,-0.430711
mmse_bl,-0.069383,0.019856,-0.170663,1.0,0.226774
moca_bl,-0.343349,-0.034113,-0.430711,0.226774,1.0

Unnamed: 0,age,cdrsb_bl,adas13_bl,mmse_bl,moca_bl
age,1.0,0.081323,-0.018548,-0.050178,-0.030545
cdrsb_bl,0.081323,1.0,0.449607,-0.29825,-0.396443
adas13_bl,-0.018548,0.449607,1.0,-0.446169,-0.726418
mmse_bl,-0.050178,-0.29825,-0.446169,1.0,0.541724
moca_bl,-0.030545,-0.396443,-0.726418,0.541724,1.0


In [42]:
# display correlation matrices for sample, CN, and AD
display_side_by_side(df_sm[tests].corr(), 
                     df_em[tests].corr(),
                     df_lm[tests].corr())

Unnamed: 0,age,cdrsb_bl,adas13_bl,mmse_bl,moca_bl
age,1.0,0.120102,0.232009,-0.092579,-0.179436
cdrsb_bl,0.120102,1.0,-0.022905,-0.098121,-0.224284
adas13_bl,0.232009,-0.022905,1.0,-0.123475,-0.423262
mmse_bl,-0.092579,-0.098121,-0.123475,1.0,0.329693
moca_bl,-0.179436,-0.224284,-0.423262,0.329693,1.0

Unnamed: 0,age,cdrsb_bl,adas13_bl,mmse_bl,moca_bl
age,1.0,0.028404,0.355035,-0.272018,-0.281299
cdrsb_bl,0.028404,1.0,0.154609,-0.111594,-0.212626
adas13_bl,0.355035,0.154609,1.0,-0.362637,-0.472468
mmse_bl,-0.272018,-0.111594,-0.362637,1.0,0.319844
moca_bl,-0.281299,-0.212626,-0.472468,0.319844,1.0

Unnamed: 0,age,cdrsb_bl,adas13_bl,mmse_bl,moca_bl
age,1.0,-0.019251,0.102015,-0.115877,-0.140551
cdrsb_bl,-0.019251,1.0,0.278052,-0.177689,-0.21004
adas13_bl,0.102015,0.278052,1.0,-0.390636,-0.581511
mmse_bl,-0.115877,-0.177689,-0.390636,1.0,0.518924
moca_bl,-0.140551,-0.21004,-0.581511,0.518924,1.0


In [43]:
trace_cn = go.Heatmap(
                      z=df_cn[tests].corr().values.tolist()[::-1],
                      x=tests,
                      y=tests[::-1],
                      hoverinfo='x+y+z')

trace_bl = go.Heatmap(
                      z=df_bl[tests].corr().values.tolist()[::-1],
                      x=tests,
                      y=tests[::-1],
                      showscale=False,
                      hoverinfo='x+y+z')

trace_ad = go.Heatmap(
                      z=df_ad[tests].corr().values.tolist()[::-1],
                      x=tests,
                      y=tests[::-1],
                      showscale=False,
                      hoverinfo='x+y+z')

trace_sm = go.Heatmap(
                      z=df_sm[tests].corr().values.tolist()[::-1],
                      x=tests,
                      y=tests[::-1],
                      showscale=False,
                      hoverinfo='x+y+z')

trace_em = go.Heatmap(
                      z=df_em[tests].corr().values.tolist()[::-1],
                      x=tests,
                      y=tests[::-1],
                      showscale=False,
                      hoverinfo='x+y+z')

trace_lm = go.Heatmap(
                      z=df_lm[tests].corr().values.tolist()[::-1],
                      x=tests,
                      y=tests[::-1],
                      showscale=False,
                      hoverinfo='x+y+z')

# make subplots
fig = plotly.tools.make_subplots(rows=2, cols=3, subplot_titles=('Whole Group', 'CN', 'AD', 'SMC', 'EMCI', 'LMCI'), print_grid=False)
fig.append_trace(trace_bl, 1, 1)
fig.append_trace(trace_cn, 1, 2)
fig.append_trace(trace_ad, 1, 3)
fig.append_trace(trace_sm, 2, 1)
fig.append_trace(trace_em, 2, 2)
fig.append_trace(trace_lm, 2, 3)

fig['layout'].update(showlegend=False)
py.iplot(fig, filename='correlation')

There are two primary points to note in these heatmaps:

1. The measures overall are not strongly correlated, except in the case of AD and late MCI. In AD, the CDRSB and the ADAS13 become strongly correlated. This is interesting and may be useful in modeling in that we may consider using the correlation values between these two tests as a feature itself. This correlation is there in late MCI as well, although it is not as strong. 

2. The MOCA has a fairly strong correlation with the MMSE and a negative correltion with the ADAS13. Since we will run an initial model without the MOCA, we can leave this for now.

### 2.1.3 Volumetric Brain Measures

In this section, we'll look at how volumetric brain measurements from the structural MRIs differ (or not) between the diagnosis categories. The measurements in the dataset are either cross sectional (denoted with an x) or longitudinal (denoted with an l). Cross sectional data points are generated by segmenting a patient's scan using a common template. This provides a volumetric estimate of brain regions that can then be compared between groups, as the segmentation is relative to a constant across all patients. The longitudinal data is done within subject, meaning that a template is generated for each patient and changes in brain morphology are calculated relative to that template. This provides a more reliable estimate of how an individual's brain changes over time, but is more time intensive to generate, which is why there is less of these done in the dataset. 

Also, it is generally thought that the larger a brain structure is, the better it is able to perform the cognitive functions associated with it. There are caveats to this, in that the volume of a brain region needs to be normalized by the volume of the whole brain (some people just have bigger brains), and that there are upper bounds to the volume = skill relationship. For the hippocampus, however, there are studies showing that larger hippomcampi results in better memory performance. For example, see the following publication.

Bohbot, V. D., Lerch, J., Thorndycraft, B., Iaria, G., & Zijdenbos, A. P. (2007). Gray Matter Differences Correlate with Spontaneous Strategies in a Human Virtual Navigation Task. Journal of Neuroscience, 27(38), 10078–10083. http://doi.org/10.1523/JNEUROSCI.1763-07.2007

For now, we'll look at the baseline averages for the ventricles, the whole brain, and intracranial volumne (ICV), as well as the longitudinal data for the hippocampus and entorhinal cortex.

In [44]:
vector_dict.metadata['MRI']

['Ventricles_bl',
 'WholeBrain_bl',
 'ICV_bl',
 'l_hippocampus_l',
 'l_hippocampus_r',
 'x_hippocampus_l',
 'x_hippocampus_r',
 'l_entorhinal_l',
 'l_entorhinal_r',
 'l_entorhinal_l_thick',
 'l_entorhinal_r_thick',
 'x_entorhinal_l',
 'x_entorhinal_r',
 'x_entorhinal_l_thick',
 'x_entorhinal_r_thick']

In [45]:
# create list of brain measures
s_mri = [m.lower() for m in vector_dict.metadata['MRI'] if 'x_' not in m]
# display  list
print(s_mri)

['ventricles_bl', 'wholebrain_bl', 'icv_bl', 'l_hippocampus_l', 'l_hippocampus_r', 'l_entorhinal_l', 'l_entorhinal_r', 'l_entorhinal_l_thick', 'l_entorhinal_r_thick']


In [46]:
# plot ventricles
plot_dx('ventricles_bl')

Overall, the volumne of ventricles appear to increase with cogntiive impairment. The CN and SMC groups largely have the same range of volumnes, and these increase with more severe impairment.

In [47]:
# plot whole brain
plot_dx('wholebrain_bl')

The data here are not as straight forward. This may be because some estimates in AD are increasing in volume, such as the ventricles, while other areas are decreasing. Let's continue looking at the volumetric estimates and make a note that this measure may not be useful as a feature in the model.

In [48]:
# plot intracranial volumne
plot_dx('icv_bl')

Similar to the whole brain, intracranial volumne doesn't appear to successfully differentiate the groups.

In [49]:
# plot hippocampus
plot_dx_multibox(s_mri[3:5], 1, 2)

The overall pattern here is what we would expect. AD is neurodegenerative, in that there is atrophy in brain regions, especially those like the hippocampus that are critical for memory. Lower volumetric estimates indicate the brain region is smaller. In both the left and right hippocampus, we can see that there is a decrease of volume with increasing cognitive impairment. Interestingly though, the SMC group has larger hippocampi than the CN group, although the difference likely isn't statistically significant.

In [50]:
# plot entorhinal cortex
plot_dx_multibox(s_mri[5:7], 1, 2)

Similar to the hippocampus, we can see that the volume of the entorhinal cortex appears to relate to cognitive impairment. Lastly, let's look the cortical thickness of the entorhinal cortex.

In [51]:
# plot entorhinal cortex thickness
plot_dx_multibox(s_mri[7:], 1, 2)

These estimates resemble the volumetric ones, except that there appears to be more outliers with lower thickness values in the CN and EMCI group. It may be that individuals with lower thickness values in the entorhinal cortex is associated with a higher rate of converting to AD. We'll evaluate this in section 3 where we look at the characteristics of patients that have a diagnositic change.

### 2.1.4 Ecog Measures

Now we'll look at some of the measures that we'll use in reduced models. These will be reduced models since the measures were only collected for a smaller subset of the total sample. The first one we'll look at is electroencephalography (Ecog) where measures were recorded while patients underwent a memory or visual spatial task.

In [52]:
# create list of brain measures
ecog = [m.lower() for m in vector_dict.metadata['ECog'] if 'bl' in m]
# display  list
print(ecog)

['ecogptmem_bl', 'ecogptvisspat_bl']


In [53]:
# plot memory task
plot_dx(ecog[0])

The Ecog measurements from the memory task look like they are useful in distinguishing CN from AD, but not from AD and MCI.

In [54]:
measures_to_drop.append(ecog[0])

In [55]:
# plot visual spatial task
plot_dx(ecog[1])

Similar to the memory task, but with more variance in each of the groups. These appear that they may not be useful in distinguishing cognitive impairment.

In [56]:
measures_to_drop.append(ecog[1])

### 2.1.5 fMRI Functional Connectivity

In [57]:
# create list of brain measures
fmri = [m.lower() for m in vector_dict.metadata['fMRI']]
# display  list
print(fmri)

['admnrv', 'pdmnrv', 'dmnrvr']


In [58]:
# plot anterior default mode network values
plot_dx(fmri[0])

There is more variance in the AD group, but overall this measure does not appear to be indicative of diagnosis type. We can revisit the variance issue as a potential feature when we look at the characteristics of diagnostic change in section 3.

In [59]:
# plot posterior default mode network
plot_dx(fmri[1])

Here, there is a large variance associated with the CN group. As with the anterior DMN measure, the variance may be useful in modeling. We'll explore this in section 3.

### 2.1.6 DTI Measures

Diffusion tensor imaging (DTI) can produce a number of metrics. The two of interest in the dataset are fractional anisotropy (FA) and mean diffusivity (MD). DTI characterizes the diffusion of water molecules in the brain. The core idea for the current study is that non-random diffusion is present when there are neuronal axons, as the white matter on an axon causes water to diffuse in a linear manner. FA represents a deviation from random diffusion of water molecules in a structure and is taken as a measure of the structural integraty of a brain region. High values of FA indicate more integrity. FA values in the hippocampus have been shown to correlate with spatial memory performance. See the publication:

Iaria, G., Lanyon, L. J., Fox, C. J., Giaschi, D., & Barton, J. J. S. (2008). Navigational skills correlate with hippocampal fractional anisotropy in humans. Hippocampus, 18(4), 335–339. http://doi.org/10.1002/hipo.20400

MD is the average of three different metrics of diffusivity of water molecules in a brain region. Higher MD values represent more structural damage to a particular region.

In [60]:
# create list of brain measures
dti = [m.lower() for m in vector_dict.metadata['DTI']]
# display  list
print(dti)

['fa_hippocampus_l', 'fa_hippocampus_r', 'md_hippocampus_l', 'md_hippocampus_r']


In [61]:
# plot fractional anisotropy in hippocampus
plot_dx_multibox(dti[:2], 1, 2)

These measures appear to have utility in differentiating CN and AD, but not the other diagnosis types.

In [62]:
# plot mean diffusivity in hippocampus
plot_dx_multibox(dti[2:], 1, 2)

MD, particularly in the left hippocampus, seems to be well differentiated between diagnosis types. Next, let's check to see if the DTI measures are correlated.

In [63]:
df_bl['avg_md'] = (df_bl['md_hippocampus_l'] + df_bl['md_hippocampus_r'])/2

In [64]:
df_bl['avg_fa'] = (df_bl['fa_hippocampus_l'] + df_bl['fa_hippocampus_r'])/2

In [65]:
df_bl[['avg_md', 'avg_fa']].corr()

Unnamed: 0,avg_md,avg_fa
avg_md,1.0,-0.610639
avg_fa,-0.610639,1.0


The two measures have a strong negative correlation. Because MD looks more sensative to diagnosis type, let's remove FA from the dataset prior to modeling.

In [66]:
measures_to_drop.append('fa_hippocampus_l')
measures_to_drop.append('fa_hippocampus_r')

### 2.1.7 Genetic Measure

The final measure to visualize is the presence of the APOE4 mutation.

In [67]:
# diagnosis type
dx = list(df_bl.dx_bl.unique())

# APOE4 not present
y0 = [df_cn['apoe4'][df_cn.apoe4 == 0].count(), df_ad['apoe4'][df_ad.apoe4 == 0].count(), df_lm['apoe4'][df_lm.apoe4 == 0].count(),
     df_em['apoe4'][df_em.apoe4 == 0].count(), df_sm['apoe4'][df_sm.apoe4 == 0].count()]

# APOE4 present on one gene
y1 = [df_cn['apoe4'][df_cn.apoe4 == 1].count(), df_ad['apoe4'][df_ad.apoe4 == 1].count(), df_lm['apoe4'][df_lm.apoe4 == 1].count(),
     df_em['apoe4'][df_em.apoe4 == 1].count(), df_sm['apoe4'][df_sm.apoe4 == 1].count()]

# APOE4 present on two genes
y2 = [df_cn['apoe4'][df_cn.apoe4 == 2].count(), df_ad['apoe4'][df_ad.apoe4 == 2].count(), df_lm['apoe4'][df_lm.apoe4 == 2].count(),
      df_em['apoe4'][df_em.apoe4 == 2].count(), df_sm['apoe4'][df_sm.apoe4 == 2].count()]

# generate traces
trace0 = go.Bar(
                x=dx,
                y=y0,
                name='Not Present',
                opacity=0.75)

trace1 = go.Bar(
                x=dx,
                y=y1,
                name='One Gene',
                opacity=0.75)

trace2 = go.Bar(
                x=dx,
                y=y2,
                name='Two Genes',
                opacity=0.75)

# link and plot
data = [trace0, trace1, trace2]
layout = go.Layout(
                   title='Presence of APOE4 Allele',
                   xaxis=dict(title='Diagnosis Type'),
                   yaxis=dict(title='Count'))
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='apoe4')

We can see that in AD and late MCI, there is a high proportion of patients with at least one copy of the APOE4 allele. In section 3, we'll explore the degree to which persons with the APOE4 allele convert to AD compared to those without it. Before we move on to looking at diagnositic change though, let's summarize what we've found so far and characterize who persons with AD are.

### 2.1.8 A Portrait of Alzheimer's Disease

Below, we'll use the demographic information to reveal what traits characterize average person with AD in the dataset.

In [68]:
# age trace
trace_age = go.Bar(
                x=['AD', 'CN', 'SMC', 'Early MCI', 'Late MCI'],
                y=[df_ad.age.mean(), df_cn.age.mean(), df_sm.age.mean(), df_em.age.mean(), df_lm.age.mean()],
                hoverinfo='y',
                marker=dict(
                            color=['rgba(222,45,38,0.8)', 'rgba(204,204,204,1)',
                                   'rgba(204,204,204,1)', 'rgba(204,204,204,1)',
                                   'rgba(204,204,204,1)']))
# gender trace
trace_gender = go.Bar(
                      y=['Female', 'Male'],
                      x=[df_ad.ptgender[df_ad.ptgender == 'Female'].count()/len(df_ad.ptgender)*100, 
                         df_ad.ptgender[df_ad.ptgender == 'Male'].count()/len(df_ad.ptgender)*100],
                      orientation='h',
                      hoverinfo='x',
                      marker=dict(
                                  color=['rgb(116,159,199)', 'rgb(245,169,114)']),
                      opacity = 0.75)
# education trace
trace_edu = go.Bar(
                   x=['AD', 'CN', 'SMC', 'Early MCI', 'Late MCI'],
                   y=[df_ad.pteducat.mean(), df_cn.pteducat.mean(), df_sm.pteducat.mean(), df_em.pteducat.mean(), df_lm.pteducat.mean()],
                   hoverinfo='y',
                   marker=dict(
                            color=['rgba(222,45,38,0.8)', 'rgba(204,204,204,1)',
                                   'rgba(204,204,204,1)', 'rgba(204,204,204,1)',
                                   'rgba(204,204,204,1)']))


# marital status trace
trace_marital = go.Bar(
                       x=['Married', 'Widowed', 'Divorced', 'Never Married'],
                       y=list(df_ad.ptmarry.value_counts()),
                       hoverinfo='y',
                       opacity=0.75)
    
# make subplots
fig = plotly.tools.make_subplots(rows=2, cols=2, subplot_titles=('Mean Age', 'Percent of Gender', 'Mean Education', 'Marital Status'), print_grid=False)
fig.append_trace(trace_age, 1, 1)
fig.append_trace(trace_gender, 1, 2)
fig.append_trace(trace_edu, 2, 1)
fig.append_trace(trace_marital, 2, 2)

# update layout
fig['layout']['yaxis1'].update(title='Age', range=[70,75])
fig['layout']['yaxis3'].update(title='Years of Education', range=[10,20])
fig['layout'].update(showlegend=False)
py.iplot(fig, filename='ad_portrait')

The average person with AD tends to be male and older than the other patients in the study. They are less educated and are currently married.

## 2.2 Diagnostic change

This section will explore our target variable which is the conversion to AD. This is coded by the diagnositic change (DXCHANGE) column in the dataframe. The different categories for this variable are:

In [70]:
for k,v in vector_dict.dx_change_ids.items():
    print(f'Value {k} represents {v}')

Value 1 represents Stable:NL to NL
Value 2 represents Stable:MCI to MCI
Value 3 represents Stable:AD to AD
Value 4 represents Conv:NL to MCI
Value 5 represents Conv:MCI to AD
Value 6 represents Conv:NL to AD
Value 7 represents Rev:MCI to NL
Value 8 represents Rev:AD to MCI
Value 9 represents Rev:AD to NL
Value -1 represents Not available


The two values that we're interested in are 5, the conversion from MCI to AD, and 6, the conversion of cognitively normal to AD. Let's explore some of the characteristics in the dataset associated with these two diagnostic changes.

###  2.2.1 Diagnostic change data cleaning

First, we need to explore the dataset to ensure that we have reliable data for the diagnostic change values. As we've already dealt with missing values, we need to inspect the patients that convert and ensure that their diagnostic change timeline makes sense.

In [100]:
# list for patient IDs of converts
converts = []
for i,r in df.iterrows():
    if r.dxchange == 5 or r.dxchange == 6:
        converts.append(r.rid)
# display how many patients convert
print(len(converts))

373


In [96]:
# extract baseline info of patients that convert
dx_bl_df = df_bl[df_bl.rid.isin(converts)]

In [98]:
dx_bl_df.shape

(345, 63)

There is a mismatch between the patient IDs that converted and those extracted from the baseline dataframe. Let's check why this might be.

In [164]:
print(f'There are a total of {len(set(converts))} unique patient IDs.')

There are a total of 340 unique patient IDs.


In [166]:
# list to contain non-duplicate patient IDs
converts_u = []
# set to contain patient IDs with more than one occurance
converts_multi = set()
# loop through patient IDs
for patient in converts:
    # add patient ID if not a duplicate
    if patient not in converts_u and patient not in converts_multi:
        converts_u.append(patient)
    # add duplicates
    else:
        converts_multi.add(patient)
        try:
            converts_u.remove(patient)
        except:
            pass

In [167]:
print(f'There are a total of {len(converts_u)} patients that convert only once and {len(converts_multi)} patients with multiple conversions.')

There are a total of 311 patients that convert only once and 29 patients with multiple conversions.


Although possible, it is unlikely that people are converting in and out of AD. Let's inspect these patients closer and see if we can figure out what might be going on.

In [225]:
dx_multi = df[df.rid.isin(converts_multi)]

In [226]:
dx_multi.dxchange.value_counts()

2.0    104
5.0     62
3.0     37
1.0     22
4.0      2
8.0      1
Name: dxchange, dtype: int64

There is only one instance of a patient converting from AD to CN/MCI (value = 8). There are also no occurances of a CN patient converting to AD (value = 6).

In [241]:
dx_multi.head(20)

Unnamed: 0,rid,viscode,d1,d2,dx_bl,dxchange,age,ptgender,pteducat,ptethcat,...,av1451_entorhinal_l,av1451_entorhinal_r,fa_hippocampus_l,fa_hippocampus_r,md_hippocampus_l,md_hippocampus_r,apoe4,admnrv,pdmnrv,dmnrvr
272,61,0,1,1,CN,1.0,77.0,Female,15,Not Hisp/Latino,...,,,,,,,0.0,,,
273,61,6,1,1,CN,1.0,77.0,Female,15,Not Hisp/Latino,...,,,,,,,0.0,,,
274,61,12,1,1,CN,1.0,77.0,Female,15,Not Hisp/Latino,...,,,,,,,0.0,,,
275,61,18,1,1,CN,1.0,77.0,Female,15,Not Hisp/Latino,...,,,,,,,0.0,,,
276,61,24,1,1,CN,1.0,77.0,Female,15,Not Hisp/Latino,...,,,,,,,0.0,,,
277,61,36,1,1,CN,1.0,77.0,Female,15,Not Hisp/Latino,...,,,,,,,0.0,,,
278,61,48,1,1,CN,1.0,77.0,Female,15,Not Hisp/Latino,...,,,,,,,0.0,,,
279,61,60,1,1,CN,1.0,77.0,Female,15,Not Hisp/Latino,...,,,,,,,0.0,,,
280,61,72,1,1,CN,1.0,77.0,Female,15,Not Hisp/Latino,...,,,,,,,0.0,,,
281,61,84,1,1,CN,4.0,77.0,Female,15,Not Hisp/Latino,...,,,,,,,0.0,,,


Looking at patient 61 and 101, it seems like when a patient converts, this conversion is recorded at multiple visits. Let's fix this by checking if the patient has two or more repeated occurances of conversion to AD.

In [228]:
# list to store index values of visits that have repeated conversion diagnostic changes
dx_multi_index = []
for i,r in dx_multi.iterrows():
    # skip first row
    if i == 281:
        pass
    # check to see if current and previous diagnostic change are 5
    elif dx_multi.loc[i, 'dxchange'] == 5 and dx_multi.loc[i-1, 'dxchange'] == 5:
        # check to see if same patient
        if dx_multi.loc[i, 'rid'] == dx_multi.loc[i-1, 'rid']:
            # record index of visit
            dx_multi_index.append(i)

In [231]:
# recode dxchange 5 to 3 for repeated entries
for i in dx_multi_index:
    df.iloc[i, 5] = 3

Now let's check to see that we've cleaned the dataset properly and have no patients with repeating conversions to AD.

In [232]:
# list for patient IDs of converts
converts = []
for i,r in df.iterrows():
    if r.dxchange == 5 or r.dxchange == 6:
        converts.append(r.rid)
# display how many patients convert
print(len(converts))

351


In [233]:
len(set(converts))

340

There are still problems with 11 patients. Let's pull the patient IDs for these and look at them.

In [234]:
# list to contain non-duplicate patient IDs
converts_u = []
# set to contain patient IDs with more than one occurance
converts_multi = set()
# loop through patient IDs
for patient in converts:
    # add patient ID if not a duplicate
    if patient not in converts_u and patient not in converts_multi:
        converts_u.append(patient)
    # add duplicates
    else:
        converts_multi.add(patient)
        try:
            converts_u.remove(patient)
        except:
            pass

In [239]:
df[df.rid.isin(converts_multi)]

Unnamed: 0,rid,viscode,d1,d2,dx_bl,dxchange,age,ptgender,pteducat,ptethcat,...,av1451_entorhinal_l,av1451_entorhinal_r,fa_hippocampus_l,fa_hippocampus_r,md_hippocampus_l,md_hippocampus_r,apoe4,admnrv,pdmnrv,dmnrvr
767,166,0,1,1,CN,1.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
768,166,6,1,1,CN,1.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
769,166,12,1,1,CN,1.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
770,166,18,1,1,CN,1.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
771,166,24,1,1,CN,1.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
772,166,36,1,1,CN,1.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
773,166,60,1,1,CN,5.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
774,166,72,1,1,CN,2.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
775,166,84,1,1,CN,5.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
776,166,96,1,1,CN,3.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,


It appears that in some cases, such as patient 166, that there is a conversion to MCI that is coded as a conversion to AD. This is evident because they enter the study as CN, convert to AD but remain stable as MCI, then later convert again to AD. In other cases, such as patient 4947, there appear to be wrong entries (index 9750) in the dxchange column. Because the patients who convert to AD are critical to our model, we'll go through each of these and fix them one by one.

In [244]:
# change set to list
converts_multi = list(converts_multi)
# display first patient
df[df.rid == converts_multi[0]]

Unnamed: 0,rid,viscode,d1,d2,dx_bl,dxchange,age,ptgender,pteducat,ptethcat,...,av1451_entorhinal_l,av1451_entorhinal_r,fa_hippocampus_l,fa_hippocampus_r,md_hippocampus_l,md_hippocampus_r,apoe4,admnrv,pdmnrv,dmnrvr
7894,4414,0,1,1,LMCI,2.0,60.8,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
7895,4414,6,1,1,LMCI,5.0,60.8,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
7896,4414,12,1,1,LMCI,3.0,60.8,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
7897,4414,24,1,1,LMCI,5.0,60.8,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
7898,4414,36,1,1,LMCI,3.0,60.8,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
7899,4414,48,1,1,LMCI,3.0,60.8,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,


In [245]:
# AD stable
df.loc[7897, 'dxchange'] = 3

In [246]:
# display next patient
df[df.rid == converts_multi[1]]

Unnamed: 0,rid,viscode,d1,d2,dx_bl,dxchange,age,ptgender,pteducat,ptethcat,...,av1451_entorhinal_l,av1451_entorhinal_r,fa_hippocampus_l,fa_hippocampus_r,md_hippocampus_l,md_hippocampus_r,apoe4,admnrv,pdmnrv,dmnrvr
8409,4542,0,1,1,LMCI,2.0,79.3,Female,16,Not Hisp/Latino,...,,,,,,,2.0,0.83879,0.66878,1.2542
8410,4542,3,1,1,LMCI,2.0,79.3,Female,16,Not Hisp/Latino,...,,,,,,,2.0,0.61526,0.88235,0.69729
8411,4542,6,1,1,LMCI,5.0,79.3,Female,16,Not Hisp/Latino,...,,,,,,,2.0,0.62512,0.66809,0.93569
8412,4542,12,1,1,LMCI,3.0,79.3,Female,16,Not Hisp/Latino,...,,,,,,,2.0,0.86926,0.72541,1.1983
8413,4542,24,1,1,LMCI,5.0,79.3,Female,16,Not Hisp/Latino,...,,,,,,,2.0,0.7393,0.68384,1.0811
8414,4542,36,1,1,LMCI,3.0,79.3,Female,16,Not Hisp/Latino,...,,,,,,,2.0,,,


In [247]:
# AD stable
df.loc[8413, 'dxchange'] = 3

In [248]:
# display next patient
df[df.rid == converts_multi[2]]

Unnamed: 0,rid,viscode,d1,d2,dx_bl,dxchange,age,ptgender,pteducat,ptethcat,...,av1451_entorhinal_l,av1451_entorhinal_r,fa_hippocampus_l,fa_hippocampus_r,md_hippocampus_l,md_hippocampus_r,apoe4,admnrv,pdmnrv,dmnrvr
7390,4293,0,1,1,LMCI,2.0,69.7,Male,12,Not Hisp/Latino,...,,,,,,,0.0,0.56881,0.65893,0.86323
7391,4293,3,1,1,LMCI,2.0,69.7,Male,12,Not Hisp/Latino,...,,,,,,,0.0,0.63867,0.6489,0.98423
7392,4293,6,1,1,LMCI,2.0,69.7,Male,12,Not Hisp/Latino,...,,,,,,,0.0,0.56812,0.62121,0.91454
7393,4293,12,1,1,LMCI,2.0,69.7,Male,12,Not Hisp/Latino,...,,,,,,,0.0,0.67492,0.58268,1.1583
7394,4293,24,1,1,LMCI,5.0,69.7,Male,12,Not Hisp/Latino,...,,,,,,,0.0,0.6142,0.58311,1.0533
7395,4293,36,1,1,LMCI,2.0,69.7,Male,12,Not Hisp/Latino,...,,,,,,,0.0,,,
7396,4293,48,1,1,LMCI,5.0,69.7,Male,12,Not Hisp/Latino,...,,,,,,,0.0,0.84043,0.58839,1.4284


In [249]:
# MCI stable
df.loc[7394, 'dxchange'] = 2

In [252]:
# display next patient
df[df.rid == converts_multi[3]]

Unnamed: 0,rid,viscode,d1,d2,dx_bl,dxchange,age,ptgender,pteducat,ptethcat,...,av1451_entorhinal_l,av1451_entorhinal_r,fa_hippocampus_l,fa_hippocampus_r,md_hippocampus_l,md_hippocampus_r,apoe4,admnrv,pdmnrv,dmnrvr
767,166,0,1,1,CN,1.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
768,166,6,1,1,CN,1.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
769,166,12,1,1,CN,1.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
770,166,18,1,1,CN,1.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
771,166,24,1,1,CN,1.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
772,166,36,1,1,CN,1.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
773,166,60,1,1,CN,5.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
774,166,72,1,1,CN,2.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
775,166,84,1,1,CN,5.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
776,166,96,1,1,CN,3.0,72.5,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,


In [253]:
# convert to MCI
df.loc[773, 'dxchange'] = 4

In [254]:
# display next patient
df[df.rid == converts_multi[4]]

Unnamed: 0,rid,viscode,d1,d2,dx_bl,dxchange,age,ptgender,pteducat,ptethcat,...,av1451_entorhinal_l,av1451_entorhinal_r,fa_hippocampus_l,fa_hippocampus_r,md_hippocampus_l,md_hippocampus_r,apoe4,admnrv,pdmnrv,dmnrvr
9106,4741,0,1,1,LMCI,2.0,61.6,Male,16,Not Hisp/Latino,...,,,,,,,0.0,,,
9107,4741,3,1,1,LMCI,2.0,61.6,Male,16,Not Hisp/Latino,...,,,,,,,0.0,,,
9108,4741,6,1,1,LMCI,5.0,61.6,Male,16,Not Hisp/Latino,...,,,,,,,0.0,,,
9109,4741,12,1,1,LMCI,2.0,61.6,Male,16,Not Hisp/Latino,...,,,,,,,0.0,,,
9110,4741,24,1,1,LMCI,5.0,61.6,Male,16,Not Hisp/Latino,...,,,,,,,0.0,,,
9111,4741,36,1,1,LMCI,3.0,61.6,Male,16,Not Hisp/Latino,...,,,,,,,0.0,,,
9112,4741,48,1,1,LMCI,3.0,61.6,Male,16,Not Hisp/Latino,...,,,,,,,0.0,,,


In [255]:
# MCI stable
df.loc[9108, 'dxchange'] = 2

In [256]:
# display next patient
df[df.rid == converts_multi[5]]

Unnamed: 0,rid,viscode,d1,d2,dx_bl,dxchange,age,ptgender,pteducat,ptethcat,...,av1451_entorhinal_l,av1451_entorhinal_r,fa_hippocampus_l,fa_hippocampus_r,md_hippocampus_l,md_hippocampus_r,apoe4,admnrv,pdmnrv,dmnrvr
4171,1097,0,1,0,LMCI,2.0,73.0,Male,18,Not Hisp/Latino,...,,,,,,,0.0,,,
4172,1097,6,1,0,LMCI,2.0,73.0,Male,18,Not Hisp/Latino,...,,,,,,,0.0,,,
4173,1097,12,1,0,LMCI,2.0,73.0,Male,18,Not Hisp/Latino,...,,,,,,,0.0,,,
4174,1097,18,1,0,LMCI,2.0,73.0,Male,18,Not Hisp/Latino,...,,,,,,,0.0,,,
4175,1097,24,1,0,LMCI,2.0,73.0,Male,18,Not Hisp/Latino,...,,,,,,,0.0,,,
4176,1097,36,1,0,LMCI,2.0,73.0,Male,18,Not Hisp/Latino,...,,,,,,,0.0,,,
4177,1097,48,1,0,LMCI,5.0,73.0,Male,18,Not Hisp/Latino,...,,,,,,,0.0,,,
4178,1097,60,1,0,LMCI,3.0,73.0,Male,18,Not Hisp/Latino,...,,,,,,,0.0,,,
4179,1097,72,1,0,LMCI,5.0,73.0,Male,18,Not Hisp/Latino,...,,,,,,,0.0,,,


In [257]:
# AD stable
df.loc[4179, 'dxchange'] = 3

In [258]:
# display next patient
df[df.rid == converts_multi[6]]

Unnamed: 0,rid,viscode,d1,d2,dx_bl,dxchange,age,ptgender,pteducat,ptethcat,...,av1451_entorhinal_l,av1451_entorhinal_r,fa_hippocampus_l,fa_hippocampus_r,md_hippocampus_l,md_hippocampus_r,apoe4,admnrv,pdmnrv,dmnrvr
7971,4430,0,1,1,LMCI,2.0,80.0,Male,15,Not Hisp/Latino,...,,,,,,,0.0,,,
7972,4430,3,1,1,LMCI,2.0,80.0,Male,15,Not Hisp/Latino,...,,,,,,,0.0,,,
7973,4430,6,1,1,LMCI,5.0,80.0,Male,15,Not Hisp/Latino,...,,,,,,,0.0,,,
7974,4430,12,1,1,LMCI,2.0,80.0,Male,15,Not Hisp/Latino,...,,,,,,,0.0,,,
7975,4430,24,1,1,LMCI,3.0,80.0,Male,15,Not Hisp/Latino,...,,,,,,,0.0,,,
7976,4430,36,1,1,LMCI,5.0,80.0,Male,15,Not Hisp/Latino,...,,,,,,,0.0,,,
7977,4430,48,1,1,LMCI,3.0,80.0,Male,15,Not Hisp/Latino,...,,,,,,,0.0,,,


This patient is more complicated. It could be that they either are stable MCI at month 6 or convert to AD and have month 12 and 24 recorded wrong. In this case the patient will be dropped.

In [259]:
patients_to_drop = [4430]

In [260]:
# display next patient
df[df.rid == converts_multi[7]]

Unnamed: 0,rid,viscode,d1,d2,dx_bl,dxchange,age,ptgender,pteducat,ptethcat,...,av1451_entorhinal_l,av1451_entorhinal_r,fa_hippocampus_l,fa_hippocampus_r,md_hippocampus_l,md_hippocampus_r,apoe4,admnrv,pdmnrv,dmnrvr
5140,1394,0,1,0,LMCI,2.0,77.1,Male,18,Not Hisp/Latino,...,,,,,,,1.0,,,
5141,1394,6,1,0,LMCI,2.0,77.1,Male,18,Not Hisp/Latino,...,,,,,,,1.0,,,
5142,1394,12,1,0,LMCI,5.0,77.1,Male,18,Not Hisp/Latino,...,,,,,,,1.0,,,
5143,1394,18,1,0,LMCI,3.0,77.1,Male,18,Not Hisp/Latino,...,,,,,,,1.0,,,
5144,1394,24,1,0,LMCI,5.0,77.1,Male,18,Not Hisp/Latino,...,,,,,,,1.0,,,
5145,1394,36,1,0,LMCI,3.0,77.1,Male,18,Not Hisp/Latino,...,,,,,,,1.0,,,
5146,1394,48,1,0,LMCI,3.0,77.1,Male,18,Not Hisp/Latino,...,,,,,,,1.0,,,


In [261]:
# stable AD
df.loc[5144, 'dxchange'] = 3

In [262]:
# display next patient
df[df.rid == converts_multi[8]]

Unnamed: 0,rid,viscode,d1,d2,dx_bl,dxchange,age,ptgender,pteducat,ptethcat,...,av1451_entorhinal_l,av1451_entorhinal_r,fa_hippocampus_l,fa_hippocampus_r,md_hippocampus_l,md_hippocampus_r,apoe4,admnrv,pdmnrv,dmnrvr
9748,4947,0,1,0,EMCI,2.0,74.5,Male,14,Not Hisp/Latino,...,,,,,,,1.0,0.59061,0.55416,1.0658
9749,4947,3,1,0,EMCI,2.0,74.5,Male,14,Not Hisp/Latino,...,,,,,,,1.0,0.71663,0.66484,1.0779
9750,4947,6,1,0,EMCI,5.0,74.5,Male,14,Not Hisp/Latino,...,,,,,,,1.0,0.57304,0.60773,0.94292
9751,4947,12,1,0,EMCI,2.0,74.5,Male,14,Not Hisp/Latino,...,,,,,,,1.0,0.64834,0.69708,0.93009
9752,4947,24,1,0,EMCI,5.0,74.5,Male,14,Not Hisp/Latino,...,,,,,,,1.0,0.59951,0.63133,0.94959


In [263]:
# MCI stable
df.loc[9750, 'dxchange'] = 2

In [264]:
# display next patient
df[df.rid == converts_multi[9]]

Unnamed: 0,rid,viscode,d1,d2,dx_bl,dxchange,age,ptgender,pteducat,ptethcat,...,av1451_entorhinal_l,av1451_entorhinal_r,fa_hippocampus_l,fa_hippocampus_r,md_hippocampus_l,md_hippocampus_r,apoe4,admnrv,pdmnrv,dmnrvr
2276,566,0,1,1,LMCI,2.0,78.8,Male,20,Not Hisp/Latino,...,,,,,,,1.0,,,
2277,566,6,1,1,LMCI,2.0,78.8,Male,20,Not Hisp/Latino,...,,,,,,,1.0,,,
2278,566,12,1,1,LMCI,2.0,78.8,Male,20,Not Hisp/Latino,...,,,,,,,1.0,,,
2279,566,18,1,1,LMCI,2.0,78.8,Male,20,Not Hisp/Latino,...,,,,,,,1.0,,,
2280,566,24,1,1,LMCI,2.0,78.8,Male,20,Not Hisp/Latino,...,,,,,,,1.0,,,
2281,566,36,1,1,LMCI,5.0,78.8,Male,20,Not Hisp/Latino,...,,,,,,,1.0,,,
2282,566,60,1,1,LMCI,2.0,78.8,Male,20,Not Hisp/Latino,...,,,,,,,1.0,,,
2283,566,72,1,1,LMCI,5.0,78.8,Male,20,Not Hisp/Latino,...,,,,,,,1.0,,,
2284,566,84,1,1,LMCI,3.0,78.8,Male,20,Not Hisp/Latino,...,,,,,,,1.0,,,
2285,566,96,1,1,LMCI,3.0,78.8,Male,20,Not Hisp/Latino,...,,,,,,,1.0,,,


In [265]:
# MCI stable
df.loc[2281, 'dxchange'] = 2

In [266]:
# display next patient
df[df.rid == converts_multi[10]]

Unnamed: 0,rid,viscode,d1,d2,dx_bl,dxchange,age,ptgender,pteducat,ptethcat,...,av1451_entorhinal_l,av1451_entorhinal_r,fa_hippocampus_l,fa_hippocampus_r,md_hippocampus_l,md_hippocampus_r,apoe4,admnrv,pdmnrv,dmnrvr
2749,702,0,1,0,LMCI,2.0,85.0,Male,16,Not Hisp/Latino,...,,,,,,,1.0,,,
2750,702,6,1,0,LMCI,2.0,85.0,Male,16,Not Hisp/Latino,...,,,,,,,1.0,,,
2751,702,12,1,0,LMCI,2.0,85.0,Male,16,Not Hisp/Latino,...,,,,,,,1.0,,,
2752,702,18,1,0,LMCI,2.0,85.0,Male,16,Not Hisp/Latino,...,,,,,,,1.0,,,
2753,702,24,1,0,LMCI,5.0,85.0,Male,16,Not Hisp/Latino,...,,,,,,,1.0,,,
2754,702,36,1,0,LMCI,8.0,85.0,Male,16,Not Hisp/Latino,...,,,,,,,1.0,,,
2755,702,48,1,0,LMCI,5.0,85.0,Male,16,Not Hisp/Latino,...,,,,,,,1.0,,,
2756,702,60,1,0,LMCI,3.0,85.0,Male,16,Not Hisp/Latino,...,,,,,,,1.0,,,
2757,702,72,1,0,LMCI,3.0,85.0,Male,16,Not Hisp/Latino,...,,,,,,,1.0,,,


This patient appears correct and went from AD to MCI and back to AD. We'll leave it as is. Now let's remove the one patient and extract the baseline variables for people that convert to characterize them.

In [275]:
# drop problematic patient from dataframe
df.drop(df[df.rid == patients_to_drop[0]].index, inplace=True)
# drop problematic patient from convert list
converts_multi.remove(patients_to_drop[0])
# intiate final convert list
converts = converts_u + converts_multi

In [276]:
len(converts)

339

In [277]:
# baseline dataframe of patients that convert
dx_df = df[df.rid.isin(converts)]

In [279]:
# extract baseline rows
dx_bl_df = dx_df[dx_df.viscode == 0]

In [288]:
len(dx_bl_df)

344

We can see here that there is still a mismatch between the number of patients that convert and the entries in the dataframe we will analyze. This needs to be cleaned up before we do any further analysis.

In [287]:
dx_bl_df.rid.value_counts()[:5]

4515    3
4250    2
4346    2
4363    2
511     1
Name: rid, dtype: int64

It appears that there are multiple baseline entries for four different patients.

In [304]:
# list to store indexes for visits to remove
visits_to_remove = []
for i,r in df.iterrows():
    try:
        # check to see if duplicate visit code for each patient
        if df.loc[i, 'viscode'] == df.loc[i-1, 'viscode'] and df.loc[i, 'rid'] == df.loc[i-1, 'rid']:
            # record duplicate index
            visits_to_remove.append(i)
    except:
        pass

In [305]:
len(visits_to_remove)

55

In [306]:
# drop duplicate rows
df.drop(df.loc[visits_to_remove].index, inplace=True)

In [307]:
# baseline dataframe of patients that convert
dx_df = df[df.rid.isin(converts)]

In [308]:
# extract baseline rows
dx_bl_df = dx_df[dx_df.viscode == 0]

In [309]:
len(dx_bl_df)

339

Now we have a properly cleaned dataframe that we can do some inspection and analysis on.

### 2.2.2 Characteristics of AD converts

First, let's visualize some of the demographic information of patients that convert to AD.

In [315]:
dx_bl_df.dx_bl.value_counts()

LMCI    270
EMCI     39
CN       23
AD        6
SMC       1
Name: dx_bl, dtype: int64

Intestingly, there are 6 patients that are reported as AD at baseline who convert to AD.

In [316]:
dx_bl_df[dx_bl_df.dx_bl == 'AD']

Unnamed: 0,rid,viscode,d1,d2,dx_bl,dxchange,age,ptgender,pteducat,ptethcat,...,av1451_entorhinal_l,av1451_entorhinal_r,fa_hippocampus_l,fa_hippocampus_r,md_hippocampus_l,md_hippocampus_r,apoe4,admnrv,pdmnrv,dmnrvr
352,78,0,1,0,AD,5.0,76.0,Female,18,Not Hisp/Latino,...,,,,,,,1.0,,,
763,162,0,1,0,AD,3.0,71.8,Male,18,Not Hisp/Latino,...,,,,,,,1.0,,,
890,190,0,1,0,AD,5.0,78.8,Male,18,Not Hisp/Latino,...,,,,,,,1.0,,,
3794,995,0,1,0,AD,5.0,78.5,Female,12,Not Hisp/Latino,...,,,,,,,1.0,,,
4358,1154,0,1,0,AD,5.0,76.6,Male,18,Not Hisp/Latino,...,,,,,,,1.0,,,
4610,1226,0,1,0,AD,5.0,82.6,Male,16,Not Hisp/Latino,...,,,,,,,0.0,,,


Everyone except patient 162 appears to have gotten a conversion from MCI to AD at the baseline assessment. Let's look at patient 162.

In [317]:
df[df.rid == 162]

Unnamed: 0,rid,viscode,d1,d2,dx_bl,dxchange,age,ptgender,pteducat,ptethcat,...,av1451_entorhinal_l,av1451_entorhinal_r,fa_hippocampus_l,fa_hippocampus_r,md_hippocampus_l,md_hippocampus_r,apoe4,admnrv,pdmnrv,dmnrvr
763,162,0,1,0,AD,3.0,71.8,Male,18,Not Hisp/Latino,...,,,,,,,1.0,,,
764,162,6,1,0,AD,8.0,71.8,Male,18,Not Hisp/Latino,...,,,,,,,1.0,,,
765,162,12,1,0,AD,2.0,71.8,Male,18,Not Hisp/Latino,...,,,,,,,1.0,,,
766,162,24,1,0,AD,5.0,71.8,Male,18,Not Hisp/Latino,...,,,,,,,1.0,,,


The data for patient 162 appears ok. They enter the study with AD, change back to MCI, then at 24 months convert back to AD.

In [354]:
# plot with plotly
trace1 = go.Bar(
                x=list(dx_bl_df.dx_bl.value_counts().index),
                y=list(dx_bl_df.dx_bl.value_counts().values),
                hoverinfo='x+y',
                opacity = 0.75
                )

trace2 = go.Bar(
                y=list(dx_bl_df.ptgender.value_counts().index),
                x=list(dx_bl_df.ptgender.value_counts().values/len(dx_bl_df.ptgender)*100),
                orientation='h',
                hoverinfo='x',
                marker=dict(
                    color=['rgb(116,159,199)', 'rgb(245,169,114)']),
                opacity = 0.75
                )

trace3 = go.Histogram(
                x=list(dx_bl_df.age),
                hoverinfo='x+y',
                opacity = 0.75
                )

trace4 = go.Histogram(
                x=list(dx_bl_df.pteducat),
                hoverinfo='x+y',
                opacity = 0.75
                )

trace5 = go.Bar(
                x=list(dx_bl_df.ptmarry.value_counts().index),
                hoverinfo='x+y',
                y=list(dx_bl_df.ptmarry.value_counts().values),
                opacity = 0.75
                )

trace6 = go.Bar(
                x=list(dx_bl_df.ptethcat.value_counts().index),
                hoverinfo='x+y',
                y=list(dx_bl_df.ptethcat.value_counts().values),
                opacity = 0.75
                )

# make subplots
fig = plotly.tools.make_subplots(rows=3, cols=2, print_grid=False, 
                                 subplot_titles=('Frequency of Diagnosis Category', 'Percent of Converts by Gender',
                                                 'Age', 'Frequency of Education Amount', 'Frequency of Marital Category', 'Frequency of Ethnicity Group'))
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 2, 1)
fig.append_trace(trace4, 2, 2)
fig.append_trace(trace5, 3, 1)
fig.append_trace(trace6, 3, 2)

# update layout
fig['layout'].update(height = 1200, showlegend=False)
fig['layout']['xaxis1'].update(title='Baseline Diagnosis')
fig['layout']['yaxis1'].update(title='Count')
fig['layout']['xaxis2'].update(title='Percent')
fig['layout']['xaxis3'].update(title='Years')
fig['layout']['yaxis3'].update(title='Count')
fig['layout']['xaxis4'].update(title='Years')
fig['layout']['yaxis4'].update(title='Count')
fig['layout']['xaxis5'].update(title='Marital Status')
fig['layout']['yaxis5'].update(title='Count')
fig['layout']['xaxis6'].update(title='Ethnicity')
fig['layout']['yaxis6'].update(title='Count')
py.iplot(fig, filename='dx_demo')

The above visuals give an overview of demographic information as it relates to AD conversion. The average patient enters the study is a male with a diagnosis of late MCI, has a university degree, is married and not Hispanic/Latino.

In [None]:
# baseline diagnosis dataframes for plotting
df_cn = dx_bl_df[dx_bl_df.dx_bl == 'CN']
df_ad = dx_bl_df[dx_bl_df.dx_bl == 'AD']
df_lm = dx_bl_df[dx_bl_df.dx_bl == 'LMCI']
df_em = dx_bl_df[dx_bl_df.dx_bl == 'EMCI']
df_sm = dx_bl_df[dx_bl_df.dx_bl == 'SMC]

In [357]:
# create list of baseline cognitive assessments
tests = [t.lower() for t in vector_dict.metadata['cognitive_tests'] if 'bl' in t]
# dispaly list
print(f'The cognitive assessments that will be looked at are {tests}')

The cognitive assessments that will be looked at are ['cdrsb_bl', 'adas13_bl', 'mmse_bl', 'moca_bl']


In [360]:
plot_dx(tests[0])

In [361]:
plot_dx(tests[1])

In [362]:
plot_dx(tests[2])

In [363]:
plot_dx(tests[3])

In [None]:
# compare converters to group data