# Diagnostic Response Variable EDA

(Zach Werkhoven)

## Import dependencies

In [2]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import statsmodels.api as sm
from statsmodels.regression.linear_model import OLS
%matplotlib inline

# import custom dependencies
from ADNI_utilities import define_terms, describe_meta_data, append_meta_cols

## Diagnostic Summary Analysis

ADNI uses diagnostic codes to define patient dianosis for alzheimer's disease at each visit. Common to all ADNI phases are the following:

- Normal or `NL` or `1`
- Mild Cognitive Impairment or `MCI` or `2`
- Alzheimer's Disease or `AD` or `3`

However, some ADNI phases record additional diagnoses (eg. `LMCI` and `EMCI` for early and late mild cognitive impairment) and others only record change in diagnoses relative to the last visit. The purpose of this EDA is to try methods for constructing a single response variable with the format above for each visit. 

In [3]:
# read in data
dx_df = pd.read_csv("../data/Diagnosis/DXSUM_PDXCONV_ADNIALL.csv")

# read in the ADNI dictionary and get summary of terms
adni_dict_df = pd.read_csv("../data/study info/DATADIC.csv")
dx_terms = define_terms(dx_df, adni_dict_df)

In [4]:
# print diagnosis dataframe overview
describe_meta_data(dx_df)

Phases:	 ['ADNI1' 'ADNIGO' 'ADNI2' 'ADNI3']
Num Entries: 11264
Num Columns: 53
Num Patients: 2516
Records per Patient: 1-15
Phases spanned per patient: 1-4
Patients w/ Duplicates: 2024


In [5]:
# print the summary of terms
dx_terms

Unnamed: 0,FLDNAME,TYPE,TEXT,CODE
0,,,,
1,ID,N,Record ID,"""crfname"","""",""indexes"",""adni_aal_idx=TBLID,FLD..."
2,RID,N,Participant roster ID,
3,SITEID,N,Site ID,
4,VISCODE,T,Visit code,
5,VISCODE2,-4,Translated visit code,-4
6,USERDATE,S,Date record created,
7,USERDATE2,S,Date record last updated,
8,EXAMDATE,D,Examination Date,
9,DXCHANGE,N,1. Which best describes the participant's cha...,1=Stable: NL to NL; 2=Stable: MCI to MCI; 3=St...


We can see from the summary above that there are many diagnostic summaries which take the form of comments on or are conditional on previous categories. These will have to be removed.

In [6]:
# specify non alzheimer's diagnosis columns that may be of interest
dx_cols = ["DXNORM","DXNODEP","DXMPTR1","DXMPTR2","DXMPTR3","DXMPTR4","DXMPTR5","DXPARK","DXDEP","DXOTHDEM"]

There seem to be three primary metrics that record alzheimer's diagnosis in the data set: `DXCURREN`, `DXCHANGE`, `DIAGNOSIS`. To determine how to split up the data, let's look at the values for each metric.

In [7]:
# print the values for diagnosis metrics
print("DXCURREN values:\n")
print(dx_terms.loc[dx_terms.FLDNAME=="DXCURREN"].CODE.values)
print("\nDXCHANGE values:\n")
print(dx_terms.loc[dx_terms.FLDNAME=="DXCHANGE"].CODE.values)
print("\nDIAGNOSIS values:\n")
print(dx_terms.loc[dx_terms.FLDNAME=="DIAGNOSIS"].CODE.values)

DXCURREN values:

['1=NL;2=MCI;3=AD']

DXCHANGE values:

['1=Stable: NL to NL; 2=Stable: MCI to MCI; 3=Stable: Dementia to Dementia; 4=Conversion: NL to MCI; 5=Conversion: MCI to Dementia; 6=Conversion: NL to Dementia; 7=Reversion: MCI to NL; 8=Reversion: Dementia to MCI; 9=Reversion: Dementia to NL']

DIAGNOSIS values:

["1=Cognitively Normal; 5=Significant Memory Concern;2=Early MCI; 3=Late MCI; 4=Alzheimer's Disease"]


By combining information from the metrics above, we should be able to to get a measure for each patient that falls into one of the three categories defined earlier: `NL`, `MCI`, and `AD`.

Let's see which metrics were recorded during each ADNI phase.

In [8]:
# look at number of diagnostic values for each phase
by_phase = dx_df.groupby("Phase")
n_per_phase = by_phase[["DXCURREN","DXCHANGE","DIAGNOSIS"]].apply(lambda x: x.shape[0]-x.isna().sum())
n_per_phase

Unnamed: 0_level_0,DXCURREN,DXCHANGE,DIAGNOSIS
Phase,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ADNI1,3868,0,0
ADNI2,0,5638,0
ADNI3,0,0,1281
ADNIGO,0,475,0


We can tell from the table above that we have different diagnostic summary categories for different phases of ADNI. We can make a more or less complete list of diagnoses by combining these three categories into one.

For the `DIAGNOSIS` metric, we'll use the following conversion rule:

- Normal = NL `1` or SMCI `5`
- MCI = EMCI `2` or LMCI `3`
- AD = AD `4`

The `DXCHANGE` metric diagnosis take the format `from _ to _`. We'll use a rule that just records the current diagnosis as follows:

- Normal = NL to NL `1`, MCI to NL `7`, or AD to NL `9`
- MCI = MCI to MCI `2`, NL to MCI `4`, or AD to MCI `8`
- AD = AD to AD `2`, MCI to AD `5`, or NL to AD `6`

In [9]:
# convert DXCHANGE to DX, NL=(1,7,9), MCI=(2,4,8), AD=(3,5,6)
def combine_dx_measures(dxchange, dxcurr, diagnosis):
    
    # ensure arrays have proper dimensions
    dxchange = dxchange.reshape(-1,1)
    diagnosis = diagnosis.reshape(-1,1)
    
    # adjust DXCHANGE to NL, MCI, and AD
    NL = np.array([1,7,9]).reshape(1,3)
    MCI = np.array([2,4,8]).reshape(1,3)
    AD = np.array([3,5,6]).reshape(1,3)
    is_normal = (dxchange==NL).any(1)
    is_mildcog = (dxchange==MCI).any(1)
    is_alzh = (dxchange==AD).any(1)
    
    # insert into dx summary
    dx_sum = np.full(dxchange.shape,np.nan)
    dx_sum[is_normal]=1
    dx_sum[is_mildcog]=2
    dx_sum[is_alzh]=3
    
    # adjust DIAGNOSIS to NL, MCI, and AD
    NL = np.array([1,5]).reshape(1,2)
    MCI = np.array([2,3]).reshape(1,2)
    is_normal = (diagnosis==NL).any(1)
    is_mildcog = (diagnosis==MCI).any(1)
    is_alzh = diagnosis== 4
    
    # insert into dx summary
    dx_sum[is_normal]=1
    dx_sum[is_mildcog]=2
    dx_sum[is_alzh]=3
    
    # add in dxcurr 
    dx_sum[np.isnan(dx_sum)] = dxcurr[np.isnan(dx_sum).flatten()]
    
    return(dx_sum)

In [10]:
# combine diagnostic values across ADNI phases and add to df
dx_comb = combine_dx_measures(dx_df.DXCHANGE.values, dx_df.DXCURREN.values, dx_df.DIAGNOSIS.values)
dx_df["DXCOMB"] = dx_comb

# append our new category to dx column list
dx_cols.append("DXCOMB")

Check the unique values in the new combined diagnostic metric `DXCOMB`

In [11]:
# print unique vals
dx_df.DXCOMB.unique()

array([ 1.,  3.,  2., nan])

Looking at the same table of number of entries per phase, we can see that all the original information has been captured by the new metric.

In [12]:
# look at number of diagnostic values for each phase
by_phase = dx_df.groupby("Phase")
n_per_phase = by_phase[["DXCURREN","DXCHANGE","DIAGNOSIS","DXCOMB"]].apply(lambda x: x.shape[0]-x.isna().sum())
n_per_phase

Unnamed: 0_level_0,DXCURREN,DXCHANGE,DIAGNOSIS,DXCOMB
Phase,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ADNI1,3868,0,0,3868
ADNI2,0,5638,0,5638
ADNI3,0,0,1281,1281
ADNIGO,0,475,0,475


In [13]:
# ensure data types are all int for categorical data
dx_df[dx_cols].dtypes.unique()

array([dtype('float64')], dtype=object)

In [14]:
# ensure standardized missing value is compatible with int
dx_df.replace({np.nan:-1, -4:-1}, inplace=True)

# convert to int dtype
dx_df[dx_cols] = dx_df[dx_cols].astype(int)

## Output data to file

With the columns selected and the response variable constructed, we can output the data to file for later use.



In [15]:
# ensure RID, VISCODE is in column list
dx_cols = append_meta_cols(dx_df.columns, dx_cols)
    
# write data to file
to_file_df = dx_df[dx_cols]
to_file_df.to_csv("../data/Cleaned/diagnosis_clean.csv")