## Can Alzheimer be predicted?

### 1. Sourcing and Loading 


#### 1.1. Importing Libraries

In [71]:
# import the pandas, numpy libraries as pd, and np respectively. 
import pandas as pd
import numpy as np

# Load the pyplot collection of functions from matplotlib, as plt 
import matplotlib.pyplot as plt

#### 1.2.  Loading the data
This MRI data sets has been taken from Open Access Series of Imaging Studies (OASIS)
which is a project aimed at making MRI data sets of the brain freely available to the
scientific community. OASIS is made available by the Washington University Alzheimer’s
Disease Research Center, Dr. Randy Buckner at the Howard Hughes Medical Institute (HHMI)
(at Harvard University, the Neuroinformatics Research Group (NRG) at Washington University
School of Medicine,and the Biomedical Informatics Research Network (BIRN).

In [72]:
# First, make a variable called Cross-sec, and assign it to cross-sectional 
# collection of 416 subjects
Cross_sec = pd.read_csv('Data/oasis_cross-sectional.csv', index_col= None)
# Second, make a second variable called Long-sec, and assign it to longitudinal
#collection of 150 subjects
Long_sec = pd.read_csv('Data/oasis_longitudinal.csv', index_col= None)

Staging patients diagnosed with dementia is determined by a global rating scale, called clinical dementia rating scale (**CDR** ). The CDR evaluates cognitive, behavioral, and functional aspects of Alzheimer disease and other dementias. Features used for applying machine learning from these two sets of data include age, education, gender, socioeconomic status (**SES**), Mini-Mental State Exam (**MMSE**) which is a test of cognitive function, **eTIV** - estimated Total Intracranial Volume (sum of brain, ventricular, and extraventricular CSF) and brain volumes (**nWBV**), and Atlas Scaling Factor (**ASF**) which is volume-scaling factor required to match each individual to the atlas target.

### 2. Cleaning, transforming, and visualizing

**2.1. Exploring the data** 


In [73]:
Cross_sec.head()

Unnamed: 0,ID,M/F,Hand,Age,Educ,SES,MMSE,CDR,eTIV,nWBV,ASF,Delay
0,OAS1_0001_MR1,F,R,74,2.0,3.0,29.0,0.0,1344,0.743,1.306,
1,OAS1_0002_MR1,F,R,55,4.0,1.0,29.0,0.0,1147,0.81,1.531,
2,OAS1_0003_MR1,F,R,73,4.0,3.0,27.0,0.5,1454,0.708,1.207,
3,OAS1_0004_MR1,M,R,28,,,,,1588,0.803,1.105,
4,OAS1_0005_MR1,M,R,18,,,,,1737,0.848,1.01,


In [74]:
Long_sec.head()

Unnamed: 0,Subject ID,MRI ID,Group,Visit,MR Delay,M/F,Hand,Age,EDUC,SES,MMSE,CDR,eTIV,nWBV,ASF
0,OAS2_0001,OAS2_0001_MR1,Nondemented,1,0,M,R,87,14,2.0,27.0,0.0,1987,0.696,0.883
1,OAS2_0001,OAS2_0001_MR2,Nondemented,2,457,M,R,88,14,2.0,30.0,0.0,2004,0.681,0.876
2,OAS2_0002,OAS2_0002_MR1,Demented,1,0,M,R,75,12,,23.0,0.5,1678,0.736,1.046
3,OAS2_0002,OAS2_0002_MR2,Demented,2,560,M,R,76,12,,28.0,0.5,1738,0.713,1.01
4,OAS2_0002,OAS2_0002_MR3,Demented,3,1895,M,R,80,12,,22.0,0.5,1698,0.701,1.034


**2.2. Cleaning the data**

In [75]:
Cross_sec.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 436 entries, 0 to 435
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      436 non-null    object 
 1   M/F     436 non-null    object 
 2   Hand    436 non-null    object 
 3   Age     436 non-null    int64  
 4   Educ    235 non-null    float64
 5   SES     216 non-null    float64
 6   MMSE    235 non-null    float64
 7   CDR     235 non-null    float64
 8   eTIV    436 non-null    int64  
 9   nWBV    436 non-null    float64
 10  ASF     436 non-null    float64
 11  Delay   20 non-null     float64
dtypes: float64(7), int64(2), object(3)
memory usage: 41.0+ KB


In [76]:
Long_sec.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 373 entries, 0 to 372
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Subject ID  373 non-null    object 
 1   MRI ID      373 non-null    object 
 2   Group       373 non-null    object 
 3   Visit       373 non-null    int64  
 4   MR Delay    373 non-null    int64  
 5   M/F         373 non-null    object 
 6   Hand        373 non-null    object 
 7   Age         373 non-null    int64  
 8   EDUC        373 non-null    int64  
 9   SES         354 non-null    float64
 10  MMSE        371 non-null    float64
 11  CDR         373 non-null    float64
 12  eTIV        373 non-null    int64  
 13  nWBV        373 non-null    float64
 14  ASF         373 non-null    float64
dtypes: float64(5), int64(5), object(5)
memory usage: 43.8+ KB


#### Check if data are all for right hand or left hand or both? 

In [77]:
Cross_sec['Hand'].unique()

array(['R'], dtype=object)

In [78]:
Long_sec['Hand'].unique()

array(['R'], dtype=object)

So all the data are from right hand poeple, thefore we do not need to keep this column

In [79]:
Cross_sec = Cross_sec.drop(['Hand'], axis = 1)

In [80]:
Long_sec = Long_sec.drop(['Hand'], axis = 1)

In [81]:
Long_sec['M/F'].value_counts()

F    213
M    160
Name: M/F, dtype: int64

#### Counts the number of subjects scanned for Longitudinal and cross sectional 

In [82]:
Cross_sec['ID'].value_counts().sum()

436

In [83]:
Long_sec['Subject ID'].value_counts().sum()

373

#### Renaming the similar columns in both group to the same name 

In [84]:
Cross_sec.columns

Index(['ID', 'M/F', 'Age', 'Educ', 'SES', 'MMSE', 'CDR', 'eTIV', 'nWBV', 'ASF',
       'Delay'],
      dtype='object')

In [85]:
Long_sec.columns

Index(['Subject ID', 'MRI ID', 'Group', 'Visit', 'MR Delay', 'M/F', 'Age',
       'EDUC', 'SES', 'MMSE', 'CDR', 'eTIV', 'nWBV', 'ASF'],
      dtype='object')

In [86]:
Cross_sec = Cross_sec.rename(columns={'ID':'Subject ID'})

In [87]:
Long_sec = Long_sec.rename(columns={'EDUC':'Educ'})

#### Finding the number of Null values for differnt features 

In [88]:
Cross_sec.isna().sum()

Subject ID      0
M/F             0
Age             0
Educ          201
SES           220
MMSE          201
CDR           201
eTIV            0
nWBV            0
ASF             0
Delay         416
dtype: int64

In [89]:
Long_sec.isna().sum()

Subject ID     0
MRI ID         0
Group          0
Visit          0
MR Delay       0
M/F            0
Age            0
Educ           0
SES           19
MMSE           2
CDR            0
eTIV           0
nWBV           0
ASF            0
dtype: int64

In [90]:
Cross_sec = Cross_sec.drop(['Delay'], axis = 1)

In [91]:
Cross_sec[Cross_sec['Educ'].isna()]['MMSE'].isna().sum()

201

In [92]:
Cross_sec[Cross_sec['Educ'].isna()]['CDR'].isna().sum()

201

In [93]:
Cross_sec[Cross_sec['SES'].isna()]['CDR'].isna().sum()

201

In [94]:
Cross_sec[Cross_sec['SES'].isna()]['MMSE'].isna().sum()

201

In [95]:
Cross_sec[Cross_sec['SES'].isna()]['Educ'].isna().sum()

201

#### There are 201 observation that have NAN value for MMSE, Educ, CDR, SES

In [96]:
Cross_sec_cl = Cross_sec[Cross_sec['CDR'].notna()]

In [97]:
Cross_sec_cl.isna().sum()

Subject ID     0
M/F            0
Age            0
Educ           0
SES           19
MMSE           0
CDR            0
eTIV           0
nWBV           0
ASF            0
dtype: int64

 Filling the NAN value with the median for social economy status, SES,  column

In [98]:
Cross_sec_cl['SES'].fillna(Cross_sec_cl['SES'].median(), inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(


In [125]:
Long_sec_cl = Long_sec
Long_sec_cl['SES'].fillna(Long_sec_cl['SES'].median(), inplace=True)

It is better to remove the null observation for MMSE, because this is a mental test and also there are only two observations with Null value for MMSE. 

In [126]:
Long_sec_cl = Long_sec[Long_sec['MMSE'].notna()]

In [127]:
Cross_sec_cl.isna().sum()

Subject ID    0
M/F           0
Age           0
Educ          0
SES           0
MMSE          0
CDR           0
eTIV          0
nWBV          0
ASF           0
dtype: int64

In [128]:
Long_sec_cl.isna().sum()

Subject ID    0
MRI ID        0
Group         0
Visit         0
MR Delay      0
M/F           0
Age           0
Educ          0
SES           0
MMSE          0
CDR           0
eTIV          0
nWBV          0
ASF           0
dtype: int64

In [129]:
Cross_sec_cl.head()

Unnamed: 0,Subject ID,M/F,Age,Educ,SES,MMSE,CDR,eTIV,nWBV,ASF
0,OAS1_0001_MR1,F,74,2.0,3.0,29.0,0.0,1344,0.743,1.306
1,OAS1_0002_MR1,F,55,4.0,1.0,29.0,0.0,1147,0.81,1.531
2,OAS1_0003_MR1,F,73,4.0,3.0,27.0,0.5,1454,0.708,1.207
8,OAS1_0010_MR1,M,74,5.0,2.0,30.0,0.0,1636,0.689,1.073
9,OAS1_0011_MR1,F,52,3.0,2.0,30.0,0.0,1321,0.827,1.329


The CDR is based on a scale of 0–3: no dementia (CDR = 0), questionable dementia (CDR = 0.5), MCI (CDR = 1), moderate cognitive impairment (CDR = 2), and severe cognitive impairment (CDR = 3)

In [130]:
Cross_sec_cl[Cross_sec_cl['CDR'] >0].count()

Subject ID    100
M/F           100
Age           100
Educ          100
SES           100
MMSE          100
CDR           100
eTIV          100
nWBV          100
ASF           100
dtype: int64

So in group 1, there are 100 Dementia. 

In [131]:
Long_sec_cl[Long_sec_cl['CDR'] >0].count()

Subject ID    165
MRI ID        165
Group         165
Visit         165
MR Delay      165
M/F           165
Age           165
Educ          165
SES           165
MMSE          165
CDR           165
eTIV          165
nWBV          165
ASF           165
dtype: int64

There are 165 Demnetia in the observations in longitudinal study, 
and some of these data obtained from the same subject (Same subject ID) during study.
I consider these repeated measurments as a new measurment. 

Therfore I removed few columns such as MR Delay, Group
, Visit, Subject ID, MRI ID from longitudinal data 

In [132]:
Long_sec_cl.drop(columns=['MR Delay', 'Group' , 'Visit', 'Subject ID', 'MRI ID'], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [134]:
Cross_sec_cl.drop(columns=['Subject ID'],inplace=True)

In [136]:
Cross_Long = pd.concat([Long_sec_cl,Cross_sec_cl])

In [138]:
Cross_Long.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 606 entries, 0 to 415
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   M/F     606 non-null    object 
 1   Age     606 non-null    int64  
 2   Educ    606 non-null    float64
 3   SES     606 non-null    float64
 4   MMSE    606 non-null    float64
 5   CDR     606 non-null    float64
 6   eTIV    606 non-null    int64  
 7   nWBV    606 non-null    float64
 8   ASF     606 non-null    float64
dtypes: float64(6), int64(2), object(1)
memory usage: 47.3+ KB


In [141]:
Corr= Cross_Long.corr()
Corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,Age,Educ,SES,MMSE,CDR,eTIV,nWBV,ASF
Age,1.0,0.184351,0.049669,-0.084787,0.133527,0.058236,-0.647796,-0.051177
Educ,0.184351,1.0,-0.266295,0.115209,-0.061148,0.166613,-0.190119,-0.155344
SES,0.049669,-0.266295,1.0,-0.162225,0.097854,-0.226946,0.007856,0.21632
MMSE,-0.084787,0.115209,-0.162225,1.0,-0.711017,-0.019059,0.377745,0.027136
CDR,0.133527,-0.061148,0.097854,-0.711017,1.0,0.065785,-0.406617,-0.076771
eTIV,0.058236,0.166613,-0.226946,-0.019059,0.065785,1.0,-0.224172,-0.98928
nWBV,-0.647796,-0.190119,0.007856,0.377745,-0.406617,-0.224172,1.0,0.226093
ASF,-0.051177,-0.155344,0.21632,0.027136,-0.076771,-0.98928,0.226093,1.0


CDR has strong correlation with MMSE and nWBV which is measured using MRI.

In [142]:
Cross_Long_M = Cross_Long[Cross_Long['M/F']=='M']
Cross_Long_F = Cross_Long[Cross_Long['M/F']=='F']

Investigating the dependency of correlation to the gender 

In [143]:
Corr= Cross_Long_F.corr()
Corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,Age,Educ,SES,MMSE,CDR,eTIV,nWBV,ASF
Age,1.0,0.207966,0.036444,-0.138146,0.151954,0.107384,-0.665176,-0.09002
Educ,0.207966,1.0,-0.238449,0.193985,-0.143319,0.045181,-0.150699,-0.039979
SES,0.036444,-0.238449,1.0,-0.236581,0.168027,-0.088488,-0.025015,0.085819
MMSE,-0.138146,0.193985,-0.236581,1.0,-0.761731,-0.049977,0.406895,0.040636
CDR,0.151954,-0.143319,0.168027,-0.761731,1.0,0.066495,-0.421479,-0.067626
eTIV,0.107384,0.045181,-0.088488,-0.049977,0.066495,1.0,-0.188568,-0.991764
nWBV,-0.665176,-0.150699,-0.025015,0.406895,-0.421479,-0.188568,1.0,0.179053
ASF,-0.09002,-0.039979,0.085819,0.040636,-0.067626,-0.991764,0.179053,1.0


In [144]:
Corr= Cross_Long_M.corr()
Corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,Age,Educ,SES,MMSE,CDR,eTIV,nWBV,ASF
Age,1.0,0.152042,0.068891,-0.01562,0.112484,0.040397,-0.66109,-0.030281
Educ,0.152042,1.0,-0.296349,0.046711,0.009363,0.212806,-0.202092,-0.211049
SES,0.068891,-0.296349,1.0,-0.088646,0.025285,-0.402331,0.028695,0.417087
MMSE,-0.01562,0.046711,-0.088646,1.0,-0.641434,0.149977,0.314283,-0.14101
CDR,0.112484,0.009363,0.025285,-0.641434,1.0,-0.152676,-0.330841,0.148594
eTIV,0.040397,0.212806,-0.402331,0.149977,-0.152676,1.0,-0.072013,-0.993635
nWBV,-0.66109,-0.202092,0.028695,0.314283,-0.330841,-0.072013,1.0,0.073801
ASF,-0.030281,-0.211049,0.417087,-0.14101,0.148594,-0.993635,0.073801,1.0
