# Preliminary Exploration


### Findings

- Dataset contains information about 19892 patients, described by 402 variables.
- Of these variables, 393 are `int64` types, seven are `float64` types and two are `object` types.
- The two `object` variables and two zero-variance variables were dropped.
- Missing entries are encoded as `-4` and may also be encoded specific to each variable (e.g. missing BMI entries are encoded as 888.8). These placeholders need to be replaced with `NaN`.
- The `outcome` variable is the target. It is encoded as `2` if the patient has Alzheimer's Disease (AD), `1` otherwise.
- The dataset is highly imbalanced, with only 8.5% of patients with AD. This is an imbalanced classification problem with a binary `outcome`.

### Future Tasks

- Substitute encoded missing numbers with `NaN`s. Can specify `na_values=-4` when reading the CSV file but that will not fix variable-specific missing entries. Refer to NACC Researcher's Data Dictionary for info. 
- Clean dataset.
- Carry out in-depth exploratory data analysis.
- Look for highly correlated variables
- Pick out top 30-50 variables that could cause Alzheimer's according to literature review and explore them further.


**Ismail Dawoodjee 2:45 AM 14-Oct-2020**

In [92]:
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv('MASTEROUTCOME.csv')

In [4]:
df.shape

(19892, 402)

Dataset has 19892 rows and 402 columns.

In [26]:
dtype_count = pd.DataFrame(df.dtypes.value_counts(), columns=['count'])
dtype_count

Unnamed: 0,count
int64,393
float64,7
object,2


In [49]:
dtype_df = pd.DataFrame(df.dtypes, columns=['dtype'])
dtype_df

Unnamed: 0,dtype
NACCID,object
SEX,int64
RACE,int64
EDUC,int64
MARISTAT,int64
...,...
NACCVASD,int64
NACCBMI,float64
NACCUDSD,int64
NACCDIED,int64


Almost all variables (393) are integer types, with 7 variables being float types and 2 variables being object types.

In [109]:
floats = list(df.select_dtypes(include='float64').columns)
floats

['HEIGHT', 'MEMORY', 'ORIENT', 'JUDGMENT', 'COMMUN', 'HOMEHOBB', 'NACCBMI']

The above 7 variables are floats and the below 2 variables are objects:

In [107]:
objects = list(df.dtypes[df.dtypes == 'object'].index)
objects

['NACCID', 'CVDIMAGX']

In [108]:
df[objects]

Unnamed: 0,NACCID,CVDIMAGX
0,NACC000144,
1,NACC000184,
2,NACC000385,
3,NACC000403,
4,NACC000546,
...,...,...
19887,NACC999729,
19888,NACC999839,
19889,NACC999853,
19890,NACC999854,


In [85]:
df.query('~CVDIMAGX.isnull()')['CVDIMAGX'].head(10)

62                                     LOCALPITAL INFARCT
425                    MILD PEVIVENTVIULER PARTALE GLOSIS
584                                   ANOXIC BRAIN INJURY
961                                                     s
1132                              ANEURYSM DIPPING + COIL
1208                                              FEW WMH
1237            microhemorrhage in the left parietal lobe
1341                                           EC Inforct
1737    LACUNAR INFARCT IN BASAL GANGLIA  THALAMIC BIL...
1773                                    Subdural Hematoma
Name: CVDIMAGX, dtype: object

In [86]:
df.query('~CVDIMAGX.isnull()')['CVDIMAGX'].count()

46

The first variable is an ID and the second consists of some 46 scattered comments. They should be removed.

In [105]:
variance = np.var(df)
zero_var = list(variance[variance == 0].index)
zero_var

['NACCTMCI', 'IMPNOMCI']

In [106]:
df[zero_var].head()

Unnamed: 0,NACCTMCI,IMPNOMCI
0,8,0
1,8,0
2,8,0
3,8,0
4,8,0


These variables have 0 variance and should also be removed.

In [116]:
df.drop(objects + zero_var, axis=1, inplace = True)

Once the 4 irrelevant variables are dropped, check for missing data. 

In [121]:
df.isnull().sum().sum()

0

Dataset appears to have no missing data. This is because missing entries are numerically encoded as `-4` or some other value that is specific to each variable.

In [123]:
df.iloc[:10,:10]

Unnamed: 0,SEX,RACE,EDUC,MARISTAT,NACCLIVS,INDEPEND,RESIDENC,NACCFAM,NACCAM,NACCFM
0,1,1,18,1,2,1,1,1,0,0
1,2,1,16,1,2,2,1,1,-4,-4
2,1,1,16,1,2,1,1,1,-4,-4
3,1,1,16,1,2,1,1,0,-4,-4
4,1,1,12,2,3,2,1,9,-4,-4
5,1,50,4,1,2,2,1,9,-4,-4
6,2,1,14,1,2,1,1,9,9,9
7,2,1,14,1,2,1,1,0,0,0
8,2,1,14,3,1,1,1,1,9,9
9,2,2,12,2,1,1,1,0,-4,-4


In [125]:
df['outcome'].value_counts()

1    18209
2     1683
Name: outcome, dtype: int64

In [127]:
df['outcome'].value_counts(normalize=True)*100

1    91.539312
2     8.460688
Name: outcome, dtype: float64

Target variable `outcome` is highly imbalanced, with 8.5% of patients having Alzheimer's disease. Hence, this is an imbalanced classification problem, with a binary `outcome`.