In this notebook we give a basic overview of the dataset showing what features are in the dataset. 

## **Data Source**

- **Primary Dataset** [**Kaggle Alzheimer’s Disease Risk Prediction Dataset**](https://www.kaggle.com/competitions/alzheimers-disease-risk-prediction-eu-business/data): This dataset includes health-related features for predicting AD diagnosis, categorized as either 0 (No Alzheimer’s Disease) or 1 (Alzheimer’s Disease).


## **Data Description**

The [**Kaggle Alzheimer’s Disease Risk Prediction Dataset**](https://www.kaggle.com/competitions/alzheimers-disease-risk-prediction-eu-business/data) contains extensive health information for 2,149 patients, each uniquely identified. The data is ideal for researchers and data scientists looking to explore factors associated with Alzheimer's, develop predictive models, and conduct statistical analyses.
It includes demographic details, lifestyle factors, medical history, clinical measurements, cognitive and functional assessments, symptoms, and a diagnosis of Alzheimer's Disease. The dataset is split into training and testing subsets, with an 80-20 split:

Training Data: 1,719 patient records (80%)

Testing Data: 430 patient records (20%)

There are 34 features, including both categorical and numerical variables. The variable description is given below:

- **Demographic Details**
    - `Age`: The age of the patients ranges from 60 to 90 years.
    - `Gender`: Gender of the patients, where 0 represents Male and 1 represents Female.
    - `Ethnicity`: The ethnicity of the patients, coded as follows:
        0: Caucasian
        1: African American
        2: Asian
        3: Other
- **Lifestyle Factors**
    - `BMI`: Body Mass Index of the patients, ranging from 15 to 40.
    - `Smoking`: Smoking status, where 0 indicates No and 1 indicates Yes.
    - `AlcoholConsumption`: Weekly alcohol consumption in units, ranging from 0 to 20.
    - `PhysicalActivity`: Weekly physical activity in hours, ranging from 0 to 10.
    - `DietQuality`: Diet quality score, ranging from 0 to 10.
    - `SleepQuality`: Sleep quality score, ranging from 4 to 10.
- **Medical History**
    - `FamilyHistoryAlzheimers`: Family history of Alzheimer's Disease, where 0 indicates No and 1 indicates Yes.
    - `CardiovascularDisease`: Presence of cardiovascular disease, where 0 indicates No and 1 indicates Yes.
    - `Diabetes`: Presence of diabetes, where 0 indicates No and 1 indicates Yes.
    - `Depression`: Presence of depression, where 0 indicates No and 1 indicates Yes.
    - `HeadInjury`: History of head injury, where 0 indicates No and 1 indicates Yes.
    - `Hypertension`: Presence of hypertension, where 0 indicates No and 1 indicates Yes.
- **Clinical Measurements**
    - `SystolicBP`: Systolic blood pressure, ranging from 90 to 180 mmHg.
    - `DiastolicBP`: Diastolic blood pressure, ranging from 60 to 120 mmHg.
    - `CholesterolTotal`: Total cholesterol levels, ranging from 150 to 300 mg/dL.
    - `CholesterolLDL`: Low-density lipoprotein cholesterol levels, ranging from 50 to 200 mg/dL.
    - `CholesterolHDL`: High-density lipoprotein cholesterol levels, ranging from 20 to 100 mg/dL.
    - `CholesterolTriglycerides`: Triglycerides levels, ranging from 50 to 400 mg/dL.
- **Cognitive and Functional Assessments**
    - `MMSE`: Mini-Mental State Examination score, ranging from 0 to 30. Lower scores indicate cognitive impairment.
    - `FunctionalAssessment`: Functional assessment score, ranging from 0 to 10. Lower scores indicate greater impairment.
    - `MemoryComplaints`: Presence of memory complaints, where 0 indicates No and 1 indicates Yes.
    - `BehavioralProblems`: Presence of behavioral problems, where 0 indicates No and 1 indicates Yes.
    - `ADL`: Activities of Daily Living score, ranging from 0 to 10. Lower scores indicate greater impairment.
- **Symptoms**
    - `Confusion`: Presence of confusion, where 0 indicates No and 1 indicates Yes.
    - `Disorientation`: Presence of disorientation, where 0 indicates No and 1 indicates Yes.
    - `PersonalityChanges`: Presence of personality changes, where 0 indicates No and 1 indicates Yes.
    - `DifficultyCompletingTasks`: Presence of difficulty completing tasks, where 0 indicates No and 1 indicates Yes.
    - `Forgetfulness`: Presence of forgetfulness, where 0 indicates No and 1 indicates Yes.
- **Diagnosis Information**
    - `Diagnosis`: Diagnosis status for Alzheimer's Disease, where 0 indicates No and 1 indicates Yes.

In [1]:
##Importing packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns


In [None]:
##Loading the pre-split dataset
df_train = pd.read_csv("../../data/train_set.csv")

In [3]:
df_train.head()

Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,...,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis,DoctorInCharge
0,5631,67,0,1,2,29.726811,1,15.642599,4.605988,7.871526,...,1,0,9.105791,0,0,0,0,0,0,XXXConfid
1,6545,76,1,1,2,36.169103,1,11.030414,9.534553,7.25409,...,1,0,4.725688,0,0,0,0,1,1,XXXConfid
2,5015,81,0,1,0,22.923111,0,9.314832,8.917378,3.807813,...,0,1,8.681801,0,0,0,0,0,1,XXXConfid
3,5800,90,0,1,3,31.430904,0,0.996496,7.108725,5.32861,...,0,0,7.80541,0,0,0,0,0,0,XXXConfid
4,5688,89,0,1,3,39.570099,0,1.5767,5.712014,1.026138,...,0,0,1.307295,0,0,0,0,1,1,XXXConfid


In [4]:
df_train.describe()

Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,...,FunctionalAssessment,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis
count,1375.0,1375.0,1375.0,1375.0,1375.0,1375.0,1375.0,1375.0,1375.0,1375.0,...,1375.0,1375.0,1375.0,1375.0,1375.0,1375.0,1375.0,1375.0,1375.0,1375.0
mean,5824.682909,74.854545,0.499636,0.685818,1.296727,27.651994,0.288727,9.9974,4.921046,4.976219,...,5.066274,0.206545,0.155636,4.943081,0.197091,0.157818,0.149091,0.171636,0.298909,0.353455
std,621.959288,9.093678,0.500182,0.983219,0.906444,7.257941,0.453336,5.731307,2.841409,2.898691,...,2.893335,0.404974,0.362642,2.94778,0.397946,0.364703,0.356308,0.377201,0.457946,0.478216
min,4751.0,60.0,0.0,0.0,0.0,15.008851,0.0,0.002003,0.003616,0.009385,...,0.00046,0.0,0.0,0.001288,0.0,0.0,0.0,0.0,0.0,0.0
25%,5295.5,67.0,0.0,0.0,1.0,21.349474,0.0,5.220899,2.552605,2.436207,...,2.560246,0.0,0.0,2.286136,0.0,0.0,0.0,0.0,0.0,0.0
50%,5819.0,75.0,0.0,0.0,1.0,27.805269,0.0,9.844113,4.856851,5.056105,...,5.092445,0.0,0.0,4.998029,0.0,0.0,0.0,0.0,0.0,0.0
75%,6357.5,83.0,1.0,1.0,2.0,33.856798,1.0,15.237754,7.319822,7.472882,...,7.496824,0.0,0.0,7.558082,0.0,0.0,0.0,0.0,1.0,1.0
max,6899.0,90.0,1.0,3.0,3.0,39.988513,1.0,19.985622,9.987429,9.997203,...,9.996467,1.0,1.0,9.999747,1.0,1.0,1.0,1.0,1.0,1.0


In [5]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1375 entries, 0 to 1374
Data columns (total 35 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   PatientID                  1375 non-null   int64  
 1   Age                        1375 non-null   int64  
 2   Gender                     1375 non-null   int64  
 3   Ethnicity                  1375 non-null   int64  
 4   EducationLevel             1375 non-null   int64  
 5   BMI                        1375 non-null   float64
 6   Smoking                    1375 non-null   int64  
 7   AlcoholConsumption         1375 non-null   float64
 8   PhysicalActivity           1375 non-null   float64
 9   DietQuality                1375 non-null   float64
 10  SleepQuality               1375 non-null   float64
 11  FamilyHistoryAlzheimers    1375 non-null   int64  
 12  CardiovascularDisease      1375 non-null   int64  
 13  Diabetes                   1375 non-null   int64

**`This data is clean in terms of missing values.  We have no missing values. Each of the fields is populated with 1719 datapoints or records. We might not need to do much for data cleaning using only this dataset`**