## CS439: Final Project
### Detecting Alzheimer's Disease from Handwriting.

In [92]:
# Importing libraries
import pandas as pd
import numpy as np

**Step 1**: Need to load the dataset into our program.

In [93]:
# Load dataset
og_df = pd.read_csv("data.csv")
og_df.head()

Unnamed: 0,ID,air_time1,disp_index1,gmrt_in_air1,gmrt_on_paper1,max_x_extension1,max_y_extension1,mean_acc_in_air1,mean_acc_on_paper1,mean_gmrt1,...,mean_jerk_in_air25,mean_jerk_on_paper25,mean_speed_in_air25,mean_speed_on_paper25,num_of_pendown25,paper_time25,pressure_mean25,pressure_var25,total_time25,class
0,id_1,5160,1.3e-05,120.804174,86.853334,957,6601,0.3618,0.217459,103.828754,...,0.141434,0.024471,5.596487,3.184589,71,40120,1749.278166,296102.7676,144605,P
1,id_2,51980,1.6e-05,115.318238,83.448681,1694,6998,0.272513,0.14488,99.383459,...,0.049663,0.018368,1.665973,0.950249,129,126700,1504.768272,278744.285,298640,P
2,id_3,2600,1e-05,229.933997,172.761858,2333,5802,0.38702,0.181342,201.347928,...,0.178194,0.017174,4.000781,2.392521,74,45480,1431.443492,144411.7055,79025,P
3,id_4,2130,1e-05,369.403342,183.193104,1756,8159,0.556879,0.164502,276.298223,...,0.113905,0.01986,4.206746,1.613522,123,67945,1465.843329,230184.7154,181220,P
4,id_5,2310,7e-06,257.997131,111.275889,987,4732,0.266077,0.145104,184.63651,...,0.121782,0.020872,3.319036,1.680629,92,37285,1841.702561,158290.0255,72575,P


In [94]:
#getting shape, datatypes, and information just for reference
og_df.shape

(174, 452)

In [95]:
og_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174 entries, 0 to 173
Columns: 452 entries, ID to class
dtypes: float64(300), int64(150), object(2)
memory usage: 614.6+ KB


In [96]:
og_df.dtypes

ID                  object
air_time1            int64
disp_index1        float64
gmrt_in_air1       float64
gmrt_on_paper1     float64
                    ...   
paper_time25         int64
pressure_mean25    float64
pressure_var25     float64
total_time25         int64
class               object
Length: 452, dtype: object

In [97]:
# Check for any missing values
og_df.isnull().sum()

ID                 0
air_time1          0
disp_index1        0
gmrt_in_air1       0
gmrt_on_paper1     0
                  ..
paper_time25       0
pressure_mean25    0
pressure_var25     0
total_time25       0
class              0
Length: 452, dtype: int64

**Step 2**: Get columns needed for detection models.
For our purposes, we are planning to go with trials 1, 2, and 6 since they are different writing tests (M,G,C) that patients were made to do.
We are also focusing on certain columns of each since we believe these are the most important for the model's predictions:
1. Mean jerk on paper (MJP)
2. Mean jerk in air (MJA)
3. Mean speed on paper (MSP)
4. Mean speed in air (MSA)
5. Total time (TT)
6. Pressure mean (PM)

We will also need the class column to classify each person as either a patient (P) or healthy (H).

In [98]:
# Rename values in 'class' column
og_df['class'] = og_df['class'].map({'P' : 0, 'H' : 1})
corr_matrix = np.array(og_df.corr(numeric_only=True))

In [99]:
# Select relevant characteristics: MJP, MJA, MSP, MSA, TT, PM, Class(label: patient (0) or healthy (1))
trial1 = pd.DataFrame(og_df[['mean_jerk_on_paper1', 'mean_jerk_in_air1', 'mean_speed_on_paper1', 
                      'mean_speed_in_air1', 'total_time1', 'pressure_mean1', 'class']])

trial2 = pd.DataFrame(og_df[['mean_jerk_on_paper2', 'mean_jerk_in_air2', 'mean_speed_on_paper2', 
                             'mean_speed_in_air2', 'total_time2', 'pressure_mean2', 'class']])

trial6 = pd.DataFrame(og_df[['mean_jerk_on_paper6', 'mean_jerk_in_air6', 'mean_speed_on_paper6', 
                      'mean_speed_in_air6', 'total_time6', 'pressure_mean6', 'class']])

#rename all column headers
trial1.columns = trial2.columns = trial6.columns = ['MJP', 'MJA', 'MSP', 'MSA', 'TT', 'PM', 'Class']


**Step 3**: Modeling
We are implementing 2 models:
1. Naive Bayes
2. Rainforest

In [None]:
# building models

from sklearn.model_selection import train_test_split

# TRIAL 1

# Split data into features & target
X = trial1.drop('Class', axis=1)
y = trial1['Class']

# Make training & test sets - 30% of data goes to test set, 70% to training - can change random_state to any # - 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Train a NB Classifier (assuming it follows a normal distribution)
from sklearn.naive_bayes import GaussianNB

model1 = GaussianNB()
model1.fit(X_train, y_train)

# Make prediction
y_pred = model1.predict(X_test)

# analysis
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Classification Report (precision, recall, f1-score)
print("Classification Report:\n", classification_report(y_test, y_pred))

- The model is shown to correctly predict whether a patient is healthy or not approximately 62% of the time. 
- Looking at the confusion matrix, there were 16 TP (true positives - 16 people were healthy and were correctly classified), 10 FN (false negatives - 10 healthy people were predicted as patients), 10 FP (false positives - 10 patients were misclassified as healthy), and 17 TN (true negatives - 17 patients got classified correctly).
- Looking at the classification report:
    - Precision: Of all the times the model made a prediction, 62% of predicted healthy were actually healthy, 63% of predicted patients were actually patient
    - Recall: Of all the actual H & P examples, the model found 62% of actual healthy people correctly, and 63% of actual patients correctly
    - f1-score: 62-63% for both H and P, so a pretty balanced performance measure

**Step 4**: Comparing Two Models
Given the results of the two models, we are now going to compare the accuracies and differences between the two. For comparison, we will be using matplotlib and seaborn.

In [100]:
#import for visualization
import matplotlib.pyplot as plt
import seaborn as sns