# Computer Science Capstone
## By: Igor Jancan

### The purpose of this project is to demonstrate competency in using a dataset in conjunction with machine learning to perform data anlytics and predictions. I have chosen a dataset of about 2000 patients, some of whom have been diagnosed with Alzheimers and some who had not. The dataset includes various demographics information, as well as results of medical tests. We will use these demographics and results as training data, and using a machine learning model, to predic whether a patient can be diagnosed with Alzheimers or not (i.e. the non-descriptive method).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
import seaborn as sns
%matplotlib inline

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score

## Loading, Exploring and Cleaning Data

In [2]:
from IPython.display import display
out=widgets.Output()
data = pd.read_csv("alzheimers_disease_data.csv")

rawData = widgets.Button(
    description='Show Data',
    disabled=False,
    button_style='info',
    tooltip='Show Data'
)
hideData = widgets.Button(
    description = 'Hide Data',
    button_style = 'info',
    tooltip='Hide Data')

display(rawData,hideData,out)
def showData(b):
    data=pd.read_csv("alzheimers_disease_data.csv")
    with out:
        display(data)
def clearData(b):
    out.clear_output()
rawData.on_click(showData)
hideData.on_click(clearData)

    

Button(button_style='info', description='Show Data', style=ButtonStyle(), tooltip='Show Data')

Button(button_style='info', description='Hide Data', style=ButtonStyle(), tooltip='Hide Data')

Output()

### We can do some preliminary data exploration by using a scatter plot to show correlation between age and MMSE columns in both Alzheimer's positive and negative patients. As we can see from the plot below, more Alzheimer's negative patients score higher on the MMSE exam than the number of patients with a positive diagnosis. Age alone does not seem to impact MMSE scores for either group.

In [3]:
outScatter=widgets.Output()


showScatter = widgets.Button(
    description='Show Plot',
    disabled=False,
    button_style='info',
    tooltip='Show Plot'
)
hideScatter = widgets.Button(
    description = 'Hide Plot',
    button_style = 'info',
    tooltip='Hide Plot')

display(showScatter,hideScatter,outScatter)
def showScatterPlot(b):
    outScatter.clear_output()
    with outScatter:
        plt.figure(figsize=(8,6))
        plt.scatter(data.Age[data.Diagnosis==1],data.MMSE[data.Diagnosis==1], c="red")
        plt.scatter(data.Age[data.Diagnosis==0],data.MMSE[data.Diagnosis==0], c="blue")
        plt.title("MMSE scores by age in Alzheimer's positive and negative groups")
        plt.xlabel("Age")
        plt.ylabel("MMSE Score")
        plt.legend(["Alzheimer's Positive", "Alzheimer's negative"])
        plt.show()
        
def clearScatterPlot(b):
    outScatter.clear_output()
showScatter.on_click(showScatterPlot)
hideScatter.on_click(clearScatterPlot)


Button(button_style='info', description='Show Plot', style=ButtonStyle(), tooltip='Show Plot')

Button(button_style='info', description='Hide Plot', style=ButtonStyle(), tooltip='Hide Plot')

Output()

### Because we do not want to include PatientID and DoctorInCharge columns in our modeling, we drop them.

In [4]:
data = data.drop(['PatientID', 'DoctorInCharge'], axis=1)

### We will also make a correlation matrix to see if our how our variables correlate with each other. From the matrix below, we can clearly see that the standout variables for a positive diagnosis are Memory Complaints and Behavioral Problems

In [5]:
outCorr=widgets.Output()


showCorr = widgets.Button(
    description='Show Matrix',
    disabled=False,
    button_style='info',
    tooltip='Show Matrix'
)
hideCorr = widgets.Button(
    description = 'Hide Matrix',
    button_style = 'info',
    tooltip='Hide Matrix')

display(showCorr,hideCorr)
display(outCorr)
def showMatrix(b):
    outCorr.clear_output()
    correlation_matrix = data.corr()
    figure, ax = plt.subplots(figsize=(20,15))
    ax = sns.heatmap(correlation_matrix, annot=True,linewidths=0.3, fmt=".2f", cmap="rocket_r")
    with outCorr:
        display(figure)
        
def clearMatrix(b):
    outCorr.clear_output()
showCorr.on_click(showMatrix)
hideCorr.on_click(clearMatrix)


Button(button_style='info', description='Show Matrix', style=ButtonStyle(), tooltip='Show Matrix')

Button(button_style='info', description='Hide Matrix', style=ButtonStyle(), tooltip='Hide Matrix')

Output()

### We're going to take one more step to clean the data. Because some of our variables are categorical and non-binary (namely, Ethnicity and EducationLevel), but are presented as numbers in our data, we do not want our model to treat those categorical variables as numeric. This could lead to improper fitting of the model. We use OneHotEncoder function of Scikit-learn to transform the variables to binary. We will also prepare our data for modeling by splitting it into training and test sets.

In [6]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
X = data.drop("Diagnosis", axis = 1)
y = data["Diagnosis"]
categories = ["Ethnicity", "EducationLevel"]
encoder = OneHotEncoder()
transformer = ColumnTransformer([("encoder",encoder,categories)],remainder="passthrough")
transformed_data = transformer.fit_transform(X)
X  = pd.DataFrame(transformed_data)


## Modeling


### Finally, we're going to fit a model to our cleaned data. We have chosen a Random Forest classifier due to the fact that our data contains both numeric and categorical data.

In [7]:

np.random.seed(5)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
model = RandomForestClassifier()
model.fit(X_train, y_train)
model_score = model.score(X_test, y_test)
model_score

0.9395348837209302

### As we can see, our model receives an accuracy score of ~.9395. This could indicate that there is a very strong correlation between our data and diagnosis, or that our model is overfitting. Next, we will evaluate our model.