## Section 1: Import Data
This lab uses the [Stroke Prediction Dataset](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset) from [Kaggle](https://www.kaggle.com).  

In order to interact with the data in python, you will need to import the CSV into a dataframe.

In [1]:
#Import pandas package
import pandas as pd

#Read in stroke data
stroke_data = pd.read_csv('healthcare-dataset-stroke-data.csv')

#Display first 10 records of the data
stroke_data.head(10)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
6,53882,Male,74.0,1,1,Yes,Private,Rural,70.09,27.4,never smoked,1
7,10434,Female,69.0,0,0,No,Private,Urban,94.39,22.8,never smoked,1
8,27419,Female,59.0,0,0,Yes,Private,Rural,76.15,,Unknown,1
9,60491,Female,78.0,0,0,Yes,Private,Urban,58.57,24.2,Unknown,1


Below are some summary statistics about the **stroke** column, which is what you will be trying to predict.

In [2]:
#Count the total records in the dataset
total_records = stroke_data.shape[0]  # Get the number of rows in the DataFrame
print('There are {:,} records in the stroke dataset.'.format(total_records)) 

# Create a summary DataFrame by grouping the data based on the 'stroke' column
summary = pd.DataFrame(stroke_data.groupby('stroke').size()).rename(columns={0:'Count'})
# Calculate the percentage of each group by dividing the count of each group by the total number of records
summary['Percent'] = summary['Count'] / total_records
summary

There are 5,110 records in the stroke dataset.


Unnamed: 0_level_0,Count,Percent
stroke,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4861,0.951272
1,249,0.048728


There are a number of techniques to deal with [unbalanced datasets](https://medium.com/strands-tech-corner/unbalanced-datasets-what-to-do-144e0552d9cd).  

We will use the **true positive rate** to assess the performance of your predictions.
<center>
    $\text{True Positive Rate} = \frac{TP}{TP+FN}$
</center>  

where 
* $TP$ is the number of true positive predictions (actual value = stroke; predicted value = stroke)
* $FN$ is the number of false negative predictions (actual value = stroke; predicted value = no stroke)

This value, also called [sensitivity](https://en.wikipedia.org/wiki/Sensitivity_and_specificity) or [recall](https://en.wikipedia.org/wiki/Precision_and_recall), measures how well a model is at capturing actual stroke cases.  

Assuming medical interventions are relatively cheap (i.e., recommending weight loss or exercise to a patient in danger of a stroke), it is better to have the occasional false positive than miss patients at high risk for strokes.

The goal is to build a model that identifies factors indicating a high likelihood of having a stroke, so interventions can hopefully prevent the stroke *before* it happens.  

This being the case, we need to evaluate correct predictions on data that the model has never seen before.  This can be done by splitting our data into a training dataset (to use for training and evaluating the model as it is being built) and a test dataset.

### Data Cleanup
Categorical variables that will be used to predict strokes need to be converted to [dummy variables](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)).  

We also need to replace missing data in the BMI column.  
I will simply use the average BMI to replace the missing data.

In [5]:
#Gender
#First, to make things easier, remove the one "other" gender value.
stroke_data = stroke_data[stroke_data['gender'] != 'Other']
#Add new column 'male': 1 = male; 0 = female
stroke_data['male'] = pd.get_dummies(stroke_data['gender'], drop_first=True)

#Residence Type
#Add new column 'urban': 1 = urban; 0 = rural
stroke_data['urban'] = pd.get_dummies(stroke_data['Residence_type'], drop_first=True)

#Married
stroke_data['married'] = pd.get_dummies(stroke_data['ever_married'], drop_first=True)

#Smoking Status
smoking_dummies = pd.get_dummies(stroke_data['smoking_status'], drop_first=True)
stroke_data = pd.concat([stroke_data, smoking_dummies], axis=1)

In [6]:
#Replace Missing BMI with average BMI
bmi_average = stroke_data['bmi'].mean()
stroke_data['bmi'] = stroke_data['bmi'].fillna(bmi_average)

## Section 2: Create a Test Dataset
Now, let's split your dataset into two datasets:
* Training Data: Used to train your model to identify important predictors of stroke
* Test Data: Reserved to evaluate the model on new, unseen data

We will use the [scikit-learn](https://scikit-learn.org/stable/index.html) package in python, since has many tools for machine learning, including data preparation tools. 

Let's take a look on the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function.  

Inputs to *train_test_split*:
 * **arrays**: This is where you enter one or more arrays: the entire dataset including the output or two separate arrays (X array (predictors) and y array (output) variable).  If you enter two arrays, the number of rows in the X and y arrays must be the same and the indexes must align the data.
 * **test_size**: Value between 0 and 1 that indicates the percentage of data to be reserved for the test dataset (defaults to 0.25 if train_size is None).
 * **train_size**: Value between 0 and 1 that indicates the percentage of data to be used for the training dataset (complement of test_size if test_size is set and this value is None).
 * **random_state**: Seed value for randomizing the data split.
 * **shuffle**: Whether to shuffle the data before splitting (defaults to True).
 * **stratify**: Output field to use for [stratified sampling](https://en.wikipedia.org/wiki/Stratified_sampling)  (defaults to None).
 
Since our dataset has unbalanced output classes, you want to be sure to use the **stratify** option.

In [7]:
#Import train_test_split function from scikit-learn package
from sklearn.model_selection import train_test_split

train, test = train_test_split(stroke_data,train_size=0.8,stratify=stroke_data['stroke'])

In [8]:
#Count the total records in the training dataset
training_records = train.shape[0]
print('There are {:,} records in the training dataset.'.format(training_records))

train_summary = pd.DataFrame(train.groupby('stroke').size()).rename(columns={0:'Count'})
train_summary['Percent'] = train_summary['Count'] / training_records
train_summary

There are 4,087 records in the training dataset.


Unnamed: 0_level_0,Count,Percent
stroke,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3888,0.951309
1,199,0.048691


In [9]:
#Count the total records in the test dataset
test_records = test.shape[0]
print('There are {:,} records in the test dataset.'.format(test_records))

test_summary = pd.DataFrame(test.groupby('stroke').size()).rename(columns={0:'Count'})
test_summary['Percent'] = test_summary['Count'] / test_records
test_summary

There are 1,022 records in the test dataset.


Unnamed: 0_level_0,Count,Percent
stroke,Unnamed: 1_level_1,Unnamed: 2_level_1
0,972,0.951076
1,50,0.048924


In [10]:
#Confirm the split did what you expected!
print('There are {:.1%} of all records in training dataset.'.format(training_records/total_records))

There are 80.0% of all records in training dataset.


## Section 3: Tune Models Using Validation Data
By creating a validation data dataset, I want to be sure that this is the best algorithm and best settings for your dataset.  
This is *another* split of the data, this time using the training dataset.

Using the training dataset, you can again use the *train_test_split* function to create two new datasets:
 * train_final: The final dataset used to train your models
 * validation: The dataset used to evaluate and tune your models

In [11]:
#Split the training data into training/validation data using a 75%/25% split
#Be sure to use stratified sampling!
train_final, validation = train_test_split(train, train_size=0.75, stratify=train['stroke'])

Now we can finally built a binary classification model to predict strokes.  

You will be using the [K-Nearest Neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) to build your model.

In [12]:
#Import KNN model function from scikit-learn
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5)

#Remove 'id' and 'stroke' column from features (predictors)
features = ~train_final.columns.isin(['id','gender','ever_married','work_type','Residence_type','smoking_status','stroke'])
feature_columns = train_final.columns[features]
model.fit(train_final[feature_columns],train_final['stroke'])


#Import true positive rate (recall) function
from sklearn.metrics import recall_score

#Predict output for training dataset
train_predict = model.predict(train_final[feature_columns])

tpr_train = recall_score(train_final['stroke'],train_predict)
print('The true positive rate for the training dataset is {:.3%}.'.format(tpr_train))

The true positive rate for the training dataset is 15.436%.


In [13]:
#Predict output for validation dataset
validation_predict = model.predict(validation[feature_columns])

tpr_validation = recall_score(validation['stroke'],validation_predict)
print('The true positive rate for the validation dataset is {:.3%}.'.format(tpr_validation))

The true positive rate for the validation dataset is 0.000%.


The model performs significantly worse in predicting strokes for the validation dataset than on the training dataset.  

### Cross-Validation
Instead of using a single static validation dataset, you can try cross-validation.  [Cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) splits the training dataset into *k* folds, and then creates a temporary training dataset consisting of *k-1* folds to build a model.  The process is repeated *k* times, holding a different fold of the training data out each time.  Then the validation metric is averaged across all *k* models.

In [14]:
from sklearn.model_selection import GridSearchCV

#KNN parameters to test
parameters = {'n_neighbors': [1,2,5,10,15,100], 'weights': ['uniform','distance']}

#Initialize model
knn = KNeighborsClassifier()
#Set up a grid search for the best hyperparameters using 5-fold cross-validation
grid_search = GridSearchCV(knn, parameters, cv=5, scoring='recall')
#Fit model using the full training
grid_search.fit(train[feature_columns],train['stroke'])

In [15]:
#Show the best true positive rate results of the grid search
print('The best CV true positive rate from the grid search was {:.3%}.'.format(grid_search.best_score_))

The best CV true positive rate from the grid search was 15.590%.


In [16]:
#Show the best model parameters
grid_search.best_params_

{'n_neighbors': 1, 'weights': 'uniform'}

## Section 4: Evaluate the Best Model on Unseen Data
Finally, you can see how well "best" model predicts strokes for the test dataset.  This is how we simulate how well the model will do in the real-world against totally new, unseen data.

In [17]:
#Build your "best" model using the best parameters from the grid search
knn_best = KNeighborsClassifier(n_neighbors=1, weights='uniform')
knn_best.fit(train[feature_columns],train['stroke'])

#Calculate training true positive rate
training_predict = knn_best.predict(train[feature_columns])
training_tpr = recall_score(train['stroke'], training_predict)

print('The true positive rate for the training dataset is {:.1%}.'.format(training_tpr))

The true positive rate for the training dataset is 100.0%.


So this model performs perfectly on the training dataset!  Whenever you see a perfect training score, you should be skeptical.  It is very likely that you are dealing with [overfitting](https://en.wikipedia.org/wiki/Overfitting), where the model learned the training dataset TOO well.  This generally means that the model will not generalize well when compared to real world data.

So now you can use this same model to predict strokes using the test dataset and see how the true positive rate compares to the training set. 

In [18]:
#Calculate test true positive rate
test_predict = knn_best.predict(test[feature_columns])
test_tpr = recall_score(test['stroke'], test_predict)

print('The true positive rate for the test dataset is {:.1%}.'.format(test_tpr))

The true positive rate for the test dataset is 16.0%.
