# **Heart Attack: K- Nearest Neighbors**

## **Intro** 

### K-Nearest Neighbors is a supervised learning classification algorithm. It's a supervised algorithm, because we start with data that is already labeled and learn from that data. We start the algorithm by choosing a number, $k$. Then, we train the model by "plotting" the training data and grouping them by their true label. That's it. Very simple and easy. 

### In order to classify a new point, we plot the new unknown data and then check the label of its $k$ nearest neighbors or training data points we previously plotted. These labels now "vote" on what the unknown data point should be.

![knn.png](attachment:knn.png)

### Something to keep in mind: the size of $k$ will matter for accuracy and bias. The smaller $k$ is the more likely an outlier may affect the classification of a new data point. The larger the $k$ is, the more likely it will encompass multiple groups and it becomes less powerful as a classification tool as you can imagine. 

### Import Libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import seaborn as sns



### Import Data and preprocess. Make sure you run this cell, but this is the same code from the data exploration notebook. We are just getting the data into the same form so we can continue to work with it.

In [3]:
# Import the dataset
df = pd.read_csv('heart.csv')

# Display the first few rows of the dataset
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## **Splitting the Data**

### Mention how this came up in module 2, but go into a little more detail here as it is a key aspect of machine learning going forward

### For machine learning, we need to split our data into train and test subsets. We do this in order to first teach our model trends to look for using the train dataset. Then, we use the test data to verify our model learned the correct trends. 

### Discussion on why we do 80/20 split

### The first thing we have to do is take the labels we count as "truth" which in our case is our target column. All of the other columns will be our "features". These features are what the machine learning model takes and tries to learn from. 

In [4]:
# define target variable (heart attack)
y = df["target"]

# define predictor variables or features which we will train with 
basic = df[["age", 'chol', 'thalach']]

# split data into train and test sets using built in method from sci-kit learn
X_train, X_test, y_train, y_test = train_test_split(basic, y, test_size=0.20, random_state=0)

### Let's check to see if the split was done evenly. If by chance one case is significantly more represented, we can re-split the data by running it again. 

In [5]:
print(y_train.value_counts())

target
1    131
0    111
Name: count, dtype: int64


## **Creating a KNN Model**

### What are we doing here 

### Let's build the model using 10 neighbors!

In [17]:
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
knn_predicted = knn.predict(X_test)
knn_conf_matrix = confusion_matrix(y_test, knn_predicted)


### Introduce AUROC HERE

print("Confusion Matrix: \n", knn_conf_matrix, '\n')

print("Classification Report: \n", classification_report(y_test,knn_predicted))

Confusion Matrix: 
 [[24  3]
 [ 4 30]] 

Classification Report: 
               precision    recall  f1-score   support

           0       0.86      0.89      0.87        27
           1       0.91      0.88      0.90        34

    accuracy                           0.89        61
   macro avg       0.88      0.89      0.88        61
weighted avg       0.89      0.89      0.89        61



### Poor results- scaling and briefly mention it as it relates to things involving distance. This concept comes up again and again ( k means, svm) which will see later so try to understand it early!

### After scaling the data, the features will now be considered the same.

In [7]:
# Instantiate an object of standard scaler
scaler = StandardScaler()

###FIX THIS TO AVOID DATA LEAKAGE
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Discussion of workflow of scaling when using training and testing sets to avoid data leakage

### lets now fit the model

In [8]:
# Instantiate a knn classifier using 10 neighbors
knn = KNeighborsClassifier(n_neighbors=10)

# Train the model using our scaled training and testing data
knn.fit(X_train, y_train)

# Make predictions on the trained model
knn_predicted = knn.predict(X_test)

# calculate 
knn_conf_matrix = confusion_matrix(y_test, knn_predicted)



print("confussion matrix")
print(knn_conf_matrix)
print("\n")
print(classification_report(y_test,knn_predicted))

confussion matrix
[[21  6]
 [ 9 25]]


Accuracy of K-NeighborsClassifier: 75.40983606557377 

              precision    recall  f1-score   support

           0       0.70      0.78      0.74        27
           1       0.81      0.74      0.77        34

    accuracy                           0.75        61
   macro avg       0.75      0.76      0.75        61
weighted avg       0.76      0.75      0.75        61



### Now let's create another model, but this time using the entire set of features.

In [9]:
y = df["target"]
feats = df.drop('target',axis=1)
X_train, X_test, y_train, y_test = train_test_split(feats, y, test_size=0.20, random_state=0)

In [11]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [12]:
# Instantiate a knn classifier using 10 neighbors
knn = KNeighborsClassifier(n_neighbors=10)

# Train the model using our scaled training and testing data
knn.fit(X_train, y_train)

# Make predictions on the trained model
knn_predicted = knn.predict(X_test)

# calculate 
knn_conf_matrix = confusion_matrix(y_test, knn_predicted)



print("confussion matrix")
print(knn_conf_matrix)
print("\n")
print(classification_report(y_test,knn_predicted))


confussion matrix
[[24  3]
 [ 4 30]]


Accuracy of K-NeighborsClassifier: 88.52459016393442 

              precision    recall  f1-score   support

           0       0.86      0.89      0.87        27
           1       0.91      0.88      0.90        34

    accuracy                           0.89        61
   macro avg       0.88      0.89      0.88        61
weighted avg       0.89      0.89      0.89        61



### Discussion of these results in detail 