<a href="https://colab.research.google.com/github/parumahajan24/Data-Analytics-and-Machine-Learning-Labs-MIS-637/blob/main/Lab_week6_kNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 6 Lab: use kNN to predict heart disease
In this lab, we will work with a data containing records of patients with and without heart disease: target = 1 means having heart disease, and target = 0 means the opposite. Our goals are the following:
- Load dataset and examine whether there are notable data issues, e.g. missing data, preprocess data as needed
- Split dataset with 80/20 split, then train a kNN with the default settings in scikit-learn
- Use cross validation instead of the simple train-test split, and report a variety of performance metrics
- Tune kNN to improve performance: distance function, combination function, feature scaling, k choice

In [None]:
# These lines are for suppressing the warnings from Python
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)

### Load dataset
The dataset is stored in a csv file. We first load the dataset, then create X, y to store features and labels.

In [None]:
import pandas as pd

#read in the data using pandas
df = pd.read_csv('/content/drive/MyDrive/heart.csv')

#check data has been read in properly
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#check number of rows and columns in dataset
df.shape

(303, 14)

In [None]:
#check for missing data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


We next create separate dataframes to store features (predictor variables) and label (target variable to classify).

In [None]:
#create a dataframe with all training data except the target column
X = df.drop(columns=['target'])

#check that the target variable has been removed
X.head()
X.shape

(303, 13)

In [None]:
#separate target values
y = df['target'].values
y.shape

#view target values
y[:5]

array([1, 1, 1, 1, 1])

### Fit a kNN with default setting
We start with the default setting to train a kNN model as benchmark.

`
class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)
`

As we can tell from the default parameters, this kNN model is setup with:
- k is equal to 5
- Distance function is Euclidean distance
- Votes from 5 neighbors are combined with unweighted voting

In [None]:
from sklearn.model_selection import train_test_split

#split dataset into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=123)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Create KNN classifier
knn = KNeighborsClassifier()

# Fit the classifier to the data
knn.fit(X_train,y_train)

#### Evaluate training and testing accuracy
A common performance metric is the classification accuracy on testing dataset. From the fitted kNN model, we can use the `score()` function to return the accuracy.

Alternatively, the accuracy can be computed with the `accuracy_score` function in the `sklearn.metrics` class, which offers a collection of common performance metrics for ML models.

In [None]:
#check accuracy of our model on the test data
knn.score(X_test, y_test)

0.639344262295082

In [None]:
# alternatively, we can compute accuracy as follows
from sklearn.metrics import accuracy_score
y_pred_test = knn.predict(X_test)
print(accuracy_score(y_test, y_pred_test))

0.639344262295082


In [None]:
print ("Training accuracy of the 5-NN model is: ", knn.score(X_train, y_train))
print ("Testing accuracy of the 5-NN model is: ", knn.score(X_test, y_test))

Training accuracy of the 5-NN model is:  0.768595041322314
Testing accuracy of the 5-NN model is:  0.639344262295082


#### Evaluate other performance metrics on training and testing datasets
We next utilize a few functions from the `sklearn.metrics` module to evaluate additional performance metrics.

    - Confusion matrix: `sklearn.metrics.confusion_matrix(y_true, y_pred, *, labels=None, sample_weight=None, normalize=None)`
    
    - Precision: `sklearn.metrics.precision_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')`
    
    - Recall: `sklearn.metrics.recall_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')`

In [None]:
from sklearn.metrics import confusion_matrix, precision_score, recall_score

y_pred_train = knn.predict(X_train)

print ("Confusion matrix on the training dataset is: ", confusion_matrix(y_train, y_pred_train))
print ("Precision on the training dataset is: ", precision_score(y_train, y_pred_train))
print ("Recall on the training dataset is: ", recall_score(y_train, y_pred_train))

Confusion matrix on the training dataset is:  [[ 78  30]
 [ 26 108]]
Precision on the training dataset is:  0.782608695652174
Recall on the training dataset is:  0.8059701492537313


In [None]:
# Reminder: y_pred_test contains the predicted labels from the kNN model
print ("Confusion matrix on the testing dataset is: ", confusion_matrix(y_test, y_pred_test))
print ("Precision on the testing dataset is: ", precision_score(y_test, y_pred_test))
print ("Recall on the testing dataset is: ", recall_score(y_test, y_pred_test))

Confusion matrix on the testing dataset is:  [[20 10]
 [12 19]]
Precision on the testing dataset is:  0.6551724137931034
Recall on the testing dataset is:  0.6129032258064516


### Cross validation
The train-test-split method is often called ‘holdout’. Cross-validation is better than using the holdout method because the holdout method score is dependent on how the data is split into train and test sets. Cross-validation gives the model an opportunity to test on multiple splits so we can get a better idea on how the model will perform on unseen data.

In the following, we will use the same testing dataset created in previous steps for testing, and use 5-fold cross validation with the other 80 percent.

Question: in each iteration of 5-fold cross validation, what percentage of the entire dataset is used for training? For validation?

We will use the function `sklearn.model_selection.cross_val_score(estimator, X, y=None, *, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)` to report accuracy (default option), precision (specify with `scoring = 'precision'`) and recall (specify with `scoring = 'recall'`).

In [None]:
from sklearn.model_selection import cross_val_score
import numpy as np

cv_scores = cross_val_score(knn,X_train,y_train,cv=5)

#print accuracy from all 5 iterations
print(cv_scores)
print('cv_scores mean:{}'.format(np.mean(cv_scores)))

[0.65306122 0.71428571 0.72916667 0.52083333 0.70833333]
cv_scores mean:0.6651360544217687


In [None]:
# obtain and print recall values from all 5 iterations
cv_recall = cross_val_score(knn,X_train,y_train,cv=5, scoring = 'recall')
print (cv_recall)

[0.72       0.56       0.58333333 0.79166667 0.75       0.70833333
 0.5        0.54166667 0.70833333 0.70833333]


In [None]:
# obtain and print precision values from all 5 iterations
cv_precision = cross_val_score(knn,X_train,y_train,cv=5, scoring = 'precision')
print (cv_precision)

[0.65625    0.74074074 0.70588235 0.5625     0.7       ]


We observe from these evaluation metrics that the 5-NN model is likely underfitting. We next use cross validation to tune kNN in order to improve performance.

### Use cross validation to tune kNN: k, distance function, combination function
Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes.

We next apply one approach for parameter search implemented in scikit-learn: `sklearn.model_selection.GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)`.

We can specify the parameters to search within by giving a grid of parameter values, then `GridSearchCV` exhaustively consider all the parameter choices and generate the optimal candidates with respect to the chosen scoring rule, e.g. classification accuracy.

To apply the function, we need to create a ML model (kNN in our case) to input as the `estimator` and a dictionary of parameter choices to input as the `param_grid`. We also need to specify the desirable evaluation metric with the `scoring` option, e.g. `scoring = accuracy_score`.

In [None]:
from sklearn.model_selection import GridSearchCV

#create new a knn model
knn_tune = KNeighborsClassifier()

#create a dictionary of parameters to test
param_grid = {'n_neighbors': np.arange(1, 30),
             'weights':['uniform','distance'],
             'metric':['minkowski','manhattan']}

#use gridsearch to test all values for the parameters
knn_gscv = GridSearchCV(knn_tune, param_grid, cv=5,scoring='accuracy')
# other options: scoring='precision', scoring='recall', etc.

#fit model to training and validation data
knn_gscv.fit(X_train, y_train)

In [None]:
#check top performing parameter values
knn_gscv.best_params_

{'metric': 'manhattan', 'n_neighbors': 24, 'weights': 'uniform'}

In [None]:
#check mean score for the top performing parameter values
knn_gscv.best_score_

0.7355442176870748

We create a new kNN model with the obtained optimal paremeter setting. Note that we will use the entire training portion (80 percent of original dataset) to train this model.

In [None]:
knn2 = KNeighborsClassifier(n_neighbors = 24, metric = 'manhattan', weights = 'uniform')
knn2.fit(X_train, y_train)
print ("Training accuracy of the 24-NN model is: ", knn2.score(X_train, y_train))
print ("Testing accuracy of the 24-NN model is: ", knn2.score(X_test, y_test))

Training accuracy of the 24-NN model is:  0.731404958677686
Testing accuracy of the 24-NN model is:  0.6557377049180327


We observe that the new 24-NN model is overfitting. Here are some strategies we may try to mitigate this issue:
- Instead of 5-fold CV for parameter tuning, we can use k-fold CV with a smaller k.
- Remove the less relevant features from predictor variables. This requires domain knowledge about the dataset, or uses certain feature selection methods. A useful reading to begin learning about this is [this article](https://scikit-learn.org/stable/modules/feature_selection.html).

### Toy-example of stretching the axes: scale variables
We look into the features to understand whether some scaling might be needed.

In [None]:
# get summary statistics about the dataset
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


We observe that the feature variables are on quite different scales. Standardizing the features' impacts by scaling may be helpful for performance. In the `sklearn.preprocessing` class, there are several scaling functions implemented.
- `StandardScaler`: z-score standardization
- `MinMaxScaler`: min-max normalization

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

#### Apply z-score standardization to selected columns.

In [None]:
#Create copy of dataset.
df_model = df.copy()
#Rescaling features age, trestbps, chol, thalach, oldpeak with z-score standardization.
scaler = StandardScaler()
features = [['age', 'trestbps', 'chol', 'thalach', 'oldpeak']]
for feature in features:
    df_model[feature] = scaler.fit_transform(df_model[feature])

#Create new feature arrays to store scaled values
X_zs = df_model.drop(columns=['target'])
#Create new training and testing datasets
X_train_zs, X_test_zs, y_train_zs, y_test_zs = train_test_split(X_zs, y, test_size = 0.2, random_state=123)

In [None]:
#Repeat the kNN tuning steps on the scaled data, then fit and test a new kNN with the identified parameters

#create a kNN model
knn3 = KNeighborsClassifier()

#create a dictionary of parameters to test
param_grid = {'n_neighbors': np.arange(1, 30),
             'weights':['uniform','distance'],
             'metric':['minkowski','manhattan']}

#use gridsearch to test all values for the parameters
knn_gscv = GridSearchCV(knn3, param_grid, cv=5, scoring='accuracy')

#fit new training data
knn_gscv.fit(X_train_zs, y_train_zs)
#check top performing n_neighbors value
knn_gscv.best_params_

{'metric': 'manhattan', 'n_neighbors': 12, 'weights': 'distance'}

In [None]:
#Fit and test a new kNN with the optimal parameter setup
knn3 = KNeighborsClassifier(n_neighbors = 12, metric = 'manhattan', weights = 'distance')
knn3.fit(X_train_zs, y_train_zs)
print ("Training accuracy of the kNN model is: ", knn3.score(X_train_zs, y_train_zs))
print ("Testing accuracy of the kNN model is: ", knn3.score(X_test_zs, y_test_zs))

Training accuracy of the kNN model is:  1.0
Testing accuracy of the kNN model is:  0.6885245901639344


The kNN model is still overfitting, but the testing accuracy is slightly better after scaling the selected variables with z-score standardization.

#### Apply Min-Max normalization to selected columns

In [None]:
#Create copy of dataset.
df_model = df.copy()
#Rescaling features age, trestbps, chol, thalach, oldpeak with z-score standardization.
scaler = MinMaxScaler()
features = [['age', 'trestbps', 'chol', 'thalach', 'oldpeak']]
for feature in features:
    df_model[feature] = scaler.fit_transform(df_model[feature])

#Create new feature arrays to store scaled values
X_mm = df_model.drop(columns=['target'])
#Create new training and testing datasets
X_train_mm, X_test_mm, y_train_mm, y_test_mm = train_test_split(X_mm, y, test_size = 0.2, random_state=123)

NameError: name 'MinMaxScaler' is not defined

In [None]:
#Repeat the kNN tuning steps on the scaled data, then fit and test a new kNN with the identified parameters

#create a kNN model
knn4 = KNeighborsClassifier()

#create a dictionary of parameters to test
param_grid = {'n_neighbors': np.arange(1, 30),
             'weights':['uniform','distance'],
             'metric':['minkowski','manhattan']}

#use gridsearch to test all values for the parameters
knn_gscv = GridSearchCV(knn4, param_grid, cv=5, scoring='accuracy')

#fit new training data
knn_gscv.fit(X_train_mm, y_train_mm)
#check top performing n_neighbors value
knn_gscv.best_params_

NameError: name 'GridSearchCV' is not defined

In [None]:
#Fit and test a new kNN with the optimal parameter setup
knn4 = KNeighborsClassifier(n_neighbors = 16, metric = 'manhattan', weights = 'uniform')
knn4.fit(X_train_mm, y_train_mm)
print ("Training accuracy of the kNN model is: ", knn4.score(X_train_mm, y_train_mm))
print ("Testing accuracy of the kNN model is: ", knn4.score(X_test_mm, y_test_mm))

Training accuracy of the kNN model is:  0.8801652892561983
Testing accuracy of the kNN model is:  0.7377049180327869


We observe that min-max normalization appears to be more helpful at mitigating overfitting.