# K-nearest neighbor
In this notebook we want to train a K-nearest-neighbor classifier that should predict whether a patient has diabetes or not.

In [1]:
#imports
from preprocessing.preprocessing_label_encoding import *
from preprocessing.preprocessing_one_hot_encoding import *

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import *

import warnings
warnings.filterwarnings('ignore')

#used to store the results and compare at the end
approach_list = []
acc_list = []
cr_list = []
f1_list = []



# Hyperparameter tuning and preprocessing
We will evaluate different parameters for the classifier (hyperparameter tuning) as well as different preprocessing steps based on the accuracy and F1-Score.

In detail, we will vary regarding parameters:<br>

| **Parameter** |                    **range Values**                    |
|:-------------:|:------------------------------------------------------:|
|    metric     | 'euclidean',<br>'cosine',<br>'manhattan',<br>'jaccard' |
|  K-neighbors  |                          5-15                          |

And we will vary for preprocessing:


|              **Preprocessing**              |                           **Description**                           |
|:-------------------------------------------:|:-------------------------------------------------------------------:|
|               Label Encoding                |                           Label encoding                            |
|     Label Encoding<br>+<br>Oversampling     |                   Label encoding and oversampling                   |
|    Label Encoding<br>+<br>Undersampling     |                  Label encoding and undersampling                   |
|            One Hot Encoding (1)             |           One hot encoding for all columns except yes/no            |
|            One Hot Encoding (2)             |          One hot encoding for all columns including yes/no          |
|  One Hot Encoding (1)<br>+<br>Oversampling  |   One hot encoding for all columns except yes/no and oversampling   |
| One Hot Encoding (2) <br>+<br>Oversampling  | One hot encoding for all columns including yes/no and oversampling  |
| One Hot Encoding (1)<br>+<br>Undersampling  |  One hot encoding for all columns except yes/no and undersampling   |
| One Hot Encoding (2) <br>+<br>Undersampling | One hot encoding for all columns including yes/no and undersampling |

The below defined function is used to store all the accuracy and f1-scores for a better comparison and evaluation capability at the end. It also returns the Accuracy score and F1-Score to see the performance directly under each method.

In [2]:
def evaluation(target_validation, diabetes_test_prediction, k, metric, preprocessing):
    approach = "preprocessing: {} k= {} metric: {}".format(preprocessing, k, metric)
    approach_list.append(approach)
    acc = accuracy_score(target_validation, diabetes_test_prediction)
    acc_list.append(acc)
    cr = classification_report(target_validation, diabetes_test_prediction)
    cr_list.append(cr)
    f1 = f1_score(target_validation, diabetes_test_prediction)
    f1_list.append(f1)
    return "{}\n acc = {}\n f1-score = {}".format(approach, acc, f1)

We start with applying a K-nearest neighbor classifier to the train data and test against the validation data on how it performs by using the accuracy and f1 score.

We will do so for each combination that is listed above by using two for loops. The following estimators are structured by the different style of preprocessing.

At the end we test the best approach against the actual test data.

# Label Encoding

In [3]:
preprocessing = "Label Encoding"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_label_encoded_train_test_split()

#metrics that should be used
params = ('euclidean', 'cosine', 'manhattan', 'jaccard')

for k_neighbors in range(5,16):
    for metric in params:
        knn_estimator = KNeighborsClassifier(n_neighbors=k_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, k_neighbors, metric, preprocessing))

k= 5 metric: cosine:
 acc = 0.8348629810814321
 f1-score = 0.8086342517615546
k= 5 metric: manhattan:
 acc = 0.8379150612580631
 f1-score = 0.8107654592583807
k= 6 metric: cosine:
 acc = 0.8322438200787913
 f1-score = 0.8117186129597088
k= 6 metric: manhattan:
 acc = 0.8342136023204467
 f1-score = 0.8125962193311277


# Label Encoding + Oversampling

In [None]:
preprocessing = "Label Encoding + Oversampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_label_encoded_train_test_split_oversampled()

#metrics that should be used
params = ('euclidean', 'cosine', 'manhattan', 'jaccard')

for k_neighbors in range(5,7):
    for metric in params:
        knn_estimator = KNeighborsClassifier(n_neighbors=k_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, k_neighbors, metric, preprocessing))

# Label Encoding + Undersampling

In [None]:
preprocessing = "Label Encoding + Undersampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_label_encoded_train_test_split_undersampled()

#metrics that should be used
params = ('euclidean', 'cosine', 'manhattan', 'jaccard')

for k_neighbors in range(5,7):
    for metric in params:
        knn_estimator = KNeighborsClassifier(n_neighbors=k_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, k_neighbors, metric, preprocessing))

# One Hot Encoding (1)
yes/no values not one hot encoded

In [None]:
preprocessing = "One Hot Encoding (1)"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded()

#metrics that should be used
params = ('euclidean', 'cosine', 'manhattan', 'jaccard')

for k_neighbors in range(5,7):
    for metric in params:
        knn_estimator = KNeighborsClassifier(n_neighbors=k_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, k_neighbors, metric, preprocessing))

# One Hot Encoding (2)
all columns one hot encoded

In [None]:
preprocessing = "One Hot Encoding (2)"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_all_columns()

#metrics that should be used
params = ('euclidean', 'cosine', 'manhattan', 'jaccard')

for k_neighbors in range(5,7):
    for metric in params:
        knn_estimator = KNeighborsClassifier(n_neighbors=k_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, k_neighbors, metric, preprocessing))

# One Hot Encoding (1) + Oversampling

In [None]:
preprocessing = "One Hot Encoding (1) + Oversampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_train_test_split_oversampled()

#metrics that should be used
params = ('euclidean', 'cosine', 'manhattan', 'jaccard')

for k_neighbors in range(5,7):
    for metric in params:
        knn_estimator = KNeighborsClassifier(n_neighbors=k_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, k_neighbors, metric, preprocessing))

# One Hot Encoding (2) + Oversampling

In [None]:
preprocessing = "One Hot Encoding (2) + Oversampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_all_columns_train_test_split_oversampled()

#metrics that should be used
params = ('euclidean', 'cosine', 'manhattan', 'jaccard')

for k_neighbors in range(5,7):
    for metric in params:
        knn_estimator = KNeighborsClassifier(n_neighbors=k_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, k_neighbors, metric, preprocessing))

# One Hot Encoding (1) + Undersampling

In [None]:
preprocessing = "One Hot Encoding (1) + Undersampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_train_test_split_undersampled()

#metrics that should be used
params = ('euclidean', 'cosine', 'manhattan', 'jaccard')

for k_neighbors in range(5,7):
    for metric in params:
        knn_estimator = KNeighborsClassifier(n_neighbors=k_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, k_neighbors, metric, preprocessing))

# One Hot Encoding (2) + Undersampling

In [None]:
preprocessing = "One Hot Encoding (2) + Undersampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_all_columns_train_test_split_undersampled()

#metrics that should be used
params = ('euclidean', 'cosine', 'manhattan', 'jaccard')

for k_neighbors in range(5,7):
    for metric in params:
        knn_estimator = KNeighborsClassifier(n_neighbors=k_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, k_neighbors, metric, preprocessing))

# Evaluation of best approach
We use the following loop to print all F1- and accuracy-scores and thereby also analyze which approach performed best.

In [7]:
highest_acc = [0.0, None, None]
highest_f1 = [0.0, None, None]

for i in range(0, len(approach_list)):
    print("Nr.{}) {}".format(i, approach_list[i]))
    print ("Weighted F1-Score = {}".format(f1_list[i]))
    print ("Accuracy-Score = {}\n".format(acc_list[i]))
    if highest_f1[0] < float(f1_list[i]):
        highest_f1[0] = f1_list[i]
        highest_f1[1] = i
        highest_f1[2] = approach_list[i]
    if highest_acc[0] < float(acc_list[i]):
        highest_acc[0] = acc_list[i]
        highest_acc[1] = i
        highest_acc[2] = approach_list[i]

print("--------- Best Approaches ---------")
print("Best Approach regarding F1-Score:\nNr.{}) {} with f1-score = {}\n".format(highest_f1[1], highest_f1[2], highest_f1[0]))
print("Best Approach regarding Accuracy:\nNr.{}) {} with acc = {}".format(highest_acc[1], highest_acc[2], highest_acc[0]))

Nr.0) k= 5 metric: cosine
Weighted F1-Score = 0.8086342517615546
Accuracy-Score = 0.8348629810814321

Nr.1) k= 5 metric: manhattan
Weighted F1-Score = 0.8107654592583807
Accuracy-Score = 0.8379150612580631

Nr.2) k= 6 metric: cosine
Weighted F1-Score = 0.8117186129597088
Accuracy-Score = 0.8322438200787913

Nr.3) k= 6 metric: manhattan
Weighted F1-Score = 0.8125962193311277
Accuracy-Score = 0.8342136023204467

--------- Best Approaches ---------
Best Approach regarding F1-Score:
Nr.3) Parameter: (k= 6 metric: manhattan) with f1-score = 0.8125962193311277

Best Approach regarding Accuracy:
Nr.1) Parameter: (k= 5 metric: manhattan) with acc = 0.8379150612580631


We can see that approach _tbd_ with k = _tbd_ and metric = _tbd_ performed best.
We test this approach now finally against the test data that we separated at the beginning:

In [None]:
#tbd