# K-nearest neighbor
In this notebook we want to train a K-nearest-neighbor classifier that should predict whether a patient has diabetes or not.

In [1]:
#imports
from preprocessing.preprocessing_label_encoding import *
from preprocessing.preprocessing_one_hot_encoding import *

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import *

import warnings
warnings.filterwarnings('ignore')

#used to store the results and compare at the end
approach_list = []
acc_list = []
cr_list = []
f2_list = []

# Hyperparameter tuning and preprocessing
We will evaluate different parameters for the classifier (hyperparameter tuning) as well as different preprocessing steps based on the accuracy and F2-Score.

In detail, we will vary regarding parameters:<br>

| **Parameter** |                    **range Values**                    |
|:-------------:|:------------------------------------------------------:|
|    metric     | 'euclidean',<br>'cosine',<br>'manhattan',<br>'jaccard' |
|  K-neighbors  |                          1-15                          |

And we will vary for preprocessing:


|              **Preprocessing**              |                           **Description**                           |
|:-------------------------------------------:|:-------------------------------------------------------------------:|
|               Label Encoding                |                           Label encoding                            |
|     Label Encoding<br>+<br>Oversampling     |                   Label encoding and oversampling                   |
|    Label Encoding<br>+<br>Undersampling     |                  Label encoding and undersampling                   |
|            One Hot Encoding (1)             |           One hot encoding for all columns except yes/no            |
|            One Hot Encoding (2)             |          One hot encoding for all columns including yes/no          |
|  One Hot Encoding (1)<br>+<br>Oversampling  |   One hot encoding for all columns except yes/no and oversampling   |
| One Hot Encoding (2) <br>+<br>Oversampling  | One hot encoding for all columns including yes/no and oversampling  |
| One Hot Encoding (1)<br>+<br>Undersampling  |  One hot encoding for all columns except yes/no and undersampling   |
| One Hot Encoding (2) <br>+<br>Undersampling | One hot encoding for all columns including yes/no and undersampling |

In [2]:
#metrics that should be used
metrics = ('euclidean', 'cosine', 'manhattan', 'jaccard')
n_neighbors_start = 1
n_neighbors_end = 16

The below defined function is used to store all the accuracy and F2-scores for a better comparison and evaluation capability at the end. It also returns the Accuracy score and F2-Score to see the performance directly under each method.

In [3]:
def evaluation(target_validation, diabetes_test_prediction, k, metric, preprocessing):
    approach = "preprocessing: {} k= {} metric: {}".format(preprocessing, k, metric)
    approach_list.append(approach)
    acc = accuracy_score(target_validation, diabetes_test_prediction)
    acc_list.append(acc)
    cr = classification_report(target_validation, diabetes_test_prediction)
    cr_list.append(cr)
    f2 = fbeta_score(target_validation, diabetes_test_prediction, beta=2)
    f2_list.append(f2)
    return "{}\n acc = {}\n f2-score = {}".format(approach, acc, f2)

We start with applying a K-nearest neighbor classifier to the train data and test against the validation data on how it performs by using the accuracy and F2-score.

We will do so for each combination that is listed above by using two for loops. The following estimators are structured by the different style of preprocessing.

At the end we test the best approach against the actual test data.

# Label Encoding

In [6]:
preprocessing = "Label Encoding"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_label_encoded_train_test_split()

for n_neighbors in range(1,5):
    for metric in metrics:
        knn_estimator = KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, n_neighbors, metric, preprocessing))

preprocessing: Label Encoding k= 1 metric: euclidean
 acc = 0.8216150921945501
 f2-score = 0.2991179498594841
preprocessing: Label Encoding k= 1 metric: cosine
 acc = 0.8197854122997122
 f2-score = 0.30350420726306465
preprocessing: Label Encoding k= 1 metric: manhattan
 acc = 0.8230717305574307
 f2-score = 0.30445551442963287
preprocessing: Label Encoding k= 1 metric: jaccard
 acc = 0.8529328169964827
 f2-score = 0.0535538814281035
preprocessing: Label Encoding k= 2 metric: euclidean
 acc = 0.8652431875510712
 f2-score = 0.13700404858299592
preprocessing: Label Encoding k= 2 metric: cosine
 acc = 0.8652076597861229
 f2-score = 0.13989972505256346
preprocessing: Label Encoding k= 2 metric: manhattan
 acc = 0.8662379649696238
 f2-score = 0.1413856768049752
preprocessing: Label Encoding k= 2 metric: jaccard
 acc = 0.8721178100685686
 f2-score = 0.0
preprocessing: Label Encoding k= 3 metric: euclidean
 acc = 0.8516005258109213
 f2-score = 0.2606394114304513
preprocessing: Label Encoding k

# Label Encoding + Oversampling

In [4]:
preprocessing = "Label Encoding + Oversampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_label_encoded_train_test_split_oversampled()

for n_neighbors in range(5,16):
    for metric in metrics:
        knn_estimator = KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, n_neighbors, metric, preprocessing))

preprocessing: Label Encoding + Oversampling k= 5 metric: euclidean
 acc = 0.7431342594237397
 f2-score = 0.474826447395026
preprocessing: Label Encoding + Oversampling k= 5 metric: cosine
 acc = 0.7372544143247949
 f2-score = 0.47358103893789544
preprocessing: Label Encoding + Oversampling k= 5 metric: manhattan
 acc = 0.7453192169680606
 f2-score = 0.48150811126589665
preprocessing: Label Encoding + Oversampling k= 5 metric: jaccard
 acc = 0.5244608661669095
 f2-score = 0.40419447092469013
preprocessing: Label Encoding + Oversampling k= 6 metric: euclidean
 acc = 0.7478239243969161
 f2-score = 0.47096437722500617
preprocessing: Label Encoding + Oversampling k= 6 metric: cosine
 acc = 0.7422815930649803
 f2-score = 0.4692105203950772
preprocessing: Label Encoding + Oversampling k= 6 metric: manhattan
 acc = 0.750008881941237
 f2-score = 0.47706546889887025
preprocessing: Label Encoding + Oversampling k= 6 metric: jaccard
 acc = 0.6273492734572068
 f2-score = 0.4067926606226278
preproc

# Label Encoding + Undersampling

In [5]:
preprocessing = "Label Encoding + Undersampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_label_encoded_train_test_split_undersampled()

for n_neighbors in range(n_neighbors_start, n_neighbors_end):
    for metric in metrics:
        knn_estimator = KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, n_neighbors, metric, preprocessing))

preprocessing: Label Encoding + Undersampling k= 1 metric: euclidean
 acc = 0.6729136320034107
 f2-score = 0.48936640668971626
preprocessing: Label Encoding + Undersampling k= 1 metric: cosine
 acc = 0.6658258428962234
 f2-score = 0.4914007356596083
preprocessing: Label Encoding + Undersampling k= 1 metric: manhattan
 acc = 0.6762354780260774
 f2-score = 0.4929946577965931
preprocessing: Label Encoding + Undersampling k= 1 metric: jaccard
 acc = 0.8720289906561978
 f2-score = 0.000520706772659423
preprocessing: Label Encoding + Undersampling k= 2 metric: euclidean
 acc = 0.7795324546132802
 f2-score = 0.4364077669902912
preprocessing: Label Encoding + Undersampling k= 2 metric: cosine
 acc = 0.7737769566916546
 f2-score = 0.4364812232533538
preprocessing: Label Encoding + Undersampling k= 2 metric: manhattan
 acc = 0.7817351760400754
 f2-score = 0.43892944038929443
preprocessing: Label Encoding + Undersampling k= 2 metric: jaccard
 acc = 0.8720289906561978
 f2-score = 0.000520706772659

# One Hot Encoding (1)
yes/no values not one hot encoded

In [6]:
preprocessing = "One Hot Encoding (1)"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_train_test_split()

for n_neighbors in range(n_neighbors_start, n_neighbors_end):
    for metric in metrics:
        knn_estimator = KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, n_neighbors, metric, preprocessing))

preprocessing: One Hot Encoding (1) k= 1 metric: euclidean
 acc = 0.8200518705368245
 f2-score = 0.3002190975400061
preprocessing: One Hot Encoding (1) k= 1 metric: cosine
 acc = 0.8217039116069208
 f2-score = 0.2938797143494756
preprocessing: One Hot Encoding (1) k= 1 metric: manhattan
 acc = 0.8205847870110491
 f2-score = 0.30083534537784806
preprocessing: One Hot Encoding (1) k= 1 metric: jaccard
 acc = 0.8191459125306427
 f2-score = 0.30210811708072816
preprocessing: One Hot Encoding (1) k= 2 metric: euclidean
 acc = 0.8653852986108644
 f2-score = 0.13673330525224378
preprocessing: One Hot Encoding (1) k= 2 metric: cosine
 acc = 0.8655096457881835
 f2-score = 0.13139089196387968
preprocessing: One Hot Encoding (1) k= 2 metric: manhattan
 acc = 0.8649944931964331
 f2-score = 0.13724792024083127
preprocessing: One Hot Encoding (1) k= 2 metric: jaccard
 acc = 0.8639819518954063
 f2-score = 0.13836863859467835
preprocessing: One Hot Encoding (1) k= 3 metric: euclidean
 acc = 0.85062351

# One Hot Encoding (2)
all columns one hot encoded

In [None]:
preprocessing = "One Hot Encoding (2)"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_all_columns_train_test_split()

for n_neighbors in range(n_neighbors_start, n_neighbors_end):
    for metric in metrics:
        knn_estimator = KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, n_neighbors, metric, preprocessing))

# One Hot Encoding (1) + Oversampling

In [None]:
preprocessing = "One Hot Encoding (1) + Oversampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_train_test_split_oversampled()

for n_neighbors in range(n_neighbors_start, n_neighbors_end):
    for metric in metrics:
        knn_estimator = KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, n_neighbors, metric, preprocessing))

# One Hot Encoding (2) + Oversampling

In [None]:
preprocessing = "One Hot Encoding (2) + Oversampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_all_columns_train_test_split_oversampled()

for n_neighbors in range(n_neighbors_start, n_neighbors_end):
    for metric in metrics:
        knn_estimator = KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, n_neighbors, metric, preprocessing))

# One Hot Encoding (1) + Undersampling

In [None]:
preprocessing = "One Hot Encoding (1) + Undersampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_train_test_split_undersampled()

for n_neighbors in range(n_neighbors_start, n_neighbors_end):
    for metric in metrics:
        knn_estimator = KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, n_neighbors, metric, preprocessing))

# One Hot Encoding (2) + Undersampling

In [None]:
preprocessing = "One Hot Encoding (2) + Undersampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_all_columns_train_test_split_undersampled()

for n_neighbors in range(n_neighbors_start, n_neighbors_end):
    for metric in metrics:
        knn_estimator = KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)
        knn_estimator.fit(data_train, target_train.values.ravel())
        diabetes_test_prediction = knn_estimator.predict(data_validation)
        print(evaluation(target_validation, diabetes_test_prediction, n_neighbors, metric, preprocessing))

# Evaluation of best approach
We use the following loop to print all F2- and accuracy-scores and thereby also analyze which approach performed best.

In [7]:
highest_acc = [0.0, None, None]
highest_f2 = [0.0, None, None]

for i in range(0, len(approach_list)):
    print("Nr.{}) {}".format(i, approach_list[i]))
    print ("Weighted F2-Score = {}".format(f2_list[i]))
    print ("Accuracy-Score = {}\n".format(acc_list[i]))
    if highest_f2[0] < float(f2_list[i]):
        highest_f2[0] = f2_list[i]
        highest_f2[1] = i
        highest_f2[2] = approach_list[i]
    if highest_acc[0] < float(acc_list[i]):
        highest_acc[0] = acc_list[i]
        highest_acc[1] = i
        highest_acc[2] = approach_list[i]

print("--------- Best Approaches ---------")
print("Best Approach regarding F2-Score:\nNr.{}) {} with f2-score = {}\n".format(highest_f2[1], highest_f2[2], highest_f2[0]))
print("Best Approach regarding Accuracy:\nNr.{}) {} with acc = {}".format(highest_acc[1], highest_acc[2], highest_acc[0]))

Nr.0) k= 5 metric: cosine
Weighted F1-Score = 0.8086342517615546
Accuracy-Score = 0.8348629810814321

Nr.1) k= 5 metric: manhattan
Weighted F1-Score = 0.8107654592583807
Accuracy-Score = 0.8379150612580631

Nr.2) k= 6 metric: cosine
Weighted F1-Score = 0.8117186129597088
Accuracy-Score = 0.8322438200787913

Nr.3) k= 6 metric: manhattan
Weighted F1-Score = 0.8125962193311277
Accuracy-Score = 0.8342136023204467

--------- Best Approaches ---------
Best Approach regarding F1-Score:
Nr.3) Parameter: (k= 6 metric: manhattan) with f1-score = 0.8125962193311277

Best Approach regarding Accuracy:
Nr.1) Parameter: (k= 5 metric: manhattan) with acc = 0.8379150612580631


We can see that approach _tbd_ with k = _tbd_ and metric = _tbd_ performed best.
We test this approach now finally against the test data that we separated at the beginning:

In [None]:
#tbd