# K-nearest centroids
In this notebook we want to train a K-nearest-centroid classifier that should predict whether a patient has diabetes or not.

In [1]:
#imports
from preprocessing.preprocessing import *
from preprocessing.preprocessing_label_encoding import *
from preprocessing.preprocessing_one_hot_encoding import *

import pandas as pd

from sklearn.neighbors import NearestCentroid
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV



# Hyperparameter tuning and preprocessing
We will evaluate different parameters for the classifier (hyperparameter tuning) as well as different preprocessing steps.

In detail, we will vary regarding parameters:<br>


| **Parameter** |                        **range Values**                        |
|:-------------:|:--------------------------------------------------------------:|
| metric        | 'euclidean',<br>'minkowski',<br>'cosine',<br>'sqeuclidean',<br>'manhattan' |

And we will vary for preprocessing:

|              **Preprocessing**              |                           **Description**                           | **Name train data** | **Name train target** | **Name test data** | **Name test target** |
|:-------------------------------------------:|:-------------------------------------------------------------------:|-----------------|-------------------|--------------------|----------------------|
|     Label Encoding<br>    |                   Label encoding                  | data_train      | target_train      | data_test          | target_test          |
|     Label Encoding<br>+<br>Oversampling     |                   Label encoding and oversampling                   | data_train_os   | target_train_os   | data_test_os       | target_test_os       |
|    Label Encoding<br>+<br>Undersampling     |                  Label encoding and undersampling                   | data_train_us   | target_train_us   | data_test_us       | target_test_us       |
|            One Hot Encoding (1)             |           One hot encoding for all columns except yes/no            | data_train_oh   | target_train_oh   | data_test_oh       | target_test_oh       |
|            One Hot Encoding (2)             |          One hot encoding for all columns including yes/no          | data_train_a_oh | target_train_a_oh | data_test_a_oh     | target_test_a_oh     |
|  One Hot Encoding (1)<br>+<br>Oversampling  |   One hot encoding for all columns except yes/no and oversampling   | data_train_oh_os | target_train_oh_os | data_test_oh_os    | target_test_oh_os    |
| One Hot Encoding (2) <br>+<br>Oversampling  | One hot encoding for all columns including yes/no and oversampling  | data_train_a_oh_os | target_train_a_oh_os | data_test_a_oh_os  | target_test_a_oh_os  |
| One Hot Encoding (1)<br>+<br>Undersampling  |  One hot encoding for all columns except yes/no and undersampling   | data_train_oh_us | target_train_oh_us | data_test_oh_us    | target_test_oh_us    |
| One Hot Encoding (2) <br>+<br>Undersampling | One hot encoding for all columns including yes/no and undersampling | data_train_a_oh_us | target_train_a_oh_us | data_test_a_oh_us  | target_test_a_oh_us  |

So we start with loading the data:

In [None]:
#label encoding
data_train, data_test, target_train, target_test = get_preprocessed_brfss_dataset_label_encoded_train_test_split()

#label encoding oversampling
data_train_os, data_test_os, target_train_os, target_test_os = get_preprocessed_brfss_dataset_label_encoded_train_test_split_oversampled()

#label encoding undersampling
data_train_us, data_test_us, target_train_us, target_test_us = get_preprocessed_brfss_dataset_label_encoded_train_test_split_undersampled()

#one hot encoding (1) - target not one hot encoded
data_train_oh, data_test_oh, target_train_oh, target_test_oh = get_preprocessed_brfss_dataset_one_hot_encoded_train_test_split()

#one hot encoding (2) - target not one hot encoded
data_train_a_oh, data_test_a_oh, target_train_a_oh, target_test_a_oh = get_preprocessed_brfss_dataset_one_hot_encoded_all_columns_train_test_split()

#one hot encoding (1) - target not one hot encoded + oversampling
data_train_oh_os, data_test_oh_os, target_train_oh_os, target_test_oh_os = get_preprocessed_brfss_dataset_one_hot_encoded_train_test_split_oversampled()

#one hot encoding (2) - target not one hot encoded + oversampling
data_train_a_oh_os, data_test_a_oh_os, target_train_a_oh_os, target_test_a_oh_os = get_preprocessed_brfss_dataset_one_hot_encoded_all_columns_train_test_split_oversampled()

#one hot encoding (1) - target not one hot encoded + undersampling
data_train_oh_us, data_test_oh_us, target_train_oh_us, target_test_oh_us = get_preprocessed_brfss_dataset_one_hot_encoded_train_test_split_undersampled()

#one hot encoding (2) - target not one hot encoded + undersampling
data_train_a_oh_us, data_test_a_oh_us, target_train_a_oh_us, target_test_a_oh_us = get_preprocessed_brfss_dataset_one_hot_encoded_all_columns_train_test_split_undersampled()

No we can apply a K-nearest centroid classifier to the train data and test against the test data on how it performs by using the accuracy score.

We will do so for each metric parameter that is listed above. The following estimators are structured by the different style of preprocessing.

# Label Encoding

In [None]:
#metrics that should be used
params = ('euclidean', 'minkowski', 'cosine', 'sqeuclidean', 'manhattan')

for metric in params:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train, target_train.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_test)
    print("metric: {} -> acc: {}".format(metric, accuracy_score(target_test, diabetes_test_prediction)))

# Label Encoding + Oversampling

In [None]:
#metrics that should be used
params = ('euclidean', 'minkowski', 'cosine', 'sqeuclidean', 'manhattan')

for metric in params:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train_os, target_train_os.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_test_os)
    print("metric: {} -> acc: {}".format(metric, accuracy_score(target_test_os, diabetes_test_prediction)))

# Label Encoding + Undersampling

In [None]:
#metrics that should be used
params = ('euclidean', 'minkowski', 'cosine', 'sqeuclidean', 'manhattan')

for metric in params:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train_us, target_train_us.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_test_us)
    print("metric: {} -> acc: {}".format(metric, accuracy_score(target_test_us, diabetes_test_prediction)))

# One Hot Encoding (1)
yes/no values not one hot encoded

In [None]:
#metrics that should be used
params = ('euclidean', 'minkowski', 'cosine', 'sqeuclidean', 'manhattan')

for metric in params:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train_oh, target_train_oh.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_test_oh)
    print("metric: {} -> acc: {}".format(metric, accuracy_score(target_test_oh, diabetes_test_prediction)))

# One Hot Encoding (2)
all columns one hot encoded

In [None]:
#metrics that should be used
params = ('euclidean', 'minkowski', 'cosine', 'sqeuclidean', 'manhattan')

for metric in params:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train_a_oh, target_train_a_oh.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_test_a_oh)
    print("metric: {} -> acc: {}".format(metric, accuracy_score(target_test_a_oh, diabetes_test_prediction)))

# One Hot Encoding (1) + Oversampling

In [None]:
#metrics that should be used
params = ('euclidean', 'minkowski', 'cosine', 'sqeuclidean', 'manhattan')

for metric in params:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train_oh_os, target_train_oh_os.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_test_oh_os)
    print("metric: {} -> acc: {}".format(metric, accuracy_score(target_test_oh_os, diabetes_test_prediction)))

# One Hot Encoding (2) + Oversampling

In [None]:
#metrics that should be used
params = ('euclidean', 'minkowski', 'cosine', 'sqeuclidean', 'manhattan')

for metric in params:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train_a_oh_os, target_train_a_oh_os.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_test_a_oh_os)
    print("metric: {} -> acc: {}".format(metric, accuracy_score(target_test_a_oh_os, diabetes_test_prediction)))

# One Hot Encoding (1) + Undersampling

In [None]:
#metrics that should be used
params = ('euclidean', 'minkowski', 'cosine', 'sqeuclidean', 'manhattan')

for metric in params:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train_oh_us, target_train_oh_us.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_test_oh_us)
    print("metric: {} -> acc: {}".format(metric, accuracy_score(target_test_oh_us, diabetes_test_prediction)))

# One Hot Encoding (2) + Undersampling

In [None]:
#metrics that should be used
params = ('euclidean', 'minkowski', 'cosine', 'sqeuclidean', 'manhattan')

for metric in params:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train_a_oh_us, target_train_a_oh_us.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_test_a_oh_us)
    print("metric: {} -> acc: {}".format(metric, accuracy_score(target_test_a_oh_us, diabetes_test_prediction)))

# Archived-Code - can be deleted later

In [4]:
# nearest_centroid = NearestCentroid()
# nearest_centroid.fit(data_train, target_train.values.ravel())
# predictions = nearest_centroid.predict(data_test)
# print("nearest_centroid: acc: {}".format(accuracy_score(target_test, predictions)))
#
# nearest_centroid.get_params()
#
# params = {
#     'metric': ('euclidean', 'minkowski', 'cosine', 'sqeuclidean', 'manhattan')
# }
#
# grid_search_estimator = GridSearchCV(nearest_centroid, params, scoring='accuracy', cv=5, return_train_score=False)
# grid_search_estimator.fit(data_train,target_train.values.ravel())
#
# results = pd.DataFrame(grid_search_estimator.cv_results_)
# display(results)
#
#
# print("best score is {} with params {}".format(grid_search_estimator.best_score_, grid_search_estimator.best_params_))



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_metric,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.058831,0.00498,0.011605,0.000648,euclidean,{'metric': 'euclidean'},0.146888,0.14921,0.147766,0.146449,0.149495,0.147962,0.001216,2
1,0.052294,0.00392,0.014411,0.000776,minkowski,{'metric': 'minkowski'},0.146888,0.14921,0.147766,0.146449,0.149495,0.147962,0.001216,2
2,0.050737,0.001637,0.01191,0.00049,cosine,{'metric': 'cosine'},0.370823,0.362782,0.36768,0.372253,0.366213,0.36795,0.003364,1
3,0.050477,0.00097,0.013757,0.000381,sqeuclidean,{'metric': 'sqeuclidean'},0.146888,0.14921,0.147766,0.146449,0.149495,0.147962,0.001216,2
4,0.18025,0.005372,0.011906,0.000155,manhattan,{'metric': 'manhattan'},0.132178,0.108633,0.109384,0.111847,0.109754,0.114359,0.008973,5


best score is 0.3679504083507075 with params {'metric': 'cosine'}
