# Nearest Centroids
In this notebook we want to train a Nearest Centroid classifier that should predict whether a patient has diabetes or not.

In [1]:
#imports
from preprocessing.preprocessing import *
from preprocessing.preprocessing_label_encoding import *
from preprocessing.preprocessing_one_hot_encoding import *

import pandas as pd

from sklearn.neighbors import NearestCentroid
from sklearn.metrics import *

import warnings
warnings.filterwarnings('ignore')

#used to store the results and compare at the end
approach_list = []
acc_list = []
cr_list = []
f2_list = []



# Hyperparameter tuning and preprocessing
We will evaluate different parameters for the classifier (hyperparameter tuning) as well as different preprocessing steps.

In detail, we will vary regarding parameters:<br>


| **Parameter** |                   **range Values**                    |
|:-------------:|:-----------------------------------------------------:|
| metric        | 'euclidean',<br>'cosine',<br>'manhattan',<br>'jaccard' |


And we will vary for preprocessing:

|              **Preprocessing**              |                           **Description**                           |
|:-------------------------------------------:|:-------------------------------------------------------------------:|
|               Label Encoding                |                           Label encoding                            |
|     Label Encoding<br>+<br>Oversampling     |                   Label encoding and oversampling                   |
|    Label Encoding<br>+<br>Undersampling     |                  Label encoding and undersampling                   |
|            One Hot Encoding (1)             |           One hot encoding for all columns except yes/no            |
|            One Hot Encoding (2)             |          One hot encoding for all columns including yes/no          |
|  One Hot Encoding (1)<br>+<br>Oversampling  |   One hot encoding for all columns except yes/no and oversampling   |
| One Hot Encoding (2) <br>+<br>Oversampling  | One hot encoding for all columns including yes/no and oversampling  |
| One Hot Encoding (1)<br>+<br>Undersampling  |  One hot encoding for all columns except yes/no and undersampling   |
| One Hot Encoding (2) <br>+<br>Undersampling | One hot encoding for all columns including yes/no and undersampling |

In [2]:
#metrics that should be used
metrics = ('euclidean', 'cosine', 'manhattan', 'jaccard')

The below defined function is used to store all the accuracy and F2-scores for a better comparison and evaluation capability at the end. It also returns the Accuracy score and F2-Score to see the performance directly under each method.


In [2]:
def evaluation(target_validation, diabetes_test_prediction, metric, preproccessing):
    approach = "preprocessing: {} metric: {}".format(preproccessing, metric)
    approach_list.append(approach)
    acc = accuracy_score(target_validation, diabetes_test_prediction)
    acc_list.append(acc)
    cr = classification_report(target_validation, diabetes_test_prediction)
    cr_list.append(cr)
    f2 = fbeta_score(target_validation, diabetes_test_prediction, beta=2)
    f2_list.append(f2)
    return "{}:\n acc = {}\n f2-score = {}".format(approach, acc, f2)

We start with applying a Nearest Centroid classifier to the train data and test against the validation data on how it performs by using the accuracy and F2 score.

We will do so for each combination that is listed above by using a for loop. The following estimators are structured by the different style of preprocessing.

At the end we test the best approach against the actual test data.

# Label Encoding

In [4]:
preprocessing = "Label Encoding"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_label_encoded_train_test_split()

for metric in metrics:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train, target_train.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_validation)
    print(evaluation(target_validation, diabetes_test_prediction, metric, preprocessing))

preprocessing: Label Encoding metric: euclidean:
 acc = 0.6673002451415782
 f2-score = 0.53095703125
preprocessing: Label Encoding metric: cosine:
 acc = 0.6812093651188403
 f2-score = 0.5305309293754358
preprocessing: Label Encoding metric: manhattan:
 acc = 0.7256901268341208
 f2-score = 0.5538975641537588
preprocessing: Label Encoding metric: jaccard:
 acc = 0.8721178100685686
 f2-score = 0.0


# Label Encoding + Oversampling

In [5]:
preprocessing = "Label Encoding + Oversampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_label_encoded_train_test_split_oversampled()

for metric in metrics:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train, target_train.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_validation)
    print(evaluation(target_validation, diabetes_test_prediction, metric, preprocessing))

preprocessing: Label Encoding + Oversampling metric: euclidean:
 acc = 0.6669804952570434
 f2-score = 0.5313854081274154
preprocessing: Label Encoding + Oversampling metric: cosine:
 acc = 0.6808718513518315
 f2-score = 0.5307223638101306
preprocessing: Label Encoding + Oversampling metric: manhattan:
 acc = 0.71101715991047
 f2-score = 0.558961635465152
preprocessing: Label Encoding + Oversampling metric: jaccard:
 acc = 0.8721178100685686
 f2-score = 0.0


# Label Encoding + Undersampling

In [6]:
preprocessing = "Label Encoding + Undersampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_label_encoded_train_test_split_undersampled()

for metric in metrics:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train, target_train.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_validation)
    print(evaluation(target_validation, diabetes_test_prediction, metric, preprocessing))

preprocessing: Label Encoding + Undersampling metric: euclidean:
 acc = 0.6677621060859061
 f2-score = 0.5309959349593495
preprocessing: Label Encoding + Undersampling metric: cosine:
 acc = 0.6813514761786336
 f2-score = 0.5306154980373403
preprocessing: Label Encoding + Undersampling metric: manhattan:
 acc = 0.7256901268341208
 f2-score = 0.5538975641537588
preprocessing: Label Encoding + Undersampling metric: jaccard:
 acc = 0.8721178100685686
 f2-score = 0.0


# One Hot Encoding (1)
yes/no values not one hot encoded

In [7]:
preprocessing = "One Hot Encoding (1)"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_train_test_split()

for metric in metrics:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train, target_train.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_validation)
    print(evaluation(target_validation, diabetes_test_prediction, metric, preprocessing))

preprocessing: One Hot Encoding (1) metric: euclidean:
 acc = 0.7202366149145557
 f2-score = 0.5659361146787091
preprocessing: One Hot Encoding (1) metric: cosine:
 acc = 0.7183181156073472
 f2-score = 0.5645928069450186
preprocessing: One Hot Encoding (1) metric: manhattan:
 acc = 0.7154403666465343
 f2-score = 0.5276018289357775
preprocessing: One Hot Encoding (1) metric: jaccard:
 acc = 0.1278821899314314
 f2-score = 0.42302268186625924


# One Hot Encoding (2)
all columns one hot encoded

In [8]:
preprocessing = "One Hot Encoding (2)"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_all_columns_train_test_split()

for metric in metrics:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train, target_train.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_validation)
    print(evaluation(target_validation, diabetes_test_prediction, metric, preprocessing))

preprocessing: One Hot Encoding (2) metric: euclidean:
 acc = 0.7302909723949267
 f2-score = 0.5737807943690297
preprocessing: One Hot Encoding (2) metric: cosine:
 acc = 0.7304153195722457
 f2-score = 0.5741070493348696
preprocessing: One Hot Encoding (2) metric: manhattan:
 acc = 0.7216399616300139
 f2-score = 0.5213753265232441
preprocessing: One Hot Encoding (2) metric: jaccard:
 acc = 0.1278821899314314
 f2-score = 0.42302268186625924


# One Hot Encoding (1) + Oversampling

In [9]:
preprocessing = "One Hot Encoding (1) + Oversampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_train_test_split_oversampled()

for metric in metrics:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train, target_train.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_validation)
    print(evaluation(target_validation, diabetes_test_prediction, metric, preprocessing))

preprocessing: One Hot Encoding (1) + Oversampling metric: euclidean:
 acc = 0.7202366149145557
 f2-score = 0.5658559827643355
preprocessing: One Hot Encoding (1) + Oversampling metric: cosine:
 acc = 0.7183536433722955
 f2-score = 0.5645361336199768
preprocessing: One Hot Encoding (1) + Oversampling metric: manhattan:
 acc = 0.7154403666465343
 f2-score = 0.5276018289357775
preprocessing: One Hot Encoding (1) + Oversampling metric: jaccard:
 acc = 0.1278821899314314
 f2-score = 0.42302268186625924


# One Hot Encoding (2) + Oversampling

In [10]:
preprocessing = "One Hot Encoding (2) + Oversampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_all_columns_train_test_split_oversampled()

for metric in metrics:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train, target_train.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_validation)
    print(evaluation(target_validation, diabetes_test_prediction, metric, preprocessing))

preprocessing: One Hot Encoding (2) + Oversampling metric: euclidean:
 acc = 0.7302376807475042
 f2-score = 0.5732604186309637
preprocessing: One Hot Encoding (2) + Oversampling metric: cosine:
 acc = 0.7305574306320389
 f2-score = 0.5736382118081026
preprocessing: One Hot Encoding (2) + Oversampling metric: manhattan:
 acc = 0.7216399616300139
 f2-score = 0.5213753265232441
preprocessing: One Hot Encoding (2) + Oversampling metric: jaccard:
 acc = 0.1278821899314314
 f2-score = 0.42302268186625924


# One Hot Encoding (1) + Undersampling

In [11]:
preprocessing = "One Hot Encoding (1) + Undersampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_train_test_split_undersampled()

for metric in metrics:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train, target_train.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_validation)
    print(evaluation(target_validation, diabetes_test_prediction, metric, preprocessing))

preprocessing: One Hot Encoding (1) + Undersampling metric: euclidean:
 acc = 0.7213202117454791
 f2-score = 0.5657695420306696
preprocessing: One Hot Encoding (1) + Undersampling metric: cosine:
 acc = 0.7196504067929087
 f2-score = 0.5653092068487195
preprocessing: One Hot Encoding (1) + Undersampling metric: manhattan:
 acc = 0.7154403666465343
 f2-score = 0.5276018289357775
preprocessing: One Hot Encoding (1) + Undersampling metric: jaccard:
 acc = 0.8721178100685686
 f2-score = 0.0


# One Hot Encoding (2) + Undersampling

In [12]:
preprocessing = "One Hot Encoding (2) + Undersampling"

#load data
data_train, data_validation, target_train, target_validation = get_preprocessed_brfss_dataset_one_hot_encoded_all_columns_train_test_split_undersampled()

for metric in metrics:
    nearest_centroid = NearestCentroid(metric=metric)
    nearest_centroid.fit(data_train, target_train.values.ravel())
    diabetes_test_prediction = nearest_centroid.predict(data_validation)
    print(evaluation(target_validation, diabetes_test_prediction, metric, preprocessing))

preprocessing: One Hot Encoding (2) + Undersampling metric: euclidean:
 acc = 0.7302376807475042
 f2-score = 0.574390116218197
preprocessing: One Hot Encoding (2) + Undersampling metric: cosine:
 acc = 0.7301488613351335
 f2-score = 0.5742493404246409
preprocessing: One Hot Encoding (2) + Undersampling metric: manhattan:
 acc = 0.7216399616300139
 f2-score = 0.5213753265232441
preprocessing: One Hot Encoding (2) + Undersampling metric: jaccard:
 acc = 0.8721178100685686
 f2-score = 0.0


# Evaluation of best approach
We use the following loop to print all F2- and accuracy-scores and thereby also analyze which approach performed best.

In [13]:
highest_acc = [0.0, None, None]
highest_f2 = [0.0, None, None]

for i in range(0, len(approach_list)):
    print("Nr.{}) {}".format(i, approach_list[i]))
    print ("Weighted F2-Score = {}".format(f2_list[i]))
    print ("Accuracy-Score = {}\n".format(acc_list[i]))
    if highest_f2[0] < float(f2_list[i]):
        highest_f2[0] = f2_list[i]
        highest_f2[1] = i
        highest_f2[2] = approach_list[i]
    if highest_acc[0] < float(acc_list[i]):
        highest_acc[0] = acc_list[i]
        highest_acc[1] = i
        highest_acc[2] = approach_list[i]

print("--------- Best Approaches ---------")
print("Best Approach regarding F2-Score:\nNr.{}) {} with f2-score = {}\n".format(highest_f2[1], highest_f2[2], highest_f2[0]))
print("Best Approach regarding Accuracy:\nNr.{}) {} with acc = {}".format(highest_acc[1], highest_acc[2], highest_acc[0]))

Nr.0) preprocessing: Label Encoding metric: euclidean
Weighted F2-Score = 0.53095703125
Accuracy-Score = 0.6673002451415782

Nr.1) preprocessing: Label Encoding metric: cosine
Weighted F2-Score = 0.5305309293754358
Accuracy-Score = 0.6812093651188403

Nr.2) preprocessing: Label Encoding metric: manhattan
Weighted F2-Score = 0.5538975641537588
Accuracy-Score = 0.7256901268341208

Nr.3) preprocessing: Label Encoding metric: jaccard
Weighted F2-Score = 0.0
Accuracy-Score = 0.8721178100685686

Nr.4) preprocessing: Label Encoding + Oversampling metric: euclidean
Weighted F2-Score = 0.5313854081274154
Accuracy-Score = 0.6669804952570434

Nr.5) preprocessing: Label Encoding + Oversampling metric: cosine
Weighted F2-Score = 0.5307223638101306
Accuracy-Score = 0.6808718513518315

Nr.6) preprocessing: Label Encoding + Oversampling metric: manhattan
Weighted F2-Score = 0.558961635465152
Accuracy-Score = 0.71101715991047

Nr.7) preprocessing: Label Encoding + Oversampling metric: jaccard
Weighted 

We can see that approach 32  with preprocessing: One Hot Encoding (2) + Undersampling and metric = euclidean performed best with F2-score=0.574.
We test this approach now finally against the test data that we separated at the beginning:

In [3]:
preprocessing = "One Hot Encoding (2) + Undersampling"

#load data
data_train, data_validation, data_test, target_train, target_validation, target_test = get_preprocessed_brfss_dataset_one_hot_encoded_all_columns_train_test_split_undersampled(include_test_data=True)

nearest_centroid = NearestCentroid(metric='euclidean')
nearest_centroid.fit(data_train, target_train.values.ravel())
diabetes_test_prediction = nearest_centroid.predict(data_test)
print(evaluation(target_test, diabetes_test_prediction, "euclidean", preprocessing))

preprocessing: One Hot Encoding (2) + Undersampling metric: euclidean:
 acc = 0.7322852828848033
 f2-score = 0.5745641671917665
