# KNN Metrics

### Article source: [Some study notes on machine learning algorithms — The series](https://medium.com/comunidadeds/some-study-notes-on-machine-learning-algorithms-the-series-cd7549746f86)

## Imports

In [9]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from scipy.spatial import distance

import warnings
warnings.filterwarnings("ignore")

## Data

In [35]:
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

## Verifying KNN Distance Metrics

In [108]:
from sklearn.neighbors import VALID_METRICS

In [19]:
VALID_METRICS.keys()

dict_keys(['ball_tree', 'kd_tree', 'brute'])

In [30]:
print (f"We have {len(VALID_METRICS['brute'])} distance metrics in KNN sklearn class.")

We have 27 distance metrics in KNN sklearn class.


In [36]:
METRICS = VALID_METRICS['brute']

In [50]:
VALID_METRICS['brute']

['braycurtis',
 'canberra',
 'chebyshev',
 'cityblock',
 'correlation',
 'cosine',
 'dice',
 'euclidean',
 'hamming',
 'haversine',
 'jaccard',
 'kulsinski',
 'l1',
 'l2',
 'mahalanobis',
 'manhattan',
 'matching',
 'minkowski',
 'nan_euclidean',
 'precomputed',
 'rogerstanimoto',
 'russellrao',
 'seuclidean',
 'sokalmichener',
 'sokalsneath',
 'sqeuclidean',
 'yule']

## CFG

In [85]:
METRICS = ['cityblock', 
           'cosine', 
           'euclidean',
#            'haversine', 'seuclidean', COMPUTA NAN
#            'mahalanobis', COMPUTA NAN PARA TEST
           'minkowski',   # bom valor de treino
#            'precomputed', DA ERRO NESSA
           'sqeuclidean', # bom valor de test
           'l1', 'l2', 'manhattan',
           'nan_euclidean', 'hamming', 'braycurtis',
           'canberra', 'chebyshev', 'correlation',
           'dice', 'jaccard', 'kulsinski', 
           'yule', 'matching', 'rogerstanimoto',
           'russellrao', 'sokalmichener',  'sokalsneath']

K = 5
FOLDS = 5

In [86]:
METRICS.sort()

In [87]:
METRICS

['braycurtis',
 'canberra',
 'chebyshev',
 'cityblock',
 'correlation',
 'cosine',
 'dice',
 'euclidean',
 'hamming',
 'jaccard',
 'kulsinski',
 'l1',
 'l2',
 'manhattan',
 'matching',
 'minkowski',
 'nan_euclidean',
 'rogerstanimoto',
 'russellrao',
 'sokalmichener',
 'sokalsneath',
 'sqeuclidean',
 'yule']

## Pipeline

In [88]:
# Create pipeline steps
steps = [("knn", KNeighborsClassifier(n_neighbors = K))]

#instantiate pipeline 
pipe = Pipeline(steps=steps)

## GridSearchCV

In [91]:
# Create params for the cv - k 
params = {}
params['knn__metric'] = METRICS

grid = GridSearchCV(pipe, params, scoring='f1', cv=FOLDS, return_train_score=True)

grid.fit(X,y)

print(grid.best_estimator_)

results = pd.DataFrame(grid.cv_results_)[['param_knn__metric', 
                                'mean_train_score', 
                                'mean_test_score']].sort_values('param_knn__metric').rename(columns={'param_knn__metric':'distance'})
results

Pipeline(steps=[('knn', KNeighborsClassifier(metric='canberra'))])


Unnamed: 0,distance,mean_train_score,mean_test_score
0,braycurtis,0.963162,0.94676
1,canberra,0.98057,0.962562
2,chebyshev,0.952179,0.940097
3,cityblock,0.963521,0.94676
4,correlation,0.958079,0.939395
5,cosine,0.954581,0.939479
6,dice,0.771058,0.771052
7,euclidean,0.958639,0.943823
8,hamming,0.862099,0.787085
9,jaccard,0.771058,0.771052


In [106]:
print(f"Best distance metric for this data is '{results.loc[results['mean_test_score'].idxmax(), 'distance']}', with {results['mean_test_score'].max():.3f} of accuracy.")

Best distance metric for this data is 'canberra', with 0.963 of accuracy.
