<h1>K-Nearest Neighbors</h1>
<p>
    K-NearestNeighbors is a supervised machine learning algorithm that uses density to predict labels for tagets.</p>
    

In [94]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import random

A dataset on winequality will be used. Several features can be used to predict the quality label. We will try out some different distance metrics.

Data: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009

In [95]:
def run():
    '''
    greedy appraoch:
    1) For each metric, do the following.
    2) Shuffle data columns (predictors)
    3) Choose first 2 features as predictors. Do knn, fit, score.
    4) Repeat step 3 with 1 additional feature. Then again until all features are used.
    5) Append results.
    6) After last metric is used and step 5 is done, return results.
    '''
    
    path = 'data/winequality-red.csv'
    
    
    df = pd.read_csv(path)
    df['id']=[j for j in range(0,len(df.iloc[:,0]))]
    
    metrics = ['euclidean','minkowski','cosine','chebyshev','manhattan']
    T = len(metrics) #
    
    X = df.drop(columns=['quality','id'])
    y = df['quality']
                
    max_num_features = len(X.columns)
    
                
    knns = []
    train_accs = []
    test_accs = []
    features = []
    
    data = {}
                
    c = 0
    for i in range(0,T):
                metric = metrics[i]
                perm_features = list(X.columns)
                random.shuffle(perm_features)
                for num_features in range(2,max_num_features):
                    X_i = X.iloc[:,: num_features]
                    X_train, X_test, y_train, y_test = train_test_split(X_i, y, test_size = 0.20, 
                                                    random_state=1, stratify=y)
                    for neighbors in range(1,10):
                        knn=KNeighborsClassifier(n_neighbors=neighbors, metric=metric)
                        knn.fit(X_train, y_train)
                        train_accs.append(knn.score(X_train,y_train))
                        test_accs.append(knn.score(X_test,y_test))
                        knns.append(knn)
                        data[c] = [knn, perm_features[: num_features], metric, neighbors, num_features,
                                     knn.score(X_train,y_train),knn.score(X_test,y_test)]
                        c=c+1
    return data

In [96]:
data = run()

In [97]:
results_df = pd.DataFrame(data).T
results_df.columns = ['KNN', 'Predictors', 'Metric','Number Neighbors', 'Number Predictors', 'Train Accuracy', 'Test Accuracy']
results_df.head()

Unnamed: 0,KNN,Predictors,Metric,Number Neighbors,Number Predictors,Train Accuracy,Test Accuracy
0,"KNeighborsClassifier(metric='euclidean', n_nei...","[sulphates, volatile acidity]",euclidean,1,2,0.927287,0.528125
1,"KNeighborsClassifier(metric='euclidean', n_nei...","[sulphates, volatile acidity]",euclidean,2,2,0.750586,0.515625
2,"KNeighborsClassifier(metric='euclidean', n_nei...","[sulphates, volatile acidity]",euclidean,3,2,0.713839,0.5125
3,"KNeighborsClassifier(metric='euclidean', n_nei...","[sulphates, volatile acidity]",euclidean,4,2,0.670837,0.50625
4,KNeighborsClassifier(metric='euclidean'),"[sulphates, volatile acidity]",euclidean,5,2,0.652072,0.50625


In [98]:
somewhat_successful = results_df[results_df['Test Accuracy']>.6]
somewhat_successful

Unnamed: 0,KNN,Predictors,Metric,Number Neighbors,Number Predictors,Train Accuracy,Test Accuracy
18,"KNeighborsClassifier(metric='euclidean', n_nei...","[sulphates, volatile acidity, density, alcohol]",euclidean,1,4,1.0,0.60625
99,KNeighborsClassifier(n_neighbors=1),"[citric acid, density, pH, total sulfur dioxide]",minkowski,1,4,1.0,0.60625
207,"KNeighborsClassifier(metric='cosine', n_neighb...","[alcohol, residual sugar, pH, sulphates, fixed...",cosine,1,7,1.0,0.603125
216,"KNeighborsClassifier(metric='cosine', n_neighb...","[alcohol, residual sugar, pH, sulphates, fixed...",cosine,1,8,1.0,0.615625
225,"KNeighborsClassifier(metric='cosine', n_neighb...","[alcohol, residual sugar, pH, sulphates, fixed...",cosine,1,9,1.0,0.61875
234,"KNeighborsClassifier(metric='cosine', n_neighb...","[alcohol, residual sugar, pH, sulphates, fixed...",cosine,1,10,1.0,0.625
351,"KNeighborsClassifier(metric='manhattan', n_nei...","[chlorides, volatile acidity, sulphates, pH, f...",manhattan,1,5,1.0,0.60625


The cosine metric has performed slightly better (more consistent and higher maximum accuracy scores) than the other metrics with this data. The number of neighbors that perform best consistently is 1, with the number of predictors varying. A maxium of 62.5% test-accuracy was achieved a few times with the consine metric, 1 neighbor, and all 10 predictors.