# kNN

We will continue with Instance based methods. These algorithms do NOT compute
probability distribution.

kNN works as follows

compute the distance between the given instance and all the known instance.
Sort by distance and assign the most frequent class of the k nearest neighbors.

In [32]:
import math
import pandas as pd
from sklearn.model_selection import train_test_split

In [18]:
def euclidean_distance(a, b):
    d = [(x - y)**2 for x, y in zip(a, b)]
    d = math.sqrt(sum(d))
    return d

Let us write another function to iterate over the test set 
and over the training set for each observation in the test set.

In [19]:
def knn_predict2(test_data, train_data, k_value, labelcol):
    eu_char = train_data[[labelcol]]
    b = train_df.drop(columns = [labelcol])
    pred = []
    for i in range(0, len(test_data)): 
        a = test_data.iloc[i]
        eu_dist = b.apply(lambda x: euclidean_distance(a, x), axis=1)
        eu = pd.concat([eu_char, eu_dist], axis=1).rename(columns={"target":"eu_char", 0:"eu_dist"}).sort_values("eu_dist").head(k_value)
        prediction = eu["eu_char"].value_counts().reset_index().sort_values("eu_char", ascending = False).at[0, "index"]
        pred.append(prediction)
    
    return(pred)

Let us also write a utility function for accuracy.

In [87]:
def accuracy(test_data, labelcol, predcol):
    correct = 0
    test_data = test_data[[labelcol]].reset_index(drop=True)
    for i, row in test_data.iterrows():
        if row[labelcol] == predcol[i]:
            correct += 1
    accu = (correct / len(test_data.index)) * 100  
    return(accu)

# Here's a faster and I find clearer implementation
def accuracy2(y, yhat):
    is_correct = y == yhat
    n_correct = len(y[is_correct].index)
    accu = (n_correct / len(y.index)) * 100  
    return(accu)

Now let us load and prepare training/test set from the heart dataset.

We will call the predict function we wrote above.

In [88]:
#load data
heart = pd.read_csv("heart.csv")
heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [89]:
knn_df = heart[["sex", "cp", "fbs", "restecg", "exang", "slope", "ca", "thal", "target"]]

<span style="color:red">**NOTE: The RMD file used set.seed and a selection process to choose the training/test split.  I will show how could be done in python but will ultimately use a list of the same observations so the model results are consistent.**</span>

In [90]:
# Here's how to split the data in python
X = knn_df.loc[:, knn_df.columns != "target"]
Y = knn_df["target"]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30, random_state = 43)

# Here's the rows selected by the process in R
r_tstidx = [44, 296, 196, 149, 66, 261, 216, 7, 167, 130, 270, 225, 180, 147, 173, 22, 78, 228, 259, 110, 161, 142, 297, 72, 301, 39, 249, 274, 81, 121, 286, 98, 55, 230, 263, 120, 102, 291, 49, 245, 128, 166, 40, 154, 89, 218, 43, 165, 85, 184, 158, 303, 150, 169, 170, 67, 99, 9, 21, 205, 123, 209, 300, 234, 281, 5, 69, 268, 74, 93, 236, 30, 254, 279, 45, 176, 90, 233, 37, 103, 38, 191, 220, 194, 285, 217, 80, 115, 88, 299, 127, 227, 293, 145, 248, 298, 104, 179, 117, 160, 61, 174, 70, 221, 182, 140, 256, 146, 253, 136, 290, 222, 24, 125, 143, 212, 292, 276, 239, 34, 244, 219, 157, 162, 294, 168, 151, 114, 52, 210, 11, 42, 97, 163, 266, 269, 264, 108, 208, 172, 175, 107, 64, 27, 201, 211, 8, 18, 116, 59, 273, 251, 159, 181, 229, 223, 10, 1, 153, 189, 129, 156, 25, 283, 284, 235, 101, 12, 33, 92, 91, 152, 71, 133, 141, 77, 131, 250, 135, 195, 207, 242, 54, 109, 185, 126, 198, 111, 271, 76, 86, 105, 203, 164, 206, 199, 275, 202, 272, 118, 31, 139, 200, 14, 257, 2, 37, 138, 19, 94, 302, 3]
python_tstidx = [x - 1 for x in r_tstidx]
trdidx = knn_df.index.isin(python_tstidx)
train_df = knn_df[trdidx]
test_df = knn_df[~trdidx]
K = 3
predictions = knn_predict2(test_df, train_df, K, "target") 

Computing ROCR and AUC is some what non-trivial as kNN do not compute probabilities
and results in unreliable ROCR plots.

In [92]:
print('accuracy=' + str(accuracy(test_df, "target", predictions)))
# or
# print('accuracy=' + str(accuracy2(test_df["target"], predictions)))

accuracy=85.86956521739131


accuracy=85.86956521739131
