## k-nearest neighbors (*kNN*) classifier implemented from scratch

---

#### note:

- not as efficient as scikit-learn's implementation (~4000 data points with 7 mins training + validation)
- simply an exercise to get intuitive understanding of the algorithm

---

import packages for data manipulation and matrix operations

---

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
import time

### data read-in

- data is on loan applications to a bank
- each row is a customer
- features are demographic and banking info on a customer
- target is if loan is approved (Personal Loan = 1) or not (Personal Loan = 0)

---

In [2]:
knn_data = pd.read_excel('knndata.xlsx')

print('')
print('Dataset has {} attributes on {} banking customers applying for loans'.format(knn_data.shape[1], knn_data.shape[0]))
print('')


Dataset has 14 attributes on 5000 banking customers applying for loans



In [3]:
knn_data.head()

Unnamed: 0,Age,Experience,Income,Family,CCAvg,UG,GR,ADV,Mortgage,Securities Account,CD Account,Online,CreditCard,Personal Loan
0,25,1,49,4,1.6,1,0,0,0,1,0,0,0,0
1,45,19,34,3,1.5,1,0,0,0,1,0,0,0,0
2,39,15,11,1,1.0,1,0,0,0,0,0,0,0,0
3,35,9,100,1,2.7,0,1,0,0,0,0,0,0,0
4,35,8,45,4,1.0,0,1,0,0,0,0,0,1,0


### data preparation for modeling

perform random split on the data to get subsets to use for training, validation, and testing

---

In [42]:
train, testing = train_test_split(knn_data, test_size=0.5)
validate, test = train_test_split(testing)

print('Training with {} records'.format(train.shape[0]))
print('Validating with {} records'.format(validate.shape[0]))
print('Testing with {} records'.format(test.shape[0]))

Training with 2500 records
Validating with 1875 records
Testing with 625 records


### important preprocessing step

- because of how distance metric is computed, magnitude of features (large numbers vs small numbers) has an effect on how close/far observations are
- to eliminate this, we carry out preprocessing on the training set, and transform the validation and test sets
- transformation options include normalizing, standardizing, etc 

In [43]:
train_feats = train.drop('Personal Loan', axis=1)
validate_feats = validate.drop('Personal Loan', axis=1)
test_feats = test.drop('Personal Loan', axis=1)

from sklearn.preprocessing import Normalizer

norm = Normalizer()
train_feats = norm.fit_transform(train_feats)
validate_feats = norm.transform(validate_feats)
test_feats = norm.transform(test_feats)

train.reset_index(inplace=True)
validate.reset_index(inplace=True)
test.reset_index(inplace=True)

knn_train = pd.DataFrame(data=train_feats, columns=knn_data.columns[:-1])
knn_train['Personal Loan'] = train['Personal Loan']

knn_validate = pd.DataFrame(data=validate_feats, columns=knn_data.columns[:-1])
knn_validate['Personal Loan'] = validate['Personal Loan']

knn_test = pd.DataFrame(data=test_feats, columns=knn_data.columns[:-1])
knn_test['Personal Loan'] = test['Personal Loan']

### searching for the best *k* neigbors

- kNN classifier is a pretty intuitive and simple model to explain
- task is to look at *k* most-similar records (customers in this case) for a customer, and the most popular loan decision (among the customers in that sample) is what you expect that customer's loan decision to be
- "similar" records are computed using a distance metric (cosine similasrity in this case)
- but how do we know what *k* to use?
- grid search on no. of neighbors, and select the best one with the validation set

----------------------------------

In [45]:
neys = range(2, 5)

# initialize best params + metrics
best_train_k = None
best_val_k = None
best_train_misclassification = np.inf
best_val_misclassification = np.inf

### building the *kNN* classifier model 

- in the training data, we use all records to get the similarity
- in the validation, we use ***all training*** and ***one record at a time*** from the validation set
- so take a customer in the validation set, get similarity with all customers in the training set, and pick the *k* most similar; take the next customer in the validation set and you get it ...

-----------------------------------

In [46]:
# track how long it takes for the modeling 
start = time.time()

for k in neys:
    
    # training predictions
    preds = []
    actual = []
    i = 0

    # get similarity matrix of all docs
    knn_sim = cosine_similarity(knn_train)

    # for each doc in training df
    for vec in knn_sim:
        # get indices of "k" most-similar docs
        knn_indices = np.argsort(vec)[-k-1:-1]
        # get prediction from mode of labels of "k" most-similar docs
        pred = stats.mode(knn_train.iloc[knn_indices]['Personal Loan'])[0][0]
        # append prediction and actual labels to list
        preds.append(pred)
        actual.append(int(knn_train.iloc[i]['Personal Loan']))
        i += 1

    # compare training predictions to actual labels and get error metric
    misclassification_rate = np.abs((np.array(actual) - np.array(preds))).sum() / len(knn_sim)
    
    if misclassification_rate < best_train_misclassification:
        best_train_misclassification = misclassification_rate
        best_train_k = k

    # validation predictions
    preds = []
    actual = []

    # for each new doc
    for i in range(knn_validate.shape[0]):
        knn_validation = knn_train.append(knn_validate.iloc[i])
        knn_validation.reset_index(inplace=True)
        # get similarity matrix of new doc w/ training docs
        knn_sim = cosine_similarity(knn_validation)
        # get indices of "k" most-similar docs
        knn_indices = np.argsort(knn_sim[-1])[-k-1:-1]
        # get prediction from mode of labels of "k" most-similar docs
        pred = stats.mode(knn_validation.iloc[knn_indices]['Personal Loan'])[0][0]
        # append prediction and actual labels to list
        preds.append(pred)
        actual.append(int(knn_validate.iloc[i]['Personal Loan']))

    # compare validation predictions to actual labels and get error metric
    misclassification_rate = np.abs((np.array(actual) - np.array(preds))).sum() / knn_validate.shape[0]
    
    if misclassification_rate < best_val_misclassification:
        best_val_misclassification = misclassification_rate
        best_val_k = k
    
end = time.time()    

print('Modeling Time : {:.1f} minutes'.format((end - start) / 60))
print('')
print('Best-k Training : {}'.format(best_train_k))
print('Best-k Training Misclassification Rate : {:.1f}%'.format(best_train_misclassification * 100))
print('')
print('Best-k Validation : {}'.format(best_val_k ))
print('Best-k Validation Misclassification Rate : {:.1f}%'.format(best_val_misclassification * 100))

Modeling Time : 5.4 minutes

Best-k Training : 2
Best-k Training Misclassification Rate : 0.0%

Best-k Validation : 2
Best-k Validation Misclassification Rate : 0.1%


### making predictions on new loan applications

- same logic as in the validation explained earlier
- the value of *k* used for predictions is the number of neighbors with the lowest misclassification rate on the validation dataset
- we can also see how the model performs through the test set misclassification rate
- if this is significantly higher than the validation misclassification rate, that is an indication that the model is overfitting (increase the number of neighbors)
- in actual model deployment, you will only have predictions at the time of using the model
- it is good practice to evaluate the model once the actual labels are obtained to iterate and improve the model

---------------------------

In [47]:
# testing predictions
preds = []
actual = []

start = time.time()
# for each new doc
for i in range(knn_test.shape[0]):
    knn_testing = knn_train.append(knn_test.iloc[i])
    knn_testing.reset_index(inplace=True)
    # get similarity matrix of new doc w/ training docs
    knn_sim = cosine_similarity(knn_testing)
    # get indices of "k" most-similar docs
    knn_indices = np.argsort(knn_sim[-1])[-best_val_k-1:-1]
    # get prediction from mode of labels of "k" most-similar docs
    pred = stats.mode(knn_testing.iloc[knn_indices]['Personal Loan'])[0][0]
    # append prediction and actual labels to list
    preds.append(pred)
    actual.append(int(knn_test.iloc[i]['Personal Loan']))
end = time.time()

# compare testing predictions to actual labels and get error metric
test_misclassification_rate = np.abs((np.array(actual) - np.array(preds))).sum() / knn_test.shape[0]

print('Testing Time : {:.1f} minutes'.format((end - start) / 60))
print('')
print('Test Misclassification Rate : {:.1f}%'.format(test_misclassification_rate * 100))

Testing Time : 0.6 minutes

Test Misclassification Rate : 0.0%


### other stuff

if you are reading this from a different source, check out my other micro-projects on my [GitHub](https://www.github.com/olaadapo), and/or articles on [Medium](https://medium.com/@adeoyewole)