# K-Nearest Neighbors

- K-nearest neighbors (K-NN) is a classification (or regression) algorithm that uses K number of nearest points to determine the classification of a dataset.

- K-Nearest Neighbors (KNN) is a simple, yet effective machine learning algorithm that can be used for both classification and regression problems. It is a non-parametric and instance-based method, where the model is not built explicitly but instead the data points are stored in memory. The prediction is made by finding the "K" nearest data points to a given test data point, and the majority label/value of those "K" data points is assigned as the prediction. The value of "K" is a hyperparameter that can be selected using techniques such as cross-validation.

- A real-world example of using KNN could be for recommendation systems. For example, a movie recommendation system could use the KNN algorithm to recommend movies to users based on their past viewing history. The system would store the viewing history of all users and the ratings of each movie. When a new user wants to watch a movie, the system would find the "K" nearest users based on their viewing history, and recommend the movies that are highly rated by those "K" nearest users.

- The key mathematical component of the KNN algorithm is the distance metric used to determine the proximity of two data points. The most commonly used distance metric is the Euclidean distance, which is calculated as the square root of the sum of the squared differences between the corresponding features of two data points.


# Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_iris
from sklearn.preprocessing import LabelEncoder

# K-Nearest - Code

- the syntax of a lambda function is lambda arguments: expression. In this case, the argument x is a tuple, and the expression x[0] returns the first element of the tuple.

In [None]:
class KNEARESTCODE(object):
    def __init__(self):
        self.__xtrain = np.random.randn(100,2)*0.5+[2,2]
        self.__ytrain = np.zeros(100)
        self.__xtrain = np.concatenate((self.__xtrain,
                                       np.random.randn(100,2)*0.5+[-2,-2]))
        self.__ytrain = np.concatenate((self.__ytrain,
                                       np.ones(100)))
        self.__xtest = np.random.randn(20,2)*0.5+[0,0]
        self.__knear = 3
    def __str__(self):
        return "K-NEAREST CODE - View"
    def __call__(self):
        return None
    def __getstate__(self):
        raise TypeError("[DENIED]")
    def __repr__(self):
        return KNEARESTCODE.__doc__
    def _EUCLIDEAN_DISTANCE(self,x1,x2):
        return np.sqrt(np.sum((x1-x2)**2))
    def _KNN(self,xtrain,ytrain,xtest,k):
        distances = []
        for i in range(xtrain.shape[0]):
            dis = self._EUCLIDEAN_DISTANCE(xtest,xtrain[i,:])
            distances.append((dis,ytrain[i]))
        distances = sorted(distances,key=lambda x: x[0])
        neighbors = np.asarray(distances)[:k,1]
        (val,cnt) = np.unique(neighbors,return_counts=True)
        index = np.argmax(cnt)
        return val[index]
    def _PLOT(self):
        plt.figure(figsize=(15,8))
        plt.scatter(self.__xtrain[:100,0],self.__xtrain[:100,1],color="blue")
        plt.scatter(self.__xtrain[100:,0],self.__xtrain[100:,1],color="red")
        plt.scatter(self.__xtest[:,0],self.__xtest[:,1],color="green",marker="x")
        for ix in range(self.__xtest.shape[0]):
            ind = self._KNN(self.__xtrain,self.__ytrain,self.__xtest[ix,:],self.__knear)
            clr = "purple" if ind == 0.0 else "black"
            plt.scatter(self.__xtest[ix,0],self.__xtest[ix,1],color=clr,marker="o",s=100)
        plt.show()

In [None]:
KNEARESTCODE()._PLOT()

# K-Nearest - Topic Example

In [None]:
x,y = load_iris(return_X_y=True)
xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.3,random_state=0)

In [None]:
clf = KNeighborsClassifier()
clf.fit(x,y)

In [None]:
ypred = clf.predict(xtest)

In [None]:
print(f"BASIC ACCURACY: {100*(ypred==ytest).sum()/ytest.shape[0]}")

# K-Nearest - Real Example

- The "bank marketing" dataset is a real-world data set that was collected from a Portuguese banking institution. The data was collected during a telemarketing campaign that was conducted to sell bank term deposits. The goal of the campaign was to determine if a client would subscribe to a term deposit based on various features such as age, job, marital status, education, etc.

- The data set consists of 41,188 instances, each representing a customer, and 20 features such as age, job, marital status, education, balance, etc. The target variable, y, is binary and indicates whether or not the customer subscribed to a term deposit. The dataset is widely used in machine learning research for benchmarking and testing various algorithms for binary classification problems.

In [None]:
dataurl = "/kaggle/input/aidl-education-set/bank-additional.csv"

In [None]:
dataset = pd.read_csv(dataurl,sep=";")

In [None]:
dataset.head()

In [None]:
print(f"NULL CONTROL:\n\n{dataset.isnull().sum()}")

In [None]:
x = dataset[["age",
             "duration",
             "campaign",
             "pdays",
             "previous",
             "emp.var.rate",
             "cons.price.idx",
             "cons.conf.idx",
             "euribor3m",
             "nr.employed"]]
y = dataset["y"]

In [None]:
print(f"DATA SHAPE: {x.shape}")
print(f"TARGET SHAPE: {y.shape}")
print(f"CLASSES: {np.unique(y)}")

In [None]:
xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.3,random_state=0)

In [None]:
le = LabelEncoder()
ytrain = le.fit_transform(ytrain)
ytest = le.transform(ytest)

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(xtrain,ytrain)

In [None]:
ypred = knn.predict(xtest)

In [None]:
print(f"PREDICTION: {ypred}")

In [None]:
accuracy = 100*(ypred==ytest).sum()/ytest.shape[0]
print(f"BASIC ACCURACY: {accuracy}")

In [None]:
test_confusion_matrix = confusion_matrix(ytest,ypred)
plt.figure(figsize=(15,8))
plt.imshow(test_confusion_matrix,cmap="Blues",interpolation="nearest")
plt.title("CONFUSION MATRIX FOR TEST")
plt.xticks([0,1],["No","Yes"])
plt.yticks([0,1],["No","Yes"])
plt.xlabel("Prediction")
plt.ylabel("True Label")
plt.colorbar()
for i in range(test_confusion_matrix.shape[0]):
    for j in range(test_confusion_matrix.shape[1]):
        plt.text(j,i,test_confusion_matrix[i,j],
                ha="center",
                va="center",
                color="black")
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(15,8))
plt.scatter(range(len(ytest)),ytest,color="blue",label="ACTUAL")
plt.scatter(range(len(ytest)),ypred,color="red",label="PREDICTED")
plt.xlabel("INDEX")
plt.ylabel("TEST-PREDICTION")
plt.title("KNN ON TEST SET")
plt.legend()
plt.tight_layout()
plt.show()

# PARAMETERS EXAMPLE

- np.asarray: is used when we want to convert input to an array.

In [None]:
exp_list = [1,2,3,4,5]
print(f"ARRAY OUTPUT: {np.asarray(exp_list)}")

In [None]:
nan