# K-nearest Neighbors

The k-Nearest Neighbors (k-NN) algorithm is a popular and simple supervised machine learning algorithm. It's a type of instance-based learning or lazy learning, meaning it doesn't build a model during training but instead stores the entire training dataset in memory. When making predictions for new, unseen data points, k-NN looks at the k nearest neighbors from the training dataset and uses their labels to make predictions for the new data point.

1. Data Preparation:
- Gather a labeled dataset containing input data points and their corresponding class labels (for classification) or target values (for regression).
- Define a distance metric (e.g., Euclidean distance, Manhattan distance) to measure the similarity or distance between data points. Euclidean distance is a common choice and works well in many cases.

2. Choosing a Value for 'k':
- Select a positive integer value for 'k,' which represents the number of nearest neighbors to consider when making predictions.
- The choice of 'k' can significantly affect the algorithm's performance. Smaller 'k' values make predictions more sensitive to individual data points, while larger 'k' values make predictions more stable but potentially less accurate.

3. Prediction for Classification:
- To classify a new data point, calculate the distance between that point and all data points in the training set.
- Identify the 'k' nearest neighbors with the smallest distances to the new data point.
- Count the frequency of each class among these 'k' neighbors.
- Assign the class label to the new data point based on the majority class among its 'k' nearest neighbors. In case of a tie, you can use different tie-breaking strategies.

In [18]:
# import the libraries
from sklearn import datasets, neighbors
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score

from keras.datasets import mnist

import numpy as np

## Libraries to import

### sklearn
- scikit-learn (sklearn) is a popular open-source Python library for machine learning that offers a wide range of tools and algorithms for building, evaluating, and deploying machine learning models. 
- It provides a consistent and user-friendly interface for tasks such as classification, regression, clustering, and data preprocessing.

### keras
- Keras is an open-source high-level neural networks API written in Python that makes it easy to design, train, and deploy deep learning models. 
- It provides a user-friendly and modular interface to popular deep learning frameworks like TensorFlow and Theano, simplifying the development of artificial neural networks for tasks such as image recognition and natural language processing.

### numpy
- NumPy, short for "Numerical Python," is a fundamental Python library for numerical and mathematical operations. 
- It provides support for arrays and matrices, along with a wide range of mathematical functions, making it a powerful tool for scientific computing and data analysis in Python.

In [7]:
# load the mnist dataset from the internet
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# checking shapes or dimensionalities
print("X train shape:", X_train.shape[0:])
print("y train shape:", y_train.shape[0:])
print("X val shape:", X_validation.shape[0:])
print("y val shape:", y_validation.shape[0:])
print("X test shape:", X_test.shape[0:])
print("y test shape:", y_test.shape[0:])

X train shape: (48000, 28, 28)
y train shape: (48000,)
X val shape: (12000, 28, 28)
y val shape: (12000,)
X test shape: (10000, 28, 28)
y test shape: (10000,)


In [8]:
# reshape flatten the vectors
X_train = X_train.reshape((-1, 28*28))
X_validation = X_test.reshape((-1, 28*28))
X_test = X_test.reshape((-1, 28*28))

# re-checking shapes or dimensionalities
print("X train shape:", X_train.shape[0:])
print("y train shape:", y_train.shape[0:])
print("X val shape:", X_validation.shape[0:])
print("y val shape:", y_validation.shape[0:])
print("X test shape:", X_test.shape[0:])
print("y test shape:", y_test.shape[0:])

X train shape: (48000, 784)
y train shape: (48000,)
X val shape: (10000, 784)
y val shape: (12000,)
X test shape: (10000, 784)
y test shape: (10000,)


In [11]:
k = 3

# initialize the model using the library in sklearn
model = KNeighborsClassifier(n_neighbors=k, metric='euclidean')

# "train" the model (although we are not actually obtaining a parametric classifier)
model.fit(X_train, y_train)

In [15]:
# get the label for the first item in the testing set
print(model.predict([X_test[0]]))

[7]


In [16]:
y_pred = model.predict(X_test)
print(y_pred.shape)

(10000,)


In [19]:
# check the accuracy score of our k-NN model
print(accuracy_score(y_true=y_test, y_pred=y_pred))

0.9681
