# K-Nearest Neighbor (KNN) Algorithm in Machine Learning

The K-Nearest Neighbors (KNN) algorithm is a simple, supervised machine learning algorithm that can be used for both classification and regression tasks. It is a non-parametric algorithm, which means that it does not make any assumptions about the underlying data distribution. It is a lazy learning algorithm because it does not have a specialized training phase. It uses all the data for training while classification.

## How KNN Works

The KNN algorithm works by finding the K most similar data points in the training data to a new data point, and then predicting the label of the new data point based on the labels of its K-nearest neighbors.

### Steps of the KNN Algorithm
1. **Load the data:** Load the training data and the test data.
2. **Choose the value of K:** Choose the number of nearest neighbors to consider.
3. **For each data point in the test data:**
    a. **Calculate the distance:** Calculate the distance between the test data point and all the training data points. The most common distance metric is Euclidean distance, but other metrics such as Manhattan distance and Minkowski distance can also be used.
    b. **Find the K-nearest neighbors:** Find the K training data points that are closest to the test data point.
    c. **Predict the label:**
        - For classification tasks, predict the label of the test data point to be the most common label among its K-nearest neighbors.
        - For regression tasks, predict the label of the test data point to be the average of the labels of its K-nearest neighbors.

### How to Choose the Value of K

The value of K is a hyperparameter that you need to choose. A small value of K will make the model more sensitive to noise, while a large value of K will make the model more biased. A common way to choose the value of K is to use cross-validation. The value of K is usually an odd number to avoid ties in classification.

## Implementing KNN Algorithm Using Scikit-Learn

### Example 1: Linear KNN

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the data
iris = load_iris()
X = iris.data[:, :2]
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the KNN model
knn = KNeighborsClassifier(n_neighbors=3)

# Train the model
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Plot the decision boundary
def plot_decision_boundary(X, y, model):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
                         np.arange(y_min, y_max, 0.01))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.Paired)
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k', cmap=plt.cm.Paired)
    plt.title("KNN with Iris Dataset")
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[1])
    plt.show()

plot_decision_boundary(X, y, knn)

### Example 2: Non-linear KNN

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Create and split the dataset
X, y = make_moons(n_samples=500, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the KNN model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Plot the decision boundary
def plot_decision_boundary(X, y, model):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
                         np.arange(y_min, y_max, 0.01))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.Paired)
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k', cmap=plt.cm.Paired)
    plt.title("Non-linear KNN with Moons Dataset")
    plt.show()

plot_decision_boundary(X, y, knn)

## Advantages of KNN

* Simple to understand and implement.
* No assumptions about the data distribution.
* Can be used for both classification and regression tasks.
* It is a lazy learning algorithm, so it is fast to train.

## Disadvantages of KNN

* Can be slow for large datasets because it needs to store all the training data.
* Sensitive to the choice of K.
* Sensitive to the scale of the data and irrelevant features.