###
## K Nearest Neighbour with Python
KNN is a very simple, easy to understand, versatile and one of the topmost machine learning algorithms. It can be used for classification as well as regression predictive problems. However, it is more widely used in classification problems in the industry.
## How Does the KNN Algorithm Works?
* In KNN, K is the number of nearest neighbours.
* The number of neighbours is the core deciding factor.
* K is generally an odd number if the number of classes is 2. When k=1, then the algorithm is known as the nearest algorithm.

**KNN needs alot of storage and prediction rate is also quit slow.**

This is the simplest case. Suppose P1 is the point, for which label needs to predict. First, you find the one closest point to P1 and then the label of the nearest point assigned to P1.

![image.png](attachment:image.png)

**K** decides how many neighbours you want, i.e: **K = 1**, there is only one neighbour. Like in pic above:
* k = 3, there are three neighbours in circle.
* k = 7, there are seven neighbours in circle.

For finding closest similar point, you find the distance between points using distance measures such as:
* Euclidean Distance
    * Euclidean Distance is most popular distance measure.
    * Euclidean Distance is the shortest distance between two points.
* Manhattan Distance
    * Manhattan Distance is the distance between two points in a grid.
    * It measures sum of absolute differences between two points.
* Minkowski Distance
    * Minkowski Distance is the generalized form of Euclidean and Manhattan Distance.
    * It takes the sum of absolute differences to the power of p. where p is a positive integer, which is also called as order of norm.
* hamming Distance
    * Hamming distance is the distance between two points in a binary space.
    * It measures the number of positions at which the corresponding symbols are different.
    * Hamming distance is used in DNA sequencing.
    * Hamming distance is also used in spell checking.
    * Hamming distance is also used in error correction.
    * Hamming distance is also used in data compression.

![image-2.png](attachment:image-2.png)

###
## How KNN Works
* Load the data
* Initialize K to your chosen number of projects
* For each example in the data
    * Calculate the distance between the query example and the current example from the data
    * Add the distance and the index of the example to an ordered collection
* Sord the ordered collection of distances and indicates from smallest to largest (in ascending order) by the distances
* Pick the first K entries from the sorted collection
* Get the labels of the selected K entries
* If regression, return the mean of the K labels
* If classification, return the mode of the K labels

![image.png](attachment:image.png)

#### How do you decide the number of neighbour in KNN?
* The number of neighbour is the core deciding factor.
* K is generally an odd number if the number of classes is 2. When K=1, then the algorithm is known as the nearest neighbour algorithm
* When K is small, the model is more prone to noise. It means that the model is more sensitive to the training data
* When K is large, the model is less prone to noise. It means that the model is less sensitive to individual variations in the training data, but it might become too generalized, leading to underfitting where it may fail to capture important patterns in the data.

###
## Code Example:
###

In [1]:
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Step 1: Load the Iris dataset
iris = load_iris()

# Convert to DataFrame for easier understanding
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = iris.target

# Display first 5 rows of the dataset
print("First 5 rows of the Iris dataset:\n", df.head())

# Step 2: Split the data into training and testing sets
X = iris.data  # Features (sepal length, sepal width, etc.)
y = iris.target  # Target (species)

# Split the data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\nTraining set size:", X_train.shape)
print("Testing set size:", X_test.shape)

# Step 3: Apply the KNN algorithm with K=3
knn = KNeighborsClassifier(n_neighbors=3)

# Train the model
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f'\nAccuracy with K=3: {accuracy * 100:.2f}%')

# Step 4: Try different values of K and see the effect on accuracy
print("\nAccuracy with different values of K:")
for k in [1, 3, 5, 7, 9]:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f'Accuracy with K={k}: {accuracy * 100:.2f}%')


First 5 rows of the Iris dataset:
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   species  
0        0  
1        0  
2        0  
3        0  
4        0  

Training set size: (120, 4)
Testing set size: (30, 4)

Accuracy with K=3: 100.00%

Accuracy with different values of K:
Accuracy with K=1: 100.00%
Accuracy with K=3: 100.00%
Accuracy with K=5: 100.00%
Accuracy with K=7: 96.67%
Accuracy with K=9: 100.00%
