# k-Nearest Neighbors (k-NN) Algorithm Implementation with the Iris Dataset

## Overview

This notebook provides a step-by-step guide to implementing the **k-Nearest Neighbors (k-NN)** algorithm using Python and the popular **Iris dataset**. The k-NN algorithm is a foundational classification technique in machine learning, which classifies data points based on the majority class of their nearest neighbors.
## Objectives

- **Load and Explore**: Understand and prepare the Iris dataset, a classic dataset in machine learning.
- **Feature and Target Separation**: Organize data for training by separating features and target labels.
- **Data Splitting**: Create training and testing sets to evaluate model performance.
- **k-NN Implementation**: Implement the k-NN classifier, experimenting with different values of `K`.
- **Model Evaluation**: Measure the accuracy of the k-NN model to assess its performance.


In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score



## Loading the Iris Dataset

We’ll use the **Iris dataset**, a classic dataset in machine learning commonly used for classification tasks. This dataset includes three species of Iris flowers (setosa, versicolor, and virginica), each with four features: sepal length, sepal width, petal length, and petal width.

The code below loads the Iris dataset using `datasets.load_iris()` from `sklearn`, then creates a **DataFrame** using `pandas`. This DataFrame combines the features and target labels into a structured format for easy exploration.

- `np.c_` is used to concatenate arrays along the columns (second axis). Here, it combines the feature data (`iris['data']`) with the target labels (`iris['target']`), arranging them side by side to form a single array.
- This combined array is then passed to `pd.DataFrame`, which organizes it into a tabular format with appropriate column names (`iris['feature_names'] + ['target']`).



In [2]:
iris = datasets.load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                      columns= iris['feature_names'] + ['target'])

In [3]:
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2.0
146,6.3,2.5,5.0,1.9,2.0
147,6.5,3.0,5.2,2.0,2.0
148,6.2,3.4,5.4,2.3,2.0


## Separating Features and Target Variable

Separate the **features** and the **target variable** from the dataset.

- `x`: This variable contains all the feature columns (sepal length, sepal width, petal length, and petal width), which will be used as input for the k-NN algorithm.
- `y`: This variable holds the target column, which indicates the species of each flower.

The code below uses `iloc` to select:
- `x`: All rows and all columns except the last one.
- `y`: All rows and only the last column (target).



In [4]:
x= iris_df.iloc[:, :-1]
y= iris_df.iloc[:, -1]

In [5]:
x

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [6]:
y

0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
      ... 
145    2.0
146    2.0
147    2.0
148    2.0
149    2.0
Name: target, Length: 150, dtype: float64

## Splitting the Dataset into Training and Testing Sets

To evaluate the k-NN algorithm effectively, split the dataset into **training** and **testing** sets. This allows us to train the model on one portion of the data and test its performance on another, ensuring a more reliable evaluation.

The `train_test_split` function from `sklearn.model_selection` is used here, with the following parameters:
- `test_size=0.2`: Reserves 20% of the data for testing and 80% for training.
- `shuffle=True`: Randomly shuffles the data before splitting to ensure randomness.
- `random_state=0`: Sets a random seed for reproducibility, so the split remains consistent each time.



In [10]:
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size = 0.2, shuffle = True, random_state = 0)

## Implementing the k-NN Classifier

Now that we’ve prepared our training and testing data, we can implement the **k-Nearest Neighbors (k-NN) classifier**. The `KNeighborsClassifier` from `sklearn.neighbors` is used here, where we specify the number of neighbors, `K`, to use for classification.

- **Choosing K**: The `K` value represents the number of neighbors considered when classifying a new data point. Different values of `K` can affect the classifier's performance, so it’s a good practice to test multiple values to find the optimal one.

1. **Define `K`**: Here, `K=3` is selected initially. You can experiment with other values to see their impact.
2. **Initialize Classifier**: `KNeighborsClassifier(K)` initializes the k-NN classifier with `K` neighbors.
3. **Train the Model**: The `.fit()` method trains the model using the training features (`x_train`) and labels (`y_train`).
4. **Predict**: The `.predict()` method generates predictions for the test set (`x_test`), which we store in `y_pred_sklearn`.



In [None]:
K=3
knn=KNeighborsClassifier(K)
knn.fit(x_train, y_train)
y_pred_sklearn= knn.predict(x_test)
print(y_pred_sklearn)

[2. 1. 0. 2. 0. 2. 0. 1. 1. 1. 2. 1. 1. 1. 2. 0. 1. 1. 0. 0. 2. 1. 0. 0.
 2. 0. 0. 1. 1. 0.]


## Evaluating the Model's Accuracy

To assess the performance of k-NN classifier, calculate the **accuracy score**, which represents the proportion of correct predictions out of the total predictions made.

The `accuracy_score` function from `sklearn.metrics` compares the predicted labels (`y_pred_sklearn`) with the actual labels (`y_test`) from the test set. An accuracy score closer to 1 indicates a higher accuracy of the model.



In [12]:
accuracy_score(y_test, y_pred_sklearn)

0.9666666666666667