![image.png](attachment:image.png)

### **K-Nearest Neighbors (KNN) Classifier – Easy Explanation**  
K-Nearest Neighbors (KNN) is a simple and effective **supervised learning algorithm** used for classification and regression. It classifies a data point based on the **majority vote of its k-nearest neighbors** in the dataset.

#### **How KNN Works (Step-by-Step)**  
1. **Choose a value for K** (number of neighbors).  
2. **Calculate the distance** between the new data point and all existing points in the dataset (commonly using Euclidean distance).  
3. **Find the K nearest neighbors** (the K points closest to the new data point).  
4. **Assign the class** based on the majority class of these K neighbors.  

---

### **Advantages of KNN**  
✅ **Simple and Intuitive** – Easy to understand and implement.  
✅ **No Training Phase** – KNN is a **lazy learner**, meaning it doesn’t train a model but simply memorizes the dataset.  
✅ **Works Well with Small Data** – Effective for small datasets with clear patterns.  
✅ **Can Be Used for Classification & Regression** – Flexible and widely applicable.  

---

### **Disadvantages of KNN**  
❌ **Slow for Large Datasets** – Since KNN stores all data points and calculates distances for each prediction, it becomes slow with large datasets.  
❌ **Sensitive to Irrelevant Features** – Unimportant features can affect distance calculations, leading to poor accuracy.  
❌ **Memory Intensive** – Requires storing the entire dataset, which can be costly for large datasets.  
❌ **Bad for Imbalanced Data** – If one class dominates the dataset, KNN can be biased toward that class.  

---

### **When to Use KNN?**  
✅ When you have **small to medium-sized datasets**.  
✅ When the data is **not too noisy or imbalanced**.  
✅ When **real-time predictions** are not required (since KNN can be slow).  

---

### **Real-Life Applications of KNN**  
📍 **Medical Diagnosis** – Used to classify diseases based on symptoms and past patient data.  
📍 **Recommendation Systems** – Suggests products based on users with similar preferences.

---
https://www.geeksforgeeks.org/k-nearest-neighbours/

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv(r"Social_Network_Ads.csv")
df

Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0
3,27,57000,0
4,19,76000,0
...,...,...,...
395,46,41000,1
396,51,23000,1
397,50,20000,1
398,36,33000,0


In [3]:
df.isnull().sum()

Age                0
EstimatedSalary    0
Purchased          0
dtype: int64

In [4]:
X = df.drop(columns='Purchased')
y = df['Purchased']

In [5]:
X.head()

Unnamed: 0,Age,EstimatedSalary
0,19,19000
1,35,20000
2,26,43000
3,27,57000
4,19,76000


In [6]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)

# train test split

In [7]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

In [8]:
X_train

array([[ 1.94321462,  2.18056084],
       [ 2.03872775,  0.38930459],
       [-1.30423178, -0.4329114 ],
       [-1.11320552, -1.02020853],
       [ 1.94321462, -0.93211396],
       [ 0.41500455,  0.30121002],
       [ 0.22397829,  0.15438573],
       [ 2.03872775,  1.76945285],
       [ 0.79705706, -0.84401939],
       [ 0.31949142, -0.28608712],
       [ 0.41500455, -0.16862769],
       [-0.0625611 ,  2.23929055],
       [-1.39974491, -0.63846539],
       [-1.20871865, -1.07893824],
       [-1.30423178,  0.41866944],
       [-1.01769239,  0.77104772],
       [-1.39974491, -0.19799255],
       [ 0.98808332, -1.07893824],
       [ 0.98808332,  0.59485858],
       [ 0.41500455,  1.00596657],
       [ 0.60603081, -0.9027491 ],
       [-0.54012675,  1.47580428],
       [ 0.03295203, -0.57973568],
       [-0.54012675,  1.91627713],
       [ 1.37013584, -1.43131652],
       [ 1.46564897,  1.00596657],
       [ 0.12846516, -0.81465453],
       [ 0.03295203, -0.25672226],
       [-0.15807423,

In [9]:
y_train

336    1
64     0
55     0
106    0
300    1
      ..
323    1
192    0
117    0
47     0
172    0
Name: Purchased, Length: 320, dtype: int64

In [10]:
X_test

array([[-0.73115301,  0.50676401],
       [ 0.03295203, -0.57973568],
       [-0.25358736,  0.15438573],
       [-0.73115301,  0.27184516],
       [-0.25358736, -0.57973568],
       [-1.01769239, -1.46068138],
       [-0.63563988, -1.60750566],
       [-0.15807423,  2.18056084],
       [-1.87731056, -0.05116826],
       [ 0.89257019, -0.78528968],
       [-0.73115301, -0.60910054],
       [-0.92217926, -0.4329114 ],
       [-0.0625611 , -0.4329114 ],
       [ 0.12846516,  0.21311545],
       [-1.6862843 ,  0.47739916],
       [-0.54012675,  1.38770971],
       [-0.0625611 ,  0.21311545],
       [-1.78179743,  0.4480343 ],
       [ 1.65667523,  1.76945285],
       [-0.25358736, -1.40195167],
       [-0.25358736, -0.66783025],
       [ 0.89257019,  2.18056084],
       [ 0.31949142, -0.55037082],
       [ 0.89257019,  1.03533143],
       [-1.39974491, -1.22576253],
       [ 1.08359645,  2.09246627],
       [-0.92217926,  0.50676401],
       [-0.82666613,  0.30121002],
       [-0.0625611 ,

In [11]:
y_test

132    0
309    0
341    0
196    0
246    0
      ..
14     0
363    0
304    0
361    1
329    1
Name: Purchased, Length: 80, dtype: int64

# K-Nearest Neighbors Classifier

In [12]:
from sklearn.neighbors import KNeighborsClassifier
nc = KNeighborsClassifier(n_neighbors=3)
nc.fit(X_train,y_train)

0,1,2
,n_neighbors,3
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


# prediction

In [13]:
y_pred = nc.predict(X_test)
y_pred

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1])

In [14]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
confusion_matrix(y_pred,y_test)

array([[55,  1],
       [ 3, 21]])

![image.png](attachment:image.png)

In [15]:
accuracy_score(y_pred,y_test)

0.95

In [16]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.98      0.95      0.96        58
           1       0.88      0.95      0.91        22

    accuracy                           0.95        80
   macro avg       0.93      0.95      0.94        80
weighted avg       0.95      0.95      0.95        80



In [None]:
if 'Age' in df.columns and 'EstimatedSalary' in df.columns and 'Purchased' in df.columns:
    X = df[['Age', 'EstimatedSalary']].values  # Features
    y = df['Purchased'].values  # Target
    # Splitting data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    # Standardizing the features
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    # Train KNN model
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train, y_train)

NameError: name 'df' is not defined

In [None]:
import numpy as np
import matplotlib.pyplot as plt
# Create mesh grid for decision boundary
h = 0.1  # Step size
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict on mesh grid
Z = knn.predict(scaler.transform(np.c_[xx.ravel(), yy.ravel()]))
Z = Z.reshape(xx.shape)
# Plot decision boundary
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.Paired)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', marker='o', label="Train Data")
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolors='k', marker='s', label="Test Data")
plt.xlabel("Age (scaled)")
plt.ylabel("Estimated Salary (scaled)")
plt.title("K-Nearest Neighbors (KNN) Decision Boundary")
plt.legend()
plt.show()

MemoryError: Unable to allocate 22.1 GiB for an array with shape (594008800, 5) and data type int64