# KNN Scratch implementation




**Step 1:Loading Dataset**


In [26]:
import pandas as pd
import numpy as np

!gdown 1ZdhRqYv-JizWV6DxO6C4R_k1kxPhmlF2 -O groceries.csv

df = pd.read_csv("groceries.csv")
print("First 5 rows:\n", df.head())
print("\nShape:", df.shape)
print("\nColumns:", df.columns)


Downloading...
From: https://drive.google.com/uc?id=1ZdhRqYv-JizWV6DxO6C4R_k1kxPhmlF2
To: /content/groceries.csv
  0% 0.00/14.6k [00:00<?, ?B/s]100% 14.6k/14.6k [00:00<00:00, 28.6MB/s]
First 5 rows:
    Region  Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicassen  class
0       3  12669  9656     7561     214              2674        1338      2
1       3   7057  9810     9568    1762              3293        1776      2
2       3   6353  8808     7684    2405              3516        7844      2
3       3  13265  1196     4221    6404               507        1788      1
4       3  22615  5410     7198    3915              1777        5185      1

Shape: (440, 8)

Columns: Index(['Region', 'Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper',
       'Delicassen', 'class'],
      dtype='object')


**Step 2: Data Pre-Processing**



In [27]:
# Standardize column names
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

# Check Nulls
print("\nMissing values:\n", df.isnull().sum())

# Fill numeric missing values with mean, categorical with mode
for col in df.columns:
    if df[col].dtype == "object":
        df[col].fillna(df[col].mode()[0], inplace=True)
    else:
        df[col].fillna(df[col].mean(), inplace=True)

# Encode categorical variables
from sklearn.preprocessing import LabelEncoder, StandardScaler

label_encoders = {}
for col in df.select_dtypes(include=["object"]).columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

# Standardize numeric features
scaler = StandardScaler()
num_cols = df.select_dtypes(include=np.number).columns
df[num_cols] = scaler.fit_transform(df[num_cols])

print("\nProcessed Data Sample:\n", df.head())




Missing values:
 region              0
fresh               0
milk                0
grocery             0
frozen              0
detergents_paper    0
delicassen          0
class               0
dtype: int64

Processed Data Sample:
      region     fresh      milk   grocery    frozen  detergents_paper  \
0  0.590668  0.052933  0.523568 -0.041115 -0.589367         -0.043569   
1  0.590668 -0.391302  0.544458  0.170318 -0.270136          0.086407   
2  0.590668 -0.447029  0.408538 -0.028157 -0.137536          0.133232   
3  0.590668  0.100111 -0.624020 -0.392977  0.687144         -0.498588   
4  0.590668  0.840239 -0.052396 -0.079356  0.173859         -0.231918   

   delicassen     class  
0   -0.066339 -0.262905  
1    0.089151 -0.262905  
2    2.243293 -0.262905  
3    0.093411 -1.607999  
4    1.299347 -1.607999  


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mean(), inplace=True)



**Step 3: Splitting into Train and Test**

In [31]:

from sklearn.preprocessing import LabelEncoder
X = df.drop(target_col, axis=1).values
y = df[target_col].values

le = LabelEncoder()
y = le.fit_transform(y)

# Train-Test Split again (with corrected y)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("\nTrain shape:", X_train.shape, " Test shape:", X_test.shape)


Train shape: (352, 7)  Test shape: (88, 7)


**Step 4:KNN Scratch Implementation**

In [33]:
from collections import Counter

def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2))

class KNN_Scratch:
    def __init__(self, k=3):
        self.k = k

    def fit(self, X, y):
        self.X_train = X
        self.y_train = y

    def predict(self, X):
        return np.array([self._predict(x) for x in X])

    def _predict(self, x):
        distances = [euclidean_distance(x, x_train) for x_train in self.X_train]
        k_indices = np.argsort(distances)[:self.k]
        k_labels = [self.y_train[i] for i in k_indices]
        most_common = Counter(k_labels).most_common(1)
        return most_common[0][0]

# Train & Evaluate Scratch KNN
knn_scratch = KNN_Scratch(k=5)
knn_scratch.fit(X_train, y_train)
y_pred_scratch = knn_scratch.predict(X_test)

acc_scratch = np.mean(y_pred_scratch == y_test)
print("\nScratch KNN Accuracy:", acc_scratch)


Scratch KNN Accuracy: 0.7613636363636364


**Step 5:KNN using sklearn**

In [32]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

knn_sk = KNeighborsClassifier(n_neighbors=5)
knn_sk.fit(X_train, y_train)

y_pred_sk = knn_sk.predict(X_test)

print("Sklearn KNN Accuracy:", accuracy_score(y_test, y_pred_sk))
# print("\nClassification Report:\n", classification_report(y_test, y_pred_sk))


Sklearn KNN Accuracy: 0.7386363636363636
