# Task: KNN from Scratch (Diamonds Price Prediction)

**Goal:** Predict diamond `price` using KNN Regression implemented from scratch.

**Dataset:** diamonds.csv  
**Target column:** `price`  
**Feature columns:** `carat, cut, color, clarity, depth, table, x, y, z`

**Plan**
1. Load data
2. Split into train/test (75/25)
3. Preprocess:
   - OneHotEncode categorical features: `cut`, `color`, `clarity`
   - Scale numerical features
4. Implement KNN Regression from scratch and predict test set
5. Evaluate (MAE, RMSE, R²)
6. Train sklearn KNN Regressor and compare results


In [3]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'



import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

plt.style.use('bmh')
sns.set_style('darkgrid')

#### Step 1: Load data

In [4]:
df = pd.read_csv("diamonds.csv")
print(df.shape)
df.head()


(53940, 10)


Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


#### Step 2: Identify input (X) and output (y)

- **y (target)** = `price`
- **X (features)** = all columns except `price`
- Categorical features: `cut`, `color`, `clarity`
- Numeric features: `carat`, `depth`, `table`, `x`, `y`, `z`


In [9]:
X = df.drop(columns=["price"])
y = df["price"]


#### Step 3: Train-test split 75/25

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

print(X_train.shape, X_test.shape)


(40455, 9) (13485, 9)


#### Step 4: Preprocess X_train (fit on train)


In [14]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

cat_cols = ["cut", "color", "clarity"]
num_cols = [c for c in X_train.columns if c not in cat_cols]


In [15]:
preprocess = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
    ]
)


In [16]:
# Fit ONLY on train, then transform train
X_train_p = preprocess.fit_transform(X_train)

#### Step 5: Preprocess X_test (transform test)


In [17]:
X_test_p = preprocess.transform(X_test)


#### Step 6: KNN Regression from scratch

In [18]:
# Make NumPy arrays

X_train_arr = np.array(X_train_p)
X_test_arr  = np.array(X_test_p)
y_train_arr = np.array(y_train)


In [19]:
# Distance calculation

def distance_from_all_train(X_train_arr, one_test_row):
    # Euclidean distance
    return np.sqrt(np.sum((X_train_arr - one_test_row)**2, axis=1))


In [20]:
# Predict 1 row

def predict_one(X_train_arr, y_train_arr, one_test_row, k):
    d = distance_from_all_train(X_train_arr, one_test_row)
    nearest_k_index = np.argsort(d)[:k]
    return np.mean(y_train_arr[nearest_k_index])


In [21]:
# Predict all test rows

def predict_all(X_train_arr, y_train_arr, X_test_arr, k):
    preds = []
    for i in range(len(X_test_arr)):
        preds.append(predict_one(X_train_arr, y_train_arr, X_test_arr[i], k))
    return np.array(preds)


In [24]:
X_train_small = X_train_arr[:5000]
y_train_small = y_train_arr[:5000]
X_test_small  = X_test_arr[:500]

k = 5
y_pred_scratch = predict_all(X_train_small, y_train_small, X_test_small, k)
y_pred_scratch

array([  765.2,  2199. ,  1173.2,  1210.8,  7065.2,  3782.4,  1421.2,
        1962.4,  2019.6,  6316.4,   955.6, 10967.6,  2080.2,  1951.8,
        1208.2,  9819.4,  3289.4,  1173.2, 10937.6,   545.4, 12967.8,
         790.8,   669.4,   653.4,  3743.2,  1654.6,  1044.6, 15259.2,
        9055.2,   652. ,  4330. ,  1813.6, 11125.4,  1633.6,  1291. ,
         561.6,   623.8,  2179.8,  1657.2, 13031. ,  2219. ,  2649.8,
         697.2, 15356.8,  2136. ,   848.2, 11975.4, 10346.8, 14734.2,
        7525.8,   989.6, 10534.4,  2698.2,  5845.2,  4252. ,  5703.2,
        5256.2,   565.8,  4195.2,  1676. ,  4213. ,   622. , 15134.8,
        3838.4,   863.4,   702.2,  2879.8,  1726.6,  1725.2,  3201.4,
        4100. ,   547.8,  4595.8, 17032.6,  1468.2,  1113.2, 15183.8,
        2690.6, 11329.6,  7984.6,   997.8,   875.2,  1799. ,   789. ,
        3900.8,   993. ,   911.2,  3974.4, 10120.4,  2516.4,  5414. ,
        2544.4,  1250.4,  3190.4,  4790.6,   510.6,   768.8,   947.4,
        4084.2,  166

In [29]:
# find k nearest indices

k = 5
d = np.sqrt(np.sum((X_train_arr - X_test_arr[0])**2, axis=1))  # distances for test row 0
nearest_k_index = np.argpartition(d, k-1)[:k]
nearest_k_index


array([28832,  6114, 37452,  2585, 20347])

#### Step 7: Evaluate your model

In [39]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


In [41]:
# Make y_test_small to match 500 predictions

y_test_small = y_test.iloc[:500]   # because you predicted only 500 rows
y_test_small


1388       559
50052     2201
41645     1238
42377     1304
17244     6901
         ...  
51121     2339
22581    10633
7133      4174
34702      874
14951     6021
Name: price, Length: 500, dtype: int64

In [42]:
mae = mean_absolute_error(y_test_small, y_pred_scratch)

mse = mean_squared_error(y_test_small, y_pred_scratch)
rmse = np.sqrt(mse)

r2 = r2_score(y_test_small, y_pred_scratch)

print("Scratch KNN Evaluation (500 test rows)")
print("MAE  :", mae)
print("RMSE :", rmse)
print("R2   :", r2)


Scratch KNN Evaluation (500 test rows)
MAE  : 602.1132
RMSE : 1106.789440029132
R2   : 0.9342795928875387


#### Step 8: Train sklearn KNN + compare

In [43]:
from sklearn.neighbors import KNeighborsRegressor

k = 5
knn_model = KNeighborsRegressor(n_neighbors=k)   # default distance = Euclidean

knn_model.fit(X_train_small, y_train_small)


0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


In [44]:
# Predict on the same small test

y_pred_sklearn = knn_model.predict(X_test_small)
y_pred_sklearn[:10]


array([ 765.2, 2199. , 1173.2, 1210.8, 7065.2, 3782.4, 1421.2, 1962.4,
       2019.6, 6316.4])

In [45]:
# Evaluate sklearn KNN on same 500 test rows

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

mae_sk = mean_absolute_error(y_test_small, y_pred_sklearn)
mse_sk = mean_squared_error(y_test_small, y_pred_sklearn)
rmse_sk = np.sqrt(mse_sk)
r2_sk = r2_score(y_test_small, y_pred_sklearn)

print("Sklearn KNN Evaluation (500 test rows)")
print("MAE  :", mae_sk)
print("RMSE :", rmse_sk)
print("R2   :", r2_sk)


Sklearn KNN Evaluation (500 test rows)
MAE  : 602.1132
RMSE : 1106.789440029132
R2   : 0.9342795928875387


In [46]:
# Compare scratch vs sklearn

print("---- Comparison (Scratch vs Sklearn) ----")
print("MAE  :", mae, "vs", mae_sk)
print("RMSE :", rmse, "vs", rmse_sk)
print("R2   :", r2, "vs", r2_sk)


---- Comparison (Scratch vs Sklearn) ----
MAE  : 602.1132 vs 602.1132
RMSE : 1106.789440029132 vs 1106.789440029132
R2   : 0.9342795928875387 vs 0.9342795928875387


### Step 8 Observation

- Scratch KNN and sklearn `KNeighborsRegressor` produced **exactly the same** MAE, RMSE, and R² for `k = 5`, which confirms the scratch implementation matches sklearn’s behavior.
- This is expected because sklearn KNN regression (with default `weights='uniform'`) predicts by taking the average of the target values of the k nearest neighbors—same logic used in the scratch code. 
- Any differences usually come from changing distance metric/weights (like `weights='distance'`) or using different preprocessing. 
