# Module 17 Practice Notebook: K-Nearest Neighbour (KNN)

**Dataset:** Breast Cancer Wisconsin (Diagnostic)  
**Type:** Binary classification, fully tabular, numeric features only

### What you will practice
- Proper KNN training and prediction workflow
- Using pipelines with feature scaling
- Model evaluation using classification metrics
- Choosing the value of K
- Comparing distance metrics and voting weights

✅ This is a **practice notebook**: fill in the TODOs


## 0) Setup

Run this cell first.


In [2]:
#Importing Necessary Libraries [Done for you]
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

np.random.seed(42)


## 1) Load the Dataset (Tabular)

Breast Cancer dataset:
- 30 numeric features
- Binary target: malignant vs benign
- Very common in real ML pipelines

### Task
Load the dataset and inspect:
- Shape of X
- Target labels


In [8]:
# TODO 1: Load the dataset
# Hint: data = load_breast_cancer()


# TODO 2: Set X and y
# Hint: X = data.data, y = data.target


# TODO 3: Print:
# - X shape
# - unique class labels
# - class names

# YOUR CODE HERE

data = load_breast_cancer()
X = data.data
y = data.target

## 2) Convert to DataFrame (Optional but Recommended)

Working with DataFrames helps interpretation and debugging.


In [5]:
# TODO: Convert X to a pandas DataFrame with feature names
# Hint: pd.DataFrame(X, columns=data.feature_names)

# YOUR CODE HERE
df = pd.DataFrame(X, columns= data.feature_names)
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## 3) Train-Test Split

### Task
Split the dataset:
- test_size = 0.25
- random_state = 42
- stratify = y


In [9]:
# TODO: Create X_train, X_test, y_train, y_test using train_test_split

# YOUR CODE HERE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.20, random_state=42, stratify=y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((455, 30), (114, 30), (455,), (114,))

## 4) Baseline KNN Model (With Scaling)

KNN relies on distance, so scaling is mandatory.

### Task
- Build a pipeline: StandardScaler → KNN
- Start with K = 5
- Fit the model
- Predict on test data
- Compute accuracy


In [31]:
# TODO: Create pipeline model
# Hint:
# model = Pipeline([
#   ("scaler", StandardScaler()),
#   ("knn", KNeighborsClassifier(n_neighbors=5))
# ])

# TODO: Fit model
# TODO: Predict on X_test
# TODO: Print accuracy

# YOUR CODE HERE
model = Pipeline([
    ('scaling', StandardScaler()),
    ('knd', KNeighborsClassifier(n_neighbors=11, metric='minkowski', p=1))
])

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print('accuracy_score', accuracy_score(y_test, y_pred))

accuracy_score 0.9736842105263158


## 5) Model Evaluation

### Task
Evaluate your model using:
- Confusion matrix
- Classification report

Think:
- Which class is harder to predict?
- Is false negative more dangerous here?


In [32]:
# TODO: Compute confusion matrix
# TODO: Print classification report

# Hints:
# confusion_matrix(y_test, y_pred)
# classification_report(y_test, y_pred)

# YOUR CODE HERE
print('Confusion Matrix: ')
print(confusion_matrix(y_test, y_pred))
print('Classification report: ')
print(classification_report(y_test, y_pred))

Confusion Matrix: 
[[39  3]
 [ 0 72]]
Classification report: 
              precision    recall  f1-score   support

           0       1.00      0.93      0.96        42
           1       0.96      1.00      0.98        72

    accuracy                           0.97       114
   macro avg       0.98      0.96      0.97       114
weighted avg       0.97      0.97      0.97       114



## 6) Optimization: Choosing the Best K

### Task
Try K values from 1 to 30.

Steps:
1. Loop over K
2. Train a pipeline for each K
3. Store accuracy
4. Plot accuracy vs K
5. Print best K and accuracy


In [46]:
# TODO: Sweep K values and store accuracy
# Hints:
# k_values = range(1, 31)
# accs = []
# for k in k_values:
#     model_k = Pipeline([("scaler", StandardScaler()), ("knn", KNeighborsClassifier(n_neighbors=k))])
#     model_k.fit(X_train, y_train)
#     pred_k = model_k.predict(X_test)
#     accs.append(accuracy_score(y_test, pred_k))

# TODO: Plot accuracy vs K
# TODO: Print best K and best accuracy

# YOUR CODE HERE
k_values = range(1, 31)
accs = []
for k in k_values:
    model_k = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=k))
    ])
    model_k.fit(X_train, y_train)
    pred_k = model_k.predict(X_test)
    # accs.append(accuracy_score(y_test, pred_k))
    accs.append(f'{k} : {accuracy_score(y_test, pred_k)}')

print(sorted(accs))


['1 : 0.9385964912280702', '10 : 0.9649122807017544', '11 : 0.9736842105263158', '12 : 0.9736842105263158', '13 : 0.9736842105263158', '14 : 0.9649122807017544', '15 : 0.9736842105263158', '16 : 0.9736842105263158', '17 : 0.9824561403508771', '18 : 0.9824561403508771', '19 : 0.9736842105263158', '2 : 0.9298245614035088', '20 : 0.9736842105263158', '21 : 0.9649122807017544', '22 : 0.9649122807017544', '23 : 0.956140350877193', '24 : 0.9649122807017544', '25 : 0.9649122807017544', '26 : 0.956140350877193', '27 : 0.9473684210526315', '28 : 0.956140350877193', '29 : 0.9473684210526315', '3 : 0.9824561403508771', '30 : 0.9473684210526315', '4 : 0.9473684210526315', '5 : 0.956140350877193', '6 : 0.956140350877193', '7 : 0.9736842105263158', '8 : 0.9736842105263158', '9 : 0.9736842105263158']


## 7) Try Different Distance Metrics and Weights

### Task
Using your best K:
Compare the following settings:
1. Euclidean distance (p=2), uniform weights
2. Manhattan distance (p=1), uniform weights
3. Euclidean distance (p=2), distance weights

Store results in a DataFrame.


In [47]:
# TODO: Compare different KNN settings
# Hints:
# settings = [
#   ("Euclidean uniform", KNeighborsClassifier(n_neighbors=best_k, metric="minkowski", p=2, weights="uniform")),
#   ("Manhattan uniform", KNeighborsClassifier(n_neighbors=best_k, metric="minkowski", p=1, weights="uniform")),
#   ("Euclidean distance-weighted", KNeighborsClassifier(n_neighbors=best_k, metric="minkowski", p=2, weights="distance"))
# ]

# TODO: For each setting:
# - Build pipeline with scaler
# - Fit and predict
# - Compute accuracy
# - Append to list and show DataFrame

# YOUR CODE HERE

settings = [
  ("Euclidean uniform", KNeighborsClassifier(n_neighbors=9, metric="minkowski", p=2, weights="uniform")),
  ("Manhattan uniform", KNeighborsClassifier(n_neighbors=9, metric="minkowski", p=1, weights="uniform")),
  ("Euclidean distance-weighted", KNeighborsClassifier(n_neighbors=9, metric="minkowski", p=2, weights="distance"))
]

rows = []
for name,knn in settings:
  model = Pipeline([("scaler", StandardScaler()), ("knn",knn)])
  model.fit(X_train, y_train)
  pred = model.predict(X_test)
  rows.append([name, accuracy_score(y_test,pred)])

pd.DataFrame(rows, columns=["Setting", "Accuracy"]).sort_values("Accuracy", ascending=False)



Unnamed: 0,Setting,Accuracy
0,Euclidean uniform,0.973684
1,Manhattan uniform,0.973684
2,Euclidean distance-weighted,0.973684


## 8) Scaling Reality Check (Critical Lesson)

### Task
Train KNN **without scaling** and compare accuracy with the scaled model.

This usually exposes why KNN without scaling is unreliable.


In [52]:
# TODO: Train KNN without scaling
# Hint:
# knn_raw = KNeighborsClassifier(n_neighbors=best_k)
# knn_raw.fit(X_train, y_train)
# pred_raw = knn_raw.predict(X_test)
# acc_raw = accuracy_score(y_test, pred_raw)

# TODO: Compare with scaled accuracy

# YOUR CODE HERE
knn_raw = KNeighborsClassifier(n_neighbors=9)
knn_raw.fit(X_train, y_train)
pred_raw = knn_raw.predict(X_test)
acc_raw = accuracy_score(y_test, pred_raw)
print(acc_raw)
print(accuracy_score(y_test, y_pred))

0.9385964912280702
0.9736842105263158


## 9) Reflection Questions

Write short answers.

1. Which K worked best and why?
2. Did distance-weighted voting help?
3. How much did scaling change the result?
4. Would you trust KNN for medical diagnosis? Why or why not?


**Your answers here:**

1.
2.
3.
4.
