# John's Story — Learning with KNN 📘

John has just moved to a new, high-performing school.  
He’s worried about whether he’ll pass his exams.  

To understand his chances, John looks at other students with similar study and sleep habits.  
We’ll use the **k-Nearest Neighbors (KNN)** algorithm to see how he compares.


## The Classmates

We have data on four students:

- **Alice**: studied 2 hours, slept 9 → Fail (55%)
- **Ben**: studied 4 hours, slept 8 → Pass (70%)
- **Cara**: studied 6 hours, slept 5 → Pass (80%)
- **Dan**: studied 8 hours, slept 3 → Fail (90%)

John: studied 5 hours, slept 7 → **???**

Our goal: predict John’s outcome using KNN.


In [5]:
# Features: [hours_studied, hours_slept]
X = [
    [2, 9],  # Alice
    [4, 8],  # Ben
    [6, 5],  # Cara
    [8, 3],  # Dan
]

students = ["Alice", "Ben", "Cara", "Dan"]

# Classification labels (Pass=1, Fail=0)
y_class = [0, 1, 1, 0]  # (toy labels)

# Regression labels (grades %)
y_reg = [55, 70, 80, 90]

# John (new student)
john = [5, 7]

X, y_class, y_reg, john

([[2, 9], [4, 8], [6, 5], [8, 3]], [0, 1, 1, 0], [55, 70, 80, 90], [5, 7])

## Step 1: Distances

John learns that to compare himself to classmates,  
he needs to calculate the **Euclidean distance** — the straight-line distance between his habits and theirs.

This is where the vector math he studied earlier pays off:
- Subtract coordinates → get a difference vector
- Find its magnitude → the distance

In [6]:
import math

for name, p in zip(students, X):
    dx = john[0] - p[0]
    dy = john[1] - p[1]
    d  = math.sqrt(dx*dx + dy*dy)
    print(f"John{john} -> {name}{p}: dx={dx}, dy={dy}, distance={d:.4f}")


John[5, 7] -> Alice[2, 9]: dx=3, dy=-2, distance=3.6056
John[5, 7] -> Ben[4, 8]: dx=1, dy=-1, distance=1.4142
John[5, 7] -> Cara[6, 5]: dx=-1, dy=2, distance=2.2361
John[5, 7] -> Dan[8, 3]: dx=-3, dy=4, distance=5.0000


## Step 2: Pass or Fail?

John uses **k=3 neighbors**.  
He looks at his 3 closest classmates:
- Ben → Pass
- Cara → Pass
- Alice → Fail

2 out of 3 are Pass → John is predicted to **Pass** ✅


In [7]:
import math
from collections import Counter

def euclidean(a, b):
    """Euclidean distance between two feature vectors a and b."""
    return math.sqrt(sum((ai - bi)**2 for ai, bi in zip(a, b)))

def k_nearest_neighbors(X, x_new, k=3):
    """Return list of (distance, index) for the k closest points in X to x_new."""
    dists = [(euclidean(x, x_new), i) for i, x in enumerate(X)]
    dists.sort(key=lambda t: t[0])
    return dists[:k]

def knn_classify(X, y_class, x_new, k=3):
    """Majority vote among k nearest neighbors. Tie-break by the single closest neighbor."""
    neigh = k_nearest_neighbors(X, x_new, k)
    labels = [y_class[i] for (_, i) in neigh]
    counts = Counter(labels).most_common()
    # tie → pick label of the closest neighbor
    if len(counts) > 1 and counts[0][1] == counts[1][1]:
        return y_class[neigh[0][1]]
    return counts[0][0]

def knn_regress_mean(X, y_reg, x_new, k=3):
    """Unweighted mean of neighbor targets."""
    neigh = k_nearest_neighbors(X, x_new, k)
    vals = [y_reg[i] for (_, i) in neigh]
    return sum(vals) / len(vals)

def knn_regress_weighted(X, y_reg, x_new, k=3):
    """Distance-weighted average with weights = 1/d. Exact match returns that label."""
    neigh = k_nearest_neighbors(X, x_new, k)
    num, den = 0.0, 0.0
    for (d, i) in neigh:
        if d == 0:
            return y_reg[i]
        w = 1.0 / d
        num += w * y_reg[i]
        den += w
    return num / den


In [8]:
k = 3

# Find k nearest neighbors to John
neigh = k_nearest_neighbors(X, john, k=k)

# Pretty print neighbor details
print("k =", k)
print("Neighbors (name, features, class, distance):")
for (d, i) in neigh:
    name = students[i]
    feats = X[i]
    label = y_class[i]  # 1=Pass, 0=Fail
    print(f"  {name:>5}  {feats}  class={label}  dist={d:.4f}")

# Predict class by majority vote
pred_class = knn_classify(X, y_class, john, k=k)
print("\nPrediction (Pass=1, Fail=0):", pred_class)


k = 3
Neighbors (name, features, class, distance):
    Ben  [4, 8]  class=1  dist=1.4142
   Cara  [6, 5]  class=1  dist=2.2361
  Alice  [2, 9]  class=0  dist=3.6056

Prediction (Pass=1, Fail=0): 1


## Step 3: What Grade Might John Get?

Instead of just Pass/Fail, we use their percentages:

- Ben: 70
- Cara: 80
- Alice: 55

**Simple average:** (70+80+55)/3 = ~68.3%  
**Weighted by distance:** closer classmates count more → ~70.2%

John feels reassured: he’s likely to pass comfortably, with room to improve.


In [9]:
# Predict John's grade with KNN regression
pred_reg_mean = knn_regress_mean(X, y_reg, john, k=3)
pred_reg_weighted = knn_regress_weighted(X, y_reg, john, k=3)

# Show neighbor details for regression
neigh_reg = k_nearest_neighbors(X, john, k=3)
print("Neighbors (name, grade, distance):")
for (d, i) in neigh_reg:
    print(f"  {students[i]:>5}  grade={y_reg[i]}  dist={d:.4f}")

print(f"\nUnweighted mean prediction: {pred_reg_mean:.2f}%")
print(f"Distance-weighted prediction: {pred_reg_weighted:.2f}%")


Neighbors (name, grade, distance):
    Ben  grade=70  dist=1.4142
   Cara  grade=80  dist=2.2361
  Alice  grade=55  dist=3.6056

Unweighted mean prediction: 68.33%
Distance-weighted prediction: 70.22%


## Step 4: Beyond Study and Sleep

John notices classmates who exercise, manage stress, and eat well often do even better.  
So he extends his dataset with new features:
- Hours of exercise per week
- Stress score (0–10)
- Nutrition quality (0–10)

Now each student is a point in higher dimensions.  
KNN still works the same way — it just compares across more features.


In [10]:
# Extended features: [study, sleep, exercise_hours, stress_0to10, nutrition_0to10]
# (Toy values — tweak as you like)
X_ext = [
    [2, 9, 1, 7, 4],  # Alice: low exercise, high stress, modest nutrition
    [4, 8, 3, 5, 6],  # Ben
    [6, 5, 5, 4, 7],  # Cara
    [8, 3, 4, 6, 5],  # Dan
]

# Keep the same grade labels for simplicity
y_reg_ext = y_reg[:]  # [55, 70, 80, 90]

# John's extended features (edit as you like)
john_ext = [5, 7, 2, 5, 5]

# Use k=3 neighbors
k = 3

# Show nearest neighbors in the extended space
neigh_ext = k_nearest_neighbors(X_ext, john_ext, k=k)
print("Extended neighbors (distance, name, features, grade):")
for (d, i) in neigh_ext:
    print(f"  dist={d:.4f}  {students[i]:>5}  {X_ext[i]}  grade={y_reg_ext[i]}")

# Predict grade with mean and distance-weighted averages in higher-D
pred_ext_mean = knn_regress_mean(X_ext, y_reg_ext, john_ext, k=k)
pred_ext_weighted = knn_regress_weighted(X_ext, y_reg_ext, john_ext, k=k)

print(f"\nExtended regression (mean):     {pred_ext_mean:.2f}%")
print(f"Extended regression (weighted): {pred_ext_weighted:.2f}%")


Extended neighbors (distance, name, features, grade):
  dist=2.0000    Ben  [4, 8, 3, 5, 6]  grade=70
  dist=4.3589  Alice  [2, 9, 1, 7, 4]  grade=55
  dist=4.3589   Cara  [6, 5, 5, 4, 7]  grade=80

Extended regression (mean):     68.33%
Extended regression (weighted): 68.80%


## Step 5: Scaling Matters

John realizes that if one feature is on a much bigger scale  
(e.g., exam prep time in minutes vs. stress score 0–10),  
it can overwhelm the distances.

That’s why we use **normalization**: to keep features comparable.


In [11]:
def minmax_scale_columns(M):
    """Column-wise min-max scaling to [0,1]."""
    cols = list(zip(*M))
    scaled_cols = []
    for col in cols:
        cmin, cmax = min(col), max(col)
        if cmax == cmin:
            scaled_cols.append([0.0]*len(col))
        else:
            scaled_cols.append([(v - cmin)/(cmax - cmin) for v in col])
    return [list(row) for row in zip(*scaled_cols)]

# Example: exaggerate one feature (add exam prep minutes, huge numbers)
X_ext_unscaled = [row[:] for row in X_ext]
john_ext_unscaled = john_ext[:]

# Add a big feature (toy values)
for row in X_ext_unscaled:
    row.append(5000)  # all students have big values, just for illustration
john_ext_unscaled = john_ext_unscaled + [4500]

# Neighbors BEFORE scaling
neigh_before = k_nearest_neighbors(X_ext_unscaled, john_ext_unscaled, k=3)

# Apply min-max scaling
X_scaled = minmax_scale_columns(X_ext_unscaled + [john_ext_unscaled])
X_ext_scaled, john_ext_scaled = X_scaled[:-1], X_scaled[-1]

# Neighbors AFTER scaling
neigh_after = k_nearest_neighbors(X_ext_scaled, john_ext_scaled, k=3)

print("Neighbors BEFORE scaling (distance, index):", [(round(d,4), i) for (d,i) in neigh_before])
print("Neighbors AFTER  scaling (distance, index):", [(round(d,4), i) for (d,i) in neigh_after])


Neighbors BEFORE scaling (distance, index): [(500.004, 1), (500.019, 0), (500.019, 2)]
Neighbors AFTER  scaling (distance, index): [(1.1087, 1), (1.4068, 0), (1.4337, 3)]


## Conclusion

Through KNN, John saw:
- How his study/sleep compared to classmates
- That he’s on track to pass (~70%)
- That adding lifestyle factors could refine predictions

Most importantly, he discovered that math he’d learned earlier — vectors, magnitudes, distances —  
was the key to making sense of it all.


In [12]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor

# Convert to NumPy
X_np   = np.array(X)        # [[2,9],[4,8],[6,5],[8,3]]
yC_np  = np.array(y_class)  # [0,1,1,0]
yR_np  = np.array(y_reg)    # [55,70,80,90]
john_np = np.array(john).reshape(1, -1)  # [[5,7]]

k = 3

# --- Classification ---
clf = KNeighborsClassifier(n_neighbors=k)
clf.fit(X_np, yC_np)
pred_class = clf.predict(john_np)[0]
proba = clf.predict_proba(john_np)[0]  # [P(Fail), P(Pass)]
dists, idxs = clf.kneighbors(john_np, n_neighbors=k, return_distance=True)

print("=== KNN Classification (2D) ===")
print("Prediction (Pass=1, Fail=0):", pred_class)
print("Class probabilities [Fail, Pass]:", proba)
print("Neighbors (distance, name, features, class):")
for d, i in zip(dists[0], idxs[0]):
    print(f"  {d:.4f}  {students[i]:>5}  {X[i]}  class={y_class[i]}")

# --- Regression ---
reg = KNeighborsRegressor(n_neighbors=k)
reg.fit(X_np, yR_np)
pred_grade = reg.predict(john_np)[0]
dists_r, idxs_r = reg.kneighbors(john_np, n_neighbors=k, return_distance=True)

print("\n=== KNN Regression (2D) ===")
print(f"Predicted grade (%): {pred_grade:.2f}")
print("Neighbors (distance, name, features, grade):")
for d, i in zip(dists_r[0], idxs_r[0]):
    print(f"  {d:.4f}  {students[i]:>5}  {X[i]}  grade={y_reg[i]}")


=== KNN Classification (2D) ===
Prediction (Pass=1, Fail=0): 1
Class probabilities [Fail, Pass]: [0.33333333 0.66666667]
Neighbors (distance, name, features, class):
  1.4142    Ben  [4, 8]  class=1
  2.2361   Cara  [6, 5]  class=1
  3.6056  Alice  [2, 9]  class=0

=== KNN Regression (2D) ===
Predicted grade (%): 68.33
Neighbors (distance, name, features, grade):
  1.4142    Ben  [4, 8]  grade=70
  2.2361   Cara  [6, 5]  grade=80
  3.6056  Alice  [2, 9]  grade=55


In [13]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Use your extended feature matrix X_ext and john_ext from earlier
X_ext_np   = np.array(X_ext)
yR_ext_np  = np.array(y_reg)  # same grades for simplicity
john_ext_np = np.array(john_ext).reshape(1, -1)

k = 3

# Pipeline: Standardize features -> KNN
reg_ext = Pipeline([
    ("scale", StandardScaler()),
    ("knn", KNeighborsRegressor(n_neighbors=k))
])
reg_ext.fit(X_ext_np, yR_ext_np)

pred_ext = reg_ext.predict(john_ext_np)[0]

# To see neighbors/distances after scaling, access the fitted KNN and scaled arrays
X_ext_scaled = reg_ext.named_steps["scale"].transform(X_ext_np)
john_ext_scaled = reg_ext.named_steps["scale"].transform(john_ext_np)
knn_model = reg_ext.named_steps["knn"]
dists_ext, idxs_ext = knn_model.kneighbors(john_ext_scaled, n_neighbors=k, return_distance=True)

print("=== KNN Regression (Extended + Scaled) ===")
print(f"Predicted grade (%): {pred_ext:.2f}")
print("Neighbors in scaled space (distance, name, features, grade):")
for d, i in zip(dists_ext[0], idxs_ext[0]):
    print(f"  {d:.4f}  {students[i]:>5}  {X_ext[i]}  grade={y_reg[i]}")


=== KNN Regression (Extended + Scaled) ===
Predicted grade (%): 71.67
Neighbors in scaled space (distance, name, features, grade):
  1.2779    Ben  [4, 8, 3, 5, 6]  grade=70
  2.6383  Alice  [2, 9, 1, 7, 4]  grade=55
  2.6911    Dan  [8, 3, 4, 6, 5]  grade=90
