<a href="https://colab.research.google.com/github/lubaochuan/ml_python/blob/main/ISLP_chapter4_LogReg_vs_kNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# ISLP Chapter 4 Contrast Exercise
: Logistic Regression vs k-NN (Classification)

**Goal:** Develop intuition for *when* logistic regression or k-NN is a better choice.

You will compare models on three scenarios:
1. Mostly linear boundary (logistic regression should shine)
2. Nonlinear boundary (k-NN can shine)
3. High-dimensional noise features (k-NN can struggle)

For each scenario:
- Train/test split
- Accuracy + confusion matrix
- Boundary visualization (2D scenarios)


In [None]:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

np.random.seed(2)


In [None]:

def plot_boundary(model, X, y, title):
    x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
    y_min, y_max = X[:,1].min()-1, X[:,1].max()+1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300),
                         np.linspace(y_min, y_max, 300))
    grid = np.c_[xx.ravel(), yy.ravel()]
    zz = model.predict(grid).reshape(xx.shape)
    plt.contourf(xx, yy, zz, alpha=0.25)
    plt.scatter(X[:,0], X[:,1], c=y, edgecolor="k", alpha=0.8)
    plt.title(title)
    plt.xlabel("x1"); plt.ylabel("x2")
    plt.show()

def eval_model(name, model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    acc = accuracy_score(y_test, pred)
    cm = confusion_matrix(y_test, pred)
    print(f"{name} accuracy: {acc:.3f}")
    print("Confusion matrix:\n", cm)
    return acc, cm


## Scenario 1: Mostly linear decision boundary

In [None]:

n = 500
X = np.random.normal(0, 1, (n, 2))
# Linear-ish rule + noise
y = ((1.2*X[:,0] + 0.9*X[:,1] + np.random.normal(0, 0.6, n)) > 0).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

logreg = LogisticRegression()
knn5 = KNeighborsClassifier(5)

print("Scenario 1")
eval_model("Logistic Regression", logreg, X_train, X_test, y_train, y_test)
eval_model("k-NN (k=5)", knn5, X_train, X_test, y_train, y_test)

plot_boundary(logreg.fit(X_train, y_train), X_train, y_train, "LogReg boundary (Scenario 1)")
plot_boundary(knn5.fit(X_train, y_train), X_train, y_train, "k-NN boundary (Scenario 1)")



### Reflection
1. Which model performed better? Why might that be?
2. Which boundary looks simpler? Which looks more flexible?


## Scenario 2: Nonlinear decision boundary (ring vs center)

In [None]:

n = 600
X = np.random.normal(0, 1.2, (n, 2))
r = np.sqrt(X[:,0]**2 + X[:,1]**2)
# Ring classification with noise
y = ((r + np.random.normal(0, 0.15, n)) > 1.2).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

logreg = LogisticRegression()
knn15 = KNeighborsClassifier(15)

print("Scenario 2")
eval_model("Logistic Regression", logreg, X_train, X_test, y_train, y_test)
eval_model("k-NN (k=15)", knn15, X_train, X_test, y_train, y_test)

plot_boundary(logreg.fit(X_train, y_train), X_train, y_train, "LogReg boundary (Scenario 2)")
plot_boundary(knn15.fit(X_train, y_train), X_train, y_train, "k-NN boundary (Scenario 2)")



### Reflection
1. Why does logistic regression struggle here?
2. How is k-NN able to adapt?


## Scenario 3: Add many irrelevant features (curse of dimensionality intuition)

In [None]:

n = 800
# Start with 2 informative features
X2 = np.random.normal(0, 1, (n, 2))
y = ((1.0*X2[:,0] - 1.2*X2[:,1] + np.random.normal(0, 0.6, n)) > 0).astype(int)

# Add 20 irrelevant noise features
noise = np.random.normal(0, 1, (n, 20))
X = np.hstack([X2, noise])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

logreg = LogisticRegression(max_iter=2000)
knn5 = KNeighborsClassifier(5)

print("Scenario 3 (22 features: 2 signal + 20 noise)")
eval_model("Logistic Regression", logreg, X_train, X_test, y_train, y_test)
eval_model("k-NN (k=5)", knn5, X_train, X_test, y_train, y_test)



### Reflection
1. Why might k-NN degrade when we add many irrelevant features?
2. What preprocessing might help k-NN?



# Questions

1. When would you choose logistic regression over k-NN?
2. When would you choose k-NN over logistic regression?
3. How does model choice relate to interpretability, data size, and decision boundary shape?
4. If false negatives are extremely costly, how would you change your threshold strategy?


<details>
<summary>Answer Key</summary>

1. Choose logistic regression for interpretability, stability, probability outputs, small datasets, or when boundary is roughly linear.
2. Choose k-NN when boundary is complex/nonlinear and you have enough data; but be cautious with high-dimensional features.
3. LR = global linear boundary + interpretable coefficients; k-NN = local voting, flexible boundary, sensitive to scaling/dimensionality.
4. Lower the threshold to catch more positives (increase recall), accepting more false positives.
</details>