<a href="https://colab.research.google.com/github/lubaochuan/ml_python/blob/main/ISLP_chapter4_guided_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# ISLP Chapter 4 – Guided Exercise: Classification, Decision Boundaries, and Probability Heatmaps

This lab focuses on **intuition**, not math. You will:
- Train **logistic regression** and **k-NN** classifiers
- Visualize **decision boundaries**
- Visualize **probability heatmaps** (confidence maps)
- Compare models using **accuracy** and **confusion matrices**
- Explore how **thresholds** change errors (false positives vs false negatives)

**Tip:** Run cells top to bottom. Stop at reflection prompts and discuss with a partner.


## Step 0: Setup

In [None]:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

np.random.seed(1)


## Step 1: Create a 2D dataset (two overlapping classes)

In [None]:

n = 400
X0 = np.random.multivariate_normal([0,0], [[1.3,0.5],[0.5,1.1]], n//2)
X1 = np.random.multivariate_normal([2.2,2.0], [[1.3,-0.4],[-0.4,1.0]], n//2)

X = np.vstack([X0, X1])
y = np.array([0]*(n//2) + [1]*(n//2))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

plt.scatter(X_train[:,0], X_train[:,1], c=y_train, alpha=0.8)
plt.title("Training data (2 classes)")
plt.xlabel("x1"); plt.ylabel("x2")
plt.show()



### Reflection
1. Do you expect *perfect* accuracy on test data? Why or why not?
2. What does overlap between classes mean for the “best possible” classifier?


## Helper functions: decision boundary + probability heatmap

In [None]:

def plot_decision_boundary(model, X, y, title, proba=False):
    x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
    y_min, y_max = X[:,1].min()-1, X[:,1].max()+1

    xx, yy = np.meshgrid(
        np.linspace(x_min, x_max, 300),
        np.linspace(y_min, y_max, 300)
    )
    grid = np.c_[xx.ravel(), yy.ravel()]

    if proba and hasattr(model, "predict_proba"):
        zz = model.predict_proba(grid)[:,1].reshape(xx.shape)
        plt.contourf(xx, yy, zz, alpha=0.35)
        plt.colorbar(label="P(class=1)")
        plt.contour(xx, yy, zz, levels=[0.5], colors="black", linewidths=1)
    else:
        zz = model.predict(grid).reshape(xx.shape)
        plt.contourf(xx, yy, zz, alpha=0.25)

    plt.scatter(X[:,0], X[:,1], c=y, edgecolor="k", alpha=0.8)
    plt.title(title)
    plt.xlabel("x1"); plt.ylabel("x2")
    plt.show()


## Step 2: Logistic Regression (parametric, linear boundary)

In [None]:

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

pred_lr = logreg.predict(X_test)
acc_lr = accuracy_score(y_test, pred_lr)
cm_lr = confusion_matrix(y_test, pred_lr)

print("Logistic Regression accuracy:", round(acc_lr, 3))
print("Confusion matrix:\n", cm_lr)

### Step 2a: Logistic regression probability heatmap

In [None]:

plot_decision_boundary(logreg, X_train, y_train,
                       title="Logistic Regression: Probability Heatmap + 0.5 Boundary",
                       proba=True)



### Reflection
1. Why is the boundary roughly a straight line?
2. What does a “lighter” vs “darker” area mean on the heatmap?
3. Where is the model least confident?


## Step 3: k-NN Classification (non-parametric, flexible boundary)

In [None]:

knn_1 = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train)
knn_5 = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)
knn_25 = KNeighborsClassifier(n_neighbors=25).fit(X_train, y_train)

models = {
    "k-NN (k=1)": knn_1,
    "k-NN (k=5)": knn_5,
    "k-NN (k=25)": knn_25,
}
for name, m in models.items():
    pred = m.predict(X_test)
    print(name, "accuracy:", round(accuracy_score(y_test, pred), 3))


### Step 3a: Decision boundaries for k-NN (compare k values)

In [None]:

for name, m in models.items():
    plot_decision_boundary(m, X_train, y_train, title=f"{name}: Decision Boundary", proba=False)


### Step 3b: k-NN probability heatmaps (optional but enlightening)

In [None]:

# k-NN probabilities are local vote fractions; heatmaps show how “confident” local neighborhoods are.
for name, m in {"k-NN (k=5)": knn_5, "k-NN (k=25)": knn_25}.items():
    plot_decision_boundary(m, X_train, y_train, title=f"{name}: Probability Heatmap", proba=True)



### Reflection
1. Which k produces the most jagged boundary? Why?
2. Which k produces the smoothest boundary? Why?
3. Relate k to bias–variance (small k vs large k).


## Step 4: Thresholding (Why probabilities matter)

In [None]:

probs_test = logreg.predict_proba(X_test)[:,1]

for t in [0.3, 0.5, 0.7]:
    preds = (probs_test >= t).astype(int)
    print("\nThreshold =", t)
    print("Accuracy:", round(accuracy_score(y_test, preds), 3))
    print("Confusion matrix:\n", confusion_matrix(y_test, preds))



### Reflection
1. As the threshold increases, do false positives increase or decrease?
2. As the threshold increases, do false negatives increase or decrease?
3. Give one real-world scenario where you would use a high threshold, and one where you would use a low threshold.



# Review Questions

1. Why is logistic regression called a **classification** method?
2. What is a **decision boundary**?
3. What does a probability heatmap show that a hard boundary does not?
4. Why can **accuracy** be misleading?
5. How does changing **k** affect k-NN behavior?
6. How does changing the **threshold** affect error types?
7. When might you choose logistic regression over k-NN (and vice versa)?


<details>
<summary>Answer Key</summary>

1. It outputs P(class=1|X) and then applies a threshold to decide class labels.
2. A curve (often a line for logistic regression) separating regions predicted as different classes.
3. Confidence/uncertainty: which areas are “close calls” vs clearly one class.
4. It hides *which* mistakes happen and fails under class imbalance.
5. Small k → high variance (jagged boundary); large k → higher bias (smooth boundary).
6. Higher threshold → fewer false positives but more false negatives (typically); lower threshold does the opposite.
7. Choose logistic regression for interpretability, stability, small data, or when a linear boundary is reasonable; choose k-NN when you have lots of data and expect complex boundaries.
</details>