# Workshop: Logistic Regression (Super Simple)

A gentle, hands‑on intro to binary classification that mirrors the class slides and prepares you for the assignment.

**You will:**
1. Make a tiny **binary classification** dataset (2 features → easy to plot).
2. Do a **train/test split** and keep the **test set** out of model selection.
3. Build a scikit‑learn **Pipeline**: `StandardScaler` → `LogisticRegression`.
4. Use **cross‑validation on the training set** to pick `C` (regularization).
5. Evaluate **once** on the **held‑out test set** (generalization).
6. Visualize a **decision boundary** and a **ROC curve**.

**Template to reuse:**  
`split → pipeline (scale + logistic) → CV on train → final test → report metrics/plots`


## 0) Setup

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    RocCurveDisplay, roc_auc_score, precision_score, recall_score
)

np.random.seed(42)
print("Libraries imported.")

## 1) Make a tiny dataset (no files needed)

We create a 2‑feature dataset so we can **see** the classes and the learned boundary.

In [None]:
X, y = make_classification(
    n_samples=400, n_features=2, n_redundant=0, n_informative=2,
    n_clusters_per_class=1, class_sep=1.6, random_state=42
)

plt.figure()
plt.scatter(X[:,0], X[:,1], c=y)
plt.xlabel("x1"); plt.ylabel("x2")
plt.title("Toy dataset (2 features)")
plt.show()

## 2) Train/test split (hold out a final test set)

We split once. **All** model selection is done using **only training data** with cross‑validation.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)
print("Train size:", X_train.shape, " Test size:", X_test.shape)
print("Train class balance:", np.mean(y_train).round(3),
      "  Test class balance:", np.mean(y_test).round(3))

## 3) Pipeline + cross‑validation on the **training** set

- `StandardScaler` puts features on comparable scales.  
- `LogisticRegression(C)` controls regularization (smaller `C` ⇒ stronger regularization).  
- Pick `C` with **5‑fold CV** inside the training set.

In [None]:
pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000, random_state=42))
])

param_grid = {"clf__C": [0.1, 1.0, 10.0]}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(
    pipe, param_grid=param_grid, cv=cv, scoring="accuracy", refit=True
)
grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
print("CV best accuracy (mean over folds):", grid.best_score_.round(3))
best_model = grid.best_estimator_

## 4) Final evaluation on the **held‑out test set** (generalization)

In [None]:
y_pred = best_model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Test accuracy:", round(acc, 3))
print("\nConfusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred, digits=3))

## 5) Visualize the learned decision boundary

In [None]:
# Mesh for background probabilities
x1_min, x1_max = X[:,0].min()-1, X[:,0].max()+1
x2_min, x2_max = X[:,1].min()-1, X[:,1].max()+1
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, 300),
                       np.linspace(x2_min, x2_max, 300))
grid = np.c_[xx1.ravel(), xx2.ravel()]

proba_grid = best_model.predict_proba(grid)[:,1].reshape(xx1.shape)

plt.figure()
plt.contourf(xx1, xx2, proba_grid, levels=20, alpha=0.35)
plt.scatter(X_train[:,0], X_train[:,1], c=y_train, label="train")
plt.scatter(X_test[:,0], X_test[:,1], c=y_test, marker="x", label="test")
plt.xlabel("x1"); plt.ylabel("x2")
plt.title("Decision boundary (probability background)")
plt.legend()
plt.show()

## 6) Probability outputs and ROC curve

In [None]:
y_proba = best_model.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, y_proba)
print("Test ROC AUC:", round(auc, 3))

plt.figure()
RocCurveDisplay.from_predictions(y_test, y_proba)
plt.title("ROC curve (test set)")
plt.show()

## 7) Try different decision thresholds

By default, 0.5 splits the classes. Try 0.3 and 0.7 to see the precision/recall trade‑off.

In [None]:
def evaluate_threshold(proba, y_true, t=0.5):
    y_hat = (proba >= t).astype(int)
    acc = accuracy_score(y_true, y_hat)
    prec = precision_score(y_true, y_hat)
    rec = recall_score(y_true, y_hat)
    print(f"threshold={t:0.2f}  acc={acc:0.3f}  precision={prec:0.3f}  recall={rec:0.3f}")

for t in [0.3, 0.5, 0.7]:
    evaluate_threshold(y_proba, y_test, t)

## Wrap‑up

- Logistic regression outputs probabilities via the sigmoid; a **threshold** turns those into class labels.  
- Use **Pipeline** + **CV on the training set**; evaluate **once** on the **test set** for generalization.  
- You saw accuracy, confusion matrix, report, ROC/AUC, and threshold effects.
