<a href="https://colab.research.google.com/github/juliabui/csc408-411/blob/main/Mod3_Logistic_Regression_Log_Loss%20%23411.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This code:**

* makes a synthetic binary dataset

* fits LogisticRegression inside a Pipeline(StandardScaler → LogisticRegression)

* tunes for log loss with cross-validated GridSearchCV,

* reports test log loss (and a few extra metrics for context).

**Notes**

* Why scoring='neg_log_loss'? GridSearchCV maximizes the score, so we use the negative log loss to minimize log loss.

* Regularization: C is the inverse of regularization strength (smaller C → stronger regularization). Searching L1 vs L2 and class_weight can materially change log loss, especially with imbalance.

* Protocol: tune via CV on the train split only, then report metrics once on the test split for an unbiased estimate.

**Step 3 is where we tune**

* Try different values for the parameter grid in gridsearch on line 47

* Options for scoring on line 56

  * "roc_auc" - ROC AUC (binary)

  * "roc_auc_ovr", "roc_auc_ovo" - multiclass variants

  * "average_precision" - PR-AUC (great for rare positives)

**Rule of thumb**

* Want well-calibrated probabilities → neg_log_loss, neg_brier_score.

* Care about ranking under imbalance → average_precision (or roc_auc).

* Reporting hard labels at a fixed threshold → f1/precision/recall (pick what matches your objective).

In [None]:
# Logistic regression tuned for log loss on a synthetic dataset
# -------------------------------------------------------------
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, roc_auc_score, accuracy_score, brier_score_loss

RNG = 42

# 1) Synthetic data (slightly imbalanced; a bit of label noise)
X, y = make_classification(
    n_samples=4000,
    n_features=20,
    n_informative=6,
    n_redundant=4,
    class_sep=1.2,
    flip_y=0.03,
    weights=[0.7, 0.3],   # ~30% positive class
    random_state=RNG
)

# Hold out a test set once; do all tuning by CV on the train split only
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, stratify=y, random_state=RNG
)

# 2) Baseline (predict the base rate for everyone) — yardstick for log loss
p_base = np.full_like(y_test, fill_value=y_train.mean(), dtype=float)
print(f"Baseline (always predict prevalence={y_train.mean():.3f}) "
      f"| Test log loss: {log_loss(y_test, p_base):.4f}")

# 3) Pipeline + grid search (smaller C = stronger L1/L2 regularization)
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("logreg", LogisticRegression(
        solver="liblinear",    # supports L1 and L2 for binary problems
        max_iter=1000,
        random_state=RNG
    ))
])
#Tune C with logreg_C
param_grid = {
    "logreg__penalty": ["l2", "l1"],
    "logreg__C": np.logspace(-3, 3, 13),  # ← Grid of C values to try
    "logreg__class_weight": [None, "balanced"]  # try with/without imbalance handling
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RNG)

gs = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring="neg_log_loss",    # <-- tune for log loss
    cv=cv,
    n_jobs=-1,
    refit=True,                # refit on full training set using best params
    verbose=0
)

gs.fit(X_train, y_train)

print("\nBest by CV (mean ± std over folds):")
print(f"  neg_log_loss: {gs.best_score_:.4f}  "
      f"(C={gs.best_params_['logreg__C']}, penalty={gs.best_params_['logreg__penalty']}, "
      f"class_weight={gs.best_params_['logreg__class_weight']})")

# 4) One-time evaluation on the untouched test set
proba_test = gs.predict_proba(X_test)[:, 1]
logloss_test = log_loss(y_test, proba_test)
rocauc_test  = roc_auc_score(y_test, proba_test)
brier_test   = brier_score_loss(y_test, proba_test)
acc_test     = accuracy_score(y_test, proba_test >= 0.5)  # threshold shown just for reference

print("\nTest set performance (unbiased):")
print(f"  Log loss:   {logloss_test:.4f}  (lower is better)")
print(f"  ROC AUC:    {rocauc_test:.4f}")
print(f"  Brier score:{brier_test:.4f}")
print(f"  Accuracy@:0.5 threshold: {acc_test:.4f}")

# Optional: show the learned coefficients (after scaling) for interpretability
clf = gs.best_estimator_.named_steps["logreg"]
print("\nTop coefficients (absolute value):")
coef = clf.coef_.ravel()
top = np.argsort(np.abs(coef))[::-1][:10]
for j in top:
    print(f"  x{j:02d}: coef={coef[j]: .4f}")


Baseline (always predict prevalence=0.306) | Test log loss: 0.6157

Best by CV (mean ± std over folds):
  neg_log_loss: -0.2333  (C=0.31622776601683794, penalty=l1, class_weight=None)

Test set performance (unbiased):
  Log loss:   0.2361  (lower is better)
  ROC AUC:    0.9537
  Brier score:0.0625
  Accuracy@:0.5 threshold: 0.9233

Top coefficients (absolute value):
  x02: coef= 3.0580
  x15: coef= 1.3475
  x05: coef=-0.6580
  x17: coef= 0.2861
  x12: coef=-0.2172
  x00: coef= 0.1677
  x16: coef=-0.1109
  x19: coef= 0.0806
  x03: coef= 0.0554
  x13: coef=-0.0532


###How to interpret the results

1) Baseline

    * What it is: log loss if you predict the training prevalence for everyone (no features).

    * How to use it: your model's test log loss must be lower than baseline. If it's close or worse → your features/model aren't adding value.

2) “Best by CV …”

    * neg_log_loss: GridSearchCV maximizes this, so the CV log loss ≈ -neg_log_loss.

    * Example: neg_log_loss = -0.2333 ⇒ CV log loss ≈ 0.2333.

    * mean ± std: mean over folds and its variability.

    * Small std → stable across splits. Large std → results depend on the split (try more data/features or stronger regularization).

    * Best params (C, penalty, class_weight):

    * Smaller C = stronger regularization.

    * penalty='l1' often zeros many coefficients (feature selection); 'l2' shrinks them smoothly.

    * class_weight='balanced' helps when positives are rare.

3) Test set performance (unbiased)

Read these on the held-out test only after tuning:

  * Log loss (lower is better): measures probability quality.

  * Compare to baseline log loss and to CV log loss.

    * Close to CV → good generalization. Much worse → overfitting or leakage in CV.

    * ROC AUC (higher is better): ranking ability; 0.5 ≈ random, ≥0.8 strong.

    * Brier score (lower is better): calibration/mean squared error of probabilities; complements log loss.

  * Accuracy@0.5: accuracy after converting probabilities to labels at threshold 0.5.

      *Useful for a quick read but can mislead under class imbalance. If positives are rare, also check precision/recall or PR-AUC.

  * Sanity check: If test log loss ≫ CV log loss, or AUC drops a lot, revisit splits, leakage, or regularization strength.

4) Top coefficients

  * Sign = direction (positive increases odds of class 1; negative decreases).

  * Magnitude = influence after scaling (the pipeline standardizes features).

  * With L1, expect many near-zero weights (implicit feature selection).
  * Use these for interpretability, not as proof of causation.

5) What to do next (based on what you see)

  * Beating baseline but high log loss: add/engineer features; try Elastic Net; check calibration (isotonic/Platt) if probabilities seem off.

  * High variance across folds: increase regularization (smaller C), simplify features, or get more data.

  * Good AUC but mediocre log loss/Brier: ranking is fine, probabilities may be miscalibrated → consider calibration.

  * Accuracy looks great, AUC/PR-AUC mediocre: threshold may be flattering due to imbalance—optimize threshold on validation for the metric that matches your costs.