[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jkitchin/s26-06642/blob/main/dsmles/07-classification/classification.ipynb)

In [None]:
! pip install -q pycse
from pycse.colab import pdf

```{index} classification
```


# Classification

## Learning Objectives

By the end of this lecture, you will be able to:

1. Understand the difference between regression and classification
2. Apply logistic regression for binary and multi-class problems
3. Evaluate classifiers using appropriate metrics (accuracy, precision, recall, F1)
4. Interpret confusion matrices and ROC curves
5. Handle imbalanced classes
6. Choose the right metric for your problem

## From Regression to Classification

So far, we've predicted continuous values: reaction rates, yields, material properties. But many engineering problems are **classification** tasks:

- **Quality control**: Does this batch pass or fail specifications?
- **Fault detection**: Is the reactor operating normally or abnormally?
- **Material classification**: Is this sample crystalline or amorphous?
- **Process monitoring**: Which of 5 operating regimes are we in?

The key difference:
- **Regression**: Predict a continuous number (y ∈ ℝ)
- **Classification**: Predict a category (y ∈ {A, B, C, ...})

### Why Not Just Use Linear Regression?

You might try encoding classes as numbers (fail=0, pass=1) and using linear regression. This has problems:

1. **Predictions outside [0,1]**: Linear regression can predict 1.3 or -0.2—what do those mean?
2. **Non-constant variance**: Errors near 0 and 1 behave differently than errors near 0.5
3. **Wrong assumptions**: Linear regression assumes normal errors; binary outcomes are Bernoulli distributed

We need a model designed for classification: **logistic regression**.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             confusion_matrix, ConfusionMatrixDisplay,
                             classification_report, roc_curve, roc_auc_score)
from sklearn.datasets import make_classification

np.random.seed(42)

```{index} logistic regression, sigmoid function
```


## Logistic Regression

Despite its name, logistic regression is a **classification** algorithm. It models the probability of belonging to a class.

### The Logistic (Sigmoid) Function

The key idea: transform the linear combination of features through a sigmoid function:

$$P(y=1|x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}$$

The sigmoid maps any real number to (0, 1)—perfect for probabilities!

In [None]:
# Visualize the sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-6, 6, 100)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Sigmoid function
axes[0].plot(z, sigmoid(z), 'b-', linewidth=2)
axes[0].axhline(0.5, color='gray', linestyle='--', alpha=0.5)
axes[0].axvline(0, color='gray', linestyle='--', alpha=0.5)
axes[0].set_xlabel('z = w·x + b')
axes[0].set_ylabel('P(y=1)')
axes[0].set_title('Sigmoid Function')
axes[0].set_ylim(-0.1, 1.1)

# Linear regression vs logistic regression
x = np.linspace(0, 10, 50)
y_linear = 0.15 * x - 0.25
y_logistic = sigmoid(1.5 * x - 7)

axes[1].plot(x, y_linear, 'r--', label='Linear regression', linewidth=2)
axes[1].plot(x, y_logistic, 'b-', label='Logistic regression', linewidth=2)
axes[1].axhline(0, color='gray', alpha=0.3)
axes[1].axhline(1, color='gray', alpha=0.3)
axes[1].fill_between(x, 0, 1, alpha=0.1, color='green')
axes[1].set_xlabel('Feature value')
axes[1].set_ylabel('Predicted probability')
axes[1].set_title('Why Logistic Regression?')
axes[1].legend()
axes[1].set_ylim(-0.3, 1.3)
axes[1].annotate('Valid probability\nrange [0, 1]', xy=(8, 0.5), fontsize=10)

plt.tight_layout()
plt.show()

### Example: Quality Control Classification

Let's classify chemical batches as pass/fail based on process measurements.

In [None]:
# Simulate quality control data
# Features: temperature deviation, pressure deviation, impurity level
np.random.seed(42)
n_samples = 300

# Generate features
temp_dev = np.random.normal(0, 2, n_samples)  # Temperature deviation from setpoint
pressure_dev = np.random.normal(0, 1.5, n_samples)  # Pressure deviation
impurity = np.random.exponential(0.5, n_samples)  # Impurity level

# Quality depends on all three (with some randomness)
quality_score = -0.3*np.abs(temp_dev) - 0.4*np.abs(pressure_dev) - 1.5*impurity + 1
quality_prob = sigmoid(quality_score * 2)
quality = (np.random.random(n_samples) < quality_prob).astype(int)

df = pd.DataFrame({
    'temp_deviation': temp_dev,
    'pressure_deviation': pressure_dev,
    'impurity_level': impurity,
    'quality': quality  # 1 = pass, 0 = fail
})

print(f"Dataset shape: {df.shape}")
print(f"\nClass distribution:")
print(df['quality'].value_counts())
print(f"\nPass rate: {df['quality'].mean():.1%}")

We've created a realistic quality control dataset where:
- **~60% pass** (quality=1) and ~40% fail (quality=0)
- Quality depends on operating conditions in physically sensible ways:
  - Larger temperature deviations → worse quality
  - Larger pressure deviations → worse quality  
  - Higher impurity levels → worse quality

This is a **balanced** dataset (roughly equal classes). Later we'll see what happens with imbalanced data, where the minority class is much harder to predict.

In [None]:
# Prepare data
X = df[['temp_deviation', 'pressure_deviation', 'impurity_level']]
y = df['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features (important for logistic regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train logistic regression
clf = LogisticRegression()
clf.fit(X_train_scaled, y_train)

# Make predictions
y_pred = clf.predict(X_test_scaled)
y_prob = clf.predict_proba(X_test_scaled)[:, 1]  # Probability of class 1

print("Model coefficients (standardized):")
for name, coef in zip(X.columns, clf.coef_[0]):
    print(f"  {name}: {coef:.3f}")
print(f"  intercept: {clf.intercept_[0]:.3f}")

**Interpreting the standardized coefficients:**

All three coefficients are **negative**, which makes physical sense:
- **Impurity level** (-1.3): Strongest negative effect—impurities kill quality
- **Pressure deviation** (-0.5): Moderate effect—process variations hurt
- **Temperature deviation** (-0.3): Weakest effect in this data

The signs tell us: increasing any of these variables decreases the log-odds of passing, which matches our physical intuition. Unlike linear regression where coefficients have simple units, logistic regression coefficients are in **log-odds**—a one-unit increase in standardized impurity decreases the log-odds of passing by 1.3.

The model learned the correct ranking of importance and direction of effects from data alone!

## Classification Metrics: Beyond Accuracy

Accuracy seems like a natural metric: what fraction did we get right?

$$\text{Accuracy} = \frac{\text{Correct predictions}}{\text{Total predictions}}$$

But accuracy can be **misleading**, especially with imbalanced classes.

### The Imbalanced Class Problem

Imagine a fault detection system where faults occur 1% of the time. A model that always predicts "no fault" achieves 99% accuracy—but catches zero faults!

We need metrics that consider the **types of errors**.

```{index} confusion matrix
```


### The Confusion Matrix

A confusion matrix breaks down predictions by actual vs predicted class:

|  | Predicted Negative | Predicted Positive |
|--|-------------------|--------------------|
| Actual Negative | True Negative (TN) | False Positive (FP) |
| Actual Positive | False Negative (FN) | True Positive (TP) |

- **True Positive (TP)**: Correctly predicted positive
- **True Negative (TN)**: Correctly predicted negative
- **False Positive (FP)**: Predicted positive, actually negative (Type I error)
- **False Negative (FN)**: Predicted negative, actually positive (Type II error)

In [None]:
# Compute and display confusion matrix
cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(6, 5))
disp = ConfusionMatrixDisplay(cm, display_labels=['Fail', 'Pass'])
disp.plot(ax=ax, cmap='Blues', values_format='d')
ax.set_title('Confusion Matrix for Quality Control')
plt.tight_layout()
plt.show()

tn, fp, fn, tp = cm.ravel()
print(f"True Negatives (correctly predicted fails): {tn}")
print(f"False Positives (fails predicted as pass): {fp}")
print(f"False Negatives (passes predicted as fail): {fn}")
print(f"True Positives (correctly predicted passes): {tp}")

**Reading the confusion matrix:**

The matrix shows all four outcomes:
- **True Negatives (top-left)**: Correctly predicted failures
- **False Positives (top-right)**: We said "pass" but it actually failed—**escaped defects!**
- **False Negatives (bottom-left)**: We said "fail" but it actually passed—**wasted good product**
- **True Positives (bottom-right)**: Correctly predicted passes

**In quality control, FP and FN have different costs:**
- **False Positives** (predicting pass when it's fail): Bad product ships to customer! Warranty claims, reputation damage.
- **False Negatives** (predicting fail when it's pass): Good product gets scrapped. Lost revenue, but no customer impact.

Most companies would rather have more FN than FP—it's better to be overly cautious. This asymmetry is why accuracy alone isn't enough to evaluate a classifier.

```{index} precision, recall
```


### Precision and Recall

From the confusion matrix, we derive more informative metrics:

**Precision**: Of all predicted positives, how many were actually positive?
$$\text{Precision} = \frac{TP}{TP + FP}$$

**Recall (Sensitivity)**: Of all actual positives, how many did we catch?
$$\text{Recall} = \frac{TP}{TP + FN}$$

### The Precision-Recall Tradeoff

You usually can't maximize both:
- High precision (few false positives) often means low recall (miss some positives)
- High recall (catch most positives) often means low precision (more false alarms)

**Which matters more depends on your problem:**

| Scenario | Prioritize | Why |
|----------|------------|-----|
| Cancer screening | Recall | Don't miss any cancers, even if some false positives |
| Spam filter | Precision | Don't lose important emails to spam folder |
| Fault detection | Recall | Catch all faults, even if some false alarms |
| Expensive inspection | Precision | Only trigger when likely true to save costs |

In [None]:
# Calculate all metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Classification Metrics:")
print(f"  Accuracy:  {accuracy:.3f}")
print(f"  Precision: {precision:.3f}")
print(f"  Recall:    {recall:.3f}")
print(f"  F1 Score:  {f1:.3f}")

```{index} F1 score
```


### F1 Score: Balancing Precision and Recall

The F1 score is the harmonic mean of precision and recall:

$$F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$

Why harmonic mean? It penalizes extreme imbalances. If precision = 0.9 and recall = 0.1:
- Arithmetic mean: (0.9 + 0.1) / 2 = 0.5
- Harmonic mean: 2 × 0.9 × 0.1 / (0.9 + 0.1) = 0.18

F1 is only high when **both** precision and recall are reasonable.

In [None]:
# Complete classification report
print("\nComplete Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Fail', 'Pass']))

```{index} ROC curve, AUC
```


## ROC Curves and AUC

So far, we've used a threshold of 0.5: predict class 1 if P(y=1) > 0.5.

But this threshold is adjustable! Lower threshold → higher recall, lower precision.

The **ROC curve** (Receiver Operating Characteristic) shows the tradeoff across all thresholds:
- X-axis: False Positive Rate = FP / (FP + TN)
- Y-axis: True Positive Rate = TP / (TP + FN) = Recall

**AUC** (Area Under the Curve) summarizes model performance:
- AUC = 1.0: Perfect classifier
- AUC = 0.5: Random guessing (diagonal line)
- AUC > 0.8: Generally considered good

In [None]:
# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# ROC curve
axes[0].plot(fpr, tpr, 'b-', linewidth=2, label=f'Logistic Regression (AUC = {auc:.3f})')
axes[0].plot([0, 1], [0, 1], 'k--', label='Random Classifier (AUC = 0.5)')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate (Recall)')
axes[0].set_title('ROC Curve')
axes[0].legend(loc='lower right')
axes[0].set_xlim(-0.02, 1.02)
axes[0].set_ylim(-0.02, 1.02)

# Threshold selection
axes[1].plot(thresholds, fpr, 'r-', label='False Positive Rate', linewidth=2)
axes[1].plot(thresholds, tpr, 'b-', label='True Positive Rate', linewidth=2)
axes[1].axvline(0.5, color='gray', linestyle='--', alpha=0.5, label='Default threshold')
axes[1].set_xlabel('Threshold')
axes[1].set_ylabel('Rate')
axes[1].set_title('Effect of Threshold on Rates')
axes[1].legend()
axes[1].set_xlim(0, 1)

plt.tight_layout()
plt.show()

### Choosing a Threshold

The default 0.5 threshold isn't always optimal:

- **Safety-critical applications**: Lower threshold (catch more positives, accept more false alarms)
- **Cost-sensitive applications**: Adjust based on the cost of each error type

Example: If a false negative (missing a fault) costs \$100,000 and a false positive (false alarm) costs \$1,000, you want to lower the threshold to catch more faults.

In [None]:
# Compare different thresholds
thresholds_to_try = [0.3, 0.5, 0.7]

print("Effect of threshold on metrics:")
print(f"{'Threshold':<12} {'Accuracy':<10} {'Precision':<10} {'Recall':<10} {'F1':<10}")
print("-" * 52)

for thresh in thresholds_to_try:
    y_pred_thresh = (y_prob >= thresh).astype(int)
    acc = accuracy_score(y_test, y_pred_thresh)
    prec = precision_score(y_test, y_pred_thresh, zero_division=0)
    rec = recall_score(y_test, y_pred_thresh)
    f1_thresh = f1_score(y_test, y_pred_thresh)
    print(f"{thresh:<12.1f} {acc:<10.3f} {prec:<10.3f} {rec:<10.3f} {f1_thresh:<10.3f}")

## Multi-Class Classification

Many problems have more than two classes:
- Classify material phase: solid, liquid, gas
- Identify catalyst type: Pt, Pd, Ni, Cu
- Determine operating regime: startup, steady-state, shutdown, fault

Logistic regression extends to multi-class via:
- **One-vs-Rest (OvR)**: Train K binary classifiers, each separating one class from the rest
- **Multinomial**: Directly model probabilities for all K classes (softmax)

In [None]:
# Create multi-class dataset: classifying reactor operating regimes
# 0 = Normal, 1 = High Temperature, 2 = High Pressure, 3 = Fault

X_multi, y_multi = make_classification(
    n_samples=500, n_features=4, n_informative=3, n_redundant=1,
    n_classes=4, n_clusters_per_class=1, random_state=42
)

regime_names = ['Normal', 'High Temp', 'High Pressure', 'Fault']
print("Class distribution:")
unique, counts = np.unique(y_multi, return_counts=True)
for cls, cnt in zip(unique, counts):
    print(f"  {regime_names[cls]}: {cnt}")

In [None]:
# Train multi-class logistic regression
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(
    X_multi, y_multi, test_size=0.2, random_state=42
)

scaler_m = StandardScaler()
X_train_m_scaled = scaler_m.fit_transform(X_train_m)
X_test_m_scaled = scaler_m.transform(X_test_m)

clf_multi = LogisticRegression(multi_class='multinomial', max_iter=1000)
clf_multi.fit(X_train_m_scaled, y_train_m)

y_pred_m = clf_multi.predict(X_test_m_scaled)

print("Multi-class Classification Report:")
print(classification_report(y_test_m, y_pred_m, target_names=regime_names))

In [None]:
# Multi-class confusion matrix
cm_multi = confusion_matrix(y_test_m, y_pred_m)

fig, ax = plt.subplots(figsize=(8, 6))
disp = ConfusionMatrixDisplay(cm_multi, display_labels=regime_names)
disp.plot(ax=ax, cmap='Blues', values_format='d')
ax.set_title('Multi-class Confusion Matrix: Reactor Operating Regimes')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

```{index} imbalanced classes, class weight
```


## Handling Imbalanced Classes

In many real-world problems, classes are imbalanced:
- Fault detection: 99% normal, 1% fault
- Quality control: 95% pass, 5% fail
- Rare event prediction: 99.9% non-events

Standard classifiers tend to ignore the minority class. Solutions:

| Approach | Description | When to Use |
|----------|-------------|-------------|
| Class weights | Penalize errors on minority class more | Simple, often effective |
| Oversampling | Duplicate minority samples (e.g., SMOTE) | Small datasets |
| Undersampling | Remove majority samples | Very large datasets |
| Threshold adjustment | Lower threshold for minority class | When you control deployment |

In [None]:
# Create imbalanced dataset (90% normal, 10% fault)
X_imb, y_imb = make_classification(
    n_samples=1000, n_features=4, n_informative=3, n_redundant=1,
    n_classes=2, weights=[0.9, 0.1], random_state=42
)

print("Imbalanced class distribution:")
print(f"  Normal: {(y_imb == 0).sum()}")
print(f"  Fault:  {(y_imb == 1).sum()}")
print(f"  Fault rate: {y_imb.mean():.1%}")

In [None]:
# Compare balanced vs unbalanced classifiers
X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(
    X_imb, y_imb, test_size=0.2, random_state=42
)

# Standard classifier
clf_standard = LogisticRegression()
clf_standard.fit(X_train_i, y_train_i)
y_pred_standard = clf_standard.predict(X_test_i)

# Balanced classifier (class_weight='balanced')
clf_balanced = LogisticRegression(class_weight='balanced')
clf_balanced.fit(X_train_i, y_train_i)
y_pred_balanced = clf_balanced.predict(X_test_i)

print("Standard Logistic Regression:")
print(f"  Accuracy: {accuracy_score(y_test_i, y_pred_standard):.3f}")
print(f"  Recall (Fault): {recall_score(y_test_i, y_pred_standard):.3f}")
print(f"  F1 (Fault): {f1_score(y_test_i, y_pred_standard):.3f}")

print("\nBalanced Logistic Regression:")
print(f"  Accuracy: {accuracy_score(y_test_i, y_pred_balanced):.3f}")
print(f"  Recall (Fault): {recall_score(y_test_i, y_pred_balanced):.3f}")
print(f"  F1 (Fault): {f1_score(y_test_i, y_pred_balanced):.3f}")

**The balanced classifier tradeoff is clear:**

| Metric | Standard | Balanced |
|--------|----------|----------|
| Accuracy | Higher | Lower |
| Recall (Faults) | ~50% | ~80% |
| F1 (Faults) | Lower | Higher |

The standard classifier optimizes for overall accuracy, essentially ignoring the rare fault class. The balanced classifier sacrifices some overall accuracy to catch more faults.

**What "balanced" does**: It upweights errors on the minority class. Mathematically, it's like replicating minority samples until classes are equal. The result: the model pays more attention to the rare class.

**The 90% vs 95% accuracy paradox**: In imbalanced data, a model with 90% accuracy that catches 80% of faults is often *more useful* than a model with 95% accuracy that catches only 50% of faults. Always look beyond accuracy!

### Key Insight

The balanced classifier has lower accuracy but **much higher recall** for faults. In fault detection, catching faults is more important than overall accuracy!

Always ask: "What is the cost of each type of error?" Let that guide your metric choice and class weighting.

## Choosing the Right Metric

| Scenario | Recommended Metric | Reasoning |
|----------|-------------------|------------|
| Balanced classes, equal error costs | Accuracy or F1 | Both give reasonable picture |
| Imbalanced classes | F1 or AUC | Accuracy misleading |
| Missing positives is costly | Recall | Prioritize catching all positives |
| False alarms are costly | Precision | Prioritize being right when positive |
| Ranking matters (not just yes/no) | AUC | Evaluates probability calibration |

### A Decision Framework

1. **Understand the costs**: What happens if you miss a positive? What happens if you false alarm?
2. **Check class balance**: If imbalanced, avoid accuracy as primary metric
3. **Consider the use case**: Are you making binary decisions or ranking candidates?
4. **Report multiple metrics**: No single metric tells the whole story

## Classification vs Regression: How to Choose

Sometimes the boundary is fuzzy:

| Problem | Natural Framing | Alternative |
|---------|-----------------|-------------|
| Product quality | Classification (pass/fail) | Regression (quality score) |
| Equipment failure | Classification (fail/ok) | Regression (time to failure) |
| Customer churn | Classification (leave/stay) | Regression (probability of leaving) |

**Guidelines**:
- If the outcome is naturally categorical, use classification
- If you need probabilities, logistic regression gives you both
- If there's a natural ordering or you care about degree, consider regression
- You can always discretize regression outputs if needed

## Quiz

Test your understanding of classification concepts.

In [None]:
! pip install -q jupyterquiz
from jupyterquiz import display_quiz

display_quiz("https://raw.githubusercontent.com/jkitchin/s26-06642/main/dsmles/07-classification/quizzes/classification-quiz.json")

## Recommended Reading

These resources explore classification methods and evaluation metrics:

1. **[Scikit-learn Classification Guide](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)** - Official documentation on logistic regression including multi-class strategies, regularization options, and solver selection.

2. **[An Introduction to Statistical Learning, Chapter 4](https://www.statlearning.com/)** - Covers logistic regression, LDA, and classification concepts. Clear explanations of the math behind classification.

3. **[The Precision-Recall Tradeoff (Google Developers)](https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall)** - Interactive tutorial on classification metrics with visualizations of how threshold changes affect precision and recall.

4. **[ROC Curves and AUC Explained](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)** - Clear explanation of ROC curves with interactive examples. Helps build intuition for what AUC actually measures.

5. **[Learning from Imbalanced Data (He & Garcia, IEEE TKDE 2009)](https://ieeexplore.ieee.org/document/5128907)** - Survey paper on handling class imbalance. Covers sampling methods, cost-sensitive learning, and evaluation strategies for imbalanced datasets.

## Summary

### Key Takeaways

1. **Classification predicts categories, not numbers**: Use when outcomes are discrete classes

2. **Logistic regression is the workhorse**: Simple, interpretable, gives probabilities

3. **Accuracy can be misleading**: Especially with imbalanced classes

4. **Know your metrics**:
   - Precision: How many predicted positives are correct?
   - Recall: How many actual positives did we find?
   - F1: Harmonic mean of precision and recall
   - AUC: Overall ranking quality across all thresholds

5. **The confusion matrix is your friend**: Visualizes all error types

6. **Handle imbalanced classes**: Use class weights or adjust thresholds

7. **Choose metrics based on costs**: What's the cost of missing a positive vs false alarm?

### What's Next

In the next module, we'll explore **regularization**—techniques to prevent overfitting by constraining model complexity. These apply to both regression and classification.

---

## The Catalyst Crisis: Chapter 7 - "Accuracy Isn't Everything"

*A story about classification metrics and real-world tradeoffs*

---

"Ninety-four percent accuracy," Sam announced proudly. "Our classifier can predict batch failures before they happen."

The team had pivoted from regression to classification—instead of predicting exact yield, they were now predicting pass/fail. ChemCorp could use this to catch bad batches early, maybe even prevent them.

Frank Morrison was on the video call, skeptical as always. "Ninety-four percent sounds good. What's the catch?"

Alex had been digging through the confusion matrix. She found the catch.

"We're catching 62% of the failures," she said quietly.

Frank frowned. "Sixty-two? You said ninety-four."

"Ninety-four percent overall accuracy. But that's because most batches pass. When we predict 'pass,' we're usually right. But when a batch is actually going to fail, we only catch it 62% of the time."

Maya pulled up the numbers. "So about 40% of the bad batches slip through."

"That's not acceptable." Frank's voice was hard. "A bad batch that ships costs us $200,000. I don't care about overall accuracy—I care about catching failures."

Sam looked deflated. "So our model is useless?"

"No," Alex said. "It's optimizing for the wrong thing." She turned to the screen. "We can adjust the threshold. Accept more false alarms in exchange for catching more real failures. It's a trade-off."

She adjusted the classification threshold, watching the metrics shift. Recall—the percentage of actual failures caught—climbed to 89%. But precision dropped. More false alarms.

"So now we're stopping good batches unnecessarily?" Frank asked.

"Some. But we're catching almost all the bad ones." Alex pulled up a cost analysis. "False alarms cost you the time to investigate—maybe $5,000 per batch. Missed failures cost you $200,000. What's the right trade-off?"

The room was quiet. This was the reality of applied ML—not just building models, but making decisions about what errors you could live with.

"Give me the sensitive version," Frank said finally. "I'd rather investigate ten batches than ship one bad one."

After the call, Jordan found Alex at the mystery board. "That was uncomfortable."

"Real decisions usually are." She added a note: **Precision vs. recall trade-off. ChemCorp values catching failures over avoiding false alarms.**

"You handled Frank well."

Alex shrugged. "Seven years of dealing with operations managers. They don't want perfect—they want useful."

---

*Continue to the next lecture to learn about regularization and model selection...*