[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jkitchin/s26-06642/blob/main/dsmles/participation/participation-08-regularization.ipynb)

# Module 08: Regularization & Model Selection - Participation Exercises

## Exercise Types

| Type | Icon | Description | Time |
|------|------|-------------|------|
| **Reflection** | ü§î | Personal reflection on concepts and connections | 3-5 min |
| **Mini-Exercise** | üîß | Hands-on coding or problem solving | 5-10 min |
| **Discussion** | üí¨ | Pair or group discussion with neighbors | 5-7 min |
| **Prediction** | üîÆ | Make a prediction before seeing results | 2-3 min |
| **Critique** | üîç | Analyze code, results, or approaches | 5-7 min |

## Exercise 8.1: Discussion - The Bias-Variance Tradeoff

**Type:** üí¨ Discussion (5 min)

Explain the bias-variance tradeoff to a partner as if they've never heard of it:

1. What is bias? What is variance?
2. Why can't we minimize both at the same time?
3. Give a real-world analogy (not from machine learning)

**Challenge:** Can you explain it in under 30 seconds?

*Your explanation:*



## Exercise 8.2: Mini-Exercise - Ridge vs Lasso

**Type:** üîß Mini-Exercise (7 min)

Observe how Ridge and Lasso affect coefficients differently.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler

# Create data with some useless features
np.random.seed(42)
n = 100
X = np.random.randn(n, 5)
# Only first 2 features matter, rest are noise
y = 3*X[:, 0] + 2*X[:, 1] + np.random.randn(n)*0.5

feature_names = ['important_1', 'important_2', 'noise_1', 'noise_2', 'noise_3']

# Scale features
X_scaled = StandardScaler().fit_transform(X)

# Fit models with different regularization
models = {
    'OLS': LinearRegression(),
    'Ridge (alpha=1)': Ridge(alpha=1),
    'Lasso (alpha=0.1)': Lasso(alpha=0.1)
}

results = {}
for name, model in models.items():
    model.fit(X_scaled, y)
    results[name] = model.coef_

coef_df = pd.DataFrame(results, index=feature_names)
print("Coefficients:")
print(coef_df.round(3))

# TASK: 
# 1. Which model correctly identifies the noise features?
# 2. When would you prefer Ridge over Lasso?

*Your observations:*



## Exercise 8.3: Critique - Cross-Validation Mistakes

**Type:** üîç Critique (5 min)

Find the data leakage in this cross-validation workflow:

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge

# BUGGY cross-validation workflow
# X, y = load_data()

# Step 1: Scale all data
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X)  # <-- PROBLEM HERE!

# Step 2: Cross-validate
# scores = cross_val_score(Ridge(), X_scaled, y, cv=5)

# TASK: What's wrong with this workflow?
# How should it be done correctly?

*The problem:*

*The fix:*
