[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jkitchin/s26-06642/blob/main/dsmles/participation/participation-10-ensemble-methods.ipynb)

# Module 10: Ensemble Methods - Participation Exercises

## Exercise Types

| Type | Icon | Description | Time |
|------|------|-------------|------|
| **Reflection** | ü§î | Personal reflection on concepts and connections | 3-5 min |
| **Mini-Exercise** | üîß | Hands-on coding or problem solving | 5-10 min |
| **Discussion** | üí¨ | Pair or group discussion with neighbors | 5-7 min |
| **Prediction** | üîÆ | Make a prediction before seeing results | 2-3 min |
| **Critique** | üîç | Analyze code, results, or approaches | 5-7 min |

## Exercise 10.1: Discussion - Wisdom of Crowds

**Type:** üí¨ Discussion (5 min)

The "wisdom of crowds" says that averaging many independent estimates is often more accurate than any single expert.

**Discuss:**
1. How does this relate to ensemble methods in ML?
2. What's the key word in "independent estimates"? Why does it matter?
3. How do Random Forests achieve independence between trees?

*Discussion notes:*



## Exercise 10.2: Mini-Exercise - Feature Importance

**Type:** üîß Mini-Exercise (7 min)

Compare feature importance from Random Forest vs coefficients from Linear Regression.

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

# Create data with a nonlinear relationship
np.random.seed(42)
n = 200

X1 = np.random.uniform(0, 10, n)  # Important, nonlinear effect
X2 = np.random.uniform(0, 10, n)  # Important, linear effect
X3 = np.random.uniform(0, 10, n)  # Noise

# y has nonlinear dependence on X1, linear on X2
y = np.sin(X1) + 0.5*X2 + np.random.randn(n)*0.3

X = np.column_stack([X1, X2, X3])
feature_names = ['nonlinear_feature', 'linear_feature', 'noise']

# Fit both models
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X, y)

lr = LinearRegression()
lr.fit(StandardScaler().fit_transform(X), y)

print("Random Forest Feature Importance:")
for name, imp in zip(feature_names, rf.feature_importances_):
    print(f"  {name}: {imp:.3f}")

print("\nLinear Regression Coefficients (standardized):")
for name, coef in zip(feature_names, np.abs(lr.coef_)):
    print(f"  {name}: {coef:.3f}")

# TASK: Why do the rankings differ?
# Which method gives a more accurate picture of importance here?

*Your analysis:*



## Exercise 10.3: Reflection - When to Use Ensembles

**Type:** ü§î Reflection (3 min)

Ensemble methods often win ML competitions. But they're not always the right choice.

**Reflect on scenarios where you might NOT use an ensemble:**
- When interpretability is critical?
- When computational resources are limited?
- When you need fast predictions in real-time?

*Your reflection:*

