# In-Class Participation Exercises

This notebook contains participation activities for each lecture module. Exercises are designed to be completed during class time and encourage active engagement with the material.

## Exercise Types

| Type | Icon | Description | Time |
|------|------|-------------|------|
| **Reflection** | :thinking: | Personal reflection on concepts and connections | 3-5 min |
| **Mini-Exercise** | :wrench: | Hands-on coding or problem solving | 5-10 min |
| **Discussion** | :speech_balloon: | Pair or group discussion with neighbors | 5-7 min |
| **Prediction** | :crystal_ball: | Make a prediction before seeing results | 2-3 min |
| **Critique** | :mag: | Analyze code, results, or approaches | 5-7 min |

---

# Module 00: Introduction

## Exercise 0.1: Reflection - Your Data Story

**Type:** :thinking: Reflection (3 min writing, 2 min sharing)

Think about a time you worked with data (in a class, research, internship, or personal project).

1. What was the data about?
2. What question were you trying to answer?
3. What was the most challenging part?

**Share:** Turn to a neighbor and share one insight from your experience.

*Your response:*



## Exercise 0.2: Discussion - ML in Your Field

**Type:** :speech_balloon: Discussion (5 min)

With 2-3 neighbors, discuss:

1. What's one problem in chemical engineering where you think machine learning could help?
2. What data would you need to solve it?
3. What would success look like?

Be prepared to share your group's best idea with the class.

*Group notes:*



## Exercise 0.3: Prediction - Model Performance

**Type:** :crystal_ball: Prediction (2 min)

Before we run the introductory example, predict:

- If we train a model to predict reaction yield from temperature, pressure, and catalyst loading, what R^2 do you expect?
  - [ ] R^2 < 0.5 (poor fit)
  - [ ] R^2 = 0.5-0.7 (moderate fit)
  - [ ] R^2 = 0.7-0.9 (good fit)
  - [ ] R^2 > 0.9 (excellent fit)

**Why?** Write one sentence explaining your prediction.

*Your prediction and reasoning:*



---

# Module 01: NumPy Fundamentals

## Exercise 1.1: Mini-Exercise - Vectorization Challenge

**Type:** :wrench: Mini-Exercise (7 min)

Convert this loop-based code to vectorized NumPy operations. Time both versions.

In [None]:
import numpy as np

# Given: temperatures in Celsius
temps_C = np.array([25, 50, 75, 100, 125, 150, 175, 200])

# Loop version (slow) - calculate vapor pressure using Antoine equation
# For water: log10(P) = A - B/(C + T), with A=8.07, B=1730.63, C=233.43
A, B, C = 8.07, 1730.63, 233.43

pressures_loop = []
for T in temps_C:
    log_P = A - B / (C + T)
    P = 10 ** log_P
    pressures_loop.append(P)
pressures_loop = np.array(pressures_loop)

print("Loop result:", pressures_loop)

# YOUR TASK: Write the vectorized version below
# pressures_vectorized = ???


## Exercise 1.2: Discussion - When Loops Are Okay

**Type:** :speech_balloon: Discussion (5 min)

We learned that vectorization is faster than loops. But are there situations where loops are better or necessary?

With a partner, come up with 2-3 scenarios where you might still use a loop in scientific Python code.

**Hint:** Think about dependencies between iterations, readability, or operations that can't be vectorized.

*Scenarios where loops might be appropriate:*

1. 
2. 
3. 

## Exercise 1.3: Reflection - Broadcasting Intuition

**Type:** :thinking: Reflection (3 min)

Broadcasting is one of NumPy's most powerful features, but it can also cause subtle bugs.

Look at this code and predict the output shape **without running it**:

In [None]:
import numpy as np

A = np.array([[1, 2, 3],
              [4, 5, 6]])  # Shape: (2, 3)

B = np.array([10, 20, 30])  # Shape: (3,)

C = np.array([[100],
              [200]])  # Shape: (2, 1)

# Predict shapes before running:
# A + B = shape ???
# A + C = shape ???
# A + B + C = shape ???

*Your predictions:*

- A + B = 
- A + C = 
- A + B + C = 

---

# Module 02: Pandas Introduction

## Exercise 2.1: Prediction - Data Types

**Type:** :crystal_ball: Prediction (3 min)

You receive a CSV file with experimental data. Before loading it, predict what data types pandas will assign:

In [None]:
# Imagine this CSV content:
csv_content = """
experiment_id,temperature,pressure,catalyst,yield,date,notes
EXP001,350.5,1.2,Pt/Al2O3,78.3,2024-01-15,Good run
EXP002,375.0,1.5,Pd/C,82.1,2024-01-16,
EXP003,400.0,1.8,Pt/Al2O3,NaN,2024-01-17,Equipment failure
"""

# Predict the dtype for each column:
# experiment_id: ???
# temperature: ???
# pressure: ???
# catalyst: ???
# yield: ???
# date: ???
# notes: ???

*Your dtype predictions:*

| Column | Predicted dtype | Reasoning |
|--------|-----------------|--------|
| experiment_id | | |
| temperature | | |
| pressure | | |
| catalyst | | |
| yield | | |
| date | | |
| notes | | |

## Exercise 2.2: Mini-Exercise - Data Exploration Race

**Type:** :wrench: Mini-Exercise (7 min)

Load the dataset below and answer all questions as quickly as possible. First person done raises their hand!

In [None]:
import pandas as pd
import numpy as np

# Create sample reactor data
np.random.seed(42)
n = 100
df = pd.DataFrame({
    'reactor': np.random.choice(['R1', 'R2', 'R3'], n),
    'temperature': np.random.uniform(300, 500, n),
    'pressure': np.random.uniform(1, 10, n),
    'conversion': np.random.uniform(0.3, 0.95, n),
    'shift': np.random.choice(['Day', 'Night'], n)
})
df.loc[np.random.choice(n, 5), 'conversion'] = np.nan  # Add some missing values

# QUESTIONS - Answer using pandas operations:
# 1. How many rows and columns?
# 2. How many missing values in 'conversion'?
# 3. What is the mean temperature for reactor R2?
# 4. How many experiments were run on the Night shift?
# 5. What is the maximum conversion?

# Your code here:


## Exercise 2.3: Discussion - Missing Data Strategies

**Type:** :speech_balloon: Discussion (5 min)

You have sensor data from a reactor with 10% missing values. Discuss with a partner:

1. What are three different ways to handle the missing data?
2. What are the pros and cons of each approach?
3. Does it matter *why* the data is missing?

**Scenario:** The missing values occur mostly during shift changes. How does this affect your strategy?

*Discussion notes:*

| Strategy | Pros | Cons |
|----------|------|------|
| | | |
| | | |
| | | |

---

# Module 03: Intermediate Pandas

## Exercise 3.1: Mini-Exercise - GroupBy Challenge

**Type:** :wrench: Mini-Exercise (8 min)

Use groupby operations to answer questions about catalyst performance.

In [None]:
import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    'catalyst': np.random.choice(['Pt/Al2O3', 'Pd/C', 'Ni/SiO2'], 150),
    'temperature': np.random.choice([350, 400, 450], 150),
    'yield': np.random.uniform(50, 95, 150),
    'selectivity': np.random.uniform(0.7, 0.99, 150)
})

# TASKS:
# 1. Find the mean yield for each catalyst
# 2. Find the mean yield for each catalyst-temperature combination
# 3. Which catalyst has the highest average selectivity?
# 4. Create a pivot table: rows=catalyst, columns=temperature, values=mean yield

# Your code here:


## Exercise 3.2: Reflection - Data Wrangling Frustrations

**Type:** :thinking: Reflection (3 min)

Data wrangling often takes 80% of a data science project's time.

1. What's the messiest dataset you've encountered? What made it difficult?
2. What would have made your life easier (better data collection, documentation, etc.)?

**Connection:** How might you design experiments or data collection to make future analysis easier?

*Your reflection:*



## Exercise 3.3: Critique - Spot the Bug

**Type:** :mag: Critique (5 min)

The following code attempts to analyze experimental data but contains several bugs or poor practices. Find and fix them.

In [None]:
import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
    'temp': [350, 400, np.nan, 450, 400],
    'yield': [75, 82, 78, np.nan, 85],
    'catalyst': ['Pt', 'Pt', 'Pd', 'Pd', 'Pt']
})

# BUGGY CODE - Find the problems:

# Bug 1: Calculate mean yield per catalyst
mean_yields = df.groupby('catalyst').mean()['yield']

# Bug 2: Filter to high-yield experiments
high_yield = df[df.yield > 80]

# Bug 3: Fill missing temperatures with the mean
df.temp.fillna(df.temp.mean())

# Bug 4: Check if there are any missing values left
print("Missing values:", df.isnull().sum())

# What's wrong with each line? How would you fix it?

*Bugs identified:*

1. Bug 1: 
2. Bug 2: 
3. Bug 3: 
4. Bug 4: 

---

# Module 04: Feature Engineering

## Exercise 4.1: Discussion - Domain Knowledge Features

**Type:** :speech_balloon: Discussion (7 min)

You're building a model to predict reaction yield. You have:
- Temperature (K)
- Pressure (atm)
- Reactant concentrations (mol/L)
- Catalyst loading (wt%)
- Residence time (min)

**Task:** With 2-3 neighbors, brainstorm engineered features that might improve predictions. Think about:
- Physical/chemical relationships (Arrhenius, ideal gas law, etc.)
- Ratios and interactions
- Domain-specific transformations

List at least 5 potential engineered features with brief justifications.

*Engineered features:*

| Feature | Formula/Description | Justification |
|---------|--------------------|--------------|
| 1. | | |
| 2. | | |
| 3. | | |
| 4. | | |
| 5. | | |

## Exercise 4.2: Mini-Exercise - Scaling Matters

**Type:** :wrench: Mini-Exercise (6 min)

Demonstrate why feature scaling matters for distance-based algorithms.

In [None]:
import numpy as np
from sklearn.preprocessing import StandardScaler

# Two data points with different feature scales
# Point A: Temperature=400K, Pressure=5atm
# Point B: Temperature=401K, Pressure=6atm

A = np.array([400, 5])
B = np.array([401, 6])

# Task 1: Calculate Euclidean distance without scaling
# dist_unscaled = ???

# Task 2: Now consider that temperature range is 300-500K, pressure range is 1-10 atm
# Scale both points and calculate distance again
# (Hint: what's the relative change in each feature?)

# Task 3: Which distance is more "fair"? Why?


## Exercise 4.3: Reflection - Feature Engineering Philosophy

**Type:** :thinking: Reflection (3 min)

"Feature engineering is where domain expertise meets machine learning."

Reflect on:
1. Why might a chemical engineer create better features than a generic data scientist?
2. Can we automate feature engineering? What are the limits?
3. When might *too many* features be a problem?

*Your reflection:*



---

# Module 05: Dimensionality Reduction

## Exercise 5.1: Prediction - Variance Explained

**Type:** :crystal_ball: Prediction (3 min)

You have a dataset with 10 features describing chemical compounds. You apply PCA.

**Predict:** How much variance do you think PC1 will explain?

- [ ] < 20% (features are independent)
- [ ] 20-40% (some correlation)
- [ ] 40-60% (moderate correlation)
- [ ] > 60% (highly correlated features)

**Consider:** What does your answer imply about the "true" dimensionality of the data?

*Your prediction and reasoning:*



## Exercise 5.2: Mini-Exercise - Interpret PCA Loadings

**Type:** :wrench: Mini-Exercise (7 min)

Analyze PCA loadings to understand what each component represents.

In [None]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Polymer property data (synthetic)
np.random.seed(42)
n = 100

# Create correlated features
strength = np.random.normal(50, 10, n)
hardness = strength * 0.8 + np.random.normal(0, 5, n)  # Correlated with strength
flexibility = 100 - strength + np.random.normal(0, 8, n)  # Anti-correlated
density = np.random.normal(1.2, 0.2, n)  # Independent
cost = np.random.normal(10, 3, n)  # Independent

df = pd.DataFrame({
    'strength': strength,
    'hardness': hardness,
    'flexibility': flexibility,
    'density': density,
    'cost': cost
})

# Apply PCA
X_scaled = StandardScaler().fit_transform(df)
pca = PCA()
pca.fit(X_scaled)

# Look at loadings
loadings = pd.DataFrame(
    pca.components_.T,
    columns=[f'PC{i+1}' for i in range(5)],
    index=df.columns
)
print("PCA Loadings:")
print(loadings.round(3))

print("\nVariance explained:", pca.explained_variance_ratio_.round(3))

# TASK: Interpret what PC1 and PC2 represent based on the loadings
# What physical meaning can you assign to each component?

*Your interpretation:*

- PC1 represents:
- PC2 represents:

## Exercise 5.3: Discussion - PCA vs t-SNE

**Type:** :speech_balloon: Discussion (5 min)

You need to visualize a high-dimensional dataset. When would you choose:

1. **PCA** over t-SNE?
2. **t-SNE** over PCA?
3. **Both** (for different purposes)?

Consider: interpretability, reproducibility, global vs local structure, computational cost.

*Discussion notes:*



---

# Module 06: Linear Regression

## Exercise 6.1: Critique - Coefficient Interpretation

**Type:** :mag: Critique (5 min)

A colleague shows you their regression model and says: "Temperature has a coefficient of 0.002, and pressure has a coefficient of 5.3. Clearly pressure is way more important!"

**Task:** Write 2-3 sentences explaining why this interpretation might be wrong and what they should do instead.

*Your critique:*



## Exercise 6.2: Mini-Exercise - Diagnose the Model

**Type:** :wrench: Mini-Exercise (7 min)

Look at these residual plots and diagnose what's wrong with each model.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n = 100
y_pred = np.linspace(0, 100, n)

# Three different residual patterns
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Pattern A: Curved residuals
residuals_A = (y_pred - 50)**2 / 100 + np.random.normal(0, 2, n)
axes[0].scatter(y_pred, residuals_A, alpha=0.6)
axes[0].axhline(0, color='r', linestyle='--')
axes[0].set_title('Model A')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Residual')

# Pattern B: Fan shape
residuals_B = np.random.normal(0, y_pred/10 + 0.5, n)
axes[1].scatter(y_pred, residuals_B, alpha=0.6)
axes[1].axhline(0, color='r', linestyle='--')
axes[1].set_title('Model B')
axes[1].set_xlabel('Predicted')

# Pattern C: Good residuals
residuals_C = np.random.normal(0, 3, n)
axes[2].scatter(y_pred, residuals_C, alpha=0.6)
axes[2].axhline(0, color='r', linestyle='--')
axes[2].set_title('Model C')
axes[2].set_xlabel('Predicted')

plt.tight_layout()
plt.show()

# TASK: For each model, identify:
# 1. What pattern do you see?
# 2. What does it indicate?
# 3. How would you fix it?

*Your diagnosis:*

**Model A:**
- Pattern:
- Problem:
- Fix:

**Model B:**
- Pattern:
- Problem:
- Fix:

**Model C:**
- Pattern:
- Problem:
- Fix:

## Exercise 6.3: Reflection - Causation vs Correlation

**Type:** :thinking: Reflection (3 min)

Your regression model shows that "catalyst age" has a negative coefficient for yield. A manager suggests: "Let's always use fresh catalyst!"

**Reflect:**
1. Does the coefficient prove that old catalyst *causes* lower yields?
2. What else might explain the relationship?
3. What would you need to establish causation?

*Your reflection:*



---

# Module 07: Classification

## Exercise 7.1: Discussion - Choosing Metrics

**Type:** :speech_balloon: Discussion (5 min)

You're building a classifier to detect faulty batches in a chemical plant. Only 2% of batches are faulty.

**Discuss with a partner:**
1. If you predict "good" for every batch, what's your accuracy?
2. Why is accuracy a bad metric here?
3. Which metric would you use instead: precision, recall, or F1? Why?

*Discussion notes:*



## Exercise 7.2: Mini-Exercise - Confusion Matrix Interpretation

**Type:** :wrench: Mini-Exercise (6 min)

Calculate metrics from a confusion matrix.

In [None]:
# A classifier for detecting equipment failures
# Confusion matrix:
#                 Predicted
#              | Normal | Failure |
# Actual Normal |   85   |   10    |
# Actual Failure|    3   |    2    |

# True Positives (TP) = correctly predicted failures = 2
# True Negatives (TN) = correctly predicted normal = 85
# False Positives (FP) = normal predicted as failure = 10
# False Negatives (FN) = failure predicted as normal = 3

TP, TN, FP, FN = 2, 85, 10, 3

# TASK: Calculate these metrics
# accuracy = ???
# precision = ???  (of predicted failures, how many were real?)
# recall = ???  (of actual failures, how many did we catch?)
# f1 = ???

# Which metric is most concerning? Why?

*Your calculations and interpretation:*



## Exercise 7.3: Prediction - ROC Curves

**Type:** :crystal_ball: Prediction (3 min)

You have two classifiers:
- **Model A**: High precision (0.9), low recall (0.3)
- **Model B**: Low precision (0.5), high recall (0.9)

**Predict:** Sketch (mentally or on paper) where each model's operating point would be on an ROC curve. Which model has higher AUC?

*Your prediction:*



---

# Module 08: Regularization & Model Selection

## Exercise 8.1: Discussion - The Bias-Variance Tradeoff

**Type:** :speech_balloon: Discussion (5 min)

Explain the bias-variance tradeoff to a partner as if they've never heard of it:

1. What is bias? What is variance?
2. Why can't we minimize both at the same time?
3. Give a real-world analogy (not from machine learning)

**Challenge:** Can you explain it in under 30 seconds?

*Your explanation:*



## Exercise 8.2: Mini-Exercise - Ridge vs Lasso

**Type:** :wrench: Mini-Exercise (7 min)

Observe how Ridge and Lasso affect coefficients differently.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler

# Create data with some useless features
np.random.seed(42)
n = 100
X = np.random.randn(n, 5)
# Only first 2 features matter, rest are noise
y = 3*X[:, 0] + 2*X[:, 1] + np.random.randn(n)*0.5

feature_names = ['important_1', 'important_2', 'noise_1', 'noise_2', 'noise_3']

# Scale features
X_scaled = StandardScaler().fit_transform(X)

# Fit models with different regularization
models = {
    'OLS': LinearRegression(),
    'Ridge (alpha=1)': Ridge(alpha=1),
    'Lasso (alpha=0.1)': Lasso(alpha=0.1)
}

results = {}
for name, model in models.items():
    model.fit(X_scaled, y)
    results[name] = model.coef_

coef_df = pd.DataFrame(results, index=feature_names)
print("Coefficients:")
print(coef_df.round(3))

# TASK: 
# 1. Which model correctly identifies the noise features?
# 2. When would you prefer Ridge over Lasso?

*Your observations:*



## Exercise 8.3: Critique - Cross-Validation Mistakes

**Type:** :mag: Critique (5 min)

Find the data leakage in this cross-validation workflow:

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge

# BUGGY cross-validation workflow
# X, y = load_data()

# Step 1: Scale all data
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X)  # <-- PROBLEM HERE!

# Step 2: Cross-validate
# scores = cross_val_score(Ridge(), X_scaled, y, cv=5)

# TASK: What's wrong with this workflow?
# How should it be done correctly?

*The problem:*

*The fix:*


---

# Module 09: Nonlinear Methods

## Exercise 9.1: Prediction - Method Selection

**Type:** :crystal_ball: Prediction (3 min)

For each scenario, predict which method would work best: **Linear Regression**, **Polynomial Regression**, **Decision Tree**, or **k-Nearest Neighbors**.

| Scenario | Your Choice | Reasoning |
|----------|-------------|----------|
| Predicting yield from temperature (Arrhenius-like) | | |
| Classifying materials into 5 categories based on properties | | |
| Predicting property with many step-changes/thresholds | | |
| Predicting output from 100 features, most are noise | | |

*Fill in the table above*

## Exercise 9.2: Mini-Exercise - Overfitting Visualization

**Type:** :wrench: Mini-Exercise (7 min)

Visualize overfitting with polynomial regression.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# Generate noisy data from a simple quadratic
np.random.seed(42)
X = np.linspace(0, 1, 15).reshape(-1, 1)
y = 2*X.ravel()**2 - X.ravel() + 0.5 + np.random.randn(15)*0.1

X_plot = np.linspace(-0.1, 1.1, 100).reshape(-1, 1)

plt.figure(figsize=(15, 4))

# TASK: Try degrees 1, 2, and 15
# For each, plot the fit and observe what happens
for i, degree in enumerate([1, 2, 15]):
    plt.subplot(1, 3, i+1)
    
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(X, y)
    
    plt.scatter(X, y, color='blue', label='Data')
    plt.plot(X_plot, model.predict(X_plot), color='red', label=f'Degree {degree}')
    plt.ylim(-0.5, 2)
    plt.title(f'Degree {degree}')
    plt.legend()

plt.tight_layout()
plt.show()

# QUESTION: Which degree is best? How do you know?

*Your observation:*



## Exercise 9.3: Discussion - Interpretability vs Performance

**Type:** :speech_balloon: Discussion (5 min)

You're presenting model results to plant operators who need to understand *why* the model makes certain predictions.

**Discuss:**
1. Rank these models from most to least interpretable: Linear Regression, Random Forest, Neural Network, Decision Tree
2. When might you sacrifice interpretability for performance?
3. How could you make a black-box model more interpretable?

*Discussion notes:*



---

# Module 10: Ensemble Methods

## Exercise 10.1: Discussion - Wisdom of Crowds

**Type:** :speech_balloon: Discussion (5 min)

The "wisdom of crowds" says that averaging many independent estimates is often more accurate than any single expert.

**Discuss:**
1. How does this relate to ensemble methods in ML?
2. What's the key word in "independent estimates"? Why does it matter?
3. How do Random Forests achieve independence between trees?

*Discussion notes:*



## Exercise 10.2: Mini-Exercise - Feature Importance

**Type:** :wrench: Mini-Exercise (7 min)

Compare feature importance from Random Forest vs coefficients from Linear Regression.

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

# Create data with a nonlinear relationship
np.random.seed(42)
n = 200

X1 = np.random.uniform(0, 10, n)  # Important, nonlinear effect
X2 = np.random.uniform(0, 10, n)  # Important, linear effect
X3 = np.random.uniform(0, 10, n)  # Noise

# y has nonlinear dependence on X1, linear on X2
y = np.sin(X1) + 0.5*X2 + np.random.randn(n)*0.3

X = np.column_stack([X1, X2, X3])
feature_names = ['nonlinear_feature', 'linear_feature', 'noise']

# Fit both models
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X, y)

lr = LinearRegression()
lr.fit(StandardScaler().fit_transform(X), y)

print("Random Forest Feature Importance:")
for name, imp in zip(feature_names, rf.feature_importances_):
    print(f"  {name}: {imp:.3f}")

print("\nLinear Regression Coefficients (standardized):")
for name, coef in zip(feature_names, np.abs(lr.coef_)):
    print(f"  {name}: {coef:.3f}")

# TASK: Why do the rankings differ?
# Which method gives a more accurate picture of importance here?

*Your analysis:*



## Exercise 10.3: Reflection - When to Use Ensembles

**Type:** :thinking: Reflection (3 min)

Ensemble methods often win ML competitions. But they're not always the right choice.

**Reflect on scenarios where you might NOT use an ensemble:**
- When interpretability is critical?
- When computational resources are limited?
- When you need fast predictions in real-time?

*Your reflection:*



---

# Module 11: Clustering

## Exercise 11.1: Prediction - How Many Clusters?

**Type:** :crystal_ball: Prediction (3 min)

You have process data from a reactor that operates in three known modes: startup, steady-state, and shutdown. You apply k-means.

**Predict:** Will k-means with k=3 find clusters that match the three operating modes?

- [ ] Yes, definitely
- [ ] Probably, if the modes are well-separated
- [ ] Probably not, clusters aren't always meaningful
- [ ] No, k-means can't find interpretable clusters

**Explain your reasoning.**

*Your prediction and reasoning:*



## Exercise 11.2: Mini-Exercise - Scaling Impact

**Type:** :wrench: Mini-Exercise (6 min)

See how scaling affects k-means clustering.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Two features with very different scales
np.random.seed(42)
n = 100

# Temperature (300-500 K) and pressure (1-10 atm)
# Two natural clusters based on pressure
cluster1 = np.column_stack([
    np.random.normal(400, 30, n//2),  # temperature
    np.random.normal(3, 0.5, n//2)    # pressure
])
cluster2 = np.column_stack([
    np.random.normal(400, 30, n//2),  # temperature
    np.random.normal(7, 0.5, n//2)    # pressure
])
X = np.vstack([cluster1, cluster2])

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Without scaling
kmeans_unscaled = KMeans(n_clusters=2, random_state=42, n_init=10)
labels_unscaled = kmeans_unscaled.fit_predict(X)
axes[0].scatter(X[:, 0], X[:, 1], c=labels_unscaled, cmap='viridis')
axes[0].set_xlabel('Temperature (K)')
axes[0].set_ylabel('Pressure (atm)')
axes[0].set_title('K-means WITHOUT scaling')

# With scaling
X_scaled = StandardScaler().fit_transform(X)
kmeans_scaled = KMeans(n_clusters=2, random_state=42, n_init=10)
labels_scaled = kmeans_scaled.fit_predict(X_scaled)
axes[1].scatter(X[:, 0], X[:, 1], c=labels_scaled, cmap='viridis')
axes[1].set_xlabel('Temperature (K)')
axes[1].set_ylabel('Pressure (atm)')
axes[1].set_title('K-means WITH scaling')

plt.tight_layout()
plt.show()

# TASK: Explain the difference. Which clustering is "correct"?

*Your explanation:*



## Exercise 11.3: Discussion - Clustering Without Ground Truth

**Type:** :speech_balloon: Discussion (5 min)

Clustering is unsupervised - there are no labels to tell us if we're right.

**Discuss:**
1. How do you validate clusters without ground truth?
2. When might silhouette score be misleading?
3. What role does domain expertise play in evaluating clusters?

*Discussion notes:*



---

# Module 12: Uncertainty Quantification

## Exercise 12.1: Reflection - Communicating Uncertainty

**Type:** :thinking: Reflection (3 min)

Your model predicts a reactor yield of 75% with a 95% confidence interval of [65%, 85%].

**Reflect:**
1. How would you explain this to a plant manager who asks "So what's the yield going to be?"
2. Why might the manager find uncertainty uncomfortable?
3. Why is communicating uncertainty important for decision-making?

*Your reflection:*



## Exercise 12.2: Mini-Exercise - Bootstrap Confidence Intervals

**Type:** :wrench: Mini-Exercise (8 min)

Calculate a bootstrap confidence interval for a regression coefficient.

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression

# Simple linear data
np.random.seed(42)
n = 50
X = np.random.uniform(0, 10, n).reshape(-1, 1)
y = 2.5 * X.ravel() + 5 + np.random.randn(n) * 2  # True slope = 2.5

# Single model fit
model = LinearRegression().fit(X, y)
print(f"Point estimate for slope: {model.coef_[0]:.3f}")

# TASK: Implement bootstrap to get confidence interval
n_bootstrap = 1000
bootstrap_slopes = []

for i in range(n_bootstrap):
    # 1. Sample with replacement from the data
    # indices = ???
    
    # 2. Fit model on bootstrap sample
    # X_boot = ???
    # y_boot = ???
    # model_boot = ???
    
    # 3. Store the coefficient
    # bootstrap_slopes.append(???)
    pass

# 4. Calculate 95% CI
# ci_lower = np.percentile(bootstrap_slopes, ???)
# ci_upper = np.percentile(bootstrap_slopes, ???)
# print(f"95% CI for slope: [{ci_lower:.3f}, {ci_upper:.3f}]")

## Exercise 12.3: Discussion - Sources of Uncertainty

**Type:** :speech_balloon: Discussion (5 min)

A machine learning model has multiple sources of uncertainty.

**Categorize these sources as "aleatory" (irreducible) or "epistemic" (reducible):**

1. Measurement noise in sensors
2. Model doesn't include an important variable
3. Inherent randomness in a chemical process
4. Not enough training data
5. Wrong model architecture

**Which can we reduce by collecting more data?**

*Classification:*

| Source | Aleatory or Epistemic? | Can more data help? |
|--------|----------------------|--------------------|
| 1. Measurement noise | | |
| 2. Missing variable | | |
| 3. Process randomness | | |
| 4. Limited training data | | |
| 5. Wrong architecture | | |

---

# Module 13: Model Interpretability

## Exercise 13.1: Discussion - Explaining to Stakeholders

**Type:** :speech_balloon: Discussion (5 min)

Your black-box model recommends changing a reactor setpoint. The operator asks: "Why?"

**Discuss:**
1. What would be a satisfying answer?
2. How might SHAP values help?
3. When might "trust the model" be an acceptable answer? When is it not?

*Discussion notes:*



## Exercise 13.2: Mini-Exercise - SHAP Interpretation

**Type:** :wrench: Mini-Exercise (7 min)

Interpret a SHAP summary plot.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor

# Create interpretable data
np.random.seed(42)
n = 200

temperature = np.random.uniform(300, 500, n)
pressure = np.random.uniform(1, 10, n)
catalyst = np.random.uniform(1, 5, n)

# yield increases with temp and pressure, decreases with too much catalyst
y = (0.1 * temperature + 
     5 * pressure + 
     10 * catalyst - catalyst**2 +  # Optimum catalyst loading
     np.random.randn(n) * 3)

X = np.column_stack([temperature, pressure, catalyst])
feature_names = ['temperature', 'pressure', 'catalyst_loading']

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y)

# If SHAP is available:
try:
    import shap
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X[:50])  # Sample for speed
    
    plt.figure(figsize=(10, 6))
    shap.summary_plot(shap_values, X[:50], feature_names=feature_names, show=False)
    plt.tight_layout()
    plt.show()
except ImportError:
    print("SHAP not available - interpret feature importance instead:")
    for name, imp in zip(feature_names, model.feature_importances_):
        print(f"  {name}: {imp:.3f}")

# TASK: Based on the plot (or importances):
# 1. Which feature has the largest impact?
# 2. Is the effect of temperature positive or negative?
# 3. What's unusual about catalyst_loading?

*Your interpretation:*



## Exercise 13.3: Reflection - The Right to Explanation

**Type:** :thinking: Reflection (3 min)

In many contexts (medical, legal, financial), there's growing demand for "explainable AI."

**Reflect:**
1. In chemical engineering applications, when is model explainability critical?
2. When might an unexplainable model be acceptable?
3. How does interpretability relate to trust and safety?

*Your reflection:*



---

# End-of-Course Reflection

## Final Reflection: Your Data Science Journey

**Type:** :thinking: Reflection (5 min)

1. **Most surprising thing you learned:**

2. **Concept you found most challenging:**

3. **How you plan to use these skills:**

4. **One question you still have:**

5. **Advice you'd give to future students:**

*Your final reflection:*

