[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jkitchin/s26-06642/blob/main/dsmles/assignments/hw10-ensemble-methods.ipynb)

# Homework 7: Ensemble Methods

**Due:** One week after Lecture 8

**Points:** 10

Apply Random Forests and Gradient Boosting to a chemical engineering problem.

In [None]:
! pip install -q pycse
from pycse.colab import pdf

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import xgboost as xgb

## Dataset: Polymer Properties

Predict polymer tensile strength from synthesis conditions.

In [None]:
np.random.seed(42)
n = 300

data = pd.DataFrame({
    'temperature': np.random.uniform(150, 250, n),
    'pressure': np.random.uniform(1, 50, n),
    'catalyst_conc': np.random.uniform(0.1, 2, n),
    'monomer_ratio': np.random.uniform(0.5, 2, n),
    'reaction_time': np.random.uniform(30, 180, n),
    'molecular_weight': np.random.uniform(10000, 100000, n)
})

# Complex nonlinear relationship
data['tensile_strength'] = (
    20 * np.log(data['molecular_weight'] / 10000) +
    0.1 * data['temperature'] * np.exp(-0.01 * data['pressure']) +
    5 * np.tanh(data['catalyst_conc']) +
    10 / (1 + np.abs(data['monomer_ratio'] - 1)) +
    np.random.normal(0, 3, n)
)

feature_names = ['temperature', 'pressure', 'catalyst_conc', 'monomer_ratio', 
                 'reaction_time', 'molecular_weight']
X = data[feature_names].values
y = data['tensile_strength'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

## Problem 1: Random Forest (3 points)

**1a.** Train a Random Forest with 100 trees and max_depth=10. Report train and test R².

In [None]:
# Your code here


**1b.** Plot feature importances. Which features matter most?

In [None]:
# Your code here


**1c.** How does performance change with the number of trees? Plot test R² vs n_estimators for [10, 25, 50, 100, 200].

In [None]:
# Your code here


## Problem 2: Gradient Boosting (3 points)

**2a.** Train an XGBoost model with default parameters. Compare to Random Forest.

In [None]:
# Your code here


**2b.** Use GridSearchCV to tune max_depth (3, 5, 7) and learning_rate (0.01, 0.1, 0.2). What are the best parameters?

In [None]:
# Your code here


**2c.** Plot the learning curve showing how test error decreases with number of boosting rounds.

In [None]:
# Your code here


## Problem 3: Model Comparison (4 points)

**3a.** Compare Random Forest and XGBoost using 5-fold cross-validation. Which performs better?

In [None]:
# Your code here


**3b.** Create a predicted vs actual plot for your best model.

In [None]:
# Your code here


**3c.** A colleague suggests increasing reaction_time to improve tensile strength. Based on your model's feature importances, is this a good strategy? What would you recommend instead?

*Your answer here:*

