[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jkitchin/s26-06642/blob/main/dsmles/assignments/hw09-nonlinear-methods.ipynb)

# Homework 9: Nonlinear Methods

**Due:** One week after Lecture 9

**Points:** 10

Apply nonlinear regression methods to chemical engineering data.

In [None]:
! pip install -q pycse
from pycse.colab import pdf

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score

## Problem 1: K-Nearest Neighbors Regression (3 points)

Predict reaction rate from operating conditions.

In [None]:
np.random.seed(42)
n = 150

T = np.random.uniform(300, 500, n)
C = np.random.uniform(0.5, 2.0, n)

# Nonlinear rate law
rate = 1e5 * np.exp(-4000/T) * C / (1 + 0.5*C) + np.random.normal(0, 0.2, n)

rate_data = pd.DataFrame({'temperature': T, 'concentration': C, 'rate': rate})
rate_data.head()

**1a.** Scale features and fit KNN regressors with k=1, 5, 15, 30. Report test R² for each.

In [None]:
# Your code here


**1b.** What happens to the bias-variance tradeoff as k increases? Which k performs best?

*Your answer here:*



**1c.** Create a predicted vs actual plot for your best KNN model.

In [None]:
# Your code here


## Problem 2: Decision Trees (4 points)

Use a decision tree for regression.

In [None]:
np.random.seed(42)
n = 200

process_data = pd.DataFrame({
    'temp': np.random.uniform(300, 500, n),
    'pressure': np.random.uniform(1, 10, n),
    'flow': np.random.uniform(10, 100, n),
    'catalyst_age': np.random.uniform(0, 100, n)
})

# Complex nonlinear relationship with interactions
process_data['yield'] = (
    50 + 
    20 * np.tanh((process_data['temp'] - 400) / 50) +
    5 * np.log(process_data['pressure']) +
    0.1 * process_data['flow'] * (process_data['catalyst_age'] < 50) +
    np.random.normal(0, 2, n)
).clip(0, 100)

X = process_data[['temp', 'pressure', 'flow', 'catalyst_age']]
y = process_data['yield']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**2a.** Fit decision trees with max_depth = 2, 5, 10, and None (unlimited). Compare train and test R².

In [None]:
# Your code here


**2b.** Which max_depth gives the best test performance? Is there evidence of overfitting?

*Your answer here:*



**2c.** Plot feature importances for your best tree. Which feature is most important?

In [None]:
# Your code here


**2d.** What are the advantages and disadvantages of decision trees compared to linear regression?

*Your answer here:*



## Problem 3: Model Comparison (3 points)

**3a.** Compare Linear Regression, KNN (k=5), and Decision Tree (max_depth=5) using 5-fold cross-validation on the process data. Which performs best?

In [None]:
# Your code here


**3b.** For the rate data in Problem 1, would you expect linear regression to perform well? Why or why not?

*Your answer here:*



**3c.** When might you prefer a simpler model (like linear regression) even if a complex model (like KNN) gives slightly better accuracy?

*Your answer here:*

