[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jkitchin/s26-06642/blob/main/dsmles/assignments/hw04-feature-engineering.ipynb)

# Homework 4: Feature Engineering

**Due:** One week after Lecture 4

**Points:** 10

Practice creating and transforming features for machine learning.

In [None]:
! pip install -q pycse
from pycse.colab import pdf

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

## Problem 1: Scaling and Normalization (3 points)

Process data has features on very different scales.

In [None]:
np.random.seed(42)
n = 100

process_data = pd.DataFrame({
    'temperature_K': np.random.uniform(300, 600, n),
    'pressure_Pa': np.random.uniform(100000, 1000000, n),
    'flow_rate_mol_s': np.random.uniform(0.01, 0.1, n),
    'concentration_M': np.random.uniform(0.001, 0.01, n)
})

# Target variable
process_data['conversion'] = (
    0.3 + 
    0.001 * (process_data['temperature_K'] - 400) +
    1e-7 * process_data['pressure_Pa'] +
    np.random.normal(0, 0.05, n)
).clip(0, 1)

process_data.describe()

**1a.** Apply StandardScaler to the features. Show the mean and std of scaled features.

In [None]:
# Your code here


**1b.** Compare linear regression R² with and without scaling. Does scaling change the predictions?

In [None]:
# Your code here


**1c.** Compare the raw coefficients vs scaled coefficients. Which feature is most important according to the scaled model?

In [None]:
# Your code here


## Problem 2: Polynomial Features (4 points)

Sometimes relationships are nonlinear.

In [None]:
np.random.seed(42)
n = 80

# Reaction rate data
T = np.random.uniform(300, 500, n)
C = np.random.uniform(0.1, 2.0, n)

# Rate follows Arrhenius with concentration dependence
rate = 1e6 * np.exp(-5000/T) * C**1.5 + np.random.normal(0, 0.5, n)

rate_data = pd.DataFrame({'temperature': T, 'concentration': C, 'rate': rate})
rate_data.head()

**2a.** Fit a linear model using just temperature and concentration. Report R².

In [None]:
# Your code here


**2b.** Create polynomial features of degree 2 (include interaction terms). How many features do you have now?

In [None]:
# Your code here


**2c.** Fit a linear model with polynomial features. How much does R² improve?

In [None]:
# Your code here


**2d.** Create a physically-motivated feature: 1/T (for Arrhenius). Does this improve the model?

In [None]:
# Your code here


## Problem 3: Categorical Encoding (3 points)

Handle non-numeric features.

In [None]:
np.random.seed(42)
n = 120

catalyst_data = pd.DataFrame({
    'catalyst': np.random.choice(['Pt', 'Pd', 'Rh', 'Ni'], n),
    'support': np.random.choice(['Al2O3', 'SiO2', 'TiO2'], n),
    'temperature': np.random.uniform(400, 600, n),
    'activity': np.random.uniform(50, 100, n)
})

catalyst_data.head()

**3a.** One-hot encode the catalyst and support columns using `pd.get_dummies()`.

In [None]:
# Your code here


**3b.** Why should you use `drop_first=True` when one-hot encoding? Demonstrate with an example.

In [None]:
# Your code here


**3c.** Fit a linear regression predicting activity. Which catalyst has the highest predicted activity?

In [None]:
# Your code here
