[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jkitchin/s26-06642/blob/main/dsmles/04-feature-engineering/feature-engineering.ipynb)

# Feature Engineering

## Learning Objectives

By the end of this lecture, you will be able to:

1. Understand why feature engineering often matters more than algorithm choice
2. Create domain-informed features from raw measurements
3. Apply common transformations (log, polynomial, interactions)
4. Handle categorical variables appropriately
5. Scale and normalize features for different algorithms
6. Use feature selection to reduce dimensionality

## Why Feature Engineering Matters

Here's a truth that surprises many beginners: **the features you provide often matter more than the algorithm you choose**.

Consider this scenario: you have temperature and pressure data from a reactor, and you want to predict reaction rate. A linear model on raw T and P might give R² = 0.6. But if you know the Arrhenius equation suggests rate depends on exp(-Ea/RT), creating that feature could push R² to 0.95—even with the same simple linear model.

### The Feature Engineering Mindset

Feature engineering is where **domain knowledge meets machine learning**. It's the process of:

1. **Transforming** raw measurements into forms that better capture underlying relationships
2. **Creating** new features that encode domain knowledge
3. **Selecting** the most informative features and removing noise
4. **Encoding** categorical and text data for numerical algorithms

### Why Algorithms Can't Do This Automatically

You might wonder: if deep learning can learn features automatically, why bother?

- **Small data**: Most chemical engineering datasets have hundreds, not millions, of samples. Hand-crafted features help models learn from limited data.
- **Interpretability**: Features like "activation energy" are meaningful; learned features like "hidden unit 47" are not.
- **Physical constraints**: You can encode conservation laws, bounds, and symmetries that algorithms would have to learn from scratch.
- **Efficiency**: A good feature reduces the hypothesis space, making training faster and more reliable.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.model_selection import train_test_split, cross_val_score

np.random.seed(42)

## Domain-Informed Feature Creation

The most powerful features come from understanding your system. Let's see this with a chemical kinetics example.

### Example: Reaction Rate Prediction

We have experimental data: temperature (K), concentration (mol/L), and observed reaction rate. A naive approach treats T and C as independent features. But we know from chemical kinetics:

$$r = k \cdot C^n = A \cdot e^{-E_a/RT} \cdot C^n$$

This suggests features like:
- **1/T**: Appears in Arrhenius equation
- **log(r) vs 1/T**: Should be linear if Arrhenius holds
- **log(C)**: If reaction order n ≠ 1, log-transform helps

In [None]:
# Simulate reaction rate data with Arrhenius kinetics
# r = A * exp(-Ea/RT) * C^n
A = 1e8  # pre-exponential factor
Ea = 50000  # activation energy, J/mol
R = 8.314  # gas constant
n = 1.5  # reaction order

# Generate experimental conditions
T = np.random.uniform(300, 400, 100)  # Temperature in K
C = np.random.uniform(0.1, 2.0, 100)  # Concentration in mol/L

# True rate with some measurement noise
k = A * np.exp(-Ea / (R * T))
rate_true = k * C**n
rate_observed = rate_true * np.exp(np.random.normal(0, 0.1, 100))  # Log-normal noise

df = pd.DataFrame({
    'T': T,
    'C': C,
    'rate': rate_observed
})
df.head()

In [None]:
# Naive approach: use raw T and C
X_naive = df[['T', 'C']]
y = df['rate']

X_train, X_test, y_train, y_test = train_test_split(X_naive, y, test_size=0.2, random_state=42)

model_naive = LinearRegression()
model_naive.fit(X_train, y_train)

print(f"Naive model R² (train): {model_naive.score(X_train, y_train):.3f}")
print(f"Naive model R² (test):  {model_naive.score(X_test, y_test):.3f}")

The naive model gives R² ≈ 0.79—not terrible, but it's missing something. The linear model is trying to fit what is fundamentally an *exponential* relationship with a straight line. It does okay because over a limited range, most curves can be approximated linearly. But we can do better by thinking about the physics.

In [None]:
# Domain-informed approach: use Arrhenius-inspired features
# If r = A * exp(-Ea/RT) * C^n, then:
# log(r) = log(A) - Ea/(RT) + n*log(C)
# This is LINEAR in 1/T and log(C)!

df['inv_T'] = 1 / df['T']
df['log_C'] = np.log(df['C'])
df['log_rate'] = np.log(df['rate'])

X_engineered = df[['inv_T', 'log_C']]
y_log = df['log_rate']

X_train_e, X_test_e, y_train_e, y_test_e = train_test_split(
    X_engineered, y_log, test_size=0.2, random_state=42
)

model_engineered = LinearRegression()
model_engineered.fit(X_train_e, y_train_e)

print(f"Engineered model R² (train): {model_engineered.score(X_train_e, y_train_e):.3f}")
print(f"Engineered model R² (test):  {model_engineered.score(X_test_e, y_test_e):.3f}")

# Extract physical parameters!
print(f"\nExtracted parameters:")
print(f"  Reaction order n = {model_engineered.coef_[1]:.2f} (true: {n})")
print(f"  Ea/R = {-model_engineered.coef_[0]:.0f} K (true: {Ea/R:.0f} K)")

**Dramatic improvement!** R² jumped from 0.79 to 0.99 with the same linear regression algorithm. The only difference: we transformed the features based on physical knowledge.

**What happened?**
- The Arrhenius equation says: r = A·exp(-Ea/RT)·C^n
- Taking the log: log(r) = log(A) - Ea/RT + n·log(C)
- This is **linear** in 1/T and log(C)!

By transforming to the right feature space, we converted a nonlinear problem into a linear one. The coefficients now have direct physical meaning:
- The coefficient on log(C) ≈ 1.5 gives us the reaction order (true value: 1.5) ✓
- The coefficient on 1/T gives us Ea/R ≈ 6000 K, so Ea ≈ 50 kJ/mol (true value: 50 kJ/mol) ✓

**The lesson**: Before reaching for complex nonlinear models, ask "Is there a transformation that makes this linear?"

### Key Insight

The same linear regression algorithm went from poor to excellent performance—not by tuning hyperparameters, but by **encoding domain knowledge into features**.

Better yet, the coefficients now have physical meaning! This is feature engineering at its best: improving both accuracy and interpretability.

### Common Domain Transformations in Chemical Engineering

| Physical Law | Suggested Features |
|--------------|--------------------|
| Arrhenius kinetics | 1/T, exp(-1/T) |
| Power laws | log(x), x^n |
| Ideal gas | P*V, P/T |
| Mass transfer | Re, Sc, Sh (dimensionless groups) |
| Heat transfer | Nu, Pr, Gr |
| Reaction equilibrium | Products/Reactants ratios |

## Feature Scaling

Many algorithms are sensitive to feature scales. Consider predicting material properties from:
- Temperature: 300-400 K
- Pressure: 1-100 atm  
- Concentration: 0.001-0.1 mol/L

Without scaling, algorithms might think pressure is "more important" simply because it has larger numbers.

### When Scaling Matters

| Algorithm | Needs Scaling? | Why |
|-----------|---------------|-----|
| Linear/Logistic Regression | Yes, for regularization | L1/L2 penalties affected by scale |
| SVM, SVR | Yes | Distance-based kernel computations |
| K-Means, KNN | Yes | Distance-based |
| PCA | Yes | Variance-based |
| Decision Trees | No | Split thresholds adapt to scale |
| Random Forest, XGBoost | No | Tree-based, scale-invariant |

### Two Common Approaches

In [None]:
# Sample data with different scales
data = pd.DataFrame({
    'temperature_K': [300, 350, 400, 450, 500],
    'pressure_atm': [1, 25, 50, 75, 100],
    'conc_mol_L': [0.001, 0.025, 0.050, 0.075, 0.100]
})

print("Original data:")
print(data)
print(f"\nStandard deviations: {data.std().values}")

In [None]:
# StandardScaler: zero mean, unit variance
# z = (x - mean) / std
# Good for: algorithms assuming normally distributed features

scaler_standard = StandardScaler()
data_standard = pd.DataFrame(
    scaler_standard.fit_transform(data),
    columns=data.columns
)

print("StandardScaler (z-score normalization):")
print(data_standard.round(3))
print(f"\nMeans: {data_standard.mean().values.round(10)}")
print(f"Stds:  {data_standard.std().values.round(3)}")

In [None]:
# MinMaxScaler: scales to [0, 1] range
# x_scaled = (x - min) / (max - min)
# Good for: bounded features, neural networks with sigmoid activations

scaler_minmax = MinMaxScaler()
data_minmax = pd.DataFrame(
    scaler_minmax.fit_transform(data),
    columns=data.columns
)

print("MinMaxScaler (0-1 normalization):")
print(data_minmax.round(3))
print(f"\nMin: {data_minmax.min().values}")
print(f"Max: {data_minmax.max().values}")

### Critical Warning: Fit on Training Data Only!

A common mistake is fitting the scaler on all data, then splitting. This causes **data leakage**—information from the test set contaminates training.

```python
# WRONG
X_scaled = scaler.fit_transform(X)  # Sees all data!
X_train, X_test = train_test_split(X_scaled)

# CORRECT
X_train, X_test = train_test_split(X)
scaler.fit(X_train)  # Learn parameters from training only
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Apply same transformation
```

Use scikit-learn Pipelines to avoid this mistake (covered in the regularization module).

## Polynomial and Interaction Features

When you don't have domain knowledge suggesting specific transformations, polynomial features provide a general way to capture nonlinearity.

### The Tradeoff

For features [x₁, x₂] with degree=2:
- **Output**: [1, x₁, x₂, x₁², x₁x₂, x₂²]
- You get interactions (x₁x₂) and nonlinear terms (x₁²)

**The danger**: Feature explosion! With d features and degree p:
- Number of terms = C(d+p, p) = (d+p)! / (d! × p!)
- 10 features, degree 3 → 286 terms
- 50 features, degree 3 → 23,426 terms

More features than samples = guaranteed overfitting.

In [None]:
# Demonstrate polynomial features
X_simple = np.array([[1, 2], [3, 4], [5, 6]])
print("Original features (x1, x2):")
print(X_simple)

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_simple)

print(f"\nPolynomial features (degree=2):")
print(f"Feature names: {poly.get_feature_names_out()}")
print(X_poly)

In [None]:
# Feature explosion demonstration
from math import comb

print("Number of polynomial features:")
print(f"{'Features':<10} {'Degree 2':<12} {'Degree 3':<12} {'Degree 4':<12}")
print("-" * 46)
for d in [5, 10, 20, 50]:
    n2 = comb(d + 2, 2) - 1  # exclude bias
    n3 = comb(d + 3, 3) - 1
    n4 = comb(d + 4, 4) - 1
    print(f"{d:<10} {n2:<12} {n3:<12} {n4:<12}")

### Interaction-Only Features

Sometimes you want interactions (x₁×x₂) but not higher powers (x₁²). This is common when you believe features interact but each has a linear effect individually.

In [None]:
# Interaction features only (no x^2 terms)
poly_interact = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interact = poly_interact.fit_transform(X_simple)

print("Interaction-only features:")
print(f"Feature names: {poly_interact.get_feature_names_out()}")
print(X_interact)

## Handling Categorical Variables

Many chemical engineering datasets include categorical variables:
- Catalyst type (Pt, Pd, Ni, ...)
- Solvent (water, ethanol, acetone, ...)
- Reactor type (batch, CSTR, PFR)
- Material phase (solid, liquid, gas)

ML algorithms need numbers, not strings. How you encode categories affects model performance.

### Label Encoding vs One-Hot Encoding

| Method | Encoding | When to Use |
|--------|----------|-------------|
| Label | A→0, B→1, C→2 | Ordinal categories (low/medium/high) |
| One-Hot | A→[1,0,0], B→[0,1,0], C→[0,0,1] | Nominal categories (no ordering) |

**Warning**: Label encoding implies ordering! If you encode catalysts as Pt=0, Pd=1, Ni=2, the model thinks Pd is "between" Pt and Ni, which is chemically meaningless.

In [None]:
# Catalyst screening dataset
catalyst_data = pd.DataFrame({
    'catalyst': ['Pt', 'Pd', 'Ni', 'Pt', 'Pd', 'Ni', 'Pt', 'Pd'],
    'temperature': [350, 350, 350, 400, 400, 400, 450, 450],
    'conversion': [0.65, 0.58, 0.42, 0.82, 0.75, 0.61, 0.91, 0.88]
})

print("Original data:")
print(catalyst_data)

In [None]:
# Label encoding - WRONG for nominal categories
le = LabelEncoder()
catalyst_data['catalyst_label'] = le.fit_transform(catalyst_data['catalyst'])

print("Label encoding (problematic for nominal categories):")
print(f"Classes: {le.classes_}")
print(catalyst_data[['catalyst', 'catalyst_label']])

In [None]:
# One-hot encoding - CORRECT for nominal categories
ohe = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' avoids multicollinearity
catalyst_encoded = ohe.fit_transform(catalyst_data[['catalyst']])

print("One-hot encoding (correct for nominal categories):")
print(f"Feature names: {ohe.get_feature_names_out()}")
print(catalyst_encoded)

### Why drop='first'?

With 3 catalysts, we only need 2 indicator columns:
- Pd=1, Pt=0 → It's Pd
- Pd=0, Pt=1 → It's Pt  
- Pd=0, Pt=0 → It must be Ni (the dropped category)

This avoids **multicollinearity**: if all three columns existed, their sum would always equal 1, creating a linear dependency that causes problems for regression.

In [None]:
# Pandas alternative: pd.get_dummies()
catalyst_dummies = pd.get_dummies(catalyst_data, columns=['catalyst'], drop_first=True)
print("Using pd.get_dummies():")
print(catalyst_dummies)

## Feature Selection

More features isn't always better. Problems with too many features:

1. **Overfitting**: Model memorizes noise instead of learning patterns
2. **Curse of dimensionality**: Data becomes sparse in high dimensions
3. **Computational cost**: Training and prediction slow down
4. **Interpretability**: Hard to explain a model with 1000 features

### Three Approaches to Feature Selection

| Approach | Method | Pros | Cons |
|----------|--------|------|------|
| Filter | Statistical tests (correlation, mutual info) | Fast, model-agnostic | Ignores feature interactions |
| Wrapper | Train models with feature subsets | Considers interactions | Slow, overfitting risk |
| Embedded | L1 regularization (Lasso) | Efficient, considers interactions | Model-specific |

In [None]:
# Create dataset with informative and noise features
np.random.seed(42)
n_samples = 200

# Informative features
X1 = np.random.uniform(0, 10, n_samples)  # Strong predictor
X2 = np.random.uniform(0, 5, n_samples)   # Moderate predictor
X3 = np.random.uniform(0, 1, n_samples)   # Weak predictor

# Noise features (no relationship with y)
X_noise = np.random.randn(n_samples, 7)

# Target: depends on X1, X2, X3 with different strengths
y = 3*X1 + 1.5*X2 + 0.5*X3 + np.random.randn(n_samples)*2

# Combine into DataFrame
X = np.column_stack([X1, X2, X3, X_noise])
feature_names = ['X1_strong', 'X2_moderate', 'X3_weak'] + [f'noise_{i}' for i in range(7)]
df_select = pd.DataFrame(X, columns=feature_names)

print(f"Dataset shape: {df_select.shape}")
print(f"True informative features: X1_strong, X2_moderate, X3_weak")

In [None]:
# Filter method: SelectKBest with f_regression
selector = SelectKBest(score_func=f_regression, k=5)
selector.fit(X, y)

# Get scores for each feature
scores = pd.DataFrame({
    'feature': feature_names,
    'f_score': selector.scores_,
    'selected': selector.get_support()
}).sort_values('f_score', ascending=False)

print("Filter method (F-statistic):")
print(scores)

The F-scores clearly distinguish informative from noise features:

- **X1_strong** (F≈180): Highest score, strongest relationship with y
- **X2_moderate** (F≈25): Second highest, moderate relationship  
- **X3_weak** (F≈2): Low score—the true weak predictor is hard to distinguish from noise
- **Noise features** (F≈0-1): As expected, no significant relationship

Notice that X3_weak barely beats some noise features! With only a coefficient of 0.5 in the true model, this feature contributes little to y. In a real analysis, you might reasonably exclude it—sometimes weak features add more noise than signal.

The filter method selected the top 5 features, which includes X1, X2, X3, and unfortunately 2 noise features. This illustrates the limitation: with noisy data and weak true effects, perfect selection is impossible.

In [None]:
# Embedded method: Lasso regression
# L1 regularization drives coefficients to exactly zero

# Scale features first (important for Lasso)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

lasso = Lasso(alpha=0.5)
lasso.fit(X_scaled, y)

lasso_coefs = pd.DataFrame({
    'feature': feature_names,
    'coefficient': lasso.coef_,
    'selected': lasso.coef_ != 0
}).sort_values('coefficient', key=abs, ascending=False)

print("Embedded method (Lasso):")
print(lasso_coefs)

### Comparing Selection Methods

Both methods correctly identified X1 and X2 as the most important features. The filter method is faster but treats each feature independently. Lasso considers features together and can handle correlated features better.

**Practical advice**: Start with filter methods for quick exploration, then use Lasso or tree-based importance for final selection.

## Handling Missing Data

Real datasets often have missing values. Before feature engineering, you need a strategy:

| Strategy | When to Use | Implementation |
|----------|-------------|----------------|
| Drop rows | Few missing, random pattern | `df.dropna()` |
| Drop columns | >50% missing in a feature | `df.drop(columns=[...])` |
| Mean/median imputation | Numerical, random missing | `SimpleImputer(strategy='mean')` |
| Mode imputation | Categorical | `SimpleImputer(strategy='most_frequent')` |
| Flag + impute | Missingness is informative | Create `is_missing` column |

**Warning**: Mean imputation reduces variance and can bias coefficients. For serious analysis, consider multiple imputation or model-based approaches.

In [None]:
from sklearn.impute import SimpleImputer

# Data with missing values
df_missing = pd.DataFrame({
    'temperature': [300, 350, np.nan, 400, 450],
    'pressure': [1, np.nan, 50, 75, 100],
    'yield': [0.65, 0.72, 0.78, np.nan, 0.91]
})

print("Data with missing values:")
print(df_missing)
print(f"\nMissing counts: {df_missing.isna().sum().to_dict()}")

In [None]:
# Mean imputation
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(
    imputer.fit_transform(df_missing),
    columns=df_missing.columns
)

print("After mean imputation:")
print(df_imputed)

## Feature Engineering Workflow

A typical workflow:

1. **Understand your data**: Check types, distributions, missing values
2. **Handle missing data**: Impute or drop based on pattern and amount
3. **Encode categoricals**: One-hot for nominal, ordinal encoding if ordered
4. **Create domain features**: Use physical/chemical knowledge
5. **Add polynomial/interaction terms**: If needed and data is sufficient
6. **Scale features**: Required for many algorithms
7. **Select features**: Remove noise, improve interpretability

**Important**: Steps 5-7 should be done within cross-validation to avoid data leakage!

## Summary

### Key Takeaways

1. **Domain knowledge is your superpower**: Physics-informed features often beat complex algorithms on simple features

2. **Scaling matters for many algorithms**: Use StandardScaler or MinMaxScaler, but fit only on training data

3. **Polynomial features explode quickly**: Be cautious with high degrees or many input features

4. **Encode categories correctly**: One-hot for nominal, label for ordinal only

5. **Feature selection improves models**: Filter methods are fast; Lasso is powerful

### When in Doubt

- Start simple: raw features, linear model
- Add complexity only when simple fails
- Always validate on held-out data
- If you can derive a feature from physical laws, try it!

### What's Next

In the next module, we'll explore **dimensionality reduction**—another way to handle high-dimensional feature spaces by finding lower-dimensional representations that preserve important structure.

---

## The Catalyst Crisis: "Speaking the Right Language"

*Continued from Intermediate Pandas...*

---

"R-squared of 0.52," Alex muttered, staring at her model results. "That's... not great."

She'd been trying to predict batch quality from the sensor data for three days. Linear regression on temperature, pressure, flow rates—all the obvious inputs. The model captured half the variance and missed the other half completely.

Professor Pipeline appeared at her desk, coffee in hand. "Stuck?"

"The model doesn't work. I'm using all the right features—"

"Are you?" He sat down beside her. "What do you know about reaction kinetics?"

Alex almost laughed. "I did seven years of reaction engineering. I know Arrhenius backwards."

"Then why are you feeding your model raw temperature?"

She blinked. "Because... that's the measurement?"

"But the reaction rate doesn't depend on temperature linearly, does it?" He took a sip of coffee. "If you know the physics, encode it. Transform temperature to one-over-T. The Arrhenius relationship becomes linear."

Alex felt something click—the same feeling she got when a P&ID suddenly made sense. Of course. She knew this. She'd just forgotten to use what she knew.

She rebuilt the model with transformed features: 1/T for the Arrhenius relationship, log of concentration for reaction order effects, dimensionless groups where they made sense.

R-squared: 0.91.

Same data. Same algorithm. Different features. The model went from useless to useful because she'd finally asked the question in a language the physics understood.

"Domain knowledge isn't optional," Professor Pipeline said, watching her results. "All those young coders who can implement any algorithm—they still need someone who knows what the numbers mean."

Later, Maya found Alex in the lab, still refining her feature transformations. "How'd you know to use one-over-T?"

"Arrhenius equation. The rate constant depends exponentially on the inverse of temperature. If you log both sides—"

Maya held up her hands. "Okay, I believe you. I just wouldn't have thought of it."

"And I wouldn't have thought to vectorize my loops. That's why we're on the same team."

The transformed features revealed something else: a strong interaction between temperature and catalyst age. The relationship wasn't just additive—it was multiplicative. Old catalyst at high temperature performed dramatically worse than either factor alone would predict.

Alex added to the mystery board: **Temperature × catalyst age interaction. The combination matters more than either alone.**

*To be continued...*