[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jkitchin/s26-06642/blob/main/dsmles/06-linear-regression/linear-regression.ipynb)

# Module 05: Linear Regression

Building predictive models with linear regression.

## Learning Objectives

1. Understand the linear regression model
2. Fit models using scikit-learn
3. Interpret regression coefficients
4. Evaluate model performance with metrics
5. Use train/test splits for validation

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler

## The Linear Regression Model: Simple Yet Powerful

Linear regression is often the first modeling tool we reach for, and for good reason. Despite its simplicity, it's:

- **Interpretable**: Coefficients have clear physical meaning
- **Fast**: Solutions are analytic, no iteration needed
- **Robust**: Well-understood statistical properties
- **A baseline**: If linear works, why use something complex?

### The Mathematical Form

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p + \epsilon$$

Each coefficient $\beta_i$ tells you: *"When $x_i$ increases by 1 unit (holding other features constant), y changes by $\beta_i$ units."*

### When Does Linear Regression Work?

Linear regression assumes the relationship between features and target is **approximately linear**. This works well when:
- Effects are additive (no strong interactions)
- Relationships are monotonic (increasing or decreasing)
- You're working in a limited range where curves look straight

### When It Fails

Linear regression will struggle with:
- Saturation effects (Michaelis-Menten kinetics, Langmuir isotherms)
- Exponential relationships (Arrhenius law without log-transform)
- Strong interactions (catalyst × temperature effects)
- Threshold effects (phase transitions)

We'll learn nonlinear methods later. For now, remember: **start with linear, add complexity only when needed.**

In [None]:
# Create reactor experiment dataset
np.random.seed(42)
n_samples = 100

# Features: temperature, pressure, catalyst loading, residence time
temperature = np.random.uniform(300, 500, n_samples)  # K
pressure = np.random.uniform(1, 10, n_samples)  # atm
catalyst_loading = np.random.uniform(0.5, 5, n_samples)  # wt%
residence_time = np.random.uniform(1, 30, n_samples)  # min

# True relationship (with some noise)
# Conversion increases with temp, pressure, catalyst, and time
conversion = (
    0.1 * (temperature - 300) / 200 +  # Temperature effect
    0.05 * pressure +                   # Pressure effect
    0.08 * catalyst_loading +           # Catalyst effect
    0.01 * residence_time +             # Time effect
    np.random.normal(0, 0.05, n_samples)  # Noise
)
conversion = np.clip(conversion, 0, 1)  # Keep between 0 and 1

# Create DataFrame
df = pd.DataFrame({
    'temperature': temperature,
    'pressure': pressure,
    'catalyst_loading': catalyst_loading,
    'residence_time': residence_time,
    'conversion': conversion
})

print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Quick look at the data
df.describe()

## Simple Linear Regression

Let's start with a single feature to understand the basics.

In [None]:
# Simple linear regression: conversion vs temperature
X_simple = df[['temperature']].values
y = df['conversion'].values

# Fit the model
model_simple = LinearRegression()
model_simple.fit(X_simple, y)

print(f"Intercept: {model_simple.intercept_:.4f}")
print(f"Coefficient (temperature): {model_simple.coef_[0]:.6f}")
print(f"R² score: {model_simple.score(X_simple, y):.4f}")

In [None]:
# Visualize the fit
plt.figure(figsize=(10, 6))

plt.scatter(df['temperature'], df['conversion'], alpha=0.6, label='Data')

# Regression line
temp_range = np.linspace(300, 500, 100).reshape(-1, 1)
plt.plot(temp_range, model_simple.predict(temp_range), 'r-', 
         linewidth=2, label='Linear fit')

plt.xlabel('Temperature (K)')
plt.ylabel('Conversion')
plt.title('Simple Linear Regression: Conversion vs Temperature')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Multiple Linear Regression

Now let's use all features.

In [None]:
# Multiple linear regression with all features
feature_names = ['temperature', 'pressure', 'catalyst_loading', 'residence_time']
X = df[feature_names].values
y = df['conversion'].values

# Fit the model
model = LinearRegression()
model.fit(X, y)

print(f"Intercept: {model.intercept_:.4f}")
print("\nCoefficients:")
for name, coef in zip(feature_names, model.coef_):
    print(f"  {name}: {coef:.6f}")

print(f"\nR² score: {model.score(X, y):.4f}")

## Interpreting Coefficients: The Hidden Complexity

Coefficient interpretation seems straightforward, but there are important subtleties:

### The Scale Problem

Raw coefficients depend on feature scales. If temperature is in Kelvin (300-500) and pressure is in bar (1-10), their coefficients aren't directly comparable. A coefficient of 0.001 for temperature might be more important than 0.1 for pressure!

**Solution**: Standardize features first, then coefficients are comparable.

### Standardized Coefficients

After standardizing (mean=0, std=1), coefficients answer: "When this feature increases by 1 standard deviation, how many standard deviations does the target change?"

This lets you rank feature importance fairly.

In [None]:
# Standardized coefficients for comparison
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model_scaled = LinearRegression()
model_scaled.fit(X_scaled, y)

print("Standardized coefficients (relative importance):")
for name, coef in sorted(zip(feature_names, model_scaled.coef_), 
                         key=lambda x: abs(x[1]), reverse=True):
    print(f"  {name}: {coef:.4f}")

In [None]:
# Visualize coefficient importance
plt.figure(figsize=(10, 6))

coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': model_scaled.coef_
})
coef_df = coef_df.sort_values('Coefficient')

colors = ['red' if x < 0 else 'green' for x in coef_df['Coefficient']]
plt.barh(coef_df['Feature'], coef_df['Coefficient'], color=colors, edgecolor='black')
plt.xlabel('Standardized Coefficient')
plt.title('Feature Importance (Standardized Coefficients)')
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
plt.grid(True, alpha=0.3, axis='x')
plt.show()

## Train/Test Split: The Most Important Concept in ML

This is perhaps the most important concept in machine learning. **Never evaluate a model on the data it was trained on.**

### Why?

The goal isn't to explain the data you have—it's to predict data you haven't seen. A model that memorizes training data (overfitting) looks great on training metrics but fails on new data.

### The Simple Solution: Hold-Out Validation

1. **Split** data into training (typically 70-80%) and test (20-30%) sets
2. **Train** only on training data
3. **Evaluate** on test data

The test set simulates "new, unseen data." If training and test performance are similar, your model generalizes well.

### What the Gap Tells You

- **Train R² ≈ Test R²**: Good! Model generalizes
- **Train R² >> Test R²**: Overfitting! Model is too complex
- **Train R² << Test R²**: Unusual (possibly data leakage or very small test set)
- **Both R² are low**: Underfitting! Model is too simple or features aren't predictive

In [None]:
# Split data: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

In [None]:
# Fit on training data only
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate on both sets
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print(f"Training R²: {train_score:.4f}")
print(f"Test R²: {test_score:.4f}")

## Model Evaluation Metrics: Choosing the Right One

Different metrics answer different questions. Choosing the right one depends on what you care about.

### R² (Coefficient of Determination)

$$R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

**Interpretation**: Proportion of variance explained by the model
- R² = 1: Perfect predictions
- R² = 0: Model is as good as predicting the mean
- R² < 0: Model is worse than predicting the mean (possible with test data!)

**When to use**: General model comparison, communicating performance

### RMSE (Root Mean Squared Error)

$$RMSE = \sqrt{\frac{1}{n}\sum(y_i - \hat{y}_i)^2}$$

**Interpretation**: Average prediction error in the same units as y
- RMSE = 5% means predictions are off by ~5% on average

**When to use**: When you need error in original units, when large errors are especially bad

### MAE (Mean Absolute Error)

$$MAE = \frac{1}{n}\sum|y_i - \hat{y}_i|$$

**Interpretation**: Average absolute error
- Less sensitive to outliers than RMSE

**When to use**: When outliers shouldn't dominate the metric

In [None]:
# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate metrics
print("Training Metrics:")
print(f"  R²:   {r2_score(y_train, y_train_pred):.4f}")
print(f"  MSE:  {mean_squared_error(y_train, y_train_pred):.6f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_train, y_train_pred)):.4f}")
print(f"  MAE:  {mean_absolute_error(y_train, y_train_pred):.4f}")

print("\nTest Metrics:")
print(f"  R²:   {r2_score(y_test, y_test_pred):.4f}")
print(f"  MSE:  {mean_squared_error(y_test, y_test_pred):.6f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, y_test_pred)):.4f}")
print(f"  MAE:  {mean_absolute_error(y_test, y_test_pred):.4f}")

In [None]:
# Predicted vs Actual plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training set
axes[0].scatter(y_train, y_train_pred, alpha=0.6)
axes[0].plot([y.min(), y.max()], [y.min(), y.max()], 'r--', linewidth=2)
axes[0].set_xlabel('Actual Conversion')
axes[0].set_ylabel('Predicted Conversion')
axes[0].set_title(f'Training Set (R² = {r2_score(y_train, y_train_pred):.3f})')
axes[0].grid(True, alpha=0.3)

# Test set
axes[1].scatter(y_test, y_test_pred, alpha=0.6, color='orange')
axes[1].plot([y.min(), y.max()], [y.min(), y.max()], 'r--', linewidth=2)
axes[1].set_xlabel('Actual Conversion')
axes[1].set_ylabel('Predicted Conversion')
axes[1].set_title(f'Test Set (R² = {r2_score(y_test, y_test_pred):.3f})')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Residual Analysis: Diagnosing Model Problems

Residuals (actual - predicted) are your model's mistakes. Analyzing them reveals problems that R² alone can't detect.

### What Good Residuals Look Like

- **Random scatter** around zero: No systematic patterns
- **Constant variance**: Spread doesn't change with predicted value
- **Normally distributed**: Bell-shaped histogram

### Red Flags to Watch For

| Pattern | What It Means | Possible Fix |
|---------|---------------|--------------|
| Curved pattern | Nonlinear relationship | Add polynomial terms or use nonlinear model |
| Fan shape (increasing spread) | Heteroscedasticity | Log-transform y, use weighted regression |
| Clusters | Distinct groups in data | Add categorical features, consider separate models |
| Outliers | Unusual data points | Investigate, possibly remove or use robust methods |

### The Residual Plot Recipe

1. Plot residuals vs predicted values (should be random cloud)
2. Plot histogram of residuals (should be roughly normal)
3. Plot residuals vs each feature (look for patterns)

In [None]:
# Residual plots
residuals = y_test - y_test_pred

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Residuals vs predicted
axes[0].scatter(y_test_pred, residuals, alpha=0.6)
axes[0].axhline(y=0, color='r', linestyle='--')
axes[0].set_xlabel('Predicted Conversion')
axes[0].set_ylabel('Residual')
axes[0].set_title('Residuals vs Predicted Values')
axes[0].grid(True, alpha=0.3)

# Histogram of residuals
axes[1].hist(residuals, bins=15, edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='r', linestyle='--')
axes[1].set_xlabel('Residual')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Residuals')

plt.tight_layout()
plt.show()

print(f"Mean residual: {residuals.mean():.6f}")
print(f"Std of residuals: {residuals.std():.4f}")

## Making Predictions

Once trained, use the model to predict outcomes for new conditions.

In [None]:
# Predict conversion for new conditions
new_conditions = pd.DataFrame({
    'temperature': [350, 400, 450],
    'pressure': [5, 5, 5],
    'catalyst_loading': [2.5, 2.5, 2.5],
    'residence_time': [15, 15, 15]
})

predictions = model.predict(new_conditions[feature_names].values)

new_conditions['predicted_conversion'] = predictions
print("Predictions for new conditions:")
new_conditions

## Polynomial Features

If relationships are nonlinear, we can add polynomial terms.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features (degree 2)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

print(f"Original features: {X.shape[1]}")
print(f"Polynomial features: {X_poly.shape[1]}")
print(f"\nFeature names: {poly.get_feature_names_out(feature_names)}")

In [None]:
# Compare linear vs polynomial
X_train_poly, X_test_poly, y_train, y_test = train_test_split(
    X_poly, y, test_size=0.2, random_state=42
)

model_poly = LinearRegression()
model_poly.fit(X_train_poly, y_train)

print(f"Linear model test R²: {model.score(X_test, y_test):.4f}")
print(f"Polynomial model test R²: {model_poly.score(X_test_poly, y_test):.4f}")

## Common Pitfalls

1. **Overfitting**: Model fits training data too well, poor generalization
   - Solution: Use train/test split, regularization

2. **Multicollinearity**: Features are highly correlated
   - Can make coefficients unstable
   - Solution: Remove redundant features, use PCA, or regularization

3. **Extrapolation**: Predicting outside the range of training data
   - Linear models may give unrealistic predictions
   - Be cautious about predictions far from training data

In [None]:
# Check for multicollinearity
correlation_matrix = df[feature_names].corr()

plt.figure(figsize=(8, 6))
plt.imshow(correlation_matrix, cmap='coolwarm', aspect='auto', vmin=-1, vmax=1)
plt.colorbar(label='Correlation')
plt.xticks(range(len(feature_names)), feature_names, rotation=45, ha='right')
plt.yticks(range(len(feature_names)), feature_names)
plt.title('Feature Correlation Matrix')

# Add correlation values
for i in range(len(feature_names)):
    for j in range(len(feature_names)):
        plt.text(j, i, f'{correlation_matrix.iloc[i, j]:.2f}', 
                 ha='center', va='center', fontsize=10)

plt.tight_layout()
plt.show()

## Summary: The Linear Regression Workflow

Linear regression is your starting point for predictive modeling. Here's the workflow:

### The Process

1. **Explore data** → Scatter plots, correlations
2. **Split data** → Train/test (never skip this!)
3. **Fit model** → `model.fit(X_train, y_train)`
4. **Evaluate** → R², RMSE on test set
5. **Diagnose** → Check residuals for problems
6. **Interpret** → What do coefficients mean?

### Key Decisions

| Decision | Guidance |
|----------|----------|
| Feature scaling? | Yes, if you want to compare coefficient importance |
| Test set size? | 20-30% is typical; smaller datasets might use cross-validation |
| Which metric? | R² for overall fit, RMSE for error in original units |
| Add polynomial terms? | Only if residuals show curvature |

### Common Mistakes to Avoid

1. **Evaluating on training data**: Always use a held-out test set
2. **Ignoring multicollinearity**: Correlated features make coefficients unstable
3. **Extrapolating**: Be cautious predicting outside the training data range
4. **Confusing correlation with causation**: Coefficients show association, not causation

### When to Move Beyond Linear Regression

- Residuals show clear nonlinear patterns
- R² is too low despite good features
- Domain knowledge suggests nonlinear relationships
- You need feature selection (→ Lasso, next module)

## Next Steps

In the next module, we'll learn about regularization (Ridge, Lasso) to prevent overfitting and cross-validation for more reliable model selection.