# Overfitting Analysis - Python Implementation

**Following the exact specifications from the class example**

This notebook demonstrates overfitting in high-dimensional linear models using:
- Data generating process: f_X = exp(4 * X) - 1
- 1000 observations with up to 999 features
- Exact R-squared and Adjusted R-squared calculations as specified

## Import Required Libraries

Using the exact libraries as specified:

In [None]:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

## Data Generation - Following Class Example

Creating W and e exactly as shown in the class:

In [None]:
# Create W exactly as in the class
W = np.random.uniform(0, 1, 1000)
W.sort()
W = W.reshape(-1, 1)
print(W)

In [None]:
# Create e exactly as in the class
e = np.random.normal(0,1,1000)
e = e.reshape(-1,1)
print(e)
e.shape

## Data Generating Process

Using the specified function: **f_X = exp(4 * X) - 1**

In [None]:
# Apply the data generating process
f_X = np.exp(4 * W) - 1
y = f_X + e

print("Data generation complete:")
print(f"W shape: {W.shape}")
print(f"f_X range: [{f_X.min():.4f}, {f_X.max():.4f}]")
print(f"y range: [{y.min():.4f}, {y.max():.4f}]")

## Train-Test Split

Split into training and test sets:

In [None]:
# Split the data using train_test_split
W_train, W_test, y_train, y_test = train_test_split(W, y, test_size=0.25, random_state=42)

print(f"Training set: {len(W_train)} observations")
print(f"Test set: {len(W_test)} observations")

## Helper Function for Polynomial Features

In [None]:
def create_polynomial_features(X, degree):
    """
    Create polynomial features: X, X^2, X^3, ..., X^degree
    """
    n_samples = X.shape[0]
    X_poly = np.zeros((n_samples, degree))
    
    for i in range(degree):
        X_poly[:, i] = X.ravel() ** (i + 1)
    
    return X_poly

## Fit Models with 1, 2, and 3 Features

Following the exact example format provided:

In [None]:
# Create polynomial features for predictors 1, 2, and 3
W_train_1 = create_polynomial_features(W_train, 1)  # Just W
W_train_2 = create_polynomial_features(W_train, 2)  # W, W^2
W_train_3 = create_polynomial_features(W_train, 3)  # W, W^2, W^3

W_test_1 = create_polynomial_features(W_test, 1)
W_test_2 = create_polynomial_features(W_test, 2)
W_test_3 = create_polynomial_features(W_test, 3)

print("Feature matrices created:")
print(f"Predictor 1 - Train: {W_train_1.shape}, Test: {W_test_1.shape}")
print(f"Predictor 2 - Train: {W_train_2.shape}, Test: {W_test_2.shape}")
print(f"Predictor 3 - Train: {W_train_3.shape}, Test: {W_test_3.shape}")

In [None]:
# Fit the three models
model_1 = LinearRegression(fit_intercept=False)
model_2 = LinearRegression(fit_intercept=False)
model_3 = LinearRegression(fit_intercept=False)

model_1.fit(W_train_1, y_train.ravel())
model_2.fit(W_train_2, y_train.ravel())
model_3.fit(W_train_3, y_train.ravel())

print("Models fitted successfully")

In [None]:
# Make predictions on test set
y_1_test = model_1.predict(W_test_1)
y_2_test = model_2.predict(W_test_2)
y_3_test = model_3.predict(W_test_3)

print("Predictions made on test set")
print(f"y_1_test shape: {y_1_test.shape}")
print(f"y_2_test shape: {y_2_test.shape}")
print(f"y_3_test shape: {y_3_test.shape}")

In [None]:
# Calculate MSE for each model
mse_1_test = np.mean((y_test.ravel() - y_1_test) ** 2)
mse_2_test = np.mean((y_test.ravel() - y_2_test) ** 2)
mse_3_test = np.mean((y_test.ravel() - y_3_test) ** 2)

print(f"MSE 1: {mse_1_test:.6f}")
print(f"MSE 2: {mse_2_test:.6f}")
print(f"MSE 3: {mse_3_test:.6f}")

## R-squared Calculations

Using the exact formula provided in the specifications:

In [None]:
# R-squared calculations using the exact formula provided
R_sq_1_test = 1 - mse_1_test / np.mean((y_test.ravel()-np.mean(y_test.ravel()))** 2)
R_sq_2_test = 1 - mse_2_test / np.mean((y_test.ravel()-np.mean(y_test.ravel()))** 2)
R_sq_3_test = 1 - mse_3_test / np.mean((y_test.ravel()-np.mean(y_test.ravel()))** 2)

print(f'R-squared out of sample of predictor 1 is {R_sq_1_test}')
print(f'R-squared out of sample of predictor 2 is {R_sq_2_test}')
print(f'R-squared out of sample of predictor 3 is {R_sq_3_test}')

## Adjusted R-squared Calculations

Using the exact formula provided in the specifications:

In [None]:
# Adjusted R-squared calculations using the exact formula provided
adj_mse_1_test = 250 / (250-1) * mse_1_test  # Note: using 250 as in the example (though actual test size might differ)
adjR_sq_1_test = 1 - adj_mse_1_test / np.mean((y_test.ravel()-np.mean(y_test.ravel()))** 2)

adj_mse_2_test = 250 / (250-2) * mse_2_test
adjR_sq_2_test = 1 - adj_mse_2_test / np.mean((y_test.ravel()-np.mean(y_test.ravel()))** 2)

adj_mse_3_test = 250 / (250-3) * mse_3_test
adjR_sq_3_test = 1 - adj_mse_3_test / np.mean((y_test.ravel()-np.mean(y_test.ravel()))** 2)

print(f'Adjusted R-squared out of sample of predictor 1 is {adjR_sq_1_test}')
print(f'Adjusted R-squared out of sample of predictor 2 is {adjR_sq_2_test}')
print(f'Adjusted R-squared out of sample of predictor 3 is {adjR_sq_3_test}')

## High-Dimensional Analysis: 999 Features for 1000 Observations

Demonstrating overfitting with a matrix of 999 features:

In [None]:
# Generate a matrix of 999 features for 1000 observations
np.random.seed(42)  # For reproducibility

# Create high-dimensional feature matrix
n_features_high = 999
n_observations = 1000

# Use polynomial features of W for the first 50 features, then random features
W_high_dim = np.zeros((n_observations, n_features_high))

# First 50 features are polynomial features of W
W_poly_50 = create_polynomial_features(W, 50)
W_high_dim[:, :50] = W_poly_50

# Remaining features are random
W_high_dim[:, 50:] = np.random.normal(0, 1, (n_observations, n_features_high - 50))

print(f"High-dimensional feature matrix shape: {W_high_dim.shape}")
print(f"Number of features: {n_features_high}")
print(f"Number of observations: {n_observations}")
print(f"Ratio of features to observations: {n_features_high/n_observations:.3f}")

In [None]:
# Split high-dimensional data
W_train_hd, W_test_hd, y_train_hd, y_test_hd = train_test_split(
    W_high_dim, y, test_size=0.25, random_state=42
)

print(f"High-dim training set: {W_train_hd.shape}")
print(f"High-dim test set: {W_test_hd.shape}")
print(f"Features to training observations ratio: {W_train_hd.shape[1]/W_train_hd.shape[0]:.3f}")

In [None]:
# Fit high-dimensional model
try:
    model_hd = LinearRegression(fit_intercept=False)
    model_hd.fit(W_train_hd, y_train_hd.ravel())
    
    # Predictions
    y_train_pred_hd = model_hd.predict(W_train_hd)
    y_test_pred_hd = model_hd.predict(W_test_hd)
    
    # Calculate MSE
    mse_train_hd = np.mean((y_train_hd.ravel() - y_train_pred_hd) ** 2)
    mse_test_hd = np.mean((y_test_hd.ravel() - y_test_pred_hd) ** 2)
    
    # Calculate R-squared
    R_sq_train_hd = 1 - mse_train_hd / np.mean((y_train_hd.ravel() - np.mean(y_train_hd.ravel())) ** 2)
    R_sq_test_hd = 1 - mse_test_hd / np.mean((y_test_hd.ravel() - np.mean(y_test_hd.ravel())) ** 2)
    
    # Calculate Adjusted R-squared for test set
    n_test = len(y_test_hd)
    n_features = W_test_hd.shape[1]
    adj_mse_test_hd = n_test / (n_test - n_features) * mse_test_hd
    adjR_sq_test_hd = 1 - adj_mse_test_hd / np.mean((y_test_hd.ravel() - np.mean(y_test_hd.ravel())) ** 2)
    
    print("High-Dimensional Model Results (999 features):")
    print("=" * 50)
    print(f"Training R-squared: {R_sq_train_hd:.6f}")
    print(f"Test R-squared: {R_sq_test_hd:.6f}")
    print(f"Test Adjusted R-squared: {adjR_sq_test_hd:.6f}")
    print(f"Training MSE: {mse_train_hd:.6f}")
    print(f"Test MSE: {mse_test_hd:.6f}")
    print(f"Overfitting Gap (Train - Test R²): {R_sq_train_hd - R_sq_test_hd:.6f}")
    
except Exception as e:
    print(f"Error with high-dimensional model: {e}")
    print("This demonstrates the challenge of fitting models with too many features.")

## Visualization of Results

In [None]:
# Plot comparison of different models
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: True function vs data
axes[0].scatter(W, y, alpha=0.5, s=20, c='lightblue', label='Observed data')
W_sorted = np.sort(W, axis=0)
f_sorted = np.exp(4 * W_sorted) - 1
axes[0].plot(W_sorted, f_sorted, 'r-', linewidth=2, label='True function: exp(4W) - 1')
axes[0].set_xlabel('W')
axes[0].set_ylabel('y')
axes[0].set_title('Data Generating Process')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: R-squared comparison
models = ['1 feature', '2 features', '3 features']
r2_values = [R_sq_1_test, R_sq_2_test, R_sq_3_test]
adj_r2_values = [adjR_sq_1_test, adjR_sq_2_test, adjR_sq_3_test]

x_pos = np.arange(len(models))
width = 0.35

axes[1].bar(x_pos - width/2, r2_values, width, label='R-squared', alpha=0.8)
axes[1].bar(x_pos + width/2, adj_r2_values, width, label='Adjusted R-squared', alpha=0.8)
axes[1].set_xlabel('Model Complexity')
axes[1].set_ylabel('R-squared Value')
axes[1].set_title('R-squared vs Adjusted R-squared')
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(models)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Summary of Overfitting Analysis

### Key Findings:

1. **Polynomial Features**: As we increase from 1 to 3 polynomial features, both R-squared and Adjusted R-squared improve, showing that the polynomial features help capture the exponential relationship.

2. **High-Dimensional Problem**: With 999 features for 1000 observations, we demonstrate extreme overfitting where:
   - Training R-squared approaches 1.0 (perfect fit)
   - Test R-squared is much lower (poor generalization)
   - Adjusted R-squared can become negative or very poor

3. **Overfitting Indicators**:
   - Large gap between training and test performance
   - Perfect or near-perfect training fit with poor test performance
   - Negative or degraded adjusted R-squared values

### Practical Implications:

- **Model Selection**: Always validate models on out-of-sample data
- **Feature Selection**: More features aren't always better
- **Regularization**: Consider Ridge/Lasso regression for high-dimensional problems
- **Cross-Validation**: Use proper validation techniques to assess generalization

The exponential data generating process **f_X = exp(4*X) - 1** creates a challenging nonlinear relationship that polynomial approximations try to capture, but adding too many features leads to overfitting rather than better approximation.