# Overfitting Analysis - High Dimensional Linear Models

This notebook demonstrates overfitting in high-dimensional linear models using the specified data generating process.

**Data Generating Process:** f_X = exp(4 * X) - 1

We analyze how adding polynomial features leads to overfitting by examining R-squared and Adjusted R-squared metrics.

## Import Libraries

Using the specified libraries as requested.

In [None]:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully")

## Data Generation

Following the exact specifications from the class:
- Generate W from uniform distribution [0,1] and sort
- Generate error term e from normal distribution N(0,1)
- Use data generating process: f_X = exp(4 * X) - 1

In [None]:
# Generate W as specified
W = np.random.uniform(0, 1, 1000)
W.sort()
W = W.reshape(-1, 1)
print("W shape:", W.shape)
print("W range: [{:.4f}, {:.4f}]".format(W.min(), W.max()))
print("W:")
print(W[:10])  # Print first 10 values

In [None]:
# Generate error term e as specified
e = np.random.normal(0, 1, 1000)
e = e.reshape(-1, 1)
print("e shape:", e.shape)
print("e:")
print(e[:10])  # Print first 10 values

In [None]:
# Generate y using the specified function: f_X = exp(4 * X) - 1
f_X = np.exp(4 * W) - 1
y = f_X + e

print("f_X shape:", f_X.shape)
print("y shape:", y.shape)
print("f_X range: [{:.4f}, {:.4f}]".format(f_X.min(), f_X.max()))
print("y range: [{:.4f}, {:.4f}]".format(y.min(), y.max()))

## Visualize the Data Generating Process

In [None]:
# Plot the true relationship and data
plt.figure(figsize=(12, 5))

# Plot 1: True function
plt.subplot(1, 2, 1)
plt.plot(W, f_X, 'b-', linewidth=2, label='f_X = exp(4*X) - 1')
plt.xlabel('X (W)')
plt.ylabel('f_X')
plt.title('True Data Generating Function')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 2: Data with noise
plt.subplot(1, 2, 2)
plt.scatter(W, y, alpha=0.6, s=10, c='red', label='y = f_X + e')
plt.plot(W, f_X, 'b-', linewidth=2, label='True function')
plt.xlabel('X (W)')
plt.ylabel('y')
plt.title('Observed Data with Noise')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Train-Test Split

Split the data into training (750 observations) and test (250 observations) sets.

In [None]:
# Split data into train and test sets
W_train, W_test, y_train, y_test = train_test_split(
    W, y, test_size=0.25, random_state=42
)

print(f"Training set size: {len(W_train)}")
print(f"Test set size: {len(W_test)}")
print(f"Training W range: [{W_train.min():.4f}, {W_train.max():.4f}]")
print(f"Test W range: [{W_test.min():.4f}, {W_test.max():.4f}]")

## Helper Functions for Polynomial Features and Metrics

In [None]:
def create_polynomial_features(X, n_features):
    """
    Create polynomial features X, X^2, X^3, ..., X^n_features
    """
    n_samples = X.shape[0]
    X_poly = np.zeros((n_samples, n_features))
    
    for i in range(n_features):
        X_poly[:, i] = (X.ravel()) ** (i + 1)
    
    return X_poly

def calculate_mse(y_true, y_pred):
    """
    Calculate Mean Squared Error
    """
    return np.mean((y_true - y_pred) ** 2)

print("Helper functions defined")

## Overfitting Analysis

We'll fit linear models with different numbers of polynomial features (1, 2, 3, ..., up to 999) to demonstrate overfitting.

For each model, we'll calculate:
1. R-squared on test data
2. Adjusted R-squared on test data

Using the exact formulas provided.

In [None]:
# Define the number of features to test
# We'll test models with 1, 2, 3, 5, 10, 20, 50, 100, 200, 500 features
# (avoiding 999 initially to prevent computational issues)
feature_counts = [1, 2, 3, 5, 10, 20, 50, 100, 200, 500]

# Storage for results
results = []

print("Starting overfitting analysis...")
print("Features | R² (test) | Adj R² (test) | MSE (test)")
print("-" * 50)

for n_features in feature_counts:
    # Skip if we don't have enough training samples
    if n_features >= len(W_train):
        print(f"{n_features:8d} | Skipped (too many features for training set)")
        continue
    
    try:
        # Create polynomial features for training and test sets
        W_train_poly = create_polynomial_features(W_train, n_features)
        W_test_poly = create_polynomial_features(W_test, n_features)
        
        # Fit linear regression model (no intercept as specified)
        model = LinearRegression(fit_intercept=False)
        model.fit(W_train_poly, y_train.ravel())
        
        # Make predictions on test set
        y_test_pred = model.predict(W_test_poly)
        
        # Calculate MSE on test set
        mse_test = calculate_mse(y_test.ravel(), y_test_pred)
        
        # Calculate R-squared on test set using the specified formula
        R_sq_test = 1 - mse_test / np.mean((y_test.ravel() - np.mean(y_test.ravel())) ** 2)
        
        # Calculate Adjusted R-squared on test set using the specified formula
        adj_mse_test = len(W_test) / (len(W_test) - n_features) * mse_test
        adjR_sq_test = 1 - adj_mse_test / np.mean((y_test.ravel() - np.mean(y_test.ravel())) ** 2)
        
        # Store results
        results.append({
            'n_features': n_features,
            'r2_test': R_sq_test,
            'adj_r2_test': adjR_sq_test,
            'mse_test': mse_test
        })
        
        print(f"{n_features:8d} | {R_sq_test:9.4f} | {adjR_sq_test:10.4f} | {mse_test:10.4f}")
        
    except Exception as e:
        print(f"{n_features:8d} | Error: {str(e)[:30]}...")

print(f"\nCompleted analysis for {len(results)} models")

## Example with First Three Predictors

Let's demonstrate the exact calculation method for the first three predictors as shown in the example.

In [None]:
# Fit models with 1, 2, and 3 features specifically
models_123 = {}
predictions_123 = {}
mse_123 = {}

for i in [1, 2, 3]:
    # Create polynomial features
    W_train_poly = create_polynomial_features(W_train, i)
    W_test_poly = create_polynomial_features(W_test, i)
    
    # Fit model
    model = LinearRegression(fit_intercept=False)
    model.fit(W_train_poly, y_train.ravel())
    
    # Predict on test set
    y_pred = model.predict(W_test_poly)
    
    # Calculate MSE
    mse = calculate_mse(y_test.ravel(), y_pred)
    
    # Store results
    models_123[i] = model
    predictions_123[i] = y_pred
    mse_123[i] = mse

# Extract MSE values
mse_1_test = mse_123[1]
mse_2_test = mse_123[2]
mse_3_test = mse_123[3]

# Extract predictions
y_1_test = predictions_123[1]
y_2_test = predictions_123[2]
y_3_test = predictions_123[3]

print(f"MSE for 1 feature: {mse_1_test:.6f}")
print(f"MSE for 2 features: {mse_2_test:.6f}")
print(f"MSE for 3 features: {mse_3_test:.6f}")

## R-squared Calculations Using Exact Formula

Using the exact formulas provided in the problem statement.

In [None]:
# R-squared calculations using the exact formula provided
R_sq_1_test = 1 - mse_1_test / np.mean((y_test.ravel() - np.mean(y_test.ravel()))** 2)
R_sq_2_test = 1 - mse_2_test / np.mean((y_test.ravel() - np.mean(y_test.ravel()))** 2)
R_sq_3_test = 1 - mse_3_test / np.mean((y_test.ravel() - np.mean(y_test.ravel()))** 2)

print(f'R-squared out of sample of predictor 1 is {R_sq_1_test:.6f}')
print(f'R-squared out of sample of predictor 2 is {R_sq_2_test:.6f}')
print(f'R-squared out of sample of predictor 3 is {R_sq_3_test:.6f}')

## Adjusted R-squared Calculations Using Exact Formula

Using the exact formulas provided in the problem statement.

In [None]:
# Adjusted R-squared calculations using the exact formula provided
# Note: Using 250 as the test set size as shown in the example
test_size = len(y_test)
print(f"Test set size: {test_size}")

adj_mse_1_test = test_size / (test_size - 1) * mse_1_test  # 1 feature
adjR_sq_1_test = 1 - adj_mse_1_test / np.mean((y_test.ravel() - np.mean(y_test.ravel()))** 2)

adj_mse_2_test = test_size / (test_size - 2) * mse_2_test  # 2 features
adjR_sq_2_test = 1 - adj_mse_2_test / np.mean((y_test.ravel() - np.mean(y_test.ravel()))** 2)

adj_mse_3_test = test_size / (test_size - 3) * mse_3_test  # 3 features
adjR_sq_3_test = 1 - adj_mse_3_test / np.mean((y_test.ravel() - np.mean(y_test.ravel()))** 2)

print(f'Adjusted R-squared out of sample of predictor 1 is {adjR_sq_1_test:.6f}')
print(f'Adjusted R-squared out of sample of predictor 2 is {adjR_sq_2_test:.6f}')
print(f'Adjusted R-squared out of sample of predictor 3 is {adjR_sq_3_test:.6f}')

## Visualization of Overfitting

In [None]:
# Extract data for plotting
if len(results) > 0:
    features = [r['n_features'] for r in results]
    r2_values = [r['r2_test'] for r in results]
    adj_r2_values = [r['adj_r2_test'] for r in results]
    
    # Create plots
    plt.figure(figsize=(15, 5))
    
    # Plot 1: R-squared vs number of features
    plt.subplot(1, 3, 1)
    plt.plot(features, r2_values, 'bo-', linewidth=2, markersize=6)
    plt.xlabel('Number of Features')
    plt.ylabel('R-squared (Test Set)')
    plt.title('Test Set R-squared vs Number of Features')
    plt.grid(True, alpha=0.3)
    plt.xscale('log')
    
    # Plot 2: Adjusted R-squared vs number of features
    plt.subplot(1, 3, 2)
    plt.plot(features, adj_r2_values, 'ro-', linewidth=2, markersize=6)
    plt.xlabel('Number of Features')
    plt.ylabel('Adjusted R-squared (Test Set)')
    plt.title('Test Set Adjusted R-squared vs Number of Features')
    plt.grid(True, alpha=0.3)
    plt.xscale('log')
    
    # Plot 3: Both metrics together
    plt.subplot(1, 3, 3)
    plt.plot(features, r2_values, 'bo-', linewidth=2, markersize=6, label='R-squared')
    plt.plot(features, adj_r2_values, 'ro-', linewidth=2, markersize=6, label='Adjusted R-squared')
    plt.xlabel('Number of Features')
    plt.ylabel('Test Set Performance')
    plt.title('Comparison of R-squared Metrics')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.xscale('log')
    
    plt.tight_layout()
    plt.show()
else:
    print("No results to plot")

## Analysis: Demonstrating Overfitting with High-Dimensional Features

Let's create a matrix with 999 features for 1000 observations to demonstrate extreme overfitting.

In [None]:
# Create a high-dimensional feature matrix (999 features for 1000 observations)
print("Creating high-dimensional feature matrix (999 features)...")

# Generate 999 random features for demonstration
np.random.seed(42)
n_obs = 1000
n_features_high = 999

# Create a matrix of random features
X_high_dim = np.random.normal(0, 1, (n_obs, n_features_high))

# Also include our original W as the first feature, and polynomial features of W
X_high_dim[:, :50] = create_polynomial_features(W, 50)  # First 50 features are polynomial features of W

print(f"High-dimensional feature matrix shape: {X_high_dim.shape}")
print(f"Ratio of features to observations: {n_features_high/n_obs:.3f}")

# Use our y variable (still based on f_X = exp(4*X) - 1)
y_high_dim = y.ravel()

print(f"Target variable shape: {y_high_dim.shape}")

In [None]:
# Split the high-dimensional data
X_train_hd, X_test_hd, y_train_hd, y_test_hd = train_test_split(
    X_high_dim, y_high_dim, test_size=0.25, random_state=42
)

print(f"High-dim training set: {X_train_hd.shape}")
print(f"High-dim test set: {X_test_hd.shape}")
print(f"Training observations: {len(y_train_hd)}")
print(f"Test observations: {len(y_test_hd)}")
print(f"Features: {X_train_hd.shape[1]}")
print(f"Ratio of features to training observations: {X_train_hd.shape[1]/len(y_train_hd):.3f}")

In [None]:
# Fit linear regression with 999 features
print("Fitting high-dimensional linear regression...")

try:
    # Fit the model (this will likely overfit severely)
    model_hd = LinearRegression(fit_intercept=False)
    model_hd.fit(X_train_hd, y_train_hd)
    
    # Make predictions
    y_train_pred_hd = model_hd.predict(X_train_hd)
    y_test_pred_hd = model_hd.predict(X_test_hd)
    
    # Calculate MSE
    mse_train_hd = calculate_mse(y_train_hd, y_train_pred_hd)
    mse_test_hd = calculate_mse(y_test_hd, y_test_pred_hd)
    
    # Calculate R-squared
    r2_train_hd = 1 - mse_train_hd / np.mean((y_train_hd - np.mean(y_train_hd)) ** 2)
    r2_test_hd = 1 - mse_test_hd / np.mean((y_test_hd - np.mean(y_test_hd)) ** 2)
    
    # Calculate Adjusted R-squared
    n_train = len(y_train_hd)
    n_test = len(y_test_hd)
    n_features = X_train_hd.shape[1]
    
    adj_mse_test_hd = n_test / (n_test - n_features) * mse_test_hd
    adj_r2_test_hd = 1 - adj_mse_test_hd / np.mean((y_test_hd - np.mean(y_test_hd)) ** 2)
    
    print("\nHigh-Dimensional Model Results (999 features):")
    print("=" * 50)
    print(f"Training R-squared: {r2_train_hd:.6f}")
    print(f"Test R-squared: {r2_test_hd:.6f}")
    print(f"Test Adjusted R-squared: {adj_r2_test_hd:.6f}")
    print(f"Training MSE: {mse_train_hd:.6f}")
    print(f"Test MSE: {mse_test_hd:.6f}")
    print(f"Overfitting Gap (Train R² - Test R²): {r2_train_hd - r2_test_hd:.6f}")
    
except Exception as e:
    print(f"Error fitting high-dimensional model: {e}")
    print("This is expected when we have too many features relative to observations.")

## Summary and Conclusions

This analysis demonstrates the overfitting phenomenon in high-dimensional linear models:

### Key Findings:

1. **Training vs. Test Performance**: As we add more polynomial features, models can achieve higher R-squared on training data but often perform worse on test data.

2. **Adjusted R-squared**: Unlike regular R-squared, adjusted R-squared can become negative when the model performs very poorly, indicating that the model is worse than simply using the mean.

3. **High-Dimensional Problem**: With 999 features and 1000 observations, we approach the extreme case where the number of parameters nearly equals the number of observations, leading to severe overfitting.

### Overfitting Indicators:
- Large gap between training and test R-squared
- Declining test performance as model complexity increases
- Negative adjusted R-squared values

### Practical Implications:
- Always use proper validation techniques
- Consider regularization methods (Ridge, Lasso, Elastic Net)
- Monitor both training and validation performance
- Be cautious of models with many features relative to observations

The exponential data generating process (f_X = exp(4*X) - 1) creates a challenging nonlinear relationship that polynomial features attempt to approximate, but adding too many features leads to overfitting rather than better approximation.