# High-Dimensional Linear Models: Overfitting Simulation (Python)

This notebook demonstrates the overfitting phenomenon in high-dimensional linear models.
We'll generate data with a nonlinear relationship and fit linear models with increasing numbers of polynomial features.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

## Data Generating Process

We use the specified data generating process:
- f_X = exp(4 * X) - 1
- y = f_X + ε, where ε ~ N(0, σ²)
- n = 1000 observations
- Intercept = 0

In [None]:
# Parameters
n = 1000
noise_std = 0.5  # Standard deviation of the error term

# Generate X from uniform distribution
X = np.random.uniform(-0.5, 0.5, n)

# Data generating process: f_X = exp(4 * X) - 1
f_X = np.exp(4 * X) - 1

# Add noise to get Y
epsilon = np.random.normal(0, noise_std, n)
Y = f_X + epsilon

print(f"Generated {n} observations")
print(f"X range: [{X.min():.3f}, {X.max():.3f}]")
print(f"Y range: [{Y.min():.3f}, {Y.max():.3f}]")
print(f"True function range: [{f_X.min():.3f}, {f_X.max():.3f}]")

## Visualization of the True Relationship

In [None]:
# Plot the true relationship
plt.figure(figsize=(10, 6))
plt.scatter(X, Y, alpha=0.5, s=20, label='Observed data')
X_sorted = np.sort(X)
f_sorted = np.exp(4 * X_sorted) - 1
plt.plot(X_sorted, f_sorted, 'r-', linewidth=2, label='True function: exp(4X) - 1')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Data Generating Process')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Function to Calculate R-squared Metrics

In [None]:
def calculate_adjusted_r2(r2, n, p):
    """
    Calculate adjusted R-squared
    
    Parameters:
    r2: R-squared value
    n: number of observations
    p: number of predictors (excluding intercept)
    """
    if n <= p + 1:
        return np.nan
    return 1 - (1 - r2) * (n - 1) / (n - p - 1)

def fit_and_evaluate(X_train, y_train, X_test, y_test, n_features):
    """
    Fit polynomial regression model and calculate metrics
    
    Parameters:
    X_train, y_train: training data
    X_test, y_test: test data  
    n_features: number of polynomial features to include
    
    Returns:
    dict with R-squared, adjusted R-squared, and out-of-sample R-squared
    """
    # Create polynomial features
    poly = PolynomialFeatures(degree=n_features, include_bias=False)
    X_train_poly = poly.fit_transform(X_train.reshape(-1, 1))
    X_test_poly = poly.transform(X_test.reshape(-1, 1))
    
    # Fit linear regression (without intercept as specified)
    model = LinearRegression(fit_intercept=False)
    model.fit(X_train_poly, y_train)
    
    # Predictions
    y_train_pred = model.predict(X_train_poly)
    y_test_pred = model.predict(X_test_poly)
    
    # Calculate metrics
    r2_train = r2_score(y_train, y_train_pred)
    r2_test = r2_score(y_test, y_test_pred)
    adj_r2 = calculate_adjusted_r2(r2_train, len(y_train), X_train_poly.shape[1])
    
    return {
        'r2': r2_train,
        'adj_r2': adj_r2,
        'r2_oos': r2_test,
        'n_params': X_train_poly.shape[1]
    }

## Main Simulation Loop

We'll test models with different numbers of polynomial features: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000

In [None]:
# Split data into train (75%) and test (25%)
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.25, random_state=42
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

# Number of features to test
feature_counts = [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]

# Storage for results
results = []

print("\nRunning simulation...")
print("Features | R² (train) | Adj R² | R² (test) | Parameters")
print("-" * 60)

for n_features in feature_counts:
    # Skip if we don't have enough training samples
    if n_features >= len(X_train):
        print(f"{n_features:8d} | Skipped (insufficient training data)")
        continue
        
    try:
        metrics = fit_and_evaluate(X_train, y_train, X_test, y_test, n_features)
        
        results.append({
            'n_features': n_features,
            'r2': metrics['r2'],
            'adj_r2': metrics['adj_r2'],
            'r2_oos': metrics['r2_oos'],
            'n_params': metrics['n_params']
        })
        
        print(f"{n_features:8d} | {metrics['r2']:9.4f} | {metrics['adj_r2']:6.4f} | {metrics['r2_oos']:8.4f} | {metrics['n_params']:9d}")
        
    except Exception as e:
        print(f"{n_features:8d} | Error: {str(e)[:30]}...")

# Convert to DataFrame for easier manipulation
results_df = pd.DataFrame(results)
print(f"\nCompleted simulation with {len(results_df)} successful models")

## Results Visualization

We create three separate plots showing how the different R-squared metrics change with the number of features.

In [None]:
# Create the three plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: R-squared (training)
axes[0].plot(results_df['n_features'], results_df['r2'], 'bo-', linewidth=2, markersize=6)
axes[0].set_xlabel('Number of Features')
axes[0].set_ylabel('R-squared (Training)')
axes[0].set_title('Training R-squared vs Number of Features')
axes[0].grid(True, alpha=0.3)
axes[0].set_xscale('log')
axes[0].set_ylim(0, 1)

# Plot 2: Adjusted R-squared
valid_adj = results_df.dropna(subset=['adj_r2'])
axes[1].plot(valid_adj['n_features'], valid_adj['adj_r2'], 'go-', linewidth=2, markersize=6)
axes[1].set_xlabel('Number of Features')
axes[1].set_ylabel('Adjusted R-squared')
axes[1].set_title('Adjusted R-squared vs Number of Features')
axes[1].grid(True, alpha=0.3)
axes[1].set_xscale('log')

# Plot 3: Out-of-sample R-squared
axes[2].plot(results_df['n_features'], results_df['r2_oos'], 'ro-', linewidth=2, markersize=6)
axes[2].set_xlabel('Number of Features')
axes[2].set_ylabel('Out-of-sample R-squared')
axes[2].set_title('Out-of-sample R-squared vs Number of Features')
axes[2].grid(True, alpha=0.3)
axes[2].set_xscale('log')

plt.tight_layout()
plt.show()

## Summary and Analysis

Let's examine the results and discuss the overfitting phenomenon.

In [None]:
# Display summary statistics
print("Summary of Results:")
print("=" * 80)
print(results_df.to_string(index=False, float_format='%.4f'))

# Find optimal number of features based on out-of-sample R²
best_oos_idx = results_df['r2_oos'].idxmax()
best_n_features = results_df.loc[best_oos_idx, 'n_features']
best_oos_r2 = results_df.loc[best_oos_idx, 'r2_oos']

print(f"\nOptimal number of features (based on out-of-sample R²): {best_n_features}")
print(f"Best out-of-sample R²: {best_oos_r2:.4f}")

# Calculate the difference between training and test R² to show overfitting
results_df['overfitting'] = results_df['r2'] - results_df['r2_oos']
print(f"\nOverfitting Analysis (Training R² - Test R²):")
print(results_df[['n_features', 'overfitting']].to_string(index=False, float_format='%.4f'))

## Interpretation and Conclusions

### Overfitting Demonstration

This simulation clearly demonstrates the overfitting phenomenon in high-dimensional linear models:

1. **Training R-squared** monotonically increases as we add more polynomial features. This makes sense because with more parameters, the model can fit the training data more closely.

2. **Adjusted R-squared** initially increases but then starts to decrease as the penalty for additional parameters outweighs the improvement in fit. This metric tries to balance model fit with model complexity.

3. **Out-of-sample R-squared** typically increases initially as we capture more of the true nonlinear relationship, but then decreases as the model becomes too complex and starts fitting noise rather than signal.

### Key Insights:

- **Bias-Variance Tradeoff**: Simple models (few features) have high bias but low variance. Complex models (many features) have low bias but high variance.
- **Optimal Complexity**: There's an optimal number of features that maximizes out-of-sample performance.
- **Generalization**: Models that perform well on training data don't necessarily generalize well to new data.

### Practical Implications:

- Always use cross-validation or hold-out samples to evaluate model performance
- Consider regularization techniques (Ridge, Lasso) for high-dimensional problems
- Be cautious of models with very high training accuracy but poor test performance
- The true data generating process is nonlinear (exponential), but we're using polynomial approximations