# High-Dimensional Linear Models: Overfitting Simulation (R)

This notebook demonstrates the overfitting phenomenon in high-dimensional linear models using R.
We'll generate data with a nonlinear relationship and fit linear models with increasing numbers of polynomial features.

In [None]:
# Load required libraries
library(ggplot2)
library(dplyr)
library(tidyr)
library(gridExtra)

# Set random seed for reproducibility
set.seed(42)
cat("Libraries loaded successfully\n")

## Data Generating Process

We use the specified data generating process:
- f_X = exp(4 * X) - 1
- y = f_X + ε, where ε ~ N(0, σ²)
- n = 1000 observations
- Intercept = 0

In [None]:
# Parameters
n <- 1000
noise_std <- 0.5  # Standard deviation of the error term

# Generate X from uniform distribution
X <- runif(n, min = -0.5, max = 0.5)

# Data generating process: f_X = exp(4 * X) - 1
f_X <- exp(4 * X) - 1

# Add noise to get Y
epsilon <- rnorm(n, mean = 0, sd = noise_std)
Y <- f_X + epsilon

cat("Generated", n, "observations\n")
cat("X range: [", min(X), ",", max(X), "]\n")
cat("Y range: [", min(Y), ",", max(Y), "]\n")
cat("True function range: [", min(f_X), ",", max(f_X), "]\n")

## Visualization of the True Relationship

In [None]:
# Create data frame for plotting
data_df <- data.frame(X = X, Y = Y, f_X = f_X)

# Plot the true relationship
p1 <- ggplot(data_df, aes(x = X)) +
  geom_point(aes(y = Y), alpha = 0.5, size = 1, color = "blue") +
  geom_line(aes(y = f_X), color = "red", size = 1.5) +
  labs(x = "X", y = "Y", title = "Data Generating Process") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

print(p1)

## Functions for Model Fitting and Evaluation

In [None]:
# Function to create polynomial features
create_polynomial_features <- function(x, degree) {
  n <- length(x)
  X_poly <- matrix(0, nrow = n, ncol = degree)
  
  for (i in 1:degree) {
    X_poly[, i] <- x^i
  }
  
  return(X_poly)
}

# Function to calculate R-squared
calculate_r2 <- function(y_true, y_pred) {
  ss_res <- sum((y_true - y_pred)^2)
  ss_tot <- sum((y_true - mean(y_true))^2)
  return(1 - ss_res / ss_tot)
}

# Function to calculate adjusted R-squared
calculate_adjusted_r2 <- function(r2, n, p) {
  if (n <= p + 1) {
    return(NA)
  }
  return(1 - (1 - r2) * (n - 1) / (n - p - 1))
}

# Function to fit and evaluate models
fit_and_evaluate <- function(X_train, y_train, X_test, y_test, n_features) {
  tryCatch({
    # Create polynomial features
    X_train_poly <- create_polynomial_features(X_train, n_features)
    X_test_poly <- create_polynomial_features(X_test, n_features)
    
    # Fit linear regression without intercept using lm
    # Create data frame for lm function
    train_df <- data.frame(y = y_train, X_train_poly)
    colnames(train_df) <- c("y", paste0("X", 1:n_features))
    
    # Fit model without intercept
    formula_str <- paste("y ~", paste(paste0("X", 1:n_features), collapse = " + "), "- 1")
    model <- lm(as.formula(formula_str), data = train_df)
    
    # Predictions
    y_train_pred <- predict(model)
    
    # For test predictions, create test data frame
    test_df <- data.frame(X_test_poly)
    colnames(test_df) <- paste0("X", 1:n_features)
    y_test_pred <- predict(model, newdata = test_df)
    
    # Calculate metrics
    r2_train <- calculate_r2(y_train, y_train_pred)
    r2_test <- calculate_r2(y_test, y_test_pred)
    adj_r2 <- calculate_adjusted_r2(r2_train, length(y_train), n_features)
    
    return(list(
      r2 = r2_train,
      adj_r2 = adj_r2,
      r2_oos = r2_test,
      n_params = n_features,
      success = TRUE
    ))
  }, error = function(e) {
    return(list(
      r2 = NA,
      adj_r2 = NA,
      r2_oos = NA,
      n_params = n_features,
      success = FALSE,
      error = as.character(e)
    ))
  })
}

cat("Functions defined successfully\n")

## Main Simulation Loop

We'll test models with different numbers of polynomial features: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000

In [None]:
# Split data into train (75%) and test (25%)
train_size <- floor(0.75 * n)
indices <- sample(1:n, n, replace = FALSE)
train_idx <- indices[1:train_size]
test_idx <- indices[(train_size + 1):n]

X_train <- X[train_idx]
y_train <- Y[train_idx]
X_test <- X[test_idx]
y_test <- Y[test_idx]

cat("Training set size:", length(X_train), "\n")
cat("Test set size:", length(X_test), "\n")

# Number of features to test
feature_counts <- c(1, 2, 5, 10, 20, 50, 100, 200, 500, 1000)

# Storage for results
results <- data.frame(
  n_features = integer(),
  r2 = numeric(),
  adj_r2 = numeric(),
  r2_oos = numeric(),
  n_params = integer(),
  stringsAsFactors = FALSE
)

cat("\nRunning simulation...\n")
cat("Features | R² (train) | Adj R² | R² (test) | Parameters\n")
cat(strrep("-", 60), "\n")

for (n_features in feature_counts) {
  # Skip if we don't have enough training samples
  if (n_features >= length(X_train)) {
    cat(sprintf("%8d | Skipped (insufficient training data)\n", n_features))
    next
  }
  
  metrics <- fit_and_evaluate(X_train, y_train, X_test, y_test, n_features)
  
  if (metrics$success) {
    results <- rbind(results, data.frame(
      n_features = n_features,
      r2 = metrics$r2,
      adj_r2 = metrics$adj_r2,
      r2_oos = metrics$r2_oos,
      n_params = metrics$n_params
    ))
    
    cat(sprintf("%8d | %9.4f | %6.4f | %8.4f | %9d\n", 
               n_features, metrics$r2, metrics$adj_r2, metrics$r2_oos, metrics$n_params))
  } else {
    error_msg <- substr(metrics$error, 1, 30)
    cat(sprintf("%8d | Error: %s...\n", n_features, error_msg))
  }
}

cat("\nCompleted simulation with", nrow(results), "successful models\n")

## Results Visualization

We create three separate plots showing how the different R-squared metrics change with the number of features.

In [None]:
# Create the three plots
# Plot 1: R-squared (training)
p1 <- ggplot(results, aes(x = n_features, y = r2)) +
  geom_line(color = "blue", size = 1) +
  geom_point(color = "blue", size = 2) +
  scale_x_log10() +
  ylim(0, 1) +
  labs(x = "Number of Features", y = "R-squared (Training)",
       title = "Training R-squared vs Number of Features") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# Plot 2: Adjusted R-squared (filter out NA values)
valid_adj <- results[!is.na(results$adj_r2), ]
p2 <- ggplot(valid_adj, aes(x = n_features, y = adj_r2)) +
  geom_line(color = "green", size = 1) +
  geom_point(color = "green", size = 2) +
  scale_x_log10() +
  labs(x = "Number of Features", y = "Adjusted R-squared",
       title = "Adjusted R-squared vs Number of Features") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# Plot 3: Out-of-sample R-squared
p3 <- ggplot(results, aes(x = n_features, y = r2_oos)) +
  geom_line(color = "red", size = 1) +
  geom_point(color = "red", size = 2) +
  scale_x_log10() +
  labs(x = "Number of Features", y = "Out-of-sample R-squared",
       title = "Out-of-sample R-squared vs Number of Features") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# Combine plots
grid.arrange(p1, p2, p3, ncol = 3)

## Summary and Analysis

Let's examine the results and discuss the overfitting phenomenon.

In [None]:
# Display summary statistics
cat("Summary of Results:\n")
cat(strrep("=", 80), "\n")
print(results)

# Find optimal number of features based on out-of-sample R²
best_idx <- which.max(results$r2_oos)
best_n_features <- results$n_features[best_idx]
best_oos_r2 <- results$r2_oos[best_idx]

cat("\nOptimal number of features (based on out-of-sample R²):", best_n_features, "\n")
cat("Best out-of-sample R²:", round(best_oos_r2, 4), "\n")

# Calculate the difference between training and test R² to show overfitting
results$overfitting <- results$r2 - results$r2_oos
cat("\nOverfitting Analysis (Training R² - Test R²):\n")
overfitting_df <- results[, c("n_features", "overfitting")]
print(overfitting_df)

## Additional Visualization: Combined Plot

In [None]:
# Create a combined plot showing all three metrics
results_long <- results %>%
  select(n_features, r2, adj_r2, r2_oos) %>%
  pivot_longer(cols = c(r2, adj_r2, r2_oos), 
               names_to = "metric", 
               values_to = "value") %>%
  filter(!is.na(value))

# Rename metrics for better labels
results_long$metric <- factor(results_long$metric, 
                             levels = c("r2", "adj_r2", "r2_oos"),
                             labels = c("Training R²", "Adjusted R²", "Out-of-sample R²"))

p_combined <- ggplot(results_long, aes(x = n_features, y = value, color = metric)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  scale_x_log10() +
  scale_color_manual(values = c("Training R²" = "blue", 
                               "Adjusted R²" = "green", 
                               "Out-of-sample R²" = "red")) +
  labs(x = "Number of Features", y = "R-squared",
       title = "Comparison of R-squared Metrics vs Number of Features",
       color = "Metric") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "bottom")

print(p_combined)

## Interpretation and Conclusions

### Overfitting Demonstration

This simulation clearly demonstrates the overfitting phenomenon in high-dimensional linear models:

1. **Training R-squared** monotonically increases as we add more polynomial features. This makes sense because with more parameters, the model can fit the training data more closely.

2. **Adjusted R-squared** initially increases but then starts to decrease as the penalty for additional parameters outweighs the improvement in fit. This metric tries to balance model fit with model complexity.

3. **Out-of-sample R-squared** typically increases initially as we capture more of the true nonlinear relationship, but then decreases as the model becomes too complex and starts fitting noise rather than signal.

### Key Insights:

- **Bias-Variance Tradeoff**: Simple models (few features) have high bias but low variance. Complex models (many features) have low bias but high variance.
- **Optimal Complexity**: There's an optimal number of features that maximizes out-of-sample performance.
- **Generalization**: Models that perform well on training data don't necessarily generalize well to new data.

### R-Specific Observations:

- R's built-in `lm()` function makes linear regression straightforward
- The `ggplot2` package provides excellent visualization capabilities
- Data manipulation with `dplyr` and `tidyr` makes data processing clean and readable
- R's statistical functions make it easy to calculate various metrics

### Practical Implications:

- Always use cross-validation or hold-out samples to evaluate model performance
- Consider regularization techniques (Ridge, Lasso) for high-dimensional problems
- Be cautious of models with very high training accuracy but poor test performance
- The true data generating process is nonlinear (exponential), but we're using polynomial approximations
- Adjusted R-squared provides a better measure than regular R-squared when comparing models with different numbers of parameters