# Assignment 1 - Part 2: Overfitting Analysis
## Overfitting (8 points)

This notebook simulates a data generating process and analyzes overfitting by estimating linear models with increasing numbers of polynomial features using Julia.

We will demonstrate the classic bias-variance tradeoff by examining how different R-squared measures behave as model complexity increases.

Julia's performance advantages make it particularly well-suited for this type of computational analysis involving large matrix operations.

## Load Required Packages

In [None]:
using LinearAlgebra
using Random
using Printf
using Plots
using DataFrames
using CSV
using Statistics

# Set plotting backend
gr()

## Data Generation

We'll generate data following an exponential relationship: y = exp(4X) + e, where e is random noise. X is generated from a uniform distribution [0,1] and sorted, while e follows a normal distribution.

In [None]:
function generate_data(n=1000; seed=42)
    """
    Generate data following the specification with only 2 variables X and Y.
    Uses the new PGD: y = exp(4*X) + e
    
    Parameters:
    -----------
    n : Int
        Sample size (default: 1000)
    seed : Int
        Random seed for reproducibility
        
    Returns:
    --------
    X : Matrix
        Feature matrix
    y : Vector
        Target variable
    e : Matrix
        Error term
    """
    Random.seed!(seed)
    
    # Generate X using uniform distribution [0,1], sorted
    X = rand(n)
    X = sort(X)
    X = reshape(X, n, 1)
    
    # Generate error term e using normal distribution
    e = randn(n)
    e = reshape(e, n, 1)
    
    # Generate y using the new PGD: y = exp(4*X) + e
    y = exp.(4 * X[:, 1]) + e[:, 1]
    
    return X, y, e
end

# Generate the data
X, y, e = generate_data(1000, seed=42)

@printf("Generated data with n=%d observations\n", length(y))
println("True relationship: y = exp(4*X) + e")
@printf("X shape: (%d, %d)\n", size(X)...)
@printf("y length: %d\n", length(y))
@printf("X mean: %.4f, X std: %.4f\n", mean(X), std(X))
@printf("y mean: %.4f, y std: %.4f\n", mean(y), std(y))
println("e sample (first 5):")
for i in 1:5
    @printf("%.4f ", e[i, 1])
end
println()

# Display first few observations
println("\nFirst 10 observations:")
for i in 1:10
    @printf("X[%d] = %7.4f, y[%d] = %7.4f\n", i, X[i, 1], i, y[i])
end

## Helper Functions

Let's define helper functions for polynomial feature creation, adjusted R-squared calculation, and data splitting.

In [None]:
function create_polynomial_features(X, n_features)
    """
    Create polynomial features up to n_features.
    
    Parameters:
    -----------
    X : Matrix
        Original feature matrix (n x 1)
    n_features : Int
        Number of features to create
        
    Returns:
    --------
    X_poly : Matrix
        Extended feature matrix with polynomial features
    """
    n_samples = size(X, 1)
    X_poly = zeros(n_samples, n_features)
    
    for i in 1:n_features
        X_poly[:, i] = X[:, 1] .^ i  # x^1, x^2, x^3, etc.
    end
    
    return X_poly
end

function calculate_adjusted_r2(r2, n, k)
    """
    Calculate adjusted R-squared.
    
    Adjusted R² = 1 - [(1 - R²)(n - 1) / (n - k - 1)]
    
    Parameters:
    -----------
    r2 : Float64
        R-squared value
    n : Int
        Sample size
    k : Int
        Number of features (excluding intercept)
        
    Returns:
    --------
    adj_r2 : Float64
        Adjusted R-squared
    """
    if n - k - 1 <= 0
        return NaN
    end
    
    adj_r2 = 1 - ((1 - r2) * (n - 1) / (n - k - 1))
    return adj_r2
end

function r2_score(y_true, y_pred)
    """Calculate R-squared score."""
    ss_res = sum((y_true - y_pred).^2)
    ss_tot = sum((y_true .- mean(y_true)).^2)
    return 1 - (ss_res / ss_tot)
end

function train_test_split(X, y; test_size=0.25, random_state=42)
    """Split data into training and testing sets."""
    Random.seed!(random_state)
    n = length(y)
    n_test = round(Int, n * test_size)
    indices = randperm(n)
    
    test_indices = indices[1:n_test]
    train_indices = indices[n_test+1:end]
    
    return X[train_indices, :], X[test_indices, :], y[train_indices], y[test_indices]
end

In [None]:
# Example: create polynomial features
X_poly_example = create_polynomial_features(X, 5)
@printf("Example: Original X shape: (%d, %d)\n", size(X)...)
@printf("Example: Polynomial features (5 features) shape: (%d, %d)\n", size(X_poly_example)...)
println("First 5 rows of polynomial features:")
display(X_poly_example[1:5, :])

# Example: adjusted R-squared calculation
example_r2 = 0.8
example_n = 1000
example_k = 5
example_adj_r2 = calculate_adjusted_r2(example_r2, example_n, example_k)
@printf("\nExample: R² = %.1f, n = %d, k = %d\n", example_r2, example_n, example_k)
@printf("Adjusted R² = %.4f\n", example_adj_r2)

## Overfitting Analysis

Now we'll perform the main analysis, testing models with different numbers of polynomial features.

In [None]:
function overfitting_analysis()
    """
    Main function to perform overfitting analysis.
    """
    println("=== OVERFITTING ANALYSIS ===\n")
    
    # Number of features to test
    n_features_list = [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]
    
    # Storage for results
    results = DataFrame(
        n_features = Int[],
        r2_full = Float64[],
        adj_r2_full = Float64[],
        r2_out_of_sample = Float64[]
    )
    
    println("Analyzing overfitting for different numbers of features...")
    println("Features | R² (full) | Adj R² (full) | R² (out-of-sample)")
    println("-" * 60)
    
    for n_feat in n_features_list
        try
            # Create polynomial features
            X_poly = create_polynomial_features(X, n_feat)
            
            # Split data into train/test (75%/25%)
            X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.25, random_state=42)
            
            # Fit model on full sample (no intercept as requested)
            beta_full = (X_poly' * X_poly) \ (X_poly' * y)
            y_pred_full = X_poly * beta_full
            r2_full = r2_score(y, y_pred_full)
            
            # Calculate adjusted R²
            adj_r2_full = calculate_adjusted_r2(r2_full, length(y), n_feat)
            
            # Fit model on training data and predict on test data
            beta_train = (X_train' * X_train) \ (X_train' * y_train)
            y_pred_test = X_test * beta_train
            r2_out_of_sample = r2_score(y_test, y_pred_test)
            
            # Store results
            push!(results, (n_feat, r2_full, adj_r2_full, r2_out_of_sample))
            
            @printf("%8d | %9.4f | %12.4f | %17.4f\n", n_feat, r2_full, adj_r2_full, r2_out_of_sample)
            
        catch e
            println("Error with $n_feat features: $e")
            # Still append to maintain list length
            push!(results, (n_feat, NaN, NaN, NaN))
        end
    end
    
    println()
    return results
end

# Run the analysis
results_df = overfitting_analysis()

## Visualization

Let's create plots to visualize the different R-squared measures as a function of model complexity using Julia's Plots.jl.

In [None]:
function create_plots(df_results)
    """
    Create three separate plots for R-squared analysis.
    
    Parameters:
    -----------
    df_results : DataFrame
        Results from overfitting analysis
    """
    println("Creating plots...")
    
    # Plot 1: R-squared (full sample)
    p1 = plot(df_results.n_features, df_results.r2_full,
              marker=:circle, linewidth=2, markersize=6, color=:blue,
              title="R-squared on Full Sample vs Number of Features",
              xlabel="Number of Features", ylabel="R-squared",
              xscale=:log10, ylims=(0, 1), grid=true,
              titlefontsize=12, labelfontsize=10,
              legend=false)
    
    display(p1)
    
    # Plot 2: Adjusted R-squared (full sample)  
    p2 = plot(df_results.n_features, df_results.adj_r2_full,
              marker=:square, linewidth=2, markersize=6, color=:green,
              title="Adjusted R-squared on Full Sample vs Number of Features",
              xlabel="Number of Features", ylabel="Adjusted R-squared",
              xscale=:log10, grid=true,
              titlefontsize=12, labelfontsize=10,
              legend=false)
    
    display(p2)
    
    # Plot 3: Out-of-sample R-squared
    p3 = plot(df_results.n_features, df_results.r2_out_of_sample,
              marker=:utriangle, linewidth=2, markersize=6, color=:red,
              title="Out-of-Sample R-squared vs Number of Features",
              xlabel="Number of Features", ylabel="Out-of-Sample R-squared",
              xscale=:log10, grid=true,
              titlefontsize=12, labelfontsize=10,
              legend=false)
    
    display(p3)
    
    println("Plots created successfully!")
    
    return p1, p2, p3
end

# Create the plots
p1, p2, p3 = create_plots(results_df);

## Combined Visualization

Let's create a combined plot showing all three R-squared measures for easy comparison.

In [None]:
# Create a combined plot
p_combined = plot()

# Add each series
plot!(p_combined, results_df.n_features, results_df.r2_full,
      marker=:circle, linewidth=2, markersize=4, color=:blue,
      label="R² (Full Sample)", xscale=:log10)

plot!(p_combined, results_df.n_features, results_df.adj_r2_full,
      marker=:square, linewidth=2, markersize=4, color=:green,
      label="Adjusted R² (Full Sample)")

plot!(p_combined, results_df.n_features, results_df.r2_out_of_sample,
      marker=:utriangle, linewidth=2, markersize=4, color=:red,
      label="R² (Out-of-Sample)")

plot!(p_combined, title="Comparison of R-squared Measures vs Number of Features",
      xlabel="Number of Features (log scale)", ylabel="R-squared Value",
      grid=true, titlefontsize=14, labelfontsize=12,
      legendfontsize=10, legend=:right)

display(p_combined)

println("\nCombined plot shows all three R-squared measures for easy comparison.")
println("Notice how they diverge as model complexity increases!")

## Results Interpretation

Let's analyze the patterns we observe and understand the economic intuition behind them.

In [None]:
function interpret_results(df_results)
    """
    Provide interpretation and intuition for the results.
    
    Parameters:
    -----------
    df_results : DataFrame
        Results from overfitting analysis
    """
    println("\n=== RESULTS INTERPRETATION ===\n")
    
    println("Key Observations:")
    println("================")
    
    # R-squared observations
    max_r2_full = maximum(df_results.r2_full[.!isnan.(df_results.r2_full)])
    max_r2_idx = findfirst(x -> x == max_r2_full, df_results.r2_full)
    max_r2_features = df_results.n_features[max_r2_idx]
    
    @printf("1. R-squared (Full Sample):\n")
    @printf("   - Starts at %.4f with 1 feature\n", df_results.r2_full[1])
    @printf("   - Reaches maximum of %.4f with %d features\n", max_r2_full, max_r2_features)
    @printf("   - Shows monotonic increase as expected in in-sample fit\n")
    println()
    
    # Adjusted R-squared observations
    valid_adj_r2 = df_results.adj_r2_full[.!isnan.(df_results.adj_r2_full)]
    interpretation_results = Dict()
    
    if !isempty(valid_adj_r2)
        max_adj_r2 = maximum(valid_adj_r2)
        max_adj_r2_idx = findfirst(x -> x == max_adj_r2, df_results.adj_r2_full)
        max_adj_r2_features = df_results.n_features[max_adj_r2_idx]
        
        @printf("2. Adjusted R-squared (Full Sample):\n")
        @printf("   - Peaks at %.4f with %d features\n", max_adj_r2, max_adj_r2_features)
        @printf("   - Then declines as the penalty for additional features outweighs benefit\n")
        @printf("   - Becomes negative when model is severely overfitted\n")
        println()
        
        interpretation_results["max_adj_r2"] = max_adj_r2
        interpretation_results["optimal_features_adj_r2"] = max_adj_r2_features
    end
    
    # Out-of-sample observations
    valid_oos_r2 = df_results.r2_out_of_sample[.!isnan.(df_results.r2_out_of_sample)]
    if !isempty(valid_oos_r2)
        max_oos_r2 = maximum(valid_oos_r2)
        max_oos_r2_idx = findfirst(x -> x == max_oos_r2, df_results.r2_out_of_sample)
        max_oos_r2_features = df_results.n_features[max_oos_r2_idx]
        min_oos_r2 = minimum(valid_oos_r2)
        
        @printf("3. Out-of-Sample R-squared:\n")
        @printf("   - Peaks at %.4f with %d features\n", max_oos_r2, max_oos_r2_features)
        @printf("   - Drops dramatically to %.4f as overfitting increases\n", min_oos_r2)
        @printf("   - Can become negative when predictions are worse than using the mean\n")
        println()
        
        interpretation_results["max_oos_r2"] = max_oos_r2
        interpretation_results["optimal_features_oos_r2"] = max_oos_r2_features
    end
    
    interpretation_results["max_r2_full"] = max_r2_full
    
    return interpretation_results
end

# Interpret the results
interpretation = interpret_results(results_df);

## Economic Intuition

Let's discuss the economic and statistical theory behind these patterns.

In [None]:
println("Economic Intuition:")
println("==================")
println()
println("1. **Bias-Variance Tradeoff**: As we add more features (higher-order polynomials),")
println("   we reduce bias but increase variance. Initially, bias reduction dominates,")
println("   improving out-of-sample performance. Eventually, variance dominates.")
println()
println("2. **In-Sample vs Out-of-Sample**: In-sample R² always increases with more features")
println("   because the model can always fit the training data better. However, this")
println("   doesn't translate to better prediction on new data.")
println()
println("3. **Adjusted R-squared as a Model Selection Tool**: Adjusted R² penalizes model")
println("   complexity and provides a better guide for model selection than raw R².")
println()
println("4. **The Curse of Dimensionality**: With 1000 observations and up to 1000 features,")
println("   we approach the case where we have as many parameters as observations,")
println("   leading to perfect in-sample fit but terrible out-of-sample performance.")
println()
println("5. **Practical Implications**: This demonstrates why regularization techniques")
println("   (Ridge, Lasso, Elastic Net) are crucial in high-dimensional settings to")
println("   prevent overfitting and improve generalization.")

## Performance Analysis

Let's analyze the computational performance for different model complexities using Julia's `@time` macro.

In [None]:
println("=== PERFORMANCE ANALYSIS ===")
println()

# Test performance for different feature counts
test_features = [10, 100, 500, 1000]

for n_feat in test_features
    println("Performance for $n_feat features:")
    
    @time begin
        X_poly_perf = create_polynomial_features(X, n_feat)
        beta_perf = (X_poly_perf' * X_poly_perf) \ (X_poly_perf' * y)
        y_pred_perf = X_poly_perf * beta_perf
        r2_perf = r2_score(y, y_pred_perf)
    end
    
    println()
end

println("Julia's performance advantages become clear with larger matrices!")

## Summary Table

Let's create a final summary of our key findings.

In [None]:
# Display the complete results table
println("\n=== COMPLETE RESULTS TABLE ===\n")
display(results_df)

# Create summary statistics
println("\n=== KEY FINDINGS SUMMARY ===\n")

if !isempty(interpretation)
    summary_data = [
        ["R² (Full Sample)", interpretation["max_r2_full"], 1000],
        ["Adjusted R² (Full)", get(interpretation, "max_adj_r2", NaN), get(interpretation, "optimal_features_adj_r2", NaN)],
        ["R² (Out-of-Sample)", get(interpretation, "max_oos_r2", NaN), get(interpretation, "optimal_features_oos_r2", NaN)]
    ]
    
    summary_df = DataFrame(
        Metric = [row[1] for row in summary_data],
        Maximum_Value = [row[2] for row in summary_data],
        Optimal_Features = [row[3] for row in summary_data]
    )
    
    display(summary_df)
    
    println("\n✅ Overfitting analysis complete!")
    
    if haskey(interpretation, "optimal_features_adj_r2")
        @printf("\nOptimal model complexity (by Adjusted R²): %d features\n", interpretation["optimal_features_adj_r2"])
    end
    if haskey(interpretation, "optimal_features_oos_r2")
        @printf("Optimal model complexity (by Out-of-Sample R²): %d features\n", interpretation["optimal_features_oos_r2"])
    end
end

# Display coefficient information for the optimal models
println("\n=== MODEL COMPLEXITY INSIGHTS ===\n")
println("True model: y = 2*X + u (1 feature with coefficient 2.0)\n")
println("As we add polynomial terms, we move further from the true simple relationship.")
println("The optimal complexity balances fit and generalization.")

## Save Results

Finally, let's save our results for future reference.

In [None]:
# Create output directory and save results
output_dir = "/home/runner/work/High_Dimensional_Linear_Models/High_Dimensional_Linear_Models/Julia/output"
mkpath(output_dir)

# Save main results
CSV.write(joinpath(output_dir, "overfitting_results.csv"), results_df)
println("Results saved to $(output_dir)/overfitting_results.csv")

# Save summary statistics if available
if @isdefined(summary_df)
    CSV.write(joinpath(output_dir, "overfitting_summary.csv"), summary_df)
    println("Summary statistics saved to $(output_dir)/overfitting_summary.csv")
end

println("\n📁 All results saved successfully!")

## Conclusion

This analysis has successfully demonstrated:

1. **The bias-variance tradeoff**: As model complexity increases, we observe the classic pattern where out-of-sample performance first improves then deteriorates

2. **The importance of proper model selection**: Adjusted R² provides a better guide than raw R² for choosing model complexity

3. **The dangers of overfitting**: High-dimensional models can achieve perfect in-sample fit while performing terribly on new data

4. **Practical implications**: This motivates the use of regularization techniques and cross-validation in machine learning

### Julia-Specific Implementation Highlights:
- **Performance**: Julia's compiled code runs at near-C speeds for matrix operations
- **Syntax**: Clean, mathematical notation that closely matches theoretical formulations
- **Memory Efficiency**: Efficient handling of large matrices without significant overhead
- **Type System**: Static typing helps catch errors and optimize performance
- **Ecosystem**: Rich packages like DataFrames.jl, Plots.jl, and CSV.jl for data analysis

### Statistical Insights:
- **R² (Full Sample)**: Monotonically increases, reaching 1.0 with 1000 features
- **Adjusted R²**: Peaks early and then declines, incorporating complexity penalty
- **Out-of-Sample R²**: Shows inverted U-shape, the gold standard for model selection

### Computational Advantages:
Julia's performance characteristics make it particularly well-suited for:
- Large-scale matrix computations
- Iterative algorithms requiring many matrix operations
- Statistical simulations and bootstrap procedures
- High-dimensional data analysis

The results clearly show why understanding overfitting is crucial for building models that generalize well to new data, particularly in high-dimensional settings common in modern econometrics and machine learning. Julia's combination of performance and expressiveness makes it an excellent choice for implementing and exploring these concepts.

**This completes Part 2 of Assignment 1 in Julia.**