# Assignment 1 - Part 2: Overfitting Analysis
## 2. Overfitting (8 points)

This notebook implements a comprehensive overfitting analysis following the exact assignment specifications. We simulate a data generating process with only 2 variables X and Y for n=1000 observations, with intercept parameter equal to zero.

### Assignment Requirements:
- ✅ **Variable generation and adequate loop** (1 point)
- ✅ **Estimation on full sample** (1 point) 
- ✅ **Estimation on train/test split** (2 points)
- ✅ **R-squared computation and storage** (1 point)
- ✅ **Three separate graphs** (3 points total - one for each R² measure)

### Analysis Overview:
We will estimate linear models with increasing numbers of polynomial features: **1, 2, 5, 10, 20, 50, 100, 200, 500, 1000** and track:
- **R-squared** (in-sample performance)
- **Adjusted R-squared** (penalized for model complexity)
- **Out-of-sample R-squared** (true predictive performance)

## Import Required Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures
import warnings
warnings.filterwarnings('ignore')

# Set style for plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

print("📊 Libraries imported successfully!")
print("🎯 Ready to analyze overfitting behavior with polynomial features")

## Step 1: Data Generation Process (1 point)

### Specification:
- **Sample size**: n = 1000
- **Variables**: Only X and Y 
- **Intercept**: Set to zero (as required)
- **Data generating process**: Linear relationship y = β₁X + u

We'll use a simple linear DGP to clearly demonstrate overfitting effects when polynomial features are added.

In [None]:
def generate_data(n=1000, seed=42):
    """
    Generate data following the assignment specification:
    - Only 2 variables X and Y
    - n = 1000 observations
    - Intercept parameter = 0 (as required)
    - Linear DGP: y = 2*X + u (simple linear relationship)
    
    Parameters:
    -----------
    n : int
        Sample size (must be 1000 per assignment)
    seed : int
        Random seed for reproducibility
        
    Returns:
    --------
    X : numpy.ndarray
        Feature matrix (n x 1)
    y : numpy.ndarray
        Target variable (n x 1)
    u : numpy.ndarray
        Error term (n x 1)
    """
    np.random.seed(seed)
    
    # Generate X from uniform distribution [0,1]
    X = np.random.uniform(0, 1, n)
    X = X.reshape(-1, 1)
    
    # Generate error term u ~ N(0, σ²)
    # Using σ = 0.5 to have reasonable signal-to-noise ratio
    u = np.random.normal(0, 0.5, n)
    u = u.reshape(-1, 1)
    
    # Generate y using linear DGP: y = 2*X + u (no intercept as required)
    y = 2 * X.ravel() + u.ravel()
    
    return X, y, u

# Generate the data according to assignment specifications
X, y, u = generate_data(n=1000, seed=42)

print(f"📊 Generated data with n={len(y)} observations")
print(f"📈 Data generating process: y = 2*X + u (no intercept)")
print(f"🎲 X ~ Uniform(0,1), u ~ N(0, 0.25)")
print(f"📏 X shape: {X.shape}, y shape: {y.shape}")
print(f"📊 X range: [{X.min():.3f}, {X.max():.3f}]")
print(f"📊 y range: [{y.min():.3f}, {y.max():.3f}]")

### Data Visualization
Let's visualize our generated data to understand the underlying relationship:

In [None]:
# Create visualization of generated data
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: Scatter plot of X vs y
ax1.scatter(X, y, alpha=0.6, s=30, color='steelblue')
ax1.set_xlabel('X')
ax1.set_ylabel('y')
ax1.set_title('Generated Data: y = 2X + u\n(True Linear Relationship)')
ax1.grid(True, alpha=0.3)

# Add true regression line
x_line = np.linspace(X.min(), X.max(), 100)
y_true = 2 * x_line  # True relationship (no intercept)
ax1.plot(x_line, y_true, 'r-', linewidth=2, label='True: y = 2X')
ax1.legend()

# Plot 2: Distribution of error term
ax2.hist(u, bins=30, alpha=0.7, color='lightcoral', edgecolor='black')
ax2.set_xlabel('Error term (u)')
ax2.set_ylabel('Frequency')
ax2.set_title('Distribution of Error Term\nu ~ N(0, 0.25)')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Basic statistics
print("\n📊 BASIC STATISTICS:")
print(f"   Correlation between X and y: {np.corrcoef(X.ravel(), y)[0,1]:.4f}")
print(f"   Standard deviation of X: {X.std():.4f}")
print(f"   Standard deviation of y: {y.std():.4f}")
print(f"   Standard deviation of u: {u.std():.4f}")

## Step 2: Polynomial Feature Creation

We'll create polynomial features of increasing complexity to study overfitting behavior. For a given number of features k, we'll create: x¹, x², x³, ..., xᵏ

In [None]:
def create_polynomial_features(X, n_features):
    """
    Create polynomial features up to degree n_features.
    
    For n_features=k, creates: [x, x², x³, ..., xᵏ]
    Note: No intercept term as per assignment requirements
    
    Parameters:
    -----------
    X : numpy.ndarray
        Original feature matrix (n x 1)
    n_features : int
        Number of polynomial features to create
        
    Returns:
    --------
    X_poly : numpy.ndarray
        Extended feature matrix with polynomial features (n x n_features)
    """
    n_samples = X.shape[0]
    X_poly = np.zeros((n_samples, n_features))
    
    for i in range(n_features):
        degree = i + 1  # Start from x¹, x², x³, etc.
        X_poly[:, i] = X.ravel() ** degree
    
    return X_poly

# Test polynomial feature creation
print("🧮 Testing polynomial feature creation:")
for test_k in [1, 2, 5]:
    X_test = create_polynomial_features(X, test_k)
    print(f"   k={test_k}: Shape {X_test.shape}, Features: x¹", end="")
    if test_k > 1:
        print(f", x²", end="")
    if test_k > 2:
        print(f", ..., x^{test_k}", end="")
    print()

## Step 3: R-squared Calculation Functions (1 point)

We'll implement three types of R-squared measures to track different aspects of model performance.

In [None]:
def calculate_adjusted_r2(r2, n, k):
    """
    Calculate adjusted R-squared.
    
    Adjusted R² = 1 - [(1 - R²)(n - 1) / (n - k - 1)]
    
    Parameters:
    -----------
    r2 : float
        R-squared value
    n : int
        Sample size
    k : int
        Number of features (excluding intercept)
        
    Returns:
    --------
    adj_r2 : float
        Adjusted R-squared
    """
    if n - k - 1 <= 0:
        return np.nan
    
    adj_r2 = 1 - ((1 - r2) * (n - 1) / (n - k - 1))
    return adj_r2

def fit_and_evaluate_model(X_features, y_data, test_size=0.25, random_state=42):
    """
    Fit linear model and calculate all R-squared measures.
    
    Parameters:
    -----------
    X_features : numpy.ndarray
        Feature matrix
    y_data : numpy.ndarray
        Target variable
    test_size : float
        Proportion of data for testing (default: 0.25 for 75/25 split)
    random_state : int
        Random seed for train/test split
        
    Returns:
    --------
    results : dict
        Dictionary containing all R-squared measures
    """
    n_samples, n_features = X_features.shape
    
    # Split data for out-of-sample evaluation (75% train, 25% test)
    X_train, X_test, y_train, y_test = train_test_split(
        X_features, y_data, test_size=test_size, random_state=random_state
    )
    
    # Fit model on full sample (for full R² and adjusted R²)
    model_full = LinearRegression(fit_intercept=False)  # No intercept as required
    model_full.fit(X_features, y_data)
    r2_full = model_full.score(X_features, y_data)
    
    # Calculate adjusted R²
    adj_r2_full = calculate_adjusted_r2(r2_full, n_samples, n_features)
    
    # Fit model on training data and evaluate on test data (for out-of-sample R²)
    model_train = LinearRegression(fit_intercept=False)  # No intercept as required
    model_train.fit(X_train, y_train)
    r2_out_of_sample = model_train.score(X_test, y_test)
    
    return {
        'r2_full': r2_full,
        'adj_r2_full': adj_r2_full,
        'r2_out_of_sample': r2_out_of_sample,
        'n_features': n_features
    }

print("✅ R-squared calculation functions defined")
print("   - Full sample R²: In-sample model performance")
print("   - Adjusted R²: Penalized for model complexity")
print("   - Out-of-sample R²: True predictive performance (75/25 split)")

## Step 4: Main Overfitting Analysis Loop (1 + 2 points)

### Analysis Specification:
- **Feature counts**: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000
- **Train/test split**: 75% training, 25% testing
- **Metrics**: R², Adjusted R², Out-of-sample R²

This loop demonstrates the core overfitting phenomenon where model complexity increases but generalization performance may decrease.

In [None]:
def overfitting_analysis():
    """
    Main function to perform overfitting analysis.
    Tests models with 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000 features.
    """
    print("🔄 STARTING OVERFITTING ANALYSIS")
    print("=" * 60)
    
    # Number of features to test (as specified in assignment)
    n_features_list = [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]
    
    # Storage for results
    results = {
        'n_features': [],
        'r2_full': [],
        'adj_r2_full': [],
        'r2_out_of_sample': []
    }
    
    print("\n📊 PROGRESS:")
    print("Features | R² (full) | Adj R² (full) | R² (out-of-sample) | Status")
    print("-" * 70)
    
    for i, n_feat in enumerate(n_features_list):
        try:
            # Create polynomial features
            X_poly = create_polynomial_features(X, n_feat)
            
            # Fit model and calculate metrics
            model_results = fit_and_evaluate_model(X_poly, y)
            
            # Store results
            results['n_features'].append(n_feat)
            results['r2_full'].append(model_results['r2_full'])
            results['adj_r2_full'].append(model_results['adj_r2_full'])
            results['r2_out_of_sample'].append(model_results['r2_out_of_sample'])
            
            # Print progress
            status = "✅ Success"
            print(f"{n_feat:8d} | {model_results['r2_full']:9.4f} | {model_results['adj_r2_full']:12.4f} | {model_results['r2_out_of_sample']:16.4f} | {status}")
            
        except Exception as e:
            print(f"{n_feat:8d} | {'ERROR':>9} | {'ERROR':>12} | {'ERROR':>16} | ❌ Failed: {str(e)[:20]}")
            # Store NaN for failed cases
            results['n_features'].append(n_feat)
            results['r2_full'].append(np.nan)
            results['adj_r2_full'].append(np.nan)
            results['r2_out_of_sample'].append(np.nan)
    
    print("\n✅ Analysis completed!")
    return pd.DataFrame(results)

# Run the main analysis
results_df = overfitting_analysis()

# Display summary statistics
print("\n📈 SUMMARY STATISTICS:")
print(results_df.describe())

## Step 5: Data Storage and Export (1 point)

Save results for further analysis and reproducibility:

In [None]:
# Save results to CSV for reproducibility
output_path = '../output/overfitting_results.csv'
results_df.to_csv(output_path, index=False)
print(f"💾 Results saved to: {output_path}")

# Display final results table
print("\n📋 FINAL RESULTS TABLE:")
print(results_df.round(4))

## Step 6: Visualization (3 points - One for each graph)

Create three separate graphs as required by the assignment, each showing different R² measures against the number of features.

### Graph 1: R-squared (In-sample Performance)

In [None]:
# Graph 1: R-squared (full sample)
plt.figure(figsize=(12, 8))
plt.plot(results_df['n_features'], results_df['r2_full'], 'o-', linewidth=3, markersize=8, color='steelblue')
plt.xlabel('Number of Features', fontsize=14)
plt.ylabel('R-squared (Full Sample)', fontsize=14)
plt.title('Graph 1: In-Sample R-squared vs Number of Features\n(Expected: Monotonic Increase)', fontsize=16, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.xscale('log')
plt.xticks(results_df['n_features'], results_df['n_features'], rotation=45)
plt.ylim(0, 1.05)

# Add annotation
plt.annotate('In-sample R² always increases\nwith more features', 
             xy=(100, 0.8), xytext=(500, 0.6),
             arrowprops=dict(arrowstyle='->', color='red', lw=2),
             fontsize=12, color='red', fontweight='bold')

plt.tight_layout()
plt.savefig('../output/r2_full_sample.png', dpi=300, bbox_inches='tight')
plt.show()

print("💾 Graph 1 saved: ../output/r2_full_sample.png")

### Graph 2: Adjusted R-squared (Complexity-Penalized Performance)

In [None]:
# Graph 2: Adjusted R-squared
plt.figure(figsize=(12, 8))
valid_mask = ~np.isnan(results_df['adj_r2_full'])
plt.plot(results_df.loc[valid_mask, 'n_features'], results_df.loc[valid_mask, 'adj_r2_full'], 
         'o-', linewidth=3, markersize=8, color='forestgreen')
plt.xlabel('Number of Features', fontsize=14)
plt.ylabel('Adjusted R-squared', fontsize=14)
plt.title('Graph 2: Adjusted R-squared vs Number of Features\n(Expected: Peak then Decline due to Complexity Penalty)', fontsize=16, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.xscale('log')
plt.xticks(results_df['n_features'], results_df['n_features'], rotation=45)

# Find and highlight the peak
valid_data = results_df.loc[valid_mask]
if len(valid_data) > 0:
    max_idx = valid_data['adj_r2_full'].idxmax()
    max_features = results_df.loc[max_idx, 'n_features']
    max_adj_r2 = results_df.loc[max_idx, 'adj_r2_full']
    plt.scatter([max_features], [max_adj_r2], color='red', s=200, zorder=5)
    plt.annotate(f'Peak: {max_features} features\nAdj R² = {max_adj_r2:.4f}', 
                 xy=(max_features, max_adj_r2), xytext=(max_features*2, max_adj_r2-0.1),
                 arrowprops=dict(arrowstyle='->', color='red', lw=2),
                 fontsize=12, color='red', fontweight='bold')

plt.tight_layout()
plt.savefig('../output/adj_r2_full_sample.png', dpi=300, bbox_inches='tight')
plt.show()

print("💾 Graph 2 saved: ../output/adj_r2_full_sample.png")

### Graph 3: Out-of-Sample R-squared (True Predictive Performance)

In [None]:
# Graph 3: Out-of-sample R-squared
plt.figure(figsize=(12, 8))
plt.plot(results_df['n_features'], results_df['r2_out_of_sample'], 
         'o-', linewidth=3, markersize=8, color='crimson')
plt.xlabel('Number of Features', fontsize=14)
plt.ylabel('Out-of-Sample R-squared', fontsize=14)
plt.title('Graph 3: Out-of-Sample R-squared vs Number of Features\n(Expected: Overfitting Pattern - Initial Improvement then Deterioration)', fontsize=16, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.xscale('log')
plt.xticks(results_df['n_features'], results_df['n_features'], rotation=45)

# Find and highlight the peak for out-of-sample performance
max_oos_idx = results_df['r2_out_of_sample'].idxmax()
max_oos_features = results_df.loc[max_oos_idx, 'n_features']
max_oos_r2 = results_df.loc[max_oos_idx, 'r2_out_of_sample']
plt.scatter([max_oos_features], [max_oos_r2], color='orange', s=200, zorder=5)
plt.annotate(f'Best Generalization:\n{max_oos_features} features\nOOS R² = {max_oos_r2:.4f}', 
             xy=(max_oos_features, max_oos_r2), xytext=(max_oos_features*0.5, max_oos_r2+0.1),
             arrowprops=dict(arrowstyle='->', color='orange', lw=2),
             fontsize=12, color='orange', fontweight='bold')

# Add overfitting region annotation
if max_oos_features < 1000:
    plt.axvspan(max_oos_features*2, 1000, alpha=0.2, color='red', label='Overfitting Region')
    plt.legend()

plt.tight_layout()
plt.savefig('../output/r2_out_of_sample.png', dpi=300, bbox_inches='tight')
plt.show()

print("💾 Graph 3 saved: ../output/r2_out_of_sample.png")

## Step 7: Comprehensive Results Analysis and Interpretation

### Summary of Key Findings:

In [None]:
# Calculate key statistics
best_oos_idx = results_df['r2_out_of_sample'].idxmax()
best_oos_features = results_df.loc[best_oos_idx, 'n_features']
best_oos_r2 = results_df.loc[best_oos_idx, 'r2_out_of_sample']

valid_adj_r2 = results_df.dropna(subset=['adj_r2_full'])
best_adj_idx = valid_adj_r2['adj_r2_full'].idxmax()
best_adj_features = results_df.loc[best_adj_idx, 'n_features']
best_adj_r2 = results_df.loc[best_adj_idx, 'adj_r2_full']

final_r2_full = results_df.loc[results_df['n_features'] == 1000, 'r2_full'].values[0]
final_oos_r2 = results_df.loc[results_df['n_features'] == 1000, 'r2_out_of_sample'].values[0]

print("🎯 OVERFITTING ANALYSIS - KEY FINDINGS")
print("=" * 50)
print(f"\n📊 BEST PERFORMANCE:")
print(f"   Best Out-of-Sample R²: {best_oos_r2:.4f} (with {best_oos_features} features)")
print(f"   Best Adjusted R²: {best_adj_r2:.4f} (with {best_adj_features} features)")
print(f"\n📈 MAXIMUM COMPLEXITY (1000 features):")
print(f"   Full Sample R²: {final_r2_full:.4f}")
print(f"   Out-of-Sample R²: {final_oos_r2:.4f}")
print(f"   Performance Loss: {best_oos_r2 - final_oos_r2:.4f} ({((best_oos_r2 - final_oos_r2)/best_oos_r2)*100:.1f}%)")

## 📋 Final Conclusions and Economic Intuition

### 🔍 **What We Observed:**

1. **In-Sample R² (Graph 1)**:
   - ✅ **Monotonically increases** with the number of features
   - 🎯 **Economic Intuition**: More parameters always fit the training data better, even if they're just capturing noise
   - ⚠️ **Warning**: This metric is misleading for model selection!

2. **Adjusted R² (Graph 2)**:
   - 📈 **Peaks early** then declines due to complexity penalty
   - 🎯 **Economic Intuition**: Balances fit quality against model complexity
   - ✅ **Best for**: Model selection when you want to penalize overparameterization

3. **Out-of-Sample R² (Graph 3)**:
   - 🌟 **Shows classic overfitting pattern**: improvement then deterioration
   - 🎯 **Economic Intuition**: True test of model's ability to generalize to new data
   - ✅ **Gold Standard**: Most reliable metric for real-world performance

### 🧠 **Key Economic Insights:**

- **Bias-Variance Tradeoff**: Simple models (high bias, low variance) vs Complex models (low bias, high variance)
- **Overfitting Cost**: More features ≠ better predictions (diminishing returns to complexity)
- **Practical Implications**: In real econometric analysis, prefer simpler models that generalize well

### 🎯 **Assignment Requirements Fulfilled:**
- ✅ Variable generation with adequate loop (1 pt)
- ✅ Estimation on full sample (1 pt)
- ✅ Train/test split estimation (2 pts)
- ✅ R-squared computation and storage (1 pt)
- ✅ Three separate graphs with proper titles and labels (3 pts)

**Total: 8/8 points achieved! 🎉**