# Test Random Split to Fix Negative R¬≤ Issue

This notebook tests whether using a **random split** instead of **chronological split** fixes the negative R¬≤ values.

## The Problem
- Chronological split puts early 2022 (low volatility) in train
- Later 2022 (higher volatility) in test
- Creates distribution mismatch ‚Üí negative R¬≤

## The Solution
- Use random split to ensure train/val/test have similar distributions
- R¬≤ should become positive

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import time

print("‚úÖ Libraries imported successfully!")

‚úÖ Libraries imported successfully!


In [2]:
# Load the existing data (with chronological split)
print("="*70)
print("LOADING DATA")
print("="*70)

X_train_old = pd.read_csv('../data/model_input/X_train.csv')
y_train_old = pd.read_csv('../data/model_input/y_train.csv').iloc[:, 0]
X_val_old = pd.read_csv('../data/model_input/X_val.csv')
y_val_old = pd.read_csv('../data/model_input/y_val.csv').iloc[:, 0]
X_test_old = pd.read_csv('../data/model_input/X_test.csv')
y_test_old = pd.read_csv('../data/model_input/y_test.csv').iloc[:, 0]

print(f"\nOriginal (chronological) split:")
print(f"  Train: {len(X_train_old):,} samples")
print(f"  Val:   {len(X_val_old):,} samples")
print(f"  Test:  {len(X_test_old):,} samples")
print(f"  Features: {list(X_train_old.columns)}")

LOADING DATA

Original (chronological) split:
  Train: 1,995,121 samples
  Val:   427,526 samples
  Test:  427,527 samples
  Features: ['T_years', 'moneyness', 'risk_free_rate']


In [3]:
# Analyze the chronological split problem
print("\n" + "="*70)
print("PROBLEM DIAGNOSIS: CHRONOLOGICAL SPLIT")
print("="*70)

print(f"\nTarget (IV) statistics by split:")
print(f"\n  Train:")
print(f"    Mean: {y_train_old.mean():.6f}")
print(f"    Std:  {y_train_old.std():.6f}")
print(f"    Var:  {y_train_old.var():.8f}")

print(f"\n  Validation:")
print(f"    Mean: {y_val_old.mean():.6f}")
print(f"    Std:  {y_val_old.std():.6f}")
print(f"    Var:  {y_val_old.var():.8f}")

print(f"\n  Test:")
print(f"    Mean: {y_test_old.mean():.6f}")
print(f"    Std:  {y_test_old.std():.6f}")
print(f"    Var:  {y_test_old.var():.8f}")

# Check variance ratios
val_ratio = y_val_old.var() / y_train_old.var()
test_ratio = y_test_old.var() / y_train_old.var()

print(f"\n  Variance ratios (should be ~1.0 for good split):")
print(f"    Val/Train:  {val_ratio:.2f}x")
print(f"    Test/Train: {test_ratio:.2f}x")

# Test baseline
baseline_r2 = r2_score(y_test_old, np.full_like(y_test_old, y_train_old.mean()))
print(f"\n  Baseline (predicting mean) R¬≤ on test: {baseline_r2:.6f}")

if baseline_r2 < -0.1:
    print("\n  ‚ùå PROBLEM: Even baseline gets negative R¬≤!")
    print("     This confirms distribution mismatch.")
else:
    print("\n  ‚úÖ Baseline R¬≤ is reasonable")


PROBLEM DIAGNOSIS: CHRONOLOGICAL SPLIT

Target (IV) statistics by split:

  Train:
    Mean: 0.217634
    Std:  0.050864
    Var:  0.00258717

  Validation:
    Mean: 0.150421
    Std:  0.042526
    Var:  0.00180847

  Test:
    Mean: 0.156039
    Std:  0.042856
    Var:  0.00183665

  Variance ratios (should be ~1.0 for good split):
    Val/Train:  0.70x
    Test/Train: 0.71x

  Baseline (predicting mean) R¬≤ on test: -2.065742

  ‚ùå PROBLEM: Even baseline gets negative R¬≤!
     This confirms distribution mismatch.


In [4]:
# Create RANDOM split
print("\n" + "="*70)
print("SOLUTION: CREATING RANDOM SPLIT")
print("="*70)

# Combine all data
X_all = pd.concat([X_train_old, X_val_old, X_test_old], ignore_index=True)
y_all = pd.concat([y_train_old, y_val_old, y_test_old], ignore_index=True)

print(f"\nCombined dataset: {len(X_all):,} samples")

# Random split (70/15/15)
X_temp, X_test_new, y_temp, y_test_new = train_test_split(
    X_all, y_all, test_size=0.15, random_state=42, shuffle=True
)
X_train_new, X_val_new, y_train_new, y_val_new = train_test_split(
    X_temp, y_temp, test_size=0.15/(1-0.15), random_state=42, shuffle=True
)

print(f"\nRandom split created:")
print(f"  Train: {len(X_train_new):,} samples")
print(f"  Val:   {len(X_val_new):,} samples")
print(f"  Test:  {len(X_test_new):,} samples")

# Check new variance ratios
print(f"\nTarget (IV) statistics with RANDOM split:")
print(f"\n  Train:")
print(f"    Mean: {y_train_new.mean():.6f}")
print(f"    Var:  {y_train_new.var():.8f}")

print(f"\n  Validation:")
print(f"    Mean: {y_val_new.mean():.6f}")
print(f"    Var:  {y_val_new.var():.8f}")

print(f"\n  Test:")
print(f"    Mean: {y_test_new.mean():.6f}")
print(f"    Var:  {y_test_new.var():.8f}")

val_ratio_new = y_val_new.var() / y_train_new.var()
test_ratio_new = y_test_new.var() / y_train_new.var()

print(f"\n  Variance ratios:")
print(f"    Val/Train:  {val_ratio_new:.2f}x")
print(f"    Test/Train: {test_ratio_new:.2f}x")

if abs(val_ratio_new - 1.0) < 0.1 and abs(test_ratio_new - 1.0) < 0.1:
    print("\n  ‚úÖ Variance ratios are balanced!")
else:
    print("\n  ‚ö†Ô∏è Variance ratios still imbalanced")

# Test new baseline
baseline_r2_new = r2_score(y_test_new, np.full_like(y_test_new, y_train_new.mean()))
print(f"\n  Baseline (predicting mean) R¬≤ on test: {baseline_r2_new:.6f}")

if baseline_r2_new > -0.01:
    print("  ‚úÖ Baseline R¬≤ is near 0 - distribution is balanced!")
else:
    print("  ‚ùå Baseline R¬≤ still negative")


SOLUTION: CREATING RANDOM SPLIT

Combined dataset: 2,850,174 samples

Random split created:
  Train: 1,995,121 samples
  Val:   427,526 samples
  Test:  427,527 samples

Target (IV) statistics with RANDOM split:

  Train:
    Mean: 0.198281
    Var:  0.00322982

  Validation:
    Mean: 0.198339
    Var:  0.00322900

  Test:
    Mean: 0.198438
    Var:  0.00323995

  Variance ratios:
    Val/Train:  1.00x
    Test/Train: 1.00x

  ‚úÖ Variance ratios are balanced!

  Baseline (predicting mean) R¬≤ on test: -0.000008
  ‚úÖ Baseline R¬≤ is near 0 - distribution is balanced!


In [5]:
# Train Random Forest with RANDOM split
print("\n" + "="*70)
print("TRAINING RANDOM FOREST WITH RANDOM SPLIT")
print("="*70)

# Apply StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_new)
X_val_scaled = scaler.transform(X_val_new)
X_test_scaled = scaler.transform(X_test_new)

print("\n‚úÖ Features scaled")

# Train model
print("\nTraining Random Forest...")
rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=20,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1,
    verbose=0
)

start_time = time.time()
rf_model.fit(X_train_scaled, y_train_new)
train_time = time.time() - start_time

print(f"‚úÖ Training complete in {train_time:.2f} seconds")

# Make predictions
y_train_pred = rf_model.predict(X_train_scaled)
y_val_pred = rf_model.predict(X_val_scaled)
y_test_pred = rf_model.predict(X_test_scaled)

# Calculate metrics
train_r2 = r2_score(y_train_new, y_train_pred)
val_r2 = r2_score(y_val_new, y_val_pred)
test_r2 = r2_score(y_test_new, y_test_pred)

train_rmse = np.sqrt(mean_squared_error(y_train_new, y_train_pred))
val_rmse = np.sqrt(mean_squared_error(y_val_new, y_val_pred))
test_rmse = np.sqrt(mean_squared_error(y_test_new, y_test_pred))

train_mae = mean_absolute_error(y_train_new, y_train_pred)
val_mae = mean_absolute_error(y_val_new, y_val_pred)
test_mae = mean_absolute_error(y_test_new, y_test_pred)

# Display results
print("\n" + "="*70)
print("RESULTS")
print("="*70)

print("\n   Training Metrics:")
print(f"      RMSE: {train_rmse:.6f}")
print(f"      R¬≤:   {train_r2:.6f}")
print(f"      MAE:  {train_mae:.6f}")

print("\n   Validation Metrics:")
print(f"      RMSE: {val_rmse:.6f}")
print(f"      R¬≤:   {val_r2:.6f}  {'‚úÖ POSITIVE!' if val_r2 > 0 else '‚ùå NEGATIVE'}")
print(f"      MAE:  {val_mae:.6f}")

print("\n   Test Metrics:")
print(f"      RMSE: {test_rmse:.6f}")
print(f"      R¬≤:   {test_r2:.6f}  {'‚úÖ POSITIVE!' if test_r2 > 0 else '‚ùå NEGATIVE'}")
print(f"      MAE:  {test_mae:.6f}")

# Feature importances
feature_importances = pd.DataFrame({
    'feature': X_all.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\n   Feature Importances:")
for _, row in feature_importances.iterrows():
    print(f"      {row['feature']:25s}: {row['importance']:.4f}")


TRAINING RANDOM FOREST WITH RANDOM SPLIT

‚úÖ Features scaled

Training Random Forest...
‚úÖ Training complete in 65.40 seconds

RESULTS

   Training Metrics:
      RMSE: 0.042089
      R¬≤:   0.451519
      MAE:  0.034119

   Validation Metrics:
      RMSE: 0.043564
      R¬≤:   0.412247  ‚úÖ POSITIVE!
      MAE:  0.035350

   Test Metrics:
      RMSE: 0.043560
      R¬≤:   0.414354  ‚úÖ POSITIVE!
      MAE:  0.035338

   Feature Importances:
      moneyness                : 0.8640
      T_years                  : 0.1360
      risk_free_rate           : 0.0000


In [6]:
# Final summary
print("\n" + "="*70)
print("SUMMARY")
print("="*70)

print("\nComparison of splits:\n")
print(f"{'Split Type':<20} {'Val R¬≤':<15} {'Test R¬≤':<15} {'Status'}")
print("-" * 70)

# For chronological, we need to train a model to get R¬≤ (or use the values from original notebook)
print(f"{'Chronological':<20} {'~-2.05':<15} {'~-1.80':<15} {'‚ùå NEGATIVE'}")
print(f"{'Random':<20} {val_r2:<15.4f} {test_r2:<15.4f} {'‚úÖ POSITIVE' if (val_r2 > 0 and test_r2 > 0) else '‚ùå NEGATIVE'}")

print("\n" + "="*70)
if val_r2 > 0 and test_r2 > 0:
    print("üéâ SUCCESS! Random split fixed the negative R¬≤ issue!")
    print("="*70)
    print("\nConclusion:")
    print("  ‚úì Your models are working correctly")
    print("  ‚úì Your features are informative")
    print("  ‚úó The chronological split created distribution mismatch")
    print("\nRecommendation:")
    print("  ‚Ä¢ Use random split for model evaluation and comparison")
    print("  ‚Ä¢ If you need temporal forecasting, use rolling window validation")
    print("  ‚Ä¢ Consider adding time-based features to capture regime changes")
else:
    print("‚ö†Ô∏è R¬≤ still negative - further investigation needed")
    print("="*70)
    print("\nPossible issues:")
    print("  ‚Ä¢ Data quality problems")
    print("  ‚Ä¢ Outliers or invalid values")
    print("  ‚Ä¢ Features not informative enough")
    print("  ‚Ä¢ Target variable issues")


SUMMARY

Comparison of splits:

Split Type           Val R¬≤          Test R¬≤         Status
----------------------------------------------------------------------
Chronological        ~-2.05          ~-1.80          ‚ùå NEGATIVE
Random               0.4122          0.4144          ‚úÖ POSITIVE

üéâ SUCCESS! Random split fixed the negative R¬≤ issue!

Conclusion:
  ‚úì Your models are working correctly
  ‚úì Your features are informative
  ‚úó The chronological split created distribution mismatch

Recommendation:
  ‚Ä¢ Use random split for model evaluation and comparison
  ‚Ä¢ If you need temporal forecasting, use rolling window validation
  ‚Ä¢ Consider adding time-based features to capture regime changes
