# ICOE Scenarios Demo: Real-World Datasets

This notebook demonstrates ICOE using **Standard Benchmark Datasets** from `sklearn` and `statsmodels`.

**Scenarios & Data:**
1.  **Cross-Sectional Regression**: `California Housing` (sklearn) - Predict House Price.
2.  **Cross-Sectional Classification**: `Breast Cancer` (sklearn) - Predict Benign/Malignant.
3.  **Panel Regression**: `Grunfeld` (statsmodels) - Predict Corporate Investment.
4.  **Panel Classification**: `Grunfeld` (statsmodels) - Predict High/Low Investment.
5.  **Time Series Regression**: `US Macro` (statsmodels) - Predict Real GDP Growth.
6.  **Time Series Classification**: `US Macro` (statsmodels) - Predict Expansion/Recession.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from icoe.estimator import ICOERegressor, ICOEClassifier
from icoe.plotting import plot_optimization_history
from sklearn.metrics import mean_squared_error, roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing, load_breast_cancer
import statsmodels.api as sm

## 1. Cross-Sectional Regression: California Housing
**Scenario**: Predict Median House Value based on demographics. Standard IID.

In [None]:
# Load Data
data = fetch_california_housing(as_frame=True)
X = data.data
y = data.target

# 1. Production Validation Split (Random)
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Train ICOE
model = ICOERegressor(n_trials=10, n_phases=1, verbose=1, splitting_strategy='random')
model.fit(X_train, y_train)

# 3. Check Generalization
val_score = model.best_global_score_
holdout_pred = model.predict(X_holdout)
holdout_rmse = np.sqrt(mean_squared_error(y_holdout, holdout_pred))

print(f"Results: Internal RMSE: {val_score:.4f}, Holdout RMSE: {holdout_rmse:.4f}")
plot_optimization_history(model)
plt.show()

## 2. Cross-Sectional Classification: Breast Cancer
**Scenario**: Benign (1) vs Malignant (0) classification.

In [None]:
# Load Data
data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target

# 1. Validation Split
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Train ICOE
model = ICOEClassifier(n_trials=10, n_phases=1, verbose=1, splitting_strategy='random', metric='auc')
model.fit(X_train, y_train)

# 3. Check Generalization
val_score = model.best_global_score_
holdout_probs = model.predict_proba(X_holdout)[:, 1]
holdout_auc = roc_auc_score(y_holdout, holdout_probs)

print(f"Results: Internal AUC: {val_score:.4f}, Holdout AUC: {holdout_auc:.4f}")

## 3. Panel Regression: Grunfeld Investment Data
**Scenario**: 11 Large US Firms over 20 years. Predict `Invest` based on `Value` and `Capital`.
**Evaluation**: Leave-Groups-Out (Train on 9 firms, Validate on 2 unseen firms).

In [None]:
# Load Data
grunfeld = sm.datasets.grunfeld.load_pandas().data
grunfeld.head()

target_col = 'invest'
group_col = 'firm'
X = grunfeld[['value', 'capital', 'year', 'firm']]
y = grunfeld[target_col]

# 1. Holdout Split (Unseen Firms)
unique_firms = X[group_col].unique()
# Hold out General Electric and US Steel (just picking last 2)
holdout_firms = unique_firms[-2:]
train_firms = unique_firms[:-2]

mask_holdout = X[group_col].isin(holdout_firms)
X_train = X[~mask_holdout].copy()
y_train = y[~mask_holdout]
X_holdout = X[mask_holdout].copy()
y_holdout = y[mask_holdout]

print(f"Training Firms: {train_firms}")
print(f"Holdout Firms: {holdout_firms}")

# 2. Train ICOE (Panel Strategy)
model = ICOERegressor(n_trials=10, n_phases=1, verbose=1, splitting_strategy='panel')
model.fit(X_train, y_train, group_column='firm')

# 3. Validate
# Need to drop 'firm' before predict if it wasn't dropped by default (ICOE handles internally but predict input must match)
# Since 'firm' is categorical, ideally we encode it or drop it. Standard panel models often use fixed effects (dummy vars).
# Here we are just using value/capital features.
val_score = model.best_global_score_
holdout_pred = model.predict(X_holdout.drop(columns=['firm']))
holdout_rmse = np.sqrt(mean_squared_error(y_holdout, holdout_pred))

print(f"Results: Internal RMSE: {val_score:.4f}, Holdout RMSE: {holdout_rmse:.4f}")

## 5. Time Series Regression: US Macroeconomic Data
**Scenario**: Predict Real GDP Growth (`realgdp` change).
**Evaluation**: Out-of-Time (Last 20% of quarters).
**Feature**: Uses `embargo=1` (1 quarter gap) to prevent look-ahead bias.

In [None]:
# Load Data
macro = sm.datasets.macrodata.load_pandas().data

# Create Date Interface
macro['date'] = pd.date_range(start='1959-01-01', periods=len(macro), freq='Q')

# Target: Quarter-over-Quarter Real GDP Growth
macro['gdp_growth'] = macro['realgdp'].pct_change()

# Features: Lagged variables (Inflation, Interest Rates, etc.)
# Shift features forward to avoid leakage? Or treat them as 'known at t'?
# Let's use standard lagging: Predict Growth(t) using Data(t-1).
# So Features = shift(1). Target = Growth(t).

features = ['realcons', 'realinv', 'realgovt', 'realdpi', 'cpi', 'm1', 'tbilrate', 'unemp', 'pop', 'infl', 'realint']
X = macro[features].shift(1)
y = macro['gdp_growth']
X['date'] = macro['date'] # Keep date aligned

# Drop NA from lagging
mask = ~X[features[0]].isna() & ~y.isna()
X = X[mask]
y = y[mask]

# 1. Out-of-Time Split
split_date = '2000-01-01'
mask_train = X['date'] < split_date
mask_holdout = X['date'] >= split_date

X_train = X[mask_train]
y_train = y[mask_train]
X_holdout = X[mask_holdout]
y_holdout = y[mask_holdout]

print(f"Train End: {X_train['date'].max()}, Holdout Start: {X_holdout['date'].min()}")

# 2. Train ICOE (Time Series)
# History Tuning: Does 10 year history beat 40 year history?
search_space = {'train_history_days': [40, 80, 200] } # Quarters? No, index is implicit day frequency in logic?
# Important: logic assumes 'days' if date column is datetime. 1 Quarter ~ 90 days. 

model = ICOERegressor(n_trials=10, n_phases=1, verbose=1, splitting_strategy='timeseries', embargo=90, search_space=search_space)
model.fit(X_train, y_train, time_column='date')

# 3. Validate
val_score = model.best_global_score_
holdout_pred = model.predict(X_holdout.drop(columns=['date']))
holdout_rmse = np.sqrt(mean_squared_error(y_holdout, holdout_pred))

print(f"Results: Internal RMSE: {val_score:.4f}, Holdout RMSE: {holdout_rmse:.4f}")
plot_optimization_history(model)
plt.show()