# Market Prediction - Enhanced v1 (1105)

**Date**: 2025-11-05  
**Improvements from 1104_v1**:
- ðŸš€ Advanced feature engineering (100+ features)
- ðŸ¤– XGBoost + LightGBM ensemble
- ðŸ“Š Continuous allocation strategy with volatility scaling
- ðŸ“¤ Kaggle submission API implementation

**Target**: Kaggle-ready submission with improved performance

## 1. Setup & Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import ParameterGrid
import lightgbm as lgb
import xgboost as xgb
import warnings
import pickle
import os
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print('ðŸš€ Libraries loaded!')
print(f'LightGBM version: {lgb.__version__}')
print(f'XGBoost version: {xgb.__version__}')

In [None]:
# Load data
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

print(f'Train shape: {train.shape}')
print(f'Test shape: {test.shape}')

# Use only complete data (date_id >= 5540 from EDA)
feature_cols = [c for c in train.columns if c not in ['date_id', 'forward_returns', 'risk_free_rate', 'market_forward_excess_returns']]
train['n_missing'] = train[feature_cols].isnull().sum(axis=1)

# Keep data with <5% missing
threshold_missing = len(feature_cols) * 0.05
df = train[train['n_missing'] < threshold_missing].copy()

print(f'\nUsing complete data: {len(df)} days')
print(f'Date range: {df["date_id"].min()} to {df["date_id"].max()}')

## 2. Advanced Feature Engineering

### Strategy:
1. **Lag features**: 1, 5, 10, 20, 60 days
2. **Rolling statistics**: mean, std, min, max (5, 10, 20, 60 windows)
3. **Momentum indicators**: 5, 20, 60 days
4. **Volatility features**: Historical volatility, volatility of volatility
5. **Market regime features**: Bull/bear, high/low volatility
6. **Cross-sectional features**: Category means, dispersions
7. **Feature interactions**: Top feature combinations

In [None]:
def create_advanced_features(data, feature_cols):
    """
    Create advanced feature engineering
    """
    df = data.copy()
    
    print('Starting feature engineering...')
    
    # Select top correlated features for lag/rolling (reduce computation)
    if 'forward_returns' in df.columns:
        correlations = df[feature_cols + ['forward_returns']].corr()['forward_returns'].drop('forward_returns')
        top_features = correlations.abs().nlargest(15).index.tolist()
    else:
        # For test set, use predefined top features
        top_features = ['P8', 'P10', 'S5', 'E3', 'M3', 'V13', 'P11', 'E2', 'P12', 'M1', 'S7', 'E5', 'M9', 'V7', 'I5']
    
    print(f'Using {len(top_features)} top features for intensive engineering')
    
    # 1. Lag features (1, 5, 10, 20, 60)
    print('Creating lag features...')
    for feat in top_features:
        for lag in [1, 5, 10, 20, 60]:
            df[f'{feat}_lag{lag}'] = df[feat].shift(lag)
    
    # 2. Rolling statistics
    print('Creating rolling statistics...')
    for feat in top_features:
        for window in [5, 20, 60]:
            df[f'{feat}_mean{window}'] = df[feat].rolling(window).mean()
            df[f'{feat}_std{window}'] = df[feat].rolling(window).std()
    
    # 3. Returns-based features
    if 'forward_returns' in df.columns:
        print('Creating returns-based features...')
        # Historical returns (shifted to avoid leakage)
        df['returns_lag1'] = df['forward_returns'].shift(1)
        df['returns_lag5'] = df['forward_returns'].shift(5)
        
        # Momentum
        df['momentum_5'] = df['forward_returns'].shift(1).rolling(5).sum()
        df['momentum_20'] = df['forward_returns'].shift(1).rolling(20).sum()
        df['momentum_60'] = df['forward_returns'].shift(1).rolling(60).sum()
        
        # Rolling returns statistics
        df['returns_mean_20'] = df['forward_returns'].shift(1).rolling(20).mean()
        df['returns_std_20'] = df['forward_returns'].shift(1).rolling(20).std()
        df['returns_mean_60'] = df['forward_returns'].shift(1).rolling(60).mean()
        df['returns_std_60'] = df['forward_returns'].shift(1).rolling(60).std()
        
        # Volatility
        df['volatility_5'] = df['forward_returns'].shift(1).rolling(5).std()
        df['volatility_20'] = df['forward_returns'].shift(1).rolling(20).std()
        df['volatility_60'] = df['forward_returns'].shift(1).rolling(60).std()
        
        # Volatility of volatility
        df['vol_of_vol_20'] = df['volatility_20'].rolling(20).std()
        
        # Volatility regime
        vol_percentile = df['volatility_20'].rolling(252).rank(pct=True)
        df['vol_regime_low'] = (vol_percentile < 0.33).astype(int)
        df['vol_regime_high'] = (vol_percentile > 0.67).astype(int)
        
        # Returns regime
        df['bull_regime'] = (df['returns_mean_60'] > 0).astype(int)
        
        # Risk-adjusted momentum
        df['risk_adj_momentum_20'] = df['momentum_20'] / (df['volatility_20'] + 1e-8)
        df['risk_adj_momentum_60'] = df['momentum_60'] / (df['volatility_60'] + 1e-8)
    
    # 4. Cross-sectional features (category aggregations)
    print('Creating cross-sectional features...')
    feature_categories = {
        'M': [c for c in feature_cols if c.startswith('M')],
        'E': [c for c in feature_cols if c.startswith('E')],
        'I': [c for c in feature_cols if c.startswith('I')],
        'P': [c for c in feature_cols if c.startswith('P')],
        'V': [c for c in feature_cols if c.startswith('V')],
        'S': [c for c in feature_cols if c.startswith('S')],
    }
    
    for cat, feats in feature_categories.items():
        if feats:
            df[f'{cat}_mean'] = df[feats].mean(axis=1)
            df[f'{cat}_std'] = df[feats].std(axis=1)
            df[f'{cat}_min'] = df[feats].min(axis=1)
            df[f'{cat}_max'] = df[feats].max(axis=1)
    
    # 5. Feature interactions (top pairs)
    print('Creating feature interactions...')
    if len(top_features) >= 2:
        # Create interactions for top 5 features
        for i in range(min(5, len(top_features))):
            for j in range(i+1, min(5, len(top_features))):
                feat1, feat2 = top_features[i], top_features[j]
                df[f'{feat1}_x_{feat2}'] = df[feat1] * df[feat2]
    
    # 6. Missing value indicator
    df['n_missing_orig'] = df['n_missing']
    
    print(f'Feature engineering complete! Shape: {df.shape}')
    return df

# Apply feature engineering
df_engineered = create_advanced_features(df, feature_cols)

# Drop rows with NaN from rolling/lag features
df_clean = df_engineered.dropna()
print(f'\nFinal data shape after dropping NaN: {df_clean.shape}')

# Identify feature list (exclude targets and meta)
exclude_cols = ['date_id', 'forward_returns', 'risk_free_rate', 'market_forward_excess_returns', 'n_missing']
feature_list = [c for c in df_clean.columns if c not in exclude_cols]
print(f'Total features: {len(feature_list)}')

## 3. Competition Metric & Validation Setup

In [None]:
def calculate_competition_score(returns, risk_free_rate, position):
    """
    Calculate competition metric (volatility-adjusted Sharpe ratio)
    """
    MIN_INVESTMENT = 0
    MAX_INVESTMENT = 2
    
    if isinstance(position, (int, float)):
        position = pd.Series([position] * len(returns), index=returns.index)
    
    # Validate
    if position.max() > MAX_INVESTMENT or position.min() < MIN_INVESTMENT:
        return {'error': 'Position out of range'}
    
    # Strategy returns
    strategy_returns = risk_free_rate * (1 - position) + position * returns
    
    # Strategy metrics
    strategy_excess_returns = strategy_returns - risk_free_rate
    strategy_excess_cumulative = (1 + strategy_excess_returns).prod()
    strategy_mean_excess_return = (strategy_excess_cumulative) ** (1 / len(returns)) - 1
    strategy_std = strategy_returns.std()
    
    trading_days_per_yr = 252
    if strategy_std == 0:
        return {'error': 'Strategy std is zero'}
    
    sharpe = strategy_mean_excess_return / strategy_std * np.sqrt(trading_days_per_yr)
    strategy_volatility = float(strategy_std * np.sqrt(trading_days_per_yr) * 100)
    
    # Market metrics
    market_excess_returns = returns - risk_free_rate
    market_excess_cumulative = (1 + market_excess_returns).prod()
    market_mean_excess_return = (market_excess_cumulative) ** (1 / len(returns)) - 1
    market_std = returns.std()
    market_volatility = float(market_std * np.sqrt(trading_days_per_yr) * 100)
    
    if market_volatility == 0:
        return {'error': 'Market std is zero'}
    
    # Penalties
    excess_vol = max(0, strategy_volatility / market_volatility - 1.2)
    vol_penalty = 1 + excess_vol
    
    return_gap = max(0, (market_mean_excess_return - strategy_mean_excess_return) * 100 * trading_days_per_yr)
    return_penalty = 1 + (return_gap**2) / 100
    
    # Adjusted Sharpe
    adjusted_sharpe = sharpe / (vol_penalty * return_penalty)
    
    return {
        'score': min(float(adjusted_sharpe), 1_000_000),
        'sharpe': sharpe,
        'strategy_volatility': strategy_volatility,
        'market_volatility': market_volatility,
        'vol_ratio': strategy_volatility / market_volatility,
        'vol_penalty': vol_penalty,
        'return_penalty': return_penalty,
    }

def walk_forward_split(df, n_splits=3):
    """
    Time-based walk-forward split
    """
    total_len = len(df)
    val_size = total_len // (n_splits + 1)
    
    splits = []
    for i in range(n_splits):
        train_end = val_size * (i + 2)
        val_start = train_end
        val_end = train_end + val_size
        
        if val_end > total_len:
            break
        
        train_idx = df.index[:train_end]
        val_idx = df.index[val_start:val_end]
        
        if len(val_idx) > 0:
            splits.append((train_idx, val_idx))
    
    return splits

# Create splits
splits = walk_forward_split(df_clean, n_splits=3)
print(f'Created {len(splits)} walk-forward splits:')
for i, (train_idx, val_idx) in enumerate(splits):
    print(f'  Split {i+1}: Train={len(train_idx)}, Val={len(val_idx)}')

## 4. Model Training - LightGBM & XGBoost

In [None]:
def train_lgb_model(train_data, val_data, features, params=None):
    """
    Train LightGBM model
    """
    if params is None:
        params = {
            'objective': 'regression',
            'metric': 'rmse',
            'num_leaves': 31,
            'learning_rate': 0.05,
            'feature_fraction': 0.7,
            'max_depth': 5,
            'verbose': -1,
            'seed': 42,
        }
    
    X_train = train_data[features]
    y_train = train_data['forward_returns']
    X_val = val_data[features]
    y_val = val_data['forward_returns']
    
    train_set = lgb.Dataset(X_train, y_train)
    val_set = lgb.Dataset(X_val, y_val, reference=train_set)
    
    model = lgb.train(
        params,
        train_set,
        num_boost_round=200,
        valid_sets=[val_set],
        callbacks=[lgb.early_stopping(stopping_rounds=30, verbose=False)]
    )
    
    return model

def train_xgb_model(train_data, val_data, features, params=None):
    """
    Train XGBoost model
    """
    if params is None:
        params = {
            'objective': 'reg:squarederror',
            'max_depth': 5,
            'learning_rate': 0.05,
            'subsample': 0.8,
            'colsample_bytree': 0.7,
            'seed': 42,
        }
    
    X_train = train_data[features]
    y_train = train_data['forward_returns']
    X_val = val_data[features]
    y_val = val_data['forward_returns']
    
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dval = xgb.DMatrix(X_val, label=y_val)
    
    evals = [(dval, 'validation')]
    model = xgb.train(
        params,
        dtrain,
        num_boost_round=200,
        evals=evals,
        early_stopping_rounds=30,
        verbose_eval=False
    )
    
    return model

print('Model training functions ready!')

## 5. Continuous Allocation Strategy

### Approach:
- Use model prediction as base signal
- Apply sigmoid transformation for continuous allocation (0-2 range)
- Dynamic volatility scaling to respect 120% constraint

In [None]:
def predict_allocation(predicted_returns, recent_volatility=None, market_vol=None, 
                       base_allocation=1.0, sensitivity=100):
    """
    Convert predicted returns to continuous allocation (0-2)
    
    Args:
        predicted_returns: Model predictions
        recent_volatility: Recent strategy volatility (optional)
        market_vol: Market volatility (optional)
        base_allocation: Center point (default 1.0)
        sensitivity: How sensitive allocation is to predictions (higher = more extreme)
    
    Returns:
        allocation: Continuous values in [0, 2]
    """
    # Sigmoid transformation: maps predictions to (0, 2)
    # predicted_returns near 0 -> allocation near 1.0
    # positive predictions -> higher allocation
    # negative predictions -> lower allocation
    
    allocation = base_allocation + np.tanh(predicted_returns * sensitivity)
    
    # Volatility-based scaling
    if recent_volatility is not None and market_vol is not None and market_vol > 0:
        vol_ratio = recent_volatility / market_vol
        
        # If approaching 120% threshold, scale down
        if vol_ratio > 1.1:  # Start scaling at 110%
            scaling_factor = 1.1 / vol_ratio
            allocation = allocation * scaling_factor
    
    # Clip to valid range
    allocation = np.clip(allocation, 0, 2)
    
    return allocation

print('Allocation strategy function ready!')

## 6. Train Ensemble Models

In [None]:
# Best parameters from hyperparameter search (can be tuned further)
lgb_params = {
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.7,
    'max_depth': 5,
    'verbose': -1,
    'seed': 42,
}

xgb_params = {
    'objective': 'reg:squarederror',
    'max_depth': 5,
    'learning_rate': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.7,
    'seed': 42,
}

# Cross-validation with ensemble
results = []

for fold_idx, (train_idx, val_idx) in enumerate(splits):
    print(f'\n{"="*80}')
    print(f'FOLD {fold_idx + 1}/{len(splits)}')
    print(f'{"="*80}')
    
    train_data = df_clean.loc[train_idx]
    val_data = df_clean.loc[val_idx].copy()
    
    # Train LightGBM
    print('Training LightGBM...')
    lgb_model = train_lgb_model(train_data, val_data, feature_list, lgb_params)
    val_data['pred_lgb'] = lgb_model.predict(val_data[feature_list])
    
    # Train XGBoost
    print('Training XGBoost...')
    xgb_model = train_xgb_model(train_data, val_data, feature_list, xgb_params)
    val_data['pred_xgb'] = xgb_model.predict(xgb.DMatrix(val_data[feature_list]))
    
    # Ensemble prediction (weighted average)
    val_data['pred_ensemble'] = 0.5 * val_data['pred_lgb'] + 0.5 * val_data['pred_xgb']
    
    # Calculate allocation
    val_data['allocation'] = predict_allocation(
        val_data['pred_ensemble'],
        sensitivity=100,
        base_allocation=1.0
    )
    
    # Calculate score
    score_result = calculate_competition_score(
        val_data['forward_returns'],
        val_data['risk_free_rate'],
        val_data['allocation']
    )
    
    if 'error' not in score_result:
        print(f'\nFold {fold_idx + 1} Results:')
        print(f'  Score: {score_result["score"]:.4f}')
        print(f'  Sharpe: {score_result["sharpe"]:.4f}')
        print(f'  Vol Ratio: {score_result["vol_ratio"]:.4f}')
        print(f'  Vol Penalty: {score_result["vol_penalty"]:.4f}')
        print(f'  Return Penalty: {score_result["return_penalty"]:.4f}')
        
        results.append({
            'fold': fold_idx + 1,
            'score': score_result['score'],
            'sharpe': score_result['sharpe'],
            'vol_ratio': score_result['vol_ratio'],
        })
    else:
        print(f'Error in fold {fold_idx + 1}: {score_result["error"]}')

# Summary
print(f'\n{"="*80}')
print('CROSS-VALIDATION SUMMARY')
print(f'{"="*80}')
results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))
print(f'\nMean Score: {results_df["score"].mean():.4f} (+/- {results_df["score"].std():.4f})')
print(f'Mean Sharpe: {results_df["sharpe"].mean():.4f} (+/- {results_df["sharpe"].std():.4f})')
print(f'Mean Vol Ratio: {results_df["vol_ratio"].mean():.4f}')

## 7. Train Final Models on Full Data

In [None]:
# Use last 80% for train, 20% for final validation
split_point = int(len(df_clean) * 0.8)
final_train = df_clean.iloc[:split_point]
final_val = df_clean.iloc[split_point:].copy()

print(f'Final training set: {len(final_train)} days')
print(f'Final validation set: {len(final_val)} days')

# Train final LightGBM
print('\nTraining final LightGBM model...')
final_lgb = train_lgb_model(final_train, final_val, feature_list, lgb_params)

# Train final XGBoost
print('Training final XGBoost model...')
final_xgb = train_xgb_model(final_train, final_val, feature_list, xgb_params)

print('\nâœ… Final models trained!')

## 8. Final Validation Performance

In [None]:
# Predict on final validation
final_val['pred_lgb'] = final_lgb.predict(final_val[feature_list])
final_val['pred_xgb'] = final_xgb.predict(xgb.DMatrix(final_val[feature_list]))
final_val['pred_ensemble'] = 0.5 * final_val['pred_lgb'] + 0.5 * final_val['pred_xgb']

# Calculate allocation
final_val['allocation'] = predict_allocation(
    final_val['pred_ensemble'],
    sensitivity=100,
    base_allocation=1.0
)

# Calculate scores
ensemble_score = calculate_competition_score(
    final_val['forward_returns'],
    final_val['risk_free_rate'],
    final_val['allocation']
)

baseline_score = calculate_competition_score(
    final_val['forward_returns'],
    final_val['risk_free_rate'],
    1.0
)

print('='*80)
print('FINAL VALIDATION RESULTS')
print('='*80)

print('\nEnsemble Model:')
for key, value in ensemble_score.items():
    if isinstance(value, float):
        print(f'  {key:25s}: {value:.4f}')

print('\nBaseline (100% invested):')
for key, value in baseline_score.items():
    if isinstance(value, float):
        print(f'  {key:25s}: {value:.4f}')

if 'error' not in ensemble_score and 'error' not in baseline_score:
    improvement = (ensemble_score['score'] / baseline_score['score'] - 1) * 100
    print(f'\n{"="*80}')
    print(f'IMPROVEMENT OVER BASELINE: {improvement:+.2f}%')
    print(f'{"="*80}')

## 9. Visualization

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# 1. Allocation distribution
axes[0, 0].hist(final_val['allocation'], bins=50, edgecolor='black', alpha=0.7, color='green')
axes[0, 0].axvline(final_val['allocation'].mean(), color='red', linestyle='--', linewidth=2, 
                   label=f'Mean: {final_val["allocation"].mean():.2f}')
axes[0, 0].set_xlabel('Allocation', fontsize=11)
axes[0, 0].set_ylabel('Frequency', fontsize=11)
axes[0, 0].set_title('Allocation Distribution', fontsize=12, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Allocation over time
axes[0, 1].plot(range(len(final_val)), final_val['allocation'], linewidth=1, alpha=0.7)
axes[0, 1].axhline(1.0, color='red', linestyle='--', linewidth=1, label='Baseline (1.0)')
axes[0, 1].set_xlabel('Time', fontsize=11)
axes[0, 1].set_ylabel('Allocation', fontsize=11)
axes[0, 1].set_title('Allocation Over Time', fontsize=12, fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Cumulative returns
strategy_returns = final_val['risk_free_rate'] * (1 - final_val['allocation']) + final_val['allocation'] * final_val['forward_returns']
strategy_cumulative = (1 + strategy_returns).cumprod()
baseline_cumulative = (1 + final_val['forward_returns']).cumprod()

axes[1, 0].plot(range(len(final_val)), strategy_cumulative, linewidth=2, label='Our Strategy', color='green')
axes[1, 0].plot(range(len(final_val)), baseline_cumulative, linewidth=2, label='Baseline', color='blue', linestyle='--')
axes[1, 0].set_xlabel('Time', fontsize=11)
axes[1, 0].set_ylabel('Cumulative Return', fontsize=11)
axes[1, 0].set_title('Cumulative Returns Comparison', fontsize=12, fontweight='bold')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# 4. Prediction vs actual
axes[1, 1].scatter(final_val['pred_ensemble'], final_val['forward_returns'], alpha=0.5, s=20)
axes[1, 1].plot([final_val['pred_ensemble'].min(), final_val['pred_ensemble'].max()],
                [final_val['pred_ensemble'].min(), final_val['pred_ensemble'].max()],
                'r--', linewidth=2)
axes[1, 1].set_xlabel('Predicted Returns', fontsize=11)
axes[1, 1].set_ylabel('Actual Returns', fontsize=11)
axes[1, 1].set_title('Predicted vs Actual', fontsize=12, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Correlation
from scipy.stats import pearsonr, spearmanr
pearson_r, _ = pearsonr(final_val['pred_ensemble'], final_val['forward_returns'])
spearman_r, _ = spearmanr(final_val['pred_ensemble'], final_val['forward_returns'])
print(f'\nPrediction Quality:')
print(f'  Pearson correlation: {pearson_r:.4f}')
print(f'  Spearman correlation: {spearman_r:.4f}')

## 10. Save Models

In [None]:
# Create models directory
os.makedirs('models', exist_ok=True)

# Save models
final_lgb.save_model('models/lgb_final.txt')
final_xgb.save_model('models/xgb_final.json')

# Save feature list and metadata
metadata = {
    'feature_list': feature_list,
    'n_features': len(feature_list),
    'cv_mean_score': results_df['score'].mean(),
    'cv_std_score': results_df['score'].std(),
    'final_val_score': ensemble_score['score'],
    'lgb_params': lgb_params,
    'xgb_params': xgb_params,
}

with open('models/metadata.pkl', 'wb') as f:
    pickle.dump(metadata, f)

print('âœ… Models saved to models/ directory!')
print(f'   - lgb_final.txt')
print(f'   - xgb_final.json')
print(f'   - metadata.pkl')

## 11. Kaggle Submission Code

This section creates the submission file that works with Kaggle's evaluation API.

In [None]:
submission_code = '''
"""
Hull Tactical Market Prediction - Kaggle Submission
Enhanced v1 with LightGBM + XGBoost Ensemble
"""

import os
import pandas as pd
import numpy as np
import lightgbm as lgb
import xgboost as xgb
import pickle

import kaggle_evaluation.default_inference_server

# Load models and metadata
lgb_model = lgb.Booster(model_file='models/lgb_final.txt')
xgb_model = xgb.Booster()
xgb_model.load_model('models/xgb_final.json')

with open('models/metadata.pkl', 'rb') as f:
    metadata = pickle.load(f)

feature_list = metadata['feature_list']

# Feature engineering function
def create_features_for_prediction(test_data, historical_data=None):
    """
    Create features for prediction (must match training features)
    
    Args:
        test_data: Current test data
        historical_data: Historical data for lag/rolling features
    """
    # Combine with historical data for lag/rolling calculations
    if historical_data is not None:
        combined = pd.concat([historical_data, test_data], ignore_index=False)
    else:
        combined = test_data.copy()
    
    df = combined.copy()
    
    # Original features
    feature_cols = [c for c in df.columns if c not in ['date_id', 'forward_returns', 'risk_free_rate']]
    
    # Top features for lag/rolling (from training)
    top_features = ['P8', 'P10', 'S5', 'E3', 'M3', 'V13', 'P11', 'E2', 'P12', 'M1', 'S7', 'E5', 'M9', 'V7', 'I5']
    
    # Lag features
    for feat in top_features:
        if feat in df.columns:
            for lag in [1, 5, 10, 20, 60]:
                df[f'{feat}_lag{lag}'] = df[feat].shift(lag)
    
    # Rolling statistics
    for feat in top_features:
        if feat in df.columns:
            for window in [5, 20, 60]:
                df[f'{feat}_mean{window}'] = df[feat].rolling(window).mean()
                df[f'{feat}_std{window}'] = df[feat].rolling(window).std()
    
    # Returns-based features (using lagged_forward_returns if available)
    if 'lagged_forward_returns' in df.columns:
        returns_col = 'lagged_forward_returns'
        
        df['returns_lag1'] = df[returns_col].shift(1)
        df['returns_lag5'] = df[returns_col].shift(5)
        
        df['momentum_5'] = df[returns_col].rolling(5).sum()
        df['momentum_20'] = df[returns_col].rolling(20).sum()
        df['momentum_60'] = df[returns_col].rolling(60).sum()
        
        df['returns_mean_20'] = df[returns_col].rolling(20).mean()
        df['returns_std_20'] = df[returns_col].rolling(20).std()
        df['returns_mean_60'] = df[returns_col].rolling(60).mean()
        df['returns_std_60'] = df[returns_col].rolling(60).std()
        
        df['volatility_5'] = df[returns_col].rolling(5).std()
        df['volatility_20'] = df[returns_col].rolling(20).std()
        df['volatility_60'] = df[returns_col].rolling(60).std()
        
        df['vol_of_vol_20'] = df['volatility_20'].rolling(20).std()
        
        vol_percentile = df['volatility_20'].rolling(252).rank(pct=True)
        df['vol_regime_low'] = (vol_percentile < 0.33).astype(int)
        df['vol_regime_high'] = (vol_percentile > 0.67).astype(int)
        
        df['bull_regime'] = (df['returns_mean_60'] > 0).astype(int)
        
        df['risk_adj_momentum_20'] = df['momentum_20'] / (df['volatility_20'] + 1e-8)
        df['risk_adj_momentum_60'] = df['momentum_60'] / (df['volatility_60'] + 1e-8)
    
    # Cross-sectional features
    feature_categories = {
        'M': [c for c in feature_cols if c.startswith('M')],
        'E': [c for c in feature_cols if c.startswith('E')],
        'I': [c for c in feature_cols if c.startswith('I')],
        'P': [c for c in feature_cols if c.startswith('P')],
        'V': [c for c in feature_cols if c.startswith('V')],
        'S': [c for c in feature_cols if c.startswith('S')],
    }
    
    for cat, feats in feature_categories.items():
        if feats:
            df[f'{cat}_mean'] = df[feats].mean(axis=1)
            df[f'{cat}_std'] = df[feats].std(axis=1)
            df[f'{cat}_min'] = df[feats].min(axis=1)
            df[f'{cat}_max'] = df[feats].max(axis=1)
    
    # Feature interactions
    if len(top_features) >= 2:
        for i in range(min(5, len(top_features))):
            for j in range(i+1, min(5, len(top_features))):
                feat1, feat2 = top_features[i], top_features[j]
                if feat1 in df.columns and feat2 in df.columns:
                    df[f'{feat1}_x_{feat2}'] = df[feat1] * df[feat2]
    
    # Missing value indicator
    df['n_missing_orig'] = df[feature_cols].isnull().sum(axis=1)
    
    # Return only the test data rows
    return df.loc[test_data.index]

# Allocation function
def predict_allocation(predicted_returns, base_allocation=1.0, sensitivity=100):
    """
    Convert predicted returns to allocation (0-2)
    """
    allocation = base_allocation + np.tanh(predicted_returns * sensitivity)
    return np.clip(allocation, 0, 2)

# Global variables for maintaining history
historical_data = None
first_call = True

def predict(test: pd.DataFrame) -> pd.DataFrame:
    """
    Main prediction function called by Kaggle evaluation API
    
    Args:
        test: Test data for current batch
    
    Returns:
        DataFrame with date_id and prediction columns
    """
    global historical_data, first_call
    
    # On first call, load training data for historical features
    if first_call:
        train_data = pd.read_csv('data/train.csv')
        # Keep last 300 days for lag/rolling features
        historical_data = train_data.tail(300).copy()
        first_call = False
    
    # Create features
    test_with_features = create_features_for_prediction(test, historical_data)
    
    # Fill missing features with 0 (or use more sophisticated imputation)
    for feat in feature_list:
        if feat not in test_with_features.columns:
            test_with_features[feat] = 0
    
    test_with_features = test_with_features[feature_list].fillna(0)
    
    # Predict with ensemble
    pred_lgb = lgb_model.predict(test_with_features)
    pred_xgb = xgb_model.predict(xgb.DMatrix(test_with_features))
    pred_ensemble = 0.5 * pred_lgb + 0.5 * pred_xgb
    
    # Convert to allocation
    allocations = predict_allocation(pred_ensemble, sensitivity=100)
    
    # Update historical data (keep last 300 days)
    historical_data = pd.concat([historical_data, test], ignore_index=False).tail(300)
    
    # Return predictions
    result = pd.DataFrame({
        'date_id': test['date_id'],
        'prediction': allocations
    })
    
    return result

# Initialize inference server
inference_server = kaggle_evaluation.default_inference_server.DefaultInferenceServer(predict)

if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    inference_server.serve()
else:
    # Local testing
    inference_server.run_local_gateway(('/kaggle/input/hull-tactical-market-prediction/',))
'''

# Save submission code
with open('submission.py', 'w') as f:
    f.write(submission_code)

print('âœ… Kaggle submission code saved to submission.py')
print('\nTo submit to Kaggle:')
print('1. Copy submission.py to your Kaggle notebook')
print('2. Copy models/ directory to your Kaggle notebook')
print('3. Run the submission.py script')
print('4. Kaggle will evaluate using their API')

## 12. Summary & Next Steps

### What We Built:
- âœ… Advanced feature engineering (100+ features)
- âœ… LightGBM + XGBoost ensemble
- âœ… Continuous allocation strategy
- âœ… Walk-forward cross-validation
- âœ… Kaggle submission code

### Performance Summary:
- Check CV mean score above
- Check final validation score above
- Compare with baseline (100% invested)

### Further Improvements:
1. **Hyperparameter tuning**: Use Optuna for more extensive search
2. **More models**: Add CatBoost, Neural Networks
3. **Feature selection**: Remove redundant features
4. **Allocation optimization**: Optimize sensitivity parameter
5. **Regime-based models**: Train separate models for different market regimes
6. **Stacking**: Use meta-model on top of base models