# Example 26: Commodity Forecasting with WorkflowSet and Genetic Algorithm Ensemble Mode

This notebook demonstrates a systematic comparison of feature selection strategies using WorkflowSet, with emphasis on **genetic algorithm ensemble mode**.

**Dataset**: Commodity futures prices (crude oil, natural gas, gold, etc.)

**Key Techniques**:
- Multiple preprocessing strategies with genetic algorithm variations
- WorkflowSet for systematic workflow comparison
- Time series cross-validation
- Hyperparameter tuning for GA parameters
- **Enhancement demonstrated**: Ensemble mode for robust feature selection

**Ensemble Mode Benefits**:
- Reduces sensitivity to random initialization
- Identifies consistently important features
- More stable feature selection across different data splits
- Enables comparison of aggregation strategies

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from py_recipes import recipe
from py_recipes.steps import (
    step_select_genetic_algorithm,
    step_normalize,
    step_mutate,
    step_select_corr,
    step_pca
)
from py_workflows import workflow
from py_parsnip import linear_reg, rand_forest
from py_yardstick import rmse, mae, r_squared, metric_set
from py_rsample import time_series_cv, initial_time_split
from py_workflowsets import WorkflowSet
from py_tune import tune, tune_grid, grid_regular, finalize_workflow

# Set random seed for reproducibility
np.random.seed(42)

## 1. Data Loading and Exploration

In [None]:
# Load commodity futures data
data = pd.read_csv('../_md/__data/all_commodities_futures_collection.csv')
data['date'] = pd.to_datetime(data['date'])

print(f"Data shape: {data.shape}")
print(f"\nColumns: {list(data.columns)}")
print(f"\nDate range: {data['date'].min()} to {data['date'].max()}")
print(f"\nMissing values:\n{data.isnull().sum()}")
print(f"\nFirst few rows:")
data.head()

## 2. Feature Engineering

Create lagged features, technical indicators, and interaction terms.

In [None]:
# Focus on crude oil (WTI) as target with other commodities as predictors
# Keep only columns we need
keep_cols = ['date', 'wti_crude', 'brent_crude', 'natural_gas', 'heating_oil', 
             'rbob_gasoline', 'gold', 'silver', 'copper']
data = data[[col for col in keep_cols if col in data.columns]].copy()

# Drop rows with missing values
data = data.dropna()

# Sort by date
data = data.sort_values('date').reset_index(drop=True)

print(f"After cleaning: {data.shape}")
print(f"Columns: {list(data.columns)}")

In [None]:
# Create lagged features (1, 3, 7 days)
for col in data.columns:
    if col != 'date':
        for lag in [1, 3, 7]:
            data[f'{col}_lag_{lag}'] = data[col].shift(lag)

# Create moving averages (7, 14, 30 days)
for col in ['wti_crude', 'brent_crude', 'natural_gas']:
    for window in [7, 14, 30]:
        data[f'{col}_ma_{window}'] = data[col].rolling(window=window, min_periods=1).mean()

# Create price momentum (% change)
for col in ['wti_crude', 'brent_crude', 'natural_gas']:
    data[f'{col}_pct_change_1d'] = data[col].pct_change(1)
    data[f'{col}_pct_change_7d'] = data[col].pct_change(7)

# Create spread features
if 'brent_crude' in data.columns:
    data['wti_brent_spread'] = data['wti_crude'] - data['brent_crude']
    data['wti_brent_ratio'] = data['wti_crude'] / (data['brent_crude'] + 1e-6)

# Drop rows with NaN from lagging/rolling
data = data.dropna()

print(f"After feature engineering: {data.shape}")
print(f"Total features: {len([col for col in data.columns if col not in ['date', 'wti_crude']])}")

## 3. Train/Test Split

Use 80/20 temporal split.

In [None]:
# Split data (80/20)
split_idx = int(len(data) * 0.8)
train_data = data.iloc[:split_idx].copy()
test_data = data.iloc[split_idx:].copy()

print(f"Train: {train_data.shape[0]} rows, {train_data['date'].min()} to {train_data['date'].max()}")
print(f"Test: {test_data.shape[0]} rows, {test_data['date'].min()} to {test_data['date'].max()}")

# Remove date column for modeling
train = train_data.drop('date', axis=1).copy()
test = test_data.drop('date', axis=1).copy()

print(f"\nTraining features: {train.shape[1] - 1} (target: wti_crude)")

## 4. Define Multiple Preprocessing Strategies

Create recipes with different feature selection approaches:
1. **All features** (baseline)
2. **Correlation-based** selection
3. **PCA** dimensionality reduction
4. **Genetic Algorithm** - single run
5. **Genetic Algorithm Ensemble** - voting (60%)
6. **Genetic Algorithm Ensemble** - voting (80%)
7. **Genetic Algorithm Ensemble** - union
8. **Genetic Algorithm Ensemble** - intersection

In [None]:
# Recipe 1: All features (baseline)
rec_all = recipe(train).step_normalize(all_numeric_predictors())

# Recipe 2: Correlation-based selection (keep top 10 features)
rec_corr = (recipe(train)
    .step_normalize(all_numeric_predictors())
    .step_select_corr(
        outcome='wti_crude',
        method='correlation',
        threshold=0.7,
        top_n=10
    )
)

# Recipe 3: PCA (10 components)
rec_pca = (recipe(train)
    .step_normalize(all_numeric_predictors())
    .step_pca(all_numeric_predictors(), num_comp=10)
)

# Recipe 4: Genetic Algorithm - single run
rec_ga_single = (recipe(train)
    .step_normalize(all_numeric_predictors())
    .step_select_genetic_algorithm(
        outcome='wti_crude',
        model=linear_reg(),
        metric='rmse',
        top_n=10,
        population_size=30,
        generations=20,
        cv_folds=3,
        random_state=42
    )
)

# Recipe 5: GA Ensemble - voting 60%
rec_ga_ens_v60 = (recipe(train)
    .step_normalize(all_numeric_predictors())
    .step_select_genetic_algorithm(
        outcome='wti_crude',
        model=linear_reg(),
        metric='rmse',
        top_n=10,
        n_ensemble=5,
        ensemble_strategy='voting',
        ensemble_threshold=0.6,
        population_size=30,
        generations=20,
        cv_folds=3,
        random_state=42
    )
)

# Recipe 6: GA Ensemble - voting 80%
rec_ga_ens_v80 = (recipe(train)
    .step_normalize(all_numeric_predictors())
    .step_select_genetic_algorithm(
        outcome='wti_crude',
        model=linear_reg(),
        metric='rmse',
        top_n=10,
        n_ensemble=5,
        ensemble_strategy='voting',
        ensemble_threshold=0.8,
        population_size=30,
        generations=20,
        cv_folds=3,
        random_state=42
    )
)

# Recipe 7: GA Ensemble - union
rec_ga_ens_union = (recipe(train)
    .step_normalize(all_numeric_predictors())
    .step_select_genetic_algorithm(
        outcome='wti_crude',
        model=linear_reg(),
        metric='rmse',
        top_n=10,
        n_ensemble=5,
        ensemble_strategy='union',
        population_size=30,
        generations=20,
        cv_folds=3,
        random_state=42
    )
)

# Recipe 8: GA Ensemble - intersection
rec_ga_ens_inter = (recipe(train)
    .step_normalize(all_numeric_predictors())
    .step_select_genetic_algorithm(
        outcome='wti_crude',
        model=linear_reg(),
        metric='rmse',
        top_n=10,
        n_ensemble=5,
        ensemble_strategy='intersection',
        population_size=30,
        generations=20,
        cv_folds=3,
        random_state=42
    )
)

recipes = {
    'all_features': rec_all,
    'correlation': rec_corr,
    'pca': rec_pca,
    'ga_single': rec_ga_single,
    'ga_ens_vote60': rec_ga_ens_v60,
    'ga_ens_vote80': rec_ga_ens_v80,
    'ga_ens_union': rec_ga_ens_union,
    'ga_ens_intersect': rec_ga_ens_inter
}

print(f"Defined {len(recipes)} preprocessing strategies")

## 5. Define Models

Compare linear regression and random forest.

In [None]:
models = {
    'linear_reg': linear_reg(),
    'rand_forest': rand_forest(trees=100, min_n=5).set_mode('regression')
}

print(f"Defined {len(models)} models")

## 6. Create WorkflowSet

Systematically combine all recipes and models.

In [None]:
# Create WorkflowSet
wf_set = WorkflowSet.from_cross(
    preproc=recipes,
    models=models,
    ids=list(recipes.keys())
)

print(f"\nCreated WorkflowSet with {len(wf_set.workflows)} workflows")
print(f"\nWorkflow IDs:")
for wf_id in wf_set.workflows.keys():
    print(f"  - {wf_id}")

## 7. Time Series Cross-Validation

Create expanding window CV splits.

In [None]:
# Create CV splits (3 folds, expanding window)
# Use smaller initial window to get more folds
n_train = len(train)
initial_size = int(n_train * 0.5)  # Start with 50% of data
assess_size = int(n_train * 0.1)   # Assess on 10%

cv_folds = time_series_cv(
    train_data,
    date_column='date',
    initial=initial_size,
    assess=assess_size,
    skip=assess_size,  # Non-overlapping assessment periods
    cumulative=True     # Expanding window
)

print(f"Created {cv_folds.n_splits} CV folds")
print(f"\nFold details:")
for i, (train_idx, test_idx) in enumerate(cv_folds.splits):
    train_dates = train_data.iloc[train_idx]['date']
    test_dates = train_data.iloc[test_idx]['date']
    print(f"  Fold {i+1}: Train {len(train_idx)} ({train_dates.min()} to {train_dates.max()}), "
          f"Test {len(test_idx)} ({test_dates.min()} to {test_dates.max()})")

## 8. Evaluate All Workflows with CV

**Note**: This will take several minutes as we're fitting 16 workflows × 3 folds = 48 models.

In [None]:
# Define metrics
metrics = metric_set(rmse, mae, r_squared)

# Fit resamples (this will take a while)
print("Fitting all workflows across CV folds...")
print(f"Total fits: {len(wf_set.workflows)} workflows × {cv_folds.n_splits} folds = {len(wf_set.workflows) * cv_folds.n_splits}")
print("\nThis may take 5-10 minutes...\n")

wf_results = wf_set.fit_resamples(
    resamples=cv_folds,
    metrics=metrics
)

print("\n✓ CV evaluation complete!")

## 9. Collect and Analyze Results

In [None]:
# Collect metrics
cv_metrics = wf_results.collect_metrics()

print("\n=== Cross-Validation Results (All Workflows) ===")
print(cv_metrics.to_string(index=False))

# Rank by RMSE
ranked = wf_results.rank_results('rmse', n=10)
print("\n=== Top 10 Workflows (by RMSE) ===")
print(ranked[['wflow_id', 'rmse', 'mae', 'rsq', 'rank']].to_string(index=False))

## 10. Compare Ensemble Strategies

Focus on genetic algorithm variants.

In [None]:
# Filter for GA-related workflows with linear regression
ga_workflows = cv_metrics[
    cv_metrics['wflow_id'].str.contains('ga_') & 
    cv_metrics['wflow_id'].str.contains('linear_reg')
].copy()

# Also include baseline for comparison
baseline = cv_metrics[
    (cv_metrics['wflow_id'] == 'all_features_linear_reg') |
    (cv_metrics['wflow_id'] == 'correlation_linear_reg') |
    (cv_metrics['wflow_id'] == 'pca_linear_reg')
]

comparison = pd.concat([baseline, ga_workflows]).sort_values('rmse')

print("\n=== Feature Selection Method Comparison ===")
print(comparison[['wflow_id', 'rmse', 'mae', 'rsq']].to_string(index=False))

## 11. Visualization

In [None]:
# Use autoplot for quick visualization
fig = wf_results.autoplot('rmse')
fig.update_layout(height=600, title='Workflow Comparison: RMSE across CV Folds')
fig.show()

In [None]:
# Custom plot for GA comparison
fig, ax = plt.subplots(figsize=(12, 6))

# Sort by RMSE
comparison_sorted = comparison.sort_values('rmse')

# Create labels
labels = comparison_sorted['wflow_id'].str.replace('_linear_reg', '').str.replace('_', ' ').str.title()

# Colors: baseline in gray, GA variants in different colors
colors = []
for wf_id in comparison_sorted['wflow_id']:
    if 'ga_ens' in wf_id:
        colors.append('steelblue')
    elif 'ga_single' in wf_id:
        colors.append('orange')
    else:
        colors.append('lightgray')

bars = ax.barh(range(len(labels)), comparison_sorted['rmse'], color=colors, alpha=0.8)
ax.set_yticks(range(len(labels)))
ax.set_yticklabels(labels)
ax.set_xlabel('RMSE (lower is better)', fontsize=11)
ax.set_title('Feature Selection Strategy Comparison: RMSE', fontsize=13, fontweight='bold')
ax.grid(axis='x', alpha=0.3)

# Add value labels
for i, (bar, val) in enumerate(zip(bars, comparison_sorted['rmse'])):
    ax.text(val, bar.get_y() + bar.get_height()/2, f'{val:.2f}', 
            ha='left', va='center', fontsize=9, fontweight='bold')

# Legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='lightgray', alpha=0.8, label='Baseline Methods'),
    Patch(facecolor='orange', alpha=0.8, label='GA Single Run'),
    Patch(facecolor='steelblue', alpha=0.8, label='GA Ensemble')
]
ax.legend(handles=legend_elements, loc='lower right')

plt.tight_layout()
plt.savefig('commodity_ga_ensemble_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nPlot saved as: commodity_ga_ensemble_comparison.png")

## 12. Select and Finalize Best Workflow

In [None]:
# Get best workflow ID
best_wf_id = ranked.iloc[0]['wflow_id']
print(f"Best workflow: {best_wf_id}")
print(f"CV RMSE: {ranked.iloc[0]['rmse']:.2f}")

# Extract and fit best workflow on full training data
best_wf = wf_set[best_wf_id]
best_fit = best_wf.fit(train)

# Evaluate on test set
test_eval = best_fit.evaluate(test)
outputs, coeffs, stats = test_eval.extract_outputs()

print(f"\n=== Test Set Performance ===")
test_stats = stats[stats['split'] == 'test']
print(f"RMSE: {test_stats['rmse'].values[0]:.2f}")
print(f"MAE: {test_stats['mae'].values[0]:.2f}")
print(f"R²: {test_stats['rsq'].values[0]:.4f}")

## 13. Inspect Selected Features (If GA-based)

If the best workflow uses genetic algorithm, inspect which features were selected.

In [None]:
if 'ga_' in best_wf_id:
    # Get the prepared recipe
    prepped_recipe = best_fit.extract_preprocessor()
    
    # Find the genetic algorithm step
    ga_step = None
    for step in prepped_recipe.prepared_steps:
        if hasattr(step, '_selected_features'):
            ga_step = step
            break
    
    if ga_step is not None:
        print(f"\n=== Selected Features by Best Workflow ===")
        selected_features = ga_step._selected_features
        print(f"Number of features selected: {len(selected_features)}")
        print(f"\nSelected features:")
        for feat in selected_features:
            print(f"  - {feat}")
        
        # If ensemble mode, show feature frequencies
        if hasattr(ga_step, '_ensemble_results') and len(ga_step._ensemble_results) > 0:
            print(f"\n=== Ensemble Details ===")
            print(f"Number of ensemble runs: {len(ga_step._ensemble_results)}")
            
            print(f"\nFeature frequency across ensemble runs:")
            freq = ga_step._feature_frequencies
            for feat, count in sorted(freq.items(), key=lambda x: x[1], reverse=True):
                pct = count / len(ga_step._ensemble_results) * 100
                print(f"  {feat}: {count}/{len(ga_step._ensemble_results)} runs ({pct:.0f}%)")
            
            print(f"\nPer-run details:")
            for result in ga_step._ensemble_results:
                print(f"  Run {result['run_idx']+1} (seed={result['seed']}): "
                      f"{len(result['features'])} features, fitness={result['fitness']:.2f}")
else:
    print(f"Best workflow ({best_wf_id}) does not use genetic algorithm.")

## 14. Key Takeaways

### Ensemble Mode Benefits:
1. **Stability**: Ensemble mode reduces variance in feature selection
2. **Confidence**: Features appearing in 80%+ of runs are highly reliable
3. **Strategy Matters**: 
   - `voting` (60-80%): Balanced selection
   - `union`: Most inclusive (may include noise)
   - `intersection`: Most conservative (only unanimous features)

### WorkflowSet Advantages:
1. **Systematic Comparison**: Automatically evaluates all preprocessing × model combinations
2. **Cross-Validation**: Robust performance estimation via time series CV
3. **Easy Ranking**: Quick identification of best-performing strategies
4. **Reproducibility**: Structured approach ensures consistent methodology

### Genetic Algorithm vs Other Methods:
- GA often outperforms correlation-based selection for nonlinear relationships
- Ensemble GA provides more robust feature selection than single GA run
- Computational cost is higher but may be worth it for improved stability

### Production Recommendations:
1. Use ensemble mode with voting (60-70% threshold) for production systems
2. Monitor feature frequencies to identify consistently important predictors
3. Retrain ensemble periodically as new data arrives
4. Consider computational budget when choosing ensemble size (5-10 runs typical)