# Example 28: Refinery Production Forecasting with WorkflowSet and Diversity Maintenance

This notebook demonstrates **diversity maintenance** in genetic algorithms to prevent premature convergence and explore alternative feature subsets.

**Dataset**: JODI refinery production data (60+ countries, 2002-2024)

**Key Techniques**:
- Diversity measurement via Hamming distance
- Fitness sharing to maintain population diversity
- Comparison of different diversity thresholds
- WorkflowSet for systematic threshold comparison
- Cross-validation with diversity-aware GA
- **Enhancement demonstrated**: Diversity maintenance for robust feature selection

**Diversity Maintenance Benefits**:
- Prevents premature convergence to local optima
- Explores broader solution space
- Finds alternative feature subsets
- More robust to initialization

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from py_recipes import recipe
from py_recipes.steps import (
    step_select_genetic_algorithm,
    step_normalize,
    step_mutate
)
from py_workflows import workflow
from py_parsnip import linear_reg, rand_forest
from py_yardstick import rmse, mae, r_squared, metric_set
from py_rsample import time_series_cv
from py_workflowsets import WorkflowSet

# Set random seed for reproducibility
np.random.seed(42)

## 1. Data Loading and Exploration

In [None]:
# Load refinery production data
data = pd.read_csv('../_md/__data/jodi_refinery_production_data.csv')
data['date'] = pd.to_datetime(data['date'])

print(f"Data shape: {data.shape}")
print(f"\nColumns: {list(data.columns)}")
print(f"\nCountries: {len(data['country'].unique())} countries")
print(f"Date range: {data['date'].min()} to {data['date'].max()}")
print(f"\nMissing values:\n{data.isnull().sum()}")
print(f"\nFirst few rows:")
data.head()

## 2. Feature Engineering

Create time series features: lags, rolling statistics, temporal features.

In [None]:
# Select a single country for analysis
# Available countries include 'United States of America' (not 'USA')
country = 'United States of America'
country_data = data[data['country'] == country].copy()
country_data = country_data.sort_values('date').reset_index(drop=True)

print(f"Country: {country}")
print(f"Data points: {len(country_data)}")
print(f"Date range: {country_data['date'].min()} to {country_data['date'].max()}")

# Use 'value' column as target (refinery production)
target_col = 'value'
print(f"\nTarget variable: {target_col}")
print(f"Available columns: {list(country_data.columns)}")

In [None]:
# Create lag features (1, 3, 6, 12 months)
for lag in [1, 3, 6, 12]:
    country_data[f'{target_col}_lag_{lag}'] = country_data[target_col].shift(lag)

# Create rolling statistics (3, 6, 12 months)
for window in [3, 6, 12]:
    country_data[f'{target_col}_ma_{window}'] = country_data[target_col].rolling(
        window=window, min_periods=1).mean()
    country_data[f'{target_col}_std_{window}'] = country_data[target_col].rolling(
        window=window, min_periods=1).std()

# Create momentum features (% change)
for period in [1, 3, 6, 12]:
    country_data[f'{target_col}_pct_change_{period}'] = country_data[target_col].pct_change(period)

# Temporal features
country_data['month'] = country_data['date'].dt.month
country_data['quarter'] = country_data['date'].dt.quarter
country_data['year'] = country_data['date'].dt.year
country_data['month_sin'] = np.sin(2 * np.pi * country_data['month'] / 12)
country_data['month_cos'] = np.cos(2 * np.pi * country_data['month'] / 12)

# Drop rows with NaN
country_data = country_data.dropna()

print(f"\nAfter feature engineering: {country_data.shape}")
print(f"Features: {[c for c in country_data.columns if c not in ['date', 'country', target_col]]}")
print(f"Total features: {len([c for c in country_data.columns if c not in ['date', 'country', target_col]])}")

## 3. Train/Test Split

In [None]:
# Split data (80/20)
split_idx = int(len(country_data) * 0.8)
train_data = country_data.iloc[:split_idx].copy()
test_data = country_data.iloc[split_idx:].copy()

print(f"Train: {train_data.shape[0]} rows, {train_data['date'].min()} to {train_data['date'].max()}")
print(f"Test: {test_data.shape[0]} rows, {test_data['date'].min()} to {test_data['date'].max()}")

# Remove date and country columns
train = train_data.drop(['date', 'country'], axis=1).copy()
test = test_data.drop(['date', 'country'], axis=1).copy()

print(f"\nTraining features: {train.shape[1] - 1} (target: {target_col})")

## 4. Define Multiple Diversity Configurations

Compare:
1. **Standard GA** (no diversity maintenance)
2. **Diversity threshold = 0.2** (low threshold - triggers often)
3. **Diversity threshold = 0.3** (moderate)
4. **Diversity threshold = 0.4** (moderate-high)
5. **Diversity threshold = 0.5** (high threshold - triggers rarely)

In [None]:
# Baseline: All features
rec_all = recipe(train).step_normalize(all_numeric_predictors())

# Standard GA (no diversity maintenance)
rec_ga_standard = (recipe(train)
    .step_normalize(all_numeric_predictors())
    .step_select_genetic_algorithm(
        outcome=target_col,
        model=linear_reg(),
        metric='rmse',
        top_n=10,
        population_size=30,
        generations=30,
        cv_folds=3,
        random_state=42
    )
)

# Diversity maintenance with different thresholds
diversity_thresholds = [0.2, 0.3, 0.4, 0.5]
diversity_recipes = {}

for threshold in diversity_thresholds:
    rec = (recipe(train)
        .step_normalize(all_numeric_predictors())
        .step_select_genetic_algorithm(
            outcome=target_col,
            model=linear_reg(),
            metric='rmse',
            top_n=10,
            maintain_diversity=True,
            diversity_threshold=threshold,
            population_size=30,
            generations=30,
            cv_folds=3,
            random_state=42
        )
    )
    diversity_recipes[f'diversity_{int(threshold*10)}'] = rec

# Combine all recipes
recipes = {
    'all_features': rec_all,
    'ga_standard': rec_ga_standard,
    **diversity_recipes
}

print(f"Defined {len(recipes)} preprocessing strategies")
for name in recipes.keys():
    print(f"  - {name}")

## 5. Create WorkflowSet

In [None]:
# Use linear regression
models = {'linear_reg': linear_reg()}

# Create WorkflowSet
wf_set = WorkflowSet.from_cross(
    preproc=recipes,
    models=models,
    ids=list(recipes.keys())
)

print(f"\nCreated WorkflowSet with {len(wf_set.workflows)} workflows")

## 6. Time Series Cross-Validation

In [None]:
# Create CV splits
n_train = len(train)
initial_size = int(n_train * 0.6)
assess_size = int(n_train * 0.1)

cv_folds = time_series_cv(
    train_data,
    date_column='date',
    initial=initial_size,
    assess=assess_size,
    skip=assess_size,
    cumulative=True
)

print(f"Created {cv_folds.n_splits} CV folds")

## 7. Evaluate All Workflows

**Note**: This will take several minutes as we're running GAs with 30 generations.

In [None]:
# Define metrics
metrics = metric_set(rmse, mae, r_squared)

# Fit resamples
print("Fitting all workflows across CV folds...")
print(f"Total fits: {len(wf_set.workflows)} workflows × {cv_folds.n_splits} folds = {len(wf_set.workflows) * cv_folds.n_splits}")
print("\nThis may take 10-15 minutes...\n")

wf_results = wf_set.fit_resamples(
    resamples=cv_folds,
    metrics=metrics
)

print("\n✓ CV evaluation complete!")

## 8. Collect and Analyze Results

In [None]:
# Collect metrics
cv_metrics = wf_results.collect_metrics()

print("\n=== Cross-Validation Results ===")
print(cv_metrics.to_string(index=False))

# Rank by RMSE
ranked = wf_results.rank_results('rmse', n=10)
print("\n=== Ranked Workflows (by RMSE) ===")
print(ranked[['wflow_id', 'rmse', 'mae', 'rsq', 'rank']].to_string(index=False))

## 9. Visualization: Diversity Threshold Comparison

In [None]:
# Use autoplot
fig = wf_results.autoplot('rmse')
fig.update_layout(height=600, title='Diversity Threshold Comparison: RMSE')
fig.show()

In [None]:
# Custom bar plot
fig, ax = plt.subplots(figsize=(12, 6))

sorted_metrics = cv_metrics.sort_values('rmse')
labels = sorted_metrics['wflow_id'].str.replace('_linear_reg', '').str.replace('_', ' ').str.title()

# Color coding
colors = []
for wf_id in sorted_metrics['wflow_id']:
    if 'diversity' in wf_id:
        colors.append('forestgreen')
    elif 'ga_standard' in wf_id:
        colors.append('orange')
    else:
        colors.append('lightgray')

bars = ax.barh(range(len(labels)), sorted_metrics['rmse'], color=colors, alpha=0.8)
ax.set_yticks(range(len(labels)))
ax.set_yticklabels(labels)
ax.set_xlabel('RMSE (lower is better)', fontsize=11)
ax.set_title('Feature Selection Strategy Comparison: RMSE', fontsize=13, fontweight='bold')
ax.grid(axis='x', alpha=0.3)

# Add value labels
for i, (bar, val) in enumerate(zip(bars, sorted_metrics['rmse'])):
    ax.text(val, bar.get_y() + bar.get_height()/2, f'{val:.2f}', 
            ha='left', va='center', fontsize=9, fontweight='bold')

# Legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='lightgray', alpha=0.8, label='Baseline (All Features)'),
    Patch(facecolor='orange', alpha=0.8, label='Standard GA (No Diversity)'),
    Patch(facecolor='forestgreen', alpha=0.8, label='GA with Diversity Maintenance')
]
ax.legend(handles=legend_elements, loc='lower right')

plt.tight_layout()
plt.savefig('production_diversity_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nPlot saved as: production_diversity_comparison.png")

## 10. Inspect Diversity History

Fit all GA variants on full training data to examine diversity evolution.

In [None]:
# Fit all GA-based workflows
diversity_analysis = {}

for wf_id in wf_set.workflows.keys():
    if 'ga_' in wf_id or 'diversity' in wf_id:
        print(f"\nFitting: {wf_id}")
        
        wf = wf_set[wf_id]
        fit = wf.fit(train)
        
        # Get prepared recipe
        prepped = fit.extract_preprocessor()
        
        # Find GA step
        ga_step = None
        for step in prepped.prepared_steps:
            if hasattr(step, '_selected_features'):
                ga_step = step
                break
        
        if ga_step is not None:
            diversity_analysis[wf_id] = {
                'selected_features': ga_step._selected_features,
                'n_features': len(ga_step._selected_features),
                'converged': ga_step._converged if hasattr(ga_step, '_converged') else None,
                'n_generations': ga_step._n_generations if hasattr(ga_step, '_n_generations') else None,
                'diversity_history': ga_step._diversity_history if hasattr(ga_step, '_diversity_history') else None
            }

print("\n✓ Diversity analysis complete")

## 11. Visualize Diversity Evolution

In [None]:
# Plot diversity over generations
fig, ax = plt.subplots(figsize=(12, 6))

for wf_id, analysis in diversity_analysis.items():
    if analysis['diversity_history'] is not None:
        label = wf_id.replace('_linear_reg', '').replace('_', ' ').replace('diversity ', 'thresh=')
        ax.plot(analysis['diversity_history'], marker='o', label=label, linewidth=2)

ax.set_xlabel('Generation', fontsize=11)
ax.set_ylabel('Population Diversity', fontsize=11)
ax.set_title('Diversity Evolution Over Generations', fontsize=13, fontweight='bold')
ax.legend(loc='best')
ax.grid(alpha=0.3)

# Add threshold lines
for threshold in diversity_thresholds:
    ax.axhline(y=threshold, color='red', linestyle='--', alpha=0.3, linewidth=1)
    ax.text(0, threshold, f'  {threshold}', color='red', fontsize=9)

plt.tight_layout()
plt.savefig('production_diversity_evolution.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nPlot saved as: production_diversity_evolution.png")

## 12. Feature Stability Analysis

Compare which features were selected by each configuration.

In [None]:
# Create feature selection matrix
all_features = set()
for analysis in diversity_analysis.values():
    all_features.update(analysis['selected_features'])

all_features = sorted(all_features)

feature_matrix = pd.DataFrame(0, index=all_features, 
                               columns=[wf_id.replace('_linear_reg', '') 
                                        for wf_id in diversity_analysis.keys()])

for wf_id, analysis in diversity_analysis.items():
    col_name = wf_id.replace('_linear_reg', '')
    for feat in analysis['selected_features']:
        feature_matrix.loc[feat, col_name] = 1

# Calculate feature selection frequency
feature_matrix['total'] = feature_matrix.sum(axis=1)
feature_matrix = feature_matrix.sort_values('total', ascending=False)

print("\n=== Feature Selection Matrix ===")
print("(1 = selected, 0 = not selected)")
print(feature_matrix)

# Identify stable features (selected by all or most configurations)
n_configs = len(diversity_analysis)
stable_features = feature_matrix[feature_matrix['total'] >= n_configs * 0.8].index.tolist()

print(f"\n=== Stable Features (selected by ≥80% of configurations) ===")
for feat in stable_features:
    print(f"  - {feat}")

## 13. Select Best Workflow

In [None]:
# Get best workflow
best_wf_id = ranked.iloc[0]['wflow_id']
print(f"Best workflow: {best_wf_id}")
print(f"CV RMSE: {ranked.iloc[0]['rmse']:.2f}")

# Fit on full training data
best_wf = wf_set[best_wf_id]
best_fit = best_wf.fit(train)

# Evaluate on test set
test_eval = best_fit.evaluate(test)
outputs, coeffs, stats = test_eval.extract_outputs()

print(f"\n=== Test Set Performance ===")
test_stats = stats[stats['split'] == 'test']
print(f"RMSE: {test_stats['rmse'].values[0]:.2f}")
print(f"MAE: {test_stats['mae'].values[0]:.2f}")
print(f"R²: {test_stats['rsq'].values[0]:.4f}")

# Show selected features
if best_wf_id in diversity_analysis:
    print(f"\n=== Selected Features ===")
    analysis = diversity_analysis[best_wf_id]
    print(f"Number of features: {analysis['n_features']}")
    print(f"\nFeatures:")
    for feat in analysis['selected_features']:
        stable_marker = '★' if feat in stable_features else ' '
        print(f"  {stable_marker} {feat}")
    print("\n★ = stable feature (selected by ≥80% of configurations)")
    
    if analysis['diversity_history'] is not None:
        print(f"\n=== Diversity Statistics ===")
        div_hist = analysis['diversity_history']
        print(f"Initial diversity: {div_hist[0]:.3f}")
        print(f"Final diversity: {div_hist[-1]:.3f}")
        print(f"Mean diversity: {np.mean(div_hist):.3f}")
        print(f"Min diversity: {np.min(div_hist):.3f}")
        print(f"Max diversity: {np.max(div_hist):.3f}")

## 14. Comprehensive Comparison Table

In [None]:
# Create comprehensive comparison
comparison_data = []

for wf_id in wf_set.workflows.keys():
    # Get CV performance
    cv_perf = cv_metrics[cv_metrics['wflow_id'] == wf_id].iloc[0]
    
    # Get test performance
    wf = wf_set[wf_id]
    fit = wf.fit(train)
    test_eval = fit.evaluate(test)
    _, _, test_stats = test_eval.extract_outputs()
    test_perf = test_stats[test_stats['split'] == 'test'].iloc[0]
    
    # Get feature count
    if wf_id in diversity_analysis:
        n_features = diversity_analysis[wf_id]['n_features']
        n_stable = sum(1 for f in diversity_analysis[wf_id]['selected_features'] if f in stable_features)
        
        # Get diversity stats
        if diversity_analysis[wf_id]['diversity_history'] is not None:
            div_mean = np.mean(diversity_analysis[wf_id]['diversity_history'])
            div_final = diversity_analysis[wf_id]['diversity_history'][-1]
        else:
            div_mean, div_final = None, None
    else:
        n_features = len([c for c in train.columns if c != target_col])
        n_stable = None
        div_mean, div_final = None, None
    
    comparison_data.append({
        'Configuration': wf_id.replace('_linear_reg', ''),
        'N Features': n_features,
        'N Stable': n_stable,
        'CV RMSE': cv_perf['rmse'],
        'Test RMSE': test_perf['rmse'],
        'Test R²': test_perf['rsq'],
        'Mean Diversity': div_mean,
        'Final Diversity': div_final
    })

comparison_df = pd.DataFrame(comparison_data).sort_values('Test RMSE')
print("\n=== Comprehensive Comparison ===")
print(comparison_df.to_string(index=False))

## 15. Key Takeaways

### Diversity Maintenance Benefits:
1. **Prevents Premature Convergence**: Maintains exploration throughout evolution
2. **Alternative Solutions**: Finds different feature subsets with similar performance
3. **Robustness**: Less sensitive to initialization and local optima
4. **Feature Stability**: Helps identify truly important features vs. noise

### Threshold Selection:
- **Low (0.2-0.3)**: Maintains high diversity, slower convergence, explores more
- **Moderate (0.3-0.4)**: Balanced exploration/exploitation (recommended)
- **High (0.4-0.5)**: Less intervention, closer to standard GA

### Diversity Evolution Patterns:
1. Diversity naturally decreases as GA converges
2. Fitness sharing activates when diversity drops below threshold
3. Multiple activation cycles indicate thorough search
4. Final diversity > 0.2 suggests healthy population variation

### Feature Stability Insights:
- Features selected consistently across thresholds are most reliable
- Stable features should be prioritized in production models
- Unstable features may be noise or marginally useful

### Comparison with Standard GA:
- Standard GA may converge faster but risk local optima
- Diversity-maintained GA explores more thoroughly
- Small performance difference can indicate local optimum in standard GA

### WorkflowSet Advantages:
- Systematic threshold comparison
- Robust performance estimation via CV
- Easy identification of optimal diversity settings

### Production Recommendations:
1. Use diversity maintenance for complex problems with many features
2. Start with threshold 0.3-0.4 for most applications
3. Monitor diversity history to verify adequate exploration
4. Focus on stable features for production deployment
5. Run multiple diversity thresholds to identify feature stability
6. Use longer runs (40+ generations) for complex fitness landscapes