# Example 27: Gas Demand Forecasting with WorkflowSet and NSGA-II Multi-Objective Optimization

This notebook demonstrates **NSGA-II (Non-dominated Sorting Genetic Algorithm II)** for multi-objective feature selection using WorkflowSet for systematic comparison.

**Dataset**: European gas demand with weather data

**Key Techniques**:
- Multi-objective optimization (performance + sparsity + cost)
- Pareto front exploration
- Different solution selection methods
- WorkflowSet for comparing objective combinations
- Cross-validation with multi-objective workflows
- **Enhancement demonstrated**: NSGA-II multi-objective feature selection

**NSGA-II Benefits**:
- Optimize multiple conflicting objectives simultaneously
- Explore trade-offs between performance, sparsity, and cost
- Find Pareto-optimal solutions (non-dominated set)
- Flexible solution selection strategies

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from py_recipes import recipe, all_numeric_predictors
from py_workflows import workflow
from py_parsnip import linear_reg, rand_forest
from py_yardstick import rmse, mae, r_squared, metric_set
from py_rsample import time_series_cv
from py_workflowsets import WorkflowSet

# Set random seed for reproducibility
np.random.seed(42)

## 1. Data Loading and Preparation

In [None]:
# Load gas demand data
data = pd.read_csv('../_md/__data/european_gas_demand_weather_data.csv')
data['date'] = pd.to_datetime(data['date'])

print(f"Data shape: {data.shape}")
print(f"\nColumns: {list(data.columns)}")
print(f"\nCountries: {sorted(data['country'].unique())}")
print(f"\nDate range: {data['date'].min()} to {data['date'].max()}")
data.head()

## 2. Feature Engineering

In [None]:
# Add temporal and interaction features
data['month'] = data['date'].dt.month
data['day_of_week'] = data['date'].dt.dayofweek
data['day_of_year'] = data['date'].dt.dayofyear
data['temp_squared'] = data['temperature'] ** 2
data['wind_squared'] = data['wind_speed'] ** 2
data['temp_wind_interaction'] = data['temperature'] * data['wind_speed']

# Temperature bins (cold, mild, warm)
data['temp_cold'] = (data['temperature'] < 5).astype(int)
data['temp_warm'] = (data['temperature'] > 15).astype(int)

# Wind bins
data['wind_high'] = (data['wind_speed'] > data['wind_speed'].median()).astype(int)

# Lag features
data['gas_demand_lag1'] = data.groupby('country')['gas_demand'].shift(1)
data['gas_demand_lag7'] = data.groupby('country')['gas_demand'].shift(7)
data['temp_lag1'] = data.groupby('country')['temperature'].shift(1)
data['temp_lag3'] = data.groupby('country')['temperature'].shift(3)

# Moving averages
data['gas_demand_ma7'] = data.groupby('country')['gas_demand'].transform(lambda x: x.rolling(7, min_periods=1).mean())
data['temp_ma7'] = data.groupby('country')['temperature'].transform(lambda x: x.rolling(7, min_periods=1).mean())

# Drop rows with NaN from lagging
data = data.dropna()

print(f"After feature engineering: {data.shape}")
print(f"Features: {[col for col in data.columns if col not in ['date', 'country', 'gas_demand']]}")
print(f"Total features: {len([col for col in data.columns if col not in ['date', 'country', 'gas_demand']])}")

## 3. Define Feature Costs

Assign hypothetical costs to features based on data collection/computation complexity.

In [None]:
# Define feature costs (hypothetical)
feature_costs = {
    # Raw weather features (easy to collect)
    'temperature': 1.0,
    'wind_speed': 1.0,
    
    # Temporal features (free - from date)
    'month': 0.1,
    'day_of_week': 0.1,
    'day_of_year': 0.1,
    
    # Derived features (cheap computation)
    'temp_squared': 0.5,
    'wind_squared': 0.5,
    'temp_wind_interaction': 0.5,
    'temp_cold': 0.5,
    'temp_warm': 0.5,
    'wind_high': 0.5,
    
    # Lag features (moderate - require historical data storage)
    'gas_demand_lag1': 2.0,
    'gas_demand_lag7': 2.0,
    'temp_lag1': 1.5,
    'temp_lag3': 1.5,
    
    # Moving averages (expensive - require rolling computation)
    'gas_demand_ma7': 3.0,
    'temp_ma7': 2.5
}

print("Feature costs (hypothetical):")
for feat, cost in sorted(feature_costs.items(), key=lambda x: x[1], reverse=True):
    print(f"  {feat}: ${cost:.1f}")

print(f"\nTotal cost if using all features: ${sum(feature_costs.values()):.1f}")

## 4. Select Single Country for Analysis

In [None]:
# Focus on Germany for demonstration
country = 'Germany'
country_data = data[data['country'] == country].copy()

# Train/test split (80/20)
split_idx = int(len(country_data) * 0.8)
train_data = country_data.iloc[:split_idx].copy()
test_data = country_data.iloc[split_idx:].copy()

print(f"Country: {country}")
print(f"Train: {train_data.shape[0]} rows, {train_data['date'].min()} to {train_data['date'].max()}")
print(f"Test: {test_data.shape[0]} rows, {test_data['date'].min()} to {test_data['date'].max()}")

# Remove date and country columns
train = train_data.drop(['date', 'country'], axis=1).copy()
test = test_data.drop(['date', 'country'], axis=1).copy()

print(f"\nFeatures: {train.shape[1] - 1} (target: gas_demand)")

## 5. Define Multiple NSGA-II Configurations

Create recipes with different objective combinations and selection methods.

In [None]:
# Baseline: All features
rec_all = recipe(train).step_normalize(all_numeric_predictors())

# Standard GA (single objective - performance only)
rec_ga_standard = (
    recipe(train)
    .step_normalize(all_numeric_predictors())
    .step_select_genetic_algorithm(
        outcome='gas_demand',
        model=linear_reg(),
        metric='rmse',
        top_n=10,
        population_size=30,
        generations=20,
        cv_folds=3,
        random_state=42
    )
)

# NSGA-II: 2 objectives (performance + sparsity) - knee point
rec_nsga2_perf_sparse_knee = (
    recipe(train)
    .step_normalize(all_numeric_predictors())
    .step_select_genetic_algorithm(
        outcome='gas_demand',
        model=linear_reg(),
        metric='rmse',
        use_nsga2=True,
        nsga2_objectives=['performance', 'sparsity'],
        nsga2_selection_method='knee_point',
        population_size=30,
        generations=20,
        cv_folds=3,
        random_state=42
    )
)

# NSGA-II: 2 objectives - min features
rec_nsga2_perf_sparse_minf = (
    recipe(train)
    .step_normalize(all_numeric_predictors())
    .step_select_genetic_algorithm(
        outcome='gas_demand',
        model=linear_reg(),
        metric='rmse',
        use_nsga2=True,
        nsga2_objectives=['performance', 'sparsity'],
        nsga2_selection_method='min_features',
        population_size=30,
        generations=20,
        cv_folds=3,
        random_state=42
    )
)

# NSGA-II: 2 objectives - best performance
rec_nsga2_perf_sparse_bestp = (
    recipe(train)
    .step_normalize(all_numeric_predictors())
    .step_select_genetic_algorithm(
        outcome='gas_demand',
        model=linear_reg(),
        metric='rmse',
        use_nsga2=True,
        nsga2_objectives=['performance', 'sparsity'],
        nsga2_selection_method='best_performance',
        population_size=30,
        generations=20,
        cv_folds=3,
        random_state=42
    )
)

# NSGA-II: 3 objectives (performance + sparsity + cost) - knee point
rec_nsga2_3obj_knee = (
    recipe(train)
    .step_normalize(all_numeric_predictors())
    .step_select_genetic_algorithm(
        outcome='gas_demand',
        model=linear_reg(),
        metric='rmse',
        use_nsga2=True,
        nsga2_objectives=['performance', 'sparsity', 'cost'],
        feature_costs=feature_costs,
        nsga2_selection_method='knee_point',
        population_size=30,
        generations=20,
        cv_folds=3,
        random_state=42
    )
)

# NSGA-II: 3 objectives - min features
rec_nsga2_3obj_minf = (
    recipe(train)
    .step_normalize(all_numeric_predictors())
    .step_select_genetic_algorithm(
        outcome='gas_demand',
        model=linear_reg(),
        metric='rmse',
        use_nsga2=True,
        nsga2_objectives=['performance', 'sparsity', 'cost'],
        feature_costs=feature_costs,
        nsga2_selection_method='min_features',
        population_size=30,
        generations=20,
        cv_folds=3,
        random_state=42
    )
)

recipes = {
    'all_features': rec_all,
    'ga_standard': rec_ga_standard,
    'nsga2_2obj_knee': rec_nsga2_perf_sparse_knee,
    'nsga2_2obj_minf': rec_nsga2_perf_sparse_minf,
    'nsga2_2obj_bestp': rec_nsga2_perf_sparse_bestp,
    'nsga2_3obj_knee': rec_nsga2_3obj_knee,
    'nsga2_3obj_minf': rec_nsga2_3obj_minf
}

print(f"Defined {len(recipes)} preprocessing strategies")


## 6. Create WorkflowSet

In [None]:
# Use linear regression for all workflows
models = {'linear_reg': linear_reg()}

# Create WorkflowSet
wf_set = WorkflowSet.from_cross(
    preproc=recipes,
    models=models,
    ids=list(recipes.keys())
)

print(f"\nCreated WorkflowSet with {len(wf_set.workflows)} workflows")
print(f"\nWorkflow IDs:")
for wf_id in wf_set.workflows.keys():
    print(f"  - {wf_id}")

## 7. Time Series Cross-Validation

In [None]:
# Create CV splits
n_train = len(train)
initial_size = int(n_train * 0.5)
assess_size = int(n_train * 0.1)

cv_folds = time_series_cv(
    train_data,
    date_column='date',
    initial=initial_size,
    assess=assess_size,
    skip=assess_size,
    cumulative=True
)

print(f"Created {cv_folds.n_splits} CV folds")

## 8. Evaluate All Workflows

**Note**: This will take several minutes.

In [None]:
# Define metrics
metrics = metric_set(rmse, mae, r_squared)

# Fit resamples
print("Fitting all workflows across CV folds...")
print(f"Total fits: {len(wf_set.workflows)} workflows × {cv_folds.n_splits} folds = {len(wf_set.workflows) * cv_folds.n_splits}")
print("\nThis may take 5-10 minutes...\n")

wf_results = wf_set.fit_resamples(
    resamples=cv_folds,
    metrics=metrics
)

print("\n✓ CV evaluation complete!")

## 9. Collect and Analyze Results

In [None]:
# Collect metrics
cv_metrics = wf_results.collect_metrics()

print("\n=== Cross-Validation Results ===")
print(cv_metrics.to_string(index=False))

# Rank by RMSE
ranked = wf_results.rank_results('rmse', n=10)
print("\n=== Ranked Workflows (by RMSE) ===")
print(ranked[['wflow_id', 'rmse_mean', 'mae_mean', 'r_squared_mean', 'rank']].to_string(index=False))

## 10. Visualization: Method Comparison

In [None]:
# Use autoplot
fig = wf_results.autoplot('rmse')
fig.update_layout(height=600, title='NSGA-II Configuration Comparison: RMSE')
fig.show()

In [None]:
# Custom bar plot
fig, ax = plt.subplots(figsize=(12, 6))

# Get RMSE values from cv_metrics (long format - need to filter for rmse metric)
rmse_metrics = cv_metrics[cv_metrics['metric'] == 'rmse'].copy()
sorted_metrics = rmse_metrics.sort_values('mean')
labels = sorted_metrics['wflow_id'].str.replace('_linear_reg', '').str.replace('_', ' ').str.title()

# Color coding
colors = []
for wf_id in sorted_metrics['wflow_id']:
    if 'nsga2_3obj' in wf_id:
        colors.append('darkgreen')
    elif 'nsga2_2obj' in wf_id:
        colors.append('steelblue')
    elif 'ga_standard' in wf_id:
        colors.append('orange')
    else:
        colors.append('lightgray')

bars = ax.barh(range(len(labels)), sorted_metrics['mean'], color=colors, alpha=0.8)
ax.set_yticks(range(len(labels)))
ax.set_yticklabels(labels)
ax.set_xlabel('RMSE (lower is better)', fontsize=11)
ax.set_title('Feature Selection Strategy Comparison: RMSE', fontsize=13, fontweight='bold')
ax.grid(axis='x', alpha=0.3)

# Add value labels
for i, (bar, val) in enumerate(zip(bars, sorted_metrics['mean'])):
    ax.text(val, bar.get_y() + bar.get_height()/2, f'{val:.2f}', 
            ha='left', va='center', fontsize=9, fontweight='bold')

# Legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='lightgray', alpha=0.8, label='Baseline (All Features)'),
    Patch(facecolor='orange', alpha=0.8, label='Standard GA'),
    Patch(facecolor='steelblue', alpha=0.8, label='NSGA-II (2 Objectives)'),
    Patch(facecolor='darkgreen', alpha=0.8, label='NSGA-II (3 Objectives)')
]
ax.legend(handles=legend_elements, loc='lower right')

plt.tight_layout()
plt.savefig('gas_demand_nsga2_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nPlot saved as: gas_demand_nsga2_comparison.png")

## 11. Select and Inspect Best Workflow

In [None]:
# Get best workflow
best_wf_id = ranked.iloc[0]['wflow_id']
print(f"Best workflow: {best_wf_id}")
print(f"CV RMSE: {ranked.iloc[0]['rmse_mean']:.2f}")

# Fit on full training data
best_wf = wf_set[best_wf_id]
best_fit = best_wf.fit(train)

# Evaluate on test set
test_eval = best_fit.evaluate(test)
outputs, coeffs, stats = test_eval.extract_outputs()

print(f"\n=== Test Set Performance ===")
test_stats = stats[stats['split'] == 'test']
print(f"RMSE: {test_stats['rmse'].values[0]:.2f}")
print(f"MAE: {test_stats['mae'].values[0]:.2f}")
print(f"R²: {test_stats['rsq'].values[0]:.4f}")

## 12. Inspect Selected Features and Pareto Front

In [None]:
if 'nsga2' in best_wf_id or 'ga_' in best_wf_id:
    # Get the prepared recipe
    prepped_recipe = best_fit.extract_preprocessor()
    
    # Find the genetic algorithm step
    ga_step = None
    for step in prepped_recipe.prepared_steps:
        if hasattr(step, '_selected_features'):
            ga_step = step
            break
    
    if ga_step is not None:
        print(f"\n=== Selected Features ===")
        selected_features = ga_step._selected_features
        print(f"Number of features: {len(selected_features)}")
        print(f"\nFeatures:")
        for feat in selected_features:
            cost = feature_costs.get(feat, 0)
            print(f"  - {feat} (cost: ${cost:.1f})")
        
        total_cost = sum(feature_costs.get(f, 0) for f in selected_features)
        max_cost = sum(feature_costs.values())
        print(f"\nTotal cost: ${total_cost:.1f} (vs ${max_cost:.1f} for all features)")
        print(f"Cost savings: ${max_cost - total_cost:.1f} ({(1 - total_cost/max_cost)*100:.1f}%)")
        
        # If NSGA-II, show Pareto front
        if hasattr(ga_step, '_pareto_front'):
            print(f"\n=== Pareto Front Details ===")
            print(f"Pareto front size: {len(ga_step._pareto_front)}")
            print(f"\nObjective values for Pareto solutions:")
            print(f"(Each row is a solution on the Pareto front)")
            print(ga_step._pareto_objectives)

## 13. Visualize Pareto Front (If NSGA-II)

Show the trade-off between objectives.

In [None]:
if 'nsga2' in best_wf_id and ga_step is not None and hasattr(ga_step, '_pareto_objectives'):
    pareto_objs = ga_step._pareto_objectives
    
    # Calculate selected solution index using same knee point method
    def find_knee_point(objectives):
        """Find knee point using maximum distance from line connecting extremes."""
        if len(objectives) == 1:
            return 0
        
        min_idx = np.argmin(objectives[:, 0])
        max_idx = np.argmax(objectives[:, 1])
        
        if min_idx == max_idx:
            return min_idx
        
        p1, p2 = objectives[min_idx], objectives[max_idx]
        max_dist, knee_idx = -1, 0
        
        for i in range(len(objectives)):
            point = objectives[i]
            dist = np.abs(np.cross(p2 - p1, point - p1)) / np.linalg.norm(p2 - p1)
            if dist > max_dist:
                max_dist, knee_idx = dist, i
        
        return knee_idx
    
    if pareto_objs.shape[1] == 2:
        # 2D Pareto front
        selected_idx = find_knee_point(pareto_objs)
        
        fig, ax = plt.subplots(figsize=(10, 6))
        
        # Plot all Pareto solutions
        ax.scatter(pareto_objs[:, 0], pareto_objs[:, 1], 
                   s=80, c='steelblue', alpha=0.6, label='Pareto Front')
        
        # Highlight selected solution
        ax.scatter(pareto_objs[selected_idx, 0], pareto_objs[selected_idx, 1],
                   s=200, c='red', marker='*', edgecolors='black', 
                   linewidths=2, label='Selected Solution', zorder=5)
        
        ax.set_xlabel('Performance (RMSE)', fontsize=11)
        ax.set_ylabel('Sparsity (Number of Features)', fontsize=11)
        ax.set_title('Pareto Front: Performance vs Sparsity Trade-off', 
                     fontsize=13, fontweight='bold')
        ax.legend()
        ax.grid(alpha=0.3)
        
        plt.tight_layout()
        plt.savefig('gas_demand_pareto_front_2d.png', dpi=150, bbox_inches='tight')
        plt.show()
        
        print(f"\nSelected solution (knee point): index {selected_idx}")
        print(f"  Performance (RMSE): {pareto_objs[selected_idx, 0]:.2f}")
        print(f"  Sparsity (# features): {int(pareto_objs[selected_idx, 1])}")
        print("\nPlot saved as: gas_demand_pareto_front_2d.png")
        
    elif pareto_objs.shape[1] == 3:
        # 3D Pareto front
        selected_idx = find_knee_point(pareto_objs[:, :2])  # Use first 2 objectives for knee
        
        fig = plt.figure(figsize=(14, 5))
        
        # 3D plot
        ax1 = fig.add_subplot(131, projection='3d')
        ax1.scatter(pareto_objs[:, 0], pareto_objs[:, 1], pareto_objs[:, 2],
                    c='steelblue', s=50, alpha=0.6)
        ax1.scatter(pareto_objs[selected_idx, 0], pareto_objs[selected_idx, 1], 
                    pareto_objs[selected_idx, 2],
                    c='red', s=200, marker='*', edgecolors='black', linewidths=2)
        ax1.set_xlabel('Performance (RMSE)')
        ax1.set_ylabel('Sparsity (# Features)')
        ax1.set_zlabel('Cost ($)')
        ax1.set_title('3D Pareto Front', fontweight='bold')
        
        # 2D projection: Performance vs Sparsity
        ax2 = fig.add_subplot(132)
        ax2.scatter(pareto_objs[:, 0], pareto_objs[:, 1], c='steelblue', s=50, alpha=0.6)
        ax2.scatter(pareto_objs[selected_idx, 0], pareto_objs[selected_idx, 1],
                    c='red', s=200, marker='*', edgecolors='black', linewidths=2)
        ax2.set_xlabel('Performance (RMSE)')
        ax2.set_ylabel('Sparsity (# Features)')
        ax2.set_title('Performance vs Sparsity', fontweight='bold')
        ax2.grid(alpha=0.3)
        
        # 2D projection: Performance vs Cost
        ax3 = fig.add_subplot(133)
        ax3.scatter(pareto_objs[:, 0], pareto_objs[:, 2], c='steelblue', s=50, alpha=0.6)
        ax3.scatter(pareto_objs[selected_idx, 0], pareto_objs[selected_idx, 2],
                    c='red', s=200, marker='*', edgecolors='black', linewidths=2)
        ax3.set_xlabel('Performance (RMSE)')
        ax3.set_ylabel('Cost ($)')
        ax3.set_title('Performance vs Cost', fontweight='bold')
        ax3.grid(alpha=0.3)
        
        plt.tight_layout()
        plt.savefig('gas_demand_pareto_front_3d.png', dpi=150, bbox_inches='tight')
        plt.show()
        
        print(f"\nSelected solution (knee point): index {selected_idx}")
        print(f"  Performance (RMSE): {pareto_objs[selected_idx, 0]:.2f}")
        print(f"  Sparsity (# features): {int(pareto_objs[selected_idx, 1])}")
        print(f"  Cost ($): {pareto_objs[selected_idx, 2]:.2f}")
        print("\nPlot saved as: gas_demand_pareto_front_3d.png")

## 14. Compare All NSGA-II Configurations

In [None]:
# Fit all NSGA-II workflows on full training data to inspect feature selections
nsga2_comparison = []

for wf_id in wf_set.workflows.keys():
    if 'nsga2' in wf_id or 'ga_standard' in wf_id or 'all_features' in wf_id:
        print(f"\nProcessing: {wf_id}")
        
        wf = wf_set[wf_id]
        fit = wf.fit(train)
        
        # Get selected features
        n_features = None
        total_cost = None
        
        if 'all_features' not in wf_id:
            prepped = fit.extract_preprocessor()
            for step in prepped.prepared_steps:
                if hasattr(step, '_selected_features'):
                    selected = step._selected_features
                    n_features = len(selected)
                    total_cost = sum(feature_costs.get(f, 0) for f in selected)
                    break
        else:
            n_features = len([c for c in train.columns if c != 'gas_demand'])
            total_cost = sum(feature_costs.values())
        
        # Get test performance
        test_eval = fit.evaluate(test)
        _, _, stats = test_eval.extract_outputs()
        test_stats = stats[stats['split'] == 'test']
        
        nsga2_comparison.append({
            'Configuration': wf_id.replace('_linear_reg', ''),
            'N Features': n_features,
            'Total Cost ($)': total_cost,
            'Test RMSE': test_stats['rmse'].values[0],
            'Test MAE': test_stats['mae'].values[0],
            'Test R²': test_stats['rsq'].values[0]
        })

nsga2_df = pd.DataFrame(nsga2_comparison).sort_values('Test RMSE')
print("\n=== NSGA-II Configuration Comparison (Test Set) ===")
print(nsga2_df.to_string(index=False))

## 15. Key Takeaways

### NSGA-II Multi-Objective Optimization:
1. **Pareto Front**: Provides set of non-dominated solutions representing optimal trade-offs
2. **Multiple Objectives**: Simultaneously optimize performance, sparsity, and cost
3. **Selection Methods**:
   - `knee_point`: Balanced trade-off (recommended)
   - `min_features`: Prioritizes sparsity
   - `best_performance`: Prioritizes accuracy

### 2 vs 3 Objectives:
- **2 objectives** (performance + sparsity): Simpler, faster convergence
- **3 objectives** (+ cost): More realistic for production with economic constraints
- Cost considerations can significantly change feature selection

### Comparison with Standard GA:
- Standard GA optimizes single objective (performance only)
- NSGA-II explores trade-off space, giving multiple solution options
- Better for scenarios with conflicting goals

### WorkflowSet Benefits:
- Systematic evaluation of different NSGA-II configurations
- Easy comparison of selection methods and objective combinations
- Robust performance estimation via CV

### Production Recommendations:
1. Use 3-objective NSGA-II when feature costs are important
2. Knee point selection provides good balance in most cases
3. Inspect full Pareto front to understand available trade-offs
4. Consider business constraints when selecting from Pareto front
5. Use larger population (50+) and more generations (30+) for better convergence