# Feature Selection Framework for Multivariate HMMs

**Objective:** Provide a systematic, reproducible framework for choosing which features to combine in your multivariate HMM.

**Problem:** With 17 available features, which combinations work best? Is it returns + volatility? Returns + momentum? How do you decide?

**Solution:** This notebook provides:
1. Feature characteristic checklist (is a feature regime-informative?)
2. Pre-training diagnostics (correlation, variance mismatch)
3. Systematic testing methodology
4. Metrics-based evaluation and ranking
5. Decision tree for feature selection

**Audience:**
- ML engineers building regime detection systems
- Traders/practitioners customizing for their trading strategy
- Researchers exploring regime detection across asset classes

**Expected Outcome:** A step-by-step methodology you can apply to any dataset.


## Part 1: Understanding Good Features

Not all features are equally useful for multivariate HMMs. A good feature must satisfy three criteria:

### Criterion 1: Regime-Informative
The feature should **vary significantly across regimes**.

Example of regime-informative feature:
- Bull regime: Realized volatility = 10% annualized
- Bear regime: Realized volatility = 25% annualized
- Difference: 2.5x ‚Üí Highly informative!

Example of regime-ambiguous feature:
- Bull regime: Trading volume = 50M shares
- Bear regime: Trading volume = 55M shares
- Difference: 1.1x ‚Üí Not very informative

### Criterion 2: Non-Redundant
The feature should **NOT provide the same information as returns**.

Example of redundant features:
- log_return and price_change ‚Üí 95% correlated (same information)
- return_ratio and log_return ‚Üí 90% correlated (nearly identical)

Example of complementary features:
- log_return and realized_volatility ‚Üí 15% correlated (different signals)
- log_return and momentum_strength ‚Üí 30% correlated (different signals)

### Criterion 3: Scale-Stable
The feature should **not have extreme scale differences** (handled by standardization, but matters for convergence).

Example of unstable scale:
- log_return: range [-0.05, 0.05] (small numbers)
- raw_volume: range [10M, 100M] (large numbers)
- Ratio: 1000x ‚Üí Numeric instability even with StandardScaler

Example of stable scale:
- log_return: range [-0.05, 0.05]
- realized_volatility: range [0.005, 0.04]
- Ratio: ~10x ‚Üí Manageable, pipeline handles easily


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import spearmanr, kendalltau
import hidden_regime as hr
import warnings
warnings.filterwarnings('ignore')

# Download market data
print("Downloading market data (SPY 2023-2024)...")
pipeline = hr.create_financial_pipeline('SPY', n_states=3, start_date='2023-01-01', end_date='2024-01-01', include_report=False)
result = pipeline.update()
data = pipeline.component_outputs['data']
obs_data = pipeline.component_outputs['observations']

print(f"Downloaded {len(data)} trading days")
print(f"\nAvailable observations: {list(obs_data.columns)}")

Downloading market data (SPY 2023-2024)...
Training on 249 observations (removed 0 NaN values), 1 feature(s)
Downloaded 249 trading days

Available observations: ['open', 'high', 'low', 'close', 'volume', 'price', 'pct_change', 'log_return', 'volatility', 'rsi']


## Part 2: Feature Diagnostic Checklist

Before training any model, run this diagnostic checklist on your candidate features.

In [2]:
def diagnose_feature_pair(feat1_name, feat2_name, data_df):
    """
    Comprehensive diagnostic for a feature pair.
    Returns dict with all diagnostic metrics.
    """
    if feat1_name not in data_df.columns or feat2_name not in data_df.columns:
        return None
    
    feat1 = data_df[feat1_name].dropna()
    feat2 = data_df[feat2_name].dropna()
    
    # Ensure same length
    min_len = min(len(feat1), len(feat2))
    feat1 = feat1.iloc[:min_len]
    feat2 = feat2.iloc[:min_len]
    
    diag = {
        'feature1': feat1_name,
        'feature2': feat2_name,
    }
    
    # Scale analysis
    scale_ratio = feat1.std() / feat2.std() if feat2.std() > 0 else np.inf
    diag['feat1_scale'] = feat1.std()
    diag['feat2_scale'] = feat2.std()
    diag['scale_ratio'] = max(scale_ratio, 1/scale_ratio)
    diag['scale_status'] = 'OK' if diag['scale_ratio'] < 100 else 'WARNING'
    
    # Correlation analysis
    pearson_corr = feat1.corr(feat2)
    spearman_corr, _ = spearmanr(feat1, feat2)
    diag['pearson_corr'] = pearson_corr
    diag['spearman_corr'] = spearman_corr
    diag['abs_corr'] = abs(pearson_corr)
    
    if abs(pearson_corr) > 0.9:
        diag['redundancy'] = 'REDUNDANT'
    elif abs(pearson_corr) > 0.7:
        diag['redundancy'] = 'Somewhat Correlated'
    elif abs(pearson_corr) > 0.5:
        diag['redundancy'] = 'Moderately Correlated'
    else:
        diag['redundancy'] = 'GOOD - Independent'
    
    return diag

# Test key feature pairs
feature_pairs = [
    ('log_return', 'realized_vol'),          # Recommended
    ('log_return', 'volatility'),            # Alternative vol measure
    ('log_return', 'momentum_strength'),     # Alternative: momentum
    ('log_return', 'price_change'),          # Redundant pair (should avoid)
    ('log_return', 'directional_consistency'),  # Advanced
    ('log_return', 'volume_ratio'),          # Volume-based
]

print("\n" + "="*80)
print("FEATURE PAIR DIAGNOSTICS")
print("="*80)

diagnostics = []
for feat1, feat2 in feature_pairs:
    diag = diagnose_feature_pair(feat1, feat2, obs_data)
    if diag:
        diagnostics.append(diag)
        
        print(f"\n{feat1:25s} + {feat2:25s}")
        print("-" * 80)
        print(f"  Scale ratio: {diag['scale_ratio']:.1f}x       ({diag['scale_status']})")
        print(f"  Pearson correlation: {diag['pearson_corr']:+.3f}")
        print(f"  Spearman correlation: {diag['spearman_corr']:+.3f}")
        print(f"  Redundancy: {diag['redundancy']}")
        
        # Recommendation
        if diag['redundancy'] == 'REDUNDANT':
            print(f"  ‚ùå AVOID: Features provide same information")
        elif diag['scale_ratio'] > 100:
            print(f"  ‚ö†Ô∏è  WARNING: Scale mismatch, but pipeline standardizes")
        else:
            print(f"  ‚úÖ GOOD: Candidate for multivariate model")


FEATURE PAIR DIAGNOSTICS

log_return                + volatility               
--------------------------------------------------------------------------------
  Pearson correlation: +nan
  Spearman correlation: +nan
  Redundancy: GOOD - Independent


## Part 3: Systematic Feature Testing

Now let's test multiple feature combinations and rank them by quality metrics.

In [3]:
print("\n" + "="*80)
print("SYSTEMATIC FEATURE COMBINATION TESTING")
print("="*80)

# Feature combinations to test
test_combinations = [
    (['log_return'], 'Univariate Baseline'),
    (['log_return', 'realized_vol'], 'RECOMMENDED'),
    (['log_return', 'volatility'], 'Alternative Vol'),
    (['log_return', 'momentum_strength'], 'Momentum-Focused'),
    (['log_return', 'trend_persistence'], 'Trend-Focused'),
]

test_results = []

for features, description in test_combinations:
    print(f"\nTesting: {description} ({', '.join(features)})")
    
    try:
        if len(features) == 1:
            # Univariate
            pipeline = hr.create_financial_pipeline(
                'SPY',
                n_states=3,
                start_date='2023-01-01',
                end_date='2024-01-01',
                include_report=False,
                observation_config_overrides={'generators': features}
            )
        else:
            # Multivariate
            pipeline = hr.create_multivariate_pipeline(
                'SPY',
                n_states=3,
                features=features,
                start_date='2023-01-01',
                end_date='2024-01-01'
            )
        
        report = pipeline.update()
        result = pipeline.component_outputs['interpreter']
        model = pipeline.model
        
        # Extract metrics
        transitions = np.sum(np.diff(result['predicted_state']) != 0)
        avg_conf = result['confidence'].mean()
        min_conf = result['confidence'].min()
        converged = model.training_history_['converged']
        iterations = model.training_history_['iterations']
        
        test_results.append({
            'description': description,
            'features': ', '.join(features),
            'converged': converged,
            'iterations': iterations,
            'transitions': transitions,
            'avg_confidence': avg_conf,
            'min_confidence': min_conf,
            'result': result
        })
        
        print(f"  ‚úì Converged in {iterations} iterations")
        print(f"    Transitions: {transitions}, Avg Confidence: {avg_conf:.1%}")
    
    except Exception as e:
        print(f"  ‚úó Failed: {str(e)[:60]}...")

# Create results DataFrame
results_df = pd.DataFrame(test_results)
print("\n" + "="*80)
print("TEST RESULTS SUMMARY")
print("="*80)
print(results_df[['description', 'converged', 'transitions', 'avg_confidence', 'min_confidence']].to_string(index=False))


SYSTEMATIC FEATURE COMBINATION TESTING

Testing: Univariate Baseline (log_return)
Training on 249 observations (removed 0 NaN values), 1 feature(s)
  ‚úì Converged in 100 iterations
    Transitions: 36, Avg Confidence: 87.0%

Testing: RECOMMENDED (log_return, realized_vol)
  Feature standardization applied (variance ratio before: 14.0)
Training on 230 observations (removed 19 NaN values), 2 feature(s)
  ‚úì Converged in 30 iterations
    Transitions: 11, Avg Confidence: 90.3%

Testing: Alternative Vol (log_return, volatility)
  Feature standardization applied (variance ratio before: 14.0)
Training on 230 observations (removed 19 NaN values), 2 feature(s)
  ‚úì Converged in 30 iterations
    Transitions: 11, Avg Confidence: 90.3%

Testing: Momentum-Focused (log_return, momentum_strength)
  Feature standardization applied (variance ratio before: 114.5)
Training on 229 observations (removed 20 NaN values), 2 feature(s)
  ‚úì Converged in 100 iterations
    Transitions: 55, Avg Confidence

## Part 4: Evaluation Metrics

Now let's define metrics to objectively rank feature combinations.

In [4]:
def compute_quality_score(result_row):
    """
    Compute an objective quality score (0-100) for a feature combination.
    
    Metric weights:
    - Convergence (30%): Did it converge quickly?
    - Stability (40%): Fewer transitions = more stable
    - Confidence (30%): Higher confidence = better predictions
    """
    # Convergence score (100 if converged, 50 if not)
    conv_score = 100 if result_row['converged'] else 50
    
    # Stability score (fewer transitions = higher score)
    # Range: ~0-100 transitions, map to 0-100 score
    stability_score = max(0, 100 - result_row['transitions'])
    
    # Confidence score (0-100% confidence, map to 0-100)
    conf_score = result_row['avg_confidence'] * 100
    
    # Weighted average
    quality = (conv_score * 0.3) + (stability_score * 0.4) + (conf_score * 0.3)
    return quality

# Compute quality scores
if test_results:
    results_df['quality_score'] = results_df.apply(compute_quality_score, axis=1)
    results_df = results_df.sort_values('quality_score', ascending=False)
    
    print("\nQUALITY RANKING (Higher is Better):")
    print("="*80)
    
    for idx, row in results_df.iterrows():
        rank = results_df.index.get_loc(idx) + 1
        score = row['quality_score']
        medal = 'ü•á' if rank == 1 else 'ü•à' if rank == 2 else 'ü•â' if rank == 3 else f'  '
        
        print(f"{medal} #{rank}. {row['description']:25s} Score: {score:6.1f}/100")
        print(f"     Features: {row['features']}")
        print(f"     Transitions: {row['transitions']}, Confidence: {row['avg_confidence']:.1%}")
        print()


QUALITY RANKING (Higher is Better):
ü•á #1. Alternative Vol           Score:   92.7/100
     Features: log_return, volatility
     Transitions: 11, Confidence: 90.3%

ü•à #2. RECOMMENDED               Score:   92.7/100
     Features: log_return, realized_vol
     Transitions: 11, Confidence: 90.3%

ü•â #3. Trend-Focused             Score:   74.3/100
     Features: log_return, trend_persistence
     Transitions: 58, Confidence: 91.8%

   #4. Univariate Baseline       Score:   66.7/100
     Features: log_return
     Transitions: 36, Confidence: 87.0%

   #5. Momentum-Focused          Score:   57.1/100
     Features: log_return, momentum_strength
     Transitions: 55, Confidence: 80.4%



## Part 5: Decision Tree for Feature Selection

Use this interactive decision tree to choose features for YOUR specific use case.

In [5]:
def feature_selection_decision_tree():
    """
    Interactive decision tree for feature selection.
    In practice, user would answer these questions.
    """
    print("\n" + "="*80)
    print("FEATURE SELECTION DECISION TREE")
    print("="*80)
    print("""
START HERE: Do you have 2+ years of daily data?
‚îú‚îÄ NO  ‚Üí Use UNIVARIATE (returns only)
‚îÇ       Reason: Multivariate needs sufficient data for covariance estimation
‚îÇ
‚îî‚îÄ YES ‚Üí What is your primary objective?
         ‚îú‚îÄ VOLATILITY REGIME DETECTION (High/Medium/Low Vol)
         ‚îÇ  ‚îî‚îÄ USE: log_return + realized_vol ‚Üê RECOMMENDED
         ‚îÇ     Why: Volatility clearly varies by regime
         ‚îÇ          Information-theoretically optimal (see notebook 03)
         ‚îÇ
         ‚îú‚îÄ TREND/MOMENTUM DETECTION
         ‚îÇ  ‚îú‚îÄ Detect momentum reversals?
         ‚îÇ  ‚îÇ  ‚îî‚îÄ USE: log_return + momentum_strength
         ‚îÇ  ‚îî‚îÄ Detect trend changes?
         ‚îÇ     ‚îî‚îÄ USE: log_return + trend_persistence
         ‚îÇ
         ‚îú‚îÄ CRISIS DETECTION
         ‚îÇ  ‚îî‚îÄ USE: log_return + realized_vol
         ‚îÇ     Why: Vol spikes are immediate crisis indicators
         ‚îÇ
         ‚îî‚îÄ GENERAL PURPOSE (not sure)
            ‚îî‚îÄ USE: log_return + realized_vol (DEFAULT BEST)
               Why: Works well across most market conditions

BEFORE COMMITTING, CHECK:
1. Correlation between your chosen features (should be < 0.7)
   ‚îî‚îÄ Use diagnose_feature_pair() above

2. Scale ratio between features (should be < 100x)
   ‚îî‚îÄ Pipeline standardizes automatically, but very large ratios may struggle

3. Data quality
   ‚îî‚îÄ No more than 3% missing data
   ‚îî‚îÄ No obvious data errors or discontinuities

4. Regime separation
   ‚îî‚îÄ Can you visually see regimes in your features?
   ‚îî‚îÄ Or are all observations bunched together?
    """)

feature_selection_decision_tree()


FEATURE SELECTION DECISION TREE

START HERE: Do you have 2+ years of daily data?
‚îú‚îÄ NO  ‚Üí Use UNIVARIATE (returns only)
‚îÇ       Reason: Multivariate needs sufficient data for covariance estimation
‚îÇ
‚îî‚îÄ YES ‚Üí What is your primary objective?
         ‚îú‚îÄ VOLATILITY REGIME DETECTION (High/Medium/Low Vol)
         ‚îÇ  ‚îî‚îÄ USE: log_return + realized_vol ‚Üê RECOMMENDED
         ‚îÇ     Why: Volatility clearly varies by regime
         ‚îÇ          Information-theoretically optimal (see notebook 03)
         ‚îÇ
         ‚îú‚îÄ TREND/MOMENTUM DETECTION
         ‚îÇ  ‚îú‚îÄ Detect momentum reversals?
         ‚îÇ  ‚îÇ  ‚îî‚îÄ USE: log_return + momentum_strength
         ‚îÇ  ‚îî‚îÄ Detect trend changes?
         ‚îÇ     ‚îî‚îÄ USE: log_return + trend_persistence
         ‚îÇ
         ‚îú‚îÄ CRISIS DETECTION
         ‚îÇ  ‚îî‚îÄ USE: log_return + realized_vol
         ‚îÇ     Why: Vol spikes are immediate crisis indicators
         ‚îÇ
         ‚îî‚îÄ GENERAL PURPOSE (n

## Part 6: Common Pitfalls & How to Avoid Them

Learn from common mistakes when selecting features.

In [6]:
print("\n" + "="*80)
print("COMMON PITFALLS IN FEATURE SELECTION")
print("="*80)
print("""
1. PITFALL: Using redundant features
   Example: returns + price_change (95% correlated)
   Problem: Provides no new information, wastes model capacity
   Fix: Check correlation, use diagnose_feature_pair()

2. PITFALL: Choosing regime-ambiguous features
   Example: Raw trading volume (changes with time-of-day, not regime)
   Problem: Model can't learn regime structure
   Fix: Use volume_ratio (relative to average) not raw volume

3. PITFALL: Ignoring scale differences
   Example: returns (-0.05 to 0.05) + volume (1M to 100M)
   Problem: Numerical instability, poor convergence
   Fix: Pipeline standardizes, but avoid 1000x differences

4. PITFALL: Using "interesting" but irrelevant features
   Example: Oil prices for equity regime detection
   Problem: No regime connection, noise only
   Fix: Ensure feature varies significantly across YOUR regimes

5. PITFALL: Too many features
   Example: 5+ features with 2 years data
   Problem: Curse of dimensionality, covariance matrix is near-singular
   Fix: Start with 2 features, add a 3rd only if justified

6. PITFALL: Over-fitting to historical data
   Example: Features that worked 2018-2020 fail 2023-2024
   Problem: Regime structure changed (market regime shift)
   Fix: Test on multiple historical periods

BEST PRACTICE:
Always validate feature choices on out-of-sample data:
‚îú‚îÄ Train on period A (pre-crisis or stable)
‚îú‚îÄ Test on period B (crisis or regime shift)
‚îî‚îÄ Ensure features still separate regimes in new period
""")


COMMON PITFALLS IN FEATURE SELECTION

1. PITFALL: Using redundant features
   Example: returns + price_change (95% correlated)
   Problem: Provides no new information, wastes model capacity
   Fix: Check correlation, use diagnose_feature_pair()

2. PITFALL: Choosing regime-ambiguous features
   Example: Raw trading volume (changes with time-of-day, not regime)
   Problem: Model can't learn regime structure
   Fix: Use volume_ratio (relative to average) not raw volume

3. PITFALL: Ignoring scale differences
   Example: returns (-0.05 to 0.05) + volume (1M to 100M)
   Problem: Numerical instability, poor convergence
   Fix: Pipeline standardizes, but avoid 1000x differences

4. PITFALL: Using "interesting" but irrelevant features
   Example: Oil prices for equity regime detection
   Problem: No regime connection, noise only
   Fix: Ensure feature varies significantly across YOUR regimes

5. PITFALL: Too many features
   Example: 5+ features with 2 years data
   Problem: Curse of dimensi

## Part 7: Summary & Next Steps

You now have a systematic framework for feature selection.

In [7]:
print("\n" + "="*80)
print("FEATURE SELECTION FRAMEWORK - SUMMARY")
print("="*80)
print("""
STEP-BY-STEP PROCESS:

1. CHARACTERIZE YOUR DATA
   ‚úì Do you have 2+ years of daily data?
   ‚úì What asset class? (equities/bonds/crypto/forex?)
   ‚úì What regimes do you want to detect?

2. IDENTIFY CANDIDATE FEATURES
   Use the decision tree above to narrow down candidates
   Typical good choices:
   - Volatility regimes: log_return + realized_vol
   - Momentum regimes: log_return + momentum_strength
   - Trend regimes: log_return + trend_persistence

3. RUN DIAGNOSTICS
   Use diagnose_feature_pair() to check:
   ‚úì Correlation (should be < 0.7)
   ‚úì Scale ratio (should be < 100x)
   ‚úì Are features independent?

4. TEST ON HISTORICAL DATA
   Train multivariate model:
   ‚úì Pre-event period (stable)
   ‚úì Check: Convergence, Transitions, Confidence
   ‚úì Benchmark: Compare to univariate baseline

5. VALIDATE ON NEW PERIOD
   Test on different regime (crisis, downturn, etc):
   ‚úì Do regime boundaries still make sense?
   ‚úì Is confidence still high?
   ‚úì Are feature distributions still regime-informative?

6. DEPLOY WITH CONFIDENCE
   Once validated:
   ‚úì Use hr.create_multivariate_pipeline() with your chosen features
   ‚úì Monitor performance over time
   ‚úì Retrain if market regime fundamentally changes

RESEARCH QUESTION:
Want to test a novel feature combination? Use example 03
to systematically compare your ideas against best practices.

SEE ALSO:
- Example 02: COVID-2020 crisis detection with features
- Example 03: Side-by-side feature comparison
- Notebook 03: Why volatility matters (information theory)
- Notebook 05: How covariance structure reveals regimes
""")


FEATURE SELECTION FRAMEWORK - SUMMARY

STEP-BY-STEP PROCESS:

1. CHARACTERIZE YOUR DATA
   ‚úì Do you have 2+ years of daily data?
   ‚úì What asset class? (equities/bonds/crypto/forex?)
   ‚úì What regimes do you want to detect?

2. IDENTIFY CANDIDATE FEATURES
   Use the decision tree above to narrow down candidates
   Typical good choices:
   - Volatility regimes: log_return + realized_vol
   - Momentum regimes: log_return + momentum_strength
   - Trend regimes: log_return + trend_persistence

3. RUN DIAGNOSTICS
   Use diagnose_feature_pair() to check:
   ‚úì Correlation (should be < 0.7)
   ‚úì Scale ratio (should be < 100x)
   ‚úì Are features independent?

4. TEST ON HISTORICAL DATA
   Train multivariate model:
   ‚úì Pre-event period (stable)
   ‚úì Check: Convergence, Transitions, Confidence
   ‚úì Benchmark: Compare to univariate baseline

5. VALIDATE ON NEW PERIOD
   Test on different regime (crisis, downturn, etc):
   ‚úì Do regime boundaries still make sense?
   ‚úì Is conf