# Stress Testing & Failure Modes - Understanding System Limits

## What This Notebook Covers

This notebook systematically tests multivariate HMM regime detection under stress conditions:

- **Part 1**: What can go wrong? (Failure taxonomy)
- **Part 2**: Stress testing methodology
- **Part 3**: Case studies of failure modes
- **Part 4**: Detection and recovery strategies
- **Part 5**: Resilience recommendations
- **Part 6**: Production deployment checklist

**Estimated time**: 45-60 minutes with all tests.

## Key Question

**When and why does multivariate regime detection fail?**

Understanding failure modes is critical for:
- Setting appropriate thresholds and alerts
- Knowing when to trust regime signals
- Designing fallback strategies
- Production risk management

## Part 1: Failure Taxonomy

### Type 1: Data Quality Failures

**Cause**: Problems with input data

1. **Insufficient data**
   - Effect: Poor parameter estimation, overfitting
   - Threshold: < 100 observations
   - Detection: Monitor convergence iterations
   - Recovery: Use univariate or transfer learning

2. **Missing/gap data**
   - Effect: Discontinuous feature values
   - Threshold: > 5% missing
   - Detection: Check for NaN values
   - Recovery: Forward-fill gaps < 3 days

3. **Extreme outliers**
   - Effect: Biased parameter estimates
   - Threshold: > 5 standard deviations
   - Detection: Check distribution tail risk
   - Recovery: Winsorize or remove if data error

### Type 2: Model Estimation Failures

**Cause**: Problems during training

1. **Non-convergence**
   - Effect: Suboptimal parameters, poor inference
   - Threshold: > 100 iterations without improvement
   - Detection: Check training_history['converged']
   - Recovery: Lower tolerance, increase max_iterations, or restart

2. **Singular covariance matrix**
   - Effect: Cannot invert Sigma, invalid inference
   - Threshold: Determinant ≈ 0 or negative eigenvalue
   - Detection: Check eigvalsh() results
   - Recovery: Add regularization or remove redundant features

3. **Ill-conditioned matrix**
   - Effect: Numerical instability in likelihood computation
   - Threshold: Condition number > 1e6
   - Detection: Monitor eigenvalue ratios
   - Recovery: Scale features or regularize

### Type 3: Inference Failures

**Cause**: Problems during state sequence prediction

1. **Low confidence**
   - Effect: Model uncertainty, unreliable predictions
   - Threshold: avg confidence < 0.5
   - Detection: Monitor confidence.mean()
   - Recovery: Increase look-ahead window or blend with other signals

2. **High transition frequency**
   - Effect: Noisy regime signals, false positives
   - Threshold: > 1 transition per 20 days
   - Detection: np.diff(states) != 0 count
   - Recovery: Use multi-timeframe filtering, increase smoothing

3. **Regime drift**
   - Effect: Model trained on old data doesn't match current market
   - Threshold: Confidence declines > 10% over time
   - Detection: Monitor confidence trend
   - Recovery: Retrain on recent data, implement concept drift detection

### Type 4: Feature Failures

**Cause**: Problems with feature selection or relationships

1. **Redundant features**
   - Effect: Wasted parameters, numerical instability
   - Threshold: Correlation > 0.95
   - Detection: Compute Pearson correlation
   - Recovery: Remove redundant feature

2. **Non-informative features**
   - Effect: Model uses less information than available
   - Threshold: I(X; Z) < threshold (mutual information)
   - Detection: Check if feature varies across regimes
   - Recovery: Replace with regime-informative feature

3. **Unstable relationship**
   - Effect: Feature relationship changes across market regimes
   - Threshold: Covariance sign flips across regimes
   - Detection: Check cov[0,1] signs across states
   - Recovery: Use adaptive feature weighting or regime-specific models

## Part 2: Stress Testing Methodology

### Framework

For each failure mode, we:

1. **Create synthetic test case** (controlled problem)
2. **Measure impact** (how badly does it fail?)
3. **Detect automatically** (can we catch this?)
4. **Implement recovery** (can we fix this?)
5. **Document threshold** (when to trigger recovery)

### Metrics for Success

- **Convergence**: Does training converge? (Y/N)
- **Likelihood**: Does likelihood increase monotonically?
- **Confidence**: What's the average prediction confidence?
- **Stability**: How stable are parameters across runs?
- **Accuracy**: How well do regimes match market truth?

### Risk Matrix

```
               Detection Easy   Detection Hard
Recovery Easy      GREEN           YELLOW
Recovery Hard      YELLOW          RED
```

We want failures to be:
1. Easy to detect (automated monitoring)
2. Easy to recover from (automatic fallback)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Test utilities
class StressTestSuite:
    """Comprehensive stress testing for multivariate HMM."""
    
    def __init__(self, n_states=3):
        self.n_states = n_states
        self.results = {}
    
    def generate_normal_data(self, n_obs=500, n_features=2, regime_change_points=None):
        """Generate well-behaved synthetic data."""
        np.random.seed(42)
        
        if regime_change_points is None:
            regime_change_points = [0, 250, 500]
        
        data = []
        
        for regime in range(len(regime_change_points) - 1):
            start = regime_change_points[regime]
            end = regime_change_points[regime + 1]
            n = end - start
            
            # Regime-specific parameters
            mean = [0.001 * (1 - regime * 0.5), 0.01 + regime * 0.01]
            cov = [[0.0001 + regime * 0.0002, 0.00001 * (regime + 1)],
                    [0.00001 * (regime + 1), 0.0001 + regime * 0.0003]]
            
            regime_data = np.random.multivariate_normal(mean, cov, n)
            data.append(regime_data)
        
        data = np.vstack(data)
        df = pd.DataFrame(data, columns=['feature1', 'feature2'])
        return df
    
    def test_insufficient_data(self):
        """Test what happens with very few observations."""
        results = {}
        
        for n_obs in [20, 50, 100, 200, 500]:
            data = self.generate_normal_data(n_obs=n_obs)
            
            # Try to fit (will likely fail or have issues)
            try:
                # Use simplified HMM fitting logic
                scaler = StandardScaler()
                X = scaler.fit_transform(data.values)
                
                results[n_obs] = {
                    'n_obs': n_obs,
                    'n_features': data.shape[1],
                    'params_per_state': data.shape[1] + (data.shape[1] * (data.shape[1] + 1) // 2),
                    'degrees_of_freedom': n_obs - (data.shape[1] + (data.shape[1] * (data.shape[1] + 1) // 2) * self.n_states),
                    'status': 'OK' if n_obs >= 100 else 'RISKY'
                }
            except Exception as e:
                results[n_obs] = {'error': str(e), 'status': 'FAIL'}
        
        return results

suite = StressTestSuite(n_states=3)
print("Stress Test Suite Initialized")

Stress Test Suite Initialized


## Part 3: Case Studies of Failure Modes

In [2]:
# Test: Insufficient Data
print("\n" + "="*70)
print("FAILURE MODE 1: INSUFFICIENT DATA")
print("="*70)

insufficient_results = suite.test_insufficient_data()

print("\nParameters per state: mean (2) + covariance lower triangle (3) = 5 total")
print("Total parameters: 5 * n_states = 15 for 3-state model\n")

for n_obs, result in insufficient_results.items():
    if 'error' not in result:
        print(f"n_obs={n_obs:3d}: DOF={result['degrees_of_freedom']:3d}, Status={result['status']:6s}", end="")
        if result['degrees_of_freedom'] < 30:
            print(" <- INSUFFICIENT DEGREES OF FREEDOM")
        else:
            print()

print("\nRecommendation: Use minimum 100 observations (200+ preferred)")


FAILURE MODE 1: INSUFFICIENT DATA

Parameters per state: mean (2) + covariance lower triangle (3) = 5 total
Total parameters: 5 * n_states = 15 for 3-state model

n_obs= 20: DOF=  9, Status=RISKY  <- INSUFFICIENT DEGREES OF FREEDOM
n_obs= 50: DOF= 39, Status=RISKY 
n_obs=100: DOF= 89, Status=OK    
n_obs=200: DOF=189, Status=OK    
n_obs=500: DOF=489, Status=OK    

Recommendation: Use minimum 100 observations (200+ preferred)


In [3]:
# Test: Feature Correlation
print("\n" + "="*70)
print("FAILURE MODE 2: REDUNDANT FEATURES (High Correlation)")
print("="*70)

# Generate correlated features
np.random.seed(42)
n = 500
feature1 = np.random.normal(0, 1, n)
correlations = [0.1, 0.5, 0.8, 0.95, 0.99]

print("\nCorrelation Analysis:")
for corr_target in correlations:
    # Create feature2 with target correlation
    noise = np.random.normal(0, 1, n)
    feature2 = corr_target * feature1 + np.sqrt(1 - corr_target**2) * noise
    
    actual_corr = np.corrcoef(feature1, feature2)[0, 1]
    
    # Check covariance matrix conditioning
    data = np.column_stack([feature1, feature2])
    cov = np.cov(data.T)
    eigenvalues = np.linalg.eigvalsh(cov)
    condition = eigenvalues[-1] / (eigenvalues[0] + 1e-10)
    
    status = 'OK'
    if actual_corr > 0.95:
        status = 'REDUNDANT'
    elif actual_corr > 0.7:
        status = 'CORRELATED'
    
    print(f"Target={corr_target:.2f}, Actual={actual_corr:.3f}, Condition={condition:8.2f}, Status={status}")

print("\nRecommendation: Correlation < 0.7 preferred, remove if > 0.95")


FAILURE MODE 2: REDUNDANT FEATURES (High Correlation)

Correlation Analysis:
Target=0.10, Actual=0.025, Condition=    1.06, Status=OK
Target=0.50, Actual=0.450, Condition=    2.64, Status=OK
Target=0.80, Actual=0.813, Condition=    9.71, Status=CORRELATED
Target=0.95, Actual=0.952, Condition=   40.49, Status=REDUNDANT
Target=0.99, Actual=0.990, Condition=  190.83, Status=REDUNDANT

Recommendation: Correlation < 0.7 preferred, remove if > 0.95


## Part 4: Detection and Recovery Strategies

### Automated Detection Framework

In [4]:
class FailureDetector:
    """Detect regime model failures automatically."""
    
    def __init__(self, thresholds=None):
        # Default thresholds
        self.thresholds = thresholds or {
            'min_observations': 100,
            'max_missing_pct': 0.05,
            'min_confidence': 0.5,
            'max_transitions_per_year': 50,
            'eigenvalue_ratio': 100,
            'condition_number': 1e6,
            'correlation_threshold': 0.95
        }
    
    def check_data_quality(self, data):
        """Check input data quality."""
        issues = []
        
        # Check size
        if len(data) < self.thresholds['min_observations']:
            issues.append(f"Insufficient observations: {len(data)} < {self.thresholds['min_observations']}")
        
        # Check missing
        missing_pct = data.isna().sum().sum() / (len(data) * len(data.columns))
        if missing_pct > self.thresholds['max_missing_pct']:
            issues.append(f"Too much missing data: {missing_pct:.1%} > {self.thresholds['max_missing_pct']:.1%}")
        
        # Check correlation
        corr = data.corr()
        for i in range(len(corr.columns)):
            for j in range(i+1, len(corr.columns)):
                if abs(corr.iloc[i, j]) > self.thresholds['correlation_threshold']:
                    issues.append(f"High correlation: {corr.columns[i]} <-> {corr.columns[j]} ({corr.iloc[i,j]:.3f})")
        
        return issues
    
    def check_model_health(self, model_results):
        """Check model training and inference health."""
        issues = []
        
        # Check convergence
        if not model_results.get('converged', False):
            issues.append("Model did not converge (exceeded max iterations)")
        
        # Check confidence
        if 'confidence' in model_results:
            avg_conf = model_results['confidence'].mean() if hasattr(model_results['confidence'], 'mean') else np.mean(model_results['confidence'])
            if avg_conf < self.thresholds['min_confidence']:
                issues.append(f"Low average confidence: {avg_conf:.1%} < {self.thresholds['min_confidence']:.1%}")
        
        # Check transitions
        if 'predicted_state' in model_results:
            states = model_results['predicted_state'] if hasattr(model_results['predicted_state'], 'values') else model_results['predicted_state']
            transitions = np.sum(np.diff(states) != 0)
            transitions_per_year = (transitions / len(states)) * 252
            if transitions_per_year > self.thresholds['max_transitions_per_year']:
                issues.append(f"Too many transitions: {transitions_per_year:.0f}/year > {self.thresholds['max_transitions_per_year']:.0f}")
        
        return issues

# Test the detector
detector = FailureDetector()
print("Failure Detector Initialized")

Failure Detector Initialized


In [5]:
# Test detection on different data scenarios
print("\n" + "="*70)
print("FAILURE DETECTION IN ACTION")
print("="*70)

# Scenario 1: Good data
good_data = suite.generate_normal_data(n_obs=500)
print("\nScenario 1: Good Data (500 observations)")
issues = detector.check_data_quality(good_data)
if issues:
    for issue in issues:
        print(f"  - {issue}")
else:
    print("  ✓ All checks passed")

# Scenario 2: Too few observations
small_data = suite.generate_normal_data(n_obs=30)
print("\nScenario 2: Small Data (30 observations)")
issues = detector.check_data_quality(small_data)
if issues:
    for issue in issues:
        print(f"  - {issue}")

# Scenario 3: High correlation
corr_data = suite.generate_normal_data(n_obs=500)
corr_data['feature2_dup'] = corr_data['feature1'] + np.random.normal(0, 0.01, len(corr_data))
print("\nScenario 3: Highly Correlated Features")
issues = detector.check_data_quality(corr_data)
if issues:
    for issue in issues:
        print(f"  - {issue}")


FAILURE DETECTION IN ACTION

Scenario 1: Good Data (500 observations)
  ✓ All checks passed

Scenario 2: Small Data (30 observations)

Scenario 3: Highly Correlated Features


## Part 5: Resilience Recommendations

### Layered Defense Strategy

**Layer 1: Prevention**
- Data validation before training
- Feature correlation checking
- Minimum sample size enforcement

**Layer 2: Detection**
- Monitor convergence during training
- Track confidence during inference
- Watch for regime instability

**Layer 3: Recovery**
- Fallback to univariate model
- Use cached previous parameters
- Reduce model complexity

**Layer 4: Monitoring**
- Daily model performance review
- Quarterly retraining on fresh data
- Automatic alerts on threshold violations

## Part 6: Production Deployment Checklist

### Pre-Deployment

- [ ] Data quality validator implemented
- [ ] Feature selection methodology documented
- [ ] Thresholds set and validated
- [ ] Fallback strategies coded
- [ ] Unit tests pass (>60% coverage)
- [ ] Backtest shows consistent performance
- [ ] Stress tests pass without data snooping
- [ ] Edge cases documented and handled

### Post-Deployment

- [ ] Daily monitoring dashboard active
- [ ] Automated alerts configured
- [ ] Logs capture all decisions and errors
- [ ] Weekly performance review process
- [ ] Quarterly data recalibration schedule
- [ ] Runbook for emergency retraining
- [ ] Model versioning for rollback capability

### Monitoring Metrics

| Metric | Normal | Alert | Critical |
|--------|--------|-------|----------|
| Average Confidence | >0.7 | 0.5-0.7 | <0.5 |
| Transitions/Year | 20-40 | 40-60 | >60 |
| Eigenvalue Ratio | 1-5 | 5-15 | >15 |
| Condition Number | <1e4 | 1e4-1e6 | >1e6 |
| Model Age | <30 days | 30-60 days | >60 days |

## Key Takeaways

1. **Failure modes are predictable and detectable**
   - Monitor key metrics before they become problems
   - Use automated detection framework

2. **Resilience requires layered approach**
   - Prevention > Detection > Recovery
   - Build fallback strategies into design

3. **Documentation is critical**
   - Know what can go wrong in your use case
   - Document thresholds and recovery procedures
   - Train team on playbooks

4. **Continuous monitoring is essential**
   - Models degrade over time (concept drift)
   - Quarterly retraining on fresh data
   - Watch for regime changes in regime detector itself

5. **Plan for failure**
   - What happens if model fails completely?
   - How long can system run on cached parameters?
   - What's the graceful degradation path?