# HMM Basics: Understanding Regime Detection

**From Log Returns to Market Regimes Using Hidden Markov Models**

In the previous notebook, we proved that log returns are the correct transformation for financial time series.

Now we'll use those log returns to detect market regimes using **Hidden Markov Models (HMMs)**.

## What You'll Learn

1. What HMMs are and how they work
2. What parameters HMMs learn from data
3. How to interpret HMM states (WITHOUT forcing Bull/Bear labels)
4. How to choose the right number of states
5. How to visualize regime sequences

## Important Note

**This notebook uses minimal code** to understand HMM mechanics.

**Next notebook** shows the full library pipeline with analysis tools and best practices.

---

In [None]:
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Add parent to path
sys.path.insert(0, str(Path().absolute().parent))

# Hidden Regime imports - minimal
from hidden_regime.data import FinancialDataLoader
from hidden_regime.observations import FinancialObservationGenerator
from hidden_regime.models import HiddenMarkovModel
from hidden_regime.config import FinancialDataConfig, FinancialObservationConfig, HMMConfig
from hidden_regime.utils import log_return_to_percent_change

# Plotting
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

print("Imports complete")

## 1. What is a Hidden Markov Model?

### The Core Idea

An HMM assumes the market operates in **hidden states** (regimes) that we cannot directly observe. Each regime has:
- **Different return characteristics** (mean and volatility)
- **Persistence** (tendency to stay in the same regime)
- **Transition probabilities** (likelihood of switching to other regimes)

### The Three Components

1. **Hidden States** (`S₁, S₂, ..., Sₜ`): The unobserved regime sequence
2. **Observations** (`O₁, O₂, ..., Oₜ`): The log returns we can see
3. **Parameters**:
   - **Transition Matrix** `A`: `P(regime_{t+1} = j | regime_t = i)`
   - **Emission Means** `μ`: Average log return for each regime
   - **Emission Stds** `σ`: Volatility for each regime
   - **Initial Probs** `π`: Starting regime distribution

### What HMMs Do

Given observed log returns, HMMs **simultaneously**:
- Infer the most likely regime sequence (states)
- Learn the transition probabilities (how regimes switch)
- Learn emission parameters (return characteristics of each regime)

---

## 2. Load Data and Create Observations

We'll use SPY (S&P 500 ETF) with 2 years of data.

In [None]:
# Configure data loader
data_config = FinancialDataConfig(
    ticker='SPY',
    end_date=datetime.now(),
    num_samples=504  # ~2 years of trading days
)

# Load data
loader = FinancialDataLoader(data_config)
df = loader.load_data()

print(f"Loaded {len(df)} days of SPY data")
print(f"Date range: {df.index[0].date()} to {df.index[-1].date()}")
print(f"\nData columns: {list(df.columns)}")
print(f"\nFirst few rows:")
print(df.head())

In [None]:
# Create observation generator
obs_config = FinancialObservationConfig(
    generators=["log_return"],  # Just log returns
    price_column="close",
    normalize_features=False  # Keep raw log returns
)

obs_generator = FinancialObservationGenerator(obs_config)
observations_df = obs_generator.update(df)

# For HMM training, we need just the log_return column as a DataFrame
observations = observations_df[['log_return']]

print(f"Generated {len(observations)} observations")
print(f"Observation shape: {observations.shape}")
print(f"\nObservations are log returns:")
print(f"  Mean: {observations.values.mean():.6f}")
print(f"  Std:  {observations.values.std():.6f}")
print(f"  Min:  {observations.values.min():.6f}")
print(f"  Max:  {observations.values.max():.6f}")

# Convert a few examples to percentage for interpretation
print(f"\nExample log returns converted to percentages:")
for i in range(min(5, len(observations))):
    log_ret = observations.iloc[i, 0]
    pct = log_return_to_percent_change(log_ret) * 100
    print(f"  Day {i+1}: log_return={log_ret:.6f} → {pct:+.2f}%")

## 3. Train the HMM

Now we'll train a 3-state HMM on these log returns. The model will learn:
- Which regime the market was in each day
- The transition probabilities between regimes
- The mean return and volatility for each regime

In [None]:
# Configure HMM with 3 states
hmm_config = HMMConfig(
    n_states=3,
    max_iterations=100,  # Maximum training iterations
    tolerance=1e-4,      # Convergence tolerance
    random_seed=42
)

# Create and train model
print("Training 3-state HMM...")
hmm = HiddenMarkovModel(hmm_config)
hmm.fit(observations)

print(f"Training complete!")
print(f"  Converged: {hmm.training_history_['converged']}")
print(f"  Iterations: {hmm.training_history_['iterations']}")
print(f"  Final log-likelihood: {hmm.score(observations):.2f}")

## 4. Inspect Learned Parameters

Let's see what the HMM learned about the three regimes.

In [None]:
# Extract learned parameters
transition_matrix = hmm.transition_matrix_
emission_means = hmm.emission_means_
emission_stds = hmm.emission_stds_
start_probs = hmm.initial_probs_

print("=" * 80)
print("LEARNED PARAMETERS")
print("=" * 80)

print("\n1. EMISSION PARAMETERS (Return Characteristics)")
print("-" * 80)
print(f"{'State':<10} {'Mean (log)':<15} {'Mean (%/day)':<15} {'Std (log)':<15} {'Vol (%/day)':<15}")
print("-" * 80)

# Sort by mean for easier interpretation
sorted_idx = np.argsort(emission_means)

for idx in sorted_idx:
    log_mean = emission_means[idx]
    pct_mean = log_return_to_percent_change(log_mean) * 100
    log_std = emission_stds[idx]
    pct_std = log_std * 100  # Approximation valid for small returns
    
    print(f"State {idx:<4} {log_mean:>14.6f} {pct_mean:>14.2f}% {log_std:>14.6f} {pct_std:>14.2f}%")

print("\n2. TRANSITION MATRIX (Regime Switching Probabilities)")
print("-" * 80)
print("         To State 0  To State 1  To State 2")
print("-" * 80)
for i in range(3):
    probs = " ".join([f"{transition_matrix[i,j]:>11.3f}" for j in range(3)])
    print(f"From {i}:  {probs}")

print("\n3. STARTING PROBABILITIES")
print("-" * 80)
for i in range(3):
    print(f"State {i}: {start_probs[i]:.3f}")

print("\n" + "=" * 80)

## 5. Predict Regime Sequence

Now let's use the Viterbi algorithm to find the most likely sequence of regimes.

In [None]:
# Predict states using Viterbi algorithm
predictions_df = hmm.predict(observations)
states = predictions_df['predicted_state'].values

print(f"Predicted {len(states)} regime states")
print(f"\nState distribution:")
for state in range(3):
    count = (states == state).sum()
    pct = count / len(states) * 100
    mean_ret = emission_means[state]
    mean_ret_pct = log_return_to_percent_change(mean_ret) * 100
    print(f"  State {state}: {count:>4} days ({pct:>5.1f}%) - Avg return: {mean_ret_pct:+.2f}%/day")

# Calculate regime durations
def get_regime_durations(states):
    """Calculate how long each regime lasted"""
    durations = []
    current_state = states[0]
    current_duration = 1
    
    for i in range(1, len(states)):
        if states[i] == current_state:
            current_duration += 1
        else:
            durations.append((current_state, current_duration))
            current_state = states[i]
            current_duration = 1
    
    durations.append((current_state, current_duration))
    return durations

durations = get_regime_durations(states)
print(f"\nRegime switches: {len(durations)} total regime periods")
print(f"\nAverage duration by state:")
for state in range(3):
    state_durations = [d for s, d in durations if s == state]
    if state_durations:
        avg_dur = np.mean(state_durations)
        print(f"  State {state}: {avg_dur:.1f} days")

## 6. Visualize Regime Sequence

Let's see how the detected regimes align with price movements and returns.

In [None]:
# Create visualization
fig, axes = plt.subplots(3, 1, figsize=(16, 12), sharex=True)

# Define colors for each state (sorted by mean return)
state_colors = {sorted_idx[0]: 'red', sorted_idx[1]: 'gray', sorted_idx[2]: 'green'}
state_names = {sorted_idx[0]: 'Negative', sorted_idx[1]: 'Neutral', sorted_idx[2]: 'Positive'}

# 1. Price with regime shading
ax = axes[0]
prices = df['close'].values
dates = df.index

ax.plot(dates, prices, linewidth=1.5, color='black', zorder=2)
ax.set_ylabel('Price ($)', fontsize=12)
ax.set_title('SPY Price with Detected Regimes', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)

# Shade background by regime
for i in range(len(states)):
    state = states[i]
    ax.axvspan(dates[i], dates[min(i+1, len(dates)-1)], 
               alpha=0.2, color=state_colors[state], zorder=1)

# Add legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=state_colors[s], alpha=0.2, 
                         label=f'State {s} ({state_names[s]})') 
                   for s in sorted_idx]
ax.legend(handles=legend_elements, loc='upper left')

# 2. Log returns with regime shading
ax = axes[1]
log_returns = observations.flatten()
ax.plot(dates[1:], log_returns, linewidth=0.8, color='blue', alpha=0.7)
ax.axhline(y=0, color='black', linestyle='--', linewidth=1)
ax.set_ylabel('Log Return', fontsize=12)
ax.set_title('Log Returns with Detected Regimes', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)

# Shade background
for i in range(len(states)):
    state = states[i]
    ax.axvspan(dates[i], dates[min(i+1, len(dates)-1)], 
               alpha=0.2, color=state_colors[state], zorder=1)

# 3. Regime sequence as bar plot
ax = axes[2]
regime_bars = ax.bar(dates[1:], np.ones(len(states)), width=1, 
                      color=[state_colors[s] for s in states], 
                      edgecolor='none', alpha=0.6)
ax.set_ylabel('Regime', fontsize=12)
ax.set_xlabel('Date', fontsize=12)
ax.set_title('Regime Sequence', fontsize=14, fontweight='bold')
ax.set_ylim(0, 1.5)
ax.set_yticks([])
ax.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\nNote: Notice how regimes cluster periods of similar return behavior")

## 7. Choosing the Number of States

How do we decide between 2, 3, 4, or more states? Let's compare different models.

In [None]:
# Train models with different numbers of states
print("Comparing models with different numbers of states...")
print("=" * 80)

results = []

for n_states in [2, 3, 4, 5]:
    config = HMMConfig(n_states=n_states, max_iterations=100, random_seed=42)
    model = HiddenMarkovModel(config)
    model.fit(observations)
    
    # Calculate metrics
    log_likelihood = model.score(observations)
    n_params = n_states**2 + 2*n_states  # Transitions + means + stds
    aic = -2 * log_likelihood + 2 * n_params
    bic = -2 * log_likelihood + n_params * np.log(len(observations))
    
    results.append({
        'n_states': n_states,
        'log_likelihood': log_likelihood,
        'aic': aic,
        'bic': bic,
        'n_params': n_params
    })
    
    print(f"\n{n_states} states:")
    print(f"  Log-likelihood: {log_likelihood:>10.2f}")
    print(f"  AIC:            {aic:>10.2f} (lower is better)")
    print(f"  BIC:            {bic:>10.2f} (lower is better)")
    print(f"  Parameters:     {n_params:>10}")

print("\n" + "=" * 80)

# Find best models
results_df = pd.DataFrame(results)
best_aic = results_df.loc[results_df['aic'].idxmin()]
best_bic = results_df.loc[results_df['bic'].idxmin()]

print(f"\nBest by AIC: {int(best_aic['n_states'])} states (AIC={best_aic['aic']:.2f})")
print(f"Best by BIC: {int(best_bic['n_states'])} states (BIC={best_bic['bic']:.2f})")

print("\nNote: BIC often prefers simpler models (fewer states)")
print("   AIC may prefer more complex models that fit better")
print("   For trading, 3-4 states usually provides good interpretability")

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Plot 1: Log-likelihood
ax = axes[0]
ax.plot(results_df['n_states'], results_df['log_likelihood'], marker='o', linewidth=2, markersize=8)
ax.set_xlabel('Number of States', fontsize=12)
ax.set_ylabel('Log-Likelihood', fontsize=12)
ax.set_title('Model Fit (Higher is Better)', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.set_xticks([2, 3, 4, 5])

# Plot 2: AIC
ax = axes[1]
ax.plot(results_df['n_states'], results_df['aic'], marker='o', linewidth=2, markersize=8, color='orange')
best_aic_idx = results_df['aic'].idxmin()
ax.scatter(results_df.loc[best_aic_idx, 'n_states'], 
           results_df.loc[best_aic_idx, 'aic'], 
           color='red', s=150, zorder=5, label='Best')
ax.set_xlabel('Number of States', fontsize=12)
ax.set_ylabel('AIC', fontsize=12)
ax.set_title('AIC (Lower is Better)', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend()
ax.set_xticks([2, 3, 4, 5])

# Plot 3: BIC
ax = axes[2]
ax.plot(results_df['n_states'], results_df['bic'], marker='o', linewidth=2, markersize=8, color='green')
best_bic_idx = results_df['bic'].idxmin()
ax.scatter(results_df.loc[best_bic_idx, 'n_states'], 
           results_df.loc[best_bic_idx, 'bic'], 
           color='red', s=150, zorder=5, label='Best')
ax.set_xlabel('Number of States', fontsize=12)
ax.set_ylabel('BIC', fontsize=12)
ax.set_title('BIC (Lower is Better)', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend()
ax.set_xticks([2, 3, 4, 5])

plt.tight_layout()
plt.show()

## 8. Key Takeaways

### What We Learned

1. **HMM Components**:
   - Hidden states represent market regimes
   - Transition matrix captures regime switching behavior
   - Emission parameters (μ, σ) define return characteristics

2. **Training Process**:
   - Baum-Welch algorithm learns parameters from data
   - Viterbi algorithm finds most likely state sequence
   - Converges to local optimum (results may vary)

3. **Interpretation**:
   - States are numbered arbitrarily (0, 1, 2)
   - Must examine emission means to understand regime types
   - DO NOT force "Bear/Sideways/Bull" labels prematurely

4. **Model Selection**:
   - Use AIC/BIC to compare different numbers of states
   - Balance fit quality vs. model complexity
   - 3-4 states often optimal for interpretability

### Important Warnings

⚠️ **DO NOT**:
- Sort states by index and call them Bear/Sideways/Bull
- Assume state 0 is always negative returns
- Use HMMs on non-stationary data (prices)

✓ **DO**:
- Use log returns (stationary observations)
- Examine learned parameters before labeling
- Compare multiple model complexities
- Validate regime assignments make sense

---

### What's Next?

**Notebook 3** will show the full pipeline:
- Regime labeling and analysis tools
- Advanced diagnostics and quality metrics
- Portfolio applications
- Best practices for production use

---

**Key Point**: HMMs are a **data-driven** approach. Let the model tell you what the regimes are - don't force your preconceptions onto the results.