# Feature Engineering for Regime Identification

This notebook implements **Level**, **Trend**, and **Volatility** features from macroeconomic variables following the feature design document. These engineered features will form the input for PCA, clustering (K-Means/HMM), and similarity-based regime identification.

In [2]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Load the macroeconomic data
DATA_PATH = '../data/processed/macro_data_1962on (cleaned).csv'
df = pd.read_csv(DATA_PATH, parse_dates=['date'])
df.set_index('date', inplace=True)
df = df.sort_index()

print(f"Data shape: {df.shape}")
print(f"Date range: {df.index.min()} to {df.index.max()}")
print(f"\nColumns: {df.columns.tolist()}")
df.head()

Data shape: (766, 7)
Date range: 1962-01-31 00:00:00 to 2025-10-31 00:00:00

Columns: ['market', 'yield_curve', 'oil ($/bbl)', 'copper ($/metric ton)', 'monetary_policy', 'volatility', 'stock_bond_corr']


Unnamed: 0_level_0,market,yield_curve,oil ($/bbl),copper ($/metric ton),monetary_policy,volatility,stock_bond_corr
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1962-01-31,68.839996,-2.32,1.52,635.37,2.73,0.093924,
1962-02-28,69.959999,-2.31,1.52,647.94,2.71,0.06457,
1962-03-31,69.550003,-2.364,1.52,647.5,2.75,0.057723,
1962-04-30,65.239998,-2.354,1.52,645.95,2.74,0.098421,
1962-05-31,59.630001,-2.31,1.52,645.73,2.7,0.347108,


## Transformation Methodology

Following the economic state variable transformation approach:

1. **12-month change**: Compute the year-over-year difference for each variable
2. **Rolling Z-score normalization**: Normalize by computing z-score over a **rolling 10-year (120 months)** window
3. **Winsorization at ±3**: Cap values to be within [-3, 3] to remove outliers

This creates transformed economic state variables that are stationary and comparable across different macroeconomic indicators.

In [3]:
def transform_economic_variable(series: pd.Series, 
                                  change_window: int = 12,
                                  zscore_window: int = 120,
                                  winsorize_bound: float = 3.0) -> pd.Series:
    """
    Transform an economic variable following the methodology:
    1. Compute 12-month change (year-over-year difference)
    2. Normalize using rolling z-score over 10 years (120 months)
    3. Winsorize at ±3 to remove outliers
    
    Parameters:
    -----------
    series : pd.Series
        Input time series of the raw macroeconomic variable
    change_window : int
        Window for computing year-over-year change (default: 12 months)
    zscore_window : int
        Rolling window for z-score normalization (default: 120 months = 10 years)
    winsorize_bound : float
        Symmetric bound for winsorization (default: 3.0, clips to [-3, 3])
    
    Returns:
    --------
    pd.Series
        Transformed economic state variable
    """
    # Step 1: Compute 12-month (year-over-year) change
    yoy_change = series.diff(periods=change_window)
    
    # Step 2: Compute rolling z-score over 10-year window
    # Rolling mean and std of the 12-month changes
    rolling_mean = yoy_change.rolling(window=zscore_window, min_periods=zscore_window).mean()
    rolling_std = yoy_change.rolling(window=zscore_window, min_periods=zscore_window).std()
    
    # Z-score normalization
    zscore = (yoy_change - rolling_mean) / rolling_std
    
    # Handle infinities
    zscore = zscore.replace([np.inf, -np.inf], np.nan)
    
    # Step 3: Winsorize at ±3 to cap outliers
    transformed = zscore.clip(lower=-winsorize_bound, upper=winsorize_bound)
    
    return transformed

## Build Feature Matrix

Apply the transformation to all macroeconomic variables and combine into a single feature matrix.

In [4]:
def build_feature_matrix(df: pd.DataFrame, 
                         change_window: int = 12,
                         zscore_window: int = 120,
                         winsorize_bound: float = 3.0,
                         exclude_cols: list = None) -> pd.DataFrame:
    """
    Build a feature matrix by applying the transformation to all variables.
    
    Transformation steps for each variable:
    1. 12-month change (year-over-year difference)
    2. Rolling z-score over 10 years (120 months)
    3. Winsorization at ±3
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataframe with macro variables (DatetimeIndex)
    change_window : int
        Window for year-over-year change (default: 12 months)
    zscore_window : int
        Rolling window for z-score normalization (default: 120 months = 10 years)
    winsorize_bound : float
        Symmetric bound for winsorization (default: 3.0)
    exclude_cols : list
        Columns to exclude from feature engineering
    
    Returns:
    --------
    pd.DataFrame
        Feature matrix with transformed economic state variables
    """
    if exclude_cols is None:
        exclude_cols = []
    
    feature_dict = {}
    
    for col in df.columns:
        if col in exclude_cols:
            continue
            
        series = df[col]
        
        # Need enough data for change window + zscore window
        min_required = change_window + zscore_window
        non_null_count = series.dropna().shape[0]
        
        if non_null_count < min_required:
            print(f"Skipping {col}: insufficient data ({non_null_count} < {min_required})")
            continue
        
        # Apply transformation
        transformed = transform_economic_variable(
            series, 
            change_window=change_window,
            zscore_window=zscore_window,
            winsorize_bound=winsorize_bound
        )
        
        feature_dict[f'{col}_transformed'] = transformed
        print(f"Processed: {col}")
    
    # Create feature matrix
    feature_matrix = pd.DataFrame(feature_dict)
    
    return feature_matrix

## Apply Feature Engineering to Macro Data

Generate the complete feature matrix from the macroeconomic variables.

In [5]:
# Build feature matrix with the new transformation methodology
# 12-month change → rolling 10-year z-score → winsorize at ±3
feature_matrix = build_feature_matrix(
    df, 
    change_window=12,      # 12-month (1-year) change
    zscore_window=120,     # Rolling 10-year window for z-score
    winsorize_bound=3.0    # Cap at ±3
)

print(f"\nFeature matrix shape: {feature_matrix.shape}")
print(f"Feature columns: {feature_matrix.columns.tolist()}")
print(f"\nValid data starts from: {feature_matrix.dropna().index.min()}")

Processed: market
Processed: yield_curve
Processed: oil ($/bbl)
Processed: copper ($/metric ton)
Processed: monetary_policy
Processed: volatility
Processed: stock_bond_corr

Feature matrix shape: (766, 7)
Feature columns: ['market_transformed', 'yield_curve_transformed', 'oil ($/bbl)_transformed', 'copper ($/metric ton)_transformed', 'monetary_policy_transformed', 'volatility_transformed', 'stock_bond_corr_transformed']

Valid data starts from: 1973-12-31 00:00:00


In [6]:
# Preview the feature matrix
feature_matrix.head(30)

Unnamed: 0_level_0,market_transformed,yield_curve_transformed,oil ($/bbl)_transformed,copper ($/metric ton)_transformed,monetary_policy_transformed,volatility_transformed,stock_bond_corr_transformed
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1962-01-31,,,,,,,
1962-02-28,,,,,,,
1962-03-31,,,,,,,
1962-04-30,,,,,,,
1962-05-31,,,,,,,
1962-06-30,,,,,,,
1962-07-31,,,,,,,
1962-08-31,,,,,,,
1962-09-30,,,,,,,
1962-10-31,,,,,,,


## Feature Summary Statistics

In [7]:
# Summary statistics for features
print("Feature Summary Statistics (after burn-in period):\n")
feature_summary = feature_matrix.dropna().describe()
feature_summary

Feature Summary Statistics (after burn-in period):



Unnamed: 0,market_transformed,yield_curve_transformed,oil ($/bbl)_transformed,copper ($/metric ton)_transformed,monetary_policy_transformed,volatility_transformed,stock_bond_corr_transformed
count,623.0,623.0,623.0,623.0,623.0,623.0,623.0
mean,0.364943,-0.039018,0.133972,0.096289,0.042927,0.001351,-0.014871
std,1.36912,1.11629,1.231218,1.126721,1.117863,1.0729,1.021737
min,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-2.948614
25%,-0.357841,-0.714499,-0.544237,-0.598063,-0.516519,-0.526132,-0.668616
50%,0.52394,-0.051838,-0.021259,-0.007206,0.062866,-0.074919,0.043381
75%,1.237453,0.508403,0.728048,0.734794,0.716348,0.499765,0.685941
max,3.0,3.0,3.0,3.0,3.0,3.0,3.0


In [8]:
# Check missing values
print("Missing values per feature:\n")
missing_counts = feature_matrix.isna().sum()
print(missing_counts)

# Valid data coverage
valid_rows = feature_matrix.dropna().shape[0]
total_rows = feature_matrix.shape[0]
print(f"\nValid rows (no missing): {valid_rows} / {total_rows} ({100*valid_rows/total_rows:.1f}%)")
print(f"Date range with complete features: {feature_matrix.dropna().index.min()} to {feature_matrix.dropna().index.max()}")

Missing values per feature:

market_transformed                   131
yield_curve_transformed              131
oil ($/bbl)_transformed              131
copper ($/metric ton)_transformed    131
monetary_policy_transformed          131
volatility_transformed               131
stock_bond_corr_transformed          143
dtype: int64

Valid rows (no missing): 623 / 766 (81.3%)
Date range with complete features: 1973-12-31 00:00:00 to 2025-10-31 00:00:00


### Why are there missing values?

The missing values come from the **burn-in period** required for window-based calculations:

| Feature | Source of NaNs | Count |
|---------|---------------|-------|
| **Level** | `expanding(min_periods=12).shift(1)` → need 12 obs + 1 shift | **12** |
| **Trend** | `rolling(window=12)` → 11 NaN, then `expanding(min_periods=12).shift(1)` → 12 more | **23** |
| **Volatility** | `rolling(window=12)` → 11 NaN, then `expanding(min_periods=12).shift(1)` → 12 more | **23** |

**Note**: `stock_bond_corr` has additional missing values (24/35) because the raw data itself is missing for the first 12 months (1962).

In [10]:
# Visualize the burn-in period
print("First valid values for each transformed feature:\n")
first_valid = feature_matrix.apply(lambda x: x.first_valid_index())
print(first_valid)

print("\n\nBreakdown of missing value sources:")
print("=" * 50)
print("12-month change: 12 NaN (need 12 months of history)")
print("Rolling 10-year z-score: 120 NaN (need 120 months for rolling window)")
print("Total burn-in: ~131 months (data starts from Dec 1973)")
print("\nstock_bond_corr has +12 extra NaN because raw data starts in Jan 1963")

First valid values for each transformed feature:

market_transformed                  1972-12-31
yield_curve_transformed             1972-12-31
oil ($/bbl)_transformed             1972-12-31
copper ($/metric ton)_transformed   1972-12-31
monetary_policy_transformed         1972-12-31
volatility_transformed              1972-12-31
stock_bond_corr_transformed         1973-12-31
dtype: datetime64[ns]


Breakdown of missing value sources:
12-month change: 12 NaN (need 12 months of history)
Rolling 10-year z-score: 120 NaN (need 120 months for rolling window)
Total burn-in: ~131 months (data starts from Dec 1973)

stock_bond_corr has +12 extra NaN because raw data starts in Jan 1963


## Save Feature Matrix

Export the engineered features for use in regime modeling.

In [11]:
# Save the complete feature matrix
output_path = '../data/processed/feature_matrix.csv'
feature_matrix.to_csv(output_path)
print(f"Feature matrix saved to: {output_path}")

# Also save a version with only complete rows (no NaN)
feature_matrix_clean = feature_matrix.dropna()
output_path_clean = '../data/processed/feature_matrix_clean.csv'
feature_matrix_clean.to_csv(output_path_clean)
print(f"Clean feature matrix saved to: {output_path_clean}")
print(f"Clean matrix shape: {feature_matrix_clean.shape}")

Feature matrix saved to: ../data/processed/feature_matrix.csv
Clean feature matrix saved to: ../data/processed/feature_matrix_clean.csv
Clean matrix shape: (623, 7)
