# 🔧 Feature Engineering: Building Market Behavior Indicators

This notebook transforms raw cryptocurrency price data into meaningful features that capture different aspects of market behavior. These engineered features will be the foundation for PCA analysis and regime detection.

## 🎯 Objective
Create a comprehensive feature set that captures:
- **Momentum & Trend**: Price direction and strength
- **Volatility & Risk**: Market uncertainty and risk levels  
- **Technical Signals**: Overbought/oversold conditions and trend reversals
- **Relative Performance**: Cross-asset dynamics

## 📊 Feature Categories
We engineer 7 types of features across 15 cryptocurrencies = **105 total features**:

1. **Log Returns**: Daily price changes (momentum proxy)
2. **30-day Volatility**: Rolling standard deviation (risk measure)  
3. **90-day Momentum**: Longer-term trend strength
4. **RSI (14-day)**: Overbought/oversold indicator
5. **%B Bollinger Bands**: Position within trading bands
6. **MACD Histogram**: Trend change momentum
7. **30-day Sharpe Ratio**: Risk-adjusted returns

These features capture the full spectrum of market dynamics needed for regime analysis.

In [None]:
# Load the daily price data
df = pd.read_csv("../data/crypto_prices.csv", index_col=0, parse_dates=True)

# Ensure columns are sorted alphabetically for consistency
df = df.sort_index(axis=1)

# === CORE FEATURES ===

# Log returns
log_returns = np.log(df / df.shift(1)).add_suffix("_log_return")

# 30-day volatility (standard deviation of log returns)
volatility = log_returns.rolling(window=30).std().add_suffix("_30d_vol")

# 90-day momentum (% price change)
momentum = df.pct_change(periods=90, fill_method=None).add_suffix("_90d_momentum")

# === TECHNICAL INDICATORS ===

# RSI (14-day)
def compute_rsi(series, window=14):
    delta = series.diff()
    gain = (delta.where(delta > 0, 0)).rolling(window).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window).mean()
    rs = gain / loss
    return 100 - (100 / (1 + rs))

rsi = pd.concat([compute_rsi(df[col]).rename(f"{col}_rsi14") for col in df.columns], axis=1)

# %B Bollinger Band
def compute_percent_b(series, window=20):
    ma = series.rolling(window).mean()
    std = series.rolling(window).std()
    upper = ma + 2 * std
    lower = ma - 2 * std
    return ((series - lower) / (upper - lower)).rename(f"{series.name}_pctB")

pct_b = pd.concat([compute_percent_b(df[col]) for col in df.columns], axis=1)

# MACD Histogram
def compute_macd_hist(series, fast=12, slow=26, signal=9):
    ema_fast = series.ewm(span=fast).mean()
    ema_slow = series.ewm(span=slow).mean()
    macd = ema_fast - ema_slow
    signal_line = macd.ewm(span=signal).mean()
    return (macd - signal_line).rename(f"{series.name}_macd_hist")

macd_hist = pd.concat([compute_macd_hist(df[col]) for col in df.columns], axis=1)

# Rolling Sharpe Ratio (30-day)
rolling_sharpe = (
    log_returns.rolling(30).mean() / log_returns.rolling(30).std()
).add_suffix("_sharpe30")

# === COMBINE ALL FEATURES ===
features = pd.concat([log_returns, volatility, momentum, rsi, pct_b, macd_hist, rolling_sharpe], axis=1)
features = features.dropna()

# Save to CSV
features.to_csv("../data/crypto_features.csv")
print("✅ Feature engineering complete. Saved to '../data/crypto_features.csv'.")

## 📊 Detailed Feature Explanations

Here's what each feature type captures in market behavior:

### **1. Log Returns (`_log_return`)**
- **Formula**: ln(price_t / price_t-1)  
- **Captures**: Daily momentum and directional bias
- **Trading Insight**: Positive = upward pressure, negative = downward pressure

### **2. 30-Day Volatility (`_30d_vol`)**
- **Formula**: Rolling standard deviation of log returns (30-day window)
- **Captures**: Market uncertainty and risk levels
- **Trading Insight**: High volatility = regime changes, low volatility = stable trends

### **3. 90-Day Momentum (`_90d_momentum`)**
- **Formula**: (price_t / price_t-90) - 1
- **Captures**: Medium-term trend strength 
- **Trading Insight**: Persistent trends vs mean-reverting behavior

### **4. RSI 14-Day (`_rsi14`)**
- **Formula**: 100 - 100/(1 + RS), where RS = avg_gain/avg_loss
- **Captures**: Overbought (>70) and oversold (<30) conditions
- **Trading Insight**: Extreme values signal potential reversals

### **5. Bollinger %B (`_pctB`)**
- **Formula**: (price - lower_band) / (upper_band - lower_band)
- **Captures**: Position within Bollinger Bands (0 = lower band, 1 = upper band)
- **Trading Insight**: Values >1 or <0 indicate breakouts from normal range

### **6. MACD Histogram (`_macd_hist`)**
- **Formula**: MACD_line - Signal_line
- **Captures**: Momentum changes and trend acceleration/deceleration
- **Trading Insight**: Positive = strengthening trend, negative = weakening trend

### **7. 30-Day Sharpe Ratio (`_sharpe30`)**
- **Formula**: Rolling mean return / rolling standard deviation (30-day)
- **Captures**: Risk-adjusted performance
- **Trading Insight**: Higher values = better risk-adjusted returns

---
**Total**: 7 feature types × 15 coins = **105 features** that comprehensively describe crypto market behavior across different timeframes and market aspects.

In [None]:
# Load the engineered features
features = pd.read_csv("../data/crypto_features.csv", index_col=0, parse_dates=True)

# Normalize (Z-score standardization)
scaler = StandardScaler()
features_normalized = pd.DataFrame(
    scaler.fit_transform(features),
    index=features.index,
    columns=features.columns
)

# Save the normalized version
features_normalized.to_csv("../data/crypto_features_normalized.csv")
print("✅ Normalized features saved to '../data/crypto_features_normalized.csv'.")

## 📏 Feature Normalization

**Why normalize?**
- Features have different scales (RSI: 0-100, log returns: -0.1 to 0.1, etc.)
- PCA is sensitive to scale differences - features with larger ranges dominate
- StandardScaler ensures each feature has mean=0, std=1

**Method**: Z-score standardization
- `(value - mean) / standard_deviation`
- Preserves relationships while equalizing scales
- Critical for meaningful PCA analysis

### ⚙️ Rolling Feature Parameters

The window sizes for feature calculations are chosen to balance:
- **Short-term sensitivity** (7–14 day windows for momentum/volatility).
- **Medium-term regime detection** (30-day cumulative returns).### 📌 Output Summary

The final dataset includes:
- **Date** and **Coin name**
- All computed features aligned per asset-date
- Ready for normalization and dimensionality reduction (PCA) in the next step