# Phase 2: Feature Analysis

This notebook focuses on analyzing the engineered features extracted from the raw OHLCV data. 
The goal is to validate the features, understand their statistical properties, and ensure they are suitable for the Regime Detection model (HMM).

**Key Objectives:**
1. **Data Integrity:** Load and verify the raw data.
2. **Feature Construction:** Generate the full set of technical indicators and statistical features.
3. **Correlation Analysis:** Identify redundant features to avoid multicollinearity issues.
4. **Distribution Analysis:** Examine feature distributions for skewness, kurtosis, and outliers.

In [None]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Add project root to path
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

from src.data.loader import load_ohlcv_csv
from src.features.builder import build_features

# Plot settings
%matplotlib inline
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Load Data
We load the 4-hour Bitcoin OHLCV data. This dataset serves as the foundation for our feature engineering.

In [None]:
data_path = project_root / "data/raw/btc_4h.csv"

try:
    df = load_ohlcv_csv(data_path)
    print(f"Successfully loaded {len(df)} candles.")
    print(f"Date Range: {df.index.min()} to {df.index.max()}")
except FileNotFoundError:
    print(f"Error: Data file not found at {data_path}")
    print("Please ensure the data file exists or run the data fetcher script.")
    # Fallback for demonstration purposes if file is missing (Optional)
    # dates = pd.date_range(start='2020-01-01', periods=1000, freq='4h')
    # df = pd.DataFrame(index=dates, data={'close': 10000 + np.random.randn(1000).cumsum()})
    # df['open'] = df['close'] + np.random.randn(1000)
    # df['high'] = df[['open', 'close']].max(axis=1) + 10
    # df['low'] = df[['open', 'close']].min(axis=1) - 10
    # df['volume'] = np.abs(np.random.randn(1000) * 100)

df.head()

## 2. Feature Construction
We use the `build_features` function from our source code to generate the feature set. This includes:
- **Returns:** Log returns, rolling means, and standard deviations.
- **Volatility:** Realized volatility over different windows.
- **Trend:** Moving average convergence divergence (MACD), RSI, etc.
- **Distribution:** Skewness and Kurtosis to capture higher moments.

In [None]:
features = build_features(df)
features.dropna(inplace=True)

print(f"Generated {features.shape[1]} features.")
print(f"Valid data points: {len(features)}")
features.head()

## 3. Correlation Analysis
High correlation between features can lead to multicollinearity, which might confuse the HMM or lead to unstable parameter estimates. We visualize the correlation matrix to identify redundant features.

In [None]:
plt.figure(figsize=(20, 16))
corr = features.corr()

sns.heatmap(
    corr, 
    annot=False, 
    cmap='coolwarm', 
    vmin=-1, 
    vmax=1, 
    center=0,
    square=True,
    linewidths=.5,
    cbar_kws={"shrink": .5}
)
plt.title("Feature Correlation Matrix", fontsize=16)
plt.show()

## 4. Distribution Analysis
We examine the distributions of key features. Non-Gaussian distributions (fat tails, skew) are common in financial data and are exactly what we hope the HMM regimes will help characterize (e.g., a high-variance regime vs. a low-variance regime).

In [None]:
# Select a few representative features to analyze
key_features = [
    'log_return', 
    'rolling_std_medium', 
    'trend_rsi_14', 
    'dist_skew_medium'
]

plt.figure(figsize=(15, 10))
for i, feature in enumerate(key_features):
    if feature in features.columns:
        plt.subplot(2, 2, i+1)
        sns.histplot(features[feature], kde=True, bins=50)
        plt.title(f"Distribution of {feature}")
        plt.xlabel(feature)
        plt.ylabel("Frequency")

plt.tight_layout()
plt.show()

### Box Plots for Outlier Detection
Box plots provide a clear view of the spread and outliers for each feature.

In [None]:
plt.figure(figsize=(15, 6))
# Normalize data for comparable boxplots
features_norm = (features - features.mean()) / features.std()

# Plot only a subset to avoid overcrowding
subset_cols = features.columns[:10] 
sns.boxplot(data=features_norm[subset_cols])
plt.xticks(rotation=45)
plt.title("Feature Boxplots (Standardized)")
plt.show()