# 06 - Anomaly Detection and Feature Engineering

**Author:** Lucas Little  
**Course:** CSCA 5522: Data Mining Project  
**University:** University of Colorado - Boulder

This notebook implements STL decomposition and z-score analysis for sentiment anomaly detection and creates the final feature set for each sample.

## Objectives
1. Load aligned price and sentiment data for each sample
2. Implement STL decomposition for sentiment signals for each sample
3. Detect sentiment anomalies using z-scores for each sample
4. Engineer final features for modeling for each sample
5. Save the enhanced feature sets

In [1]:
# Core imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
from pathlib import Path
import os

# Time series analysis imports
from statsmodels.tsa.seasonal import STL
from scipy import stats

warnings.filterwarnings('ignore')
plt.style.use('default')
sns.set_palette("husl")

print("Environment setup complete!")

Environment setup complete!


## 1. Process Sampled Data

In [2]:
data_dir = Path('../data')
processed_data_dir = data_dir / 'processed'
sampled_dir = processed_data_dir / 'sampled'

def detect_sentiment_anomalies(df, sentiment_col='sentiment_mean', z_threshold=2.0):
    """
    Detect anomalies in a sentiment time series using STL decomposition and Z-scores.
    """
    if sentiment_col not in df.columns or df[sentiment_col].isnull().all():
        print(f"Sentiment column '{sentiment_col}' not found or is all NaN.")
        return df, None
    
    # STL decomposition requires a period. For 15-min data, a daily seasonality is 24*4 = 96 periods.
    stl = STL(df[sentiment_col].fillna(0), period=96, robust=True)
    result = stl.fit()
    
    # Get the residual component
    residuals = result.resid
    
    # Calculate Z-scores of the residuals
    z_scores = np.abs(stats.zscore(residuals))
    
    # Add results to dataframe
    df_result = df.copy()
    df_result[f'{sentiment_col}_stl_residual'] = residuals
    df_result[f'{sentiment_col}_z_score'] = z_scores
    df_result[f'{sentiment_col}_anomaly'] = (z_scores > z_threshold).astype(int)
    
    return df_result, result

for i in range(1, 6):
    print(f"\n--- Processing Sample {i} ---")
    aligned_features_path = sampled_dir / f'aligned_features_sample_{i}.csv'
    
    if not aligned_features_path.exists():
        print(f"⚠️ Aligned features for sample {i} not found. Skipping.")
        continue
        
    df = pd.read_csv(aligned_features_path)
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df.set_index('timestamp', inplace=True)
    
    print("Detecting sentiment anomalies...")
    df, stl_result = detect_sentiment_anomalies(df, 'sentiment_mean', z_threshold=2.5)
    
    if stl_result:
        anomaly_count = df['sentiment_mean_anomaly'].sum()
        anomaly_pct = (anomaly_count / len(df)) * 100
        print(f"Detected {anomaly_count} sentiment anomalies ({anomaly_pct:.2f}%)")
    
    # Create target variable: high volatility in the next 2 hours
    df['future_volatility'] = df['volatility'].shift(-8) # 8 * 15 min = 120 min = 2 hours
    df['high_volatility_target'] = (df['future_volatility'] > df['volatility'].quantile(0.75)).astype(int)
    
    # Feature columns
    feature_cols = [
        'returns',
        'volatility',
        'rsi',
        'macd',
        'volume_ratio',
        'sentiment_mean',
        'sentiment_var',
        'sentiment_count',
        'sentiment_momentum',
        'sentiment_mean_anomaly'
    ]
    
    # Ensure all feature columns exist
    final_features = [col for col in feature_cols if col in df.columns]
    
    print(f"Using {len(final_features)} features for modeling.")
    
    # Drop rows with NaN in target or features
    df_model = df[final_features + ['high_volatility_target']].dropna()
    
    print(f"Shape of the final modeling dataset for sample {i}: {df_model.shape}")
    
    # Save the enhanced dataset
    output_path = sampled_dir / f'features_sample_{i}.csv'
    df_model.to_csv(output_path, index=True)
    print(f"Enhanced dataset for sample {i} saved to: {output_path}")


--- Processing Sample 1 ---
Detecting sentiment anomalies...
Detected 28 sentiment anomalies (4.16%)
Using 10 features for modeling.
Shape of the final modeling dataset for sample 1: (672, 11)
Enhanced dataset for sample 1 saved to: ../data/processed/sampled/features_sample_1.csv

--- Processing Sample 2 ---
Detecting sentiment anomalies...


Detected 25 sentiment anomalies (3.71%)
Using 10 features for modeling.


Shape of the final modeling dataset for sample 2: (404, 11)
Enhanced dataset for sample 2 saved to: ../data/processed/sampled/features_sample_2.csv

--- Processing Sample 3 ---
Detecting sentiment anomalies...
Detected 26 sentiment anomalies (3.86%)
Using 10 features for modeling.
Shape of the final modeling dataset for sample 3: (672, 11)
Enhanced dataset for sample 3 saved to: ../data/processed/sampled/features_sample_3.csv

--- Processing Sample 4 ---
Detecting sentiment anomalies...


Detected 32 sentiment anomalies (4.75%)


Using 10 features for modeling.
Shape of the final modeling dataset for sample 4: (672, 11)
Enhanced dataset for sample 4 saved to: ../data/processed/sampled/features_sample_4.csv

--- Processing Sample 5 ---
Detecting sentiment anomalies...
Detected 30 sentiment anomalies (4.46%)
Using 10 features for modeling.
Shape of the final modeling dataset for sample 5: (672, 11)
Enhanced dataset for sample 5 saved to: ../data/processed/sampled/features_sample_5.csv
