# 03 - Anomaly Detection

This notebook demonstrates **Anomaly Detection** - finding unusual patterns in market-event relationships.

---

## What is Anomaly Detection?

Anomaly detection (also called **outlier detection**) identifies data points that deviate significantly from the expected pattern. In finance, this is crucial for:

- **Risk management**: Detecting unusual market moves before they escalate
- **Alpha generation**: Finding mispricings or overreactions
- **Fraud detection**: Identifying suspicious trading patterns
- **Market surveillance**: Regulators use it to spot manipulation

---

## Two Types of Anomalies We Detect

| Type | Description | Example | Possible Cause |
|------|-------------|---------|----------------|
| **Unexplained Move** | Big market move, no major event | Oil jumps 5%, no news | Insider trading, unreported event |
| **Muted Response** | Major event, small market reaction | War breaks out, oil flat | Already priced in, market skepticism |

---

## Methods Used

1. **Z-score analysis** (statistical threshold) - Simple, interpretable
2. **Isolation Forest** (machine learning) - Robust, handles multivariate data
3. **Event-return mismatch** (domain knowledge) - Finance-specific logic

---

We'll compare our **learning version** (Z-score based) with the **production version** (Isolation Forest).

In [None]:
# ═══════════════════════════════════════════════════════════════════════════════
# IMPORTS
# ═══════════════════════════════════════════════════════════════════════════════
#
# Standard imports for data analysis notebooks:
#   - sys/pathlib: For path manipulation and project imports
#   - datetime: For date handling
#   - pandas: Data manipulation
#   - numpy: Numerical operations
#   - matplotlib/seaborn: Visualization
#
# ═══════════════════════════════════════════════════════════════════════════════

import sys
from pathlib import Path
from datetime import date, timedelta

# Add project root to Python path for imports
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Use a clean, professional style
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

print("Imports successful!")

## 1. Understanding Anomalies

### Visual Intuition

Before diving into algorithms, let's build intuition for what anomalies look like:

1. **Unexplained Move**: A return that's way outside the normal distribution
2. **Muted Response**: An event severity that doesn't match the return magnitude

In [None]:
# ═══════════════════════════════════════════════════════════════════════════════
# VISUAL EXPLANATION OF ANOMALY TYPES
# ═══════════════════════════════════════════════════════════════════════════════
#
# LEFT PLOT - Unexplained Move:
#   - Shows daily returns as a bar chart
#   - Most returns cluster around 0 (normal)
#   - One return at day 15 is a HUGE outlier (4 std devs!)
#   - Red bars = outside 2 standard deviations
#   - Question: Why did the market move so much? No event explains it!
#
# RIGHT PLOT - Muted Response:
#   - Shows event severity (Goldstein scale) vs market returns
#   - Day 3 has a MAJOR negative event (Goldstein = -8)
#   - But the market barely moved (return ≈ 0.1%)
#   - This is suspicious - major events should move markets!
#
# ═══════════════════════════════════════════════════════════════════════════════

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ─── LEFT: Unexplained Move ───
np.random.seed(42)  # For reproducibility
returns = np.random.normal(0, 1, 30)  # Normal returns
returns[15] = 4  # Inject an outlier at day 15

ax = axes[0]
# Color bars based on whether they exceed 2 standard deviations
colors = ['red' if abs(r) > 2 else 'steelblue' for r in returns]
ax.bar(range(30), returns, color=colors, alpha=0.7)

# Add threshold lines
ax.axhline(y=2, color='red', linestyle='--', label='2 std devs (95% threshold)')
ax.axhline(y=-2, color='red', linestyle='--')
ax.axhline(y=0, color='black', linestyle='-')

# Annotate the anomaly
ax.annotate('UNEXPLAINED\nMOVE!', xy=(15, 4), xytext=(20, 3.5),
            arrowprops=dict(arrowstyle='->', color='red'),
            fontsize=10, color='red', fontweight='bold')

ax.set_xlabel('Day')
ax.set_ylabel('Return (Z-score normalized)')
ax.set_title('Unexplained Move\n(Big return with no corresponding event)')
ax.legend()

# ─── RIGHT: Muted Response ───
ax = axes[1]

# Simulated data: Major negative event on day 3, but tiny market reaction
days = range(7)
event_severity = [0, 0, 0, -8, 0, 0, 0]  # Goldstein scale: -10 to +10
market_returns = [0.5, -0.3, 0.2, 0.1, -0.4, 0.3, -0.2]  # Daily returns (%)

x = np.arange(7)
width = 0.35

# Plot both series as grouped bars
bars1 = ax.bar(x - width/2, event_severity, width, 
               label='Event Severity (Goldstein)', color='red', alpha=0.7)
bars2 = ax.bar(x + width/2, [r * 10 for r in market_returns], width, 
               label='Market Return (scaled ×10)', color='green', alpha=0.7)

ax.axhline(y=0, color='black', linestyle='-')

# Annotate the mismatch
ax.annotate('Major event\nbut tiny\nmarket move!', xy=(3, -8), xytext=(5, -6),
            arrowprops=dict(arrowstyle='->', color='red'),
            fontsize=10, color='red', fontweight='bold')

ax.set_xlabel('Day')
ax.set_ylabel('Magnitude')
ax.set_title('Muted Response\n(Big event but small market reaction)')
ax.legend()
ax.set_xticks(x)
ax.set_xticklabels([f'Day {i}' for i in x])

plt.tight_layout()
plt.show()

print("\nKey Insight:")
print("─" * 50)
print("• Unexplained Move: |return| > 2σ with no significant event")
print("• Muted Response: |event| > threshold but |return| < threshold")
print("\nBoth suggest something unusual is happening!")

## 2. Learning Version: Z-Score Based Detection

### The Z-Score Approach

The simplest anomaly detection uses **Z-scores** (standard scores):

$$Z = \frac{X - \mu}{\sigma}$$

Where:
- X = observed value
- μ = mean (from rolling window)
- σ = standard deviation (from rolling window)

**Interpretation:**
- |Z| > 2 means the value is outside ~95% of normal observations
- |Z| > 3 means the value is outside ~99.7% (extremely rare)

### Why Use a Rolling Window?

Markets change over time - volatility in 2020 was very different from 2019. Using a **rolling window** (e.g., last 30 days) adapts to current market conditions.

In [None]:
# ═══════════════════════════════════════════════════════════════════════════════
# LEARNING VERSION: AnomalyDetector
# ═══════════════════════════════════════════════════════════════════════════════
#
# Our educational implementation uses three parameters:
#
# 1. zscore_threshold (default 2.0):
#    - Returns with |Z| > threshold are flagged as anomalies
#    - 2.0 = ~5% of data (1 in 20 days would be "anomalous" by chance)
#    - 3.0 = ~0.3% of data (more conservative)
#
# 2. goldstein_threshold (default 5.0):
#    - Events with |Goldstein| > threshold are considered "significant"
#    - Goldstein scale: -10 (extreme conflict) to +10 (extreme cooperation)
#    - |5| captures major events like conflicts, treaties, sanctions
#
# 3. lookback_days (default 30):
#    - Rolling window for calculating mean and std
#    - 30 days ≈ one trading month
#    - Shorter = more responsive, noisier
#    - Longer = more stable, slower to adapt
#
# ═══════════════════════════════════════════════════════════════════════════════

from src.analysis.anomaly_detection import AnomalyDetector, explain_anomaly

# Create detector with explicit parameters
detector = AnomalyDetector(
    zscore_threshold=2.0,      # Flag returns > 2 standard deviations
    goldstein_threshold=5.0,   # Major events have |Goldstein| > 5
    lookback_days=30,          # Use 30-day rolling window
)

print("Anomaly Detector Configuration")
print("=" * 50)
print(f"  Z-score threshold: {detector.zscore_threshold}")
print(f"  Goldstein threshold: {detector.goldstein_threshold}")
print(f"  Lookback window: {detector.lookback_days} days")
print()
print("What this means:")
print(f"  - 'Unexplained Move': Return where |Z| > {detector.zscore_threshold} but no event > {detector.goldstein_threshold}")
print(f"  - 'Muted Response': Event |Goldstein| > {detector.goldstein_threshold} but |Z| < {detector.zscore_threshold}")

In [None]:
# ═══════════════════════════════════════════════════════════════════════════════
# DETECTING UNEXPLAINED MOVES
# ═══════════════════════════════════════════════════════════════════════════════
#
# The detect_unexplained_moves() method:
#   1. Fetches market data for the symbol and date range
#   2. Calculates rolling Z-scores for each day's return
#   3. For days where |Z| > threshold, checks if a significant event occurred
#   4. Returns anomalies where the market moved but no event explains it
#
# These are interesting because they suggest:
#   - Unreported news (insider information?)
#   - Technical trading (momentum, stop-losses cascading)
#   - Sentiment shift not captured by GDELT
#
# ═══════════════════════════════════════════════════════════════════════════════

# Search last 60 days for anomalies
end_date = date.today()
start_date = end_date - timedelta(days=60)

print(f"Searching for unexplained moves in CL=F (Crude Oil)")
print(f"Date range: {start_date} to {end_date}")
print("=" * 60)

unexplained = detector.detect_unexplained_moves('CL=F', start_date, end_date)

if unexplained:
    print(f"\nFound {len(unexplained)} unexplained moves:\n")
    print(f"{'Date':<12} {'Return':>10} {'Z-score':>10} {'Interpretation'}")
    print("─" * 60)
    
    for anomaly in unexplained[:5]:  # Show top 5
        direction = '↑' if anomaly.actual_return > 0 else '↓'
        severity = 'EXTREME' if abs(anomaly.z_score) > 3 else 'Notable'
        print(f"{anomaly.date}   {anomaly.actual_return*100:>+8.2f}%   {anomaly.z_score:>+8.2f}   {direction} {severity}")
else:
    print("\nNo unexplained moves detected.")
    print("This could mean:")
    print("  - All large moves had corresponding events (good!)")
    print("  - Insufficient data (check ingestion)")
    print("  - Market was calm during this period")

In [None]:
# ═══════════════════════════════════════════════════════════════════════════════
# DETAILED ANOMALY EXPLANATION
# ═══════════════════════════════════════════════════════════════════════════════
#
# The explain_anomaly() function provides a human-readable summary:
#   - The date and return
#   - The Z-score and what it means
#   - What events (if any) occurred that day
#   - Why it's classified as an anomaly
#
# This is useful for:
#   - Portfolio managers investigating unusual P&L
#   - Risk managers understanding exposure
#   - Analysts researching market behavior
#
# ═══════════════════════════════════════════════════════════════════════════════

if unexplained:
    print("Detailed Explanation of First Anomaly")
    print("=" * 50)
    print(explain_anomaly(unexplained[0]))
else:
    print("No anomalies to explain.")

## 3. Production Version: Isolation Forest

### Why Machine Learning?

Z-score detection has limitations:
- Only looks at one feature (return magnitude)
- Assumes normal distribution (returns have fat tails)
- Can't capture complex patterns

**Isolation Forest** is an unsupervised ML algorithm designed for anomaly detection:

### How Isolation Forest Works

1. **Random partitioning**: Randomly select a feature and split point
2. **Recursive splitting**: Keep splitting until each point is isolated
3. **Path length**: Count how many splits it took to isolate each point
4. **Anomaly score**: Points isolated in FEWER splits are anomalies

**Intuition**: Anomalies are "few and different" - they're easier to isolate from the crowd.

### Key Parameter: Contamination

The `contamination` parameter tells the algorithm what fraction of data is expected to be anomalous:
- `0.05` = expect 5% of data to be anomalies
- Lower = more conservative (fewer anomalies flagged)
- Higher = more aggressive (more anomalies flagged)

In [None]:
# ═══════════════════════════════════════════════════════════════════════════════
# PRODUCTION VERSION: Quick Analysis
# ═══════════════════════════════════════════════════════════════════════════════
#
# The run_quick_anomaly_detection() function is a convenience wrapper:
#   1. Creates a ProductionAnomalyDetector with default settings
#   2. Runs all detection methods (Isolation Forest + Z-score + event mismatch)
#   3. Returns a formatted summary report
#
# This is the "just give me the answer" approach for quick checks.
#
# ═══════════════════════════════════════════════════════════════════════════════

from src.analysis.production_anomaly import ProductionAnomalyDetector, run_quick_anomaly_detection

# Quick one-liner for rapid analysis
report = run_quick_anomaly_detection('CL=F', days=60)

if report:
    print("Quick Anomaly Detection Report")
    print("=" * 50)
    print(report)
else:
    print("Insufficient data for anomaly detection.")
    print("Ensure the database is populated with market data.")

In [None]:
# ═══════════════════════════════════════════════════════════════════════════════
# PRODUCTION VERSION: Full Analysis
# ═══════════════════════════════════════════════════════════════════════════════
#
# For more control, instantiate ProductionAnomalyDetector directly.
#
# Key parameters:
#   - contamination: Expected fraction of anomalies (default 0.05 = 5%)
#   - zscore_threshold: For Z-score based detection (default 2.0)
#   - random_state: For reproducibility
#
# The detect_all() method combines three detection approaches:
#   1. Isolation Forest (ML-based, multivariate)
#   2. Z-score (statistical, univariate)
#   3. Event mismatch (domain-specific)
#
# Each anomaly is tagged with which method(s) detected it.
#
# ═══════════════════════════════════════════════════════════════════════════════

prod_detector = ProductionAnomalyDetector(
    contamination=0.05,    # Expect ~5% of data to be anomalous
    zscore_threshold=2.0,  # Also flag Z-score outliers
)

# Run full detection
anomalies = prod_detector.detect_all('CL=F', start_date, end_date)
report = prod_detector.get_anomaly_report(anomalies, 'CL=F', start_date, end_date)

print("Production Detector Results")
print("=" * 50)
print(f"  Total anomalies detected: {report.anomaly_count}")
print(f"  Anomaly rate: {report.anomaly_rate*100:.1f}%")
print()
print("Breakdown by type:")
print(f"  Unexplained moves: {report.unexplained_moves}")
print(f"  Muted responses: {report.muted_responses}")
print(f"  Statistical outliers: {report.statistical_outliers}")
print()
print("Interpretation:")
print(f"  - {report.anomaly_rate*100:.1f}% of days were anomalous")
print(f"  - {'Higher' if report.anomaly_rate > 0.05 else 'Lower'} than expected 5%")

In [None]:
# ═══════════════════════════════════════════════════════════════════════════════
# DETECTION METHOD BREAKDOWN
# ═══════════════════════════════════════════════════════════════════════════════
#
# Each anomaly stores which detection method(s) flagged it:
#   - 'isolation_forest': ML algorithm found it unusual
#   - 'zscore': Statistical outlier (|Z| > threshold)
#   - 'event_mismatch': Domain logic (event-return mismatch)
#
# Anomalies flagged by MULTIPLE methods are more reliable.
#
# The anomaly_probability field (0-1) indicates confidence:
#   - Higher = more likely to be a true anomaly
#   - Based on Isolation Forest's decision function
#
# ═══════════════════════════════════════════════════════════════════════════════

if anomalies:
    print("Anomaly Detection Methods")
    print("=" * 60)
    print(f"{'Date':<12} {'Type':<20} {'Return':>10} {'Prob':>8} {'Methods'}")
    print("─" * 60)
    
    for a in anomalies[:10]:  # Show first 10
        methods = ', '.join(a.detected_by)
        print(f"{a.date}   {a.anomaly_type:<20} {a.actual_return*100:>+8.2f}%   {a.anomaly_probability:>6.2f}   {methods}")
    
    print("─" * 60)
    print("\nNote: Anomalies flagged by multiple methods are more reliable.")
else:
    print("No anomalies detected.")

## 4. Visualizing Anomalies

Visualization is crucial for:
- Validating that detections make sense
- Communicating findings to stakeholders
- Building intuition about market behavior

In [None]:
# ═══════════════════════════════════════════════════════════════════════════════
# FETCH MARKET DATA FOR VISUALIZATION
# ═══════════════════════════════════════════════════════════════════════════════
#
# We need the full time series to visualize anomalies in context.
#
# The visualization will show:
#   - Price chart with anomaly points highlighted
#   - Return chart with Z-score bands
#
# ═══════════════════════════════════════════════════════════════════════════════

from src.db.queries import get_market_data
from src.db.connection import get_session

with get_session() as session:
    data = get_market_data(session, 'CL=F', start_date, end_date)
    
    if data:
        # Convert to DataFrame
        market_df = pd.DataFrame([
            {
                'date': d.date,
                'close': float(d.close),
                'return': d.log_return,
            }
            for d in data
        ]).dropna()
        
        # Calculate Z-scores using rolling window
        # This matches what the detector does internally
        market_df['z_score'] = (
            (market_df['return'] - market_df['return'].rolling(30).mean()) /
            market_df['return'].rolling(30).std()
        )
        
        print(f"Loaded {len(market_df)} days of data for CL=F")
        print(f"Date range: {market_df['date'].min()} to {market_df['date'].max()}")
        print(f"Price range: ${market_df['close'].min():.2f} - ${market_df['close'].max():.2f}")
    else:
        print("No data available. Please run the ingestion scripts.")
        market_df = None

In [None]:
# ═══════════════════════════════════════════════════════════════════════════════
# PRICE AND RETURN CHART WITH ANOMALIES
# ═══════════════════════════════════════════════════════════════════════════════
#
# TOP PANEL - Price Chart:
#   - Shows the oil price over time
#   - Red dots mark days that were flagged as anomalies
#   - Helps you see: "What happened to price on anomaly days?"
#
# BOTTOM PANEL - Return Chart:
#   - Bar chart of daily returns
#   - Red bars = anomaly days
#   - Dashed lines show ±2 standard deviation bands
#   - Returns outside the bands are statistical outliers
#
# ═══════════════════════════════════════════════════════════════════════════════

if market_df is not None and anomalies:
    fig, axes = plt.subplots(2, 1, figsize=(14, 10), sharex=True)
    
    # Get set of anomaly dates for quick lookup
    anomaly_dates = set(a.date for a in anomalies)
    
    # ─── TOP: Price Chart ───
    ax = axes[0]
    ax.plot(market_df['date'], market_df['close'], linewidth=2, color='steelblue', label='Price')
    
    # Mark anomalies on price chart
    for _, row in market_df.iterrows():
        if row['date'] in anomaly_dates:
            ax.scatter(row['date'], row['close'], color='red', s=100, zorder=5, label='_nolegend_')
    
    # Add a single legend entry for anomalies
    ax.scatter([], [], color='red', s=100, label='Anomaly')
    
    ax.set_ylabel('Price ($)', fontsize=12)
    ax.set_title('Oil (CL=F) Price with Anomalies Highlighted', fontsize=14)
    ax.legend()
    
    # ─── BOTTOM: Return Chart ───
    ax = axes[1]
    
    # Color bars by anomaly status
    colors = ['red' if d in anomaly_dates else 'steelblue' for d in market_df['date']]
    ax.bar(market_df['date'], market_df['return'] * 100, color=colors, alpha=0.7)
    
    # Add Z-score threshold lines (approximate using overall std)
    std = market_df['return'].std() * 100
    ax.axhline(y=2*std, color='red', linestyle='--', alpha=0.5, label='±2σ threshold')
    ax.axhline(y=-2*std, color='red', linestyle='--', alpha=0.5)
    ax.axhline(y=0, color='black', linestyle='-')
    
    ax.set_xlabel('Date', fontsize=12)
    ax.set_ylabel('Daily Return (%)', fontsize=12)
    ax.set_title('Daily Returns (Red Bars = Anomaly Detected)', fontsize=14)
    ax.legend()
    
    plt.tight_layout()
    plt.show()
    
    print("\nChart Interpretation:")
    print("─" * 50)
    print("• Red dots/bars show days flagged as anomalies")
    print("• Returns outside dashed lines are >2σ from normal")
    print("• Look for clusters - multiple anomalies may indicate regime change")
else:
    print("Insufficient data for visualization.")

In [None]:
# ═══════════════════════════════════════════════════════════════════════════════
# ANOMALY STATISTICS
# ═══════════════════════════════════════════════════════════════════════════════
#
# LEFT PLOT - Probability Distribution:
#   - Histogram of anomaly probabilities
#   - Higher probability = more confident the point is anomalous
#   - Skewed right = most anomalies are borderline
#   - Uniform = detector is picking up genuine outliers
#
# RIGHT PLOT - Anomaly Types:
#   - Pie chart showing breakdown by type
#   - Helps understand WHAT kinds of anomalies are occurring
#   - Useful for prioritizing investigation
#
# ═══════════════════════════════════════════════════════════════════════════════

if anomalies:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # ─── LEFT: Probability Distribution ───
    probs = [a.anomaly_probability for a in anomalies]
    
    axes[0].hist(probs, bins=20, edgecolor='black', alpha=0.7, color='red')
    axes[0].axvline(x=np.mean(probs), color='blue', linestyle='--', 
                   label=f'Mean: {np.mean(probs):.2f}')
    axes[0].set_xlabel('Anomaly Probability', fontsize=12)
    axes[0].set_ylabel('Count', fontsize=12)
    axes[0].set_title('Distribution of Anomaly Probabilities\n(Higher = More Confident)', fontsize=12)
    axes[0].legend()
    
    # ─── RIGHT: By Type ───
    type_counts = {}
    for a in anomalies:
        type_counts[a.anomaly_type] = type_counts.get(a.anomaly_type, 0) + 1
    
    colors_pie = ['#ff6b6b', '#ffa502', '#ffd93d']  # Red, Orange, Yellow
    axes[1].pie(
        type_counts.values(),
        labels=type_counts.keys(),
        autopct='%1.1f%%',
        colors=colors_pie[:len(type_counts)],
        explode=[0.05] * len(type_counts),
        shadow=True,
    )
    axes[1].set_title('Anomalies by Type', fontsize=12)
    
    plt.tight_layout()
    plt.show()
    
    print("\nType Breakdown:")
    print("─" * 40)
    for atype, count in sorted(type_counts.items(), key=lambda x: -x[1]):
        pct = count / len(anomalies) * 100
        print(f"  {atype}: {count} ({pct:.1f}%)")
else:
    print("No anomalies to visualize.")

## 5. Cross-Market Comparison

Different markets have different anomaly patterns:
- **Commodities** (oil, gold): Sensitive to geopolitical events
- **Equities** (SPY): Broader economic factors
- **VIX**: Inverse to market sentiment (spikes during fear)
- **FX** (EURUSD): Policy and interest rate driven

Comparing anomaly rates across markets can reveal:
- Which markets are "noisier"
- Whether events affect markets differently
- Potential diversification opportunities

In [None]:
# ═══════════════════════════════════════════════════════════════════════════════
# CROSS-MARKET ANOMALY COMPARISON
# ═══════════════════════════════════════════════════════════════════════════════
#
# The compare_symbols() method:
#   1. Runs anomaly detection on each symbol
#   2. Returns a DataFrame with counts by type
#   3. Enables direct comparison across markets
#
# Symbols analyzed:
#   - CL=F: Crude Oil Futures
#   - GC=F: Gold Futures (safe haven)
#   - SPY: S&P 500 ETF (broad US equities)
#   - ^VIX: Volatility Index (fear gauge)
#   - EURUSD=X: Euro/USD exchange rate
#
# ═══════════════════════════════════════════════════════════════════════════════

symbols = ['CL=F', 'GC=F', 'SPY', '^VIX', 'EURUSD=X']

print(f"Comparing anomaly rates across {len(symbols)} markets")
print(f"Date range: {start_date} to {end_date}")
print("=" * 60)

comparison = prod_detector.compare_symbols(symbols, start_date, end_date)

if not comparison.empty:
    print("\nAnomaly Comparison Across Markets")
    print("─" * 60)
    display(comparison)
else:
    print("Insufficient data for comparison.")
    print("Ensure data is ingested for all symbols.")

In [None]:
# ═══════════════════════════════════════════════════════════════════════════════
# VISUALIZATION: GROUPED BAR CHART
# ═══════════════════════════════════════════════════════════════════════════════
#
# This chart shows anomaly counts by type for each market.
#
# Visual encoding:
#   - X-axis: Different markets
#   - Y-axis: Number of anomalies
#   - Colors: Different anomaly types
#
# What to look for:
#   - Which markets have the most anomalies?
#   - Which types dominate for each market?
#   - Are there patterns (e.g., VIX has more unexplained moves)?
#
# ═══════════════════════════════════════════════════════════════════════════════

if not comparison.empty:
    fig, ax = plt.subplots(figsize=(12, 6))
    
    x = np.arange(len(comparison))
    width = 0.25
    
    # Create grouped bars
    bars1 = ax.bar(x - width, comparison['unexplained_moves'], width, 
                   label='Unexplained Moves', color='#ff6b6b', alpha=0.8)
    bars2 = ax.bar(x, comparison['muted_responses'], width, 
                   label='Muted Responses', color='#ffa502', alpha=0.8)
    bars3 = ax.bar(x + width, comparison['statistical_outliers'], width, 
                   label='Statistical Outliers', color='#ffd93d', alpha=0.8)
    
    ax.set_xlabel('Symbol', fontsize=12)
    ax.set_ylabel('Count', fontsize=12)
    ax.set_title('Anomaly Types by Market', fontsize=14)
    ax.set_xticks(x)
    ax.set_xticklabels(comparison['symbol'])
    ax.legend()
    
    # Add value labels on bars
    for bars in [bars1, bars2, bars3]:
        for bar in bars:
            height = bar.get_height()
            if height > 0:
                ax.annotate(f'{int(height)}',
                           xy=(bar.get_x() + bar.get_width() / 2, height),
                           ha='center', va='bottom', fontsize=9)
    
    plt.tight_layout()
    plt.show()
    
    print("\nInsights:")
    print("─" * 50)
    most_anomalies = comparison.loc[comparison['anomaly_count'].idxmax()]
    least_anomalies = comparison.loc[comparison['anomaly_count'].idxmin()]
    print(f"• Most anomalous market: {most_anomalies['symbol']} ({most_anomalies['anomaly_count']} anomalies)")
    print(f"• Least anomalous market: {least_anomalies['symbol']} ({least_anomalies['anomaly_count']} anomalies)")

## 6. How Isolation Forest Works

Let's visualize the Isolation Forest algorithm to build intuition.

### The Key Insight

**Anomalies are "few and different"** - they're easier to isolate from the crowd.

Imagine a crowd of people standing together, with a few standing far away:
- Normal points: Need many random cuts to isolate each person
- Anomalies: Just a few cuts separates the loners

In [None]:
# ═══════════════════════════════════════════════════════════════════════════════
# ISOLATION FOREST VISUALIZATION
# ═══════════════════════════════════════════════════════════════════════════════
#
# We create synthetic data to demonstrate how Isolation Forest works:
#   - 100 normal points clustered around the origin
#   - 3 anomaly points far from the cluster
#
# LEFT PLOT - Predictions:
#   - Blue points = normal (inliers)
#   - Red points = anomalies (outliers)
#   - The algorithm correctly identifies the distant points
#
# RIGHT PLOT - Anomaly Scores:
#   - Colormap shows the decision function
#   - Lower scores (red) = more anomalous
#   - Higher scores (green) = more normal
#
# ═══════════════════════════════════════════════════════════════════════════════

from sklearn.ensemble import IsolationForest

# Generate synthetic data
np.random.seed(42)

# Normal data: cluster around origin
normal_data = np.random.randn(100, 2)

# Anomalies: far from the cluster
anomalies_data = np.array([
    [4, 4],    # Top-right corner
    [-4, -4],  # Bottom-left corner  
    [4, -4]    # Bottom-right corner
])

# Combine
data = np.vstack([normal_data, anomalies_data])

# Fit Isolation Forest
iso_forest = IsolationForest(
    contamination=0.03,  # Expect ~3% anomalies (3 out of 103)
    random_state=42
)
predictions = iso_forest.fit_predict(data)
scores = iso_forest.decision_function(data)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ─── LEFT: Predictions ───
ax = axes[0]
colors = ['red' if p == -1 else 'steelblue' for p in predictions]
ax.scatter(data[:, 0], data[:, 1], c=colors, alpha=0.6, s=50)

# Label the groups
ax.scatter([], [], c='steelblue', label='Normal (predicted)', s=50)
ax.scatter([], [], c='red', label='Anomaly (predicted)', s=50)

ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_title('Isolation Forest: Predictions\n(Red = Detected as Anomaly)')
ax.legend()
ax.set_xlim(-6, 6)
ax.set_ylim(-6, 6)

# ─── RIGHT: Decision Function (Anomaly Scores) ───
ax = axes[1]
scatter = ax.scatter(data[:, 0], data[:, 1], c=scores, cmap='RdYlGn', alpha=0.6, s=50)
plt.colorbar(scatter, ax=ax, label='Anomaly Score\n(lower = more anomalous)')

ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_title('Isolation Forest: Anomaly Scores\n(Red = Low Score = Anomalous)')
ax.set_xlim(-6, 6)
ax.set_ylim(-6, 6)

plt.tight_layout()
plt.show()

# Explain the algorithm
print("\nHow Isolation Forest Works:")
print("═" * 60)
print("")
print("1. RANDOM PARTITIONING:")
print("   → Randomly select a feature (e.g., Feature 1)")
print("   → Randomly select a split point (e.g., x = 2.5)")
print("   → Split data into left and right partitions")
print("")
print("2. RECURSIVE SPLITTING:")
print("   → Keep splitting until each point is isolated")
print("   → This creates a 'path' for each point")
print("")
print("3. PATH LENGTH = ANOMALY SCORE:")
print("   → Anomalies are isolated in FEWER splits (short path)")
print("   → Normal points need MORE splits (long path)")
print("")
print("4. INTUITION:")
print("   → Anomalies are 'few and different'")
print("   → They stand out, so random cuts find them quickly!")

## Summary

**Anomaly Detection** finds unusual market-event patterns that warrant investigation.

---

### Anomaly Types

| Type | Definition | Possible Cause | Action |
|------|------------|----------------|--------|
| **Unexplained Move** | Large return, no event | Unreported news, technical trading | Investigate news sources |
| **Muted Response** | Large event, small return | Already priced in, market skepticism | Review positioning |
| **Statistical Outlier** | Detected by ML | General unusual behavior | Monitor closely |

---

### Detection Methods

| Method | Pros | Cons |
|--------|------|------|
| **Z-score** | Simple, interpretable | Assumes normality, univariate |
| **Isolation Forest** | Handles multivariate, robust | Black box, needs tuning |
| **Event Mismatch** | Domain-specific, meaningful | Requires event data |

---

### Two Implementations

| Version | Class | Use Case |
|---------|-------|----------|
| **Learning** | `AnomalyDetector` | Understand Z-scores, interviews |
| **Production** | `ProductionAnomalyDetector` | Real work, combines methods |

---

### Key Takeaways

1. **Anomalies are rare by definition** - don't expect many
2. **Multiple methods > single method** - combine for robustness
3. **Context matters** - always investigate before acting
4. **Tune contamination carefully** - too high = false positives

---

**Next:** See `04_classification_demo.ipynb` to predict market direction from events.