# Notebook 2: Robust Anomaly Detection on Climate Data

## Introduction

This notebook demonstrates the difference between **standard z-score** and **robust z-score** anomaly detection.

**Key question**: When sensor data contains occasional **spikes** (sensor malfunctions, transmission errors, extreme events), which anomaly detection method is more reliable?

We'll use hourly PM2.5 air pollution data from an urban monitoring station that occasionally reports erroneous readings.

---

## Background: Anomaly Detection Methods

### Standard Z-Score Method

Flags values that are more than `k` standard deviations from the mean:

```
z = (x - mean) / std
anomaly if |z| > threshold
```

**Problem**: Mean and std are **heavily influenced by outliers**, so extreme spikes can corrupt the detection.

---

### Robust Z-Score Method (Modified Z-Score)

Uses **median** and **MAD** (Median Absolute Deviation) instead:

```
MAD = median(|x - median(x)|)
robust_z = 0.6745 * (x - median) / MAD
anomaly if |robust_z| > threshold
```

**Advantage**: Median and MAD are **resistant to outliers**, so spikes don't corrupt the detection.

---

## Setup

Import HPCSeries and load the PM2.5 pollution data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import hpcs

# Display library info
print(f"HPCSeries version: {hpcs.__version__}")
print(f"SIMD ISA: {hpcs.simd_info()['isa']}")
print()

## Load Data

Load 12 days of hourly PM2.5 readings (288 hours). This dataset contains:
- Normal diurnal patterns (low at night, high during day)
- **3 sensor error spikes** with extreme values
- Realistic air quality variation

In [None]:
# Load PM2.5 data
df = pd.read_csv('data/climate_pm25_timeseries.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Extract PM2.5 concentration as NumPy array
pm25 = df['pm25_ugm3'].values

print(f"Dataset: {len(pm25)} hourly PM2.5 readings")
print(f"PM2.5 range: {pm25.min():.1f} - {pm25.max():.1f} µg/m³")
print(f"Mean PM2.5: {pm25.mean():.1f} µg/m³")
print(f"Median PM2.5: {hpcs.median(pm25):.1f} µg/m³")
print()

# Show first few rows
df.head(10)

## Visualize Raw Data

Plot the raw PM2.5 time-series. Notice the **extreme spikes** that are clearly sensor errors.

In [None]:
plt.figure(figsize=(14, 5))
plt.plot(df['timestamp'], pm25, alpha=0.7, linewidth=1.5, color='steelblue', marker='o', markersize=3)
plt.xlabel('Time', fontsize=12)
plt.ylabel('PM2.5 (µg/m³)', fontsize=12)
plt.title('Hourly PM2.5 Air Pollution — Raw Sensor Data', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Notice the extreme spikes above 150 µg/m³ — these are sensor errors.")

## Method 1: Standard Z-Score Anomaly Detection

Detect anomalies using the **standard z-score** method with `hpcs.detect_anomalies()`.

Threshold: 3.0 (values more than 3 standard deviations from the mean)

In [None]:
threshold = 3.0

# Standard z-score anomaly detection
anomalies_standard = hpcs.detect_anomalies(pm25, threshold)

num_anomalies = np.sum(anomalies_standard)
print(f"Standard Z-Score Method:")
print(f"  Detected {num_anomalies} anomalies out of {len(pm25)} readings")
print(f"  Anomaly rate: {num_anomalies / len(pm25) * 100:.2f}%")
print()

# Show detected anomaly values
if num_anomalies > 0:
    anomaly_values = pm25[anomalies_standard]
    print(f"Detected anomaly values: {anomaly_values}")
    print()

## Method 2: Robust Z-Score Anomaly Detection

Detect anomalies using the **robust z-score** method with `hpcs.detect_anomalies_robust()`.

This uses median and MAD instead of mean and std, making it **resistant to outliers**.

In [None]:
# Robust z-score anomaly detection
anomalies_robust = hpcs.detect_anomalies_robust(pm25, threshold)

num_anomalies_robust = np.sum(anomalies_robust)
print(f"Robust Z-Score Method:")
print(f"  Detected {num_anomalies_robust} anomalies out of {len(pm25)} readings")
print(f"  Anomaly rate: {num_anomalies_robust / len(pm25) * 100:.2f}%")
print()

# Show detected anomaly values
if num_anomalies_robust > 0:
    anomaly_values_robust = pm25[anomalies_robust]
    print(f"Detected anomaly values: {anomaly_values_robust}")
    print()

## Comparison: Standard vs Robust

Let's visualize both detection methods side-by-side.

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(14, 9))

# Plot 1: Standard Z-Score
axes[0].plot(df['timestamp'], pm25, alpha=0.5, linewidth=1.5, color='steelblue', label='PM2.5 Data')
axes[0].scatter(df['timestamp'][anomalies_standard], pm25[anomalies_standard], 
                color='red', s=80, marker='X', label=f'Anomalies (n={num_anomalies})', zorder=5)
axes[0].set_xlabel('Time', fontsize=11)
axes[0].set_ylabel('PM2.5 (µg/m³)', fontsize=11)
axes[0].set_title('Standard Z-Score Anomaly Detection', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Plot 2: Robust Z-Score
axes[1].plot(df['timestamp'], pm25, alpha=0.5, linewidth=1.5, color='steelblue', label='PM2.5 Data')
axes[1].scatter(df['timestamp'][anomalies_robust], pm25[anomalies_robust], 
                color='darkred', s=80, marker='X', label=f'Anomalies (n={num_anomalies_robust})', zorder=5)
axes[1].set_xlabel('Time', fontsize=11)
axes[1].set_ylabel('PM2.5 (µg/m³)', fontsize=11)
axes[1].set_title('Robust Z-Score Anomaly Detection (Median + MAD)', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Analysis: Which Method is Better?

Let's compute the statistics that drive each method to understand the difference.

In [None]:
# Standard statistics (affected by outliers)
mean_val = hpcs.mean(pm25)
std_val = hpcs.std(pm25)

# Robust statistics (resistant to outliers)
median_val = hpcs.median(pm25)
mad_val = hpcs.mad(pm25)

print("Statistical Comparison:")
print("=" * 50)
print(f"Standard Method:")
print(f"  Mean:  {mean_val:.2f} µg/m³")
print(f"  Std:   {std_val:.2f} µg/m³")
print(f"  Threshold range: [{mean_val - threshold * std_val:.2f}, {mean_val + threshold * std_val:.2f}]")
print()
print(f"Robust Method:")
print(f"  Median: {median_val:.2f} µg/m³")
print(f"  MAD:    {mad_val:.2f} µg/m³")
print(f"  Threshold range: [{median_val - threshold * mad_val * 1.4826:.2f}, {median_val + threshold * mad_val * 1.4826:.2f}]")
print()
print(f"Key Insight:")
print(f"  The extreme spikes inflate the mean and std, making the standard method")
print(f"  less sensitive. The robust method is unaffected by spikes and detects them reliably.")

## Confusion Matrix: Comparing Methods

Let's see which anomalies each method detected.

In [None]:
# Comparison
both = anomalies_standard & anomalies_robust
only_standard = anomalies_standard & ~anomalies_robust
only_robust = ~anomalies_standard & anomalies_robust

print("Detection Comparison:")
print("=" * 50)
print(f"Detected by both methods:     {np.sum(both)}")
print(f"Detected only by standard:    {np.sum(only_standard)}")
print(f"Detected only by robust:      {np.sum(only_robust)}")
print()

if np.sum(only_robust) > 0:
    print(f"Values detected ONLY by robust method:")
    print(pm25[only_robust])
    print(f"\nThese are likely the extreme sensor spikes that corrupted the standard method.")

## Rolling Robust Z-Score

For time-series data, we can use **rolling robust z-score** to detect anomalies within a moving window.

This is useful when the "normal" baseline changes over time (e.g., day vs night pollution levels).

In [None]:
window = 24  # 24-hour rolling window

# Compute rolling robust z-score
rolling_robust_z = hpcs.rolling_robust_zscore(pm25, window)

# Flag anomalies (|z| > threshold)
rolling_anomalies = np.abs(rolling_robust_z) > threshold

print(f"Rolling Robust Z-Score (window={window} hours):")
print(f"  Detected {np.sum(rolling_anomalies)} anomalies")
print(f"  Anomaly rate: {np.sum(rolling_anomalies) / len(pm25) * 100:.2f}%")
print()

## Visualize Rolling Robust Z-Score

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(14, 9), sharex=True)

# Plot 1: Original data with rolling anomalies
axes[0].plot(df['timestamp'], pm25, alpha=0.5, linewidth=1.5, color='steelblue', label='PM2.5 Data')
axes[0].scatter(df['timestamp'][rolling_anomalies], pm25[rolling_anomalies], 
                color='darkred', s=80, marker='X', label=f'Anomalies (n={np.sum(rolling_anomalies)})', zorder=5)
axes[0].set_ylabel('PM2.5 (µg/m³)', fontsize=11)
axes[0].set_title(f'Rolling Robust Anomaly Detection (window={window}h)', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Plot 2: Rolling robust z-score
axes[1].plot(df['timestamp'], rolling_robust_z, linewidth=1.5, color='darkgreen', label='Rolling Robust Z-Score')
axes[1].axhline(y=threshold, color='red', linestyle='--', linewidth=1.5, label=f'Threshold = ±{threshold}')
axes[1].axhline(y=-threshold, color='red', linestyle='--', linewidth=1.5)
axes[1].axhline(y=0, color='black', linestyle='-', linewidth=0.5, alpha=0.5)
axes[1].fill_between(df['timestamp'], -threshold, threshold, alpha=0.2, color='green', label='Normal Range')
axes[1].set_xlabel('Time', fontsize=11)
axes[1].set_ylabel('Robust Z-Score', fontsize=11)
axes[1].set_title('Rolling Robust Z-Score Values', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Performance Benchmark

HPCSeries implements fast anomaly detection using optimized C/C++ kernels.

Let's benchmark on a larger dataset.

In [None]:
import time

# Create large test array (1 million hourly readings = ~114 years)
n = 1_000_000
large_data = np.random.randn(n) * 15 + 30  # Simulated PM2.5 data
# Add some extreme outliers
spike_indices = np.random.choice(n, size=100, replace=False)
large_data[spike_indices] = np.random.uniform(200, 500, size=100)

print(f"Benchmark: {n:,} element array\n")

# Benchmark standard anomaly detection
start = time.perf_counter()
_ = hpcs.detect_anomalies(large_data, threshold=3.0)
elapsed_standard = time.perf_counter() - start
print(f"Standard Anomaly Detection:   {elapsed_standard*1000:.2f} ms  ({n/elapsed_standard/1e6:.1f} M values/sec)")

# Benchmark robust anomaly detection
start = time.perf_counter()
_ = hpcs.detect_anomalies_robust(large_data, threshold=3.0)
elapsed_robust = time.perf_counter() - start
print(f"Robust Anomaly Detection:     {elapsed_robust*1000:.2f} ms  ({n/elapsed_robust/1e6:.1f} M values/sec)")

# Benchmark rolling robust z-score
start = time.perf_counter()
_ = hpcs.rolling_robust_zscore(large_data, 24)
elapsed_rolling = time.perf_counter() - start
print(f"Rolling Robust Z-Score (w=24): {elapsed_rolling*1000:.2f} ms  ({n/elapsed_rolling/1e6:.1f} M values/sec)")

print(f"\nRobust method is {elapsed_robust/elapsed_standard:.1f}x slower (due to median computation)")
print(f"But still processes {n/elapsed_robust/1e3:.0f}k values per second!")

## What We Learned

### Key Takeaways:

1. **Standard z-score anomaly detection** is fast but **sensitive to outliers**.
   - Extreme spikes inflate the mean and std, making detection less reliable.

2. **Robust z-score (MAD-based)** is **resistant to outliers**.
   - Uses median and MAD which ignore extreme values.
   - More reliable for noisy sensor data.

3. **Rolling robust z-score** adapts to time-varying baselines.
   - Useful for data with diurnal patterns or seasonal trends.

4. **HPCSeries provides all three methods** with high performance:
   - `hpcs.detect_anomalies()` — Standard z-score
   - `hpcs.detect_anomalies_robust()` — Robust z-score (MAD-based)
   - `hpcs.rolling_robust_zscore()` — Rolling robust z-score

### When to Use Each:

| Use Case | Recommended Method |
|----------|-------------------|
| Clean data with Gaussian distribution | `detect_anomalies()` (faster) |
| Sensor data with occasional spikes | `detect_anomalies_robust()` |
| Environmental monitoring | `detect_anomalies_robust()` |
| Time-series with changing baseline | `rolling_robust_zscore()` |
| Financial fraud detection | `detect_anomalies_robust()` |
| Real-time stream processing | `rolling_robust_zscore()` |

---

## Next Steps

See the next notebook:
- **Notebook 3**: Batched rolling analytics for IoT sensors using `hpcs.rolling_mean_batched()`
