# Machine Learning-Based Anomaly Detectors

This notebook demonstrates the machine learning-based anomaly detectors available in Anomsmith:

1. **IsolationForestDetector**: Isolation Forest algorithm for detecting outliers
2. **LOFDetector**: Local Outlier Factor for density-based anomaly detection
3. **RobustCovarianceDetector**: Robust covariance (elliptic envelope) for multivariate anomaly detection

These detectors are particularly useful for complex, high-dimensional data where statistical methods may not be sufficient.


In [None]:
import numpy as np
import pandas as pd
from plotsmith import plot_timeseries
import matplotlib.pyplot as plt

from anomsmith import detect_anomalies, ThresholdRule
from anomsmith.primitives.detectors.ml import (
    IsolationForestDetector,
    LOFDetector,
    RobustCovarianceDetector
)

np.random.seed(42)


## Creating Complex Test Data

We'll create data with complex patterns that benefit from ML-based detection.


In [None]:
def create_complex_data(n: int = 300, contamination: float = 0.1, seed: int = 42) -> pd.Series:
    """Create complex time series with various anomaly patterns.
    
    Args:
        n: Length of series
        contamination: Proportion of anomalies
        seed: Random seed
    """
    np.random.seed(seed)
    
    # Base series with trend and seasonality
    t = np.arange(n)
    trend = 0.01 * t
    seasonal = 2 * np.sin(2 * np.pi * t / 50)
    noise = np.random.randn(n) * 0.5
    y = trend + seasonal + noise
    
    # Inject different types of anomalies
    n_anomalies = int(n * contamination)
    anomaly_indices = np.random.choice(n, n_anomalies, replace=False)
    
    # Mix of spike and contextual anomalies
    for idx in anomaly_indices:
        if np.random.rand() < 0.5:
            # Spike anomaly
            y[idx] += np.random.choice([-1, 1]) * np.random.uniform(3, 6)
        else:
            # Contextual anomaly (smaller but in wrong context)
            y[idx] += np.random.choice([-1, 1]) * np.random.uniform(1.5, 3)
    
    index = pd.date_range("2020-01-01", periods=n, freq="D")
    return pd.Series(y, index=index), anomaly_indices

# Create test data
y, true_anomaly_indices = create_complex_data(n=300, contamination=0.1)
print(f"Created time series with {len(y)} points")
print(f"True anomalies: {len(true_anomaly_indices)}")
print(f"\nData statistics:")
print(y.describe())


In [None]:
# Visualize the complex data
fig, ax = plot_timeseries(
    y,
    title='Complex Time Series with Various Anomaly Patterns',
    xlabel='Date',
    ylabel='Value'
)
ax.scatter(y.index[true_anomaly_indices], y.values[true_anomaly_indices], 
          color='red', s=100, marker='x', linewidths=2, 
          label=f'True Anomalies ({len(true_anomaly_indices)})', zorder=5)
ax.legend()
plt.show()


## Isolation Forest Detector

Isolation Forest is an ensemble method that isolates anomalies instead of profiling normal points. It's efficient and works well with high-dimensional data.


In [None]:
# Initialize Isolation Forest detector
iso_forest = IsolationForestDetector(contamination=0.1, random_state=42)
iso_forest.fit(y.values)

# Note: Isolation Forest is a detector, so it has its own threshold
# We can use it directly or with a threshold rule
threshold_rule = ThresholdRule(method="quantile", value=0.9, quantile=0.9)
result_iso = detect_anomalies(y, iso_forest, threshold_rule)

print("Isolation Forest Results:")
print(f"Anomalies detected: {result_iso['flag'].sum()}")
print(f"Anomaly rate: {result_iso['flag'].mean():.2%}")
print(f"\nScore statistics:")
print(result_iso['score'].describe())


## Local Outlier Factor (LOF) Detector

LOF measures the local deviation of a data point with respect to its neighbors. It's good for detecting anomalies in regions of varying density.


In [None]:
# Initialize LOF detector
lof = LOFDetector(n_neighbors=20, contamination=0.1)
lof.fit(y.values)

result_lof = detect_anomalies(y, lof, threshold_rule)

print("LOF Results:")
print(f"Anomalies detected: {result_lof['flag'].sum()}")
print(f"Anomaly rate: {result_lof['flag'].mean():.2%}")
print(f"\nScore statistics:")
print(result_lof['score'].describe())


## Robust Covariance Detector

Robust Covariance (Elliptic Envelope) fits a robust estimate of the covariance to the data, assuming the data is Gaussian distributed. It's useful for multivariate data.


In [None]:
# Initialize Robust Covariance detector
robust_cov = RobustCovarianceDetector(contamination=0.1, random_state=42)
robust_cov.fit(y.values)

result_robust = detect_anomalies(y, robust_cov, threshold_rule)

print("Robust Covariance Results:")
print(f"Anomalies detected: {result_robust['flag'].sum()}")
print(f"Anomaly rate: {result_robust['flag'].mean():.2%}")
print(f"\nScore statistics:")
print(result_robust['score'].describe())


## Comparing All ML Detectors

Let's compare the performance of all three ML detectors side by side.


In [None]:
# Compare results
comparison = pd.DataFrame({
    'Isolation Forest': [
        result_iso['flag'].sum(),
        result_iso['flag'].mean(),
        result_iso['score'].mean(),
        result_iso['score'].std()
    ],
    'LOF': [
        result_lof['flag'].sum(),
        result_lof['flag'].mean(),
        result_lof['score'].mean(),
        result_lof['score'].std()
    ],
    'Robust Covariance': [
        result_robust['flag'].sum(),
        result_robust['flag'].mean(),
        result_robust['score'].mean(),
        result_robust['score'].std()
    ]
}, index=['Anomalies Detected', 'Anomaly Rate', 'Mean Score', 'Std Score'])

print("ML Detector Comparison:")
print(comparison.round(4))


In [None]:
# Visualize detection results
detectors = [
    ('Isolation Forest', result_iso, 'blue'),
    ('LOF', result_lof, 'green'),
    ('Robust Covariance', result_robust, 'orange')
]

for name, result, color in detectors:
    anomaly_mask = result['flag'] == 1
    fig, ax = plot_timeseries(
        y,
        title=f'{name} Detection Results',
        xlabel='Date',
        ylabel='Value'
    )
    # True anomalies
    ax.scatter(y.index[true_anomaly_indices], y.values[true_anomaly_indices], 
              color='gray', s=80, marker='o', alpha=0.5, 
              label='True Anomalies', zorder=3)
    # Detected anomalies
    ax.scatter(y.index[anomaly_mask], y.values[anomaly_mask], 
              color='red', s=100, marker='x', linewidths=2, 
              label=f'Detected ({anomaly_mask.sum()})', zorder=5)
    ax.legend()
    plt.show()


In [None]:
# Visualize scores
for name, result, color in detectors:
    anomaly_mask = result['flag'] == 1
    fig, ax = plot_timeseries(
        pd.Series(result['score'], index=y.index),
        title=f'{name} Anomaly Scores',
        xlabel='Date',
        ylabel='Score'
    )
    threshold_value = np.quantile(result['score'], 0.9)
    ax.axhline(threshold_value, color='r', linestyle='--', linewidth=2, 
              label=f'Threshold ({threshold_value:.2f})')
    ax.scatter(y.index[anomaly_mask], result['score'][anomaly_mask], 
              color='red', s=50, marker='x', linewidths=1.5, zorder=5)
    ax.legend()
    plt.show()


## When to Use Each Detector

### Isolation Forest
- **Best for**: High-dimensional data, large datasets
- **Strengths**: Fast, handles high-dimensional data well, doesn't require normal data assumption
- **Weaknesses**: May struggle with local anomalies in dense regions

### Local Outlier Factor (LOF)
- **Best for**: Data with varying density, local anomalies
- **Strengths**: Detects local anomalies well, adapts to local density
- **Weaknesses**: Sensitive to k (number of neighbors), can be slow for large datasets

### Robust Covariance
- **Best for**: Multivariate Gaussian data, when you need robust statistics
- **Strengths**: Robust to outliers in training, good for multivariate data
- **Weaknesses**: Assumes Gaussian distribution, may not work well for non-Gaussian data


## Summary

In this notebook, we've explored:
1. **IsolationForestDetector**: Efficient ensemble method for anomaly detection
2. **LOFDetector**: Density-based local outlier detection
3. **RobustCovarianceDetector**: Robust statistical method for multivariate data

Key takeaways:
- ML detectors are powerful for complex, high-dimensional data
- Each detector has different strengths and use cases
- The choice depends on your data characteristics and requirements
