# Anomaly Detection for Infrastructure Monitoring

## Context
In SRE/DevOps, traditional threshold-based alerting (e.g., "Alert if CPU > 90%") is often noisy and misses subtle issues. Machine Learning-based **Anomaly Detection** automatically learns the "normal" behavior of a system and flags data points that deviate significantly.

This is an **Unsupervised Learning** task. We don't need historical data labeled "anomalous" vs. "normal"; the algorithm isolates the anomalies based on data distribution.

## Objectives
- Generate a synthetic time-series dataset of **Requests Per Minute (RPM)** and **Average Response Time (ms)**.
- Inject realistic anomalies (e.g., sudden traffic spikes, performance degradations).
- Train an **Isolation Forest** model to detect these anomalies without manual thresholding.
- Visualize the detected anomalies to see how they align with infrastructure events.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

plt.style.use('ggplot')

### 1. Generating Infrastructure Telemetry
Let's simulate 1,000 minutes of telemetry for an API gateway.
- **Normal State:** ~5000 RPM, ~45ms Response Time.
- **Anomalies Injected:** 
  - A sudden spike in response time (e.g., database lock).
  - A massive drop in traffic (e.g., upstream load balancer failure).
  - A DDoS attack (massive traffic spike + higher response time).

In [None]:
np.random.seed(42)
n_samples = 1000

# 1. Normal regular traffic
rpm_normal = np.random.normal(loc=5000, scale=300, size=n_samples)
resp_normal = np.random.normal(loc=45, scale=5, size=n_samples)

data = pd.DataFrame({'RPM': rpm_normal, 'Response_Time_ms': resp_normal})
data['Timestamp'] = pd.date_range(start='2023-10-01', periods=n_samples, freq='T')

# 2. Inject Anomalies
# Anomaly 1: DB Lock at minute 300 (Massive response time spike)
data.loc[300:305, 'Response_Time_ms'] = [250, 300, 310, 280, 200, 150]

# Anomaly 2: Upstream failure at minute 600 (Traffic drops to near 0)
data.loc[600:610, 'RPM'] = np.random.normal(loc=100, scale=50, size=11)

# Anomaly 3: DDoS attack at minute 850 (Huge traffic, high latency)
data.loc[850:860, 'RPM'] = np.random.normal(loc=15000, scale=1000, size=11)
data.loc[850:860, 'Response_Time_ms'] = np.random.normal(loc=120, scale=20, size=11)

# Visualize the Raw Telemetry
plt.figure(figsize=(12, 6))
plt.plot(data['Timestamp'], data['RPM'], label='Requests Per Minute (RPM)', color='blue', alpha=0.6)
plt.plot(data['Timestamp'], data['Response_Time_ms'] * 10, label='Response Time (ms * 10)', color='red', alpha=0.6)
plt.title("API Gateway Telemetry (Raw)")
plt.xlabel("Time")
plt.legend()
plt.show()

# Notice the 3 distinct abnormal events visually!

### 2. Feature Scaling
Algorithms like Isolation Forest rely on distances/splits, so it is crucial to standard scale the features first. RPM values are in the thousands, while Response Time is in the 10s-100s.

In [None]:
features = ['RPM', 'Response_Time_ms']
X = data[features]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


### 3. Isolation Forest
**Isolation Forest** "isolates" observations by randomly selecting a feature and then randomly selecting a split value.
Anomalies naturally require *fewer* random splits to be isolated from the rest of the data because they are far away from the dense, normal cluster.

The parameter `contamination` tells the model roughly what percentage of the data we expect to be anomalous. We will set it to `0.03` (3%).

In [None]:
# Initialize and train the Isolation Forest
iso_forest = IsolationForest(n_estimators=100, contamination=0.03, random_state=42)
data['Anomaly_Label'] = iso_forest.fit_predict(X_scaled)

# The model outputs 1 for normal data, and -1 for anomalies.
print("Anomaly Counts:")
print(data['Anomaly_Label'].value_counts())

### 4. Visualizing the Detected Anomalies
Now we can plot the original data as a scatter plot and highlight the points the model flagged as anomalies (`-1`).

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(
    x='RPM', 
    y='Response_Time_ms', 
    hue='Anomaly_Label', 
    data=data, 
    palette={1: 'green', -1: 'red'},
    s=50, 
    alpha=0.8
)
plt.title("Isolation Forest: Normal Traffic vs Anomalies")
plt.xlabel("Requests Per Minute (RPM)")
plt.ylabel("Response Time (ms)")
plt.legend(title="Status", labels=['Normal (1)', 'Anomaly (-1)'])
plt.show()

# SRE Insight: The dense green cluster represents healthy operation.
# The isolated red points represent the DB Lock (High Latency),
# Upstream Failure (Low RPM), and DDoS (High RPM, Elevated Latency).

### 5. Time-Series Overlay of Anomalies
Scatter plots are great for seeing data shapes, but as SREs, we view data over time on dashboards (like Grafana). Let's overlay the anomalies back onto our timeline.

In [None]:
anomalies = data[data['Anomaly_Label'] == -1]

plt.figure(figsize=(14, 7))

# Plot Normal Data Lines
plt.plot(data['Timestamp'], data['RPM'], label='RPM', color='blue', alpha=0.3)
plt.plot(data['Timestamp'], data['Response_Time_ms'] * 10, label='Response Time (ms * 10)', color='orange', alpha=0.3)

# Overlay Anomalies as scatter points
plt.scatter(anomalies['Timestamp'], anomalies['RPM'], color='red', label='Anomaly Alert (RPM)', s=50)
plt.scatter(anomalies['Timestamp'], anomalies['Response_Time_ms'] * 10, color='darkred', label='Anomaly Alert (Latency)', s=50)

plt.title("Simulated Grafana Dashboard: Automated Anomaly Alerts")
plt.xlabel("Time")
plt.ylabel("Metrics Scale")
plt.legend()
plt.tight_layout()
plt.show()

# Conclusion: Isolation Forest correctly identified the exact timestamps where
# the 3 major incidents occurred, all without us having to write hardcoded 
# threshold rules (like 'alert if RPM < 500').