# Clustering for Incident and Workload Analysis

## Context
In SRE and DevOps, we often deal with massive amounts of unlabeled data, such as API request logs, server telemetry, or user behavior metrics. Clustering is an **Unsupervised Learning** technique that helps us discover inherent groupings in this data without needing pre-labeled examples.

For instance, we can use clustering to:
- Group similar API endpoints based on their performance profiles (Latency vs. Error Rate).
- Discover distinct workload patterns across a fleet of microservices.
- Detect anomalies (data points that do not fit into any normal cluster).

## Objectives
- Generate synthetic API telemetry data representing different endpoint behaviors.
- Use **K-Means Clustering** to group healthy, slow, and failing endpoints.
- Use **DBSCAN** to identify isolated anomalies (outliers) in the telemetry data.
- Implement **Hierarchical Clustering** to visualize endpoint relationships in a dendrogram.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs

plt.style.use('ggplot')

### 1. Generating API Telemetry Data
We will synthesize a dataset of 300 APIs with two primary metrics:
- **Latency (ms)**
- **Error Rate (%)**

We expect the APIs to naturally form 3 clusters:
1. **Healthy:** Low latency (e.g., ~50ms), Low errors (e.g., ~0.1%)
2. **Slow / Performance Issues:** High latency (e.g., ~500ms), Low errors
3. **Failing / Reliability Issues:** Low latency, High errors (e.g., ~5%)

We will also add some "noise" (anomalous data points).

In [None]:
np.random.seed(42)

# Create 3 distinct clusters
X_clusters, _ = make_blobs(
    n_samples=300, 
    centers=[[50, 0.1], [500, 0.2], [40, 5.0]], # Centers for Healthy, Slow, Failing
    cluster_std=[10, 0.05], # Variance for Latency and Error Rate
    random_state=42
)

# Add some random anomalies (noise)
noise = np.random.uniform(low=[100, 1.0], high=[800, 8.0], size=(20, 2))
X = np.vstack([X_clusters, noise])

# Put into DataFrame for easier handling
df = pd.DataFrame(X, columns=['Latency_ms', 'Error_Rate_pct'])

# It is highly recommended to scale the features before clustering,
# otherwise Latency (which is 10s-100s) will completely dominate Error Rate (which is 0-5).
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# Visualize the unclustered data
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Latency_ms', y='Error_Rate_pct', data=df, s=50, alpha=0.7)
plt.title("Raw API Telemetry (Unlabeled)")
plt.xlabel("Average Latency (ms)")
plt.ylabel("Error Rate (%)")
plt.show()

### 2. K-Means Clustering
K-Means partitions the dataset into exactly `K` clusters, aiming to minimize the variance within each cluster. Each point is assigned to the nearest centroid.

Here, we'll tell K-Means to look for $K=3$ clusters.

In [None]:
# Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=42, n_init='auto')
kmeans_labels = kmeans.fit_predict(X_scaled)
df['KMeans_Cluster'] = kmeans_labels

# To plot centroids on original scale, inverse transform them
centers = scaler.inverse_transform(kmeans.cluster_centers_)

plt.figure(figsize=(8, 6))
sns.scatterplot(x='Latency_ms', y='Error_Rate_pct', hue='KMeans_Cluster', data=df, palette='Set1', s=60, alpha=0.8)
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, marker='X', label='Centroids')

plt.title("K-Means Clustering of APIs (K=3)")
plt.xlabel("Average Latency (ms)")
plt.ylabel("Error Rate (%)")
plt.legend()
plt.show()

# Insight: K-Means successfully separated the Healthy, Slow, and Failing groups. 
# Notice that it forcefully assigns the 'noise' anomalies into one of the 3 clusters.

#### Choosing the right 'K' (The Elbow Method)
If we didn't know in advance that there are 3 intrinsic groups, how would we pick `K`? 
We can calculate the **Inertia** (sum of squared distances from points to their centroids) for different values of `K`. The "elbow" point of the curve is typically optimal.

In [None]:
inertia = []
K_range = range(1, 8)
for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init='auto')
    km.fit(X_scaled)
    inertia.append(km.inertia_)
    
plt.figure(figsize=(7, 4))
plt.plot(K_range, inertia, marker='o', linestyle='--')
plt.title('Elbow Method For Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.axvline(x=3, color='blue', linestyle=':')
plt.show()
# The bend at K=3 strongly suggests 3 is the natural number of groupings here.

### 3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Unlike K-Means, DBSCAN does not require specifying the number of clusters in advance. It groups points that are densely packed together, and marks points in low-density regions as **outliers (noise)**.

This is arguably more powerful for SRE use cases, as incident detection is often a problem of finding anomalies (noise) versus regular workload trends (dense clusters).

**Key Parameters:**
- **`eps`**: The maximum distance between two points to be considered neighbors.
- **`min_samples`**: The minimum number of points required to form a dense region (a cluster).

In [None]:
# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)
df['DBSCAN_Cluster'] = dbscan_labels

plt.figure(figsize=(8, 6))
sns.scatterplot(x='Latency_ms', y='Error_Rate_pct', hue='DBSCAN_Cluster', data=df, palette='Dark2', s=60, alpha=0.9)

plt.title("DBSCAN Clustering (Cluster -1 = Anomalies)")
plt.xlabel("Average Latency (ms)")
plt.ylabel("Error Rate (%)")
plt.legend(title='DBSCAN Label')
plt.show()

# Insight: DBSCAN found 3 main clusters (0, 1, 2) and correctly identified the scattered 
# noise points as anomalies by assigning them a label of '-1'.
# These '-1' APIs are candidates for immediate alerting or deep-dive investigations.

### 4. Hierarchical Clustering
Hierarchical clustering builds a "tree" of clusters (a dendrogram), showing how individual data points merge together into the final clusters. This is helpful for understanding the hierarchical taxonomy of your services. 
*(We will use a small subset of the data so the dendrogram is readable)*

In [None]:
# Take a small random sample (30 points) for visualization
df_sample = df.sample(n=30, random_state=42)
X_subset_scaled = scaler.transform(df_sample[['Latency_ms', 'Error_Rate_pct']])

# Calculate the linkage matrix using Ward's method (minimizes variance when merging)
Z = linkage(X_subset_scaled, method='ward')

plt.figure(figsize=(10, 6))
dendrogram(
    Z, 
    labels=df_sample.index.to_numpy(), 
    leaf_rotation=90.,
    color_threshold=3.5 # Threshold to color the distinct branches
)
plt.title("Hierarchical Clustering Dendrogram (Subset)")
plt.xlabel("API Data Point Index")
plt.ylabel("Distance (Ward)")
plt.show()

# Insight: You can see at a high level the dataset splits into 3 primary tree branches,
# mirroring our 3 main groups (Healthy, Slow, Failing).