# Statistical Observability with Seaborn

## Context
While Matplotlib is excellent for building fundamental charts (lines, bars, histograms), modern SRE and DevOps require analyzing complex, multi-dimensional relationships. When an incident occurs, you often need to ask:
- "Is there a correlation between network traffic, CPU utilization, and error rates?"
- "How is the response time distributed across our 5 Kubernetes clusters?"
- "Which microservices are experiencing the most latency right now?"

Seaborn is a statistical plotting library built on top of Matplotlib that makes answering these questions visually intuitive.

## Objectives
- Use **Heatmaps** to quickly spot correlations between operational metrics or identify overloaded servers.
- Use **Violin Plots** to understand the true distribution of latency across different infrastructure components.
- Use **Pairplots** to automatically find correlated bottlenecks in system metrics.

## Expected Outcome
- You will be able to generate advanced statistical visualizations that provide deeper insights than standard dashboards.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Set a dark grid theme which looks great for dashboards
sns.set_theme(style="darkgrid")

### 1. The Correlation Heatmap (Finding the Bottleneck)
During a complex outage, multiple alarms fire at once. A correlation heatmap helps you determine if metrics are moving together (e.g., as Network In goes up, does CPU go up? Are Error Rates correlated with Memory?).

We will simulate metrics for a web server during a load test.

In [None]:
# Simulate 100 minutes of system metrics
np.random.seed(42)
time = np.arange(100)
network_in = np.random.normal(500, 100, 100) # Mbps
cpu_util = network_in * 0.1 + np.random.normal(5, 2, 100) # CPU scales with network
memory_util = np.random.normal(60, 5, 100) # Memory is relatively stable
disk_io = np.random.normal(200, 50, 100) # Random spikes
error_rate = cpu_util * 0.05 + np.random.normal(0, 0.5, 100) # Errors scale slightly with CPU
error_rate = np.clip(error_rate, 0, None) # No negative errors

df_metrics = pd.DataFrame({
    'Network In (Mbps)': network_in,
    'CPU (%)': cpu_util,
    'Memory (%)': memory_util,
    'Disk IOPS': disk_io,
    'Error Rate (%)': error_rate
})

# Calculate the correlation matrix
corr_matrix = df_metrics.corr()

plt.figure(figsize=(8, 6))
# annot=True shows the actual numbers. cmap specifies the color scale.
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, fmt=".2f")
plt.title('System Metrics Correlation Matrix (Incident #5912)')
plt.show()

# INSIGHT: Notice the strong positive correlation between 'Network In' and 'CPU', 
# and a moderate correlation between 'CPU' and 'Error Rate'.

### 2. Violin Plots (Latency Distribution across Clusters)
Boxplots are okay, but Violin plots are better because they show the exact probability density of the data at different values. If you have a "bimodal" distribution (e.g., some requests are very fast, some are hitting a slow cache), a violin plot will reveal it clearly, while a boxplot will hide it.

In [None]:
# Simulate API response times across three different Kubernetes clusters
np.random.seed(10)

# Cluster A is healthy (normal distribution)
cluster_a = np.random.normal(100, 20, 500)

# Cluster B is struggling with a long tail (exponential distribution)
cluster_b = np.random.exponential(150, 500) + 50 

# Cluster C has a bimodal distribution (e.g., cache hits vs cache misses)
cache_hits = np.random.normal(30, 5, 250)
cache_misses = np.random.normal(250, 30, 250)
cluster_c = np.concatenate([cache_hits, cache_misses])

# Build DataFrame
df_latency = pd.DataFrame({
    'Latency (ms)': np.concatenate([cluster_a, cluster_b, cluster_c]),
    'Cluster': ['us-east-1a']*500 + ['us-east-1b']*500 + ['us-west-2a']*500
})

plt.figure(figsize=(10, 6))
sns.violinplot(data=df_latency, x='Cluster', y='Latency (ms)', palette='Set2')
plt.title('API Latency Distribution by K8s Cluster')
plt.axhline(200, color='red', linestyle='--', label='SLA Limit (200ms)')
plt.legend()
plt.show()

# INSIGHT: 
# - us-east-1a is stable.
# - us-east-1b has severe long-tail latency dragging down the p99.
# - us-west-2a clearly shows two distinct humps (bimodal), indicating two very different code paths (like cache hit/miss).

### 3. Pairplot (hunting for unknown relationships)
When investigating a completely unknown issue in a new service, you might not know what to look for. `sns.pairplot()` plots every numeric column against every other numeric column instantly. It is the ultimate "scattershot" exploratory tool.

In [None]:
# Let's use our df_metrics from the heatmap example
sns.pairplot(df_metrics, diag_kind='kde', corner=True, plot_kws={'alpha': 0.6})
plt.suptitle('Pairwise Relationships of System Metrics', y=1.02)
plt.show()

# Diagonal: Shows the distribution (density) of that single metric.
# Scatter plots: Shows how two metrics interact. The linear relationship between Network In and CPU is obvious here.

### 4. Load Distribution Heatmap
Another common SRE use case for heatmaps: visualizing traffic distribution across a fleet of servers over time. This helps spot imbalanced load balancers or noisy neighbors.

In [None]:
# Simulate 24 hours of connection counts for 10 backend servers
hours = [f"{h:02d}:00" for h in range(24)]
servers = [f"web-node-{i:02d}" for i in range(1, 11)]

# Generate normal traffic pattern
base_traffic = np.sin(np.linspace(0, np.pi, 24)) * 1000 + 500
traffic_matrix = np.array([base_traffic + np.random.normal(0, 50, 24) for _ in range(10)])

# Simulate a "noisy neighbor" or bad routing on web-node-04 where it takes 3x traffic briefly
traffic_matrix[3, 14:18] *= 3

df_traffic = pd.DataFrame(traffic_matrix, index=servers, columns=hours)

plt.figure(figsize=(14, 6))
sns.heatmap(df_traffic, cmap='YlOrRd', linewidths=.5)
plt.title('Active Connections per Node over 24 Hours')
plt.xlabel('Hour of Day')
plt.ylabel('Server Name')
plt.show()

# INSIGHT: The heatmap makes the massive traffic spike on web-node-04 starting at 14:00 immediately obvious.