# Visualizing Infrastructure Metrics with Matplotlib

## Context
As an SRE, you constantly look at Grafana dashboards, Datadog charts, and AWS CloudWatch metrics. But what happens when you need to generate a custom report for a post-mortem, or visualize a highly specific metric that isn't supported by out-of-the-box tools? You write it yourself using Matplotlib.

## Objectives
- Learn how to generate the 3 fundamental chart types used in SRE: Line plots (Time-series), Bar charts (Distribution), and Histograms (Latency buckets).
- Understand how to add context (threshold lines, annotations) to make charts readable during high-stress incidents.

## Expected Outcome
- You will be able to take raw metric data and turn it into clear, professional visualizations ready for a Root Cause Analysis (RCA) document.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### 1. The Line Plot (Time-Series Metrics)
Line plots are the absolute core of observability. You use them for CPU, Memory, Network traffic, and Error rates over time. Here we will simulate a memory leak over a 24-hour period.

In [None]:
# Simulate 24 hours of memory utilization (%) reading every hour
hours = np.arange(0, 24)
base_memory = 40
leak = np.linspace(0, 50, 24)        # Memory slowly creeps up
noise = np.random.normal(0, 3, 24)   # Add realistic fluctuation
memory_utilization = base_memory + leak + noise

plt.figure(figsize=(10, 5))

# Create the main plot
plt.plot(hours, memory_utilization, marker='o', linestyle='-', color='b', label='app-server-1 Memory %')

# Adding context (CRITICAL for RCA documents)
plt.axhline(y=80, color='r', linestyle='--', label='80% Alert Threshold')
plt.axvline(x=18, color='orange', linestyle=':', label='OOM Kill Event')

plt.title('Memory Utilization Over 24h (Suspected Memory Leak)')
plt.xlabel('Hour of Day')
plt.ylabel('Memory Utilization (%)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### 2. The Bar Chart (Categorical Distribution)
Bar charts are excellent for comparing discrete categories, such as error counts per microservice or cost breakdown per AWS resource.

In [None]:
# Simulating HTTP 500 error counts across different microservices during an outage
services = ['auth-api', 'payment-gw', 'search-svc', 'inventory-db']
error_counts = [150, 4200, 12, 0]

plt.figure(figsize=(8, 5))

# Using colors to highlight the problem area
colors = ['gray', 'red', 'gray', 'gray'] 

bars = plt.bar(services, error_counts, color=colors)

plt.title('HTTP 500 Errors by Service (Outage Incident 409)')
plt.xlabel('Microservice')
plt.ylabel('Error Count')
plt.yscale('log') # Log scale is often used in SRE because differences can be orders of magnitude

# Add exact numbers on top of the bars
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval + (yval*0.1), int(yval), ha='center', va='bottom')

plt.show()

### 3. The Histogram (Latency Buckets)
When looking at latency (response times), averages (means) lie to you. A p99 latency spike can be hidden by thousands of fast requests. Histograms group data into "buckets" so you can see the true distribution.

In [None]:
# Simulate 1000 API request latencies in milliseconds
# Most are fast (around 50ms), but some are very slow (long tail)
np.random.seed(42)
fast_requests = np.random.normal(loc=50, scale=10, size=900)
slow_requests = np.random.normal(loc=300, scale=50, size=100) # The "long tail"

latencies = np.concatenate([fast_requests, slow_requests])

plt.figure(figsize=(10, 5))

# bins=50 creates 50 slices to group the data into
plt.hist(latencies, bins=50, color='purple', edgecolor='black', alpha=0.7)

plt.title('API Response Time Distribution (Notice the long tail)')
plt.xlabel('Latency (ms)')
plt.ylabel('Number of Requests')

# Add lines for percentiles
p95 = np.percentile(latencies, 95)
p99 = np.percentile(latencies, 99)

plt.axvline(p95, color='orange', linestyle='dashed', linewidth=2, label=f'p95: {p95:.1f}ms')
plt.axvline(p99, color='red', linestyle='dashed', linewidth=2, label=f'p99: {p99:.1f}ms')

plt.legend()
plt.show()

### 4. Subplots (Creating a Mini-Dashboard)
Often, you need to correlate two metrics side-by-side (e.g., CPU Spikes causing Latency Spikes). We use `plt.subplots()` for this.

In [None]:
# Simulate data
time = np.arange(0, 60) # 60 minutes
cpu = np.random.normal(40, 5, 60)
cpu[30:40] += 50 # massive CPU spike at minute 30

latency = np.random.normal(20, 2, 60)
latency[30:40] += 80 # Latency follows the CPU spike

# Create a 2-row, 1-column layout
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8), sharex=True)

# Top plot: CPU
ax1.plot(time, cpu, color='darkblue')
ax1.set_title('CPU Utilization %')
ax1.set_ylabel('CPU %')
ax1.grid(True, alpha=0.3)
ax1.axvspan(30, 40, color='red', alpha=0.2, label='Incident Window')
ax1.legend()

# Bottom plot: Latency
ax2.plot(time, latency, color='darkred')
ax2.set_title('API Latency (ms)')
ax2.set_xlabel('Time (Minutes)')
ax2.set_ylabel('Latency (ms)')
ax2.grid(True, alpha=0.3)
ax2.axvspan(30, 40, color='red', alpha=0.2)

plt.tight_layout()
plt.show()