# Model Evaluation: Metrics for SRE

## Context
In observability and DevOps, the standard Machine Learning metric—**Accuracy**—is usually a terrible trap. 

Consider an alert system predicting Server Outages. If your servers are healthy 99% of the time, a "dumb" model that simply hardcodes `return "Healthy"` every single time will achieve **99% Accuracy**. But it will completely fail to detect the 1% of the time your database crashes, making it useless.

To build effective ML-driven alerting, we must understand the difference between **Precision**, **Recall**, **F1-Score**, and **Confusion Matrices**.

## Objectives
- Synthesize a highly imbalanced SRE dataset (Fraud/DDoS detection).
- Build a basic classification model.
- Understand the implications of False Positives (Alert Fatigue) vs False Negatives (Missed Outages).
- Calculate and interpret Precision, Recall, F1, and the ROC-AUC score.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

### 1. Generating Imbalanced Observability Data
Let's simulate 10,000 API requests. Only ~2% of them are malicious (DDoS/Scraping).

In [None]:
np.random.seed(42)
n_samples = 10000

# 98% Normal traffic
normal_requests = pd.DataFrame({
    'Request_Rate': np.random.normal(50, 10, 9800),
    'Payload_Size': np.random.normal(1024, 200, 9800),
    'Status': 0
})

# 2% Malicious traffic
malicious_requests = pd.DataFrame({
    'Request_Rate': np.random.normal(300, 50, 200),
    'Payload_Size': np.random.normal(5000, 1000, 200),
    'Status': 1
})

df = pd.concat([normal_requests, malicious_requests])

X = df[['Request_Rate', 'Payload_Size']]
y = df['Status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

print("Percentage of Malicious Requests: {:.2f}%".format(y.mean() * 100))

### 2. Training the Detector

In [None]:
model = RandomForestClassifier(random_state=42, class_weight='balanced')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

### 3. The Confusion Matrix
A confusion matrix tells us exactly the types of mistakes our model made.

- **True Negatives (TN):** Model says Normal, Traffic is Normal. (Good)
- **True Positives (TP):** Model says Attack, Traffic is Attack. (Good)
- **False Positives (FP) [Type I Error]:** Model says Attack, Traffic is Normal. (**Paging engineer at 3 AM for nothing! Alert Fatigue!**)
- **False Negatives (FN) [Type II Error]:** Model says Normal, Traffic is Attack. (**System goes down unmonitored!**)

In [None]:
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=['Normal (0)', 'Attack (1)'], yticklabels=['Normal (0)', 'Attack (1)'])
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.title('SRE Alerting Confusion Matrix')
plt.show()

### 4. Advanced Metrics

#### **Precision: Out of all the alerts you fired, how many were real?**
Formula: `TP / (TP + FP)`
*Why it matters in SRE:* Low precision means high False Positives. This causes **Alert Fatigue**, where engineers start ignoring PagerDuty.

#### **Recall (Sensitivity): Out of all the real attacks, how many did you catch?**
Formula: `TP / (TP + FN)`
*Why it matters in SRE:* Low recall means high False Negatives. This causes **Missed Incidents** and downtime.

#### **F1-Score: The harmonic mean of Precision and Recall**
Used when you need a balance between the two.

In [None]:
print("Accuracy:  {:.4f}".format(accuracy_score(y_test, y_pred)))
print("Precision: {:.4f} (When I page you, I am correct {}% of the time)".format(precision_score(y_test, y_pred), round(precision_score(y_test, y_pred)*100)))
print("Recall:    {:.4f} (Out of all attacks, I caught {}% of them)".format(recall_score(y_test, y_pred), round(recall_score(y_test, y_pred)*100)))
print("F1-Score:  {:.4f}".format(f1_score(y_test, y_pred)))

# Notice that while Accuracy is extremely high, Precision/Recall paint the real picture of the alerting system.

### 5. ROC-AUC Score
The Receiver Operating Characteristic - Area Under Curve (ROC-AUC) measures the model's ability to distinguish between classes unconditionally across all possible threshold values. 
- `1.0` is perfect.
- `0.5` is random guessing.

In [None]:
# ROC-AUC requires predicted probabilities, not just hard labels [0,1]
y_prob = model.predict_proba(X_test)[:, 1]

roc_auc = roc_auc_score(y_test, y_prob)
print(f"ROC-AUC Score: {roc_auc:.4f}")

### Summary
As an SRE/Data Scientist, you must tune your model's threshold depending on the business context:
- If missing an attack is catastrophic (e.g., banking), you optimize for **Recall** (even if it means PagerDuty goes off more often).
- If alerts are just warnings for a non-critical background job, you optimize for **Precision** to protect your engineers' sleep.