# Model Evaluation: Cross-Validation

## Context
In observability and ML, relying on a single train/test split can be dangerously misleading. A model predicting server outages might get "lucky" on one specific test set that happens to be easy, giving you false confidence before deploying it to production.

**K-Fold Cross-Validation** solves this by splitting the data into `K` different chunks (folds). It trains the model `K` times, each time using a different chunk as the test set and the remaining as training data. This gives us a robust **average accuracy** and a **variance** (how much the performance fluctuates).

## Objectives
- Generate a synthetic SRE dataset predicting "System Outage".
- Train a model on a simple Train/Test split.
- Use `cross_val_score` to perform 5-Fold Cross-Validation.
- Visualize the stability of the model across different folds.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

### 1. Generate SRE Outage Data
We will synthesize metrics (CPU, Memory, Disk Queue Length) to predict an `Outage_Flag`.

In [None]:
np.random.seed(42)
n_samples = 400

X = pd.DataFrame({
    'CPU_Load': np.random.normal(50, 20, n_samples),
    'Mem_Usage': np.random.normal(60, 15, n_samples),
    'Disk_Queue': np.random.poisson(2, n_samples)
})

# Outage occurs if resource contention is high across the board
y = ((X['CPU_Load'] > 80) & (X['Mem_Usage'] > 80) | (X['Disk_Queue'] >= 6)).astype(int)

# Introduce noise to make it realistic
noise = np.random.choice(n_samples, size=30, replace=False)
y[noise] = 1 - y[noise]

print("Outage Distribution:\n", y.value_counts())

### 2. The Danger of a Single Split
Let's see the accuracy with a single 80/20 train/test split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(n_estimators=50, max_depth=5, random_state=42)
model.fit(X_train, y_train)

single_split_acc = accuracy_score(y_test, model.predict(X_test))
print("Single Split Accuracy: {:.2f}%".format(single_split_acc * 100))

### 3. K-Fold Cross-Validation
Now, let's use 5-Fold Cross Validation on the entire dataset. This will train and test the model 5 separate times on 5 completely different chunks of the dataset.

In [None]:
# cv=5 means 5-Fold Cross Validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print("Accuracy for each fold:", np.round(cv_scores * 100, 2))
print("\nAverage Accuracy: {:.2f}%".format(cv_scores.mean() * 100))
print("Standard Deviation: {:.2f}% (How much the performance fluctuates)".format(cv_scores.std() * 100))

### 4. Visualizing Model Stability
A boxplot easily shows us if the ML model's accuracy is stable or highly dependent on how the data was split.

In [None]:
plt.figure(figsize=(6, 4))
plt.boxplot(cv_scores * 100)
plt.title("Cross-Validation Accuracy Spread")
plt.ylabel("Accuracy (%)")
plt.xticks([1], ["Random Forest (5 Folds)"])
plt.axhline(single_split_acc * 100, color='red', linestyle='--', label='Original Single Split')
plt.legend()
plt.show()

# Interpretation: If the box is very tall (high standard deviation), the model is volatile 
# and might not be trustworthy in production alerting systems.

### 5. Stratified K-Fold (Important for SRE)
In infrastructure, outages are rare. Usually, 95% of data is "Healthy" and 5% is "Failure". 
If you just randomly split data, one fold might accidentally contain 0 failures! 

**StratifiedKFold** guarantees that the ratio of Healthy vs Failure is preserved in every single fold.

In [None]:
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
stratified_cv_scores = cross_val_score(model, X, y, cv=stratified_kfold, scoring='accuracy')

print("Stratified Average Accuracy: {:.2f}%".format(stratified_cv_scores.mean() * 100))

# Always default to Stratified K-Fold when dealing with heavily imbalanced SRE logs!