# Predicting Server Health State with Classification

## Context
As an SRE, you want to automatically categorize the health state of your infrastructure (e.g., *Healthy*, *Warning*, *Critical*) rather than relying purely on static, single-metric thresholds. By feeding historical telemetry data and their known incident states into a classification model, you can predict when a server is entering a problematic state before it fails entirely.

## Objectives
- Generate synthetic operational telemetry data mapped to server health states.
- Train classification models: Logistic Regression, Random Forest Classifier, and Support Vector Classifier (SVC).
- Evaluate model accuracy and analyze the classification report (Precision, Recall, F1-Score).
- Understand feature importance to see which metrics drive state changes.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

plt.style.use('ggplot')

### 1. Generating Synthetic Telemetry Data
We will generate data representing 1,500 server snapshots. We'll label them as 0 (Healthy), 1 (Warning), or 2 (Critical) based on underlying rules.

In [None]:
np.random.seed(42)
n_samples = 1500

# Telemetry features
cpu_usage = np.random.uniform(10, 100, n_samples)
memory_usage = np.random.uniform(20, 100, n_samples)
disk_io = np.random.normal(500, 200, n_samples)
network_latency = np.random.gamma(2, 10, n_samples)
error_rate = np.random.exponential(1.5, n_samples)

def determine_state(cpu, mem, error, latency):
    # Combinations of high metrics lead to worse states
    if error > 5 or (cpu > 90 and mem > 90):
        return 2  # Critical
    elif error > 2 or cpu > 75 or mem > 80 or latency > 40:
        return 1  # Warning
    else:
        return 0  # Healthy

# Apply logic to map features to a target state
states = [determine_state(cpu_usage[i], memory_usage[i], error_rate[i], network_latency[i]) for i in range(n_samples)]

df = pd.DataFrame({
    'cpu_usage': cpu_usage,
    'memory_usage': memory_usage,
    'disk_io': disk_io,
    'network_latency': network_latency,
    'error_rate': error_rate,
    'state': states
})

# Let's map integer states to string labels for clarity later
state_map = {0: 'Healthy', 1: 'Warning', 2: 'Critical'}
df['state_label'] = df['state'].map(state_map)

df.head()

### 2. Exploratory Data Analysis
How does error rate distinguish between states?

In [None]:
plt.figure(figsize=(8, 5))
sns.boxplot(x='state_label', y='error_rate', data=df, order=['Healthy', 'Warning', 'Critical'], palette='viridis')
plt.title('Error Rate by Server State')
plt.ylabel('Error Rate (%)')
plt.xlabel('')
plt.show()

### 3. Data Preparation (Scaling and Splitting)
Machine learning models often require features to be on the same scale (e.g., Disk I/O is in the hundreds, Memory is 0-100). We use `StandardScaler` to normalize the data.

In [None]:
X = df[['cpu_usage', 'memory_usage', 'disk_io', 'network_latency', 'error_rate']]
y = df['state']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### 4. Training Classification Models

#### **Model A: Logistic Regression**
A foundational algorithm for classification. It works well if the boundary between states is relatively linear.

In [None]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, y_train)
y_pred_log_reg = log_reg.predict(X_test_scaled)

print(f"Logistic Regression Accuracy: {accuracy_score(y_test, y_pred_log_reg):.4f}\n")
print(classification_report(y_test, y_pred_log_reg, target_names=['Healthy', 'Warning', 'Critical']))

#### **Model B: Random Forest Classifier**
An ensemble model that builds multiple decision trees. Highly effective and handles complex, non-linear relationships well (like our "IF cpu > 90 AND mem > 90" rule).

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train_scaled, y_train)
y_pred_rf = rf_clf.predict(X_test_scaled)

print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}\n")
print(classification_report(y_test, y_pred_rf, target_names=['Healthy', 'Warning', 'Critical']))

#### **Model C: Support Vector Classifier (SVC)**
Finds the optimal hyperplane that separates the classes. Excellent for high-dimensional spaces.

In [None]:
svc_clf = SVC(kernel='rbf', random_state=42)
svc_clf.fit(X_train_scaled, y_train)
y_pred_svc = svc_clf.predict(X_test_scaled)

print(f"SVC Accuracy: {accuracy_score(y_test, y_pred_svc):.4f}\n")
print(classification_report(y_test, y_pred_svc, target_names=['Healthy', 'Warning', 'Critical']))

### 5. Evaluation: Confusion Matrix
Let's look at the Random Forest confusion matrix to see specifically *where* the model makes mistakes. Are we predicting "Healthy" when it's actually "Critical"? (This would be a dangerous false negative).

In [None]:
cm = confusion_matrix(y_test, y_pred_rf)

plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Healthy', 'Warning', 'Critical'],
            yticklabels=['Healthy', 'Warning', 'Critical'])
plt.xlabel('Predicted State')
plt.ylabel('Actual State')
plt.title('Confusion Matrix: Random Forest Classifier')
plt.show()

### 6. Feature Importance
Random Forest models give us the ability to see which features contributed most to the predictions. This tells us which telemetry metrics are the strongest leading indicators of server failure.

In [None]:
importances = rf_clf.feature_importances_
features = X.columns

indices = np.argsort(importances)

plt.figure(figsize=(8, 5))
plt.barh(range(len(indices)), importances[indices], color='teal', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.title('Feature Importances in Server State Prediction')
plt.xlabel('Relative Importance')
plt.show()