# Machine Learning Pipelines in SRE

## Context
In a production observability pipeline, raw telemetry data is almost never clean. You might have missing Prometheus scrapes, metrics on vastly different scales (percentage vs. bytes), or categorical tags that need encoding.

If you build a Machine Learning model to predict Server Health, you must apply the exact same preprocessing steps to real-time production data as you did to your historical training data. If you don't, your model will crash or make garbage predictions. 

Scikit-Learn **Pipelines** solve this by bundling preprocessing and modeling into a single, executable pipeline that prevents data leakage and ensures consistency.

## Objectives
- Build a synthetic SRE dataset representing server health and metrics.
- Create a single `Pipeline` that automatically:
  1. Imputes missing telemetry data.
  2. Scales the metrics.
  3. Trains a `RandomForestClassifier` to predict server failure.
- Perform hyperparameter tuning safely using `GridSearchCV` on the entire pipeline.

In [None]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
import warnings
warnings.filterwarnings('ignore')

### 1. Generating Infrastructure Data
Let's create a dataset where we predict if a server is `Healthy (0)` or `Failing (1)` based on CPU usage and Network Traffic. We will intentionally include missing (`NaN`) values.

In [None]:
np.random.seed(42)
n_samples = 200

data = pd.DataFrame({
    'CPU_Usage': np.random.normal(60, 20, n_samples),
    'Network_Bytes': np.random.normal(1_000_000, 500_000, n_samples),
    'Failure': np.random.choice([0, 1], size=n_samples, p=[0.8, 0.2])
})

# Introduce missing values to simulate dropped telemetry
data.loc[np.random.choice(data.index, 10), 'CPU_Usage'] = np.nan
data.loc[np.random.choice(data.index, 15), 'Network_Bytes'] = np.nan

X = data[['CPU_Usage', 'Network_Bytes']]
y = data['Failure']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

X_train.head()

### 2. Building the Pipeline

Instead of manually calling `.fit_transform()` on an imputer, then `.fit_transform()` on a scaler, and finally `.fit()` on a model, we combine them into one object.

In [None]:
# Define the steps of the pipeline as a list of tuples: ('name', Transformer/Model object)
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')), # Step 1: Fill missing metrics with median
    ('scaler', StandardScaler()),                  # Step 2: Standardize scales
    ('classifier', RandomForestClassifier(random_state=42)) # Step 3: Train model
])

# Execute the entire pipeline on the training data
pipeline.fit(X_train, y_train)

# Evaluate on the test data
# The test data is automatically imputed and scaled using the parameters learned from X_train!
predictions = pipeline.predict(X_test)

print("Sample Predictions (0=Healthy, 1=Failing):", predictions[:10])
print("Pipeline Test Accuracy: {:.2f}%".format(pipeline.score(X_test, y_test) * 100))

### 3. Hyperparameter Tuning with GridSearchCV

If we want to find the best imputation strategy ('mean' vs 'median') AND the best Random Forest depth (5 vs 10), we can tune the *entire pipeline* at once.

Notice the syntax: `stepname__parametername` (two underscores).

In [None]:
# Define the grid of hyperparameters to search
param_grid = {
    'imputer__strategy': ['mean', 'median'],          # Tune step 1
    'classifier__n_estimators': [50, 100, 200],      # Tune step 3
    'classifier__max_depth': [None, 5, 10]           # Tune step 3
}

grid_search = GridSearchCV(pipeline, param_grid, cv=3, n_jobs=-1, verbose=1)

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

print("\nBest parameters found:")
print(grid_search.best_params_)

print("\nBest Cross-Validation Accuracy: {:.2f}%".format(grid_search.best_score_ * 100))

# We can now use grid_search.best_estimator_ as our final production pipeline.

### Summary

Pipelines are essential for deploying ML in DevOps environments because:
- They ensure **Consistency**: Production streaming data gets exactly the same preprocessing as historical Data Lake data.
- They prevent **Data Leakage**: Preprocessing state (like scaler means) is strictly isolated during Cross-Validation splits.
- They improve **Code readability**: Multiple workflow steps are reduced to a single `.fit()` call.