# Data Imputation for Missing Telemetry

## Context
In infrastructure monitoring, it is incredibly common to have missing data points. A telemetry agent (like Prometheus Node Exporter) might crash, network partitions might cause dropped packets, or an API might temporarily fail to log its metrics.

Machine Learning algorithms generally cannot handle missing values (often represented as `NaN` or `null`). Therefore, before training models or doing advanced analysis, we must **impute** (fill in) these missing values.

## Objectives
- Simulate a realistic SRE dataset where an agent periodically drops CPU and Memory metrics.
- Explore different imputation strategies using Scikit-Learn's `SimpleImputer`:
  - Mean Imputation
  - Median Imputation (robust to spikes)
  - Constant Imputation (filling with 0s or custom values)
  - Most Frequent Imputation (for categorical server states)

In [None]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split


### 1. Generate Synthetic Server Telemetry with Missing Data
We will generate data representing Server CPU usage, Memory usage, Disk I/O, and current Status, then artificially drop ~10% of the data to simulate agent failures.

In [None]:
np.random.seed(42)
n_samples = 100

# Base telemetry
cpu_usage = np.random.normal(loc=45, scale=15, size=n_samples)
mem_usage = np.random.normal(loc=60, scale=10, size=n_samples)
disk_io = np.random.normal(loc=200, scale=50, size=n_samples)

df = pd.DataFrame({
    'CPU_Usage_pct': cpu_usage,
    'Memory_Usage_pct': mem_usage,
    'Disk_IO_ops': disk_io
})

# Introduce missing values randomly (approx 10% chance per cell)
mask = np.random.choice([True, False], size=df.shape, p=[0.1, 0.9])
df_with_missing = df.mask(mask)

print("Missing values per column:")
print(df_with_missing.isna().sum())

print("\nFirst few rows with missing data (NaN):")
df_with_missing.head(10)

### 2. Imputation Strategies

#### **Mean Imputation**
Replaces missing values with the mean of the column. This is standard but can skew data if you have massive anomalies (e.g., sudden 100% CPU spikes).

In [None]:
# Create an imputer that uses the 'mean' strategy
mean_imputer = SimpleImputer(strategy='mean')

# Fit and transform the data
df_mean = pd.DataFrame(
    mean_imputer.fit_transform(df_with_missing), 
    columns=df_with_missing.columns
)

df_mean.head(10)

#### **Median Imputation**
Replaces missing values with the median. This is highly recommended for infrastructure metrics because server metrics often have extreme outliers (e.g., occasional 10,000ms latency spikes) that would drag the mean upward.

In [None]:
median_imputer = SimpleImputer(strategy='median')

df_median = pd.DataFrame(
    median_imputer.fit_transform(df_with_missing), 
    columns=df_with_missing.columns
)

df_median.head(10)

#### **Constant Imputation (e.g., Zero)**
Sometimes, a missing metric explicitly means `0` (e.g., a "5xx Error Count" metric that isn't emitted if there are no errors). In this case, we pad missing values with a constant.

In [None]:
constant_imputer = SimpleImputer(strategy='constant', fill_value=0)

df_constant = pd.DataFrame(
    constant_imputer.fit_transform(df_with_missing), 
    columns=df_with_missing.columns
)

df_constant.head(10)

### 3. Handling Categorical Data (Server States)
If we have missing categorical data (e.g., a server's reported health state), we cannot calculate a mean or median. Instead, we use the **most frequent** value (the mode).

In [None]:
cat_data = pd.DataFrame({
    'Server_ID': ['srv-1', 'srv-2', 'srv-3', 'srv-4', 'srv-5', 'srv-6'],
    'Status': ['Healthy', 'Healthy', np.nan, 'Warning', 'Healthy', np.nan]
})

cat_imputer = SimpleImputer(strategy='most_frequent')

cat_imputed = pd.DataFrame(
    cat_imputer.fit_transform(cat_data[['Status']]), 
    columns=['Status']
)

cat_data['Status_Imputed'] = cat_imputed
cat_data

### 4. Important: Imputing Train / Test Splits properly
A critical rule in ML is preventing **Data Leakage**. If you fill missing values *before* splitting your dataset, the test set's data will influence the training data's means/medians.
 
**Correct Workflow:**
1. Split the data.
2. `fit()` the imputer **only on the training set**.
3. `transform()` both the training set and the test set using that fitted imputer.

In [None]:
# Create target variable to simulate a supervised learning split
y = (df['CPU_Usage_pct'] > 50).astype(int)

X_train, X_test, y_train, y_test = train_test_split(df_with_missing, y, test_size=0.3, random_state=42)

# Create Imputer
imputer = SimpleImputer(strategy='median')

# FIT ONLY ON X_TRAIN, then transform
X_train_imputed = imputer.fit_transform(X_train)

# TRANSFORM X_TEST using the median calculated from X_TRAIN
X_test_imputed = imputer.transform(X_test)

# Data is now safely preprocessed without leakage and ready for ML algorithms.
print("Train Set Shape:", X_train_imputed.shape)
print("Test Set Shape:", X_test_imputed.shape)