# Predicting Application Latency with Regression

## Context
As a Site Reliability Engineer (SRE), you want to understand what drives application latency. You know that CPU utilization, memory usage, the number of concurrent connections, and the database Queries Per Second (QPS) all play a role. By training a regression model on historical data, you can predict what latency will look like under future load scenarios, allowing you to proactively scale resources before latency breaches your SLAs.

## Objectives
- Generate a synthetic operational dataset mimicking application performance.
- Train basic and regularized regression models (Linear, Ridge).
- Train a non-linear model (Random Forest Regressor).
- Compare model performance using Mean Squared Error (MSE) and R2 Score.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor

# Set style
plt.style.use('ggplot')

### 1. Generating Synthetic Operational Data
We will simulate 1,000 minutes of telemetry data for our application. Latency will be a function of the other metrics plus some random noise.

In [None]:
np.random.seed(42)
n_samples = 1000

# Features
cpu_utilization = np.random.uniform(10, 95, n_samples)  # 10% to 95%
memory_usage = np.random.uniform(20, 80, n_samples)     # 20% to 80%
concurrent_connections = np.random.poisson(500, n_samples)
db_qps = np.random.normal(2000, 500, n_samples)

# Formulate Latency (ms): Non-linear relationship with CPU, linear with Connections and DB QPS
# When CPU goes above 80%, latency spikes exponentially.
latency = (0.5 * concurrent_connections) + (0.01 * db_qps) + (0.2 * memory_usage) + \
          np.where(cpu_utilization > 80, (cpu_utilization - 80)**2 * 2, cpu_utilization) + \
          np.random.normal(0, 15, n_samples)  # Server noise

# Create DataFrame
df = pd.DataFrame({
    'cpu_percent': cpu_utilization,
    'memory_percent': memory_usage,
    'connections': concurrent_connections,
    'db_qps': db_qps,
    'latency_ms': latency
})

df.head()

### 2. Visualize the Target Variable vs Features
Let's look at how Latency reacts to CPU utilization. You should notice the hockey-stick curve where latency explodes past 80% CPU.

In [None]:
plt.figure(figsize=(8, 5))
plt.scatter(df['cpu_percent'], df['latency_ms'], alpha=0.5, color='royalblue')
plt.title('Application Latency vs CPU Utilization')
plt.xlabel('CPU Utilization (%)')
plt.ylabel('Latency (ms)')
plt.axvline(x=80, color='r', linestyle='--', label='CPU Threshold')
plt.legend()
plt.show()

### 3. Splitting the Data
We separate our target variable (`latency_ms`) from our features and split them into a training set (80%) and a testing set (20%).

In [None]:
X = df[['cpu_percent', 'memory_percent', 'connections', 'db_qps']]
y = df['latency_ms']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 4. Training Models
We will evaluate three models to see how they capture the relationships, especially the non-linear CPU degradation.

#### **Model A: Linear Regression**
The simplest approach. Assumes a straight-line relationship.

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
y_pred_lin = lin_reg.predict(X_test)

print("--- Linear Regression ---")
print(f"MSE: {mean_squared_error(y_test, y_pred_lin):.2f}")
print(f"R2 Score: {r2_score(y_test, y_pred_lin):.4f}")

#### **Model B: Ridge Regression**
Linear regression with L2 regularization to penalize extremely large coefficients.

In [None]:
ridge_reg = Ridge(alpha=10.0)
ridge_reg.fit(X_train, y_train)
y_pred_ridge = ridge_reg.predict(X_test)

print("--- Ridge Regression ---")
print(f"MSE: {mean_squared_error(y_test, y_pred_ridge):.2f}")
print(f"R2 Score: {r2_score(y_test, y_pred_ridge):.4f}")

#### **Model C: Random Forest Regressor**
An ensemble of decision trees. It is capable of modeling complex, non-linear relationships (like our CPU threshold spike).

In [None]:
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)

print("--- Random Forest Regressor ---")
print(f"MSE: {mean_squared_error(y_test, y_pred_rf):.2f}")
print(f"R2 Score: {r2_score(y_test, y_pred_rf):.4f}")

### 5. Model Comparison Analysis
Because the underlying data contained a non-linear explosion in latency when CPU went over 80%, the Linear and Ridge regression models struggle to accurately predict high-latency events. 

Let's visualize the True vs. Predicted values to see this in action.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 6), sharey=True)

# Plot Linear Regression
axes[0].scatter(y_test, y_pred_lin, alpha=0.5, color='orange')
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
axes[0].set_title("Linear Regression: Actual vs Predicted")
axes[0].set_xlabel("Actual Latency (ms)")
axes[0].set_ylabel("Predicted Latency (ms)")

# Plot Random Forest
axes[1].scatter(y_test, y_pred_rf, alpha=0.5, color='green')
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
axes[1].set_title("Random Forest: Actual vs Predicted")
axes[1].set_xlabel("Actual Latency (ms)")

plt.tight_layout()
plt.show()

# INSIGHT: The Random Forest points hug the black dotted line (perfect prediction) 
# much more tightly, especially for higher latency values, indicating it successfully 
# learned the non-linear CPU degradation rule.