# Dimensionality Reduction for Infrastructure Telemetry

## Context
Modern infrastructure systems emit dozens or even hundreds of telemetry metrics per server (e.g., CPU, Memory, Disk I/O, Network Latency, Cache misses, Thread counts). This high-dimensional data is difficult to visualize and can slow down machine learning models. 

As an SRE, we can use Dimensionality Reduction techniques like Principal Component Analysis (PCA) and t-SNE to compress these metrics into 2 or 3 components. This allows us to visually spot anomalies or clusters of misbehaving servers that would be impossible to see in a 50-dimension raw dataset.

## Objectives
- Generate a dataset with multiple operational metrics across different server roles (Web, DB, Cache).
- Use PCA (Principal Component Analysis) to reduce the data to 2 dimensions for visualization.
- Understand Explained Variance.
- Compare PCA with t-SNE (t-Distributed Stochastic Neighbor Embedding).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

plt.style.use('ggplot')

### 1. Generating High-Dimensional Telemetry
We will simulate 6 different metrics across 3 types of servers: Web Servers, Database (DB) Servers, and Cache Servers. Each role has a distinct profile of resource usage.

In [None]:
np.random.seed(42)

# 200 servers of each role
n_servers = 200

# Web Servers: High Network, Mod. CPU, Low Disk
web_data = pd.DataFrame({
    'cpu_usage': np.random.normal(60, 10, n_servers),
    'memory_usage': np.random.normal(40, 5, n_servers),
    'disk_io': np.random.normal(50, 10, n_servers),
    'network_in': np.random.normal(800, 100, n_servers),
    'network_out': np.random.normal(900, 120, n_servers),
    'active_connections': np.random.normal(1500, 300, n_servers),
    'role': 'Web'
})

# DB Servers: High Disk, High Memory, Low Network In, Mod Network Out
db_data = pd.DataFrame({
    'cpu_usage': np.random.normal(40, 8, n_servers),
    'memory_usage': np.random.normal(85, 5, n_servers),
    'disk_io': np.random.normal(800, 150, n_servers),
    'network_in': np.random.normal(200, 30, n_servers),
    'network_out': np.random.normal(500, 80, n_servers),
    'active_connections': np.random.normal(100, 20, n_servers),
    'role': 'DB'
})

# Cache Servers: High Memory, Low Disk, Mod Network
cache_data = pd.DataFrame({
    'cpu_usage': np.random.normal(25, 5, n_servers),
    'memory_usage': np.random.normal(90, 3, n_servers),
    'disk_io': np.random.normal(10, 2, n_servers),
    'network_in': np.random.normal(400, 50, n_servers),
    'network_out': np.random.normal(450, 60, n_servers),
    'active_connections': np.random.normal(500, 100, n_servers),
    'role': 'Cache'
})

df = pd.concat([web_data, db_data, cache_data], ignore_index=True)
df = df.sample(frac=1).reset_index(drop=True)  # Shuffle

# Separate features and labels
X = df.drop('role', axis=1)
y = df['role']

# Standardize the features (CRITICAL for PCA because metrics have wildly different scales)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

df.head()

### 2. Principal Component Analysis (PCA)
PCA takes our 6 metrics and creates new "Principal Components". These components are linear combinations of the original metrics, ordered by how much *variance* (information) they capture.

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print("PCA Transformed shape:", X_pca.shape)
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
print(f"Total Variance Explained by 2 components: {sum(pca.explained_variance_ratio_) * 100:.2f}%")

#### **Visualizing PCA**
With just two dimensions, we can plot our servers on a 2D graph. If the dimensionality reduction worked, servers of the same role should cluster together based on their underlying telemetry.

In [None]:
plt.figure(figsize=(10, 7))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=y, palette='Set1', s=60, alpha=0.8)
plt.title('PCA Projection of Server Telemetry (6D -> 2D)')
plt.xlabel(f'Principal Component 1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
plt.ylabel(f'Principal Component 2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
plt.legend(title='Server Role')
plt.show()

# SRE INSIGHT: You can clearly see that Web, DB, and Cache servers occupy distinct 
# regions of the component space. If a "Web" server suddenly drifted into the "DB" 
# cluster, you would know immediately it is exhibiting abnormal telemetry!

### 3. Understanding Components
What do these components actually represent? We can look at the PCA `components_` attribute to see how much each original metric contributes to PC1 and PC2.

In [None]:
component_df = pd.DataFrame(pca.components_, columns=X.columns, index=['PC1', 'PC2'])

plt.figure(figsize=(10, 4))
sns.heatmap(component_df, cmap='coolwarm', annot=True, center=0)
plt.title('Feature Contributions to Principal Components')
plt.show()

### 4. t-Distributed Stochastic Neighbor Embedding (t-SNE)
While PCA is great for maintaining global structure and is computationally fast, t-SNE is a non-linear technique that excels at grouping similar data points tightly together. It is heavily used in visualizing highly complex datasets.

In [None]:
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

plt.figure(figsize=(10, 7))
sns.scatterplot(x=X_tsne[:, 0], y=X_tsne[:, 1], hue=y, palette='viridis', s=60, alpha=0.8)
plt.title('t-SNE Projection of Server Telemetry')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.legend(title='Server Role')
plt.show()

# Note how t-SNE creates very distinct, tight "islands" for each role.