# Intelligent Log Grouping and Clustering

## Objectives
- Transform unstructured system logs (text) into numerical features using **TF-IDF**.
- Apply **K-Means Clustering** to automatically group similar error messages, ignoring dynamic variables like timestamps, IPs, or process IDs.

## Dataset
- A curated list of raw server log lines (synthetic but realistic) simulating a fast-moving production environment.

## Expected Outcome
- Given 10,000 messy log lines, the model will output the top 4 "root cause" categories automatically, saving hours of manual `grep` work.

## Challenge
- Can you apply an "Elbow Method" plot to automatically determine the optimal number of clusters (`k`) for your logs?

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re

pd.set_option('display.max_colwidth', None)

### 1. Generating Raw Log Data
In production, logs contain dynamic data (like IDs or IPs) that confuse simple text-matching. We need to group log lines that are *structurally* similar.

In [None]:
raw_logs = [
    "[2025-02-25 10:14:22] ERROR: Connection timeout calling backend service at 10.0.4.55:8080",
    "[2025-02-25 10:14:23] INFO: User 8943 logged in successfully from 192.168.1.1",
    "[2025-02-25 10:14:25] ERROR: Connection timeout calling backend service at 10.0.4.56:8080",
    "[2025-02-25 10:14:28] WARN: High memory usage detected on node worker-3 (88%)",
    "[2025-02-25 10:14:30] ERROR: OutOfMemoryError: Java heap space thread id 553",
    "[2025-02-25 10:14:31] INFO: User 1221 logged in successfully from 10.0.0.5",
    "[2025-02-25 10:14:32] ERROR: Connection timeout calling backend service at 10.0.4.55:8080",
    "[2025-02-25 10:14:35] ERROR: OutOfMemoryError: Java heap space thread id 992",
    "[2025-02-25 10:14:38] WARN: High memory usage detected on node worker-1 (92%)",
    "[2025-02-25 10:14:40] ERROR: Database connection failed (FATAL: remaining connection slots are reserved for non-replication superuser connections)"
] * 10 # Multiply to simulate more logs

df = pd.DataFrame({'raw_log': raw_logs})
print(f"Total logs to process: {len(df)}")

### 2. Preprocessing / Log Parsing
Before clustering, it's best practice to strip out digits, IPs, and timestamps so the algorithm focuses on the "skeleton" of the message.

In [None]:
def clean_log(log_line):
    # Remove timestamps [2025...]
    log_line = re.sub(r'\[.*?\]', '', log_line)
    # Remove IP addresses
    log_line = re.sub(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(:\d+)?\b', '<IP>', log_line)
    # Remove standalone numbers (user IDs, thread IDs, percentages)
    log_line = re.sub(r'\b\d+\%?\b', '<NUM>', log_line)
    
    return log_line.strip()

df['cleaned_log'] = df['raw_log'].apply(clean_log)
df[['raw_log', 'cleaned_log']].head(3)

### 3. Feature Extraction (TF-IDF)
TF-IDF translates our text into a matrix of numbers, weighing rare words (like "OutOfMemoryError") higher than common words (like "at" or "from").

In [None]:
vectorizer = TfidfVectorizer(max_features=50, stop_words='english')
X = vectorizer.fit_transform(df['cleaned_log'])
print(f"TF-IDF Matrix shape: {X.shape}")

### 4. K-Means Clustering
We group the logs into `k=5` clusters.

In [None]:
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
df['cluster'] = kmeans.fit_predict(X)

# Look at the clusters!
grouped = df.groupby('cluster')

for name, group in grouped:
    print(f"\n--- Cluster {name} ({len(group)} logs) ---")
    # Show one representative example from this cluster
    print(group['cleaned_log'].iloc[0])