# Text Embeddings - Semantic Log Analysis

Extract semantic features from log messages to detect:
- **New/unusual error patterns** (messages never seen before)
- **Message clustering** (similar errors grouped together)
- **Semantic anomalies** (messages that don't fit normal patterns)

## Approaches:
1. **TF-IDF**: Traditional bag-of-words (fast, interpretable)
2. **Sentence Embeddings**: Semantic understanding (better for new patterns)
3. **Message clustering**: Group similar messages, detect outliers

**Why This Matters:**
- Volume features detect **how many** errors
- Text features detect **what kind** of errors
- New error types = often the first sign of serious issues

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.cluster import KMeans
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import re

## Load Data

In [3]:
# Load training data.
df = pd.read_parquet('../data/training_logs.parquet')
df['timestamp'] = pd.to_datetime(df['timestamp'])

print(f"Total logs: {len(df):,}")
print(f"Date range: {df['timestamp'].min()} to {df['timestamp'].max()}")
print(f"\nSample messages:")
df['message'].head(10)

Total logs: 131,812
Date range: 2025-12-15 22:54:14.499491+00:00 to 2025-12-24 22:54:09.185593+00:00

Sample messages:


0                         Cache miss for key: cache_70
1                Slow query detected - duration: 184ms
2    Database query completed - rows: 297, duration...
3                         Cache miss for key: cache_89
4       User session created - session_id: sess_665158
5    API request received - endpoint: /api/payments...
6                 Rate limit approaching for user 9201
7                         Cache miss for key: cache_94
8                          Cache hit for key: cache_58
9       Email notification sent to user249@example.com
Name: message, dtype: object

## Text Preprocessing

In [None]:
def preprocess_log_message(message: str) -> str:
    """
    Preprocess log message for embedding.

    param message: Raw log message.
    """
    message = message.lower()
    message = re.sub(r'\d+', '<NUM>', message)
    
    # Replace common variable patterns.
    message = re.sub(r'user_\d+', 'user_<ID>', message)
    message = re.sub(r'session_\d+', 'session_<ID>', message)
    message = re.sub(r'txn_\d+', 'txn_<ID>', message)
    message = re.sub(r'ord_\d+', 'ord_<ID>', message)
    message = re.sub(r'item_\d+', 'item_<ID>', message)
    message = re.sub(r'cache_\d+', 'cache_<ID>', message)
    
    # Replace timestamps and durations.
    message = re.sub(r'\d+ms', '<DURATION>ms', message)
    message = re.sub(r'\d+s', '<DURATION>s', message)
    # Replace IP addresses n file paths.
    message = re.sub(r'\d+\.\d+\.\d+\.\d+', '<IP>', message)
    message = re.sub(r'/[\w/]+', '<PATH>', message)
    
    return message.strip()


print("Preprocessing log messages...")
df['message_processed'] = df['message'].apply(preprocess_log_message)

print("\nExample preprocessing:")
for i in range(5):
    print(f"\nOriginal: {df['message'].iloc[i]}")
    print(f"Processed: {df['message_processed'].iloc[i]}")

Preprocessing log messages...

Example preprocessing:

Original: Cache miss for key: cache_70
Processed: cache miss for key: cache_<NUM>

Original: Slow query detected - duration: 184ms
Processed: slow query detected - duration: <NUM>ms

Original: Database query completed - rows: 297, duration: 370ms
Processed: database query completed - rows: <NUM>, duration: <NUM>ms

Original: Cache miss for key: cache_89
Processed: cache miss for key: cache_<NUM>

Original: User session created - session_id: sess_665158
Processed: user session created - session_id: sess_<NUM>


## 1. TF-IDF Features


In [5]:
# Create TF-IDF vectorizer.
tfidf = TfidfVectorizer(
    max_features=100,      # Top 100 most important terms.
    min_df=5,              # Term must appear in at least 5 documents.
    max_df=0.8,            # Ignore terms appearing in > 80% of documents.
    ngram_range=(1, 2),    # Unigrams and bigrams.
    stop_words='english'   # Remove common English words.
)

print("Fitting TF-IDF vectorizer...")
tfidf_matrix = tfidf.fit_transform(df['message_processed'])

print(f"\n✓ TF-IDF matrix created")
print(f"  Shape: {tfidf_matrix.shape}")
print(f"  Features: {len(tfidf.get_feature_names_out())}")
print(f"\nTop 20 terms:")
print(tfidf.get_feature_names_out()[:20])

Fitting TF-IDF vectorizer...

✓ TF-IDF matrix created
  Shape: (131812, 100)
  Features: 100

Top 20 terms:
['api' 'api request' 'authentication' 'authentication successful' 'cache'
 'cache hit' 'cache_' 'cache_ num' 'com' 'completed' 'completed rows'
 'connection' 'created' 'created session_id' 'created successfully'
 'database' 'database query' 'duration' 'duration num' 'email']


In [7]:
# Reduce dimensions for visualization.

# why TruncatedSVD: Works with sparse matrices (efficient for TF-IDF)
# because TFIDF features will be 95% zeros.
# So ths is recommended for text data.

print("\nReducing dimensions with TruncatedSVD...")
svd = TruncatedSVD(n_components=15, random_state=42)
tfidf_reduced = svd.fit_transform(tfidf_matrix)

print(f"✓ Reduced to {tfidf_reduced.shape[1]} dimensions")
print(f"  Explained variance: {svd.explained_variance_ratio_.sum():.2%}")


Reducing dimensions with TruncatedSVD...
✓ Reduced to 15 dimensions
  Explained variance: 90.87%


## 2. Message Template Extraction

Identify unique message templates (patterns).

In [None]:
template_counts = df['message_processed'].value_counts()

print(f"Unique message templates: {len(template_counts):,}")
print(f"\nTop 10 most common templates:")
print(template_counts.head(10))

# Calculate message rarity (inverse frequency).
df['message_rarity'] = df['message_processed'].map(
    lambda x: 1.0 / template_counts.get(x, 1)
)

print(f"\n✓ Message rarity calculated")
print(f"  Rare messages (rarity > 0.01): {(df['message_rarity'] > 0.01).sum():,}")

Unique message templates: 75

Top 10 most common templates:
message_processed
database query completed - rows: <NUM>, duration: <NUM>ms              8678
payment processed - transaction_id: txn_<NUM>, amount: $<NUM>.<NUM>    8549
inventory updated - item_id: item_<NUM>, quantity: <NUM>               8549
email notification sent to user<NUM>@example.com                       8525
cache hit for key: cache_<NUM>                                         8515
request processed successfully - user_id: <NUM>, duration: <NUM>ms     8497
order created successfully - order_id: ord_<NUM>                       8477
authentication successful for user <NUM>                               8420
user session created - session_id: sess_<NUM>                          8384
query plan: index_scan                                                 5284
Name: count, dtype: int64

✓ Message rarity calculated
  Rare messages (rarity > 0.01): 758


## 3. Find Optimal Number of Clusters - Elbow Method



In [None]:
# I have 71 message templates.
# after the k=20 Silhouette score is not falling that much.
# we can go with 20 - 25 k values for better cluster representation.
# beyond that it just overfitts the data.

# Set number of clusters based on analysis.
n_clusters = 20  # Optimal based on elbow method and silhouette score.

print(f"Clustering messages into {n_clusters} groups...")
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
df['message_cluster'] = kmeans.fit_predict(tfidf_reduced)

print(f"\n✓ Messages clustered")
print(f"\nCluster distribution:")
print(df['message_cluster'].value_counts().sort_index())

from sklearn.metrics import silhouette_score

silhouette = silhouette_score(tfidf_reduced, df['message_cluster'], sample_size=10000)

print(f"\nClustering Quality Metrics:")
print(f"  Silhouette Score: {silhouette:.4f} (higher is better, range: -1 to 1)")


Clustering messages into 20 groups...

✓ Messages clustered

Cluster distribution:
message_cluster
0      8678
1     13590
2      2171
3      8561
4      8420
5      8641
6      8477
7      8384
8      8525
9      5225
10     8549
11     2274
12     5156
13     5284
14     5240
15     8497
16     4378
17     5255
18     4305
19     2202
Name: count, dtype: int64

Clustering Quality Metrics:
  Silhouette Score: 0.9514 (higher is better, range: -1 to 1)


In [None]:
# Examine clusters.
print("\nSample messages from each cluster:")
for cluster_id in range(min(5, n_clusters)):  # Show first 5 clusters.
    print(f"\n{'='*60}")
    print(f"Cluster {cluster_id} ({(df['message_cluster'] == cluster_id).sum()} messages):")
    print('='*60)
    cluster_samples = df[df['message_cluster'] == cluster_id]['message'].head(3)
    for i, msg in enumerate(cluster_samples, 1):
        print(f"  {i}. {msg}")


Sample messages from each cluster:

Cluster 0 (8678 messages):
  1. Database query completed - rows: 297, duration: 370ms
  2. Database query completed - rows: 854, duration: 460ms
  3. Database query completed - rows: 813, duration: 236ms

Cluster 1 (13590 messages):
  1. Cache miss for key: cache_70
  2. Cache miss for key: cache_89
  3. Cache miss for key: cache_94

Cluster 2 (2171 messages):
  1. Invalid request - malformed JSON in request body
  2. Unhandled exception in request handler - ValueError
  3. Invalid request - malformed JSON in request body

Cluster 3 (8561 messages):
  1. Payment processed - transaction_id: txn_757937, amount: $649.88
  2. Payment processed - transaction_id: txn_789275, amount: $406.78
  3. Payment processed - transaction_id: txn_404309, amount: $945.93

Cluster 4 (8420 messages):
  1. Authentication successful for user 4872
  2. Authentication successful for user 4571
  3. Authentication successful for user 3522


## 4. Detect New/Unusual Messages

Calculate distance to cluster centroid to find anomalous messages.

In [21]:
# Calculate distance to nearest cluster centroid.
from sklearn.metrics import pairwise_distances

print("Calculating distances to cluster centroids...")
distances = pairwise_distances(
    tfidf_reduced,
    kmeans.cluster_centers_,
    metric='euclidean'
)

print(distances) # ths is the distance from the datapoint to all other clusetr centroids.

print(np.arange(len(df)), df['message_cluster']) # first creates the row number and the second gives which col to choose.
# Distance to assigned cluster.
df['message_distance'] = distances[np.arange(len(df)), df['message_cluster']]

# Normalize distances (0-1 scale).
df['message_anomaly_score'] = (
    (df['message_distance'] - df['message_distance'].min()) /
    (df['message_distance'].max() - df['message_distance'].min())
)

print(f"\n✓ Anomaly scores calculated")
print(f"\nAnomaly score distribution:")
print(df['message_anomaly_score'].describe())

Calculating distances to cluster centroids...
[[1.33163816 0.05078051 0.97560232 ... 1.31107108 0.8980274  1.17177735]
 [0.50693147 1.18326762 0.70250584 ... 1.21254417 0.72602642 1.05658301]
 [0.         1.36346655 1.09605447 ... 1.38894149 0.9880653  1.24604773]
 ...
 [1.34554386 1.37051914 1.07841254 ... 1.39586417 1.00487265 1.25564818]
 [1.39658715 1.36000995 1.05908108 ... 1.38015281 0.99127902 0.56440214]
 [1.40432665 1.37041784 0.88862382 ... 1.3957661  1.00148709 1.25549781]]
[     0      1      2 ... 131809 131810 131811] 0          1
1         16
2          0
3          1
4          7
          ..
131807     6
131808    17
131809    13
131810     9
131811     5
Name: message_cluster, Length: 131812, dtype: int32

✓ Anomaly scores calculated

Anomaly score distribution:
count    1.318120e+05
mean     5.513946e-02
std      1.379494e-01
min      0.000000e+00
25%      0.000000e+00
50%      2.318704e-08
75%      1.585359e-02
max      1.000000e+00
Name: message_anomaly_score, dtyp

In [None]:
print("\nTop 10 most anomalous messages (unusual patterns):")
print("="*60)
anomalous = df.nlargest(10, 'message_anomaly_score')[['message', 'message_anomaly_score', 'level']]
for idx, row in anomalous.iterrows():
    print(f"\nScore: {row['message_anomaly_score']:.4f} | Level: {row['level']}")
    print(f"Message: {row['message']}")


Top 10 most anomalous messages (unusual patterns):

Score: 1.0000 | Level: WARN
Message: Potential SQL injection attempt in query

Score: 1.0000 | Level: WARN
Message: Potential SQL injection attempt in query

Score: 1.0000 | Level: WARN
Message: Potential SQL injection attempt in query

Score: 1.0000 | Level: WARN
Message: Potential SQL injection attempt in query

Score: 1.0000 | Level: WARN
Message: Potential SQL injection attempt in query

Score: 1.0000 | Level: ERROR
Message: Potential SQL injection attempt in query

Score: 1.0000 | Level: WARN
Message: Potential SQL injection attempt in query

Score: 1.0000 | Level: WARN
Message: Potential SQL injection attempt in query

Score: 0.9202 | Level: WARN
Message: Thread pool exhausted - queue size: 42534

Score: 0.9202 | Level: WARN
Message: Thread pool exhausted - queue size: 19151


## 5. Visualize Message Embeddings

In [None]:
# Further reduce to 2D so we can see them.
from sklearn.manifold import TSNE

print("Reducing to 2D with t-SNE (this may take a minute)...")

sample_size = min(5000, len(df))
sample_idx = np.random.choice(len(df), sample_size, replace=False)

tsne = TSNE(n_components=2, random_state=42, perplexity=30)
embeddings_2d = tsne.fit_transform(tfidf_reduced[sample_idx])

print(f"✓ 2D embeddings created for {sample_size} samples")

Reducing to 2D with t-SNE (this may take a minute)...
✓ 2D embeddings created for 5000 samples


In [None]:
sample_df = df.iloc[sample_idx].copy()
sample_df['x'] = embeddings_2d[:, 0]
sample_df['y'] = embeddings_2d[:, 1]

fig = go.Figure()

for cluster_id in sample_df['message_cluster'].unique():
    cluster_data = sample_df[sample_df['message_cluster'] == cluster_id]
    fig.add_trace(
        go.Scatter(
            x=cluster_data['x'],
            y=cluster_data['y'],
            mode='markers',
            name=f'Cluster {cluster_id}',
            marker=dict(size=5, opacity=0.6),
            text=cluster_data['message'].str[:50],  # Show first 50 chars on hover.
            hovertemplate='%{text}<extra></extra>'
        )
    )

fig.update_layout(
    title='Message Embeddings - t-SNE Visualization (Colored by Cluster)',
    xaxis_title='t-SNE Dimension 1',
    yaxis_title='t-SNE Dimension 2',
    height=700,
    showlegend=True
)
fig.show()

In [None]:
fig = go.Figure()

level_colors = {
    'DEBUG': 'lightblue',
    'INFO': 'green',
    'WARN': 'orange',
    'ERROR': 'red',
    'FATAL': 'darkred'
}

for level in sample_df['level'].unique():
    level_data = sample_df[sample_df['level'] == level]
    fig.add_trace(
        go.Scatter(
            x=level_data['x'],
            y=level_data['y'],
            mode='markers',
            name=level,
            marker=dict(
                size=5,
                color=level_colors.get(level, 'gray'),
                opacity=0.6
            ),
            text=level_data['message'].str[:50],
            hovertemplate='%{text}<extra></extra>'
        )
    )

fig.update_layout(
    title='Message Embeddings - Colored by Log Level',
    xaxis_title='t-SNE Dimension 1',
    yaxis_title='t-SNE Dimension 2',
    height=700
)
fig.show()

print("\nInterpretation:")
print("  - Tight clusters = similar message patterns")
print("  - Outliers = unusual/new message patterns")
print("  - ERROR/FATAL in separate regions = distinct error patterns")


Interpretation:
  - Tight clusters = similar message patterns
  - Outliers = unusual/new message patterns
  - ERROR/FATAL in separate regions = distinct error patterns


## 6. Time-Based Message Analysis

In [None]:
df_sorted = df.sort_values('timestamp')
df_sorted['is_new_template'] = ~df_sorted['message_processed'].duplicated()

# Aggregate by time window.
time_agg = df_sorted.set_index('timestamp').resample('1h').agg({
    'is_new_template': 'sum',
    'message_anomaly_score': 'mean',
    'message_rarity': 'mean'
}).reset_index()

time_agg.columns = ['timestamp', 'new_templates', 'avg_anomaly_score', 'avg_rarity']

print("Time-based message metrics calculated")
time_agg.head(10)

Time-based message metrics calculated


Unnamed: 0,timestamp,new_templates,avg_anomaly_score,avg_rarity
0,2025-12-15 22:00:00+00:00,26,0.045817,0.000313
1,2025-12-15 23:00:00+00:00,11,0.051718,0.00031
2,2025-12-16 00:00:00+00:00,7,0.070665,0.003853
3,2025-12-16 01:00:00+00:00,0,0.046908,0.000287
4,2025-12-16 02:00:00+00:00,1,0.056516,0.001325
5,2025-12-16 03:00:00+00:00,14,0.101209,0.004405
6,2025-12-16 04:00:00+00:00,0,0.049766,0.000263
7,2025-12-16 05:00:00+00:00,0,0.059963,0.000316
8,2025-12-16 06:00:00+00:00,0,0.060595,0.000353
9,2025-12-16 07:00:00+00:00,0,0.048302,0.000299


In [None]:
fig = make_subplots(
    rows=2, cols=1,
    subplot_titles=(
        'New Message Templates per Hour',
        'Average Message Anomaly Score per Hour'
    ),
    vertical_spacing=0.15
)

fig.add_trace(
    go.Scatter(
        x=time_agg['timestamp'],
        y=time_agg['new_templates'],
        mode='lines+markers',
        name='New Templates',
        line=dict(color='blue', width=2),
        marker=dict(size=4)
    ),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(
        x=time_agg['timestamp'],
        y=time_agg['avg_anomaly_score'],
        mode='lines+markers',
        name='Avg Anomaly Score',
        line=dict(color='red', width=2),
        marker=dict(size=4)
    ),
    row=2, col=1
)

fig.update_xaxes(title_text='Timestamp', row=2, col=1)
fig.update_yaxes(title_text='New Templates', row=1, col=1)
fig.update_yaxes(title_text='Anomaly Score', row=2, col=1)
fig.update_layout(height=700, showlegend=False)
fig.show()

print("\nInterpretation:")
print("  - Spike in new templates = new error types appearing")
print("  - High avg anomaly score = messages becoming more unusual")
print("  - Both spiking = strong indicator of new anomaly patterns")


Interpretation:
  - Spike in new templates = new error types appearing
  - High avg anomaly score = messages becoming more unusual
  - Both spiking = strong indicator of new anomaly patterns


## 7. Aggregate Text Features per Window

In [None]:
# Create window-level text features (30s windows to match other features).
text_features = df.set_index('timestamp').resample('30s').agg({
    'message_rarity': 'mean',
    'message_anomaly_score': 'mean',
    'message_cluster': lambda x: x.mode()[0] if len(x) > 0 else -1  # Most common cluster.
}).reset_index()

# Add TF-IDF features (average per window).
tfidf_df = pd.DataFrame(
    tfidf_reduced,
    columns=[f'tfidf_{i}' for i in range(tfidf_reduced.shape[1])]
)
tfidf_df['timestamp'] = df['timestamp'].values

tfidf_window = tfidf_df.set_index('timestamp').resample('30s').mean().reset_index()

text_features['timestamp'] = pd.to_datetime(text_features['timestamp']).dt.tz_localize(None)
tfidf_window['timestamp'] = pd.to_datetime(tfidf_window['timestamp']).dt.tz_localize(None)
text_features = text_features.merge(tfidf_window, on='timestamp', how='left')

print(f"✓ Window-level text features created")
print(f"  Shape: {text_features.shape}")
print(f"  Features: {text_features.columns.tolist()}")
text_features.head()

✓ Window-level text features created
  Shape: (25921, 19)
  Features: ['timestamp', 'message_rarity', 'message_anomaly_score', 'message_cluster', 'tfidf_0', 'tfidf_1', 'tfidf_2', 'tfidf_3', 'tfidf_4', 'tfidf_5', 'tfidf_6', 'tfidf_7', 'tfidf_8', 'tfidf_9', 'tfidf_10', 'tfidf_11', 'tfidf_12', 'tfidf_13', 'tfidf_14']


Unnamed: 0,timestamp,message_rarity,message_anomaly_score,message_cluster,tfidf_0,tfidf_1,tfidf_2,tfidf_3,tfidf_4,tfidf_5,tfidf_6,tfidf_7,tfidf_8,tfidf_9,tfidf_10,tfidf_11,tfidf_12,tfidf_13,tfidf_14
0,2025-12-15 22:54:00,0.00026,0.134753,0,0.491954,0.296805,-0.015575,-0.100311,-0.068479,-0.075225,0.004172,0.052505,-0.014076,0.185758,-0.019513,-0.026326,-4.196363e-11,-0.059728,0.000366
1,2025-12-15 22:54:30,0.000264,0.094893,1,0.029483,0.356172,0.211111,0.028178,0.113689,0.17619,-0.004748,0.149032,-0.03007,0.019358,-0.014833,-0.001962,-9.281976e-10,-0.005273,0.129338
2,2025-12-15 22:55:00,0.000272,0.0111,8,0.005052,0.160604,0.334082,-0.022194,-0.085024,0.015996,0.02443,-0.121851,-0.119422,0.002459,-0.011482,0.002484,0.3333333,-0.000151,-0.066615
3,2025-12-15 22:55:30,0.000268,0.117717,16,0.420281,0.010516,0.002337,-0.051306,0.104903,-0.077468,-0.029707,-0.068632,0.015042,0.139515,0.188312,-0.041648,-1.941217e-11,-0.050769,0.001923
4,2025-12-15 22:56:00,0.000264,0.048931,0,0.183787,0.061061,0.070379,-0.036278,0.07167,-0.054243,-0.028855,0.168491,-0.056872,0.165562,0.12426,-0.051269,-9.018995e-11,0.131728,-0.012412


## 8. Save Text Features

In [30]:
import os

# Save text features.
output_path = '../data/features/text_features.parquet'
text_features.to_parquet(output_path, index=False)

print(f"✓ Text features saved: {output_path}")
print(f"  Shape: {text_features.shape}")

# Also save per-log text features for later analysis.
log_text_features = df[[
    'timestamp', 'message', 'message_processed', 'message_rarity',
    'message_anomaly_score', 'message_cluster'
]]
log_output = '../data/features/log_text_features.parquet'
log_text_features.to_parquet(log_output, index=False)

print(f"✓ Per-log text features saved: {log_output}")
print(f"  Shape: {log_text_features.shape}")

✓ Text features saved: ../data/features/text_features.parquet
  Shape: (25921, 19)
✓ Per-log text features saved: ../data/features/log_text_features.parquet
  Shape: (131812, 6)


## Summary

### Text Features Calculated:

**1. Message Templates:**
- Extracted unique message patterns
- Calculated message rarity (inverse frequency)
- Tracked new templates over time

**2. TF-IDF Embeddings:**
- 100-dimensional TF-IDF vectors
- Reduced to 10 dimensions with SVD
- Captures important terms in messages

**3. Message Clustering:**
- Grouped similar messages into 20 clusters
- Calculated distance to cluster centroid
- Anomaly score = distance from assigned cluster

**4. Window-Level Features:**
- `message_rarity`: Average rarity of messages in window
- `message_anomaly_score`: Average semantic anomaly score
- `message_cluster`: Most common cluster in window
- `tfidf_0...tfidf_9`: Average TF-IDF features

### Why These Matter:

**1. New Error Detection:**
- High `message_rarity` = unusual error messages
- Spike in new templates = new failure modes

**2. Semantic Anomalies:**
- High `message_anomaly_score` = messages don't match learned patterns
- Detects errors with unusual wording/structure

**3. Error Pattern Shifts:**
- Changes in cluster distribution = different error types
- TF-IDF changes = different terminology appearing

**4. Complements Volume Features:**
- Volume features: **how many** errors
- Text features: **what kind** of errors
- Together: Complete anomaly detection

### Example Use Cases:

**Case 1: New Database Error**
```
Volume: normal (only 5 errors)
Text: HIGH anomaly score (never seen this error before)
→ Anomaly detected! New database connection issue.
```

**Case 2: Known Error Spike**
```
Volume: HIGH (1000 errors)
Text: low anomaly score (familiar "insufficient funds" errors)
→ Anomaly detected! Spike in known error type.
```

**Case 3: Silent Failure**
```
Volume: normal
Text: HIGH rarity, many new templates
→ Anomaly detected! System producing unusual messages.
```

## 9. Export Text Processing Artifacts for Production Pipeline

Export the trained TF-IDF vectorizer, SVD reducer, KMeans clusterer, and message rarity lookup for the production inference pipeline.

In [32]:
import pickle
from pathlib import Path

# Create models directory.
model_dir = Path('../models')
model_dir.mkdir(exist_ok=True)

print("Exporting text processing artifacts for production pipeline...")
print("="*60)

# Export TF-IDF vectorizer.
with open(model_dir / 'tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf, f)
print(f"✓ TF-IDF vectorizer saved to {model_dir / 'tfidf_vectorizer.pkl'}")

# Export TruncatedSVD reducer.
with open(model_dir / 'svd_reducer.pkl', 'wb') as f:
    pickle.dump(svd, f)
print(f"✓ SVD reducer saved to {model_dir / 'svd_reducer.pkl'}")

# Export KMeans clusterer.
with open(model_dir / 'message_clusters.pkl', 'wb') as f:
    pickle.dump(kmeans, f)
print(f"✓ KMeans clusterer saved to {model_dir / 'message_clusters.pkl'}")

# Export message rarity lookup (template_counts).
message_rarity_lookup = {}
for template, count in template_counts.items():
    message_rarity_lookup[template] = 1.0 / count

with open(model_dir / 'message_rarity.pkl', 'wb') as f:
    pickle.dump(message_rarity_lookup, f)
print(f"✓ Message rarity lookup saved to {model_dir / 'message_rarity.pkl'}")



Exporting text processing artifacts for production pipeline...
✓ TF-IDF vectorizer saved to ../models/tfidf_vectorizer.pkl
✓ SVD reducer saved to ../models/svd_reducer.pkl
✓ KMeans clusterer saved to ../models/message_clusters.pkl
✓ Message rarity lookup saved to ../models/message_rarity.pkl
