# Text Embeddings - Semantic Log Analysis

Extract semantic features from log messages to detect:
- **New/unusual error patterns** (messages never seen before)
- **Message clustering** (similar errors grouped together)
- **Semantic anomalies** (messages that don't fit normal patterns)

## Approaches:
1. **TF-IDF**: Traditional bag-of-words (fast, interpretable)
2. **Sentence Embeddings**: Semantic understanding (better for new patterns)
3. **Message clustering**: Group similar messages, detect outliers

**Why This Matters:**
- Volume features detect **how many** errors
- Text features detect **what kind** of errors
- New error types = often the first sign of serious issues

In [199]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.cluster import KMeans
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import re

## Load Data

In [200]:
# Load training data.
df = pd.read_parquet('../data/training_logs.parquet')
df['timestamp'] = pd.to_datetime(df['timestamp'])

print(f"Total logs: {len(df):,}")
print(f"Date range: {df['timestamp'].min()} to {df['timestamp'].max()}")
print(f"\nSample messages:")
df['message'].head(10)

Total logs: 131,935
Date range: 2025-12-17 23:44:21.030517+00:00 to 2025-12-26 23:44:14.821532+00:00

Sample messages:


0                         Cache miss for key: cache_70
1                Slow query detected - duration: 184ms
2    Database query completed - rows: 297, duration...
3                         Cache miss for key: cache_89
4       User session created - session_id: sess_665158
5    API request received - endpoint: /api/payments...
6                 Rate limit approaching for user 9201
7                         Cache miss for key: cache_94
8                          Cache hit for key: cache_58
9       Email notification sent to user249@example.com
Name: message, dtype: object

## Text Preprocessing

In [201]:
def preprocess_log_message(message: str) -> str:
    """
    Preprocess log message for embedding.

    param message: Raw log message.
    """
    message = message.lower()
    message = re.sub(r'\d+', '<NUM>', message)
    
    # Replace common variable patterns.
    message = re.sub(r'user_\d+', 'user_<ID>', message)
    message = re.sub(r'session_\d+', 'session_<ID>', message)
    message = re.sub(r'txn_\d+', 'txn_<ID>', message)
    message = re.sub(r'ord_\d+', 'ord_<ID>', message)
    message = re.sub(r'item_\d+', 'item_<ID>', message)
    message = re.sub(r'cache_\d+', 'cache_<ID>', message)
    
    # Replace timestamps and durations.
    message = re.sub(r'\d+ms', '<DURATION>ms', message)
    message = re.sub(r'\d+s', '<DURATION>s', message)
    # Replace IP addresses n file paths.
    message = re.sub(r'\d+\.\d+\.\d+\.\d+', '<IP>', message)
    message = re.sub(r'/[\w/]+', '<PATH>', message)
    
    return message.strip()


print("Preprocessing log messages...")
df['message_processed'] = df['message'].apply(preprocess_log_message)

print("\nExample preprocessing:")
for i in range(5):
    print(f"\nOriginal: {df['message'].iloc[i]}")
    print(f"Processed: {df['message_processed'].iloc[i]}")

Preprocessing log messages...

Example preprocessing:

Original: Cache miss for key: cache_70
Processed: cache miss for key: cache_<NUM>

Original: Slow query detected - duration: 184ms
Processed: slow query detected - duration: <NUM>ms

Original: Database query completed - rows: 297, duration: 370ms
Processed: database query completed - rows: <NUM>, duration: <NUM>ms

Original: Cache miss for key: cache_89
Processed: cache miss for key: cache_<NUM>

Original: User session created - session_id: sess_665158
Processed: user session created - session_id: sess_<NUM>


## 1. TF-IDF Features


In [202]:
# Create TF-IDF vectorizer.
tfidf = TfidfVectorizer(
    max_features=100,      # Top 100 most important terms.
    min_df=5,              # Term must appear in at least 5 documents.
    max_df=0.8,            # Ignore terms appearing in > 80% of documents.
    ngram_range=(1, 2),    # Unigrams and bigrams.
    stop_words='english'   # Remove common English words.
)

print("Fitting TF-IDF vectorizer...")
tfidf_matrix = tfidf.fit_transform(df['message_processed'])

print(f"\n✓ TF-IDF matrix created")
print(f"  Shape: {tfidf_matrix.shape}")
print(f"  Features: {len(tfidf.get_feature_names_out())}")
print(f"\nTop 20 terms:")
print(tfidf.get_feature_names_out()[:20])

Fitting TF-IDF vectorizer...

✓ TF-IDF matrix created
  Shape: (131935, 100)
  Features: 100

Top 20 terms:
['api' 'api request' 'authentication' 'authentication successful' 'cache'
 'cache hit' 'cache_' 'cache_ num' 'com' 'completed' 'completed rows'
 'connection' 'created' 'created session_id' 'created successfully'
 'database' 'database query' 'duration' 'duration num' 'email']


In [203]:
# Reduce dimensions for visualization.

# why TruncatedSVD: Works with sparse matrices (efficient for TF-IDF)
# because TFIDF features will be 95% zeros.
# So ths is recommended for text data.

print("\nReducing dimensions with TruncatedSVD...")
svd = TruncatedSVD(n_components=15, random_state=42)
tfidf_reduced = svd.fit_transform(tfidf_matrix)

print(f"✓ Reduced to {tfidf_reduced.shape[1]} dimensions")
print(f"  Explained variance: {svd.explained_variance_ratio_.sum():.2%}")


Reducing dimensions with TruncatedSVD...
✓ Reduced to 15 dimensions
  Explained variance: 91.07%


## 2. Message Template Extraction

Identify unique message templates (patterns).

In [204]:
template_counts = df['message_processed'].value_counts()

print(f"Unique message templates: {len(template_counts):,}")
print(f"\nTop 10 most common templates:")
print(template_counts.head(10))

# Calculate message rarity (inverse frequency).
df['message_rarity'] = df['message_processed'].map(
    lambda x: 1.0 / template_counts.get(x, 1)
)

print(f"\n✓ Message rarity calculated")
print(f"  Rare messages (rarity > 0.01): {(df['message_rarity'] > 0.01).sum():,}")

Unique message templates: 83

Top 10 most common templates:
message_processed
database query completed - rows: <NUM>, duration: <NUM>ms              8626
payment processed - transaction_id: txn_<NUM>, amount: $<NUM>.<NUM>    8598
inventory updated - item_id: item_<NUM>, quantity: <NUM>               8558
cache hit for key: cache_<NUM>                                         8551
order created successfully - order_id: ord_<NUM>                       8540
request processed successfully - user_id: <NUM>, duration: <NUM>ms     8490
email notification sent to user<NUM>@example.com                       8458
authentication successful for user <NUM>                               8438
user session created - session_id: sess_<NUM>                          8414
query plan: index_scan                                                 5253
Name: count, dtype: int64

✓ Message rarity calculated
  Rare messages (rarity > 0.01): 629


## 3. Find Optimal Number of Clusters - Elbow Method



In [205]:
# I have 71 message templates.
# after the k=20 Silhouette score is not falling that much.
# we can go with 20 - 25 k values for better cluster representation.
# beyond that it just overfitts the data.

# Set number of clusters based on analysis.
n_clusters = 20  # Optimal based on elbow method and silhouette score.

print(f"Clustering messages into {n_clusters} groups...")
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
df['message_cluster'] = kmeans.fit_predict(tfidf_reduced)

print(f"\n✓ Messages clustered")
print(f"\nCluster distribution:")
print(df['message_cluster'].value_counts().sort_index())

from sklearn.metrics import silhouette_score

silhouette = silhouette_score(tfidf_reduced, df['message_cluster'], sample_size=10000)

print(f"\nClustering Quality Metrics:")
print(f"  Silhouette Score: {silhouette:.4f} (higher is better, range: -1 to 1)")


Clustering messages into 20 groups...

✓ Messages clustered

Cluster distribution:
message_cluster
0      8540
1      8438
2      4360
3      8558
4      8414
5      8685
6      6470
7      5178
8     13679
9      5253
10     8610
11     8502
12     5230
13     8490
14     2251
15     5207
16     8626
17     3009
18     2216
19     2219
Name: count, dtype: int64

Clustering Quality Metrics:
  Silhouette Score: 0.9532 (higher is better, range: -1 to 1)


In [206]:
# Examine clusters.
print("\nSample messages from each cluster:")
for cluster_id in range(min(5, n_clusters)):  # Show first 5 clusters.
    print(f"\n{'='*60}")
    print(f"Cluster {cluster_id} ({(df['message_cluster'] == cluster_id).sum()} messages):")
    print('='*60)
    cluster_samples = df[df['message_cluster'] == cluster_id]['message'].head(3)
    for i, msg in enumerate(cluster_samples, 1):
        print(f"  {i}. {msg}")


Sample messages from each cluster:

Cluster 0 (8540 messages):
  1. Order created successfully - order_id: ord_376807
  2. Order created successfully - order_id: ord_345884
  3. Order created successfully - order_id: ord_576086

Cluster 1 (8438 messages):
  1. Authentication successful for user 4872
  2. Authentication successful for user 4571
  3. Authentication successful for user 3522

Cluster 2 (4360 messages):
  1. Slow query detected - duration: 184ms
  2. Service latency above threshold: 289ms
  3. Slow query detected - duration: 347ms

Cluster 3 (8558 messages):
  1. Inventory updated - item_id: item_139, quantity: 40
  2. Inventory updated - item_id: item_339, quantity: 11
  3. Inventory updated - item_id: item_407, quantity: 80

Cluster 4 (8414 messages):
  1. User session created - session_id: sess_665158
  2. User session created - session_id: sess_618392
  3. User session created - session_id: sess_364659


## 4. Detect New/Unusual Messages

Calculate distance to cluster centroid to find anomalous messages.

In [207]:
# Calculate distance to nearest cluster centroid.
from sklearn.metrics import pairwise_distances

print("Calculating distances to cluster centroids...")
distances = pairwise_distances(
    tfidf_reduced,
    kmeans.cluster_centers_,
    metric='euclidean'
)

print(distances) # ths is the distance from the datapoint to all other clusetr centroids.

print(np.arange(len(df)), df['message_cluster']) # first creates the row number and the second gives which col to choose.
# Distance to assigned cluster.
df['message_distance'] = distances[np.arange(len(df)), df['message_cluster']]

# Normalize distances (0-1 scale).
df['message_anomaly_score'] = (
    (df['message_distance'] - df['message_distance'].min()) /
    (df['message_distance'].max() - df['message_distance'].min())
)

print(f"\n✓ Anomaly scores calculated")
print(f"\nAnomaly score distribution:")
print(df['message_anomaly_score'].describe())

Calculating distances to cluster centroids...
[[1.33890983 1.33979716 1.08415146 ... 1.30611428 1.16886513 1.0775324 ]
 [1.23712162 1.23164636 0.13129025 ... 1.21620367 1.05345302 0.94574771]
 [1.40578479 1.40521778 0.56240574 ... 1.39199358 1.24355637 1.16160424]
 ...
 [1.40578479 1.40521778 0.56240574 ... 1.39199358 1.24355637 1.16160424]
 [1.36156991 1.36699006 1.17488253 ... 1.33085118 1.2541999  0.99328566]
 [1.41382329 1.3152481  1.17537757 ... 1.25296502 1.25025726 0.95736211]]
[     0      1      2 ... 131932 131933 131934] 0          8
1          2
2         16
3          8
4          4
          ..
131930     0
131931     5
131932    16
131933     4
131934    11
Name: message_cluster, Length: 131935, dtype: int32

✓ Anomaly scores calculated

Anomaly score distribution:
count    1.319350e+05
mean     4.557948e-02
std      1.228326e-01
min      0.000000e+00
25%      0.000000e+00
50%      3.084876e-08
75%      1.170732e-02
max      1.000000e+00
Name: message_anomaly_score, dtyp

In [208]:
print("\nTop 10 most anomalous messages (unusual patterns):")
print("="*60)
anomalous = df.nlargest(10, 'message_anomaly_score')[['message', 'message_anomaly_score', 'level']]
for idx, row in anomalous.iterrows():
    print(f"\nScore: {row['message_anomaly_score']:.4f} | Level: {row['level']}")
    print(f"Message: {row['message']}")


Top 10 most anomalous messages (unusual patterns):

Score: 1.0000 | Level: FATAL
Message: Connection timeout to db-5.example.com - Network unreachable

Score: 1.0000 | Level: WARN
Message: Connection timeout to db-3.example.com - Timeout

Score: 1.0000 | Level: FATAL
Message: Connection timeout to db-4.example.com - Network unreachable

Score: 1.0000 | Level: FATAL
Message: Connection timeout to db-2.example.com - Network unreachable

Score: 1.0000 | Level: WARN
Message: Connection timeout to db-3.example.com - Network unreachable

Score: 1.0000 | Level: FATAL
Message: Connection timeout to db-2.example.com - Network unreachable

Score: 1.0000 | Level: ERROR
Message: Connection timeout to db-5.example.com - Network unreachable

Score: 1.0000 | Level: WARN
Message: Connection timeout to db-2.example.com - Timeout

Score: 1.0000 | Level: ERROR
Message: Connection timeout to db-4.example.com - Timeout

Score: 1.0000 | Level: ERROR
Message: Connection timeout to db-4.example.com - Network

## 5. Visualize Message Embeddings

In [209]:
# Further reduce to 2D so we can see them.
from sklearn.manifold import TSNE

print("Reducing to 2D with t-SNE (this may take a minute)...")

sample_size = min(5000, len(df))
sample_idx = np.random.choice(len(df), sample_size, replace=False)

tsne = TSNE(n_components=2, random_state=42, perplexity=30)
embeddings_2d = tsne.fit_transform(tfidf_reduced[sample_idx])

print(f"✓ 2D embeddings created for {sample_size} samples")

Reducing to 2D with t-SNE (this may take a minute)...
✓ 2D embeddings created for 5000 samples


In [210]:
sample_df = df.iloc[sample_idx].copy()
sample_df['x'] = embeddings_2d[:, 0]
sample_df['y'] = embeddings_2d[:, 1]

fig = go.Figure()

for cluster_id in sample_df['message_cluster'].unique():
    cluster_data = sample_df[sample_df['message_cluster'] == cluster_id]
    fig.add_trace(
        go.Scatter(
            x=cluster_data['x'],
            y=cluster_data['y'],
            mode='markers',
            name=f'Cluster {cluster_id}',
            marker=dict(size=5, opacity=0.6),
            text=cluster_data['message'].str[:50],  # Show first 50 chars on hover.
            hovertemplate='%{text}<extra></extra>'
        )
    )

fig.update_layout(
    title='Message Embeddings - t-SNE Visualization (Colored by Cluster)',
    xaxis_title='t-SNE Dimension 1',
    yaxis_title='t-SNE Dimension 2',
    height=700,
    showlegend=True
)
fig.show()

In [211]:
fig = go.Figure()

level_colors = {
    'DEBUG': 'lightblue',
    'INFO': 'green',
    'WARN': 'orange',
    'ERROR': 'red',
    'FATAL': 'darkred'
}

for level in sample_df['level'].unique():
    level_data = sample_df[sample_df['level'] == level]
    fig.add_trace(
        go.Scatter(
            x=level_data['x'],
            y=level_data['y'],
            mode='markers',
            name=level,
            marker=dict(
                size=5,
                color=level_colors.get(level, 'gray'),
                opacity=0.6
            ),
            text=level_data['message'].str[:50],
            hovertemplate='%{text}<extra></extra>'
        )
    )

fig.update_layout(
    title='Message Embeddings - Colored by Log Level',
    xaxis_title='t-SNE Dimension 1',
    yaxis_title='t-SNE Dimension 2',
    height=700
)
fig.show()

print("\nInterpretation:")
print("  - Tight clusters = similar message patterns")
print("  - Outliers = unusual/new message patterns")
print("  - ERROR/FATAL in separate regions = distinct error patterns")


Interpretation:
  - Tight clusters = similar message patterns
  - Outliers = unusual/new message patterns
  - ERROR/FATAL in separate regions = distinct error patterns


## 6. Time-Based Message Analysis

In [212]:
df_sorted = df.sort_values('timestamp')
df_sorted['is_new_template'] = ~df_sorted['message_processed'].duplicated()

# Aggregate by time window.
time_agg = df_sorted.set_index('timestamp').resample('1h').agg({
    'is_new_template': 'sum',
    'message_anomaly_score': 'mean',
    'message_rarity': 'mean'
}).reset_index()

time_agg.columns = ['timestamp', 'new_templates', 'avg_anomaly_score', 'avg_rarity']

print("Time-based message metrics calculated")
time_agg.head(10)

Time-based message metrics calculated


Unnamed: 0,timestamp,new_templates,avg_anomaly_score,avg_rarity
0,2025-12-17 23:00:00+00:00,31,0.046036,0.000309
1,2025-12-18 00:00:00+00:00,6,0.04608,0.000312
2,2025-12-18 01:00:00+00:00,2,0.040874,0.000316
3,2025-12-18 02:00:00+00:00,0,0.036612,0.000273
4,2025-12-18 03:00:00+00:00,1,0.040795,0.001073
5,2025-12-18 04:00:00+00:00,1,0.043607,0.000845
6,2025-12-18 05:00:00+00:00,0,0.036402,0.000261
7,2025-12-18 06:00:00+00:00,0,0.051408,0.000316
8,2025-12-18 07:00:00+00:00,0,0.037499,0.000297
9,2025-12-18 08:00:00+00:00,0,0.041754,0.00029


In [213]:
fig = make_subplots(
    rows=2, cols=1,
    subplot_titles=(
        'New Message Templates per Hour',
        'Average Message Anomaly Score per Hour'
    ),
    vertical_spacing=0.15
)

fig.add_trace(
    go.Scatter(
        x=time_agg['timestamp'],
        y=time_agg['new_templates'],
        mode='lines+markers',
        name='New Templates',
        line=dict(color='blue', width=2),
        marker=dict(size=4)
    ),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(
        x=time_agg['timestamp'],
        y=time_agg['avg_anomaly_score'],
        mode='lines+markers',
        name='Avg Anomaly Score',
        line=dict(color='red', width=2),
        marker=dict(size=4)
    ),
    row=2, col=1
)

fig.update_xaxes(title_text='Timestamp', row=2, col=1)
fig.update_yaxes(title_text='New Templates', row=1, col=1)
fig.update_yaxes(title_text='Anomaly Score', row=2, col=1)
fig.update_layout(height=700, showlegend=False)
fig.show()

print("\nInterpretation:")
print("  - Spike in new templates = new error types appearing")
print("  - High avg anomaly score = messages becoming more unusual")
print("  - Both spiking = strong indicator of new anomaly patterns")


Interpretation:
  - Spike in new templates = new error types appearing
  - High avg anomaly score = messages becoming more unusual
  - Both spiking = strong indicator of new anomaly patterns


## 7. Aggregate Text Features per Window

In [214]:
# Create window-level text features (30s windows to match other features).
text_features = df.set_index('timestamp').resample('30s').agg({
    'message_rarity': 'mean',
    'message_anomaly_score': 'mean',
    'message_cluster': lambda x: x.mode()[0] if len(x) > 0 else -1  # Most common cluster.
}).reset_index()

# Add TF-IDF features (average per window).
tfidf_df = pd.DataFrame(
    tfidf_reduced,
    columns=[f'tfidf_{i}' for i in range(tfidf_reduced.shape[1])]
)
tfidf_df['timestamp'] = df['timestamp'].values

tfidf_window = tfidf_df.set_index('timestamp').resample('30s').mean().reset_index()

text_features['timestamp'] = pd.to_datetime(text_features['timestamp']).dt.tz_localize(None)
tfidf_window['timestamp'] = pd.to_datetime(tfidf_window['timestamp']).dt.tz_localize(None)
text_features = text_features.merge(tfidf_window, on='timestamp', how='left')

print(f"✓ Window-level text features created")
print(f"  Shape: {text_features.shape}")
print(f"  Features: {text_features.columns.tolist()}")
text_features.head()

✓ Window-level text features created
  Shape: (25921, 19)
  Features: ['timestamp', 'message_rarity', 'message_anomaly_score', 'message_cluster', 'tfidf_0', 'tfidf_1', 'tfidf_2', 'tfidf_3', 'tfidf_4', 'tfidf_5', 'tfidf_6', 'tfidf_7', 'tfidf_8', 'tfidf_9', 'tfidf_10', 'tfidf_11', 'tfidf_12', 'tfidf_13', 'tfidf_14']


Unnamed: 0,timestamp,message_rarity,message_anomaly_score,message_cluster,tfidf_0,tfidf_1,tfidf_2,tfidf_3,tfidf_4,tfidf_5,tfidf_6,tfidf_7,tfidf_8,tfidf_9,tfidf_10,tfidf_11,tfidf_12,tfidf_13,tfidf_14
0,2025-12-17 23:44:00,0.000328,0.158502,2,0.345245,0.444357,-0.023103,-0.056377,-0.041825,-0.043055,0.003759,0.025471,-0.006324,-0.011123,0.051857,-0.003688,-6.076544e-11,0.033043,0.004458
1,2025-12-17 23:44:30,0.000247,0.022993,4,0.185212,0.187288,0.168995,-0.015208,0.082408,0.148762,-0.008498,0.171122,-0.029606,0.042911,0.109247,-0.018461,-1.222732e-10,-0.053951,0.158027
2,2025-12-17 23:45:00,0.000228,0.03146,8,0.003005,0.377499,0.263705,-0.016669,-0.076322,0.014556,0.018079,-0.075764,-0.01274,-0.035067,0.006068,0.001815,0.2,-0.000121,-0.049878
3,2025-12-17 23:45:30,0.000262,0.039334,0,0.235995,0.010131,0.117226,-0.035805,0.065099,-0.045749,-0.01177,-0.121533,-0.095636,0.165227,0.080053,-0.018654,0.1666667,-0.052942,-0.055879
4,2025-12-17 23:46:00,0.000266,0.044385,2,0.321071,0.049798,0.08792,-0.063883,0.043684,-0.070967,-0.02042,0.180758,-0.042263,0.171011,0.166883,-0.039529,1.393721e-10,0.145659,0.071765


## 8. Save Text Features

In [215]:
import os

# Save text features.
output_path = '../data/features/text_features.parquet'
text_features.to_parquet(output_path, index=False)

print(f"✓ Text features saved: {output_path}")
print(f"  Shape: {text_features.shape}")

# Also save per-log text features for later analysis.
log_text_features = df[[
    'timestamp', 'message', 'message_processed', 'message_rarity',
    'message_anomaly_score', 'message_cluster'
]]
log_output = '../data/features/log_text_features.parquet'
log_text_features.to_parquet(log_output, index=False)

print(f"✓ Per-log text features saved: {log_output}")
print(f"  Shape: {log_text_features.shape}")

✓ Text features saved: ../data/features/text_features.parquet
  Shape: (25921, 19)
✓ Per-log text features saved: ../data/features/log_text_features.parquet
  Shape: (131935, 6)


## Summary

### Text Features Calculated:

**1. Message Templates:**
- Extracted unique message patterns
- Calculated message rarity (inverse frequency)
- Tracked new templates over time

**2. TF-IDF Embeddings:**
- 100-dimensional TF-IDF vectors
- Reduced to 10 dimensions with SVD
- Captures important terms in messages

**3. Message Clustering:**
- Grouped similar messages into 20 clusters
- Calculated distance to cluster centroid
- Anomaly score = distance from assigned cluster

**4. Window-Level Features:**
- `message_rarity`: Average rarity of messages in window
- `message_anomaly_score`: Average semantic anomaly score
- `message_cluster`: Most common cluster in window
- `tfidf_0...tfidf_9`: Average TF-IDF features

### Why These Matter:

**1. New Error Detection:**
- High `message_rarity` = unusual error messages
- Spike in new templates = new failure modes

**2. Semantic Anomalies:**
- High `message_anomaly_score` = messages don't match learned patterns
- Detects errors with unusual wording/structure

**3. Error Pattern Shifts:**
- Changes in cluster distribution = different error types
- TF-IDF changes = different terminology appearing

**4. Complements Volume Features:**
- Volume features: **how many** errors
- Text features: **what kind** of errors
- Together: Complete anomaly detection

### Example Use Cases:

**Case 1: New Database Error**
```
Volume: normal (only 5 errors)
Text: HIGH anomaly score (never seen this error before)
→ Anomaly detected! New database connection issue.
```

**Case 2: Known Error Spike**
```
Volume: HIGH (1000 errors)
Text: low anomaly score (familiar "insufficient funds" errors)
→ Anomaly detected! Spike in known error type.
```

**Case 3: Silent Failure**
```
Volume: normal
Text: HIGH rarity, many new templates
→ Anomaly detected! System producing unusual messages.
```

## 9. Export Text Processing Artifacts for Production Pipeline

Export the trained TF-IDF vectorizer, SVD reducer, KMeans clusterer, and message rarity lookup for the production inference pipeline.

In [216]:
import pickle
from pathlib import Path

# Create models directory.
model_dir = Path('../models')
model_dir.mkdir(exist_ok=True)

print("Exporting text processing artifacts for production pipeline...")
print("="*60)

# Export TF-IDF vectorizer.
with open(model_dir / 'tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf, f)
print(f"✓ TF-IDF vectorizer saved to {model_dir / 'tfidf_vectorizer.pkl'}")

# Export TruncatedSVD reducer.
with open(model_dir / 'svd_reducer.pkl', 'wb') as f:
    pickle.dump(svd, f)
print(f"✓ SVD reducer saved to {model_dir / 'svd_reducer.pkl'}")

# Export KMeans clusterer.
with open(model_dir / 'message_clusters.pkl', 'wb') as f:
    pickle.dump(kmeans, f)
print(f"✓ KMeans clusterer saved to {model_dir / 'message_clusters.pkl'}")

# Export message rarity lookup (template_counts).
message_rarity_lookup = {}
for template, count in template_counts.items():
    message_rarity_lookup[template] = 1.0 / count

with open(model_dir / 'message_rarity.pkl', 'wb') as f:
    pickle.dump(message_rarity_lookup, f)
print(f"✓ Message rarity lookup saved to {model_dir / 'message_rarity.pkl'}")



Exporting text processing artifacts for production pipeline...
✓ TF-IDF vectorizer saved to ../models/tfidf_vectorizer.pkl
✓ SVD reducer saved to ../models/svd_reducer.pkl
✓ KMeans clusterer saved to ../models/message_clusters.pkl
✓ Message rarity lookup saved to ../models/message_rarity.pkl
