# Getting topic for large number of documents (messages)

We have huge number of documents (say 1 Billion+ messages). We want to find broad level topics which these documents belong to.

This is a demonstration of how it can be done using 10M documents.

In [7]:
import faiss
import numpy as np
import os

## Generating synthetic data

In [6]:
# Assuming we have embeddings saved for all messages (in this example we saved embeddings for 10M messages)
num_entries = 10000000
embedding_dim = 128

embeddings = np.random.rand(num_entries, embedding_dim).astype('float32')
np.save("message_embeddings.npy", embeddings)

## Using synthetic data to train Kmeans - incrementally

In [8]:
def load_embeddings(file_path, batch_size):
    """
    Stream embeddings from a large .npy file in batches.
    
    Parameters:
    - file_path (str): Path to the stored embeddings file.
    - batch_size (int): Number of embeddings to load per batch.
    
    Yields:
    - np.ndarray: Batch of embeddings.
    """
    # Memory-map the large file to avoid loading all at once
    embeddings = np.load(file_path, mmap_mode='r')
    num_entries = embeddings.shape[0]
    
    # Yield embeddings in chunks
    for start_idx in range(0, num_entries, batch_size):
        end_idx = min(start_idx + batch_size, num_entries)
        yield embeddings[start_idx:end_idx]


In [11]:
# Configuration
embedding_dim = 128  # Dimension of embeddings
batch_size = 1000000   # Size of each batch
n_clusters = 100     # Number of clusters
file_path = "message_embeddings.npy"

# Step 1: Initialize FAISS KMeans
kmeans = faiss.Kmeans(d=embedding_dim, k=n_clusters, niter=20, verbose=True)

# Step 2: Train KMeans incrementally
print("Step 2: Training clusters incrementally...")
for batch_idx, embeddings in enumerate(load_embeddings(file_path, batch_size)):
    print(f"Processing batch {batch_idx + 1}")
    kmeans.train(embeddings)

Step 2: Training clusters incrementally...
Processing batch 1
Sampling a subset of 25600 / 1000000 for training
Clustering 25600 points in 128D to 100 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.14 s
Processing batch 296 s, search 0.78 s): objective=255091 imbalance=1.007 nsplit=0       

Sampling a subset of 25600 / 1000000 for training
Clustering 25600 points in 128D to 100 clusters, redo 1 times, 20 iterations
  Preprocessing in 1.82 s
Processing batch 385 s, search 0.73 s): objective=255448 imbalance=1.009 nsplit=0       
  Iteration 19 (0.91 s, search 0.78 s): objective=255431 imbalance=1.009 nsplit=0       
Sampling a subset of 25600 / 1000000 for training
Clustering 25600 points in 128D to 100 clusters, redo 1 times, 20 iterations
  Preprocessing in 2.43 s
Processing batch 410 s, search 0.94 s): objective=255189 imbalance=1.005 nsplit=0       

Sampling a subset of 25600 / 1000000 for training
Clustering 25600 points in 128D to 100 clusters, redo 1 times, 20 itera

In [13]:
kmeans.centroids.shape

(100, 128)

## Assign each message embedding to cluster centroid and save

In [15]:
# Step 3: Create an Index for clustering assignments
index = faiss.IndexFlatL2(embedding_dim)
index.add(kmeans.centroids)  # Add cluster centroids to the index

# Step 4: Assign clusters to full dataset incrementally
cluster_assignments = []

print("Step 4: Assigning clusters to all documents incrementally...")
for batch_idx, embeddings in enumerate(load_embeddings(file_path, batch_size)):
    print(f"Processing batch {batch_idx + 1}")
    _, batch_assignments = index.search(embeddings, 1)  # Get nearest centroid for each embedding
    cluster_assignments.append(batch_assignments)

# Combine all assignments into one array
cluster_assignments = np.concatenate(cluster_assignments).flatten()

Step 4: Assigning clusters to all documents incrementally...
Processing batch 1
Processing batch 2
Processing batch 3
Processing batch 4
Processing batch 5
Processing batch 6
Processing batch 7
Processing batch 8
Processing batch 9
Processing batch 10


In [16]:
# Save the cluster assignments to a file (optional)
np.save("cluster_assignments.npy", cluster_assignments)

# Analyze results
print(f"Number of documents clustered: {len(cluster_assignments)}")
print(f"Cluster distribution: {np.bincount(cluster_assignments)}")

Number of documents clustered: 10000000
Cluster distribution: [100651  97296 107087  94618 107199  94960 109830  97594 104802 109454
 106202 104366 106969 101504 101069  92804 104876  99182 100138  95491
 102317  87704  99995  56437  99547 101298 103724 111784 104271  93560
 102506 104545  98106 107408 100971  99000 105377 101251  80115  98058
 101601 103384  99340 104280  94726 102630  98192 105902  96861  89314
  99186 101826 106090  92036 104614  93061  98762  97327 101876 101358
 102184 102623 103350 104055 107690  97801 106645  99997  99244 105845
 104599 106911  99740  93440  92541  92483 100642 106052 100756 108349
  92424 103153  98404 103952  97227 102327  98601  93737 100425 101367
 107152  98198  96579 101784 103939  96267  97935  98483  87883  98784]


## Next Steps and additional ideas

- Take N random sample from each cluster and send it through prompt to get a topic label for that cluster.
- May have to experiment a bit with number of clusters. Would be a good idea to keep the number as high as possible, so similar topics can be normalized later and clusters can be merged.
- Can go a step further to retrieve most similar using the topic label within that cluster and prompt to different topic for least similar messages.