# Topic Embedding

To ground documents in an interpretable space, we map document embeddings to a shared topic space. Topic space reflects the underlying structure of the documents, hence, might provide a more interpretable embedding.

## Inputs

- `data/embeddings/abstracts2023_*.safetensors` safetensors file containing document embeddings.

## Outputs

- `data/embeddings/abstracts2023_topic-embeddings.nc` contains cluster assignments and membership weights of the documents in the topic space.


In [2]:
%reload_ext autoreload
%autoreload 3

from pathlib import Path

import torch
from safetensors.torch import load_file
from sklearn.cluster import HDBSCAN
from sklearn.pipeline import Pipeline

from umap import UMAP
import torch
import joblib

ModuleNotFoundError: No module named 'torch'

First, make sure all the required embeddings and models are available.

In [None]:
doc_embeddings = load_file('data/embeddings/abstracts2023_jina-embeddings-v2-small-en.safetensors')

X = torch.tensor(doc_embeddings.values())

In [None]:
# project document embeddings to a shared topic space

pipe = Pipeline([
    ('reduce', UMAP(n_components=5, n_neighbors=15, min_dist=0.0, n_jobs=-1, metric='euclidean')),
    ('cluster', HDBSCAN(min_cluster_size=100, min_samples=1, metric='euclidean',
                        core_dist_n_jobs=-1, memory=joblib.Memory(location='tmp/hdbscan'), prediction_data=True))
])

topics = pipe.fit(X)
topic_weights = pipe.named_steps['cluster'].probabilities_

In [6]:
# report
unassigned_documents = (clusters == -1).sum()

print(f'Projected {doc_embeddings.shape[0]} unique documents to a '
      f'{weights.shape[1]}-dimensional topic space, '
      f'while discarding {unassigned_documents} noise documents (unassigned to any of the topics).')

Projected 382855 unique documents to a 515-dimensional topic space, while discarding 172399 noise documents (unassigned to any of the topics).
