# Topic Embedding
This notebooks maps document embeddings to a shared topic space. Topic space reflects the underlying semantic structure of the documents, hence, might provide a more interpretable embedding.

## Inputs

- `models/gpt3/abstracts_gpt3ada.npz` contains GPT-3 embeddings, one document per row.
- `models/gpt3/abstracts_gpt3ada_pmids.csv` contains the PMIDs of the document embeddings in the above embedding dataset.

## Outputs

- `models/gpt3/abstracts_gpt3ada_clusters.csv.gz` contains cluster assignments for each document.
- `models/gpt3/abstracts_gpt3ada_weights.npz` contains membership weights of the documents in the topic space. Each row is a document and each column contains membership value to the corresponding topic.


In [68]:
# Setup and imports

%reload_ext autoreload
%reload_ext watermark

%autoreload 2


from pathlib import Path
import numpy as np
import pandas as pd
from python.cogtext.datasets.pubmed import PubMedDataLoader
from python.cogtext.topic_model import TopicModel

%watermark
%watermark -iv -p umap,hdbscan,joblib,numpy,numba,pytorch,tensorflow,python.cogtext

First, make sure all the required embeddings and models are available.

In [67]:
# PUBMED = PubMedDataLoader(preprocessed=False, drop_low_occurred_labels=False).load()

# for sbert all-MiniLM-L6-v2
# doc_embeddings = np.load('models/sbert/abstracts_all-MiniLM-L6-v2.npz')['arr_0']
# documents = pd.read_csv('models/sbert/abstracts_clusters.csv.gz', index_col=0)
# umap_embeddings = np.load('models/sbert/abstracts_UMAP5d.npz')['arr_0']

# GPT-3
doc_embeddings = np.load('models/gpt3/abstracts_gpt3ada.npz')['arr_0']
documents = pd.read_csv('models/gpt3/abstracts_gpt3ada_pmids.csv', index_col=0)
umap_embeddings = None

In [69]:
# now run the topic modeling on the embeddings. It will take a while (~ 80min on GPT-3 Ada embeddings)

model = TopicModel(parametric_umap=False, verbose=True)
clusters, weights = model.fit_transform(doc_embeddings, umap_embeddings=umap_embeddings)

UMAP(min_dist=0.0, n_components=5, verbose=True)


OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


Fri Jan 21 13:48:53 2022 Construct fuzzy simplicial set
Fri Jan 21 13:48:55 2022 Finding Nearest Neighbors
Fri Jan 21 13:48:55 2022 Building RP forest with 36 trees
Fri Jan 21 13:50:11 2022 NN descent for 19 iterations
	 1  /  19
	 2  /  19
	 3  /  19
	 4  /  19
	 5  /  19
	 6  /  19
	Stopping threshold met -- exiting after 6 iterations
Fri Jan 21 13:51:54 2022 Finished Nearest Neighbor Search
Fri Jan 21 13:52:04 2022 Construct embedding


Epochs completed: 100%| ██████████ 200/200 [06:45]


Fri Jan 21 14:02:12 2022 Finished embedding
[TopicModel] Reduced embedding dimensions. Now clustering...
________________________________________________________________________________
[Memory] Calling hdbscan.hdbscan_._hdbscan_boruvka_kdtree...
_hdbscan_boruvka_kdtree(array([[11.990156, ...,  4.206806],
       ...,
       [10.079483, ...,  5.457878]], dtype=float32), 
1, 1.0, 'euclidean', None, 40, True, False, -1)
__________________________________________hdbscan_boruvka_kdtree - 21.5s, 0.4min
[TopicModel] Clustered embeddings. Now computing weights...
[TopicModel] Done!


In [87]:

# report number of noise documents
n_noise_documents = documents['cluster'].isna().sum()

print(f'Mapped {doc_embeddings.shape[0]} documents to a {weights.shape[1]}-dimensional topic space, '
      f'while marking {n_noise_documents} documents as noise.')

# # drop cluster "-1" and make the rest 1-indexed
documents['cluster'] = np.where(clusters >= 0, clusters + 1, np.nan)

# # store
documents.to_csv('models/gpt3/abstracts_gpt3ada_clusters.csv.gz', index=True)
np.savez('models/gpt3/abstracts_gpt3ada_weights.npz', weights)

# DEBUG: report cluster frequencies
# documents['cluster'].value_counts()


Mapped 382855 documents to a 509-dimensional topic space, while marking 172957 documents as noise.
