# Topic Modeling
This notebooks maps vectorized documents to a topic space. Topic space reflects the underlying structure of the documents, hence, provide a more `interpretable` embedding.

In [68]:
%reload_ext autoreload
%autoreload 2

from pathlib import Path
import numpy as np
import pandas as pd
from python.cogtext.datasets.pubmed import PubMedDataLoader
from python.cogtext.topic_model import TopicModel

First, make sure all the required embeddings and models are available.

In [67]:
# PUBMED = PubMedDataLoader(preprocessed=False, drop_low_occurred_labels=False).load()

# for all-MiniLM-L6-v2
# doc_embeddings = np.load('models/sbert/abstracts_all-MiniLM-L6-v2.npz')['arr_0']
# documents = pd.read_csv('models/sbert/abstracts_clusters.csv.gz', index_col=0)
# umap_embeddings = np.load('models/sbert/abstracts_UMAP5d.npz')['arr_0']

# GPT-3
doc_embeddings = np.load('models/gpt3/abstracts_gpt3ada.npz')['arr_0']
documents = pd.read_csv('models/gpt3/abstracts_gpt3ada_pmids.csv', index_col=0)
umap_embeddings = None

In [69]:
# now run the topic modeling on the embeddings. It will take a while.

# UMAP/GPT-3 ~14min
# UMAP/HDBSCAN ~??min

model = TopicModel(parametric_umap=False, verbose=True)
clusters, weights = model.fit_transform(doc_embeddings, umap_embeddings=umap_embeddings)

UMAP(min_dist=0.0, n_components=5, verbose=True)


OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


Fri Jan 21 13:48:53 2022 Construct fuzzy simplicial set
Fri Jan 21 13:48:55 2022 Finding Nearest Neighbors
Fri Jan 21 13:48:55 2022 Building RP forest with 36 trees
Fri Jan 21 13:50:11 2022 NN descent for 19 iterations
	 1  /  19
	 2  /  19
	 3  /  19
	 4  /  19
	 5  /  19
	 6  /  19
	Stopping threshold met -- exiting after 6 iterations
Fri Jan 21 13:51:54 2022 Finished Nearest Neighbor Search
Fri Jan 21 13:52:04 2022 Construct embedding


Epochs completed: 100%| ██████████ 200/200 [06:45]


Fri Jan 21 14:02:12 2022 Finished embedding
[TopicModel] Reduced embedding dimensions. Now clustering...
________________________________________________________________________________
[Memory] Calling hdbscan.hdbscan_._hdbscan_boruvka_kdtree...
_hdbscan_boruvka_kdtree(array([[11.990156, ...,  4.206806],
       ...,
       [10.079483, ...,  5.457878]], dtype=float32), 
1, 1.0, 'euclidean', None, 40, True, False, -1)
__________________________________________hdbscan_boruvka_kdtree - 21.5s, 0.4min
[TopicModel] Clustered embeddings. Now computing weights...


In [None]:
print(clusters.shape, weights.shape)

# # drop cluster "-1" and make the rest 1-indexed
documents['cluster'] = np.where(clusters >= 0, clusters + 1, np.nan)

# # store
documents.to_csv('models/gpt3/abstracts_gpt3ada_clusters.csv.gz', index=True)
np.savez('models/gpt3/abstracts_gpt3ada_weights.npz', weights)

# # report cluster frequencies
documents['cluster'].value_counts()

In [93]:
%reload_ext watermark
%watermark
%watermark -iv -p umap,hdbscan,joblib,numpy,numba,pytorch,tensorflow,python.cogtext

Last updated: 2021-12-05T12:41:20.252167+01:00

Python implementation: CPython
Python version       : 3.9.7
IPython version      : 7.28.0

Compiler    : Clang 11.1.0 
OS          : Darwin
Release     : 21.1.0
Machine     : x86_64
Processor   : i386
CPU cores   : 12
Architecture: 64bit

umap          : 0.5.2
hdbscan       : 0.8.27
joblib        : 1.1.0
numpy         : 1.20.3
numba         : 0.54.1
pytorch       : not installed
tensorflow    : 2.7.0
python.cogtext: 0.1.2021120512

sys   : 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:23:19) 
[Clang 11.1.0 ]
pandas: 1.3.4
numpy : 1.20.3

