# Topic Embedding

To ground documents in an interpretable space, we map document embeddings to a shared topic space. Topic space reflects the underlying semantic structure of the documents, hence, might provide a more interpretable embedding.

## Inputs

- `models/gpt3/abstracts_gpt3ada.nc` NetCDF4 file containing GPT-3 embeddings.

## Outputs

- `models/gpt3/abstracts_gpt3ada.nc` contains cluster assignments and membership weights of the documents in the topic space.

## Requirements

```bash
mamba activate cogtext
mamba install hdbscan umap-learn # additional packages compared to the NB3 notebook
```

In [5]:
# Setup and imports

%reload_ext autoreload
%reload_ext watermark

%autoreload 2

import xarray as xr

from python.cogtext.topic_model import TopicModel

%watermark
%watermark -iv -p umap,hdbscan,joblib,numpy,numba,pytorch,tensorflow,python.cogtext

Last updated: 2022-05-06T15:10:14.284443+02:00

Python implementation: CPython
Python version       : 3.10.4
IPython version      : 8.3.0

Compiler    : GCC 10.3.0
OS          : Linux
Release     : 5.15.0-27-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 8
Architecture: 64bit

umap          : 0.5.3
hdbscan       : 0.8.28
joblib        : 1.1.0
numpy         : 1.21.6
numba         : 0.55.1
pytorch       : not installed
tensorflow    : not installed
python.cogtext: 0.1.2022050615

pandas: 1.4.2
xarray: 2022.3.0
numpy : 1.21.6
sys   : 3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:39:04) [GCC 10.3.0]



First, make sure all the required embeddings and models are available.

In [7]:
# load embeddings dataset
DATASET = xr.load_dataset('models/gpt3/abstracts_gpt3ada.nc')

# load embeddings from the dataset
doc_embeddings = DATASET['gpt3_embeddings'].values
umap_embeddings = DATASET.get('umap_embeddings', None)
DATASET

In [8]:
# project document embeddings to a shared topic space

model = TopicModel(parametric_umap=False, verbose=True)
clusters, weights = model.fit_transform(doc_embeddings, umap_embeddings=umap_embeddings)

UMAP(min_dist=0.0, n_components=5, verbose=True)
Fri May  6 15:12:11 2022 Construct fuzzy simplicial set
Fri May  6 15:12:12 2022 Finding Nearest Neighbors
Fri May  6 15:12:12 2022 Building RP forest with 36 trees
Fri May  6 15:12:51 2022 NN descent for 19 iterations
	 1  /  19
	 2  /  19
	 3  /  19
	 4  /  19
	 5  /  19
	 6  /  19
	Stopping threshold met -- exiting after 6 iterations
Fri May  6 15:13:21 2022 Finished Nearest Neighbor Search
Fri May  6 15:13:25 2022 Construct embedding


Epochs completed:   0%|            0/200 [00:00]

Fri May  6 15:16:53 2022 Finished embedding
[TopicModel] Reduced embeddings dimension. Now clustering...
________________________________________________________________________________
[Memory] Calling hdbscan.hdbscan_._hdbscan_boruvka_kdtree...
_hdbscan_boruvka_kdtree(array([[ 9.93065 , ...,  4.336648],
       ...,
       [10.208136, ...,  2.411834]], dtype=float32), 
1, 1.0, 'euclidean', None, 40, True, False, -1)
__________________________________________hdbscan_boruvka_kdtree - 19.3s, 0.3min
[TopicModel] Clustered embeddings. Now computing membership weights...
[TopicModel] Done!


In [34]:
# report
unassigned_documents = (clusters == -1).sum()

print(f'Projected {doc_embeddings.shape[0]} unique documents to a '
      f'{weights.shape[1]}-dimensional topic space, '
      f'while discarding {unassigned_documents} noise documents (unassigned to any of the topics).')

Projected 382855 unique documents to a 473-dimensional topic space, while discarding 166342 noise documents (unassigned to any of the topics).


In [27]:
# store

DATASET['umap_embeddings'] = xr.DataArray(model.umap_embeddings_, dims=['pmid', 'umap_dim'])
DATASET['topics'] = xr.DataArray(clusters, dims=['pmid'])
DATASET['topic_weights'] = xr.DataArray(weights, dims=['pmid', 'topic'])

# documentation
DATASET['umap_embeddings'].attrs['description'] = 'Document embeddinged projected to a 5dim space using UMAP.'
DATASET['topics'].attrs['description'] = 'Assigned topics to each document. -1 indicates unassigned noise documents.'
DATASET['topic_weights'].attrs['description'] = 'membership weights of each document to each topic. The size of the array is N_docs x N_topics.'

# store
DATASET.to_netcdf('models/gpt3/abstracts_gpt3ada.nc')

DATASET