# Topic Embedding

To ground documents in an interpretable space, we map document embeddings to a shared topic space. Topic space reflects the underlying semantic structure of the documents, hence, might provide a more interpretable embedding.

## Inputs

- `models/gpt3/abstracts_gpt3ada.nc` NetCDF4 file containing GPT-3 embeddings.

## Outputs

- `models/gpt3/abstracts_gpt3ada.nc` contains cluster assignments and membership weights of the documents in the topic space.

## Requirements

```bash
mamba activate cogtext
mamba install hdbscan umap-learn # additional packages compared to the NB3 notebook
```

In [4]:
# Setup and imports

%reload_ext autoreload
%autoreload 2

import xarray as xr
from pathlib import Path

from python.cogtext.topic_model import TopicModel

In [7]:
DATA_DIR = Path('../cogtext_data/')

First, make sure all the required embeddings and models are available.

In [8]:
# load embeddings dataset
DATASET = xr.load_dataset(DATA_DIR / 'gpt3' / 'abstracts_gpt3ada.nc')

# load embeddings from the dataset
doc_embeddings = DATASET['gpt3_embeddings'].values
umap_embeddings = DATASET.get('umap_embeddings', None)
DATASET

In [3]:
# project document embeddings to a shared topic space

model = TopicModel(parametric_umap=False, verbose=True)
clusters, weights = model.fit_transform(doc_embeddings, umap_embeddings=umap_embeddings)

UMAP(min_dist=0.0, n_components=5, verbose=True)
Fri May 13 18:57:29 2022 Construct fuzzy simplicial set
Fri May 13 18:57:32 2022 Finding Nearest Neighbors
Fri May 13 18:57:32 2022 Building RP forest with 36 trees
Fri May 13 18:58:34 2022 NN descent for 19 iterations
	 1  /  19
	 2  /  19
	 3  /  19
	 4  /  19
	 5  /  19
	 6  /  19
	Stopping threshold met -- exiting after 6 iterations
Fri May 13 18:59:02 2022 Finished Nearest Neighbor Search
Fri May 13 18:59:14 2022 Construct embedding


Epochs completed:   0%|            0/200 [00:00]

Fri May 13 19:04:11 2022 Finished embedding
[TopicModel] Reduced embeddings dimension. Now clustering...
________________________________________________________________________________
[Memory] Calling hdbscan.hdbscan_._hdbscan_boruvka_kdtree...
_hdbscan_boruvka_kdtree(array([[11.132565, ...,  3.557436],
       ...,
       [10.107224, ...,  1.624576]], dtype=float32), 
1, 1.0, 'euclidean', None, 40, True, False, -1)
__________________________________________hdbscan_boruvka_kdtree - 21.7s, 0.4min
[TopicModel] Clustered embeddings. Now computing membership weights...
[TopicModel] Done!


In [6]:
# report
unassigned_documents = (clusters == -1).sum()

print(f'Projected {doc_embeddings.shape[0]} unique documents to a '
      f'{weights.shape[1]}-dimensional topic space, '
      f'while discarding {unassigned_documents} noise documents (unassigned to any of the topics).')

Projected 382855 unique documents to a 515-dimensional topic space, while discarding 172399 noise documents (unassigned to any of the topics).


In [7]:
# store

DATASET['umap_embeddings'] = xr.DataArray(model.umap_embeddings_, dims=['pmid', 'umap_dim'])
DATASET['topics'] = xr.DataArray(clusters, dims=['pmid'])
DATASET['topic_weights'] = xr.DataArray(weights, dims=['pmid', 'topic'])

# documentation
DATASET['umap_embeddings'].attrs['description'] = 'Document embeddinged projected to a 5dim space using UMAP.'
DATASET['topics'].attrs['description'] = 'Assigned topics to each document. -1 indicates unassigned noise documents.'
DATASET['topic_weights'].attrs['description'] = 'membership weights of each document to each topic. The size of the array is N_docs x N_topics.'

# store
DATASET.to_netcdf('models/gpt3/abstracts_gpt3ada.nc',
                  encoding={'gpt3_embeddings':{'zlib': True, 'complevel': 5},
                            'umap_embeddings': {'zlib': True, 'complevel': 5},
                            'topic_weights': {'zlib': True, 'complevel': 5}})

DATASET

In [9]:
%reload_ext watermark

%watermark
%watermark -iv -p umap,hdbscan,joblib,numpy,numba,pytorch,tensorflow,python.cogtext

Last updated: 2022-12-21T17:21:10.393141+01:00

Python implementation: CPython
Python version       : 3.10.8
IPython version      : 8.7.0

Compiler    : Clang 14.0.6 
OS          : Darwin
Release     : 22.2.0
Machine     : x86_64
Processor   : i386
CPU cores   : 12
Architecture: 64bit

umap          : not installed
hdbscan       : 0.8.29
joblib        : 1.2.0
numpy         : 1.24.0
numba         : not installed
pytorch       : not installed
tensorflow    : not installed
python.cogtext: 0.1.2022122117

sys              : 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:27:35) [Clang 14.0.6 ]
xarray           : 2022.12.0
matplotlib_inline: 0.1.6

