# Topic Embedding

To ground documents in an interpretable space, we map document embeddings to a shared topic space. Topic space reflects the underlying semantic structure of the documents, hence, might provide a more interpretable embedding.

## Inputs

- `models/gpt3/abstracts_gpt3ada.nc` NetCDF4 file containing GPT-3 embeddings.

## Outputs

- `models/gpt3/abstracts_gpt3ada_clusters.csv.gz` contains cluster assignments for each document.
- `models/gpt3/abstracts_gpt3ada_weights.npz` contains membership weights of the documents in the topic space. Each row is a document and each column contains membership value to the corresponding topic.

- TODO combine into a single NetCDF4: `models/gpt3/abstracts_gpt3ada_topic-embeddings.nc`

## Requirements

```bash
mamba activate cogtext
mamba install hdbscan  # additional packages compared to the NB3 notebook
```

In [8]:
# Setup and imports

%reload_ext autoreload
%reload_ext watermark

%autoreload 2

import numpy as np
import pandas as pd
import xarray as xr

from python.cogtext.topic_model import TopicModel

%watermark
%watermark -iv -p umap,hdbscan,joblib,numpy,numba,pytorch,tensorflow,python.cogtext

Last updated: 2022-05-06T00:13:45.973971+02:00

Python implementation: CPython
Python version       : 3.9.0
IPython version      : 8.3.0

Compiler    : Clang 11.0.0 
OS          : Darwin
Release     : 21.4.0
Machine     : x86_64
Processor   : i386
CPU cores   : 12
Architecture: 64bit

umap          : 0.5.3
hdbscan       : 0.8.28
joblib        : 1.1.0
numpy         : 1.22.3
numba         : 0.53.1
pytorch       : not installed
tensorflow    : 2.7.0
python.cogtext: 0.1.2022050600

numpy : 1.22.3
pandas: 1.4.2
xarray: 2022.3.0
sys   : 3.9.0 | packaged by conda-forge | (default, Nov 26 2020, 07:54:06) 
[Clang 11.0.0 ]



First, make sure all the required embeddings and models are available.

In [9]:
# load embeddings dataset
DATASET = xr.load_dataset('models/gpt3/abstracts_gpt3ada.nc')

# load embeddings from the dataset
doc_embeddings = DATASET['gpt3_embeddings'].values
umap_embeddings = DATASET.get('umap_embeddings', None)
DATASET

In [10]:
# project document embeddings to a shared topic space

model = TopicModel(parametric_umap=False, verbose=True)
clusters, weights = model.fit_transform(doc_embeddings, umap_embeddings=umap_embeddings)

UMAP(min_dist=0.0, n_components=5, verbose=True)
Fri May  6 00:13:59 2022 Construct fuzzy simplicial set
Fri May  6 00:14:01 2022 Finding Nearest Neighbors
Fri May  6 00:14:02 2022 Building RP forest with 36 trees


KeyboardInterrupt: 

In [None]:
DATASET['umap_embeddings'] = xr.DataArray(model.umap_embeddings_, dims=['pmid', 'umap_dim'])
DATASET['topics'] = xr.DataArray(clusters, dims=['pmid'])
DATASET['topic_weights'] = xr.DataArray(weights, dims=['pmid', 'topic'])

# store
DATASET.to_netcdf('models/gpt3/abstracts_gpt3ada.nc',)
DATASET

In [None]:
# TODO Store umap_embeddings, weights, and clusters

In [None]:

# report number of noise documents
n_noise_documents = documents['cluster'].isna().sum()

print(f'Projected {doc_embeddings.shape[0]} documents to a '
      f'{weights.shape[1]}-dimensional topic space, '
      f'while discarding {n_noise_documents} noise documents.')

# # drop cluster "-1" and make the rest 1-indexed
documents['cluster'] = np.where(clusters >= 0, clusters + 1, np.nan)

# # store
documents.to_csv('models/gpt3/abstracts_gpt3ada_clusters.csv.gz', index=True)
np.savez('models/gpt3/abstracts_gpt3ada_weights.npz', weights)

# DEBUG: report cluster frequencies
# documents['cluster'].value_counts()
