# 3 Topic Embedding

This notebook implements an analysis that first produces a topic embedding for cognitive tasks and constructs. Topic embedding refers to the probabilities of assigning a topic to a given task/construct corpus. For example, task A could be assigned the following topic embedding: `[1., .5, .1]` which basically shows the probability of observing the three topics in the corpus A.

## Data

**Input**: `PUBMED` dataset contains the abstracts.

**Output**: `topic_embeddings` is a table in which each row denotes an article and contains the following columns:

- A list of topics
- embeddings for each task/construct, i.e., probabilities of being assigned to topics.

In [1]:
import pandas as pd
import numpy as np

from IPython.display import display

from bertopic import BERTopic

In [50]:
dataset = 'pubmed5pct'
version = '2021092511'
model = BERTopic.load(f'outputs/models/{dataset}_bertopic_v{version}.model')

probs_fpath = f'outputs/models/{dataset}_bertopic_v{version}.train_probs.npz'

with np.load(probs_fpath) as probs_f:
  probs = probs_f['arr_0']

In [63]:
# with pd.option_context('display.max_rows', 10000):
#   display(model.get_topic_info())

# model.visualize_topics()

# model.visualize_barchart()
# model.visualize_distribution(probabilities=probs[0])

model.get_params()


{'calculate_probabilities': True,
 'embedding_model': <bertopic.backend._sentencetransformers.SentenceTransformerBackend at 0x7f9d839d29d0>,
 'hdbscan_model': HDBSCAN(min_cluster_size=10, prediction_data=True),
 'language': 'english',
 'low_memory': False,
 'min_topic_size': 10,
 'n_gram_range': (1, 1),
 'nr_topics': None,
 'seed_topic_list': None,
 'top_n_words': 10,
 'umap_model': UMAP(angular_rp_forest=True, dens_frac=0.0, dens_lambda=0.0, low_memory=False,
      metric='cosine', min_dist=0.0, n_components=5),
 'vectorizer_model': CountVectorizer(),
 'verbose': True}

In [None]:
# RSA
from scipy.stats import spearmanr

sim_train = cosine_similarity(result.Z)
sim_test = cosine_similarity(result.H_test)
rho = spearmanr(sim_train, sim_test)
print(f'[RSA] mean test/train correlation: {rho[0].mean():.2f}')
