## SingleTopic Embeddings

The assumption that a document belongs only to one topic simplifies the model without losing generalization.

After assigning documents to the topics, the 8semantic embeddings* of the documents can be calculated by the matrix $H$ where element $_H{ij}$ shows the probability of assigning document $i$ to the topic $j$. For brevity, this embedding assumes each document is assigned to only one topic:

$$\overrightarrow Z_{i} = \argmax(\overrightarrow H_{i})$$

$$|\overrightarrow{Z}| = |\overrightarrow{H}| = N_{\text{topics}}$$

where $Z_i$ is a one-hot vector where the only non-zero value at index $j$ shows the probability of assigning the document $i$ to the topic $j$. $H_{i}$ is the topics membership vector for the document $i$ as generated by HDBSCAN during the topic modeling; element $H_{ij}$ shows the probability of assigning document $i$ to the topic $j$, and $\sum_{j} H_{ij} = 1$ holds.

Consequently, the topic embeddings of a document set, $L_j$, can be calculated by pooling the embeddings of its member documents.

$$\overrightarrow Z_{L_j} = \sum_{i \in L_j} \overrightarrow  Z_{i}$$

$$|\overrightarrow{Z}| = N_{\text{topics}}$$

where $L_j$ is a list of document indices, and $Z_{L_j}$ is the embedding for the label $j$ in the topics space.

In [None]:
# Install requirements
%pip install -Uq bertopic matplotlib seaborn xmltodict
%pip install -Uq git+https://github.com/scikit-learn-contrib/hdbscan

# Creating a new conda env is highly recommended because of the conflicting packages.
# %conda activate bertopic

In [None]:
# Setup and imports

%reload_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme()  # noqa

from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

from python.cogtext.datasets.pubmed import PubMedDataLoader

In [None]:
data_loader = PubMedDataLoader()
data = data_loader()

In [None]:

train_data, test_data = train_test_split(data,
                                         test_size=0.5,
                                         stratify=data['label'],
                                         random_state=42)

X_train = train_data['abstract'].values
y_train = train_data['label'].astype('category').cat.codes

X_test = test_data['abstract'].values
y_test = test_data['label'].astype('category').cat.codes

In [None]:

# pretrained document embeddings
embeddings_file = 'models/universal-sentence-encoder-v4/abstracts_embeddings.npz'
# embeddings_file = 'models/all-MiniLM-L6-v2/abstracts_embeddings.npz'

doc_embedding_model = SentenceTransformer('all-distilroberta-v1')

doc_embeddings = np.load(embeddings_file)['arr_0']
train_doc_embeddings = doc_embeddings[train_data.index]
test_doc_embeddings = doc_embeddings[test_data.index]

# OR retrain the document embedding model from scratch
# doc_embeddings = doc_embedding_model.encode(X_train, show_progress_bar=True)

In [None]:

# UMAP
umap_model = UMAP(n_neighbors=15,
                  n_components=5,
                  min_dist=0.0,
                  metric='cosine',
                  low_memory=False)

# HDBSCAN
hdbscan_model = HDBSCAN(min_cluster_size=5,
                        metric='euclidean',
                        prediction_data=True)

vectorizer_model = CountVectorizer(ngram_range=(1, 4),
                                   stop_words='english',
                                  #  max_features=20000,
                                   max_df=int(X_train.shape[0] * 1.0),
                                   min_df=5)

# BERTopic
model = BERTopic(calculate_probabilities=False,
                 nr_topics='auto',
                 embedding_model=doc_embedding_model,
                 umap_model=umap_model,
                 hdbscan_model=hdbscan_model,
                 vectorizer_model=vectorizer_model,
                 verbose=True)

# fit the topic model
train_topics, train_scores = model.fit_transform(documents=X_train, y=y_train,
                                                 embeddings=train_doc_embeddings)

test_topics, test_scores = model.transform(documents=X_test,
                                           embeddings=test_doc_embeddings)

model.get_topic_info()

In [None]:
# model.visualize_hierarchy()
# model.visualize_term_rank()
# model.visualize_topics()
# model.visualize_barchart()

sns.displot(train_scores,)
plt.show()

In [None]:
# train_data['topic'] = train_topics
# train_data['topic_score'] = train_scores
# train_data['topic'].replace({-1:np.nan}, inplace=True)
# train_data.query('topic.isna()')['topic_score'] = 0.0
# Z = train_data[['topic','topic_score','label']]

# Z.groupby('label').mean()

# pd.get_dummies(train_data['topic']).iloc[10].sum()

n_topics = np.unique(train_topics + test_topics).shape[0]
Z_train = np.zeros((train_data.shape[0], n_topics-1))
Z_test = np.zeros((test_data.shape[0], n_topics-1))

for i, (topic, score) in enumerate(zip(train_topics, train_scores)):
  if topic != -1:
    Z_train[i,topic] = score

for i, topic in enumerate(test_topics):
  if topic != -1:
    Z_test[i,topic] = 1.0

# confirming that non-zero element is valid and that one-hot encoding works as expected
assert all(Z_train.sum(axis=1) == np.array(train_scores))
assert all((Z_train.argmax(axis=1) == train_topics) | (np.array(train_topics) == -1))

# assert all(Z_test.sum(axis=1) == np.array(test_scores))
assert all((Z_test.argmax(axis=1) == test_topics) | (np.array(test_topics) == -1))

Z_train = pd.DataFrame(Z_train)
Z_train['label'] = train_data['label']
Z_train = Z_train.groupby('label').sum()

Z_test = pd.DataFrame(Z_test)
Z_test = Z_test.iloc[:,:100]
Z_test['label'] = train_data['label']
Z_test = Z_test.groupby('label').sum()

# Z_docs.corr('pearson')
# sns.clustermap(Z_test, standard_scale=1)
# Z_test.corr('pearson')

from sklearn.metrics.pairwise import cosine_similarity

Z_sim = pd.DataFrame(cosine_similarity(Z_test), index=Z_test.index, columns=Z_test.index)

categories = Z_sim.index.to_series().apply(lambda x: test_data.query('label == @x').iloc[0]['category'])
pallette = ['darkgreen', 'gold']
colors = [pallette[c] for c in categories.astype('category').cat.codes.to_list()]

sns.clustermap(Z_sim, figsize=(25,25), cmap='RdBu', row_colors=colors)
plt.show()

# Z_test_normalized = Z_test.div(Z_test.sum(axis=1), axis=0).fillna(0.0)
# sns.clustermap(Z_test_normalized, col_cluster=False, figsize=(25,25))