### 🤖 Topic modeling with Latent Dirichlet Allocation 🤖

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import altair as alt

from scipy import stats
from tqdm import tqdm

from dap_taltech.utils.data_getters import DataGetter

In [None]:
getter = DataGetter(local=True)
data = getter.get_oa_articles(institution="TT_p")


Topic Modeling with Latent Dirichlet Allocation (LDA) is a widely-used method in text analysis. The foundational assumption in LDA is that documents are composed of a mixture of topics, and these topics can be expressed as distributions of words.

The goal of LDA topic modeling is to discover the underlying topics within a corpus and understand how they are expressed within the documents. To this end, LDA uses Bayesian inference techniques to estimate:

1. Topic-Document Distribution ($\theta$): The probability of each topic being expressed in each document.
2. Word-Topic Distribution ($\beta$): The probability of each word being assigned to each topic.

These two distributions are inferred based on the words observed in the corpus and their co-occurrence across documents.

Dirichlet priors are assigned to these distributions, encapsulating our prior belief that each document contains only a few topics and each topic is represented by a subset of the entire vocabulary.

In plate notation, the graphical model of LDA is a three-level generative model that looks like this:

![plate](https://scikit-learn.org/stable/_images/lda_model_graph.png)

The parameters above can be described as:
- $D$ denotes the number of documents in our corpus-
- $K$ denotes the number of topics we exogenously chose for our data.
- $V$ refers to the vocabulary of words in our corpus.
- $N_d$ is the number of words a given document $d$ has.
- $w_{d,n}$ refers to each word in document $d$ and indexed in $N_d$ as $n$.
- $z_{d,n}$ is the topic assignment for word $n$ in document $d$, which is drawn from a multinomial distribution.
- $\theta_d$ is the topic distribution for document $d$, a probability vector or multinomial distribution over topics.
- $\beta_k$ is the word distribution for topic $k$, a probability vector over words.


The model assumes the following generative process for a corpus with $D$ documents and $K$ topics.

1. For each topic $k \in K$, draw $\beta_k \sim Dirichlet(\eta)$. This provides a distribution over the words

2. For each document $d \in D$, draw the topic proportions $\theta_d \sim Dirichlet(\alpha)$

3. For each word $i$ in document $d$

    1. Draw the topic assignment $z_{di} \sim Multinomial(\theta_d)$
    
    2. Draw the observed word $w_{ij} \sim Multinomial(\beta_{z_{di}})$
    
Bayesian inference derives the posterior probability from two elements: a _prior probability_ and a _likelihood function_ derived from a statistical model of the observed data (here, words and documents). A key concern is, however, how to compute the posterior distribution of all these hidden variables given documents. 
    
A stylized posterior distribution reads: 
$$
p(z, \theta, \beta \mid w, \alpha, \eta)=\frac{p(z, \theta, \beta, w \mid \alpha, \eta)}{p(w \mid \alpha, \eta)}
$$

As in many other complex statistical models filled with latent variables, this posterior distribution is in practice intractable (this is due to the coupling between $\theta$ and $\beta$ in the summation over latent
topics), thus requiring of approximate inference.


#### 📂 Unpacking $\theta$ and $\beta$
$\theta$ is a matrix where each cell at the intersection of row $d$ and column $j$ represents the probability of document $d$ being associated with topic $j$. As this probabilistic assignment is drawn from a multinomial distribution, $\theta$ can be seen as a distribution over these distributions.

The properties of Dirichlet distributions, being the conjugate priors for multinomial distributions, make them an ideal choice for modeling $\theta$ and $\beta$. This means that the prior and posterior distributions of multinomial parameters are both Dirichlet, simplifying the computation of the posterior.

#### 🕵 Understanding Dirichlet Distributions
A Dirichlet Distribution is the conjugate prior for the probabilities $p_1, ... p_k$, governed by concentration parameters $\alpha_1, ... \alpha_k$. The $\alpha$ values play a vital role in shaping the distribution. A high $\alpha_i$ increases the likelihood of observing $x_i$, while an $\alpha_i<1$ pushes $x_i$ towards the extremes.

In [None]:
fig = plt.figure(figsize=(12,18))
alphas = [[1, 3, 4], [10, 0.2, 0.2], [15,15,15], [1, 1, 1], [0.1, 0.1, 0.1], [0.01, 0.01, 0.01]]
for i, tripl in enumerate(alphas):
    theta = stats.dirichlet(tripl).rvs(500)
    
    ax = fig.add_subplot(3, 2, i+1, projection='3d')
    plt.title(r'$\alpha$ = {}'.format(tripl))
    ax.scatter(theta[:, 0], theta[:, 1], theta[:, 2])
    ax.view_init(azim=30)
    ax.set_xlabel(r'$\theta_1$')
    ax.set_ylabel(r'$\theta_2$')
    ax.set_zlabel(r'$\theta_3$')

#### 🔑 Insights on $\beta$, $\theta$, and Sparsity in LDA
A central aspect of LDA topic modeling lies in the sparse structure induced by modeling $\beta$ and $\theta$ as random variables drawn from Dirichlet processes with small $\alpha$ and $\eta$ parameters.

The low values of $\alpha$ and $\eta$ contribute to a tighter clustering of documents to topics via the $\theta$ matrix and words to topics through the $\beta$ matrix. This sparsity encourages a model where documents are associated with a small set of prominent topics and each topic is represented by a relatively limited set of distinct words.

This structure aligns with our intuitive understanding of how documents and topics relate. It is common for a document to revolve around a few primary themes, rather than a broad range of unrelated topics. Similarly, while a topic can include many words, only a subset of those words are often key to defining the core idea of the topic.

This sparsity also serves a penalization function. It discourages models where a document is associated with a large number of topics, or where a topic is represented by an excessively large set of words. This penalty helps the model to maintain a clear and succinct representation of the themes present in the corpus.

Therefore, the choice of $\alpha$ and $\eta$ in the Dirichlet distributions for $\theta$ and $\beta$ plays a pivotal role in shaping the thematic structure and clarity of the resulting LDA topic model.

### Approximate inference

Multiple methods of approximating the posterior exist in Bayesian statistics, with **Gibbs Sampling** and **variational inference** being traditionally applied the most. In what follows we mostly focus on the former, but present the latter briefly since in practice it's more easily implemented. For a complete review on variational inference methods, check [this paper](https://arxiv.org/pdf/1601.00670.pdf) by Blei and co-authors. For a non-LDA related introduction to Gibbs sampling, take a look [here](https://drum.lib.umd.edu/bitstream/handle/1903/10058/gsfu.pdf?sequence=3&isAllowed=y).

#### Variational inference

Variational inference seeks to approximate the intractable posterior with some well-known and well-behaved probability distribution that closely matches the true posterior. Using Kullback-Leibler divergence measures, the algorithm optimizes over a family of distributions and picks the member that best mimics the exact posterior, ie. 

$$
\gamma^{\star}, \phi^{\star}, \lambda^{\star}=\operatorname{argmin}_{(\gamma, \phi, \lambda)} D(q(\theta, \mathbf{z}, \beta \mid \gamma, \phi, \lambda) \| p(\theta, \mathbf{z}, \beta \mid \mathcal{D} ; \alpha, \eta)
$$

where $\gamma, \phi$ and $\lambda$ represent variational parameters used to approximate $\theta, z$ and $\beta$, respectively. The $D()$ function represents the KL divergence between a member distribution $q$ and the true posterior $p$. This method provides a locally-optimal exact-analytical solution to an approximation of the posterior distribution.

#### Gibbs Sampling

Gibbs Sampling is an alternative empirical method of approximating the posterior. A Monte Carlo Markov Chain method, the basic idea is to iteratetively compute the posterior of each of the latent variables by sampling from conditional distributions where all other latent variables are fixed and treated as known. Importantly, the variables used to condition are constantly updated on their most recent expected distributions, in the hopes that we gradually inch closer to the posterior joint distribution. 

In more practical terms, we iterate over words in a document and we estimate the conditional probability distribution of the word's specific topic assignment given all other topic assignments. Mathematically, we will find the conditional probability distribution of a single word topic assignment conditioned on the rest of the model:

$$
p\left(z_{d, n}=k \mid \vec{z}_{-d, n}, \vec{w}, \alpha, \eta\right)=\frac{p\left(z_{d, n}=k, \vec{z}_{-d, n} \mid \vec{w}, \alpha, \eta\right)}{p\left(\vec{z}_{-d, n} \mid \vec{w}, \alpha, \eta\right)}
$$

We won't delve on the math, but note that due to the special structure of the LDA model we are able to integrate out both $\theta$ and $\beta$ in the equation above (in jargon, we are able to marginalize the target posterior over $\beta$ and $\theta$). This dramatically reduces the space in which we explore in the Gibbs Sampler, which is convenient since it will converge to a stationary posterior at a faster rate. The algorithm for this marginalized posterior is known as the **collapsed Gibbs Sampler**. After integrating out, the conditional probability distribution reads:

$$
p\left(z_{d, n}=k \mid \vec{z}_{-d, n}, \vec{w}, \alpha, \eta\right)=\frac{n_{d, k}+\alpha_{k}}{\sum_{i}^{K} \left(n_{d, i}+\alpha_{i}\right)} \frac{v_{k, w_{d, n}}+\eta_{w_{d, n}}}{\sum_{w \in V} \left(v_{k, i}+\eta_{i}\right)}
$$

There are two parts to the equation above. First part tells us how much each topic is present in a document, while the second part tells us a topic's affinity towards a word. Since this is a probability distribution, for each word we will get a vector of probabilities. After this probability is computed, we sample a new $z$ assignment for the word. In pseudo-code, this reads:

  1. Decrement $n_{d, z_{old}}$ and $v_{w_{d,n}, z_{old}}$
  2. Sample $z_{new}=k$ with probability proportional to $p\left(z_{d, n}=k \mid \vec{z}_{-d, n}, \vec{w}, \alpha, \eta\right)$
  3. Increment $n_{d, z_{new}}$ and $v_{w_{d,n}, z_{new}}$


After sampling **$z \mid w$** with Gibbs sampling, one can recover $\theta$ and $\beta$ with

$$
\begin{aligned}
&\hat{\beta}_{k, w_{n}}=\frac{n_{k, w_{n}}+\eta}{n_{K}+V \eta} \\
&\hat{\theta}_{d k}=\frac{n_{d k}+\alpha}{n_{d}+K \alpha}
\end{aligned}
$$

which are marginalized versions of the first and second term of the equation above.

#### 🐣 Gibbs sampler example
It may be instructive to first present a quick example of a manually-built topic model where we use Gibbs sampler to sequentially update topic assignments to words. Other parts of this script will refrain from using this code (which is slightly simplified for speed following [this paper](https://www.ics.uci.edu/~asuncion/pubs/KDD_08.pdf)), and instead we will use off-the-shelve modules from well-known Python libraries. 

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
n_ = 1000

# Fetch training data from sklearn
data20, _ = fetch_20newsgroups(
    shuffle=True,
    random_state=42,
    remove=('headers', 'footers', 'quotes'),
    return_X_y=True,
)

data_sample = data20[:n_]

# Define matrix
tf_vect = CountVectorizer(
    max_df=0.8, 
    min_df=2,
    max_features=10_000,
    stop_words='english'
)
tf = tf_vect.fit_transform(data_sample)

# Get vocabulary
voc = tf_vect.vocabulary_

idx_voc = tf_vect.get_feature_names_out()

Here we convert our term frequency matrix to a list of documents where each document is represented as a list of word indices (with frequency).

In [None]:
# Initialize an empty list to store the document indices
docs = []

# Loop over each document (represented as a row in the term frequency matrix)
for row in tf.toarray():
    # Identify the indices of words that appear at least once in the document
    words = np.where(row != 0)[0].tolist()
    
    # Initialize an empty list to store the word indices, accounting for word frequency
    word_counts = []
    
    # Loop over each word index
    for word_idx in words:
        # For each occurrence of the word in the document, add the word's index to word_counts
        for i in range(row[word_idx]):
            word_counts.append(word_idx)
    
    # Append the list of word indices (with frequency) for this document to the overall docs list
    docs.append(word_counts)

We initialize our parameters and hyperparameters for the Gibbs sampler. 'K' represents the number of topics we wish to learn.

In [None]:
# number of documents
D = len(docs)

# size of the vocabulary 
Voc = len(voc)  

# number of topics
K = 10   

# Dirichlet prior on per-document topic distribution
alpha = 0.2

# Dirichlet prior on per-topic word distribution
eta = 1 / K

Initialize count vectors and other necessary parameters

In [None]:
z_d_n = [[0 for _ in range(len(d))] for d in docs]  # z_i_j # Assignment of topic to word n in document d
n_d_k = np.zeros((D, K)) # Count vector in document d for topic k 
v_k_w = np.zeros((K, Voc)) # Count vector of topic k for term w
N_d = np.zeros((D)) # Count of words in document d
V_k = np.zeros((K)) # Count of terms in topic k

Let's track a specific sample to see the effects of MCMCs.

In [None]:
test = 1

data_sample[test]

Initialization of the Gibbs sampler. We randomly assign each word in each document to one of the 'K' topics.

In [None]:
import random
from utils import parse_text
## Initialize parameters with random counts

# m → doc id
for d, doc in enumerate(docs):  
    # n → id of word inside document
    # w → id of the word in the global vocabulary
    
    for n, w in enumerate(doc):
        # Assign a t=0 topic to each word
        z_d_n[d][n] = random.randrange(K)
        # Retreive it and assign it to the count vectors
        z = z_d_n[d][n]
        n_d_k[d][z] += 1
        N_d[d] += 1
        v_k_w[z, w] += 1
        V_k[z] += 1

This is the main Gibbs sampling loop. For each iteration, we loop over each word in each document, and resample its topic assignment conditioned on all other words' assignments.

In [None]:
for iter_ in tqdm(range(10)):
    parse_text(docs, data_sample, test, idx_voc, z_d_n)
    print('\n')
    for d, doc in enumerate(docs):
        for n, w in enumerate(doc):
            # Fetch previously-assigned topic for word w_d,n
            z = z_d_n[d][n]

            # Decrement counts
            n_d_k[d][z] -= 1
            v_k_w[z, w] -= 1
            V_k[z] -= 1

            ## Sample a new topic assignment from a multinomial
            # How much a document likes a particular topic
            p_d_t = (n_d_k[d] + alpha) / (N_d[d] - 1 + K * alpha)
            
            # How much a topic likes a particular word
            p_t_w = (v_k_w[:, w] + eta) / (V_k + Voc * eta)
            
            p_z = p_d_t * p_t_w
            p_z /= np.sum(p_z)
            
            # Draw from a multinomial pmf the new topic assignment
            new_z = np.random.multinomial(1, p_z).argmax()

            # Update and increment counts according to new assignment
            z_d_n[d][n] = new_z
            n_d_k[d][new_z] += 1
            v_k_w[new_z, w] += 1
            V_k[new_z] += 1

After the Gibbs sampling procedure, we can inspect the learned topic-word distribution and document-topic distribution. Here we visualize the document-topic distribution for one document.


In [None]:
plt.plot(n_d_k[test]/ sum(n_d_k[test]))
plt.title(f'Topic distribution $theta_i$ for document {test}')

We can also inspect the top words in each topic to understand what the topics represent. We do this by looking at the words with the highest probability in each topic.


In [None]:
inv_vocabulary = {v: k for k, v in voc.items()}
n_top_words = 15

for topic_idx, topic in enumerate(v_k_w):
    message = f'Topic #{topic_idx}: '
    message += " ".join([inv_vocabulary[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
    print(message)

#### Topic modelling implementations using 🍅[tomotopy](https://bab2min.github.io/tomotopy)🍅

Let's explore more efficient implementations of topic modelling. There exist plenty of Python libraries that implement specific flavours of topic modelling. In this section, we explore one of the more popular ones, tomotopy.  

In [None]:
import tomotopy as tp

We initialize a Corpus object which will contain our text data. We also specify a simple tokenizer and a lambda function to remove stop words, which here we use to remove very rare tokens.

In [None]:
corpus = tp.utils.Corpus(tokenizer=tp.utils.SimpleTokenizer(), stopwords=lambda x: len(x) <= 2)

corpus.process(data["preprocessed_abstract"].tolist())

print(corpus[0])

We're looking for unigrams, bigrams and trigrams in our text data. The way this is implemented in tomotopy requires one to first create the tokens (which we did in the cell above) and then extract n_grams from the corpus, which will substitute the constituent unigrams.

In [None]:
# We have added all unigrams, let's also consider bigrams and trigrams
cands = corpus.extract_ngrams(min_cf=10, min_df=5, max_len=3, max_cand=2_500, normalized=True)
corpus.concat_ngrams(cands)

print(corpus[0])

##### Latent Dirichlet Allocation topic modelling 🍅

The cell below trains the model. Note that tomotopy implements Gibbs sampling, and thus a Markov Chain Monte Carlo that continuously sample from the model's assumed distributions. We omit a discussion over stopping criteria, instead running the model for long enough to assume some degree of stationarity.

In [None]:
# make LDA model and train
mdl = tp.LDAModel(k=20, min_cf=10, min_df=5, corpus=corpus)
mdl.train(0)
print(f'Num docs:{len(mdl.docs)}, Vocab size: {len(mdl.used_vocabs)}, Num words: {mdl.num_words}')
print(f'Removed top words: {mdl.removed_top_words}')
for i in range(0, 2000, 10):
    mdl.train(10, workers=0)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# mdl.save('path/to/save/model.lda.bin', True)

mdl.summary()

And that's our model trained! Let's have a look at some of the top words for each topic, where we assumed 20 topics.

In [None]:
for k in range(mdl.k):
    print(f'Top 10 words of topic #{k}')
    print(mdl.get_topic_words(k, top_n=10))

Given the unsupervised nature of the method, our topics are unlabelled. We can remediate that by resorting to PMI scores for a topic's constituent n-grams. Intuitively, a topic is best labelled using n-grams who appear disproportionately as part of a given topic, relative to how these n-grams appear in other topics.

In [None]:
# extract candidates using Pointwise Mutual Information (PMI)
extractor = tp.label.PMIExtractor(min_cf=10, min_df=5, max_len=5, max_cand=10000)
cands = extractor.extract(mdl)
cands[:10]

In order to select the top label candidates, we use FORelevance. The First-order relevance implementation for topic labeling involves using probabilistic approaches to automatically generate labels for multinomial topic models by minimizing Kullback-Leibler divergence and maximizing mutual information, which we've just computed above.

In [None]:
# Topic labeling using candidates and the First-Order Relevance implementation from Mei, Q., Shen, X., & Zhai, C. (2007)
labeler = tp.label.FoRelevance(mdl, cands, min_df=5, smoothing=1e-2, mu=0.25)
for k in range(mdl.k):
    print(f"== Topic #{k} ==")
    print("Labels:", ', '.join(label for label, score in labeler.get_topic_labels(k, top_n=5)))
    for word, prob in mdl.get_topic_words(k, top_n=5):
        print(f'{word:<{16}} {round(prob,4)}', sep='\t')
    print("\n")

Let's collect some of the relevant information that results from running the LDA routine.

In [None]:
topic_term_dists = np.stack([mdl.get_topic_word_dist(k) for k in range(mdl.k)])
doc_topic_dists = np.stack([doc.get_topic_dist() for doc in mdl.docs])
doc_topic_dists /= doc_topic_dists.sum(axis=1, keepdims=True)
doc_lengths = np.array([len(doc.words) for doc in mdl.docs])
vocab = list(mdl.used_vocabs)
term_frequency = mdl.used_vocab_freq

As a start, let's look at the document topic distributions.

In [None]:
pd.DataFrame(doc_topic_dists).head(10)

Let's briefly look at how a given document is made up of token words, which have topic assignments. In this illustrative example, we assign a color to a token given its highest topic probability.  

In [None]:
topicterm_df = pd.DataFrame(topic_term_dists, columns=vocab)
# for each word in corpus[0], find the column in f with corresonding string name, return index with highest value
tuples = []
for x in corpus[0]:
    try:
        tuples.append((x, topicterm_df[x].idxmax(), topicterm_df[x].max()/topicterm_df[x].sum()))
    except:
        pass

# color the words in the text
from utils import color_tomotopy_string
color_tomotopy_string(corpus[0], tuples)


Let's use pyLDAvis to visualise the topics, the relevance of the constituent tokens, and their representation in our corpus of TalTech publications.

In [None]:
# import tmplot as tmp
import pyLDAvis
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()

prepared_data = pyLDAvis.prepare(
    topic_term_dists, 
    doc_topic_dists, 
    doc_lengths, 
    vocab, 
    term_frequency,
    start_index=0, # tomotopy starts topic ids with 0, pyLDAvis with 1
    sort_topics=False # IMPORTANT: otherwise the topic_ids between pyLDAvis and tomotopy are not matching!
)

In [None]:
prepared_data

🛸 **TASK**: How do hyperparameters like the number of topics, alpha, and beta affect the results of LDA?

In [None]:
# Prompt: Experiment with different hyperparameters using tomotopy's LDA implementation. 
# Observe how the topics change and discuss the impact of these hyperparameters. 
# Refer to the tomotopy documentation to understand what each hyperparameter controls.

We can also investigate the evolution of topic shares over time at TalTech.

In [None]:
column_names = [f"Topic {x}: " + ', '.join(label for label, score in labeler.get_topic_labels(x, top_n=5)) for x in range(20)]

# generate plot data
data_lda = (
    data
    .copy()
    .merge(pd.DataFrame(doc_topic_dists, columns=column_names), left_index=True, right_index=True)
)

# compute quarterly measures
data_lda = (
    data_lda
    .groupby(pd.Grouper(key='publication_date', freq='Q'))[[col for col in data_lda if "Topic" in col]].mean()
    .iloc[100:, :]
    .reset_index(drop=False)
    .rename(columns={"publication_date": "Q-year"})
    .melt(id_vars="Q-year", var_name="topic", value_name="proportion")
)

# plot
alt.Chart(data_lda).mark_bar().encode(
    alt.X(
        'Q-year:T', 
        axis=alt.Axis(format='%b-%Y', title='Year', tickCount='year'),
        scale=alt.Scale(nice='month')
    ),
    alt.Y('proportion:Q', title='Topic Proportion', stack='normalize'),
    alt.Color('topic:N', title='Topic label', legend=None),
    tooltip=['topic', 'proportion']
).properties(
    width=800,
    height=400
)

An unanswered question is how we choose the number of topics. There typically isn't free lunch - methods that automatically choose the number of topics will have some other hyperparameter that regulates this and needs your attention. 

Some traditional alternatives that work across approaches include:

1. Perplexity: Perplexity is a measure of how well a probability model predicts a sample. You can train multiple models with different values of K and choose the one with the lowest perplexity on a held-out test set.

2. Coherence: Coherence measures the semantic similarity between high-scoring words in a topic. You can train multiple models with different values of K and choose the one with the highest coherence score.

3. Human judgment: You can train multiple models with different values of K and manually evaluate the interpretability and coherence of the resulting topics. Choose the value of K that produces topics that are meaningful and easy to interpret.

🛸 **TASK**: Implement measures of perplexity ([see](https://bab2min.github.io/tomotopy/v/en/#tomotopy.LDAModel.perplexity) the native implementation)and coherence ([see](https://github.com/bab2min/tomotopy/blob/main/examples/coherence.py) an example from tomotopy's codebase) for your topic model.

##### Other traditional topic modelling approaches 🍅
- **[Correlated Topic Model](https://bab2min.github.io/tomotopy/v/en/#tomotopy.CTModel)**
- **[Dynamic Topic Model](https://bab2min.github.io/tomotopy/v/en/#tomotopy.DTModel)**
- **[Hierarchical LDA](https://bab2min.github.io/tomotopy/v/en/#tomotopy.HLDAModel)**
- **[Partially Labelled LDA](https://bab2min.github.io/tomotopy/v/en/#tomotopy.PLDAModel)** (beyond 🍅, see [COREX](https://pypi.org/project/corextopic/))

#### [BERTopic](https://maartengr.github.io/BERTopic/index.html) with [Huggingface](https://huggingface.co/) 🤗

BERTopic is a Python library for topic modeling and visualization that utilizes the power of transformer models and c-TF-IDF to create a clear topic representation. It uses transformer models like BERT to convert documents into dense vectors. Then, by leveraging a class-based variant of TF-IDF, it enhances the performance of clustering algorithms like UMAP (Uniform Manifold Approximation and Projection) and HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to produce more meaningful and interpretable topics.

In terms of functionality, BERTopic allows for effective topic reduction and representation. It allows users to select the number of topics to keep after topic modeling has been performed, providing an easy way to decrease or increase the granularity of topics. The visualization of topics and their sizes, as well as the extraction of most representative documents for a given topic, makes it an interactive and flexible tool. 

Furthermore, BERTopic supports multiple languages and provides an easy interface to use any transformer model available in Hugging Face's model hub.

![plate](https://maartengr.github.io/BERTopic/algorithm/modularity.svg)


##### Quick example

The immediate benefit to leveraging Hugging Face's large library is that one can seamlessly push and pull trained topic models to and from the [HF Hub](https://huggingface.co/docs/hub/index). Let's pick a small embedding model, [paraphrase-albert-small-v2](https://huggingface.co/sentence-transformers/paraphrase-albert-small-v2). For more information on the integration between HF and BERTopic, read the recent [blogpost](https://huggingface.co/blog/bertopic) about it.

Note you may have some problems running BERTopic, given incompatibilities with old numpy versions. If that's your case, try to upgrade to a newer version of numpy than we run in this tutorial.

In [None]:
from bertopic import BERTopic

topic_model = BERTopic(
    language="english",
    n_gram_range=(1, 3),
    min_topic_size=50,
    nr_topics="auto",
    seed_topic_list=None,
    embedding_model="paraphrase-albert-small-v2",
    umap_model=None
)

topics, probs = topic_model.fit_transform(data.preprocessed_abstract)
# topic_model.push_to_hf_hub('dampudia/taltech_articles')
# topic_model = BERTopic.load("davanstrien/transformers_issues_topics")


There are a number of attributes that you can access after having trained your BERTopic model or loading one from the Hub:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |


In [None]:
topic_model.get_topic_freq()

In [None]:
topic_model.get_topic(0)[:10]

BERTopic has visualisation tools that are similar to those of LDAvis. 

In [None]:
topic_model.visualize_topics()

One can also reduce the number of topics by hierarchically clustering them.

In [None]:
topic_model.reduce_topics(data.preprocessed_abstract, nr_topics=60)

Using the fact that tokens are now embeddings, we can compute similarity of any entity to topic representation of its neighboring tokens.

In [None]:
similar_topics, similarity = topic_model.find_topics("gpu", top_n=5)
print(similar_topics)