# **Tutorial on FASTopic**

Author: **[Xiaobao Wu](https://bobxwu.github.io/)**

<br>

![stars](https://img.shields.io/github/stars/bobxwu/FASTopic?logo=github)
[![PyPI](https://img.shields.io/pypi/v/fastopic)](https://pypi.org/project/fastopic)
[![Downloads](https://static.pepy.tech/badge/fastopic)](https://pepy.tech/project/fastopic)
[![LICENSE](https://img.shields.io/github/license/bobxwu/fastopic)](https://www.apache.org/licenses/LICENSE-2.0/)
[![arXiv](https://img.shields.io/badge/arXiv-2405.17978-<COLOR>.svg)](https://arxiv.org/pdf/2405.17978.pdf)
[![Contributors](https://img.shields.io/github/contributors/bobxwu/fastopic)](https://github.com/bobxwu/fastopic/graphs/contributors/)


FASTopic is a fast, adaptive, stable, and transferable topic model, different
from the previous conventional (LDA), VAE-based (ProdLDA, ETM), or clustering-based (Top2Vec, BERTopic) methods.
It leverages optimal transport between the document, topic, and word embeddings from pretrained Transformers to model topics and topic distributions of documents.

Check our paper **[[NeurIPS 2024] FASTopic: Pretrained Transformer is a Fast, Adaptive, Stable, and Transferable Topic Model](https://arxiv.org/pdf/2405.17978.pdf)**  

<br>

<img src='https://github.com/BobXWu/FASTopic/raw/master/docs/img/illustration.svg' with='300pt'></img>


## Install FASTopic

In [1]:
!pip install fastopic



## Download a dataset

We download preprocessed dataset [NYT](https://github.com/BobXWu/TopMost/tree/main/data), news articles from New York Times.

In [2]:
import topmost
from topmost.data import download_dataset
from fastopic import FASTopic

download_dataset("NYT", cache_path="./datasets")
dataset = topmost.data.DynamicDataset("./datasets/NYT", as_tensor=False)
docs = dataset.train_texts

100%|██████████| 15.1M/15.1M [00:01<00:00, 10.3MB/s]


train size:  8254
test size:  918
vocab size:  10000
average length: 175.429
num of each time slice:  11 [ 194  265  431  554  744  837  802  884 1283 1400  860]


In [4]:
import pandas as pd
import numpy as np
import os
import sys

sys.path.append("/Users/hendrikweichel/projects/NaceCodeClassification/nace_remedi/Summarization_Classification/src")

from vector_search.Available_embedding_models import AvailableEmbeddingModels
from vector_search import pypdf_extraction
from vector_search import text_splitting
from vector_search import filter_chunks

In [5]:
pdf_path = "/Users/hendrikweichel/projects/NaceCodeClassification/nace_remedi/Summarization_Classification/data/annual_reports/conti_annual-report-2023-data.pdf"
pdf_path = "/Users/hendrikweichel/projects/NaceCodeClassification/nace_remedi/Summarization_Classification/data/annual_reports/mercedes-benz-annual-report-2023-incl-combined-management-report-mbg-ag-2.pdf"

In [6]:
pages = pypdf_extraction.pypdf_extract(pdf_path=pdf_path)
#pages = pypdf_extraction.pypdf_extract_sync(pdf_path=pdf_path)
chunks = text_splitting.split_pages(pages, chunk_size=1000, chunk_overlap=40)

Extracting text: 100%|██████████| 353/353 [00:06<00:00, 56.18it/s] 


In [16]:
docs = [chunk["text"] for chunk in chunks]

## Train FASTopic

In [17]:
model = FASTopic(num_topics=50, verbose=True)
topic_top_words, doc_topic_dist = model.fit_transform(docs)

2025-07-02 18:03:43,925 - FASTopic - use device: cpu
2025-07-02 18:03:43,926 - FASTopic - First fit the model.
loading train texts: 100%|██████████| 1347/1347 [00:00<00:00, 7499.59it/s]
parsing texts: 100%|██████████| 1347/1347 [00:00<00:00, 16513.83it/s]

The parameter 'token_pattern' will not be used since 'tokenizer' is not None'

2025-07-02 18:03:45,733 - TopMost - Real vocab size: 8474
2025-07-02 18:03:45,734 - TopMost - Real training size: 1347 	 avg length: 71.891


Batches:   0%|          | 0/43 [00:00<?, ?it/s]

Training FASTopic:   4%|▍         | 9/200 [00:01<00:30,  6.26it/s]2025-07-02 18:03:58,290 - FASTopic - Epoch: 010 loss: 649.498
Training FASTopic:  10%|▉         | 19/200 [00:03<00:35,  5.12it/s]2025-07-02 18:04:00,177 - FASTopic - Epoch: 020 loss: 634.634
Training FASTopic:  14%|█▍        | 29/200 [00:05<00:37,  4.58it/s]2025-07-02 18:04:02,336 - FASTopic - Epoch: 030 loss: 612.498
Training FASTopic:  20%|█▉        | 39/200 [00:07<00:34,  4.68it/s]2025-07-02 18:04:04,524 - FASTopic - Epoch: 040 loss: 595.677
Training FASTopic:  24%|██▍       | 49/200 [00:09<00:33,  4.52it/s]2025-07-02 18:04:06,734 - FASTopic - Epoch: 050 loss: 582.668
Training FASTopic:  30%|██▉       | 59/200 [00:11<00:30,  4.55it/s]2025-07-02 18:04:08,946 - FASTopic - Epoch: 060 loss: 572.221
Training FASTopic:  34%|███▍      | 69/200 [00:14<00:28,  4.54it/s]2025-07-02 18:04:11,140 - FASTopic - Epoch: 070 loss: 563.656
Training FASTopic:  40%|███▉      | 79/200 [00:16<00:27,  4.46it/s]2025-07-02 18:04:13,346 - FASTo

Topic 0: derivative hedges fair instruments nominal equivalents securities marketable derivatives hedge instrument amortized debt payables measured
Topic 1: finan ate accounting disclosures cial obtained opinions material exists matters independent estimates refer presentation view
Topic 2: bbac thbv saft imported dbrs flexibility opening vansmercedes purely suvs ldt brokerage minority sells rat
Topic 3: footnotes wwweuroncapcom acquiring shift persistently emerging clarification showrooms dear ystems roll platformthe functionscorporate geospatial strikes
Topic 4: private modular bevs want vanea upper eqt quickly citan positioning archi growthmercedes gamma continents anchored
Topic 5: human training integrity employees managers culture principles social cms diversity rights work responsibility qualification working
Topic 6: leveladjusted levelmercedes tracts levelresearch dollar nega levelrevenue covid penetration attribution remarketing unable adjusting verified unlike
Topic 7: bpo v

## Topic info

We can get the top words and their probabilities of a topic.

In [18]:
model.get_topic(topic_idx=36)

(('declarationannual', 0.011193578),
 ('contents', 0.008970907),
 ('information', 0.008371554),
 ('governanceannual', 0.007466087),
 ('governance', 0.0074163196))

## Visualize topic-word distributions

In [20]:
fig = model.visualize_topic(top_n=20)
fig.show()

## Visualize topic hierarchy

We use the learned topic embeddings and `scipy.cluster.hierarchy` to build a hierarchy of discovered topics.

In [11]:
fig = model.visualize_topic_hierarchy()
fig.show()


The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix



## Visualize topic weights

We plot the weights of topics in the given dataset.

In [12]:
fig = model.visualize_topic_weights(top_n=20, height=500)
fig.show()

## Get topic activity over time


Topic activity refers to the weight of a topic at a time slice.
We additionally input the time slices of documents, `time_slices` to compute and plot topic activity over time.


In [15]:
time_slices = dataset.train_times
act = model.topic_activity_over_time(time_slices)
fig = model.visualize_topic_activity(top_n=6, topic_activity=act, time_slices=time_slices)
fig.show()

AssertionError: 