# **Tutorial on FASTopic**

Author: **[Xiaobao Wu](https://bobxwu.github.io/)**

<br>

![stars](https://img.shields.io/github/stars/bobxwu/FASTopic?logo=github)
[![PyPI](https://img.shields.io/pypi/v/fastopic)](https://pypi.org/project/fastopic)
[![Downloads](https://static.pepy.tech/badge/fastopic)](https://pepy.tech/project/fastopic)
[![LICENSE](https://img.shields.io/github/license/bobxwu/fastopic)](https://www.apache.org/licenses/LICENSE-2.0/)
[![arXiv](https://img.shields.io/badge/arXiv-2405.17978-<COLOR>.svg)](https://arxiv.org/pdf/2405.17978.pdf)
[![Contributors](https://img.shields.io/github/contributors/bobxwu/fastopic)](https://github.com/bobxwu/fastopic/graphs/contributors/)


FASTopic is a fast, adaptive, stable, and transferable topic model, different
from the previous conventional (LDA), VAE-based (ProdLDA, ETM), or clustering-based (Top2Vec, BERTopic) methods.
It leverages optimal transport between the document, topic, and word embeddings from pretrained Transformers to model topics and topic distributions of documents.

Check our paper: **[FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm](https://arxiv.org/pdf/2405.17978.pdf)**

<br>

<img src='https://github.com/BobXWu/FASTopic/raw/master/docs/img/illustration.svg' with='300pt'></img>


## Install FASTopic

In [None]:
!pip install fastopic

Collecting fastopic
  Downloading fastopic-0.0.3-2-py3-none-any.whl (14 kB)
Collecting topmost>=0.0.4 (from fastopic)
  Downloading topmost-0.0.4-5-py3-none-any.whl (82 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/83.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.0/83.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting scipy<=1.10.1 (from topmost>=0.0.4->fastopic)
  Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.4/34.4 MB[0m [31m37.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers<3.0.0,>=2.6.0 (from topmost>=0.0.4->fastopic)
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
Collecting bertopic>=0.15.0 (from topmost>=

## Download a dataset

We download preprocessed dataset [NYT](https://github.com/BobXWu/TopMost/tree/main/data), news articles from New York Times.

In [None]:
import topmost
from topmost.data import download_dataset
from fastopic import FASTopic

download_dataset("NYT", cache_path="./datasets")
dataset = topmost.data.DynamicDataset("./datasets/NYT", as_tensor=False)
docs = dataset.train_texts

Downloading https://raw.githubusercontent.com/BobXWu/TopMost/master/data/NYT.zip to ./datasets/NYT.zip


100%|██████████| 15070620/15070620 [00:00<00:00, 300763016.96it/s]


all train size:  8254
all test size:  918
all vocab size:  10000
average length: 175.429
num of each time slice:  11 [ 194  265  431  554  744  837  802  884 1283 1400  860]


## Train FASTopic

In [None]:
model = FASTopic(num_topics=50, verbose=True)
topic_top_words, doc_topic_dist = model.fit_transform(docs)

2024-06-23 04:20:34,300 - FASTopic - use device: cuda
loading train texts: 100%|██████████| 8254/8254 [00:03<00:00, 2651.56it/s]
parsing texts: 100%|██████████| 8254/8254 [00:02<00:00, 3351.50it/s]
2024-06-23 04:20:43,012 - TopMost - Real vocab size: 10000
2024-06-23 04:20:43,080 - TopMost - Real training size: 8254 	 avg length: 175.429


Batches:   0%|          | 0/258 [00:00<?, ?it/s]

Training FASTopic:   4%|▍         | 9/200 [00:00<00:09, 20.86it/s]2024-06-23 04:21:06,454 - FASTopic - Epoch: 010 loss: 1610.891
Training FASTopic:   9%|▉         | 18/200 [00:00<00:11, 15.47it/s]2024-06-23 04:21:07,134 - FASTopic - Epoch: 020 loss: 1580.282
Training FASTopic:  14%|█▍        | 28/200 [00:02<00:26,  6.56it/s]2024-06-23 04:21:08,440 - FASTopic - Epoch: 030 loss: 1548.460
Training FASTopic:  19%|█▉        | 38/200 [00:03<00:15, 10.57it/s]2024-06-23 04:21:09,265 - FASTopic - Epoch: 040 loss: 1524.797
Training FASTopic:  24%|██▍       | 48/200 [00:03<00:12, 11.72it/s]2024-06-23 04:21:10,090 - FASTopic - Epoch: 050 loss: 1505.981
Training FASTopic:  29%|██▉       | 58/200 [00:04<00:11, 11.98it/s]2024-06-23 04:21:10,927 - FASTopic - Epoch: 060 loss: 1491.090
Training FASTopic:  34%|███▍      | 68/200 [00:05<00:10, 12.11it/s]2024-06-23 04:21:11,749 - FASTopic - Epoch: 070 loss: 1478.911
Training FASTopic:  39%|███▉      | 78/200 [00:06<00:10, 12.00it/s]2024-06-23 04:21:12,595 

Topic 0: queue essays thriller novels sequel novel comic prose genre remembers film characters scene funny memoir
Topic 1: russian nuclear saudi putin ukrainian sanctions moscow iran missile russia ukraine vladimir iranian invasion arabia
Topic 2: yemen pilgrims houthis yemeni abdul maj houthi iraqis ahmed sana clerics kidnapping assailants missions abducted
Topic 3: mueller impeachment inquiry accusations counsel harassment attorney fbi stone investigation roger allegations committee lawyer giuliani
Topic 4: agency information email documents records accounts agencies privacy surveillance report comment postal ton account classified
Topic 5: supreme abortion court judge lawsuit ruling courts laws judges ruled ban rights legal filed amendment
Topic 6: michael barbaro recording archived know going think really just right kind want get like lot
Topic 7: gang solitary freddie stabbed manslaughter affidavit pistol knife imprisonment bystanders rioting unconscious pinned mcdonald rampage
To

## Topic info

We can get the top words and their probabilities of a topic.

In [None]:
model.get_topic(topic_idx=36)

(('cancer', 0.004797671),
 ('monkeypox', 0.0044828397),
 ('certificates', 0.004410268),
 ('redfield', 0.004407463),
 ('administering', 0.0043857736))

## Visualize topic-word distributions

In [None]:
fig = model.visualize_topic(top_n=5)
fig.show()

## Visualize topic hierarchy

We use the learned topic embeddings and `scipy.cluster.hierarchy` to build a hierarchy of discovered topics.

In [None]:
fig = model.visualize_topic_hierarchy()
fig.show()

## Visualize topic weights

We plot the weights of topics in the given dataset.

In [None]:
fig = model.visualize_topic_weights(top_n=20, height=500)
fig.show()

## Get topic activity over time


Topic activity refers to the weight of a topic at a time slice.
We additionally input the time slices of documents, `time_slices` to compute and plot topic activity over time.


In [None]:
time_slices = dataset.train_times
act = model.topic_activity_over_time(time_slices)
fig = model.visualize_topic_activity(top_n=6, topic_activity=act, time_slices=time_slices)
fig.show()