# **Topic Model Evaluation**
Here, you will find the code needed to run the experiments of the paper:

*BERTopic: Neural topic modeling with a class-based TF-IDF procedure*.

The package itself can be found [here](https://github.com/MaartenGr/BERTopic) and the repository for evaluation [here]().

## **Installation**
First, we need to install a few packages in order to run our experiments. Most of the packages are installed through the `tm_evaluation` package of which [OCTIS](https://github.com/MIND-Lab/OCTIS) is an important component. 

You can install the evaluation package with `pip install .` from the root. To additionally install CTM run `pip install .[ctm]`To install BERTopic, run `pip install bertopic==v0.9.4` after installing the base package or use `pip install .[bertopic]`. Top2Vec should be installed with `pip install top2vec==v1.0.26` after installing the base package. 

To run a faster version of LDAseq for dynamic topic modeling, we need to uninstall gensim and install a specific merge that allows for this speed-up. First, run `pip uninstall gensim -y`, then, run `pip install git+https://github.com/RaRe-Technologies/gensim.git@refs/pull/3172/merge`

**NOTE**: After installing the above packages, make sure to restart the runtime otherwise you are likely to run into issues. 

#  1. **Data**
Some of the data can be accessed through OCTIS, such as the `20NewsGroup` and `BBC_News` datasets. Other datasets, however, are downloaded and then run through OCTIS in order to be used in their pipeline. 

The datasets that we are going to be preparing are: 
* Trump's tweets
* United Nations general debates between 2006 and 2015 

In [1]:
from evaluation import Trainer, DataLoader

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nicho\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nicho\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
%%time
dataloader = DataLoader(dataset="incels").preprocess_octis(documents_path="incels.txt", output_folder="incels")



created vocab
45290
words filtering done
Wall time: 5min 16s


## Trump
The data can be found here: https://www.thetrumparchive.com/faq

Using our `DataLoader` we can prepare the documents and save them in an OCTIS-based format: 

In [2]:
%%time
dataloader = DataLoader(dataset="trump").prepare_docs(save="trump.txt").preprocess_octis(output_folder="trump")



created vocab
53637
words filtering done
CPU times: total: 3min 57s
Wall time: 4min 3s


Additionally, there isa DTM variant that creates 10 timesteps to be used in the dynamic topic modeling experiments:

In [None]:
%%time
dataloader = DataLoader(dataset="trump_dtm").prepare_docs(save="trump_dtm.txt").preprocess_octis(output_folder="trump_dtm")

## United Nations

The transcriptions of the United Nations (UN) general debates between 2006 and 2015. The data can be found here: https://runestone.academy/runestone/books/published/httlads/_static/un-general-debates.csv

In [None]:
%%time
dataloader = DataLoader(dataset="un_dtm").prepare_docs(save="un_dtm.txt").preprocess_octis(output_folder="un_dtm")

created vocab
69447
words filtering done
CPU times: user 22min, sys: 21.5 s, total: 22min 21s
Wall time: 22min 22s


# 2. **Evaluation**
After preparing our data, we can start evaluating the topic models as used in the experiments. OCTIS already has a number of models prepared that we can use directly as shown below. 

First, we specify what the dataset is and whether that was a custom dataset not found in OCTIS. To run our custom trump dataset, we run `dataset, custom = "trump", True`. In contrast, if we are to use the prepackaged 20NewsGroup dataset, we run `dataset, custom = "20NewsGroup", False` instead. 

The OCTIS datasets can be found [here](https://github.com/MIND-Lab/OCTIS#available-datasets). 

Second, we define a number of parameters to be used for the model. It uses the following format: 

`params = {"num_topics": [(i+1)*10 for i in range(5)], "random_state": random_state}`

were we define a number of topics to loop over and calculate the evluation metrics but also define a number of parameters used in the models. 

#### **Parameters**
The parameters for LDA and NMF:


```python
params = {"num_topics": [(i+1)*10 for i in range(5)], "random_state": random_state}`
```

The parameters for Top2Vec:

```python
params = {"nr_topics": [(i+1)*10 for i in range(5)],
          "hdbscan_args": {'min_cluster_size': 15,
                            'metric': 'euclidean',
                            'cluster_selection_method': 'eom'}}
```
Note that the `min_cluster_size` is 15 for all datasets except BBC_News.

The parameters for CTM:

```python
params = {
    "n_components": [(i+1)*10 for i in range(5)],
    "contextual_size":768
}
```

The parameters for BERTopic:

```python
params = {
    "nr_topics": [(i+1)*10 for i in range(5)],
    "min_topic_size": 15,
    "verbose": True
}
```

Note that the `min_topic_size` is 15 for all datasets except BBC_News. Note that we do not set a `embedding_model` here. We do this on purpose as we can generate the embeddings beforehand and pass those to BERTopic. 

## **OCTIS**
Here, we can run the experiments for NMF and LDA. 

#### NMF

In [None]:
for i, random_state in enumerate([0, 21, 42]):
    dataset, custom = "trump", True
    params = {"num_topics": [(i+1)*10 for i in range(5)], "random_state": random_state}

    trainer = Trainer(dataset=dataset,
                      model_name="NMF",
                      params=params,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"NMF_trump_{i+1}")

#### LDA

In [4]:
for i, random_state in enumerate([0, 21, 42]):
    dataset, custom = "incels", True
    params = {"num_topics": [(i+1)*10 for i in range(5)], "random_state": random_state}

    trainer = Trainer(dataset=dataset,
                      model_name="LDA",
                      params=params,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"LDA_incels_{i+1}")

  final_df = df[df[1] == 'train'].append(df[df[1] == 'val'])
  final_df = final_df.append(df[df[1] == 'test'])


Results
npmi: -0.03822239357559332
diversity: 0.53
 
Results
npmi: -0.04654416942755496
diversity: 0.47
 
Results
npmi: -0.026994572816327247
diversity: 0.44333333333333336
 
Results
npmi: -0.02407425278256937
diversity: 0.4525
 
Results
npmi: -0.043619316027619745
diversity: 0.566
 


  final_df = df[df[1] == 'train'].append(df[df[1] == 'val'])
  final_df = final_df.append(df[df[1] == 'test'])


Results
npmi: -0.024009496036820174
diversity: 0.47
 
Results
npmi: -0.022318416614106516
diversity: 0.435
 
Results
npmi: -0.03637574135670597
diversity: 0.47
 
Results
npmi: -0.0516035350050972
diversity: 0.5375
 
Results
npmi: -0.03938090473203538
diversity: 0.518
 


  final_df = df[df[1] == 'train'].append(df[df[1] == 'val'])
  final_df = final_df.append(df[df[1] == 'test'])


Results
npmi: -0.03246790542930718
diversity: 0.48
 
Results
npmi: -0.04489353551954052
diversity: 0.485
 
Results
npmi: -0.029764886318674997
diversity: 0.51
 
Results
npmi: -0.03978900011277152
diversity: 0.495
 
Results
npmi: -0.028831246428640815
diversity: 0.51
 


## **CTM**
Here, we use de CombinedTM of the Contextualized Topic Models:  https://github.com/MilaNLProc/contextualized-topic-models



In [5]:
for i in range(3):
    dataset, custom = "incels", True
    params = {
        "n_components": [(i+1)*10 for i in range(5)],
        "contextual_size":768,
        "num_data_loader_workers":0
    }

    trainer = Trainer(dataset=dataset,
                      model_name="CTM_CUSTOM",
                      params=params,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"CTM_incels_{i+1}")

Batches:   0%|          | 0/358 [00:00<?, ?it/s]

Epoch: [100/100]	 Seen Samples: [7150600/7150600]	Train Loss: 113.24878258993618	Time: 0:00:33.028306: : 100it [55:13, 33.13s/it]
Sampling: [20/20]: : 20it [08:20, 25.01s/it]


Results
npmi: -0.09682970609323242
diversity: 0.86
 


Epoch: [100/100]	 Seen Samples: [7150600/7150600]	Train Loss: 115.60714677379565	Time: 0:00:33.306663: : 100it [55:28, 33.28s/it]
Sampling: [20/20]: : 20it [08:24, 25.22s/it]


Results
npmi: -0.05243977003303975
diversity: 0.81
 


Epoch: [100/100]	 Seen Samples: [7150600/7150600]	Train Loss: 118.66196548068847	Time: 0:00:33.173135: : 100it [55:19, 33.19s/it]
Sampling: [20/20]: : 20it [08:26, 25.34s/it]


Results
npmi: -0.030922979685721663
diversity: 0.7033333333333334
 


Epoch: [100/100]	 Seen Samples: [7150600/7150600]	Train Loss: 122.02981500719235	Time: 0:00:33.163312: : 100it [55:20, 33.20s/it]
Sampling: [20/20]: : 20it [08:29, 25.49s/it]


Results
npmi: -0.01764962060192852
diversity: 0.61
 


Epoch: [100/100]	 Seen Samples: [7150600/7150600]	Train Loss: 125.35346149520014	Time: 0:00:33.271828: : 100it [55:28, 33.28s/it]
Sampling: [20/20]: : 20it [08:30, 25.53s/it]


Results
npmi: -0.02201759693033918
diversity: 0.566
 


Batches:   0%|          | 0/358 [00:00<?, ?it/s]

Epoch: [33/100]	 Seen Samples: [2359698/7150600]	Train Loss: 113.33262253799889	Time: 0:00:33.336203: : 33it [18:14, 33.21s/it]

KeyboardInterrupt: 

In [6]:
for i in range(2):
    dataset, custom = "incels", True
    params = {
        "n_components": [100, 125, 150, 200, 250],
        "contextual_size":768,
        "num_data_loader_workers":0
    }

    trainer = Trainer(dataset=dataset,
                      model_name="CTM_CUSTOM",
                      params=params,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"CTM_incels_big_{i+1}")

Batches:   0%|          | 0/358 [00:00<?, ?it/s]

Epoch: [100/100]	 Seen Samples: [7150600/7150600]	Train Loss: 144.81998253151397	Time: 0:00:33.917171: : 100it [55:50, 33.51s/it]
Sampling: [20/20]: : 20it [08:37, 25.89s/it]


Results
npmi: -0.01926399125800415
diversity: 0.321
 


Epoch: [100/100]	 Seen Samples: [7150600/7150600]	Train Loss: 152.24581625197973	Time: 0:00:33.598527: : 100it [56:02, 33.62s/it]
Sampling: [20/20]: : 20it [08:43, 26.19s/it]


Results
npmi: -0.012917291169221717
diversity: 0.3064
 


Epoch: [100/100]	 Seen Samples: [7150600/7150600]	Train Loss: 162.345519608112	Time: 0:00:33.721982: : 100it [56:10, 33.70s/it]
Sampling: [20/20]: : 20it [08:48, 26.45s/it]


Results
npmi: -0.011227304923263858
diversity: 0.25133333333333335
 


Epoch: [100/100]	 Seen Samples: [7150600/7150600]	Train Loss: 181.6485341169469	Time: 0:00:33.930000: : 100it [56:34, 33.94s/it]
Sampling: [20/20]: : 20it [08:54, 26.74s/it]


Results
npmi: -0.0124014507755403
diversity: 0.2075
 


Epoch: [100/100]	 Seen Samples: [7150600/7150600]	Train Loss: 199.7473824926514	Time: 0:00:34.429891: : 100it [56:58, 34.18s/it]
Sampling: [20/20]: : 20it [09:04, 27.21s/it]


Results
npmi: -0.008336508369607568
diversity: 0.1912
 


Batches:   0%|          | 0/358 [00:00<?, ?it/s]

Epoch: [100/100]	 Seen Samples: [7150600/7150600]	Train Loss: 143.73808723765313	Time: 0:00:33.758122: : 100it [56:00, 33.60s/it]
Sampling: [20/20]: : 20it [08:38, 25.94s/it]


Results
npmi: -0.016081933324634475
diversity: 0.355
 


Epoch: [100/100]	 Seen Samples: [7150600/7150600]	Train Loss: 152.831568555272	Time: 0:00:33.764243: : 100it [56:10, 33.70s/it] 
Sampling: [20/20]: : 20it [08:43, 26.20s/it]


Results
npmi: -0.01694039142042814
diversity: 0.304
 


Epoch: [100/100]	 Seen Samples: [7150600/7150600]	Train Loss: 162.5063383284707	Time: 0:00:33.881380: : 100it [56:26, 33.87s/it]
Sampling: [20/20]: : 20it [08:49, 26.45s/it]


Results
npmi: -0.011059868713475573
diversity: 0.25266666666666665
 


Epoch: [100/100]	 Seen Samples: [7150600/7150600]	Train Loss: 181.32811751933951	Time: 0:00:33.879039: : 100it [56:42, 34.03s/it]
Sampling: [20/20]: : 20it [08:56, 26.82s/it]


Results
npmi: -0.008153982731317504
diversity: 0.213
 


Epoch: [100/100]	 Seen Samples: [7150600/7150600]	Train Loss: 199.4580075803298	Time: 0:00:34.244610: : 100it [57:07, 34.28s/it]
Sampling: [20/20]: : 20it [09:03, 27.18s/it]


Results
npmi: -0.008644787721432222
diversity: 0.1836
 


## **BERTopic**

To speed up BERTopic, we can generate the embeddings before passing it to the `Trainer`. This way, the same embeddings do not have to be generated 5 times which speeds up evaluation quite a bit. 

In [7]:
%%capture
from sentence_transformers import SentenceTransformer

# Prepare data
dataset, custom = "incels", True
data_loader = DataLoader(dataset)
_, timestamps = data_loader.load_docs()
data = data_loader.load_octis(custom)
data = [" ".join(words) for words in data.get_corpus()]

# Extract embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(data, show_progress_bar=True)

As show above, we load in the `data` which the data loader and combine the tokens in each document to generate our training data. Then, we pass it to the sentence transformer model of our choice and generate the embeddings. 

Next, we pass these embeddings to the `bt_embeddings` parameter to speed up training: 

In [4]:
for i in range(3):
    params = {
        "embedding_model": "all-MiniLM-L6-v2",
        "nr_topics": [(i+1)*10 for i in range(5)],
        "min_topic_size": 15,
        "diversity": None,
        "verbose": True
    }

    trainer = Trainer(dataset=dataset,
                      model_name="BERTopic",
                      params=params,
                      bt_embeddings=embeddings,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"BERTopic_incels_{i+1}")

2023-03-25 16:22:24,039 - BERTopic - Reduced dimensionality with UMAP
2023-03-25 16:22:31,698 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-25 16:22:58,788 - BERTopic - Reduced number of topics from 694 to 11


Results
npmi: -0.004570451738533479
diversity: 0.4
 


2023-03-25 16:24:23,757 - BERTopic - Reduced dimensionality with UMAP
2023-03-25 16:24:29,512 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-25 16:24:56,735 - BERTopic - Reduced number of topics from 690 to 21


Results
npmi: -0.01535210597493621
diversity: 0.375
 


2023-03-25 16:26:23,552 - BERTopic - Reduced dimensionality with UMAP
2023-03-25 16:26:29,398 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-25 16:26:57,715 - BERTopic - Reduced number of topics from 730 to 31


Results
npmi: 0.015255031960605145
diversity: 0.43
 


2023-03-25 16:28:29,695 - BERTopic - Reduced dimensionality with UMAP
2023-03-25 16:28:35,610 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-25 16:29:02,886 - BERTopic - Reduced number of topics from 684 to 41


Results
npmi: 0.021258500329169597
diversity: 0.4625
 


2023-03-25 16:30:34,419 - BERTopic - Reduced dimensionality with UMAP
2023-03-25 16:30:40,255 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-25 16:31:08,096 - BERTopic - Reduced number of topics from 720 to 51


Results
npmi: 0.025689675194642332
diversity: 0.52
 


2023-03-25 16:32:40,595 - BERTopic - Reduced dimensionality with UMAP
2023-03-25 16:32:46,714 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-25 16:33:29,412 - BERTopic - Reduced number of topics from 719 to 11


Results
npmi: -0.007103936336285556
diversity: 0.43
 


2023-03-25 16:34:49,659 - BERTopic - Reduced dimensionality with UMAP
2023-03-25 16:34:55,756 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-25 16:35:24,545 - BERTopic - Reduced number of topics from 736 to 21


Results
npmi: -0.009354392005119014
diversity: 0.405
 


2023-03-25 16:36:45,501 - BERTopic - Reduced dimensionality with UMAP
2023-03-25 16:36:51,517 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-25 16:37:19,993 - BERTopic - Reduced number of topics from 712 to 31


Results
npmi: 0.014735427828483676
diversity: 0.46
 


2023-03-25 16:38:48,176 - BERTopic - Reduced dimensionality with UMAP
2023-03-25 16:38:54,044 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-25 16:39:21,665 - BERTopic - Reduced number of topics from 703 to 41


Results
npmi: 0.019513620071159456
diversity: 0.4925
 


2023-03-25 16:40:51,407 - BERTopic - Reduced dimensionality with UMAP
2023-03-25 16:40:57,126 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-25 16:41:24,959 - BERTopic - Reduced number of topics from 715 to 51


Results
npmi: 0.019151979268951427
diversity: 0.488
 


2023-03-25 16:43:01,682 - BERTopic - Reduced dimensionality with UMAP
2023-03-25 16:43:07,579 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-25 16:43:35,901 - BERTopic - Reduced number of topics from 719 to 11


Results
npmi: -0.01040172161329305
diversity: 0.38
 


2023-03-25 16:45:00,643 - BERTopic - Reduced dimensionality with UMAP
2023-03-25 16:45:06,538 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-25 16:45:34,947 - BERTopic - Reduced number of topics from 712 to 21


Results
npmi: 0.0006207277646848468
diversity: 0.44
 


2023-03-25 16:46:58,407 - BERTopic - Reduced dimensionality with UMAP
2023-03-25 16:47:04,234 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-25 16:47:46,681 - BERTopic - Reduced number of topics from 701 to 31


Results
npmi: 0.006973075556417615
diversity: 0.4633333333333333
 


2023-03-25 16:49:10,732 - BERTopic - Reduced dimensionality with UMAP
2023-03-25 16:49:16,605 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-25 16:49:45,183 - BERTopic - Reduced number of topics from 736 to 41


Results
npmi: 0.023247660928769717
diversity: 0.5025
 


2023-03-25 16:51:15,545 - BERTopic - Reduced dimensionality with UMAP
2023-03-25 16:51:21,343 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-25 16:51:49,272 - BERTopic - Reduced number of topics from 710 to 51


Results
npmi: 0.032055600204222934
diversity: 0.526
 


In [15]:
for i in range(3):
    params = {
        "embedding_model": "all-MiniLM-L6-v2",
        "nr_topics": [None, 10, 25, 50, 75, 100],
        "min_topic_size": [10, 25, 50, 75, 100],
        "diversity": None,
        "verbose": True
    }

    trainer = Trainer(dataset=dataset,
                      model_name="BERTopic",
                      params=params,
                      bt_embeddings=embeddings,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"BERTopic_incels_extended_{i+1}")

2023-03-26 13:11:55,098 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:12:02,060 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Results
npmi: -0.14437168660825733
diversity: 0.7679611650485437
 


2023-03-26 13:16:37,171 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:16:42,878 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Results
npmi: -0.10726969119614334
diversity: 0.7492836676217765
 


2023-03-26 13:18:54,902 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:19:01,273 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Results
npmi: -0.04179228964749067
diversity: 0.7196319018404908
 


2023-03-26 13:20:57,059 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:21:04,558 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Results
npmi: -0.0020698533028827935
diversity: 0.6392523364485981
 


2023-03-26 13:22:52,341 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:23:00,680 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Results
npmi: 0.011842347466977217
diversity: 0.580952380952381
 


2023-03-26 13:24:42,064 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:24:48,753 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:25:30,810 - BERTopic - Reduced number of topics from 1196 to 11


Results
npmi: -0.006806845830887295
diversity: 0.46
 


2023-03-26 13:26:51,408 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:26:57,139 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:27:16,130 - BERTopic - Reduced number of topics from 362 to 11


Results
npmi: -0.013072839179449916
diversity: 0.37
 


2023-03-26 13:28:36,744 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:28:43,151 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:28:57,280 - BERTopic - Reduced number of topics from 174 to 11


Results
npmi: -0.017532417016635243
diversity: 0.37
 


2023-03-26 13:30:23,772 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:30:31,018 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:30:43,320 - BERTopic - Reduced number of topics from 105 to 11


Results
npmi: -0.00643639185295942
diversity: 0.37
 


2023-03-26 13:32:03,729 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:32:12,361 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:32:24,871 - BERTopic - Reduced number of topics from 89 to 11


Results
npmi: -0.010031836407403233
diversity: 0.38
 


2023-03-26 13:33:45,581 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:33:52,547 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:34:36,438 - BERTopic - Reduced number of topics from 1252 to 26


Results
npmi: 0.010065820441362262
diversity: 0.464
 


2023-03-26 13:35:59,654 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:36:05,442 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:36:24,537 - BERTopic - Reduced number of topics from 362 to 26


Results
npmi: 0.009541190895024969
diversity: 0.412
 


2023-03-26 13:37:48,153 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:37:54,504 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:38:08,167 - BERTopic - Reduced number of topics from 157 to 26


Results
npmi: 0.00010253445925954424
diversity: 0.384
 


2023-03-26 13:39:32,460 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:39:39,721 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:39:52,384 - BERTopic - Reduced number of topics from 113 to 26


Results
npmi: 0.005590126001773842
diversity: 0.38
 


2023-03-26 13:41:20,887 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:41:29,085 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:41:40,926 - BERTopic - Reduced number of topics from 84 to 26


Results
npmi: -0.0003194572448947324
diversity: 0.348
 


2023-03-26 13:43:03,968 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:43:11,056 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:43:54,306 - BERTopic - Reduced number of topics from 1219 to 51


Results
npmi: 0.02034090954512196
diversity: 0.516
 


2023-03-26 13:45:20,427 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:45:26,761 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:45:45,790 - BERTopic - Reduced number of topics from 376 to 51


Results
npmi: 0.026359669274147565
diversity: 0.498
 


2023-03-26 13:47:11,505 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:47:17,676 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:47:30,950 - BERTopic - Reduced number of topics from 164 to 51


Results
npmi: 0.024152977108041828
diversity: 0.496
 


2023-03-26 13:49:01,756 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:49:09,047 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:49:20,880 - BERTopic - Reduced number of topics from 107 to 51


Results
npmi: 0.018228457901122274
diversity: 0.496
 


2023-03-26 13:50:46,984 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:50:55,136 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:51:06,303 - BERTopic - Reduced number of topics from 80 to 51


Results
npmi: 0.020478484787817237
diversity: 0.474
 


2023-03-26 13:52:32,737 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:52:39,773 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:53:23,020 - BERTopic - Reduced number of topics from 1241 to 76


Results
npmi: 0.020308909957589057
diversity: 0.5853333333333334
 


2023-03-26 13:54:52,056 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:54:57,562 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:55:15,886 - BERTopic - Reduced number of topics from 368 to 76


Results
npmi: 0.020442260449947937
diversity: 0.5266666666666666
 


2023-03-26 13:56:44,881 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:56:50,958 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:57:03,633 - BERTopic - Reduced number of topics from 157 to 76


Results
npmi: 0.013544076446195681
diversity: 0.54
 


2023-03-26 13:58:37,635 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 13:58:44,762 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 13:58:56,144 - BERTopic - Reduced number of topics from 110 to 76


Results
npmi: 0.008251151238510836
diversity: 0.5706666666666667
 


2023-03-26 14:00:26,642 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:00:34,745 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:00:45,860 - BERTopic - Reduced number of topics from 87 to 76


Results
npmi: 0.015765179898670257
diversity: 0.5773333333333334
 


2023-03-26 14:02:20,237 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:02:27,065 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:03:08,993 - BERTopic - Reduced number of topics from 1204 to 101


Results
npmi: 0.010113043837290598
diversity: 0.627
 


2023-03-26 14:04:44,315 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:04:50,041 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:05:07,534 - BERTopic - Reduced number of topics from 355 to 101


Results
npmi: 0.01040257849717709
diversity: 0.589
 


2023-03-26 14:06:47,280 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:06:53,655 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:07:05,833 - BERTopic - Reduced number of topics from 152 to 101


Results
npmi: 0.0008632300351767871
diversity: 0.62
 


2023-03-26 14:08:47,366 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:08:54,573 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:09:05,572 - BERTopic - Reduced number of topics from 101 to 101


Results
npmi: 0.0003723860006885099
diversity: 0.631
 


2023-03-26 14:10:41,526 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:10:49,355 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:10:54,734 - BERTopic - Reduced number of topics from 86 to 86


Results
npmi: 0.012256379332429674
diversity: 0.5752941176470588
 


2023-03-26 14:12:35,975 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:12:42,970 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Results
npmi: -0.14438394935488627
diversity: 0.7672025723472669
 


2023-03-26 14:16:52,746 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:16:58,285 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Results
npmi: -0.10654141779114833
diversity: 0.7642011834319526
 


2023-03-26 14:19:04,787 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:19:11,026 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Results
npmi: -0.03656874426111189
diversity: 0.7126666666666667
 


2023-03-26 14:21:05,314 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:21:12,870 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Results
npmi: -0.007035643061754532
diversity: 0.6449541284403669
 


2023-03-26 14:22:55,407 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:23:03,226 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Results
npmi: 0.014930926342601361
diversity: 0.5823529411764706
 


2023-03-26 14:24:38,886 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:24:45,535 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:25:28,657 - BERTopic - Reduced number of topics from 1216 to 11


Results
npmi: -0.004166216864031016
diversity: 0.43
 


2023-03-26 14:26:46,974 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:26:52,613 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:27:11,576 - BERTopic - Reduced number of topics from 356 to 11


Results
npmi: -0.00514489189709368
diversity: 0.42
 


2023-03-26 14:28:32,767 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:28:38,936 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:28:52,727 - BERTopic - Reduced number of topics from 159 to 11


Results
npmi: -0.014036282402737058
diversity: 0.36
 


2023-03-26 14:30:12,633 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:30:19,843 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:30:32,375 - BERTopic - Reduced number of topics from 108 to 11


Results
npmi: -0.007717937980832447
diversity: 0.37
 


2023-03-26 14:31:56,783 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:32:05,041 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:32:16,991 - BERTopic - Reduced number of topics from 88 to 11


Results
npmi: -0.010939368400249002
diversity: 0.37
 


2023-03-26 14:33:37,559 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:33:44,307 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:34:27,403 - BERTopic - Reduced number of topics from 1198 to 26


Results
npmi: 0.0133288756906549
diversity: 0.464
 


2023-03-26 14:35:51,081 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:35:56,835 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:36:15,990 - BERTopic - Reduced number of topics from 368 to 26


Results
npmi: -0.0006844367318751346
diversity: 0.428
 


2023-03-26 14:37:43,078 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:37:49,223 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:38:02,989 - BERTopic - Reduced number of topics from 165 to 26


Results
npmi: 0.008375620740381657
diversity: 0.428
 


2023-03-26 14:39:25,969 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:39:33,093 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:39:45,412 - BERTopic - Reduced number of topics from 108 to 26


Results
npmi: 0.011473481657473684
diversity: 0.408
 


2023-03-26 14:41:08,119 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:41:16,098 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:41:27,744 - BERTopic - Reduced number of topics from 84 to 26


Results
npmi: -0.0036091746998922424
diversity: 0.376
 


2023-03-26 14:42:56,127 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:43:02,763 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:43:45,953 - BERTopic - Reduced number of topics from 1214 to 51


Results
npmi: 0.026221591525954654
diversity: 0.534
 


2023-03-26 14:45:12,226 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:45:18,047 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:45:36,679 - BERTopic - Reduced number of topics from 360 to 51


Results
npmi: 0.01933360919552401
diversity: 0.454
 


2023-03-26 14:47:02,147 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:47:08,321 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:47:21,766 - BERTopic - Reduced number of topics from 162 to 51


Results
npmi: 0.018753618557306884
diversity: 0.48
 


2023-03-26 14:48:52,322 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:48:59,239 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:49:11,273 - BERTopic - Reduced number of topics from 114 to 51


Results
npmi: 0.023171453735661034
diversity: 0.494
 


2023-03-26 14:50:42,776 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:50:50,848 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:51:02,293 - BERTopic - Reduced number of topics from 87 to 51


Results
npmi: 0.020933138058714523
diversity: 0.464
 


2023-03-26 14:52:32,654 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:52:39,509 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:53:22,135 - BERTopic - Reduced number of topics from 1221 to 76


Results
npmi: 0.018810279966924148
diversity: 0.568
 


2023-03-26 14:54:56,749 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:55:02,329 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:55:20,295 - BERTopic - Reduced number of topics from 353 to 76


Results
npmi: 0.017728005634066606
diversity: 0.552
 


2023-03-26 14:56:50,877 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:56:57,123 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:57:10,002 - BERTopic - Reduced number of topics from 156 to 76


Results
npmi: 0.020785323484659387
diversity: 0.5546666666666666
 


2023-03-26 14:58:38,874 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 14:58:46,113 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 14:58:57,740 - BERTopic - Reduced number of topics from 108 to 76


Results
npmi: 0.009734618921739137
diversity: 0.5786666666666667
 


2023-03-26 15:00:31,812 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:00:39,750 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:00:50,690 - BERTopic - Reduced number of topics from 85 to 76


Results
npmi: 0.016701869385311558
diversity: 0.5653333333333334
 


2023-03-26 15:02:20,006 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:02:26,843 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:03:10,115 - BERTopic - Reduced number of topics from 1259 to 101


Results
npmi: -0.002913690357061117
diversity: 0.647
 


2023-03-26 15:04:51,695 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:04:57,125 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:05:15,210 - BERTopic - Reduced number of topics from 371 to 101


Results
npmi: 0.0021977867974898387
diversity: 0.606
 


2023-03-26 15:06:55,288 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:07:01,621 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:07:14,243 - BERTopic - Reduced number of topics from 161 to 101


Results
npmi: 0.007121128058875267
diversity: 0.627
 


2023-03-26 15:08:50,208 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:08:57,235 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:09:08,286 - BERTopic - Reduced number of topics from 104 to 101


Results
npmi: -0.0022316032195877867
diversity: 0.648
 


2023-03-26 15:10:48,772 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:10:56,857 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:11:02,245 - BERTopic - Reduced number of topics from 87 to 87


Results
npmi: 0.011676044058734127
diversity: 0.5930232558139535
 


2023-03-26 15:12:41,798 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:12:48,449 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Results
npmi: -0.14322951647009158
diversity: 0.7668879668049793
 


2023-03-26 15:17:06,894 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:17:12,576 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Results
npmi: -0.10356173572174336
diversity: 0.7595108695652174
 


2023-03-26 15:19:28,828 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:19:35,124 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Results
npmi: -0.03196943201921524
diversity: 0.7
 


2023-03-26 15:21:24,986 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:21:32,296 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Results
npmi: -0.006826113784433472
diversity: 0.6419047619047619
 


2023-03-26 15:23:14,401 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:23:22,433 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Results
npmi: 0.01076361169787531
diversity: 0.5892857142857143
 


2023-03-26 15:24:58,031 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:25:04,498 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:25:48,541 - BERTopic - Reduced number of topics from 1245 to 11


Results
npmi: -0.007773586993720549
diversity: 0.43
 


2023-03-26 15:27:08,200 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:27:13,865 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:27:33,075 - BERTopic - Reduced number of topics from 372 to 11


Results
npmi: -0.014706967355300743
diversity: 0.37
 


2023-03-26 15:28:54,023 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:29:00,305 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:29:14,214 - BERTopic - Reduced number of topics from 164 to 11


Results
npmi: -0.01578692458635863
diversity: 0.36
 


2023-03-26 15:30:34,556 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:30:41,767 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:30:54,252 - BERTopic - Reduced number of topics from 109 to 11


Results
npmi: -0.014683262493645276
diversity: 0.37
 


2023-03-26 15:32:14,626 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:32:22,478 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:32:34,371 - BERTopic - Reduced number of topics from 83 to 11


Results
npmi: -0.013584132519800219
diversity: 0.35
 


2023-03-26 15:33:54,819 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:34:01,407 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:34:44,353 - BERTopic - Reduced number of topics from 1202 to 26


Results
npmi: 0.013298155646542527
diversity: 0.436
 


2023-03-26 15:36:07,380 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:36:12,894 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:36:31,416 - BERTopic - Reduced number of topics from 339 to 26


Results
npmi: 0.008334856565224409
diversity: 0.396
 


2023-03-26 15:37:53,719 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:37:59,781 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:38:13,626 - BERTopic - Reduced number of topics from 167 to 26


Results
npmi: 0.003445588515009513
diversity: 0.364
 


2023-03-26 15:39:37,222 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:39:44,240 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:39:56,499 - BERTopic - Reduced number of topics from 107 to 26


Results
npmi: 0.009628523263208845
diversity: 0.4
 


2023-03-26 15:41:19,224 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:41:27,159 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:41:38,711 - BERTopic - Reduced number of topics from 83 to 26


Results
npmi: 0.0022253443483912254
diversity: 0.364
 


2023-03-26 15:43:00,980 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:43:07,723 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:43:51,417 - BERTopic - Reduced number of topics from 1217 to 51


Results
npmi: 0.02201266078329074
diversity: 0.528
 


2023-03-26 15:45:18,642 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:45:24,389 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:45:43,321 - BERTopic - Reduced number of topics from 364 to 51


Results
npmi: 0.02486908216462185
diversity: 0.498
 


2023-03-26 15:47:10,047 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:47:16,497 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:47:30,079 - BERTopic - Reduced number of topics from 168 to 51


Results
npmi: 0.024535662048674314
diversity: 0.472
 


2023-03-26 15:48:56,532 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:49:03,869 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:49:15,790 - BERTopic - Reduced number of topics from 105 to 51


Results
npmi: 0.01771468883047794
diversity: 0.474
 


2023-03-26 15:50:42,832 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:50:51,645 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:51:02,935 - BERTopic - Reduced number of topics from 85 to 51


Results
npmi: 0.02161410581087038
diversity: 0.484
 


2023-03-26 15:52:31,438 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:52:37,978 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:53:20,988 - BERTopic - Reduced number of topics from 1205 to 76


Results
npmi: 0.01882927170574085
diversity: 0.6
 


2023-03-26 15:54:50,481 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:54:56,138 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:55:14,636 - BERTopic - Reduced number of topics from 367 to 76


Results
npmi: 0.020441106342870302
diversity: 0.5293333333333333
 


2023-03-26 15:56:44,898 - BERTopic - Reduced dimensionality with UMAP
2023-03-26 15:56:51,662 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2023-03-26 15:57:04,948 - BERTopic - Reduced number of topics from 168 to 76


KeyboardInterrupt: 

## **Top2Vec**
Aside from its Doc2Vec backend, we also want to explore its performance using the `"all-mpnet-base-v2"` SBERT model as that was used in BERTopic. To do so, we make a very slight change to the core code of Top2Vec, namely replacing all instances of `""distiluse-base-multilingual-cased"` with `"all-mpnet-base-v2"`:

In [2]:
import logging
import numpy as np
import pandas as pd
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import strip_tags
import umap
import hdbscan
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from joblib import dump, load
from sklearn.cluster import dbscan
import tempfile
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
from scipy.special import softmax
from top2vec import Top2Vec

try:
    import hnswlib

    _HAVE_HNSWLIB = True
except ImportError:
    _HAVE_HNSWLIB = False

try:
    import tensorflow as tf
    import tensorflow_hub as hub
    import tensorflow_text

    _HAVE_TENSORFLOW = True
except ImportError:
    _HAVE_TENSORFLOW = False

try:
    from sentence_transformers import SentenceTransformer

    _HAVE_TORCH = True
except ImportError:
    _HAVE_TORCH = False

logger = logging.getLogger('top2vec')
logger.setLevel(logging.WARNING)
sh = logging.StreamHandler()
sh.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
logger.addHandler(sh)


def default_tokenizer(doc):
    """Tokenize documents for training and remove too long/short words"""
    return simple_preprocess(strip_tags(doc), deacc=True)


class Top2VecNew(Top2Vec):
    """
    Top2Vec
    Creates jointly embedded topic, document and word vectors.
    Parameters
    ----------
    embedding_model: string
        This will determine which model is used to generate the document and
        word embeddings. The valid string options are:
            * doc2vec
            * universal-sentence-encoder
            * universal-sentence-encoder-multilingual
            * distiluse-base-multilingual-cased
        For large data sets and data sets with very unique vocabulary doc2vec
        could produce better results. This will train a doc2vec model from
        scratch. This method is language agnostic. However multiple languages
        will not be aligned.
        Using the universal sentence encoder options will be much faster since
        those are pre-trained and efficient models. The universal sentence
        encoder options are suggested for smaller data sets. They are also
        good options for large data sets that are in English or in languages
        covered by the multilingual model. It is also suggested for data sets
        that are multilingual.
        For more information on universal-sentence-encoder visit:
        https://tfhub.dev/google/universal-sentence-encoder/4
        For more information on universal-sentence-encoder-multilingual visit:
        https://tfhub.dev/google/universal-sentence-encoder-multilingual/3
        The distiluse-base-multilingual-cased pre-trained sentence transformer
        is suggested for multilingual datasets and languages that are not
        covered by the multilingual universal sentence encoder. The
        transformer is significantly slower than the universal sentence
        encoder options.
        For more informati ond istiluse-base-multilingual-cased visit:
        https://www.sbert.net/docs/pretrained_models.html
    embedding_model_path: string (Optional)
        Pre-trained embedding models will be downloaded automatically by
        default. However they can also be uploaded from a file that is in the
        location of embedding_model_path.
        Warning: the model at embedding_model_path must match the
        embedding_model parameter type.
    documents: List of str
        Input corpus, should be a list of strings.
    min_count: int (Optional, default 50)
        Ignores all words with total frequency lower than this. For smaller
        corpora a smaller min_count will be necessary.
    speed: string (Optional, default 'learn')
        This parameter is only used when using doc2vec as embedding_model.
        It will determine how fast the model takes to train. The
        fast-learn option is the fastest and will generate the lowest quality
        vectors. The learn option will learn better quality vectors but take
        a longer time to train. The deep-learn option will learn the best
        quality vectors but will take significant time to train. The valid
        string speed options are:
        
            * fast-learn
            * learn
            * deep-learn
    use_corpus_file: bool (Optional, default False)
        This parameter is only used when using doc2vec as embedding_model.
        Setting use_corpus_file to True can sometimes provide speedup for
        large datasets when multiple worker threads are available. Documents
        are still passed to the model as a list of str, the model will create
        a temporary corpus file for training.
    document_ids: List of str, int (Optional)
        A unique value per document that will be used for referring to
        documents in search results. If ids are not given to the model, the
        index of each document in the original corpus will become the id.
    keep_documents: bool (Optional, default True)
        If set to False documents will only be used for training and not saved
        as part of the model. This will reduce model size. When using search
        functions only document ids will be returned, not the actual
        documents.
    workers: int (Optional)
        The amount of worker threads to be used in training the model. Larger
        amount will lead to faster training.
    
    tokenizer: callable (Optional, default None)
        Override the default tokenization method. If None then
        gensim.utils.simple_preprocess will be used.
    use_embedding_model_tokenizer: bool (Optional, default False)
        If using an embedding model other than doc2vec, use the model's
        tokenizer for document embedding. If set to True the tokenizer, either
        default or passed callable will be used to tokenize the text to
        extract the vocabulary for word embedding.
    umap_args: dict (Optional, default None)
        Pass custom arguments to UMAP.
    hdbscan_args: dict (Optional, default None)
        Pass custom arguments to HDBSCAN.
    
    verbose: bool (Optional, default True)
        Whether to print status data during training.
    """

    def __init__(self,
                 documents,
                 min_count=50,
                 embedding_model='doc2vec',
                 embedding_model_path=None,
                 speed='learn',
                 use_corpus_file=False,
                 document_ids=None,
                 keep_documents=True,
                 workers=None,
                 tokenizer=None,
                 use_embedding_model_tokenizer=False,
                 umap_args=None,
                 hdbscan_args=None,
                 verbose=True
                 ):

        if verbose:
            logger.setLevel(logging.DEBUG)
            self.verbose = True
        else:
            logger.setLevel(logging.WARNING)
            self.verbose = False

        if tokenizer is None:
            tokenizer = default_tokenizer

        # validate documents
        if not (isinstance(documents, list) or isinstance(documents, np.ndarray)):
            raise ValueError("Documents need to be a list of strings")
        if not all((isinstance(doc, str) or isinstance(doc, np.str_)) for doc in documents):
            raise ValueError("Documents need to be a list of strings")
        if keep_documents:
            self.documents = np.array(documents, dtype="object")
        else:
            self.documents = None

        # validate document ids
        if document_ids is not None:
            if not (isinstance(document_ids, list) or isinstance(document_ids, np.ndarray)):
                raise ValueError("Documents ids need to be a list of str or int")

            if len(documents) != len(document_ids):
                raise ValueError("Document ids need to match number of documents")
            elif len(document_ids) != len(set(document_ids)):
                raise ValueError("Document ids need to be unique")

            if all((isinstance(doc_id, str) or isinstance(doc_id, np.str_)) for doc_id in document_ids):
                self.doc_id_type = np.str_
            elif all((isinstance(doc_id, int) or isinstance(doc_id, np.int_)) for doc_id in document_ids):
                self.doc_id_type = np.int_
            else:
                raise ValueError("Document ids need to be str or int")

            self.document_ids_provided = True
            self.document_ids = np.array(document_ids)
            self.doc_id2index = dict(zip(document_ids, list(range(0, len(document_ids)))))
        else:
            self.document_ids_provided = False
            self.document_ids = np.array(range(0, len(documents)))
            self.doc_id2index = dict(zip(self.document_ids, list(range(0, len(self.document_ids)))))
            self.doc_id_type = np.int_

        acceptable_embedding_models = ["universal-sentence-encoder-multilingual",
                                       "universal-sentence-encoder",
                                       "all-mpnet-base-v2"]

        self.embedding_model_path = embedding_model_path

        if embedding_model == 'doc2vec':

            # validate training inputs
            if speed == "fast-learn":
                hs = 0
                negative = 5
                epochs = 40
            elif speed == "learn":
                hs = 1
                negative = 0
                epochs = 40
            elif speed == "deep-learn":
                hs = 1
                negative = 0
                epochs = 400
            elif speed == "test-learn":
                hs = 0
                negative = 5
                epochs = 1
            else:
                raise ValueError("speed parameter needs to be one of: fast-learn, learn or deep-learn")

            if workers is None:
                pass
            elif isinstance(workers, int):
                pass
            else:
                raise ValueError("workers needs to be an int")

            doc2vec_args = {"vector_size": 300,
                            "min_count": min_count,
                            "window": 15,
                            "sample": 1e-5,
                            "negative": negative,
                            "hs": hs,
                            "epochs": epochs,
                            "dm": 0,
                            "dbow_words": 1}

            if workers is not None:
                doc2vec_args["workers"] = workers

            logger.info('Pre-processing documents for training')

            if use_corpus_file:
                processed = [' '.join(tokenizer(doc)) for doc in documents]
                lines = "\n".join(processed)
                temp = tempfile.NamedTemporaryFile(mode='w+t')
                temp.write(lines)
                doc2vec_args["corpus_file"] = temp.name


            else:
                train_corpus = [TaggedDocument(tokenizer(doc), [i]) for i, doc in enumerate(documents)]
                doc2vec_args["documents"] = train_corpus

            logger.info('Creating joint document/word embedding')
            self.embedding_model = 'doc2vec'
            self.model = Doc2Vec(**doc2vec_args)

            if use_corpus_file:
                temp.close()

        elif embedding_model in acceptable_embedding_models:

            self.embed = None
            self.embedding_model = embedding_model

            self._check_import_status()

            logger.info('Pre-processing documents for training')

            # preprocess documents
            tokenized_corpus = [tokenizer(doc) for doc in documents]

            def return_doc(doc):
                return doc

            # preprocess vocabulary
            vectorizer = CountVectorizer(tokenizer=return_doc, preprocessor=return_doc)
            doc_word_counts = vectorizer.fit_transform(tokenized_corpus)
            words = vectorizer.get_feature_names()
            word_counts = np.array(np.sum(doc_word_counts, axis=0).tolist()[0])
            vocab_inds = np.where(word_counts > min_count)[0]

            if len(vocab_inds) == 0:
                raise ValueError(f"A min_count of {min_count} results in "
                                 f"all words being ignored, choose a lower value.")
            self.vocab = [words[ind] for ind in vocab_inds]

            self._check_model_status()

            logger.info('Creating joint document/word embedding')

            # embed words
            self.word_indexes = dict(zip(self.vocab, range(len(self.vocab))))
            self.word_vectors = self._l2_normalize(np.array(self.embed(self.vocab)))

            # embed documents
            if use_embedding_model_tokenizer:
                self.document_vectors = self._embed_documents(documents)
            else:
                train_corpus = [' '.join(tokens) for tokens in tokenized_corpus]
                self.document_vectors = self._embed_documents(train_corpus)

        else:
            raise ValueError(f"{embedding_model} is an invalid embedding model.")

        # create 5D embeddings of documents
        logger.info('Creating lower dimension embedding of documents')

        if umap_args is None:
            umap_args = {'n_neighbors': 15,
                         'n_components': 5,
                         'metric': 'cosine'}

        umap_model = umap.UMAP(**umap_args).fit(self._get_document_vectors(norm=False))

        # find dense areas of document vectors
        logger.info('Finding dense areas of documents')

        if hdbscan_args is None:
            hdbscan_args = {'min_cluster_size': 15,
                            'metric': 'euclidean',
                            'cluster_selection_method': 'eom'}

        cluster = hdbscan.HDBSCAN(**hdbscan_args).fit(umap_model.embedding_)

        # calculate topic vectors from dense areas of documents
        logger.info('Finding topics')

        # create topic vectors
        self._create_topic_vectors(cluster.labels_)

        # deduplicate topics
        self._deduplicate_topics()

        # find topic words and scores
        self.topic_words, self.topic_word_scores = self._find_topic_words_and_scores(topic_vectors=self.topic_vectors)

        # assign documents to topic
        self.doc_top, self.doc_dist = self._calculate_documents_topic(self.topic_vectors,
                                                                      self._get_document_vectors())

        # calculate topic sizes
        self.topic_sizes = self._calculate_topic_sizes(hierarchy=False)

        # re-order topics
        self._reorder_topics(hierarchy=False)

        # initialize variables for hierarchical topic reduction
        self.topic_vectors_reduced = None
        self.doc_top_reduced = None
        self.doc_dist_reduced = None
        self.topic_sizes_reduced = None
        self.topic_words_reduced = None
        self.topic_word_scores_reduced = None
        self.hierarchy = None

        # initialize document indexing variables
        self.document_index = None
        self.serialized_document_index = None
        self.documents_indexed = False
        self.index_id2doc_id = None
        self.doc_id2index_id = None

        # initialize word indexing variables
        self.word_index = None
        self.serialized_word_index = None
        self.words_indexed = False

    def _check_import_status(self):
        if self.embedding_model != 'all-mpnet-base-v2':
            if not _HAVE_TENSORFLOW:
                raise ImportError(f"{self.embedding_model} is not available.\n\n"
                                  "Try: pip install top2vec[sentence_encoders]\n\n"
                                  "Alternatively try: pip install tensorflow tensorflow_hub tensorflow_text")
        else:
            if not _HAVE_TORCH:
                raise ImportError(f"{self.embedding_model} is not available.\n\n"
                                  "Try: pip install top2vec[sentence_transformers]\n\n"
                                  "Alternatively try: pip install torch sentence_transformers")

    def _check_model_status(self):
        if self.embed is None:
            if self.verbose is False:
                logger.setLevel(logging.DEBUG)

            if self.embedding_model != "all-mpnet-base-v2":
                if self.embedding_model_path is None:
                    logger.info(f'Downloading {self.embedding_model} model')
                    if self.embedding_model == "universal-sentence-encoder-multilingual":
                        module = "https://tfhub.dev/google/universal-sentence-encoder-multilingual/3"
                    else:
                        module = "https://tfhub.dev/google/universal-sentence-encoder/4"
                else:
                    logger.info(f'Loading {self.embedding_model} model at {self.embedding_model_path}')
                    module = self.embedding_model_path
                self.embed = hub.load(module)

            else:
                if self.embedding_model_path is None:
                    logger.info(f'Downloading {self.embedding_model} model')
                    module = 'all-mpnet-base-v2'
                else:
                    logger.info(f'Loading {self.embedding_model} model at {self.embedding_model_path}')
                    module = self.embedding_model_path
                model = SentenceTransformer(module)
                self.embed = model.encode

        if self.verbose is False:
            logger.setLevel(logging.WARNING)

We can then use this `Top2VecNew` class to run our experiments including the `"all-mpnet-base-v2"` model. 

In [2]:
for i in range(3):
    dataset, custom = "incels", True
    params = {"nr_topics": [(i+1)*10 for i in range(5)],
              # "embedding_model": "all-mpnet-base-v2",
              "hdbscan_args": {'min_cluster_size': 15,
                               'metric': 'euclidean',
                               'cluster_selection_method': 'eom'}}

    trainer = Trainer(dataset=dataset,
                      custom_dataset=custom,
                    #   custom_model=Top2VecNew,
                      model_name="Top2Vec",
                      params=params,
                      verbose=True)
    results = trainer.train(save=f"Top2Vec_incels_{i+1}")

2023-03-26 16:21:12,903 - top2vec - INFO - Pre-processing documents for training
2023-03-26 16:21:23,010 - top2vec - INFO - Creating joint document/word embedding
2023-03-26 16:25:40,573 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-26 16:26:31,750 - top2vec - INFO - Finding dense areas of documents
2023-03-26 16:26:37,855 - top2vec - INFO - Finding topics
2023-03-26 16:41:29,620 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.1600470102094905
diversity: 0.96
 


2023-03-26 16:41:40,293 - top2vec - INFO - Creating joint document/word embedding
2023-03-26 16:46:00,052 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-26 16:46:36,202 - top2vec - INFO - Finding dense areas of documents
2023-03-26 16:46:42,334 - top2vec - INFO - Finding topics
2023-03-26 17:01:18,016 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.23205979683152655
diversity: 0.89
 


2023-03-26 17:01:28,173 - top2vec - INFO - Creating joint document/word embedding
2023-03-26 17:05:41,836 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-26 17:06:12,553 - top2vec - INFO - Finding dense areas of documents
2023-03-26 17:06:18,664 - top2vec - INFO - Finding topics
2023-03-26 17:20:37,086 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.2064239659960237
diversity: 0.91
 


2023-03-26 17:20:47,105 - top2vec - INFO - Creating joint document/word embedding
2023-03-26 17:25:01,384 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-26 17:25:32,439 - top2vec - INFO - Finding dense areas of documents
2023-03-26 17:25:38,471 - top2vec - INFO - Finding topics
2023-03-26 17:39:49,876 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.26100336239409827
diversity: 0.82
 


2023-03-26 17:39:59,644 - top2vec - INFO - Creating joint document/word embedding
2023-03-26 17:44:14,629 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-26 17:44:50,732 - top2vec - INFO - Finding dense areas of documents
2023-03-26 17:44:56,856 - top2vec - INFO - Finding topics


Results
npmi: -0.25112036852653696
diversity: 0.84
 


2023-03-26 17:59:14,086 - top2vec - INFO - Pre-processing documents for training
2023-03-26 17:59:24,116 - top2vec - INFO - Creating joint document/word embedding
2023-03-26 18:03:38,705 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-26 18:04:14,901 - top2vec - INFO - Finding dense areas of documents
2023-03-26 18:04:21,312 - top2vec - INFO - Finding topics
2023-03-26 18:19:04,696 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.14817388226138214
diversity: 0.96
 


2023-03-26 18:19:14,570 - top2vec - INFO - Creating joint document/word embedding


KeyboardInterrupt: 

In [3]:
for i in range(3):
    dataset, custom = "incels", True
    params = {"nr_topics": [(i+1)*10 for i in range(5)],
              "embedding_model": "all-MiniLM-L6-v2",
              "hdbscan_args": {'min_cluster_size': 15,
                               'metric': 'euclidean',
                               'cluster_selection_method': 'eom'}}

    trainer = Trainer(dataset=dataset,
                      custom_dataset=custom,
                    #   custom_model=Top2VecNew,
                      model_name="Top2Vec",
                      params=params,
                      verbose=True)
    results = trainer.train(save=f"Top2Vec_incels_MiniLM_{i+1}")


2023-03-26 18:22:08,810 - top2vec - INFO - Pre-processing documents for training
2023-03-26 18:22:20,297 - top2vec - INFO - Downloading all-MiniLM-L6-v2 model
2023-03-26 18:22:20,925 - top2vec - INFO - Creating joint document/word embedding
2023-03-26 18:23:12,618 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-26 18:23:43,158 - top2vec - INFO - Finding dense areas of documents
2023-03-26 18:23:48,713 - top2vec - INFO - Finding topics
2023-03-26 18:29:26,002 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.2779866971094175
diversity: 0.89
 


2023-03-26 18:29:37,532 - top2vec - INFO - Downloading all-MiniLM-L6-v2 model
2023-03-26 18:29:37,787 - top2vec - INFO - Creating joint document/word embedding
2023-03-26 18:30:26,670 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-26 18:31:01,416 - top2vec - INFO - Finding dense areas of documents
2023-03-26 18:31:06,681 - top2vec - INFO - Finding topics
2023-03-26 18:36:30,768 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.21193039295405142
diversity: 0.86
 


2023-03-26 18:36:42,352 - top2vec - INFO - Downloading all-MiniLM-L6-v2 model
2023-03-26 18:36:42,609 - top2vec - INFO - Creating joint document/word embedding
2023-03-26 18:37:31,746 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-26 18:38:06,796 - top2vec - INFO - Finding dense areas of documents
2023-03-26 18:38:12,263 - top2vec - INFO - Finding topics
2023-03-26 18:43:33,542 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.2347784383154207
diversity: 0.7666666666666667
 


2023-03-26 18:43:45,025 - top2vec - INFO - Downloading all-MiniLM-L6-v2 model
2023-03-26 18:43:45,281 - top2vec - INFO - Creating joint document/word embedding
2023-03-26 18:44:34,392 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-26 18:45:04,693 - top2vec - INFO - Finding dense areas of documents
2023-03-26 18:45:10,074 - top2vec - INFO - Finding topics
2023-03-26 18:50:39,909 - top2vec - INFO - Pre-processing documents for training


Results
npmi: -0.22414189284090452
diversity: 0.745
 


2023-03-26 18:50:51,415 - top2vec - INFO - Downloading all-MiniLM-L6-v2 model
2023-03-26 18:50:51,685 - top2vec - INFO - Creating joint document/word embedding
2023-03-26 18:51:40,955 - top2vec - INFO - Creating lower dimension embedding of documents
2023-03-26 18:52:16,055 - top2vec - INFO - Finding dense areas of documents
2023-03-26 18:52:21,405 - top2vec - INFO - Finding topics


Results
npmi: -0.22570013110888387
diversity: 0.718
 


2023-03-26 18:57:51,785 - top2vec - INFO - Pre-processing documents for training
2023-03-26 18:58:03,812 - top2vec - INFO - Downloading all-MiniLM-L6-v2 model
2023-03-26 18:58:04,066 - top2vec - INFO - Creating joint document/word embedding


KeyboardInterrupt: 

# **DTM Evaluation**

Here, we evaluate BERTopic and LDAseq on a dynamic topic modeling task with two datasets: 
* Trump's tweets
* UN general debates

### **BERTopic**

As seen before, we can load our data and generate embeddings before passing it to our evaluator:

In [None]:
%%capture
from sentence_transformers import SentenceTransformer

# Prepare data
dataset, custom = "trump_dtm", True
data_loader = DataLoader(dataset)
_, timestamps = data_loader.load_docs()
data = data_loader.load_octis(custom)
data = [" ".join(words) for words in data.get_corpus()]

# Extract embeddings
model = SentenceTransformer("all-mpnet-base-v2")
embeddings = model.encode(data, show_progress_bar=True)

Then, we also need to make sure that the timestamps match the data that were are using:

In [None]:
# Match indices
import os
os.listdir(f"./{dataset}")
with open(f"./{dataset}/indexes.txt") as f:
    indices = f.readlines()
    
indices = [int(index.split("\n")[0]) for index in indices]
timestamps = [timestamp for index, timestamp in enumerate(timestamps) if index in indices]
len(data), len(timestamps)

Finally, we can simply run the Trainer as we did before but adding the timestamps:

In [None]:
for i in range(3):
    params = {
        "nr_topics": [50],
        "min_topic_size": 15,
        "verbose": True,
    }

    trainer = Trainer(dataset=dataset,
                      model_name="BERTopic",
                      params=params,
                      bt_embeddings=embeddings,
                      custom_dataset=custom,
                      bt_timestamps=timestamps,
                      topk=5,
                      bt_nr_bins=10,
                      verbose=True)
    results = trainer.train(f"DynamicBERTopic_trump_{i}")

### **LDAseq**
To run LDAseq, we again prepare our data and match the indices of our timestamps:

In [None]:
import os
import pandas as pd

# Prepare data
dataset, custom = "un_dtm", True
data_loader = DataLoader(dataset)
_, timestamps = data_loader.load_docs()
data = data_loader.load_octis(custom)
data = [" ".join(words) for words in data.get_corpus()]

# Match indices
os.listdir(f"{dataset}")
with open(f"{dataset}/indexes.txt") as f:
    indices = f.readlines()
    
indices = [int(index.split("\n")[0]) for index in indices]
indices_test = {index: True for index in indices}
timestamps = [timestamp for index, timestamp in tqdm(enumerate(timestamps)) if indices_test.get(index)]
len(data), len(timestamps)

119320it [03:25, 579.62it/s]
278837it [00:00, 1751620.37it/s]


(273743, 273743)

Then, we simply pass the timestamps and run our the trainer for LDAseq:

In [None]:
params = {
    "num_topics": [50],
    "nr_bins": 9,
    "random_state": 42
}

trainer = Trainer(dataset=dataset,
                  model_name="LDAseq",
                  params=params,
                  custom_dataset=custom,
                  bt_timestamps=timestamps,
                  topk=5,
                  verbose=True)
results = trainer.train()

We remove some information from the results as those are quite big to save:

In [None]:
results[0]["Params"].keys()
del results[0]["Params"]["corpus"]
del results[0]["Params"]["id2word"]
del results[0]["Params"]["time_slice"]

import json
with open(f"LDAseq_trump.json", 'w') as f:
    json.dump(results, f)

# **Wall time**
Here, we only focus on the wall time of each topic model, from instantiating the model to training. To do so, we take the Trump dataset and split it up into steps of 1000 documents. Then, we can train a model and track the wall time:

In [None]:
embedding_model = "all-mpnet-base-v2"
# embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embedding_model_name = "all-mpnet-base-v2"
topic_model_name = "BERTopic_USE"

results = pd.DataFrame(columns=["dataset", "nr_documents", "vocab_size", "time",
                                "cpu", "gpu", "gpu_cudnn", "gpu_memory", "embedding_model"])
for index, nr_documents in enumerate(tqdm(np.arange(1000, len(data), 2_000, dtype=int))):
    
    selected_data = random.sample(data, nr_documents)
    selected_tokenized_data = random.sample(tokenized_data, nr_documents)
    
    if topic_model_name == "CTM":
        qt, training_dataset = preprocess_ctm(selected_data, embedding_model_name)
    
    # Run model
    start = time.time()
    
    if topic_model_name == "LDA":
        id2word = corpora.Dictionary(selected_tokenized_data)
        id_corpus = [id2word.doc2bow(document) for document in selected_tokenized_data]
        lda = LdaMulticore(id_corpus, id2word=id2word, num_topics=100)
    
    elif topic_model_name == "NFM":
        id2word = corpora.Dictionary(selected_tokenized_data)
        id_corpus = [id2word.doc2bow(document) for document in selected_tokenized_data]
        nmf_model = nmf.Nmf(id_corpus, id2word=id2word, num_topics=100)

    elif topic_model_name == "BERTopic":
        topic_model = BERTopic(embedding_model=embedding_model)    
        topics, probs = topic_model.fit_transform(selected_data)
        
    elif topic_model_name == "BERTopic_Doc2Vec":
        train_corpus = [TaggedDocument(default_tokenizer(doc), [i]) for i, doc in enumerate(selected_data)]
        doc2vec_args = {"vector_size": 300,
                        "min_count": 50,
                        "window": 15,
                        "sample": 1e-5,
                        "negative": 0,
                        "hs": 1,
                        "epochs": 40,
                        "dm": 0,
                        "dbow_words": 1,
                       "documents": train_corpus,
                       "workers": -1}
        model = Doc2Vec(**doc2vec_args)
        embeddings = model.docvecs.vectors_docs
        topic_model = BERTopic()    
        topics, probs = topic_model.fit_transform(selected_data, embeddings)
        
    elif topic_model_name == "BERTopic_USE":
        embeddings = embedding_model(selected_data).cpu().numpy()
        topic_model = BERTopic(embedding_model=embedding_model)    
        topics, probs = topic_model.fit_transform(selected_data, embeddings)

    elif topic_model_name == "Top2Vec":
        model = Top2Vec(selected_data, hdbscan_args={"min_cluster_size": 15}, workers=-1)
#         model = Top2VecNew(selected_data, hdbscan_args={"min_cluster_size": 15}, embedding_model=embedding_model)
        
    elif topic_model_name == "CTM":
        ctm = CombinedTM(n_components=100, contextual_size=768, bow_size=len(qt.vocab))
        ctm.fit(training_dataset)
    
    end = time.time()

    # Calculate vocab size
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(selected_data)
    vocab_size = len(vectorizer.get_feature_names())
    
    results.loc[len(results)] = [dataset, len(selected_data), vocab_size, end - start, cpu_name, gpu_name, 
                                 gpu_cudnn, gpu_memory, embedding_model_name]