# **Topic Model Evaluation**
Here, you will find the code needed to run the experiments of the paper:

*BERTopic: Neural topic modeling with a class-based TF-IDF procedure*.

The package itself can be found [here](https://github.com/MaartenGr/BERTopic) and the repository for evaluation [here]().

## **Installation**
First, we need to install a few packages in order to run our experiments. Most of the packages are installed through the `tm_evaluation` package of which [OCTIS](https://github.com/MIND-Lab/OCTIS) is an important component. 

You can install the evaluation package with `pip install .` from the root. To additionally install CTM run `pip install .[ctm]`To install BERTopic, run `pip install bertopic==v0.9.4` after installing the base package or use `pip install .[bertopic]`. Top2Vec should be installed with `pip install top2vec==v1.0.26` after installing the base package. 

In [1]:
!pip install .

[31mERROR: Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


To run a faster version of LDAseq for dynamic topic modeling, we need to uninstall gensim and install a specific merge that allows for this speed-up. First, run `pip uninstall gensim -y`, then, run `pip install git+https://github.com/RaRe-Technologies/gensim.git@refs/pull/3172/merge`

**NOTE**: After installing the above packages, make sure to restart the runtime otherwise you are likely to run into issues. 

#  1. **Data**
Some of the data can be accessed through OCTIS, such as the `20NewsGroup` and `BBC_News` datasets. Other datasets, however, are downloaded and then run through OCTIS in order to be used in their pipeline. 

The datasets that we are going to be preparing are: 
* Trump's tweets
* United Nations general debates between 2006 and 2015 

In [4]:
!pip install octis
from evaluation import Trainer

Collecting octis
  Downloading octis-1.13.1-py2.py3-none-any.whl.metadata (27 kB)
Collecting gensim==4.2.0 (from octis)
  Downloading gensim-4.2.0-cp39-cp39-macosx_10_9_x86_64.whl.metadata (8.5 kB)
Collecting scikit-learn==1.1.0 (from octis)
  Downloading scikit_learn-1.1.0-cp39-cp39-macosx_10_13_x86_64.whl.metadata (10 kB)
Collecting scikit-optimize>=0.8.1 (from octis)
  Downloading scikit_optimize-0.10.1-py2.py3-none-any.whl.metadata (9.7 kB)
Collecting numpy==1.23.0 (from octis)
  Downloading numpy-1.23.0-cp39-cp39-macosx_10_9_x86_64.whl.metadata (2.2 kB)
Collecting libsvm (from octis)
  Downloading libsvm-3.23.0.4.tar.gz (170 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m170.6/170.6 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tomotopy (from octis)
  Downloading tomotopy-0.12.7.tar.gz (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m23.6 MB/s[0m eta [3

ModuleNotFoundError: No module named 'octis'

## Trump
The data can be found here: https://www.thetrumparchive.com/faq

Using our `DataLoader` we can prepare the documents and save them in an OCTIS-based format: 

In [None]:
%%time
dataloader = DataLoader(dataset="trump").prepare_docs(save="trump.txt").preprocess_octis(output_folder="trump")

created vocab
53637
words filtering done
CPU times: user 2min 44s, sys: 1.81 s, total: 2min 46s
Wall time: 2min 48s


Additionally, there isa DTM variant that creates 10 timesteps to be used in the dynamic topic modeling experiments:

In [None]:
%%time
dataloader = DataLoader(dataset="trump_dtm").prepare_docs(save="trump_dtm.txt").preprocess_octis(output_folder="trump_dtm")

## United Nations

The transcriptions of the United Nations (UN) general debates between 2006 and 2015. The data can be found here: https://runestone.academy/runestone/books/published/httlads/_static/un-general-debates.csv

In [None]:
%%time
dataloader = DataLoader(dataset="un_dtm").prepare_docs(save="un_dtm.txt").preprocess_octis(output_folder="un_dtm")

created vocab
69447
words filtering done
CPU times: user 22min, sys: 21.5 s, total: 22min 21s
Wall time: 22min 22s


# 2. **Evaluation**
After preparing our data, we can start evaluating the topic models as used in the experiments. OCTIS already has a number of models prepared that we can use directly as shown below. 

First, we specify what the dataset is and whether that was a custom dataset not found in OCTIS. To run our custom trump dataset, we run `dataset, custom = "trump", True`. In contrast, if we are to use the prepackaged 20NewsGroup dataset, we run `dataset, custom = "20NewsGroup", False` instead. 

The OCTIS datasets can be found [here](https://github.com/MIND-Lab/OCTIS#available-datasets). 

Second, we define a number of parameters to be used for the model. It uses the following format: 

`params = {"num_topics": [(i+1)*10 for i in range(5)], "random_state": random_state}`

were we define a number of topics to loop over and calculate the evluation metrics but also define a number of parameters used in the models. 

#### **Parameters**
The parameters for LDA and NMF:


```python
params = {"num_topics": [(i+1)*10 for i in range(5)], "random_state": random_state}`
```

The parameters for Top2Vec:

```python
params = {"nr_topics": [(i+1)*10 for i in range(5)],
          "hdbscan_args": {'min_cluster_size': 15,
                            'metric': 'euclidean',
                            'cluster_selection_method': 'eom'}}
```
Note that the `min_cluster_size` is 15 for all datasets except BBC_News.

The parameters for CTM:

```python
params = {
    "n_components": [(i+1)*10 for i in range(5)],
    "contextual_size":768
}
```

The parameters for BERTopic:

```python
params = {
    "nr_topics": [(i+1)*10 for i in range(5)],
    "min_topic_size": 15,
    "verbose": True
}
```

Note that the `min_topic_size` is 15 for all datasets except BBC_News. Note that we do not set a `embedding_model` here. We do this on purpose as we can generate the embeddings beforehand and pass those to BERTopic. 

## **OCTIS**
Here, we can run the experiments for NMF and LDA. 

#### NMF

In [None]:
for i, random_state in enumerate([0, 21, 42]):
    dataset, custom = "trump", True
    params = {"num_topics": [(i+1)*10 for i in range(5)], "random_state": random_state}

    trainer = Trainer(dataset=dataset,
                      model_name="NMF",
                      params=params,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"NMF_trump_{i+1}")

#### LDA

In [None]:
for i, random_state in enumerate([0, 21, 42]):
    dataset, custom = "trump", True
    params = {"num_topics": [(i+1)*10 for i in range(5)], "random_state": random_state}

    trainer = Trainer(dataset=dataset,
                      model_name="LDA",
                      params=params,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"LDA_trump_{i+1}")

## **CTM**
Here, we use de CombinedTM of the Contextualized Topic Models:  https://github.com/MilaNLProc/contextualized-topic-models



In [None]:
for i in range(3):
    dataset, custom = "trump", True
    params = {
        "n_components": [(i+1)*10 for i in range(5)],
        "contextual_size":768
    }

    trainer = Trainer(dataset=dataset,
                      model_name="CTM_CUSTOM",
                      params=params,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"CTM_trump_{i+1}")

## **BERTopic**

To speed up BERTopic, we can generate the embeddings before passing it to the `Trainer`. This way, the same embeddings do not have to be generated 5 times which speeds up evaluation quite a bit. 

In [None]:
%%capture
from sentence_transformers import SentenceTransformer

# Prepare data
dataset, custom = "trump", True
data_loader = DataLoader(dataset)
_, timestamps = data_loader.load_docs()
data = data_loader.load_octis(custom)
data = [" ".join(words) for words in data.get_corpus()]

# Extract embeddings
model = SentenceTransformer("all-mpnet-base-v2")
embeddings = model.encode(data, show_progress_bar=True)

As show above, we load in the `data` which the data loader and combine the tokens in each document to generate our training data. Then, we pass it to the sentence transformer model of our choice and generate the embeddings. 

Next, we pass these embeddings to the `bt_embeddings` parameter to speed up training: 

In [None]:
for i in range(3):
    params = {
        "embedding_model": "all-mpnet-base-v2",
        "nr_topics": [(i+1)*10 for i in range(5)],
        "min_topic_size": 15,
        "diversity": None,
        "verbose": True
    }

    trainer = Trainer(dataset=dataset,
                      model_name="BERTopic",
                      params=params,
                      bt_embeddings=embeddings,
                      custom_dataset=custom,
                      verbose=True)
    results = trainer.train(save=f"BERTopic_trump_{i+1}")

## **Top2Vec**
Aside from its Doc2Vec backend, we also want to explore its performance using the `"all-mpnet-base-v2"` SBERT model as that was used in BERTopic. To do so, we make a very slight change to the core code of Top2Vec, namely replacing all instances of `""distiluse-base-multilingual-cased"` with `"all-mpnet-base-v2"`:

In [None]:
import logging
import numpy as np
import pandas as pd
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import strip_tags
import umap
import hdbscan
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from joblib import dump, load
from sklearn.cluster import dbscan
import tempfile
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
from scipy.special import softmax
from top2vec import Top2Vec

try:
    import hnswlib

    _HAVE_HNSWLIB = True
except ImportError:
    _HAVE_HNSWLIB = False

try:
    import tensorflow as tf
    import tensorflow_hub as hub
    import tensorflow_text

    _HAVE_TENSORFLOW = True
except ImportError:
    _HAVE_TENSORFLOW = False

try:
    from sentence_transformers import SentenceTransformer

    _HAVE_TORCH = True
except ImportError:
    _HAVE_TORCH = False

logger = logging.getLogger('top2vec')
logger.setLevel(logging.WARNING)
sh = logging.StreamHandler()
sh.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
logger.addHandler(sh)


def default_tokenizer(doc):
    """Tokenize documents for training and remove too long/short words"""
    return simple_preprocess(strip_tags(doc), deacc=True)


class Top2VecNew(Top2Vec):
    """
    Top2Vec
    Creates jointly embedded topic, document and word vectors.
    Parameters
    ----------
    embedding_model: string
        This will determine which model is used to generate the document and
        word embeddings. The valid string options are:
            * doc2vec
            * universal-sentence-encoder
            * universal-sentence-encoder-multilingual
            * distiluse-base-multilingual-cased
        For large data sets and data sets with very unique vocabulary doc2vec
        could produce better results. This will train a doc2vec model from
        scratch. This method is language agnostic. However multiple languages
        will not be aligned.
        Using the universal sentence encoder options will be much faster since
        those are pre-trained and efficient models. The universal sentence
        encoder options are suggested for smaller data sets. They are also
        good options for large data sets that are in English or in languages
        covered by the multilingual model. It is also suggested for data sets
        that are multilingual.
        For more information on universal-sentence-encoder visit:
        https://tfhub.dev/google/universal-sentence-encoder/4
        For more information on universal-sentence-encoder-multilingual visit:
        https://tfhub.dev/google/universal-sentence-encoder-multilingual/3
        The distiluse-base-multilingual-cased pre-trained sentence transformer
        is suggested for multilingual datasets and languages that are not
        covered by the multilingual universal sentence encoder. The
        transformer is significantly slower than the universal sentence
        encoder options.
        For more informati ond istiluse-base-multilingual-cased visit:
        https://www.sbert.net/docs/pretrained_models.html
    embedding_model_path: string (Optional)
        Pre-trained embedding models will be downloaded automatically by
        default. However they can also be uploaded from a file that is in the
        location of embedding_model_path.
        Warning: the model at embedding_model_path must match the
        embedding_model parameter type.
    documents: List of str
        Input corpus, should be a list of strings.
    min_count: int (Optional, default 50)
        Ignores all words with total frequency lower than this. For smaller
        corpora a smaller min_count will be necessary.
    speed: string (Optional, default 'learn')
        This parameter is only used when using doc2vec as embedding_model.
        It will determine how fast the model takes to train. The
        fast-learn option is the fastest and will generate the lowest quality
        vectors. The learn option will learn better quality vectors but take
        a longer time to train. The deep-learn option will learn the best
        quality vectors but will take significant time to train. The valid
        string speed options are:
        
            * fast-learn
            * learn
            * deep-learn
    use_corpus_file: bool (Optional, default False)
        This parameter is only used when using doc2vec as embedding_model.
        Setting use_corpus_file to True can sometimes provide speedup for
        large datasets when multiple worker threads are available. Documents
        are still passed to the model as a list of str, the model will create
        a temporary corpus file for training.
    document_ids: List of str, int (Optional)
        A unique value per document that will be used for referring to
        documents in search results. If ids are not given to the model, the
        index of each document in the original corpus will become the id.
    keep_documents: bool (Optional, default True)
        If set to False documents will only be used for training and not saved
        as part of the model. This will reduce model size. When using search
        functions only document ids will be returned, not the actual
        documents.
    workers: int (Optional)
        The amount of worker threads to be used in training the model. Larger
        amount will lead to faster training.
    
    tokenizer: callable (Optional, default None)
        Override the default tokenization method. If None then
        gensim.utils.simple_preprocess will be used.
    use_embedding_model_tokenizer: bool (Optional, default False)
        If using an embedding model other than doc2vec, use the model's
        tokenizer for document embedding. If set to True the tokenizer, either
        default or passed callable will be used to tokenize the text to
        extract the vocabulary for word embedding.
    umap_args: dict (Optional, default None)
        Pass custom arguments to UMAP.
    hdbscan_args: dict (Optional, default None)
        Pass custom arguments to HDBSCAN.
    
    verbose: bool (Optional, default True)
        Whether to print status data during training.
    """

    def __init__(self,
                 documents,
                 min_count=50,
                 embedding_model='doc2vec',
                 embedding_model_path=None,
                 speed='learn',
                 use_corpus_file=False,
                 document_ids=None,
                 keep_documents=True,
                 workers=None,
                 tokenizer=None,
                 use_embedding_model_tokenizer=False,
                 umap_args=None,
                 hdbscan_args=None,
                 verbose=True
                 ):

        if verbose:
            logger.setLevel(logging.DEBUG)
            self.verbose = True
        else:
            logger.setLevel(logging.WARNING)
            self.verbose = False

        if tokenizer is None:
            tokenizer = default_tokenizer

        # validate documents
        if not (isinstance(documents, list) or isinstance(documents, np.ndarray)):
            raise ValueError("Documents need to be a list of strings")
        if not all((isinstance(doc, str) or isinstance(doc, np.str_)) for doc in documents):
            raise ValueError("Documents need to be a list of strings")
        if keep_documents:
            self.documents = np.array(documents, dtype="object")
        else:
            self.documents = None

        # validate document ids
        if document_ids is not None:
            if not (isinstance(document_ids, list) or isinstance(document_ids, np.ndarray)):
                raise ValueError("Documents ids need to be a list of str or int")

            if len(documents) != len(document_ids):
                raise ValueError("Document ids need to match number of documents")
            elif len(document_ids) != len(set(document_ids)):
                raise ValueError("Document ids need to be unique")

            if all((isinstance(doc_id, str) or isinstance(doc_id, np.str_)) for doc_id in document_ids):
                self.doc_id_type = np.str_
            elif all((isinstance(doc_id, int) or isinstance(doc_id, np.int_)) for doc_id in document_ids):
                self.doc_id_type = np.int_
            else:
                raise ValueError("Document ids need to be str or int")

            self.document_ids_provided = True
            self.document_ids = np.array(document_ids)
            self.doc_id2index = dict(zip(document_ids, list(range(0, len(document_ids)))))
        else:
            self.document_ids_provided = False
            self.document_ids = np.array(range(0, len(documents)))
            self.doc_id2index = dict(zip(self.document_ids, list(range(0, len(self.document_ids)))))
            self.doc_id_type = np.int_

        acceptable_embedding_models = ["universal-sentence-encoder-multilingual",
                                       "universal-sentence-encoder",
                                       "all-mpnet-base-v2"]

        self.embedding_model_path = embedding_model_path

        if embedding_model == 'doc2vec':

            # validate training inputs
            if speed == "fast-learn":
                hs = 0
                negative = 5
                epochs = 40
            elif speed == "learn":
                hs = 1
                negative = 0
                epochs = 40
            elif speed == "deep-learn":
                hs = 1
                negative = 0
                epochs = 400
            elif speed == "test-learn":
                hs = 0
                negative = 5
                epochs = 1
            else:
                raise ValueError("speed parameter needs to be one of: fast-learn, learn or deep-learn")

            if workers is None:
                pass
            elif isinstance(workers, int):
                pass
            else:
                raise ValueError("workers needs to be an int")

            doc2vec_args = {"vector_size": 300,
                            "min_count": min_count,
                            "window": 15,
                            "sample": 1e-5,
                            "negative": negative,
                            "hs": hs,
                            "epochs": epochs,
                            "dm": 0,
                            "dbow_words": 1}

            if workers is not None:
                doc2vec_args["workers"] = workers

            logger.info('Pre-processing documents for training')

            if use_corpus_file:
                processed = [' '.join(tokenizer(doc)) for doc in documents]
                lines = "\n".join(processed)
                temp = tempfile.NamedTemporaryFile(mode='w+t')
                temp.write(lines)
                doc2vec_args["corpus_file"] = temp.name


            else:
                train_corpus = [TaggedDocument(tokenizer(doc), [i]) for i, doc in enumerate(documents)]
                doc2vec_args["documents"] = train_corpus

            logger.info('Creating joint document/word embedding')
            self.embedding_model = 'doc2vec'
            self.model = Doc2Vec(**doc2vec_args)

            if use_corpus_file:
                temp.close()

        elif embedding_model in acceptable_embedding_models:

            self.embed = None
            self.embedding_model = embedding_model

            self._check_import_status()

            logger.info('Pre-processing documents for training')

            # preprocess documents
            tokenized_corpus = [tokenizer(doc) for doc in documents]

            def return_doc(doc):
                return doc

            # preprocess vocabulary
            vectorizer = CountVectorizer(tokenizer=return_doc, preprocessor=return_doc)
            doc_word_counts = vectorizer.fit_transform(tokenized_corpus)
            words = vectorizer.get_feature_names()
            word_counts = np.array(np.sum(doc_word_counts, axis=0).tolist()[0])
            vocab_inds = np.where(word_counts > min_count)[0]

            if len(vocab_inds) == 0:
                raise ValueError(f"A min_count of {min_count} results in "
                                 f"all words being ignored, choose a lower value.")
            self.vocab = [words[ind] for ind in vocab_inds]

            self._check_model_status()

            logger.info('Creating joint document/word embedding')

            # embed words
            self.word_indexes = dict(zip(self.vocab, range(len(self.vocab))))
            self.word_vectors = self._l2_normalize(np.array(self.embed(self.vocab)))

            # embed documents
            if use_embedding_model_tokenizer:
                self.document_vectors = self._embed_documents(documents)
            else:
                train_corpus = [' '.join(tokens) for tokens in tokenized_corpus]
                self.document_vectors = self._embed_documents(train_corpus)

        else:
            raise ValueError(f"{embedding_model} is an invalid embedding model.")

        # create 5D embeddings of documents
        logger.info('Creating lower dimension embedding of documents')

        if umap_args is None:
            umap_args = {'n_neighbors': 15,
                         'n_components': 5,
                         'metric': 'cosine'}

        umap_model = umap.UMAP(**umap_args).fit(self._get_document_vectors(norm=False))

        # find dense areas of document vectors
        logger.info('Finding dense areas of documents')

        if hdbscan_args is None:
            hdbscan_args = {'min_cluster_size': 15,
                            'metric': 'euclidean',
                            'cluster_selection_method': 'eom'}

        cluster = hdbscan.HDBSCAN(**hdbscan_args).fit(umap_model.embedding_)

        # calculate topic vectors from dense areas of documents
        logger.info('Finding topics')

        # create topic vectors
        self._create_topic_vectors(cluster.labels_)

        # deduplicate topics
        self._deduplicate_topics()

        # find topic words and scores
        self.topic_words, self.topic_word_scores = self._find_topic_words_and_scores(topic_vectors=self.topic_vectors)

        # assign documents to topic
        self.doc_top, self.doc_dist = self._calculate_documents_topic(self.topic_vectors,
                                                                      self._get_document_vectors())

        # calculate topic sizes
        self.topic_sizes = self._calculate_topic_sizes(hierarchy=False)

        # re-order topics
        self._reorder_topics(hierarchy=False)

        # initialize variables for hierarchical topic reduction
        self.topic_vectors_reduced = None
        self.doc_top_reduced = None
        self.doc_dist_reduced = None
        self.topic_sizes_reduced = None
        self.topic_words_reduced = None
        self.topic_word_scores_reduced = None
        self.hierarchy = None

        # initialize document indexing variables
        self.document_index = None
        self.serialized_document_index = None
        self.documents_indexed = False
        self.index_id2doc_id = None
        self.doc_id2index_id = None

        # initialize word indexing variables
        self.word_index = None
        self.serialized_word_index = None
        self.words_indexed = False

    def _check_import_status(self):
        if self.embedding_model != 'all-mpnet-base-v2':
            if not _HAVE_TENSORFLOW:
                raise ImportError(f"{self.embedding_model} is not available.\n\n"
                                  "Try: pip install top2vec[sentence_encoders]\n\n"
                                  "Alternatively try: pip install tensorflow tensorflow_hub tensorflow_text")
        else:
            if not _HAVE_TORCH:
                raise ImportError(f"{self.embedding_model} is not available.\n\n"
                                  "Try: pip install top2vec[sentence_transformers]\n\n"
                                  "Alternatively try: pip install torch sentence_transformers")

    def _check_model_status(self):
        if self.embed is None:
            if self.verbose is False:
                logger.setLevel(logging.DEBUG)

            if self.embedding_model != "all-mpnet-base-v2":
                if self.embedding_model_path is None:
                    logger.info(f'Downloading {self.embedding_model} model')
                    if self.embedding_model == "universal-sentence-encoder-multilingual":
                        module = "https://tfhub.dev/google/universal-sentence-encoder-multilingual/3"
                    else:
                        module = "https://tfhub.dev/google/universal-sentence-encoder/4"
                else:
                    logger.info(f'Loading {self.embedding_model} model at {self.embedding_model_path}')
                    module = self.embedding_model_path
                self.embed = hub.load(module)

            else:
                if self.embedding_model_path is None:
                    logger.info(f'Downloading {self.embedding_model} model')
                    module = 'all-mpnet-base-v2'
                else:
                    logger.info(f'Loading {self.embedding_model} model at {self.embedding_model_path}')
                    module = self.embedding_model_path
                model = SentenceTransformer(module)
                self.embed = model.encode

        if self.verbose is False:
            logger.setLevel(logging.WARNING)

We can then use this `Top2VecNew` class to run our experiments including the `"all-mpnet-base-v2"` model. 

In [None]:
for i in range(3):
    dataset, custom = "trump", True
    params = {"nr_topics": [(i+1)*10 for i in range(5)],
              # "embedding_model": "all-mpnet-base-v2",
              "hdbscan_args": {'min_cluster_size': 15,
                               'metric': 'euclidean',
                               'cluster_selection_method': 'eom'}}

    trainer = Trainer(dataset=dataset,
                      custom_dataset=custom,
                      custom_model=Top2VecNew,
                      model_name="Top2Vec",
                      params=params,
                      verbose=True)
    results = trainer.train(save=f"Top2Vec_trump_{i+1}")

# **DTM Evaluation**

Here, we evaluate BERTopic and LDAseq on a dynamic topic modeling task with two datasets: 
* Trump's tweets
* UN general debates

### **BERTopic**

As seen before, we can load our data and generate embeddings before passing it to our evaluator:

In [None]:
%%capture
from sentence_transformers import SentenceTransformer

# Prepare data
dataset, custom = "trump_dtm", True
data_loader = DataLoader(dataset)
_, timestamps = data_loader.load_docs()
data = data_loader.load_octis(custom)
data = [" ".join(words) for words in data.get_corpus()]

# Extract embeddings
model = SentenceTransformer("all-mpnet-base-v2")
embeddings = model.encode(data, show_progress_bar=True)

Then, we also need to make sure that the timestamps match the data that were are using:

In [None]:
# Match indices
import os
os.listdir(f"./{dataset}")
with open(f"./{dataset}/indexes.txt") as f:
    indices = f.readlines()
    
indices = [int(index.split("\n")[0]) for index in indices]
timestamps = [timestamp for index, timestamp in enumerate(timestamps) if index in indices]
len(data), len(timestamps)

Finally, we can simply run the Trainer as we did before but adding the timestamps:

In [None]:
for i in range(3):
    params = {
        "nr_topics": [50],
        "min_topic_size": 15,
        "verbose": True,
    }

    trainer = Trainer(dataset=dataset,
                      model_name="BERTopic",
                      params=params,
                      bt_embeddings=embeddings,
                      custom_dataset=custom,
                      bt_timestamps=timestamps,
                      topk=5,
                      bt_nr_bins=10,
                      verbose=True)
    results = trainer.train(f"DynamicBERTopic_trump_{i}")

### **LDAseq**
To run LDAseq, we again prepare our data and match the indices of our timestamps:

In [None]:
import os
import pandas as pd

# Prepare data
dataset, custom = "un_dtm", True
data_loader = DataLoader(dataset)
_, timestamps = data_loader.load_docs()
data = data_loader.load_octis(custom)
data = [" ".join(words) for words in data.get_corpus()]

# Match indices
os.listdir(f"{dataset}")
with open(f"{dataset}/indexes.txt") as f:
    indices = f.readlines()
    
indices = [int(index.split("\n")[0]) for index in indices]
indices_test = {index: True for index in indices}
timestamps = [timestamp for index, timestamp in tqdm(enumerate(timestamps)) if indices_test.get(index)]
len(data), len(timestamps)

119320it [03:25, 579.62it/s]
278837it [00:00, 1751620.37it/s]


(273743, 273743)

Then, we simply pass the timestamps and run our the trainer for LDAseq:

In [None]:
params = {
    "num_topics": [50],
    "nr_bins": 9,
    "random_state": 42
}

trainer = Trainer(dataset=dataset,
                  model_name="LDAseq",
                  params=params,
                  custom_dataset=custom,
                  bt_timestamps=timestamps,
                  topk=5,
                  verbose=True)
results = trainer.train()

We remove some information from the results as those are quite big to save:

In [None]:
results[0]["Params"].keys()
del results[0]["Params"]["corpus"]
del results[0]["Params"]["id2word"]
del results[0]["Params"]["time_slice"]

import json
with open(f"LDAseq_trump.json", 'w') as f:
    json.dump(results, f)

# **Wall time**
Here, we only focus on the wall time of each topic model, from instantiating the model to training. To do so, we take the Trump dataset and split it up into steps of 1000 documents. Then, we can train a model and track the wall time:

In [None]:
embedding_model = "all-mpnet-base-v2"
# embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embedding_model_name = "all-mpnet-base-v2"
topic_model_name = "BERTopic_USE"

results = pd.DataFrame(columns=["dataset", "nr_documents", "vocab_size", "time",
                                "cpu", "gpu", "gpu_cudnn", "gpu_memory", "embedding_model"])
for index, nr_documents in enumerate(tqdm(np.arange(1000, len(data), 2_000, dtype=int))):
    
    selected_data = random.sample(data, nr_documents)
    selected_tokenized_data = random.sample(tokenized_data, nr_documents)
    
    if topic_model_name == "CTM":
        qt, training_dataset = preprocess_ctm(selected_data, embedding_model_name)
    
    # Run model
    start = time.time()
    
    if topic_model_name == "LDA":
        id2word = corpora.Dictionary(selected_tokenized_data)
        id_corpus = [id2word.doc2bow(document) for document in selected_tokenized_data]
        lda = LdaMulticore(id_corpus, id2word=id2word, num_topics=100)
    
    elif topic_model_name == "NFM":
        id2word = corpora.Dictionary(selected_tokenized_data)
        id_corpus = [id2word.doc2bow(document) for document in selected_tokenized_data]
        nmf_model = nmf.Nmf(id_corpus, id2word=id2word, num_topics=100)

    elif topic_model_name == "BERTopic":
        topic_model = BERTopic(embedding_model=embedding_model)    
        topics, probs = topic_model.fit_transform(selected_data)
        
    elif topic_model_name == "BERTopic_Doc2Vec":
        train_corpus = [TaggedDocument(default_tokenizer(doc), [i]) for i, doc in enumerate(selected_data)]
        doc2vec_args = {"vector_size": 300,
                        "min_count": 50,
                        "window": 15,
                        "sample": 1e-5,
                        "negative": 0,
                        "hs": 1,
                        "epochs": 40,
                        "dm": 0,
                        "dbow_words": 1,
                       "documents": train_corpus,
                       "workers": -1}
        model = Doc2Vec(**doc2vec_args)
        embeddings = model.docvecs.vectors_docs
        topic_model = BERTopic()    
        topics, probs = topic_model.fit_transform(selected_data, embeddings)
        
    elif topic_model_name == "BERTopic_USE":
        embeddings = embedding_model(selected_data).cpu().numpy()
        topic_model = BERTopic(embedding_model=embedding_model)    
        topics, probs = topic_model.fit_transform(selected_data, embeddings)

    elif topic_model_name == "Top2Vec":
        model = Top2Vec(selected_data, hdbscan_args={"min_cluster_size": 15}, workers=-1)
#         model = Top2VecNew(selected_data, hdbscan_args={"min_cluster_size": 15}, embedding_model=embedding_model)
        
    elif topic_model_name == "CTM":
        ctm = CombinedTM(n_components=100, contextual_size=768, bow_size=len(qt.vocab))
        ctm.fit(training_dataset)
    
    end = time.time()

    # Calculate vocab size
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(selected_data)
    vocab_size = len(vectorizer.get_feature_names())
    
    results.loc[len(results)] = [dataset, len(selected_data), vocab_size, end - start, cpu_name, gpu_name, 
                                 gpu_cudnn, gpu_memory, embedding_model_name]