# BERTopic Best Practices - Exper

Through the nature of BERTopic, its modularity, many variations of the topic modeling technique is possible. However, during the development and through the usage of the package, a set of best practices have been developed that generally lead to great results.

The following are a number of steps, parameters, and settings that you can use that will generally improve the quality of the resulting topics. In other words, after going through the quick start and getting a feeling for the API these steps should get you to the next level of performance.

**NOTE:**
    Although these are called *best practices*, it does not necessarily mean that they work across all use cases perfectly. The underlying modular nature of BERTopic is meant to take different use cases into account. After going through these practices it is advised to fine-tune wherever necessary.

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="20%">

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [6]:
import sys
print(sys.executable)

/home/ubuntu/miniconda/envs/bertopic/bin/python


### Install library and import some essiential pakages

In [2]:
# %%capture
# !pip install git+https://github.com/scikit-learn-contrib/hdbscan.git
# !pip install bertopic
# !pip install datasets
# !pip install openai

In [3]:
# %%capture
# !pip install nltk
# import nltk
# nltk.download('words')

In [4]:
# # Testing cell

# from sklearn.datasets import fetch_20newsgroups
# from sentence_transformers import SentenceTransformer
# from bertopic import BERTopic

# # Prepare embeddings
# docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
# sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
# embeddings = sentence_model.encode(docs, show_progress_bar=True)

# # Train our topic model using our pre-trained sentence-transformers embeddings
# topic_model = BERTopic()
# topics, probs = topic_model.fit_transform(docs, embeddings)


In [5]:
# topic_model.get_topic_info()

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Data
For this example, we will use a dataset containing sentence and label

In [2]:
import tqdm as notebook_tqdm
import modin.pandas as pd

#############################################
### For the purpose of timing comparisons ###
#############################################
import time
import modin
modin.config.Engine.put("Dask")

from distributed import Client
client = Client()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 41559 instead


In [6]:
folder = "/home/ubuntu/WORK/selfexplain-semantic-topic-modeling/"
file_path = folder + "dataset/2_tripadvisor_len_review_227per_hotel.csv"

# df1_ = pd.read_csv(folder + "train.tsv", sep='\t')
# df2_ = pd.read_csv(folder + "dev.tsv", sep="\t")

# df = pd.concat([df1_, df2_], ignore_index = True)
df = pd.read_csv(file_path)

In [7]:
df.info()

<class 'modin.pandas.dataframe.DataFrame'>
RangeIndex: 48198 entries, 0 to 48197
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   hotel     48198 non-null  int64 
 1   sentence  48198 non-null  object
dtypes: int64(1), object(1)
memory usage: 753.2+ KB


In [9]:
# df = pd.DataFrame({
#     "sentence": df['review'].tolist(),
# })

In [10]:
df.sample(5)

Unnamed: 0,hotel,sentence
44994,10311875,The complex two amazing and the restaurant foo...
31090,3591310,My whom normally were very picky on the hotel ...
41767,7915166,"They have done small to stand out , like the c..."
43605,9761938,I stayed at the Royal Lotus Hotel in August fo...
12413,614747,PLEASE the one thing we ' t do was sample the ...


In [11]:
df['label'].value_counts()

KeyError: 'label'

In [12]:
from os import replace
df['sentence'].dropna()
df.info()

<class 'modin.pandas.dataframe.DataFrame'>
RangeIndex: 48198 entries, 0 to 48197
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   hotel     48198 non-null  int64 
 1   sentence  48198 non-null  object
dtypes: int64(1), object(1)
memory usage: 753.2+ KB


In [14]:
df.sample(5)

Unnamed: 0,hotel,sentence
41521,7856034,"Like many , a clear glass panel separates the ..."
20019,1546077,I choose this hotel through to go program .
22161,1823746,We found the food a bit limiting and expensive...
27207,2555547,The Golden Legend is perfectly situated for th...
32199,3913966,The only good thing about the breakfast is tha...


In [15]:
docs = df['sentence'].astype(str)



In [16]:
docs = list(docs)
docs[5:8]

['Luxury Escape package disappointment .',
 'Lovely lake after change of .',
 'Our review is by the disappointment of not being access to the club lounge which was part of the Luxury package .']

In [17]:
len(docs)

48198

**🔥 Tip - Sentence Splitter 🔥**
***
 Whenever you have large documents, you typically want to split them up into either paragraphs or sentences. A nice way to do so is by using NLTK's sentence splitter which is nothing more than:

```python
from nltk.tokenize import sent_tokenize, word_tokenize
sentences = [sent_tokenize(abstract) for abstract in docs]
sentences = [sentence for doc in sentences for sentence in doc]
```

***

# **Best Practices**

With feedback from the community throughout the development of BERTopic and the core maintainer's personal experience, there are a number of best practices developed that generally lead to an improved topic model.

The goal of these best practices to quickly guide the user to what is commonly used to speed-up training, improve performance, explore alternatives, etc. Instead of having to search through many issues and discussions, an overview of best practices are discussed here.

To start off, it is important to have a general idea of the pipeline of BERTopic as it relates to many of these best practices.

BERTopic can be viewed as a sequence of steps to create its topic representations. There are five steps to this process:

![https://maartengr.github.io/BERTopic/algorithm/default.svg](https://maartengr.github.io/BERTopic/algorithm/default.svg)

The pipeline above implies significant modularity of BERTopic. Each step in this process was carefully selected such that they are all somewhat independent from one another.

As a result, we can adopt the pipeline to the current state-of-the-art with respect to each individual step:

 ![https://maartengr.github.io/BERTopic/algorithm/modularity.svg](https://maartengr.github.io/BERTopic/algorithm/modularity.svg)

## **Pre-calculate Embeddings**
After having created our data, namely `docs`, we can dive into the very first best practice, **pre-calculating embeddings**.

BERTopic works by converting documents into numerical values, called embeddings. This process can be very costly, especially if we want to iterate over parameters. Instead, we can calculate those embeddings once and feed them to BERTopic to skip calculating embeddings each time.

In [18]:
from sentence_transformers import SentenceTransformer

# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

2023-07-22 20:19:39.471949: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-07-22 20:19:39.513005: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [19]:
embeddings = embedding_model.encode(docs, show_progress_bar=True)

Batches:   0%|          | 0/1507 [00:00<?, ?it/s]

## **Preventing Stochastic Behavior**
In BERTopic, we generally use a dimensionality reduction algorithm to reduce the size of the embeddings. This is done to prevent the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) to a certain degree.

As a default, this is done with [UMAP](https://github.com/lmcinnes/umap) which is an incredible algorithm for reducing dimensional space. However, by default, it shows stochastic behavior which creates different results each time you run it. To prevent that, we will need to set a `random_state` of the model before passing it to BERTopic.

As a result, we can now fully reproduce the results each time we run the model.

In [29]:
from umap import UMAP

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

## **Controlling Number of Topics**
There is a parameter to control the number of topics, namely `nr_topics`. This parameter, however, merges topics **after** they have been created. It is a parameter that supports creating a fixed number of topics.

However, it is advised to control the number of topics through the cluster model which is by default HDBSCAN. HDBSCAN has a parameter, namely `min_topic_size` that indirectly controls the number of topics that will be created.

A higher `min_topic_size` will generate fewer topics and a lower `min_topic_size` will generate more topics.

Here, we will go with `min_topic_size=40` to get around XXX topics.

In [20]:
from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(min_cluster_size=150, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

## Speed up by cuML's GPU

In [21]:
# from cuml.cluster import HDBSCAN
# from cuml.manifold import UMAP

# # Create instances of GPU-accelerated UMAP and HDBSCAN
# umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
# hdbscan_model = HDBSCAN(min_cluster_size=150, metric='euclidean', cluster_selection_method='eom', min_samples=10, gen_min_span_tree=True, prediction_data=True)

## **Improving Default Representation**
The default representation of topics is calculated through [c-TF-IDF](https://maartengr.github.io/BERTopic/algorithm/algorithm.html#5-topic-representation). However, c-TF-IDF is powered by the [CountVectorizer](https://maartengr.github.io/BERTopic/getting_started/vectorizers/vectorizers.html) which converts text into tokens. Using the CountVectorizer, we can do a number of things:

* Remove stopwords
* Ignore infrequent words
* Increase

In other words, we can preprocess the topic representations **after** documents are assigned to topics. This will not influence the clustering process in any way.

Here, we will ignore English stopwords and infrequent words. Moreover, by increasing the n-gram range we will consider topic representations that are made up of one or two words.

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))

## **Additional Representations**
Previously, we have tuned the default representation but there are quite a number of [other topic representations](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html) in BERTopic that we can choose from. From [KeyBERTInspired](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#keybertinspired) and [PartOfSpeech](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#partofspeech), to [OpenAI's ChatGPT](https://maartengr.github.io/BERTopic/getting_started/representation/llm.html#chatgpt) and [open-source](https://maartengr.github.io/BERTopic/getting_started/representation/llm.html#langchain) alternatives, many representations are possible.

In BERTopic, you can model many different topic representations simultanously to test them out and get different perspectives of topic descriptions. This is called [multi-aspect](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) topic modeling.

Here, we will demonstrate a number of interesting and useful representations in BERTopic:

* KeyBERTInspired
  * A method that derives inspiration from how KeyBERT works
* PartOfSpeech
  * Using SpaCy's POS tagging to extract words
* MaximalMarginalRelevance
  * Diversify the topic words
* OpenAI
  * Use ChatGPT to label our topics


In [23]:
import spacy
import openai
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI, PartOfSpeech



2023-07-22 20:19:57.954469: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-22 20:19:57.954807: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-22 20:19:57.954895: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

In [84]:
# KeyBERT
keybert_model = KeyBERTInspired()

# Part-of-Speech
pos_model = PartOfSpeech("en_core_web_sm")

# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)

# GPT-3.5
openai.api_key = "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"
prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic: <topic label>
"""
openai_model = OpenAI(model="gpt-3.5-turbo", exponential_backoff=True, chat=True, prompt=prompt)

# All representation models
representation_model = {
    "KeyBERT": keybert_model,
#     "OpenAI": openai_model,  # Uncomment if you will use OpenAI
    "MMR": mmr_model,
    "POS": pos_model
}

## **Training**
Now that we have a set of best practices, we can use them in our training loop. Here, several different representations, keywords and labels for our topics will be created. If you want to iterate over the topic model it is advised to use the pre-calculated embeddings as that significantly speeds up training.

In [25]:
embeddings.shape

(48198, 384)

In [26]:
len(docs)

48198

In [88]:
representation_model

{'KeyBERT': KeyBERTInspired(),
 'MMR': MaximalMarginalRelevance(diversity=0.3),
 'POS': PartOfSpeech(model=<spacy.lang.en.English object at 0x7fcf04bf9570>,
              pos_patterns=[[{'POS': 'ADJ'}, {'POS': 'NOUN'}], [{'POS': 'NOUN'}],
                            [{'POS': 'ADJ'}]])}

In [89]:
from bertopic import BERTopic

topic_model = BERTopic(

  # Pipeline models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,

  # Hyperparameters
  top_n_words=10,
  verbose=True
)



In [90]:
topics, probs = topic_model.fit_transform(docs, embeddings)

2023-07-22 20:42:19,989 - BERTopic - Reduced dimensionality
2023-07-22 20:42:23,063 - BERTopic - Clustered reduced embeddings


In [91]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,MMR,POS,Representative_Docs
0,-1,19862,-1_hotel_room_great_good,"[hotel, room, great, good, breakfast, staff, s...","[hotel, good hotel, hotel good, hotel great, g...","[hotel, room, great, good, breakfast, staff, s...","[hotel, room, great, good, breakfast, staff, s...","[Hotel ., I would this hotel ., Excelente hote..."
1,0,1983,0_pool_swimming_swimming pool_pool area,"[pool, swimming, swimming pool, pool area, are...","[swimming pool, pool, pool pool, pool beautifu...","[pool, swimming, swimming pool, pool area, are...","[pool, swimming, area, bar, nice, gym, great, ...","[Excellent swimming pool ., pool ., , of Pool .]"
2,1,1730,1_stayed_nights_stayed nights_stay,"[stayed, nights, stayed nights, stay, stay sta...","[stayed nights, nights stayed, stayed night, s...","[stayed, nights, stayed nights, stay, stay sta...","[nights, stay, night, trip, days, pleasant, en...","[Stayed here for nights with my ., We stayed h..."
3,2,1660,2_airport_taxi_bay_sapa,"[airport, taxi, bay, sapa, tour, check, train,...","[bay tour, bay trip, airport, picked airport, ...","[airport, taxi, bay, sapa, tour, check, train,...","[airport, taxi, bay, tour, check, train, lugga...",[We booked a tour to Ha Long Bay at the tour d...
4,3,1240,3_clean_comfortable_spacious_clean comfortable,"[clean, comfortable, spacious, clean comfortab...","[extremely clean, clean clean, clean good, cle...","[clean, comfortable, spacious, clean comfortab...","[clean, comfortable, spacious, comfortable cle...","[are clean ., The are clean ., The are very cl..."
5,4,1219,4_location_town_walk_walking,"[location, town, walk, walking, distance, walk...","[walking distance, location walking, walk loca...","[location, town, walk, walking, distance, walk...","[location, town, walk, distance, old town, cit...","[all in walking distance ., There are many and..."
6,5,1186,5_staff_hotel staff_hotel_staff hotel,"[staff, hotel staff, hotel, staff hotel, frien...","[hotel staff, staff hotel, hotel friendly, hot...","[staff, hotel staff, hotel, staff hotel, frien...","[staff, hotel, friendly, service, helpful, gre...","[Very good Hotel Staff very friendly ., Hotel ..."
7,6,1009,6_food_restaurant_ate_dinner,"[food, restaurant, ate, dinner, food good, del...","[restaurant, restaurant food, food restaurant,...","[food, restaurant, ate, dinner, food good, del...","[food, restaurant, dinner, delicious, good, go...","[We in the restaurant ., Food ., The restauran..."
8,7,981,7_definitely_come_return_definitely stay,"[definitely, come, return, definitely stay, st...","[definitely stay, stay return, stay definitely...","[definitely, come, return, definitely stay, st...","[visit, time, love, minute, nights, beautiful,...","[, We would definitely stay here again ., We w..."
9,8,883,8_hoi_stay hoi_hotel hoi_shuttle,"[hoi, stay hoi, hotel hoi, shuttle, town, la, ...","[visiting hoi, visit hoi, hoi, hoi stay, hoi l...","[hoi, stay hoi, hotel hoi, shuttle, town, la, ...","[shuttle, town, stay, resort, hotel, bus, beac...","[A must experience in Hoi An ., Best of Hoi An..."


In [95]:
topic_model.get_topic_info().to_csv("get_topic_info.csv", index=False)

To get all representations for a single topic, we simply run the following:

In [92]:
topic_model.get_topic(1, full=True)

{'Main': [('stayed', 0.09515127978062499),
  ('nights', 0.07920593455508315),
  ('stayed nights', 0.06635965662848976),
  ('stay', 0.06267688225272228),
  ('stay stayed', 0.047088970208189564),
  ('night', 0.03511292740146744),
  ('trip', 0.030524743779363),
  ('days', 0.028295033660056187),
  ('nights stayed', 0.02284097596197115),
  ('spent', 0.022469608943984754)],
 'KeyBERT': [('stayed nights', 0.8053507),
  ('nights stayed', 0.77842695),
  ('stayed night', 0.746093),
  ('stay nights', 0.73240304),
  ('night stayed', 0.7186469),
  ('nights stay', 0.7136446),
  ('night stay', 0.66900367),
  ('stayed days', 0.6564697),
  ('place stayed', 0.63687956),
  ('stay night', 0.6365199)],
 'MMR': [('stayed', 0.09515127978062499),
  ('nights', 0.07920593455508315),
  ('stayed nights', 0.06635965662848976),
  ('stay', 0.06267688225272228),
  ('stay stayed', 0.047088970208189564),
  ('night', 0.03511292740146744),
  ('trip', 0.030524743779363),
  ('days', 0.028295033660056187),
  ('nights stayed

In [93]:
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"

saved_model_dir = "bertopic_20230723_01"
topic_model.save(saved_model_dir, serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

**NOTE**: The labels generated by OpenAI's **ChatGPT** are especially interesting to use throughout your model. Below, we will go into more detail how to set that as a custom label.

**🔥 Tip - Parameters 🔥**
***
If you would like to return the topic-document probability matrix, then it is advised to use `calculate_probabilities=True`. Do note that this can significantly slow down training. To speed it up, use [cuML's HDBSCAN](https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html#cuml-hdbscan) instead. You could also approximate the topic-document probability matrix with `.approximate_distribution` which will be discussed later.
***

## **(Custom) Labels**
The default label of each topic are the top 3 words in each topic combined with an underscore between them.

This, of course, might not be the best label that you can think of for a certain topic. Instead, we can use `.set_topic_labels` to manually label all or certain topics.

We can also use `.set_topic_labels` to use one of the other topic representations that we had before, like `KeyBERTInspired` or even `OpenAI`.

In [70]:
from nltk.corpus import wordnet

def get_top_k_related_words(word, k):
    """
    Get the top K words related to the given word using WordNet.

    Parameters:
        word (str): The word to find related words for.
        k (int): The number of top related words to return.

    Returns:
        list: A list of the top K words related to the given word.
    """
    synsets = wordnet.synsets(word)
    related_words = set()

    for synset in synsets:
        for lemma in synset.lemmas():
            related_words.add(lemma.name())

    # Remove the original word from the set if it exists
    related_words.discard(word)

    # Convert the set to a list and get the first K elements
    top_k_related_words = list(related_words)[:k]

    return top_k_related_words

# Test the function
word = "service"
top_k = 20
top_k_related_words = get_top_k_related_words(word, top_k)
print("Top {} words related to '{}' are: {}".format(top_k, word, top_k_related_words))

" | ".join(top_k_related_words)


Top 20 words related to 'service' are: ['religious_service', 'serve', 'servicing', 'service_of_process', 'divine_service', 'table_service', 'help', 'military_service', 'serving', 'Service', 'Robert_William_Service', 'inspection_and_repair', 'overhaul', 'avail', 'armed_service']


'religious_service | serve | servicing | service_of_process | divine_service | table_service | help | military_service | serving | Service | Robert_William_Service | inspection_and_repair | overhaul | avail | armed_service'

In [40]:
# topic_model.topic_aspects_["KeyBERT"].items()[0]

In [96]:
# Label the topics yourself
# topic_model.set_topic_labels({
#     0: "value | valuate | evaluate | measure | economic_value | rate | prise | prize | time_value | assess",
#     1: "location | locating | positioning | localization | placement | position | localisation | emplacement", 
#     2: "cleanliness | clean | clean_house | dirty | clear | unclean",
#     3: "service religious_service | serve | servicing | service_of_process | divine_service | table_service | help | military_service | serving | Service | inspection_and_repair | overhaul | avail | armed_service"
# })


# ---------
# or use one of the other topic representations, like KeyBERTInspired
top_k = 5
keybert_topic_labels = {topic: " | ".join(list(zip(*values))[0][:5]) for topic, values in topic_model.topic_aspects_["KeyBERT"].items()}
# print(keybert_topic_labels)
topic_model.set_topic_labels(keybert_topic_labels)


# ----------
# or ChatGPT's labels
# chatgpt_topic_labels = {topic: " | ".join(list(zip(*values))[0]) for topic, values in topic_model.topic_aspects_["OpenAI"].items()}
# chatgpt_topic_labels[-1] = "Outlier Topic"
# topic_model.set_topic_labels(chatgpt_topic_labels)

Now that we have set the updated topic labels, we can access them with the many functions used throughout BERTopic. Most notably, you can show the updated labels in visualizations with the `custom_labels=True` parameters.

In [97]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,CustomName,Representation,KeyBERT,MMR,POS,Representative_Docs
0,-1,19862,-1_hotel_room_great_good,hotel | good hotel | hotel good | hotel great ...,"[hotel, room, great, good, breakfast, staff, s...","[hotel, good hotel, hotel good, hotel great, g...","[hotel, room, great, good, breakfast, staff, s...","[hotel, room, great, good, breakfast, staff, s...","[Hotel ., I would this hotel ., Excelente hote..."
1,0,1983,0_pool_swimming_swimming pool_pool area,swimming pool | pool | pool pool | pool beauti...,"[pool, swimming, swimming pool, pool area, are...","[swimming pool, pool, pool pool, pool beautifu...","[pool, swimming, swimming pool, pool area, are...","[pool, swimming, area, bar, nice, gym, great, ...","[Excellent swimming pool ., pool ., , of Pool .]"
2,1,1730,1_stayed_nights_stayed nights_stay,stayed nights | nights stayed | stayed night |...,"[stayed, nights, stayed nights, stay, stay sta...","[stayed nights, nights stayed, stayed night, s...","[stayed, nights, stayed nights, stay, stay sta...","[nights, stay, night, trip, days, pleasant, en...","[Stayed here for nights with my ., We stayed h..."
3,2,1660,2_airport_taxi_bay_sapa,bay tour | bay trip | airport | picked airport...,"[airport, taxi, bay, sapa, tour, check, train,...","[bay tour, bay trip, airport, picked airport, ...","[airport, taxi, bay, sapa, tour, check, train,...","[airport, taxi, bay, tour, check, train, lugga...",[We booked a tour to Ha Long Bay at the tour d...
4,3,1240,3_clean_comfortable_spacious_clean comfortable,extremely clean | clean clean | clean good | c...,"[clean, comfortable, spacious, clean comfortab...","[extremely clean, clean clean, clean good, cle...","[clean, comfortable, spacious, clean comfortab...","[clean, comfortable, spacious, comfortable cle...","[are clean ., The are clean ., The are very cl..."
5,4,1219,4_location_town_walk_walking,walking distance | location walking | walk loc...,"[location, town, walk, walking, distance, walk...","[walking distance, location walking, walk loca...","[location, town, walk, walking, distance, walk...","[location, town, walk, distance, old town, cit...","[all in walking distance ., There are many and..."
6,5,1186,5_staff_hotel staff_hotel_staff hotel,hotel staff | staff hotel | hotel friendly | h...,"[staff, hotel staff, hotel, staff hotel, frien...","[hotel staff, staff hotel, hotel friendly, hot...","[staff, hotel staff, hotel, staff hotel, frien...","[staff, hotel, friendly, service, helpful, gre...","[Very good Hotel Staff very friendly ., Hotel ..."
7,6,1009,6_food_restaurant_ate_dinner,restaurant | restaurant food | food restaurant...,"[food, restaurant, ate, dinner, food good, del...","[restaurant, restaurant food, food restaurant,...","[food, restaurant, ate, dinner, food good, del...","[food, restaurant, dinner, delicious, good, go...","[We in the restaurant ., Food ., The restauran..."
8,7,981,7_definitely_come_return_definitely stay,definitely stay | stay return | stay definitel...,"[definitely, come, return, definitely stay, st...","[definitely stay, stay return, stay definitely...","[definitely, come, return, definitely stay, st...","[visit, time, love, minute, nights, beautiful,...","[, We would definitely stay here again ., We w..."
9,8,883,8_hoi_stay hoi_hotel hoi_shuttle,visiting hoi | visit hoi | hoi | hoi stay | ho...,"[hoi, stay hoi, hotel hoi, shuttle, town, la, ...","[visiting hoi, visit hoi, hoi, hoi stay, hoi l...","[hoi, stay hoi, hotel hoi, shuttle, town, la, ...","[shuttle, town, stay, resort, hotel, bus, beac...","[A must experience in Hoi An ., Best of Hoi An..."


Notice that the overview in `.get_topic_info` now also includes the column `CustomName`. That is the custom label that we just created for each topic.

## **Topic-Document Distribution**
If using `calculate_probabilities=True` is not possible, than you can [approximate the topic-document distributions](https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html) using `.approximate_distribution`. It is a fast and flexible method for creating different topic-document distributions.

In [117]:
# `topic_distr` contains the distribution of topics in each document
# topic_distr, _ = topic_model.approximate_distribution(docs, window=8, stride=4)
topic_distr, _ = topic_model.approximate_distribution(docs, use_embedding_model=True)


  0%|                                                                                                       | 0/49 [00:00<?, ?it/s]

Batches:   0%|          | 0/328 [00:00<?, ?it/s]

  2%|█▉                                                                                             | 1/49 [00:04<03:25,  4.28s/it]

Batches:   0%|          | 0/316 [00:00<?, ?it/s]

  4%|███▉                                                                                           | 2/49 [00:07<03:04,  3.93s/it]

Batches:   0%|          | 0/288 [00:00<?, ?it/s]

  6%|█████▊                                                                                         | 3/49 [00:11<02:50,  3.71s/it]

Batches:   0%|          | 0/333 [00:00<?, ?it/s]

  8%|███████▊                                                                                       | 4/49 [00:15<02:50,  3.80s/it]

Batches:   0%|          | 0/319 [00:00<?, ?it/s]

 10%|█████████▋                                                                                     | 5/49 [00:19<02:48,  3.82s/it]

Batches:   0%|          | 0/315 [00:00<?, ?it/s]

 12%|███████████▋                                                                                   | 6/49 [00:21<02:23,  3.33s/it]

Batches:   0%|          | 0/317 [00:00<?, ?it/s]

 14%|█████████████▌                                                                                 | 7/49 [00:25<02:23,  3.41s/it]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

 16%|███████████████▌                                                                               | 8/49 [00:28<02:20,  3.43s/it]

Batches:   0%|          | 0/268 [00:00<?, ?it/s]

 18%|█████████████████▍                                                                             | 9/49 [00:31<02:10,  3.26s/it]

Batches:   0%|          | 0/297 [00:00<?, ?it/s]

 20%|███████████████████▏                                                                          | 10/49 [00:34<02:02,  3.13s/it]

Batches:   0%|          | 0/297 [00:00<?, ?it/s]

 22%|█████████████████████                                                                         | 11/49 [00:35<01:41,  2.66s/it]

Batches:   0%|          | 0/324 [00:00<?, ?it/s]

 24%|███████████████████████                                                                       | 12/49 [00:38<01:31,  2.47s/it]

Batches:   0%|          | 0/321 [00:00<?, ?it/s]

 27%|████████████████████████▉                                                                     | 13/49 [00:41<01:43,  2.87s/it]

Batches:   0%|          | 0/325 [00:00<?, ?it/s]

 29%|██████████████████████████▊                                                                   | 14/49 [00:46<01:55,  3.30s/it]

Batches:   0%|          | 0/320 [00:00<?, ?it/s]

 31%|████████████████████████████▊                                                                 | 15/49 [00:49<01:57,  3.47s/it]

Batches:   0%|          | 0/338 [00:00<?, ?it/s]

 33%|██████████████████████████████▋                                                               | 16/49 [00:54<02:00,  3.66s/it]

Batches:   0%|          | 0/316 [00:00<?, ?it/s]

 35%|████████████████████████████████▌                                                             | 17/49 [00:57<01:57,  3.67s/it]

Batches:   0%|          | 0/295 [00:00<?, ?it/s]

 37%|██████████████████████████████████▌                                                           | 18/49 [01:01<01:51,  3.61s/it]

Batches:   0%|          | 0/332 [00:00<?, ?it/s]

 39%|████████████████████████████████████▍                                                         | 19/49 [01:05<01:50,  3.69s/it]

Batches:   0%|          | 0/304 [00:00<?, ?it/s]

 41%|██████████████████████████████████████▎                                                       | 20/49 [01:08<01:45,  3.64s/it]

Batches:   0%|          | 0/320 [00:00<?, ?it/s]

 43%|████████████████████████████████████████▎                                                     | 21/49 [01:12<01:42,  3.68s/it]

Batches:   0%|          | 0/312 [00:00<?, ?it/s]

 45%|██████████████████████████████████████████▏                                                   | 22/49 [01:15<01:38,  3.65s/it]

Batches:   0%|          | 0/332 [00:00<?, ?it/s]

 47%|████████████████████████████████████████████                                                  | 23/49 [01:19<01:36,  3.73s/it]

Batches:   0%|          | 0/315 [00:00<?, ?it/s]

 49%|██████████████████████████████████████████████                                                | 24/49 [01:23<01:33,  3.73s/it]

Batches:   0%|          | 0/330 [00:00<?, ?it/s]

 51%|███████████████████████████████████████████████▉                                              | 25/49 [01:27<01:30,  3.77s/it]

Batches:   0%|          | 0/308 [00:00<?, ?it/s]

 53%|█████████████████████████████████████████████████▉                                            | 26/49 [01:31<01:25,  3.70s/it]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

 55%|███████████████████████████████████████████████████▊                                          | 27/49 [01:35<01:25,  3.88s/it]

Batches:   0%|          | 0/335 [00:00<?, ?it/s]

 57%|█████████████████████████████████████████████████████▋                                        | 28/49 [01:39<01:22,  3.92s/it]

Batches:   0%|          | 0/330 [00:00<?, ?it/s]

 59%|███████████████████████████████████████████████████████▋                                      | 29/49 [01:43<01:18,  3.91s/it]

Batches:   0%|          | 0/299 [00:00<?, ?it/s]

 61%|█████████████████████████████████████████████████████████▌                                    | 30/49 [01:46<01:12,  3.80s/it]

Batches:   0%|          | 0/302 [00:00<?, ?it/s]

 63%|███████████████████████████████████████████████████████████▍                                  | 31/49 [01:50<01:07,  3.74s/it]

Batches:   0%|          | 0/304 [00:00<?, ?it/s]

 65%|█████████████████████████████████████████████████████████████▍                                | 32/49 [01:53<01:02,  3.70s/it]

Batches:   0%|          | 0/319 [00:00<?, ?it/s]

 67%|███████████████████████████████████████████████████████████████▎                              | 33/49 [01:57<00:58,  3.69s/it]

Batches:   0%|          | 0/307 [00:00<?, ?it/s]

 69%|█████████████████████████████████████████████████████████████████▏                            | 34/49 [02:01<00:55,  3.67s/it]

Batches:   0%|          | 0/274 [00:00<?, ?it/s]

 71%|███████████████████████████████████████████████████████████████████▏                          | 35/49 [02:04<00:49,  3.55s/it]

Batches:   0%|          | 0/269 [00:00<?, ?it/s]

 73%|█████████████████████████████████████████████████████████████████████                         | 36/49 [02:07<00:44,  3.41s/it]

Batches:   0%|          | 0/287 [00:00<?, ?it/s]

 76%|██████████████████████████████████████████████████████████████████████▉                       | 37/49 [02:11<00:40,  3.42s/it]

Batches:   0%|          | 0/289 [00:00<?, ?it/s]

 78%|████████████████████████████████████████████████████████████████████████▉                     | 38/49 [02:14<00:37,  3.40s/it]

Batches:   0%|          | 0/273 [00:00<?, ?it/s]

 80%|██████████████████████████████████████████████████████████████████████████▊                   | 39/49 [02:17<00:33,  3.34s/it]

Batches:   0%|          | 0/325 [00:00<?, ?it/s]

 82%|████████████████████████████████████████████████████████████████████████████▋                 | 40/49 [02:21<00:31,  3.45s/it]

Batches:   0%|          | 0/289 [00:00<?, ?it/s]

 84%|██████████████████████████████████████████████████████████████████████████████▋               | 41/49 [02:24<00:27,  3.42s/it]

Batches:   0%|          | 0/297 [00:00<?, ?it/s]

 86%|████████████████████████████████████████████████████████████████████████████████▌             | 42/49 [02:27<00:21,  3.11s/it]

Batches:   0%|          | 0/300 [00:00<?, ?it/s]

 88%|██████████████████████████████████████████████████████████████████████████████████▍           | 43/49 [02:30<00:18,  3.16s/it]

Batches:   0%|          | 0/310 [00:00<?, ?it/s]

 90%|████████████████████████████████████████████████████████████████████████████████████▍         | 44/49 [02:33<00:16,  3.21s/it]

Batches:   0%|          | 0/286 [00:00<?, ?it/s]

 92%|██████████████████████████████████████████████████████████████████████████████████████▎       | 45/49 [02:36<00:12,  3.18s/it]

Batches:   0%|          | 0/281 [00:00<?, ?it/s]

 94%|████████████████████████████████████████████████████████████████████████████████████████▏     | 46/49 [02:39<00:08,  2.93s/it]

Batches:   0%|          | 0/296 [00:00<?, ?it/s]

 96%|██████████████████████████████████████████████████████████████████████████████████████████▏   | 47/49 [02:40<00:05,  2.51s/it]

Batches:   0%|          | 0/299 [00:00<?, ?it/s]

 98%|████████████████████████████████████████████████████████████████████████████████████████████  | 48/49 [02:43<00:02,  2.70s/it]

Batches:   0%|          | 0/57 [00:00<?, ?it/s]

100%|██████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [02:44<00:00,  3.36s/it]


Next, lets take a look at a specific abstract and see how the topic distribution was extracted:

In [115]:
doc_id = 2
print(docs[doc_id])

This is not my first stay at the hotel , and i was mildly .


In [116]:
# topic_distr

In [104]:
# Visualize the topic-document distribution for a single document
topic_model.visualize_distribution(topic_distr[doc_id])

In [105]:
# Visualize the topic-document distribution for a single document
topic_model.visualize_distribution(topic_distr[doc_id], custom_labels=True)

It seems to have extracted a number of topics that are relevant and shows the distributions of these topics across the abstract. We can go one step further and visualize them on a token-level:

In [114]:
# Calculate the topic distributions on a token-level
topic_distr, topic_token_distr = topic_model.approximate_distribution(docs[doc_id], calculate_tokens=True)
# topic_distr, _ = topic_model.approximate_distribution(docs[doc_id], use_embedding_model=True)

# doc_id = 1
# Visualize the token-level distributions
df = topic_model.visualize_approximate_distribution(docs[doc_id], topic_token_distr[0])
df

100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 238.79it/s]


ValueError: Length mismatch: Expected axis has 12 elements, new values have 17 elements

**🔥 Tip - `use_embedding_model` 🔥**
***
As a default, we compare the c-TF-IDF calculations between the token sets and all topics. Due to its bag-of-word representation, this is quite fast. However, you might want to use the selected embedding_model instead to do this comparison. Do note that due to the many token sets, it is often computationally quite a bit slower:

```python
topic_distr, _ = topic_model.approximate_distribution(docs, use_embedding_model=True)
```
***




## **Outlier Reduction**
By default, HDBSCAN generates outliers which is a helpful mechanic in creating accurate topic representations. However, you might want to assign every single document to a topic. We can use `.reduce_outliers` to map some or all outliers to a topic:

In [52]:
# Reduce outliers
new_topics = topic_model.reduce_outliers(docs, topics)

# Reduce outliers with pre-calculate embeddings instead
new_topics = topic_model.reduce_outliers(docs, topics, strategy="embeddings", embeddings=embeddings)

100%|██████████████████████████| 43/43 [00:23<00:00,  1.82it/s]


**💡  NOTE - Update Topics with Outlier Reduction 💡**
***
After having generated updated topic assignments, we can pass them to BERTopic in order to update the topic representations:

```python
topic_model.update_topics(docs, topics=new_topics)
```

It is important to realize that updating the topics this way may lead to errors if topic reduction or topic merging techniques are used afterwards. The reason for this is that when you assign a -1 document to topic 1 and another -1 document to topic 2, it is unclear how you map the -1 documents. Is it matched to topic 1 or 2.
***

## **Visualize Topics**

With visualizations, we are closing into the realm of subjective "best practices". These are things that I generally do because I like the representations but your experience might differ.

Having said that, there are two visualizations that are my go-to when visualizing the topics themselves:

* `topic_model.visualize_topics()`
* `topic_model.visualize_hierarchy()`

In [53]:
topic_model.visualize_topics(custom_labels=True)

In [54]:
topic_model.visualize_hierarchy(custom_labels=True)

## **Visualize Documents**

When visualizing documents, it helps to have embedded the documents beforehand to speed up computation. Fortunately, we have already done that as a "best practice".

Visualizing documents in 2-dimensional space helps in understanding the underlying structure of the documents and topics.

In [55]:
# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

The following plot is **interactive** which means that you can zoom in, double click on a label to only see that one and generally interact with the plot:

In [56]:
# Visualize the documents in 2-dimensional space and show the titles on hover instead of the docs
# NOTE: You can hide the hover with `hide_document_hover=True` which is especially helpful if you have a large dataset
topic_model.visualize_documents(titles, reduced_embeddings=reduced_embeddings, custom_labels=True)

NameError: name 'titles' is not defined

In [None]:
# We can also hide the annotation to have a more clear overview of the topics
topic_model.visualize_documents(titles, reduced_embeddings=reduced_embeddings, custom_labels=True, hide_annotations=True)

**💡  NOTE - 2-dimensional space 💡**
***
Although visualizing the documents in 2-dimensional gives an idea of their underlying structure, there is a risk involved.

Visualizing the documents in 2-dimensional space means that we have lost significant information since the original embeddings were more than 384 dimensions. Condensing all that information in 2 dimensions is simply not possible. In other words, it is merely an **approximation**, albeit quite an accurate one.
***

## **Serialization**

When saving a BERTopic model, there are several ways in doing so. You can either save the entire model with `pickle`, `pytorch`, or `safetensors`.

Personally, I would advise going with `safetensors` whenever possible. The reason for this is that the format allows for a very small topic model to be saved and shared.

When saving a model with `safetensors`, it skips over saving the dimensionality reduction and clustering models. The `.transform` function will still work without these models but instead assign topics based on the similarity between document embeddings and the topic embeddings.

As a result, the `.transform` step might give different results but it is generally worth it considering the smaller and significantly faster model.

In [None]:
saved_model_dir = "bertopic_230718_1200"

In [None]:
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save(saved_model_dir, serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

**💡  NOTE - Embedding Model 💡**
***
Using `safetensors`, we are not saving the underlying embedding model but merely a pointer to the model. For example, in the above example we are saving the string `"sentence-transformers/all-MiniLM-L6-v2"` so that we can load in the embedding model alongside the topic model.

This currently only works if you are using a sentence transformer model. If you are using a different model, you can load it in when loading the topic model like this:

```python
from sentence_transformers import SentenceTransformer

# Define embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Load model and add embedding model
loaded_model = BERTopic.load("path/to/my/model_dir", embedding_model=embedding_model)
```
***

As mentioned above, loading can be done as follows:

In [None]:
from sentence_transformers import SentenceTransformer

# Define embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Load model and add embedding model
loaded_model = BERTopic.load(saved_model_dir, embedding_model=embedding_model)

## **Inference**

To speed up the inference, we can leverage a "best practice" that we used before, namely serialization. When you save a model as `safetensors` and then load it in, we are removing the dimensionality reduction and clustering steps from the pipeline.

Instead, the assignment of topics is done through cosine similarity of document embeddings and topic embeddings. This speeds up inferences significantly.

To show its effect, let's start by disabling the logger:

In [None]:
from bertopic._utils import MyLogger
logger = MyLogger("ERROR")
loaded_model.verbose = False
topic_model.verbose = False

Then, we run inference on both the loaded model and the non-loaded model:

In [None]:
%timeit loaded_model.transform(docs[:100])

In [None]:
%timeit topic_model.transform(docs[:100])

**1000 documents**

In [None]:
%timeit loaded_model.transform(docs[:1000])

In [None]:
%timeit topic_model.transform(docs[:1000])

**10_000 documents**

In [None]:
%timeit loaded_model.transform(docs[:10000])

In [None]:
%timeit topic_model.transform(docs[:10000])

Based on the above, the `loaded_model` seems to be quite a bit faster for inference than the original `topic_model`.

--------


In [None]:
!zip -r bertopic_230714_1538.zip bertopic_230714_1538/