# **Tutorial** - Topic Modeling with BERTopic
(last updated 01-09-2022)

In this tutorial we will be exploring how to use BERTopic to create topics from the well-known 20Newsgroups dataset. The most frequent use-cases and methods are discussed together with important parameters to keep a look out for.


## BERTopic
BERTopic is a topic modeling technique that leverages 🤗 transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [1]:
import sys
print(sys.executable)
# Can not understand which env is being run.

/home/s2110149/.anaconda3/envs/semantic-topic-modeling/bin/python


In [2]:
%%capture
# !pip install bertopic

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Data
Hotel dataset

In [3]:
import pandas as pd

### Preprocessing

In [None]:

df = pd.read_csv(gdrive_path + "/tripadvisor_raw.csv")


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 533430 entries, 0 to 533429
Data columns (total 24 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   Unnamed: 0.1            533430 non-null  int64  
 1   Unnamed: 0              533430 non-null  int64  
 2   user                    533430 non-null  object 
 3   user_helpful_votes      436162 non-null  object 
 4   user_contribution       462777 non-null  object 
 5   user_rate               533430 non-null  int64  
 6   user_Value              368161 non-null  float64
 7   user_Location           368061 non-null  float64
 8   user_Cleanliness        369361 non-null  float64
 9   user_Service            477812 non-null  float64
 10  review_title            527770 non-null  object 
 11  review_text             482656 non-null  object 
 12  hotel                   533430 non-null  int64  
 13  hotel_name              533351 non-null  object 
 14  hotel_url           

In [None]:
df.iloc[0]

Unnamed: 0.1                                                         501060
Unnamed: 0                                                            14916
user                                                               Anders M
user_helpful_votes                                                        1
user_contribution                                                         3
user_rate                                                                 5
user_Value                                                              5.0
user_Location                                                           5.0
user_Cleanliness                                                        NaN
user_Service                                                            5.0
review_title                                  Great place to stay in hoi an
review_text                 We had an amazing stay at green grass villa....
hotel                                                               9682001
hotel_name  

In [None]:
# from sklearn.datasets import fetch_20newsgroups
# docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

In [None]:
df['review'] = df['review_title'] + " " + df['review_text']
df['review'][10]

'excellent stay 방이 깨끗하고 좋았고, 방마다 테라스가 있어서 저녁에 맥주마시기도 좋았어요 직원도 친절하시고 좋았습니다^^.,직원분중에 Ha 씨가 되게 친절하시고 좋았어요. the room is very cool with balcony, the breakfast is very delicous with viet nam food, the staff is very friendly espcialy ms Ha, i will come back here in the near future...'

In [None]:
df = df[df['review'].str.len() >= 40]
# Convert to compatible type.
df['review'] = df['review'].astype(str)
df.to_csv('1_tripadvisor_df.csv', index=False)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 482645 entries, 0 to 533429
Data columns (total 25 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   Unnamed: 0.1            482645 non-null  int64  
 1   Unnamed: 0              482645 non-null  int64  
 2   user                    482645 non-null  object 
 3   user_helpful_votes      389821 non-null  object 
 4   user_contribution       415651 non-null  object 
 5   user_rate               482645 non-null  int64  
 6   user_Value              329174 non-null  float64
 7   user_Location           329373 non-null  float64
 8   user_Cleanliness        330417 non-null  float64
 9   user_Service            430619 non-null  float64
 10  review_title            482645 non-null  object 
 11  review_text             482645 non-null  object 
 12  hotel                   482645 non-null  int64  
 13  hotel_name              482574 non-null  object 
 14  hotel_url           

### Load from clean df

In [4]:
df = pd.read_csv('1.tripadvisor_df.csv')
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439996 entries, 0 to 439995
Data columns (total 25 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   Unnamed: 0.1            439996 non-null  int64  
 1   Unnamed: 0              439995 non-null  float64
 2   user                    439995 non-null  object 
 3   user_helpful_votes      356293 non-null  object 
 4   user_contribution       379771 non-null  object 
 5   user_rate               439995 non-null  float64
 6   user_Value              300608 non-null  float64
 7   user_Location           300744 non-null  float64
 8   user_Cleanliness        301678 non-null  float64
 9   user_Service            392979 non-null  float64
 10  review_title            439993 non-null  object 
 11  review_text             439995 non-null  object 
 12  hotel                   439995 non-null  float64
 13  hotel_name              439933 non-null  object 
 14  hotel_url           

In [5]:
from sklearn.model_selection import train_test_split
import numpy as np

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df, np.arange(len(df)), test_size=0.33, random_state=42)

In [7]:
len(X_train)

294797

In [8]:
# reviews= df['review']
reviews = X_train['review'].astype('str')
len(reviews)

294797

In [9]:
docs = reviews
len(docs)

294797

In [10]:
docs[5]

'Wonderful oasis, exemplary service We arrived after 11 pm, following a pretty basic 17 hour train trip from Ho Chi Minh City. The hotel sent a car and driver to collect us from the station, complete with chilled face towels and water bottles (such a treat after the train trip!) and welcomed us graciously. The service and standards here are exemplary. And the views are stunning. A great hotel.'

# **Topic Modeling**

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model.




## Training

We start by instantiating BERTopic. We set language to `english` since our documents are in the English language. If you would like to use a multi-lingual model, please use `language="multilingual"` instead.

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model.


In [19]:
%%capture
# to ignore console log when it is instatlling
!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com

In [30]:
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)

# Pass the above models to be used in BERTopic
# topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)

ModuleNotFoundError: No module named 'cuml'

In [19]:
# import os
# os.environ['TRANSFORMERS_CACHE'] = '/home/s2110149/WORKING/semantic-topic-modeling/.cache/'

In [11]:
!which python

/home/s2110149/.anaconda3/bin/python


In [None]:
from bertopic import BERTopic

topic_model = BERTopic(
    language="english",
    calculate_probabilities=True,
    verbose=True,
    # umap_model=umap_model,
    # hdbscan_model=hdbscan_model
)
topics, probs = topic_model.fit_transform(docs)

  @numba.jit()
  @numba.jit()
  @numba.jit()
  from .autonotebook import tqdm as notebook_tqdm
  @numba.jit()
Downloading (…)e9125/.gitattributes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.18k/1.18k [00:00<00:00, 357kB/s]
Downloading (…)_Pooling/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 190/190 [00:00<00:00, 123kB/s]
Downloading (…)7e55de9125/README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

**NOTE**: Use `language="multilingual"` to select a model that support 50+ languages.

## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents.

In [None]:
freq = topic_model.get_topic_info(); freq.head(20)

-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [None]:
topic_model.get_topic(0)  # Select the most frequent topic

**NOTE**: BERTopic is stocastich which mmeans that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

In [None]:
### Attributes

## Attributes

There are a number of attributes that you can access after having trained your BERTopic model:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |

For example, to access the predicted topics for the first 10 documents, we simply run the following:

In [None]:
topic_model.topics_[:10]

# **Visualization**
There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. Topic modeling is, to a certain extent, quite subjective. Visualizations help understand the topics that were created.

## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation.
Instead, we can visualize the topics that were generated in a way very similar to
[LDAvis](https://github.com/cpsievert/LDAvis):

In [None]:
topic_model.visualize_topics()

## Visualize Topic Probabilities

The variable `probabilities` that is returned from `transform()` or `fit_transform()` can
be used to understand how confident BERTopic is that certain topics can be found in a document.

To visualize the distributions, we simply call:

In [None]:
topic_model.visualize_distribution(probs[200], min_probability=0.015)

## Visualize Topic Hierarchy

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help selecting an appropriate nr_topics when reducing the number of topics that you have created.

In [None]:
topic_model.visualize_hierarchy(top_n_topics=50)

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [None]:
topic_model.visualize_barchart(top_n_topics=5)

## Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [None]:
topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

## Visualize Term Score Decline
Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, the select the best number of words in a topic.


In [None]:
topic_model.visualize_term_rank()

# **Topic Representation**
After having created the topic model, you might not be satisfied with some of the parameters you have chosen. Fortunately, BERTopic allows you to update the topics after they have been created.

This allows for fine-tuning the model to your specifications and wishes.

## Update Topics
When you have trained a model and viewed the topics and the words that represent them,
you might not be satisfied with the representation. Perhaps you forgot to remove
stopwords or you want to try out a different `n_gram_range`. We can use the function `update_topics` to update
the topic representation with new parameters for `c-TF-IDF`:


In [None]:
topic_model.update_topics(docs, n_gram_range=(1, 2))

In [None]:
topic_model.get_topic(0)   # We select topic that we viewed before

## Topic Reduction
We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so,
is that you can decide the number of topics after knowing how many are actually created. It is difficult to
predict before training your model how many topics that are in your documents and how many will be extracted.
Instead, we can decide afterwards how many topics seems realistic:





In [None]:
topic_model.reduce_topics(docs, nr_topics=60)

In [None]:
# Access the newly updated topics with:
print(topic_model.topics_)

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar
to an input search_term. Here, we are going to be searching for topics that closely relate the
search term "vehicle". Then, we extract the most similar topic and check the results:

In [None]:
similar_topics, similarity = topic_model.find_topics("vehicle", top_n=5); similar_topics

In [None]:
topic_model.get_topic(71)

# **Model serialization**
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved.

In [None]:
# Save model
topic_model.save("my_model")

In [None]:
# Load model
my_model = BERTopic.load("my_model")

# **Embedding Models**
The parameter `embedding_model` takes in a string pointing to a sentence-transformers model, a SentenceTransformer, or a Flair DocumentEmbedding model.

## Sentence-Transformers
You can select any model from sentence-transformers here and pass it through BERTopic with embedding_model:



In [None]:
topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

Or select a SentenceTransformer model with your own parameters:


In [None]:
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
topic_model = BERTopic(embedding_model=sentence_model, verbose=True)

Click [here](https://www.sbert.net/docs/pretrained_models.html) for a list of supported sentence transformers models.  
