https://medium.com/grabngoinfo/hyperparameter-tuning-for-bertopic-model-in-python-104445778347

https://maartengr.github.io/BERTopic/api/bertopic.html

In [1]:
# Install bertopic
!pip install bertopic flair

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bertopic
  Downloading bertopic-0.15.0-py2.py3-none-any.whl (143 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/143.4 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting flair
  Downloading flair-0.12.2-py3-none-any.whl (373 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m373.1/373.1 kB[0m [31m38.7 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.29.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m98.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap-learn-0.5.3.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━━

In [4]:
# Data processing
import pandas as pd
import numpy as np
# Dimension reduction
from umap import UMAP
from sklearn.decomposition import PCA
# Clustering
from hdbscan import HDBSCAN
from sklearn.cluster import KMeans
# Count vectorization
from sklearn.feature_extraction.text import CountVectorizer
# Sentence transformer
from sentence_transformers import SentenceTransformer
# Flair
from transformers.pipelines import pipeline
from flair.embeddings import TransformerDocumentEmbeddings, WordEmbeddings, DocumentPoolEmbeddings, StackedEmbeddings
# Topic model
from bertopic import BERTopic

In [5]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
# Change directory
import os
os.chdir("/content/drive/MyDrive/Colab Notebooks")
# Print out the current directory
!pwd

Mounted at /content/drive
/content/drive/MyDrive/Colab Notebooks


In [6]:
# Read in data
amz_review = pd.read_csv('amazon_cells_labelled.txt', sep='\t', names=['review', 'label'])
# Drop te label
amz_review = amz_review.drop('label', axis=1);
# Take a look at the data
amz_review.head()

Unnamed: 0,review
0,So there is no way for me to plug it in here i...
1,"Good case, Excellent value."
2,Great for the jawbone.
3,Tied to charger for conversations lasting more...
4,The mic is great.


Dimensionality reduction is necessary because the clustering model works better for low-dimension data than high-dimension data. The document embeddings usually have hundreds of dimensions, so we need to reduce the dimensionality before passing the embeddings to a clustering model.

The default algorithm for dimension reduction is UMAP (Uniform Manifold Approximation & Projection). Compared with other dimension reduction techniques such as PCA (Principle Component Analysis), UMAP maintains the data’s local and global structure when reducing the dimensionality, which is important for representing the semantics of the text data. The UMAP model accepts customized hyperparameters.



In [7]:
# Initiate UMAP
umap_model = UMAP(n_neighbors=15,
                  n_components=5,
                  min_dist=0.0,
                  metric='cosine',
                  random_state=100)
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model)
# Run BERTopic model
topics = topic_model.fit_transform(amz_review['review'])
# Get the list of topics
topic_model.get_topic_info()

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,261,-1_the_it_and_my,"[the, it, and, my, not, is, this, of, on, to]",[I have been very happy with the 510 and have ...
1,0,112,0_phone_had_this_and,"[phone, had, this, and, the, have, verizon, it...","[Great Phone., It was a great phone., This is ..."
2,1,74,1_sound_hear_quality_the,"[sound, hear, quality, the, volume, is, to, yo...",[The sound is clear and the people I talk to o...
3,2,53,2_product_price_good_great,"[product, price, good, great, am, purchase, ha...","[Excellent product for the price., Great produ..."
4,3,44,3_ear_comfortable_fits_ears,"[ear, comfortable, fits, ears, the, earpiece, ...","[It was quite comfortable in the ear., It fits..."
5,4,41,4_headset_best_headphones_sound,"[headset, best, headphones, sound, my, bt, for...","[My headset works just peachy-keen., Best head..."
6,5,38,5_battery_life_original_the,"[battery, life, original, the, is, lasts, hour...","[Battery has no life., Battery life is also gr..."
7,6,36,6_works_worked_work_great,"[works, worked, work, great, far, doesnt, so, ...","[Works great., Works great!., Works great!.]"
8,7,31,7_charger_charge_car_plug,"[charger, charge, car, plug, it, not, work, ch...",[I purcashed this for the car charger and it d...
9,8,30,8_case_cases_holster_but,"[case, cases, holster, but, the, sex, my, made...","[This case seems well made., Great case and pr..."


In [8]:
# PCA for dimensionality reduction
pca_model = PCA(n_components=15)
# Initiate BERTopic
topic_model = BERTopic(umap_model=pca_model)
# Run BERTopic model
topics = topic_model.fit_transform(amz_review['review'])
# Get the list of topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,823,-1_the_and_it_is,"[the, and, it, is, to, this, my, not, of, for]","[It is simple to use and I like it., The price..."
1,0,85,0_the_disappointed_very_is,"[the, disappointed, very, is, of, what, easy, ...","[very disappointed., VERY DISAPPOINTED., If yo..."
2,1,48,1_phone_this_the_have,"[phone, this, the, have, and, great, best, had...","[Great Phone., Great Phone., This is a great p..."
3,2,16,2_headset_best_bluetooth_this,"[headset, best, bluetooth, this, excellent, ve...","[Best headset ever!!!., Excellent bluetooth he..."
4,3,16,3_product_price_great_good,"[product, price, great, good, excellent, for, ...","[Excellent product for the price., Great produ..."
5,4,12,4_works_great_worked_well,"[works, great, worked, well, described, fine, ...","[Works great!., Works great., Works great.]"


In [9]:
# Clustering model
hdbscan_model = HDBSCAN(min_cluster_size=10, min_samples = 10, metric='euclidean', prediction_data=True)
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
# Run BERTopic model
topics = topic_model.fit_transform(amz_review['review'])
# Get the list of topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,261,-1_the_it_and_my,"[the, it, and, my, not, is, this, of, on, to]",[I have been very happy with the 510 and have ...
1,0,112,0_phone_had_this_and,"[phone, had, this, and, the, have, verizon, it...","[Great Phone., It was a great phone., This is ..."
2,1,74,1_sound_hear_quality_the,"[sound, hear, quality, the, volume, is, to, yo...",[The sound is clear and the people I talk to o...
3,2,53,2_product_price_good_great,"[product, price, good, great, am, purchase, ha...","[Excellent product for the price., Great produ..."
4,3,44,3_ear_comfortable_fits_ears,"[ear, comfortable, fits, ears, the, earpiece, ...","[It was quite comfortable in the ear., It fits..."
5,4,41,4_headset_best_headphones_sound,"[headset, best, headphones, sound, my, bt, for...","[My headset works just peachy-keen., Best head..."
6,5,38,5_battery_life_original_the,"[battery, life, original, the, is, lasts, hour...","[Battery has no life., Battery life is also gr..."
7,6,36,6_works_worked_work_great,"[works, worked, work, great, far, doesnt, so, ...","[Works great., Works great!., Works great!.]"
8,7,31,7_charger_charge_car_plug,"[charger, charge, car, plug, it, not, work, ch...",[I purcashed this for the car charger and it d...
9,8,30,8_case_cases_holster_but,"[case, cases, holster, but, the, sex, my, made...","[This case seems well made., Great case and pr..."


In [10]:
# Clustering model
kmeans_model = KMeans(n_clusters=15)
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=kmeans_model)
# Run BERTopic model
topics = topic_model.fit_transform(amz_review['review'])
# Get the list of topics
topic_model.get_topic_info()



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,103,0_product_price_good_this,"[product, price, good, this, great, was, happy...",[The price was very good and with the free shi...
1,1,100,1_phone_this_and_have,"[phone, this, and, have, had, is, verizon, the...","[Great phone!., Great phone., This is a great ..."
2,2,82,2_battery_charger_the_to,"[battery, charger, the, to, charge, it, plug, ...","[New Battery works great in phone., I got the ..."
3,3,79,3_case_and_nice_it,"[case, and, nice, it, very, is, the, fit, of, ...","[Great case and price!, Good case!., The look ..."
4,4,76,4_sound_the_hear_quality,"[sound, the, hear, quality, is, volume, to, an...","[very clear, quality sound and you don't have ..."
5,5,75,5_is_use_the_to,"[is, use, the, to, device, and, easy, it, soft...","[Very easy to use., Easy to use., It is simple..."
6,6,65,6_service_customer_junk_company,"[service, customer, junk, company, back, piece...","[Worst customer service., Obviously they have ..."
7,7,64,7_waste_money_what_dont,"[waste, money, what, dont, disappointed, not, ...","[Waste of money., Dont waste your money..., Wh..."
8,8,64,8_headset_bluetooth_the_best,"[headset, bluetooth, the, best, my, for, heads...","[Love this headset!, Excellent bluetooth heads..."
9,9,63,9_ear_comfortable_the_my,"[ear, comfortable, the, my, fits, is, jabra, e...","[It was quite comfortable in the ear., It fits..."


### Hyperparameter Tuning for Language Embeddings

In [11]:
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, language="multilingual")

https://www.sbert.net/docs/pretrained_models.html

In [12]:
# Initiate a sentence transformer model
sentence_model = SentenceTransformer("paraphrase-albert-small-v2")
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, embedding_model=sentence_model)
# Run BERTopic model
topics = topic_model.fit_transform(amz_review['review'])
# Get the list of topics
topic_model.get_topic_info()

Downloading (…)f333f/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)4d423f333f/README.md:   0%|          | 0.00/4.03k [00:00<?, ?B/s]

Downloading (…)423f333f/config.json:   0%|          | 0.00/827 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/46.7M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/245 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading (…)f333f/tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

Downloading (…)23f333f/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,308,-1_it_the_not_to,"[it, the, not, to, my, you, and, for, was, is]",[If you like a loud buzzing to override all yo...
1,0,118,0_product_price_good_this,"[product, price, good, this, recommend, great,...","[Excellent product for the price., Great produ..."
2,1,70,1_phone_this_is_great,"[phone, this, is, great, and, phones, the, bes...","[Great phone., This is a great phone!., Great ..."
3,2,42,2_phone_to_this_same,"[phone, to, this, same, the, my, if, dropped, ...",[You need at least 3 mins to get to your phone...
4,3,40,3_ear_fits_the_ears,"[ear, fits, the, ears, jabra, my, earpiece, co...",[I've tried several different earpieces for my...
5,4,38,4_easy_device_use_to,"[easy, device, use, to, it, is, and, the, grea...","[The handsfree part works fine, but then the c..."
6,5,38,5_it_broke_not_fit,"[it, broke, not, fit, doesn, then, within, the...","[All three broke within two months of use., Fi..."
7,6,37,6_battery_life_original_is,"[battery, life, original, is, the, long, as, d...","[Battery has no life., Battery life is also gr..."
8,7,37,7_sound_poor_the_quality,"[sound, poor, the, quality, low, is, volume, a...","[How can that be?The audio quality is poor., A..."
9,8,32,8_disappointed_very_disappointment_company,"[disappointed, very, disappointment, company, ...","[disappointed., VERY DISAPPOINTED., very disap..."


Hugging Face model hub has thousands of pre-trained models. In this example, we used an English model called distilroberta-base, loaded it in a Hugging Face pipeline, and pass the pipeline to the parameter embedding_model.

In [13]:
# Initiate a pretrained model
hf_model = pipeline("feature-extraction", model="distilroberta-base")
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, embedding_model=hf_model)
# Run BERTopic model
topics = topic_model.fit_transform(amz_review['review'])
# Get the list of topics
topic_model.get_topic_info()

Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,979,0_the_and_it_is,"[the, and, it, is, this, to, phone, my, for, of]",[I love my 350 headset.. My Jabra350 bluetooth...
1,1,21,1_do_not_after_sucks,"[do, not, after, sucks, phone, days, piece, bu...","[WARNING - DO NOT BUY!!., DO NOT PURCHASE THIS..."


Flair is an NLP (Natual Language Processing) library that allows us to choose almost any embedding models, or combine a few embedding models together.

To use a single embedding model with Flair, we can pass the model name to TransformerDocumentEmbeddings, and use it as the input for the embedding_model option in BERTopic.

In [14]:
# Initiate a pretrained embedding model
roberta_model = TransformerDocumentEmbeddings('roberta-base')
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, embedding_model=roberta_model)
# Run BERTopic model
topics = topic_model.fit_transform(amz_review['review'])
# Get the list of topics
topic_model.get_topic_info()

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,5,-1_money_my_wasted_needless,"[money, my, wasted, needless, threw, window, w...",[So I basically threw my money out the window ...
1,0,926,0_the_and_it_is,"[the, and, it, is, this, to, phone, my, of, for]","[It is very comfortable on the ear., I've had ..."
2,1,49,1_don_doesn_it_work,"[don, doesn, it, work, buy, not, didn, make, y...","[Don't buy this product., Don't buy it., Don't..."
3,2,20,2_the_to_it_good,"[the, to, it, good, that, but, is, not, this, ...","[They do not last forever, but is not overly e..."


To use multiple embedding models with Flair, we first need to initiate different pretrained embedding models, then use the StackedEmbeddings function to stack the models, and finally pass the stacked embeddings to the BERTopic embedding_model parameter.

In [15]:
# Initiate a pretrained embedding model
roberta_model = TransformerDocumentEmbeddings('roberta-base')
# Initiate another pretrained embedding model
glove_embedding = WordEmbeddings('crawl')
document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])
# Stack the two pretrained embedding models
stacked_embeddings = StackedEmbeddings(embeddings=[roberta_model, document_glove_embeddings])
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, embedding_model=stacked_embeddings)
# Run BERTopic model
topics = topic_model.fit_transform(amz_review['review'])
# Get the list of topics
topic_model.get_topic_info()

2023-06-19 09:32:26,339 https://flair.informatik.hu-berlin.de/resources/embeddings/token/en-fasttext-crawl-300d-1M.vectors.npy not found in cache, downloading to /tmp/tmpusfvqhvy


100%|██████████| 1.12G/1.12G [00:57<00:00, 20.8MB/s]

2023-06-19 09:33:24,314 copying /tmp/tmpusfvqhvy to cache at /root/.flair/embeddings/en-fasttext-crawl-300d-1M.vectors.npy





2023-06-19 09:33:28,865 removing temp file /tmp/tmpusfvqhvy
2023-06-19 09:33:29,490 https://flair.informatik.hu-berlin.de/resources/embeddings/token/en-fasttext-crawl-300d-1M not found in cache, downloading to /tmp/tmpqaw95djz


100%|██████████| 37.5M/37.5M [00:02<00:00, 18.4MB/s]

2023-06-19 09:33:31,975 copying /tmp/tmpqaw95djz to cache at /root/.flair/embeddings/en-fasttext-crawl-300d-1M
2023-06-19 09:33:32,025 removing temp file /tmp/tmpqaw95djz





Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,354,-1_the_it_to_and,"[the, it, to, and, my, phone, this, is, was, you]",[You need at least 3 mins to get to your phone...
1,0,199,0_is_the_it_and,"[is, the, it, and, ear, very, to, of, use, not]","[It is simple to use and I like it., It fits m..."
2,1,117,1_have_had_ve_this,"[have, had, ve, this, for, and, with, it, ever...",[I've had this for nearly 2 years and it has w...
3,2,41,2_product_price_great_good,"[product, price, great, good, excellent, for, ...","[Excellent product for the price., Great produ..."
4,3,32,3_was_great_works_packaged,"[was, great, works, packaged, deal, this, item...","[Great it was new packaged nice works good, no..."
5,4,27,4_my_with_work_phone,"[my, with, work, phone, the, motorola, did, no...","[I connected my wife's bluetooth,(Motorola HS8..."
6,5,22,5_great_phone_love_armband,"[great, phone, love, armband, wallet, earphone...","[Great phone!., Great phone., Great phone!.]"
7,6,20,6_the_out_my_beep,"[the, out, my, beep, on, calls, of, off, and, ...",[While I managed to bend the leaf spring back ...
8,7,19,7_do_not_sucks_after,"[do, not, sucks, after, days, piece, phone, bu...","[DO NOT PURCHASE THIS PHONE., AFTER ARGUING WI..."
9,8,18,8_disappointed_very_disappointing_with,"[disappointed, very, disappointing, with, acce...","[disappointed., very disappointed., VERY DISAP..."


### Hyperparameter Tuning for Number of Topics

BERTopic uses the number of clusters created by the HDBSCAN model as the number of topics by default, but we can reduce the number of topics by changing the value of the nr_topics parameter.

* nr_topics=None indicates that there is no topic reduction.
* nr_topics=auto indicates an automatic topic reduction of the HDBSCAN results by merging topics close to each other.
* nr_topics=15 indicates that the target number of topics is 15.
* nr_topics value should always be smaller than the number of topics created by nr_topics=None.

In [16]:
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, nr_topics=15)
# Run BERTopic model
topics = topic_model.fit_transform(amz_review['review'])
# Get the list of topics
topic_model.get_topic_info()



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,261,-1_the_it_and_is,"[the, it, and, is, my, not, this, to, of, on]",[I am very impressed with the job that Motorol...
1,0,199,0_phone_battery_the_this,"[phone, battery, the, this, and, it, to, is, h...","[New Battery works great in phone., It was a g..."
2,1,171,1_the_headset_sound_ear,"[the, headset, sound, ear, is, and, to, my, on...",[This is simply the BEST bluetooth headset for...
3,2,73,2_product_price_good_great,"[product, price, good, great, am, this, very, ...","[Excellent product for the price., Great produ..."
4,3,59,3_recommend_item_would_this,"[recommend, item, would, this, it, device, to,...",[I am not impressed with this and i would not ...
5,4,41,4_service_customer_junk_piece,"[service, customer, junk, piece, bad, poor, qu...","[Worst Customer Service Ever., Customer servic..."
6,5,36,5_works_worked_work_great,"[works, worked, work, great, far, so, doesnt, ...","[Works great!., Works great., Works great.]"
7,6,30,6_case_cases_the_but,"[case, cases, the, but, holster, my, and, of, ...","[This case seems well made., Great case and pr..."
8,7,26,7_waste_money_dont_your,"[waste, money, dont, your, what, time, warning...","[Don't waste your money!., Dont waste your mon..."
9,8,22,8_reception_calls_signal_the,"[reception, calls, signal, the, is, get, very,...","[Bad Reception., I great reception all the tim..."


When the text corpus is large, training a BERTopic model can take a long time. Rerunning the model each time we change the number of topics can waste a lot of time and resources. The good news is that the BERTopic package has a **reduce_topics** method that uses the existing model information to do a topic reduction.

In [17]:
# Further reduce topics
topic_model.reduce_topics(amz_review['review'], nr_topics=10)
# Get the list of topics
topic_model.get_topic_info()



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,261,-1_the_it_and_is,"[the, it, and, is, my, not, this, to, of, on]",[I am very impressed with the job that Motorol...
1,0,392,0_the_phone_and_is,"[the, phone, and, is, to, it, this, my, with, ...",[I've had this bluetoooth headset for some tim...
2,1,209,1_product_this_great_it,"[product, this, great, it, works, good, price,...","[For the price on Amazon, it is an excellent p..."
3,2,30,2_case_cases_the_but,"[case, cases, the, but, holster, and, my, this...","[This case seems well made., Great case and pr..."
4,3,26,3_waste_money_dont_your,"[waste, money, dont, your, what, time, of, war...","[Don't waste your money!., don't waste your mo..."
5,4,21,4_disappointed_very_disappointing_disappointment,"[disappointed, very, disappointing, disappoint...","[Disappointed!., very disappointed., VERY DISA..."
6,5,18,5_camera_take_the_is,"[camera, take, the, is, pictures, and, cool, a...","[Pros:-Good camera - very nice pictures , also..."
7,6,16,6_buttons_are_keyboard_the,"[buttons, are, keyboard, the, difficult, is, b...","[Reaching for the bottom row is uncomfortable,..."
8,7,15,7_nice_look_sharp_very,"[nice, look, sharp, very, and, cheap, looks, g...","[It looks very nice., Its well-designed and ve..."
9,8,12,8_back_refund_return_company,"[back, refund, return, company, sending, me, u...",[I wish I could return the unit and get back m...


Another way of adjusting the number of topics is to control the minimum number of documents in a topic. We can set up this value by the parameter min_topic_size.

* A low value for min_topic_size allows fewer documents to form a topic, so the topic model produces more topics.
* A high value for min_topic_size requires a lot of documents to form a topic, so the topic model produces fewer topics.
* The default value for min_topic_size is 10. A general guideline for setting min_topic_size is to set up a low value for a smaller dataset, and a high value for a larger dataset.

Setting min_topic_size is the same as setting min_cluster_size in HDBSCAN.

In [18]:
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, min_topic_size=25)
# Run BERTopic model
topics = topic_model.fit_transform(amz_review['review'])
# Get the list of topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,310,-1_the_and_it_my,"[the, and, it, my, is, to, of, with, not, phone]",[My experience was terrible..... This was my f...
1,0,274,0_product_this_it_good,"[product, this, it, good, not, great, the, pri...",[The price was very good and with the free shi...
2,1,101,1_phone_this_the_to,"[phone, this, the, to, and, is, have, had, gre...","[Great phone., Great phone., This is a great p..."
3,2,82,2_case_and_the_is,"[case, and, the, is, it, nice, very, fit, look...","[Other than that, the leather is nice and soft..."
4,3,73,3_sound_the_hear_is,"[sound, the, hear, is, quality, and, to, volum...","[very clear, quality sound and you don't have ..."
5,4,49,4_headset_the_for_headphones,"[headset, the, for, headphones, my, headsets, ...","[Best headset ever!!!., Love this headset!, Th..."
6,5,45,5_ear_the_comfortable_fits,"[ear, the, comfortable, fits, my, ears, jabra,...",[I've tried several different earpieces for my...
7,6,37,6_battery_life_the_original,"[battery, life, the, original, is, to, and, af...",[He was very impressed when going from the ori...
8,7,29,7_charger_charge_car_it,"[charger, charge, car, it, plug, the, not, in,...","[Great charger., I purcashed this for the car ..."


### Hyperparameter for Top Words

**n_gram_range** is used to specify the range of n-grams included in the topic model.
**top_n_words** controls how many words are used to describe the topic.

In [19]:
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, n_gram_range=(1, 3))
# Run BERTopic model
topics = topic_model.fit_transform(amz_review['review'])
# Get the list of topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,261,-1_the_it_and_is,"[the, it, and, is, my, not, this, to, of, on]","[The calls drop, the phone comes on and off at..."
1,0,112,0_phone_this phone_this_had,"[phone, this phone, this, had, the, and, it, h...","[Great phone!., If you like a loud buzzing to ..."
2,1,74,1_sound_the_is_hear,"[sound, the, is, hear, quality, and, to, volum...",[It is easy to turn on and off when you are in...
3,2,53,2_product_price_good_the price,"[product, price, good, the price, for the pric...","[Great product for the price!., Excellent prod..."
4,3,44,3_ear_the ear_the_comfortable,"[ear, the ear, the, comfortable, fits, my ear,...",[This is so embarassing and also my ears hurt ...
5,4,41,4_headset_this headset_my_the,"[headset, this headset, my, the, best, sound, ...",[If the two were seperated by a mere 5+ ft I s...
6,5,38,5_battery_the battery_battery is_the,"[battery, the battery, battery is, the, life, ...","[The battery life is highly unacceptable., You..."
7,6,36,6_works_worked_work_great,"[works, worked, work, great, works great, for ...","[Works great!., Everything worked on the first..."
8,7,31,7_charger_charge_the charger_it,"[charger, charge, the charger, it, car, plug, ...",[it did not work in my cell phone plug i am ve...
9,8,30,8_case_this case_cases_the,"[case, this case, cases, the, the case, but, a...","[When I placed my treo into the case, not only..."


In [20]:
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, top_n_words=5)
# Run BERTopic model
topics = topic_model.fit_transform(amz_review['review'])
# Got top topic words
topic_model.get_topic(1)

[('sound', 0.06238531040131652),
 ('hear', 0.05110186293603991),
 ('quality', 0.04487960503083769),
 ('the', 0.040710454262311985),
 ('volume', 0.03832639720202994)]

### Hyperparameters for Words Universe

There are two ways to control how many words are used in CountVectorizer and c-TF-IDF.

min_df sets a threshold for the required word frequency. For example, min_df=10 indicates that any words that appeared less than 10 times in the corpus will not be included in the c-TF-IDF calculation. A general guideline is to set a high min_df value for a large corpus and a low value for a small corpus.
max_features indicates the maximum number of words to include for the c-TF-IDF calculation. max_features=1_000 means that the top 1000 words with the highest frequency in the corpus will be included.

In [21]:
# Count vectorizer
vectorizer_model = CountVectorizer(min_df=10)
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, vectorizer_model=vectorizer_model)
# Run BERTopic model
topics = topic_model.fit_transform(amz_review['review'])
# Get the list of topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,261,-1_it_the_not_and,"[it, the, not, and, my, is, on, of, this, was]","[The calls drop, the phone comes on and off at..."
1,0,112,0_phone_had_this_have,"[phone, had, this, have, and, was, the, that, ...","[The phone loads super!, This is hands down th..."
2,1,74,1_quality_the_is_you,"[quality, the, is, you, to, on, and, have, use...","[And the sound quality is great., very clear, ..."
3,2,53,2_product_good_great_excellent,"[product, good, great, excellent, this, im, al...","[Great product., A pretty good product., Good ..."
4,3,44,3_one_my_the_in,"[one, my, the, in, is, on, and, be, like, your]",[I usually don't like headbands but this one i...
5,4,41,4_my_for_excellent_from,"[my, for, excellent, from, very, love, just, u...",[I was looking for this headset for a long tim...
6,5,38,5_the_is_to_from,"[the, is, to, from, with, and, has, well, boug...",[Appears to actually outperform the original b...
7,6,36,6_works_work_great_so,"[works, work, great, so, me, has, for, good, h...","[Works great., Works great., Works great!.]"
8,7,31,7_work_not_it_in,"[work, not, it, in, out, to, as, for, time, the]",[I got the car charger and not even after a we...
9,8,30,8_but_an_well_its,"[but, an, well, its, was, the, my, of, this, has]","[Looks good in the picture, but this case was ..."


In [22]:
# Count vectorizer
vectorizer_model = CountVectorizer(max_features=1_000)
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, vectorizer_model=vectorizer_model)
# Run BERTopic model
topics = topic_model.fit_transform(amz_review['review'])
# Get the list of topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,261,-1_the_it_and_my,"[the, it, and, my, not, is, this, of, on, to]",[I own a Jabra Earset and was very happy with ...
1,0,112,0_phone_had_this_and,"[phone, had, this, and, have, the, verizon, it...","[O my gosh the best phone I have ever had., It..."
2,1,74,1_sound_hear_quality_the,"[sound, hear, quality, the, volume, is, to, au...",[The sound is clear and the people I talk to o...
3,2,53,2_product_price_good_great,"[product, price, good, great, am, purchase, ha...","[Excellent product for the price., Great produ..."
4,3,44,3_ear_comfortable_fits_ears,"[ear, comfortable, fits, ears, the, comfortabl...","[Painful on the ear., It fits my ear well and ..."
5,4,41,4_headset_best_headphones_sound,"[headset, best, headphones, sound, my, bt, for...","[My headset works just peachy-keen., Love this..."
6,5,38,5_battery_life_original_the,"[battery, life, original, the, lasts, is, hour...","[Battery has no life., The battery works great..."
7,6,36,6_works_worked_work_great,"[works, worked, work, great, far, doesnt, so, ...","[Works great., Works great!., Works great!.]"
8,7,31,7_charger_charge_car_plug,"[charger, charge, car, plug, it, not, work, ch...",[I purcashed this for the car charger and it d...
9,8,30,8_case_cases_holster_but,"[case, cases, holster, but, the, sex, my, made...",[The case is great and works fine with the 680...


### Hyperparameter for Diversifying Topic Representation

The hyperparameter diversity helps to remove the words with the same or similar meanings. It has a range of 0 to 1, where 0 means least diversity and 1 means most diversity.

In [23]:
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, diversity=0.8)
# Run BERTopic model
topics = topic_model.fit_transform(amz_review['review'])
# Get the list of topics
topic_model.get_topic_info()

TypeError: ignored


### Hyperparameter for Stopwords


After creating the topics, if the top words representing the topics contain stopwords, we can remove the stopwords using stop_words="english" with CountVectorizer.

In [24]:
# Count vectorizer
vectorizer_model = CountVectorizer(stop_words="english")
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, vectorizer_model=vectorizer_model)
# Run BERTopic model
topics = topic_model.fit_transform(amz_review['review'])
# Get the list of topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,261,-1_phone_good_better_dont,"[phone, good, better, dont, motorola, make, fi...",[The pairing of the two devices was so easy it...
1,0,112,0_phone_verizon_ive_tmobile,"[phone, verizon, ive, tmobile, best, phones, n...","[Great phone., Great Phone., This is a great p..."
2,1,74,1_sound_hear_quality_volume,"[sound, hear, quality, volume, audio, talk, pe...","[The sound quality is excellent as well., Poor..."
3,2,53,2_price_product_good_great,"[price, product, good, great, purchase, happy,...","[Excellent product for the price., Great produ..."
4,3,44,3_ear_comfortable_fits_ears,"[ear, comfortable, fits, ears, earpiece, comfo...",[I've tried several different earpieces for my...
5,4,41,4_headset_headphones_best_bt,"[headset, headphones, best, bt, sound, bluetoo...","[Love this headset!, Its the best headset I ha..."
6,5,38,5_battery_life_original_lasts,"[battery, life, original, lasts, hours, litera...","[The battery works great!, Battery life is als..."
7,6,36,6_works_worked_work_far,"[works, worked, work, far, doesnt, great, char...","[Works great., Works great!., Works great.]"
8,7,31,7_charger_charge_plug_car,"[charger, charge, plug, car, chargers, holds, ...",[it did not work in my cell phone plug i am ve...
9,8,30,8_case_cases_holster_sex,"[case, cases, holster, sex, scratched, extra, ...","[Great case and price!, This case seems well m..."


### Hyperparameter for Topic Probability Output

* When calculate_probabilities = True, the probabilities of each document belonging to each topic are calculated. The topic with the highest probability is the predicted topic for a new document. This probability represents how confident we are about finding the topic in the document.

* When calculate_probabilities = False, the probabilities of each document belonging to each topic are not calculated. This saves computation time and cost. If there is no new document to predict, we do not need to calculate the probabilities.

We can visualize the probabilities using visualize_distribution, and pass in the document index. visualize_distribution has the default probability threshold of 0.015, so only the topic with a probability greater than 0.015 will be included.

In [25]:
# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, calculate_probabilities=True)
# Run BERTopic model
topics, probablity = topic_model.fit_transform(amz_review['review'])
# Visualize probability distribution
topic_model.visualize_distribution(topic_model.probabilities_[0], min_probability=0.015)

In [26]:
# Check the content for the first review
amz_review['review'][0]

'So there is no way for me to plug it in here in the US unless I go by a converter.'

In [27]:
# Get probabilities for all topics
topic_model.probabilities_[0]

array([0.01304336, 0.01299055, 0.00881539, 0.01149096, 0.01389758,
       0.02600692, 0.01078748, 0.12686491, 0.01086359, 0.01014737,
       0.01166604, 0.0129515 , 0.00953966, 0.0097009 , 0.03185295,
       0.01051503, 0.01180338, 0.00837704, 0.01532238, 0.01534351,
       0.01160797, 0.01479929, 0.01068419, 0.01943851, 0.00956005])