BERTopic is a topic modeling python library that combines transformer embeddings and clustering model algorithms to identify topics in NLP (Natual Language Processing). 

#### Documents Embedding: 

Firstly, we need to get the embeddings for all the documents. Embeddings are the vector representation of the documents.  
  
1. BERTopic uses the English version of the sentence_transformers by default to get document embeddings.  
2. If there are multiple languages in the document, we can use BERTopic(language=”multilingual”) to support the topic modeling of over 50 languages.  
3. BERTopic also supports the pre-trained models from other python packages such as hugging face and flair.


#### Documents Clustering: 

After the text documents have been transformed into embeddings, the next step is to run a clustering model on the embedded documents. Because the embedding vectors usually have very high dimensions, dimension reduction techniques are used to reduce the dimensionalities.  
  
1. The default algorithm for dimension reduction is UMAP (Uniform Manifold Approximation & Projection). Compared with other dimension reduction techniques such as PCA (Principle Component Analysis), UMAP maintains the data’s local and global structure when reducing the dimensionality, which is important for representing the semantics of the text data. BERTopic provides the option of using other dimensionality reduction techniques by changing the umap_model value in the BERTopic method.  
2. The default algorithm for clustering is HDBSCAN. HDBSCAN is a density-based clustering model. It identifies the number of clustering automatically, and does not require specifying the number of clusters beforehand like most of the clustering models.



#### Topic Representation: 

After assigning each document in the corpus into a cluster, the next step is to get the topic representation using a class-based TF-IDF called c-TF-IDF. The top words with the highest c-TF-IDF scores are selected to represent each topic.    
  
1. c-TF-IDF is similar to TF-IDF in that it measures the term importance by term frequencies while taking into account the whole corpus (all the text data for the analysis).  
2. c-TF-IDF is different from TF-IDF in that the term frequency level is different. In the regular TF-IDF, TF measures the term frequency in each document. While in the c-TF-IDF, TF measures the term frequency in each cluster, and each cluster includes many documents.  

#### Maximal Marginal Relevance (MMR) (optional): 

After extracting the most important terms describing each cluster, there is an optional step to optimize the terms using Maximal Marginal Relevance (MMR). Maximal Marginal Relevance (MMR) has two benefits:  
  
1. The first benefit is to increase the coherence among the terms for the same topic and remove irrelevant terms.  
2. The second benefit is to increase the topic representation by removing synonyms and variations of the same words.

In [1]:
from bertopic import BERTopic

  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()
2023-06-05 14:54:34.563799: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Data processing
import pandas as pd
import numpy as np
# Text preprocessiong
import nltk
nltk.download('stopwords')
nltk.download('omw-1.4')
nltk.download('wordnet')
wn = nltk.WordNetLemmatizer()
# Topic model
from bertopic import BERTopic
# Dimension reduction
from umap import UMAP

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/keithlowton/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/keithlowton/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/keithlowton/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# Read in data
amz_review = pd.read_csv('amazon_cells_labelled.txt', sep='\t', names=['review', 'label'])
# Drop te label 
amz_review = amz_review.drop('label', axis=1);
# Take a look at the data
amz_review.head()

Unnamed: 0,review
0,So there is no way for me to plug it in here i...
1,"Good case, Excellent value."
2,Great for the jawbone.
3,Tied to charger for conversations lasting more...
4,The mic is great.


In [4]:
# Get the dataset information
amz_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  1000 non-null   object
dtypes: object(1)
memory usage: 7.9+ KB


### Text Data Preprocessing 

In [5]:
# Remove stopwords
stopwords = nltk.corpus.stopwords.words('english')
print(f'There are {len(stopwords)} default stopwords. They are {stopwords}')

There are 179 default stopwords. They are ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'no

In [6]:
# Remove stopwords
amz_review['review_without_stopwords'] = amz_review['review'].apply(lambda x: ' '.join([w for w in x.split() if w.lower() not in stopwords]))
# Lemmatization
amz_review['review_lemmatized'] = amz_review['review_without_stopwords'].apply(lambda x: ' '.join([wn.lemmatize(w) for w in x.split() if w not in stopwords]))
# Take a look at the data
amz_review.head()

Unnamed: 0,review,review_without_stopwords,review_lemmatized
0,So there is no way for me to plug it in here i...,way plug US unless go converter.,way plug US unless go converter.
1,"Good case, Excellent value.","Good case, Excellent value.","Good case, Excellent value."
2,Great for the jawbone.,Great jawbone.,Great jawbone.
3,Tied to charger for conversations lasting more...,Tied charger conversations lasting 45 minutes....,Tied charger conversation lasting 45 minutes.M...
4,The mic is great.,mic great.,mic great.


### Topic Modeling Using BERTopic

BERTopic model by default produces different results each time because of the stochasticity inherited from UMAP.  
  
To get reproducible topics, we need to pass a value to the random_state parameter in the UMAP method.  
  
n_neighbors=15 means that the local neighborhood size for UMAP is 15. This is the parameter that controls the local versus global structure in data.  
1. A low value forces UMAP to focus more on local structure, and may lose insights into the big picture.  
2. A high value pushes UMAP to look at the broader neighborhood, and may lose details on local structure.  
3. The default n_neighbors values for UMAP is 15.  
  
n_components=5 indicates that the target dimension from UMAP is 5. This is the dimension of data that will be passed into the clustering model.  
min_dist controls how tightly UMAP is allowed to pack points together. It's the minimum distance between points in the low dimensional space.  
  
1. Small values of min_dist result in clumpier embeddings, which is good for clustering. Since our goal of dimension reduction is to build clustering models, we set min_dist to 0.  
2. Large values of min_dist prevent UMAP from packing points together and preserves the broad structure of data.  
  
metric='cosine' indicates that we will use cosine to measure the distance.  
random_state sets a random seed to make the UMAP results reproducible.  
  
After initiating the UMAP model, we pass it to the BERTopic model, set the language to English, and set the calculate_probabilities parameter to True.
  
Finally, we pass the processed review documents to the topic model and saved the results for topics and topic probabilities.  
  
* The values in topics represents the topic each document is assigned to.  
* The values in probabilities represents the probability of a document belongs to each of the topics.

In [None]:
# Initiate UMAP
umap_model = UMAP(n_neighbors=15, 
                  n_components=5, 
                  min_dist=0.0, 
                  metric='cosine', 
                  random_state=100)

# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, language="english", calculate_probabilities=True)

# Run BERTopic model
topics, probabilities = topic_model.fit_transform(amz_review['review_lemmatized'])

### Extract Topics From Topic Modeling

Using the attribute get_topic_info() on the topic model gives us the list of topics. We can see that the output gives us 31 rows in total.
  
Topic -1 should be ignored. It indicates that the reviews are not assigned to any specific topic. The count for topic -1 is 322, meaning that there are 322 reviews as outliers that do not belong to any topic.  
Topic 0 to topic 29 are the 30 topics created for the reviews. It was ordered by the number of reviews in each topic, so topic 0 has the highest number of reviews.  
The Name column lists the top terms for each topic. For example, the top 4 terms for Topic 0 are sound, quality, volume, and audio, indicating that it is a topic related to sound quality.

In [None]:
# Get the list of topics
topic_model.get_topic_info()

In [None]:
# Get top 10 terms for a topic
topic_model.get_topic(0)

In [None]:
# Visualize top topic keywords
topic_model.visualize_barchart(top_n_topics=12)

In [None]:
# Visualize term rank decrease
topic_model.visualize_term_rank()

### Topic Similarities

In [None]:
# Visualize intertopic distance
topic_model.visualize_topics()

In [None]:
# Visualize connections between topics using hierachical clustering
topic_model.visualize_hierarchy(top_n_topics=10)

In [None]:
# Visualize similarity using heatmap
topic_model.visualize_heatmap()

### Topic Model Predicted Probabilities

In [None]:
# Visualize probability distribution
topic_model.visualize_distribution(topic_model.probabilities_[0], min_probability=0.015)

### Topic Model In-sample Predictions

In [None]:
# Get the topic predictions
topic_prediction = topic_model.topics_[:]
# Save the predictions in the dataframe
amz_review['topic_prediction'] = topic_prediction
# Take a look at the data
amz_review.head()

### Topic Model Predictions on New Data

Firstly, let’s decide the number of topics to include in the prediction.  
  
1. If we would like to assign only one topic to the document, then the number of topics should be 1.  
2. If we would like to assign multiple topics to the document, then the number of topics should be greater than 1. Here we are getting the top 3 topics that are most relevant to the new review.  
* After that, we pass the new review and the number of topics to the find_topics method. This gives us the topic number and the similarity value.  
* Finally, the results are printed. The top 3 similar topics for the new review are topic 1, topic 0, and topic 2. The similarities are 0.43, 0.34, and 0.30.  

In [None]:
# New data for the review
new_review = "I like the new headphone. Its sound quality is great."
# Find topics
num_of_topics = 3
similar_topics, similarity = topic_model.find_topics(new_review, top_n=num_of_topics); 
# Print results
print(f'The top {num_of_topics} similar topics are {similar_topics}, and the similarities are {np.round(similarity,2)}')

In [None]:
# Print the top keywords for the top similar topics
for i in range(num_of_topics):
  print(f'The top keywords for topic {similar_topics[i]} are:')
  print(topic_model.get_topic(similar_topics[i]))

### Save and Load Topic Models

In [None]:
# Save the topic model
topic_model.save("amz_review_topic_model")	
# Load the topic model
my_model = BERTopic.load("amz_review_topic_model")