# Overview

In this notebook, we do the topic modeling with BERTopic[https://github.com/MaartenGr/BERTopic] for query dataset.

In [1]:
!pip install bertopic

Collecting bertopic
  Downloading bertopic-0.16.0-py2.py3-none-any.whl.metadata (21 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence_transformers-2.6.1-py3-none-any.whl.metadata (11 kB)
Collecting cython<3,>=0.27 (from hdbscan>=0.8.29->bertopic)
  Using cached Cython-0.29.37-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl.metadata (3.1 kB)
Downloading bertopic-0.16.0-py2.py3-none-any.whl (154 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading

In [61]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from bertopic import BERTopic

import re
import string

import nltk


from tqdm import tqdm


# Loading datasets

We will use a dataset containing abstracts and metadata [ArXiv](https://huggingface.co/datasets/arxiv_dataset).

In [4]:
data= pd.read_csv("/kaggle/input/stencil/predicted_values.csv", usecols=["queryInEnglish"])

data.head()

Unnamed: 0,queryInEnglish
0,'If there is a pest in the cultivation of okra...
1,'Remedies for leaf mold of the chilli plant '
2,'What is the right time to sow mustard?'
3,'Late Wheat Wairati '
4,'Paddy variety'


# Pieline of BERTopic

Before we are going to start `Topic Modeling`. It is good for us to know the pipeline of BERTopic. BERTopic can be viewed as a sequence of steps to create its topic representations. 

Here is the process:

![https://maartengr.github.io/BERTopic/algorithm/default.svg](https://maartengr.github.io/BERTopic/algorithm/default.svg)

We can adopt the pipeline to the current state-of-art with respect to each individual step:

![https://maartengr.github.io/BERTopic/algorithm/modularity.svg](https://maartengr.github.io/BERTopic/algorithm/modularity.svg)

# Pre-calculate Embeddings

We are going to execute the first step of the BERTopic pipeline which is `embeddings`. If you want to compute embeddings with multiple GPUs, check [Computing Embeddings Streaming](https://www.kaggle.com/code/aisuko/computing-embeddings-streaming) and [Computing Embeddings with Multi GPUs](https://www.kaggle.com/code/aisuko/computing-embeddings-with-multi-gpus).

In [5]:
%%capture
from sentence_transformers import SentenceTransformer

encoder=SentenceTransformer('all-MiniLM-L6-v2').to('cuda')
encoder.max_seq_length=256
encoder

In [25]:
corpus_embeddings=encoder.encode(dataset, show_progress_bar=True)
len(corpus_embeddings)

Batches:   0%|          | 0/1183 [00:00<?, ?it/s]

37838

# Preventing Stochastic Behavior

We generally ise a dimensionality reduction algorithm to reduce the size of the embeddings. This is done to prevent the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) to a certain degree. As a default, this is done with `UMAP` which is an incredible algorithm for reducing dimentional space. However, by default, it shows stochastic behavior which creates different results each time you run it. To prevent that, we will need to set a `random_state` of the model before passing it to BERTopic.

In [8]:
from umap import UMAP

umap_model=UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
umap_model

# Controlling Number of Topics

There is a parameter to control the number of topics, namely `nr_ropics`. This parameter merges topics `after` they have been created. It is a parameter that supports creating fixed number of topics. However, it is advised to control the number of topics through the cluster model which is by default `HDBSCAN`. `HDBSCAN` has a parameter, namely `min_topic_size` that indirectly controls the number of topics that will be created.

A higher `min_topic_size` will generate fewer topics and a lower `min_topic_size` will generate more topics. Here, we will go with `min_topic_size=40` to get around xxx topics.

In [28]:
from hdbscan import HDBSCAN

hdbscan_model=HDBSCAN(min_cluster_size=300, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
hdbscan_model

# Improving Default Representation

The default representation of topics is calculated through [c-TF-IDF](). However, c-TF-IDF is powered by the [CountVectorizer]() which converts text into tokens. Using the CountVectorizer, we can do a number of things:
* Remove stopwords
* Ignore inferquent words
* Increase

In other words, we can preprocess the topic representations after documents are assigned to topics. This will not influence the clustering proess in any way. Here we will ignore English stopwords and infrequent words. Moreover, by increasing the n-gram range we will consider topic representations that are made up of one or two words.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model=CountVectorizer(stop_words='english', min_df=2, ngram_range=(1,2))
vectorizer_model

# Additional Representations

In [13]:
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, PartOfSpeech

keybert_model=KeyBERTInspired()

pos_model=PartOfSpeech('en_core_web_sm')

mmr_model=MaximalMarginalRelevance(diversity=0.3)

In [14]:
representation_model={
    'KeyBERT':keybert_model,
    'MMR':mmr_model,
    'POS':pos_model
}

In [17]:
# A list of contractions from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are"
}

In [19]:
# Basic Text preprocessing 
def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    # convert to lower case
    text = str(text).lower()
    
    # Replace contractions with their longer forms 
    if True:
        text = text.split()
        new_text = []
        for word in text:
            if word in contractions:
                new_text.append(contractions[word])
            else:
                new_text.append(word)
        text = " ".join(new_text)
    
    # remove urls
    text = re.sub('https?://\S+|www\.\S+', '', text)
    # remove punctuation
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    # removes newline characters
    text = re.sub('\n', '', text)
    # remove words containing numbers
    text = re.sub('\w*\d\w*', '', text)
    return text

data['queryInEnglish'] = data['queryInEnglish'].apply(lambda x:clean_text(x))

In [20]:
dataset=data['queryInEnglish'].to_list()

In [23]:
dataset[0]

'if there is a pest in the cultivation of okra it will have to be given a bean '

# Training

In [29]:
from bertopic import BERTopic

topic_model=BERTopic(
    embedding_model=encoder,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    representation_model=representation_model,
    
    # hyperparameters
    top_n_words=10,
    verbose=True
)

topics, probs=topic_model.fit_transform(dataset, corpus_embeddings)

2024-04-14 10:25:06,554 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-14 10:26:11,797 - BERTopic - Dimensionality - Completed ✓
2024-04-14 10:26:11,799 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-14 10:26:18,840 - BERTopic - Cluster - Completed ✓
2024-04-14 10:26:18,852 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-14 10:26:21,592 - BERTopic - Representation - Completed ✓


In [30]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,MMR,POS,Representative_Docs
0,-1,20289,-1_crop_scheme_wheat_bph,"[crop, scheme, wheat, bph, variety, rabi, seed...","[agriculture, agriculture related, benefits go...","[crop, scheme, wheat, bph, variety, rabi, seed...","[crop, scheme, wheat, bph, variety, rabi, seed...",[what scheme can i avail of agriculture relate...
1,0,2905,0_paddy_paddy crop_rice_paddy paddy,"[paddy, paddy crop, rice, paddy paddy, control...","[paddy, paddy paddy, paddy bph, bph paddy, pad...","[paddy, paddy crop, rice, paddy paddy, control...","[paddy, rice, control, crop, leaf, disease, pe...","[paddy, paddy, paddy]"
2,1,2291,1_brinjal_tree_mango_fruit,"[brinjal, tree, mango, fruit, coconut, leaves,...","[brinjal fruit, brinjal tree, brinjal trees, b...","[brinjal, tree, mango, fruit, coconut, leaves,...","[brinjal, tree, mango, fruit, coconut, leaves,...","[the brinjal planted fruit on the tree, what a..."
3,2,1916,2_poka_dhana_ra_ki,"[poka, dhana, ra, ki, roga, hai, pain, kana, c...","[roga poka, dhana, bindha poka, pokara, kanda ...","[poka, dhana, ra, ki, roga, hai, pain, kana, c...","[karibi, niyantran, kate, trafa, , , , , , ]","[dhana re kandabindha poka ra parichalana, dha..."
4,3,1304,3_disease_insect_pesticides_pest,"[disease, insect, pesticides, pest, medicine, ...","[disease, disease pest, diseases, disease inse...","[disease, insect, pesticides, pest, medicine, ...","[disease, pesticides, medicine, control, use, ...","[disease, pesticides medicine, what is the med..."
5,4,1177,4_agriculture_odisha_farmer_farming,"[agriculture, odisha, farmer, farming, schemes...","[agriculture, agricultural, scheme agricultura...","[agriculture, odisha, farmer, farming, schemes...","[agriculture, farmer, farming, schemes, farmer...","[what is union agriculture , agriculture, agri..."
6,5,933,5_moong_moong crop_moong cultivation_cultivation,"[moong, moong crop, moong cultivation, cultiva...","[moong, given moong, leaves moong, moong tree,...","[moong, moong crop, moong cultivation, cultiva...","[moong, cultivation, crop, variety, leaves, di...","[moong varait, moong, moong trips]"
7,6,727,6_mustard_mustard crop_mustard cultivation_mus...,"[mustard, mustard crop, mustard cultivation, m...","[mustard, mustard seed, mustard plant, medicin...","[mustard, mustard crop, mustard cultivation, m...","[mustard, variety, duration, crop, cultivation...","[mustard, taria mustard, powerymildew in musta..."
8,7,698,7_worm_worms_case worm_prevention,"[worm, worms, case worm, prevention, case, pad...","[worm, worms, disease worms, worms control, ca...","[worm, worms, case worm, prevention, case, pad...","[worm, worms, prevention, case, paddy, medicin...","[the worm, matiagundia worm, the worm]"
9,8,682,8_wheat_weed_weeds_wheat crop,"[wheat, weed, weeds, wheat crop, weed control,...","[crop weed, medicine wheat, wheat medicine, wh...","[wheat, weed, weeds, wheat crop, weed control,...","[wheat, weed, weeds, control, medicine, weedic...","[weed in wheat, weed in wheat, weed in wheat]"


In [32]:
#To get all representations for a single topic, we simply run the following:
topic_model.get_topic(4, full=True)

{'Main': [('agriculture', 0.11382990994222071),
  ('odisha', 0.10011492607847983),
  ('farmer', 0.08613944452686714),
  ('farming', 0.0725369436856289),
  ('schemes', 0.05837343321792601),
  ('farmers', 0.0577841606173724),
  ('agricultural', 0.05289852394094719),
  ('farm', 0.0445187483714174),
  ('department', 0.044074263716734195),
  ('scheme', 0.04264539155593236)],
 'KeyBERT': [('agriculture', 0.92828184),
  ('agricultural', 0.89815557),
  ('scheme agricultural', 0.7803847),
  ('department agriculture', 0.7570596),
  ('schemes agriculture', 0.7543499),
  ('agriculture department', 0.75192195),
  ('agricultural schemes', 0.7480914),
  ('odisha agriculture', 0.7464147),
  ('agriculture extension', 0.74289244),
  ('farmers', 0.71894413)],
 'MMR': [('agriculture', 0.11382990994222071),
  ('odisha', 0.10011492607847983),
  ('farmer', 0.08613944452686714),
  ('farming', 0.0725369436856289),
  ('schemes', 0.05837343321792601),
  ('farmers', 0.0577841606173724),
  ('agricultural', 0.05289

In [34]:
topic_model.visualize_barchart(top_n_topics =20, n_words = 10).show()

In [35]:
topic_model.visualize_topics().show()

In [36]:


# or use one of the other topic representations, like KeyBERTInspired
keybert_topic_labels={topic: ' | '.join(list(zip(*values))[0][:3]) for topic, values in topic_model.topic_aspects_['KeyBERT'].items()}
topic_model.set_topic_labels(keybert_topic_labels)

Now that we have set the updated topic labels, we can access them with the many functions used throughout BERTopic. Most notably, we can show the updated labels in visulizations with the `custom_labels=True` parameters. And we can see that `.get_topic_info` now also includes the column `CustomName`. That is the custom label that we just created for each topic.

In [37]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,CustomName,Representation,KeyBERT,MMR,POS,Representative_Docs
0,-1,20289,-1_crop_scheme_wheat_bph,agriculture | agriculture related | benefits g...,"[crop, scheme, wheat, bph, variety, rabi, seed...","[agriculture, agriculture related, benefits go...","[crop, scheme, wheat, bph, variety, rabi, seed...","[crop, scheme, wheat, bph, variety, rabi, seed...",[what scheme can i avail of agriculture relate...
1,0,2905,0_paddy_paddy crop_rice_paddy paddy,paddy | paddy paddy | paddy bph,"[paddy, paddy crop, rice, paddy paddy, control...","[paddy, paddy paddy, paddy bph, bph paddy, pad...","[paddy, paddy crop, rice, paddy paddy, control...","[paddy, rice, control, crop, leaf, disease, pe...","[paddy, paddy, paddy]"
2,1,2291,1_brinjal_tree_mango_fruit,brinjal fruit | brinjal tree | brinjal trees,"[brinjal, tree, mango, fruit, coconut, leaves,...","[brinjal fruit, brinjal tree, brinjal trees, b...","[brinjal, tree, mango, fruit, coconut, leaves,...","[brinjal, tree, mango, fruit, coconut, leaves,...","[the brinjal planted fruit on the tree, what a..."
3,2,1916,2_poka_dhana_ra_ki,roga poka | dhana | bindha poka,"[poka, dhana, ra, ki, roga, hai, pain, kana, c...","[roga poka, dhana, bindha poka, pokara, kanda ...","[poka, dhana, ra, ki, roga, hai, pain, kana, c...","[karibi, niyantran, kate, trafa, , , , , , ]","[dhana re kandabindha poka ra parichalana, dha..."
4,3,1304,3_disease_insect_pesticides_pest,disease | disease pest | diseases,"[disease, insect, pesticides, pest, medicine, ...","[disease, disease pest, diseases, disease inse...","[disease, insect, pesticides, pest, medicine, ...","[disease, pesticides, medicine, control, use, ...","[disease, pesticides medicine, what is the med..."
5,4,1177,4_agriculture_odisha_farmer_farming,agriculture | agricultural | scheme agricultural,"[agriculture, odisha, farmer, farming, schemes...","[agriculture, agricultural, scheme agricultura...","[agriculture, odisha, farmer, farming, schemes...","[agriculture, farmer, farming, schemes, farmer...","[what is union agriculture , agriculture, agri..."
6,5,933,5_moong_moong crop_moong cultivation_cultivation,moong | given moong | leaves moong,"[moong, moong crop, moong cultivation, cultiva...","[moong, given moong, leaves moong, moong tree,...","[moong, moong crop, moong cultivation, cultiva...","[moong, cultivation, crop, variety, leaves, di...","[moong varait, moong, moong trips]"
7,6,727,6_mustard_mustard crop_mustard cultivation_mus...,mustard | mustard seed | mustard plant,"[mustard, mustard crop, mustard cultivation, m...","[mustard, mustard seed, mustard plant, medicin...","[mustard, mustard crop, mustard cultivation, m...","[mustard, variety, duration, crop, cultivation...","[mustard, taria mustard, powerymildew in musta..."
8,7,698,7_worm_worms_case worm_prevention,worm | worms | disease worms,"[worm, worms, case worm, prevention, case, pad...","[worm, worms, disease worms, worms control, ca...","[worm, worms, case worm, prevention, case, pad...","[worm, worms, prevention, case, paddy, medicin...","[the worm, matiagundia worm, the worm]"
9,8,682,8_wheat_weed_weeds_wheat crop,crop weed | medicine wheat | wheat medicine,"[wheat, weed, weeds, wheat crop, weed control,...","[crop weed, medicine wheat, wheat medicine, wh...","[wheat, weed, weeds, wheat crop, weed control,...","[wheat, weed, weeds, control, medicine, weedic...","[weed in wheat, weed in wheat, weed in wheat]"


# Topic-Document Distribution

If using `calculate_probabilities=True` is not possible, than we can [approximate the topic_document distributions]() using `.approximate_distribution`. It is a fast and flexisble method for creating different topic-document distributions.

In [39]:
# `topic_distr` contains the distribution of topics in each document
topic_distr, _ =topic_model.approximate_distribution(dataset, window=8, stride=4)

100%|██████████| 38/38 [00:01<00:00, 34.02it/s]


## Visualization

Visualize the topic-document distribution for a single document

In [44]:
topic_model.visualize_distribution(topic_distr[0])

In [45]:
# Visualize the topic-document distribution for a single documentA
topic_model.visualize_distribution(topic_distr[0], custom_labels=True)

In [46]:

topic_distr, topic_token_distr=topic_model.approximate_distribution(dataset[0], calculate_tokens=True)

# visualize the token-level distributions
df=topic_model.visualize_approximate_distribution(dataset[0], topic_token_distr[0])
df

100%|██████████| 1/1 [00:00<00:00, 240.47it/s]


Unnamed: 0,if,there,is,pest,in,the,cultivation,of,okra,it,will,have,to,be,given,bean
0_paddy_paddy crop_rice_paddy paddy,0.114,0.228,0.342,0.456,0.342,0.228,0.114,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3_disease_insect_pesticides_pest,0.216,0.432,0.647,0.806,0.59,0.374,0.158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5_moong_moong crop_moong cultivation_cultivation,0.0,0.0,0.0,0.0,0.109,0.109,0.109,0.109,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18_groundnut_nut_ground nut_ground,0.0,0.0,0.0,0.111,0.221,0.221,0.221,0.111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Tip - use_embedding_model

As a default, we compare the c-TF-IDF calculations between the token sets and all topics. Due to its bag-of-word representation, this is quite fast. However, we might want to use the selected embedding_model instrad to do this comparison. Do note that due to the many token sets, it is often computationally quite a bit slower:

```python
topic_distr,_=topic_model.approximate_distribution(docs, use_embedding_model=True)
```


# Outlier Reduction

By default, HDBSCAN generates outliers which is helpful mechanic in creating accurate topic representations. However, you might want to assign every single documents to a topic. We can use `.reduce_outliers` to map some or all outliers to a topic:

In [48]:
# Reduce outliers
new_topics=topic_model.reduce_outliers(dataset,topics)

# Reduce outliers with pre-calculate embeddings instead
new_topics=topic_model.reduce_outliers(dataset,topics, strategy='embeddings', embeddings=corpus_embeddings)

100%|██████████| 21/21 [00:01<00:00, 18.72it/s]


## Note-Update Topics with Outlier Reduction

After having generated updated topic assignments, we can pass them to BERTopic in order to update the topic representations:

```python
topic_model.update_topics(docs, topics=new_topics)
```

It is important to realize that updating the topics this wat may lead to erroes if topic reduction ot topic merging techniques are used afterwards. The reason for this is that when you assign a -1 document to topic 1 and another -1 document to topic 2, it is unclear how you map the -1 documents. Is it matched to topic 1 or 2.


# Visualize Topics

With visualizations, we are closing into the realm of subjective `best practices`. We will do the visualizations by using `topic_model.visulize_topics()` and `topic_model.visualize_hierarchy()`.

In [49]:
topic_model.visualize_topics(custom_labels=True)

# Visualize Documents

When visualizing documents, it helps to have embedded the diocuments beforehand to spede up computation. Fortunately, we have already done that as a `best practice`. Visualizing documents in 2-dimensional space helps in understanding the underlying structure of the documents and topics.

In [50]:
# reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings=UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(corpus_embeddings)

In [53]:
# We can also hide the annotations
topic_model.visualize_documents(dataset, reduced_embeddings=reduced_embeddings, custom_labels=True, hide_annotations=True)

## Note - 2-dimensional space

Although visualizing the documents in 2-dimensional gives an idea of their underlying structure, there is a risk involved. Visualizing the documents in 2-dimensional space means that we have lost significant information since the original embeddings were more than 384 dimensions. Condensing all that information in 2 dimensions is simply not possible. In other words, it is merely an `approximation`, albeit quite an accurate one.


# Seialization

When saving a BERTopic model, there are several ways in doing so. We can either save the entire model with `pickle`, `pytorch` or `safetensors`. When saving a model with `safetensors`, it skips over saving the dimensionality reduction and clustering models. The `.transform` function will still work without these models but instead assign topics based on the similarity between document embeddings and the topic embeddings.

As a result, the `.transform` step might give different results but is it generally worth it considering the smaller and significantly faster model.

In [54]:
embedding_model='sentence-transformers/all-MiniLM-L6-v2-topic'
topic_model.save('embedding_model', serialization='safetensors', save_ctfidf=True, save_embedding_model=embedding_model)