<div class="alert"  style="background-color:#1f2e6b; color:white; padding:0px 5px; border-radius:10px;"><h2 style='margin:10px'>BERTopic</h2></div>

In [None]:
%pip install bertopic # Necessary to run in Google Colab

Finally, we will try out BERTopic. In simple terms, this model's pipeline starts by using a BERT transformer to embed the documents and then applies a clustering algorithm to group similar documents together.

It will end up producing very comprehensive clusters.

In [2]:
from bertopic import BERTopic # Import the BERTopic model

In [52]:
import pandas as pd

In [53]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(stop_words="english")

We will initialize BERTopic with a pretrained MiniLM model, a lightweight LLM which maps sentences to dense matrixes


In [5]:
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2") 

While BERTopic suggests keeping stopwords to make its model perform better, I have found better results on my datasets, perhaps it is the case that the stop word "noise" is only helpful when the model is fed more text, and each of the documents we're passing on to the model is reduced in size when filtered by job category.

In [None]:
topic_model_yes_vec = BERTopic(embedding_model="all-MiniLM-L6-v2", vectorizer_model=vectorizer_model)

Now its time to import and process the dataset

In [9]:
full_df = pd.read_csv("/content/drive/MyDrive/!AAA/full.csv") # The dataset's personal Google Drive path is specified

In [10]:
full_df_index = full_df["Category_1"].value_counts().head(20).index

In [11]:
full_df_index

Index(['Web Development', 'Social Media Marketing', 'Marketing Strategy',
       '3D Modeling', 'Email Marketing', 'Mobile App Development', '3D Design',
       'Market Research', '3D Rendering', 'Android App Development',
       '3D Animation', 'Python', 'Data Scraping', 'Data Analysis',
       'Search Engine Marketing', 'Sales & Marketing', 'Influencer Marketing',
       'Data Science', 'B2B Marketing', 'Lead Generation'],
      dtype='object')

In [14]:
docs = full_df["Description"].tolist()

The following will create an array with the entirity of the job descriptions in the most popular job categories, so that they can be easily trained with the BERTopic model in a for loop down below.

In [56]:
small_doc_array = []

for index in full_df_index:
  condition1 = full_df["Category_1"] == index
  docs = full_df[condition1]["Description"].tolist()
  small_doc_array.append(docs)

In [57]:
len(small_doc_array)

20

In [60]:
full_df_index[0]

'Web Development'

# Training


The following loop will save the trained models for each different Category, so they can easily be loaded once the model has been deployed.

In [None]:
for i in range(0,20):
  curr_index = i
  topics, probs = topic_model_yes_vec.fit_transform(small_doc_array[curr_index]) # yes vec
  topic_model_yes_vec.save(f"/content/drive/MyDrive/!AAA/topic_model_save/{full_df_index[curr_index]}", serialization="safetensors", save_ctfidf=True)
  print(full_df_index[curr_index])

# Loading modes

This code will load the models depending on what the end user is curious to look up.

In [44]:
curr_index = 0
print(full_df_index[curr_index])

loaded_model = BERTopic.load(f"/content/drive/MyDrive/!AAA/topic_model_save/{full_df_index[curr_index]}")


Web Development


# Visualization

Now we're ready to visualize what our topics looks like, according to BERTopic.


In [45]:
print(full_df_index[curr_index])
loaded_model.visualize_barchart() # call loaded model

Web Development


It is clear that the clusters BERTopics produce are very comprehensive and clear with very little tinkering compared to the other models. The downside remains that the models have to be infered by the user.