# What is BERTopic
BERTopic is a topic modelling framework that uses transformer based embeddings for deep-textual understanding, and clustering. Check out the details [here](https://maartengr.github.io/BERTopic/index.html#quick-start).

BERTopic has been used here to cluster(topic) the different kinds of job postings together, and extract the keywords from these clusters(topic representation).

# Setting up CUML and other packages

In [1]:
!pip install bertopic --upgrade
!pip install cudf-cu12 dask-cudf-cu12 --extra-index-url=https://pypi.nvidia.com
!pip install cuml-cu12 --extra-index-url=https://pypi.nvidia.com
!pip install cugraph-cu12 --extra-index-url=https://pypi.nvidia.com
!pip install --upgrade cupy-cuda12x -f https://pip.cupy.dev/aarch64

Collecting bertopic
  Downloading bertopic-0.16.2-py2.py3-none-any.whl (158 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/158.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/158.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.8/158.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.37-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m45.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.6-py3-none-any.whl (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
Collecting sentence-transformers>=0.4.1 (from bertopic)
 

# Downloading dataset from Kaggle
This requires you to upload a kaggle.json to the cwd. This can be obtained from your kaggle profile.

In [2]:
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!kaggle datasets download -d arshkon/linkedin-job-postings
!unzip linkedin-job-postings.zip

Dataset URL: https://www.kaggle.com/datasets/arshkon/linkedin-job-postings
License(s): CC-BY-SA-4.0
Downloading linkedin-job-postings.zip to /content
 99% 156M/158M [00:11<00:00, 12.1MB/s]
100% 158M/158M [00:11<00:00, 14.0MB/s]
Archive:  linkedin-job-postings.zip
  inflating: companies/companies.csv  
  inflating: companies/company_industries.csv  
  inflating: companies/company_specialities.csv  
  inflating: companies/employee_counts.csv  
  inflating: jobs/benefits.csv       
  inflating: jobs/job_industries.csv  
  inflating: jobs/job_skills.csv     
  inflating: jobs/salaries.csv       
  inflating: mappings/industries.csv  
  inflating: mappings/skills.csv     
  inflating: postings.csv            


In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

  from tqdm.autonotebook import tqdm, trange


# Loading, and Processing the Data

In [3]:
df = pd.read_csv("/content/postings.csv")

df

Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,expiry,closed_time,formatted_experience_level,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,1.715990e+12,,,Requirements: \n\nWe are seeking a College or ...,1.713398e+12,,0,FULL_TIME,USD,BASE_SALARY
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,1.715450e+12,,,,1.712858e+12,,0,FULL_TIME,USD,BASE_SALARY
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,1.715870e+12,,,We are currently accepting resumes for FOH - A...,1.713278e+12,,0,FULL_TIME,USD,BASE_SALARY
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,YEARLY,"New Hyde Park, NY",766262.0,16.0,,...,1.715488e+12,,,This position requires a baseline understandin...,1.712896e+12,,0,FULL_TIME,USD,BASE_SALARY
4,35982263,,Service Technician,Looking for HVAC service tech with experience ...,80000.0,YEARLY,"Burlington, IA",,3.0,,...,1.716044e+12,,,,1.713452e+12,,0,FULL_TIME,USD,BASE_SALARY
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123844,3906267117,Lozano Smith,Title IX/Investigations Attorney,Our Walnut Creek office is currently seeking a...,195000.0,YEARLY,"Walnut Creek, CA",56120.0,1.0,,...,1.716163e+12,,Mid-Senior level,,1.713571e+12,,0,FULL_TIME,USD,BASE_SALARY
123845,3906267126,Pinterest,"Staff Software Engineer, ML Serving Platform",About Pinterest:\n\nMillions of people across ...,,,United States,1124131.0,3.0,,...,1.716164e+12,,Mid-Senior level,,1.713572e+12,www.pinterestcareers.com,0,FULL_TIME,,
123846,3906267131,EPS Learning,"Account Executive, Oregon/Washington",Company Overview\n\nEPS Learning is a leading ...,,,"Spokane, WA",90552133.0,3.0,,...,1.716164e+12,,Mid-Senior level,,1.713572e+12,epsoperations.bamboohr.com,0,FULL_TIME,,
123847,3906267195,Trelleborg Applied Technologies,Business Development Manager,The Business Development Manager is a 'hunter'...,,,"Texas, United States",2793699.0,4.0,,...,1.716165e+12,,,,1.713573e+12,,0,FULL_TIME,,


## Column Selection
I have used only the title, and description columns for clustering, because these were the only actual textual-content for the job posting. Although the other attributes like currency, location, company etc could also have been used alongside, but i believe that those should be used for filtering the results instead.

In [4]:
df = df[["job_id", "title", "description"]]
df

Unnamed: 0,job_id,title,description
0,921716,Marketing Coordinator,Job descriptionA leading real estate firm in N...
1,1829192,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ..."
2,10998357,Assitant Restaurant Manager,The National Exemplar is accepting application...
3,23221523,Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...
4,35982263,Service Technician,Looking for HVAC service tech with experience ...
...,...,...,...
123844,3906267117,Title IX/Investigations Attorney,Our Walnut Creek office is currently seeking a...
123845,3906267126,"Staff Software Engineer, ML Serving Platform",About Pinterest:\n\nMillions of people across ...
123846,3906267131,"Account Executive, Oregon/Washington",Company Overview\n\nEPS Learning is a leading ...
123847,3906267195,Business Development Manager,The Business Development Manager is a 'hunter'...


## Cleaning

Dropping duplicate rows with the same title, and description together as that contributes nothing to the representations.

Extra Note: Extensive cleaning like punctuation removals, html removal etc has not been done because Transformer embeddings have been used. The author of BERTopic recommends to not use any pre-processing measures, and to rely on post-processing measures provided in the framework used when constructing and refining the topic representation.

In [5]:
df = df.drop_duplicates(subset=["title", "description"])
df

Unnamed: 0,job_id,title,description
0,921716,Marketing Coordinator,Job descriptionA leading real estate firm in N...
1,1829192,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ..."
2,10998357,Assitant Restaurant Manager,The National Exemplar is accepting application...
3,23221523,Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...
4,35982263,Service Technician,Looking for HVAC service tech with experience ...
...,...,...,...
123844,3906267117,Title IX/Investigations Attorney,Our Walnut Creek office is currently seeking a...
123845,3906267126,"Staff Software Engineer, ML Serving Platform",About Pinterest:\n\nMillions of people across ...
123846,3906267131,"Account Executive, Oregon/Washington",Company Overview\n\nEPS Learning is a leading ...
123847,3906267195,Business Development Manager,The Business Development Manager is a 'hunter'...


In [6]:
# some more basic null values, and empty strings handling

df = df.dropna(subset=["description"])
df["title"] = df["title"].fillna("")
df["text_representation"] = df["title"] + ": " + df["description"]
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["title"] = df["title"].fillna("")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_representation"] = df["title"] + ": " + df["description"]


Unnamed: 0,job_id,title,description,text_representation
0,921716,Marketing Coordinator,Job descriptionA leading real estate firm in N...,Marketing Coordinator: Job descriptionA leadin...
1,1829192,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",Mental Health Therapist/Counselor: At Aspen Th...
2,10998357,Assitant Restaurant Manager,The National Exemplar is accepting application...,Assitant Restaurant Manager: The National Exem...
3,23221523,Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,Senior Elder Law / Trusts and Estates Associat...
4,35982263,Service Technician,Looking for HVAC service tech with experience ...,Service Technician: Looking for HVAC service ...
...,...,...,...,...
123844,3906267117,Title IX/Investigations Attorney,Our Walnut Creek office is currently seeking a...,Title IX/Investigations Attorney: Our Walnut C...
123845,3906267126,"Staff Software Engineer, ML Serving Platform",About Pinterest:\n\nMillions of people across ...,"Staff Software Engineer, ML Serving Platform: ..."
123846,3906267131,"Account Executive, Oregon/Washington",Company Overview\n\nEPS Learning is a leading ...,"Account Executive, Oregon/Washington: Company ..."
123847,3906267195,Business Development Manager,The Business Development Manager is a 'hunter'...,Business Development Manager: The Business Dev...


In [8]:
df.isna().sum()

job_id                 0
title                  0
description            0
text_representation    0
dtype: int64

# Embeddings

I have used SBERT "all-MiniLM-L6-v2" for encoding my documents. Although BERTopic can handle this internally by itself, I wanted to do it explicitly for saving, and loading the embeddings for reuse. These can be pretty expensive to compute.

In [7]:
# Initialize the sentence transformer model
file_path = "/content/drive/MyDrive/embeddings.npy"

# Check if 'embeddings.npy' exists
if os.path.exists(file_path):
    # Load the embeddings from the file
    embeddings = np.load(file_path)
    print(f"Embeddings loaded from {file_path}")
else:
    # Compute the embeddings
    sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = sentence_model.encode(df["text_representation"].tolist(), show_progress_bar=True)

    # Save the embeddings to a file
    np.save(file_path, embeddings)
    print(f"Embeddings computed and saved to {file_path}")


Embeddings loaded from /content/drive/MyDrive/embeddings.npy


# Topic Modelling / Clustering

In [8]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# UMAP is used for dimensionality reduction. This is applied on the encoded embeddings.
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)

# HDBSCAN is used to construct the clusters from the reduced embeddings
hdbscan_model = HDBSCAN(min_samples=20, min_cluster_size=50, gen_min_span_tree=True, prediction_data=True)

# class based TFIDF model that is used to extract topic representations from the clusters.
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

# A form of post-processing that is used while constructing topic representations.
vectorizer_model = CountVectorizer(stop_words="english")

topic_model = BERTopic(
    verbose=True,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    embedding_model=embedding_model,
    vectorizer_model=vectorizer_model,
    ctfidf_model=ctfidf_model,
    calculate_probabilities=True,
    nr_topics='auto'
)

topic_model.fit(df["text_representation"], embeddings=embeddings)

2024-06-19 10:38:40,939 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-06-19 10:39:04,428 - BERTopic - Dimensionality - Completed ✓
2024-06-19 10:39:04,442 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-06-19 10:39:08,469 - BERTopic - Cluster - Completed ✓
2024-06-19 10:39:08,473 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-06-19 10:40:09,427 - BERTopic - Representation - Completed ✓
2024-06-19 10:40:09,526 - BERTopic - Topic reduction - Reducing number of topics
2024-06-19 10:41:09,638 - BERTopic - Topic reduction - Reduced number of topics from 198 to 94


<bertopic._bertopic.BERTopic at 0x7fcc23137ca0>

In [22]:
topic_model.save("/content/drive/MyDrive/linkedin-job-postings-bertopic-model", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

In [None]:
topic_model = BERTopic.load("/content/drive/MyDrive/linkedin-job-postings-bertopic-model")

# Visualizing the Topic Representations

The following graph visualizes the topics/clusters, their keywords, and the semantic distance between these clusters.

From manual inspection, we can see that topic 19, and the ones around it are related to python, AI, ML, NLP etc.

Extra Note: Although manual inspection was used here, we could just as easily use a set of known AI postings, and compute similarity of those with the topic representations we have to find out what topics are best suited for that.

In [9]:
# Take a look at topic 19, and the ones around it.Its at the bottom vertically, and center horizontally exactly.
topic_model.visualize_topics()

# Validating our inspection

We can pass in a query(a job posting) to our topic model to see what cluster it gets assigned to. In this case, I have taken a job posting from linkedin(not part of the training dataset) which is related to Data science, ML, AI.

As you will see, that it indeed gets classified to Topic : 19.

In [21]:
query = ["""Job Requirements

Bachelor’s/Master’s degree in Computer Science or a related field.
At least 3+ years of experience as an AI Engineer.
Proficiency in Python, TensorFlow, AutoML, and/or PyTorch.
Experience with Multi Objective Decision Engine, Game Thoery and Behavioral Analytics

Skills

Machine Learning
Data Analysis
Apache Kafka/Spark
AutoML
Google ML
Multi Objective Decision Engine (MODA)
AHP
TOPSIS
Game Theory
Behavioral Analytics
Natural Language Processing
Computer Vision
TensorFlow
PyTorch

What You’ll Be Working On

Job Details

Join a leading AI research team and work on cutting-edge AI projects. Develop machine learning models, natural language processing algorithms, and computer vision solutions.

Job Responsibilities

Design and implement machine learning models and algorithms.
Create data pipelines using different tools like Apache Kafka/Spark and/or other tools.
Collaborate with data scientists and engineers on AI projects.
Stay updated with AI research and industry developments."""]

predictions, _ = topic_model.transform(query)

print("The posting is assigned to Topic :", predictions[0])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2024-06-19 11:03:17,666 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2024-06-19 11:03:17,678 - BERTopic - Dimensionality - Completed ✓
2024-06-19 11:03:17,680 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2024-06-19 11:03:17,685 - BERTopic - Probabilities - Start calculation of probabilities with HDBSCAN
2024-06-19 11:03:17,692 - BERTopic - Probabilities - Completed ✓
2024-06-19 11:03:17,694 - BERTopic - Cluster - Completed ✓


The posting is assigned to Topic : 19


## Keywords for Topic 19

In [11]:
topic_model.get_topic(19)

[('ai', 0.5244880126501201),
 ('ml', 0.41946001641971714),
 ('machine', 0.36277151744175334),
 ('generative', 0.3283438351880089),
 ('learning', 0.29958531592935594),
 ('models', 0.287739375114762),
 ('nlp', 0.2849024203071012),
 ('algorithms', 0.2764329250061235),
 ('mlops', 0.25330154111381803),
 ('pytorch', 0.25212059877439913)]

# Postings assigned to Topic 19

Here I have separated all postings which are assigned to topic 19. As you can see all of these are indeed related to AI, ML etc.

In [20]:
topic_document_df = topic_model.get_document_info(df['text_representation'])
topic_document_df[topic_document_df['Topic'] == 19]

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
333,Artificial Intelligence Engineer Intern - Chat...,19,19_ai_ml_machine_generative,"[ai, ml, machine, generative, learning, models...",[Senior Lead Engineer - Generative AI Infrastr...,ai - ml - machine - generative - learning - mo...,0.244751,False
520,Senior Machine Learning Research Engineer: Sym...,19,19_ai_ml_machine_generative,"[ai, ml, machine, generative, learning, models...",[Senior Lead Engineer - Generative AI Infrastr...,ai - ml - machine - generative - learning - mo...,1.000000,False
663,Machine Learning Engineer: Job Title: Python A...,19,19_ai_ml_machine_generative,"[ai, ml, machine, generative, learning, models...",[Senior Lead Engineer - Generative AI Infrastr...,ai - ml - machine - generative - learning - mo...,0.280125,False
1077,Generative AI Engineer: Role: Generative AI Pr...,19,19_ai_ml_machine_generative,"[ai, ml, machine, generative, learning, models...",[Senior Lead Engineer - Generative AI Infrastr...,ai - ml - machine - generative - learning - mo...,0.241297,False
1532,Principal Platform Engineer: About the company...,19,19_ai_ml_machine_generative,"[ai, ml, machine, generative, learning, models...",[Senior Lead Engineer - Generative AI Infrastr...,ai - ml - machine - generative - learning - mo...,0.365454,False
...,...,...,...,...,...,...,...,...
104614,Data Scientist: hackajob transforms your job s...,19,19_ai_ml_machine_generative,"[ai, ml, machine, generative, learning, models...",[Senior Lead Engineer - Generative AI Infrastr...,ai - ml - machine - generative - learning - mo...,0.303813,False
107320,Data Scientist: The Data Scientist f is a key ...,19,19_ai_ml_machine_generative,"[ai, ml, machine, generative, learning, models...",[Senior Lead Engineer - Generative AI Infrastr...,ai - ml - machine - generative - learning - mo...,0.250737,False
109109,Machine Learning Engineer I: From the day we o...,19,19_ai_ml_machine_generative,"[ai, ml, machine, generative, learning, models...",[Senior Lead Engineer - Generative AI Infrastr...,ai - ml - machine - generative - learning - mo...,0.295044,False
109606,"Staff Engineer, Machine Learning: Madhive is t...",19,19_ai_ml_machine_generative,"[ai, ml, machine, generative, learning, models...",[Senior Lead Engineer - Generative AI Infrastr...,ai - ml - machine - generative - learning - mo...,1.000000,False
