<a href="https://colab.research.google.com/github/m-newhauser/weaviate-job-postings/blob/main/weaviate_job_postings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering AI-related job postings with Sentence Transformers, Weaviate, and BERTopic
**Mary Newhauser**

This notebook is a tutorial on how to leverage vectors stored in a [Weaviate](https://weaviate.io/) database for topic modeling. More specifically, we’ll train a topic model on text embeddings for AI- and ML-related job postings from the Kaggle [LinkedIn Job Postings (2023 - 2024)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings) dataset using exclusively open source models and frameworks.

In this notebook we will:
* Import and preprocess the Kaggle dataset using `Pandas`
* Generate text embeddings for concatenated job titles and descriptions using the [sentence-transformers/paraphrase-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-mpnet-base-v2) model
* Set up a `Weaviate` vector database and insert the text embeddings
* Query the database to select only jobs related to AI and ML
* Train a zero-shot topic model with [BERTopic](https://maartengr.github.io/BERTopic/index.html)
* Manually label the resulting topics
* Visualize and interpret the resulting topics

## Set up environment

First, install all the packages we need for the notebook and import our libraries.
*This may take a minute or two.*

In [2]:
%%capture
!pip install -q -U bertopic
!pip install -q -U plotly
!pip install -q -U tqdm
!pip install -q -U accelerate
!pip install transformers=="4.39.0"
!pip install -q scipy
!pip install -q datasets
!pip install sentence-transformers
!pip install -U weaviate-client

In [1]:
import getpass
import json
import os
import random
import requests
import csv
import shutil

import numpy as np
import pandas as pd
import torch

import colorcet as cc

from datasets import Dataset, DatasetDict
from google.colab import userdata
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from umap import UMAP
from sklearn.feature_extraction.text import CountVectorizer
from scipy import stats

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
from bertopic.vectorizers import ClassTfidfTransformer

import weaviate
import weaviate.classes as wvc
from weaviate.classes.config import Configure, Property, DataType, Tokenization, VectorDistances
from weaviate.classes.query import MetadataQuery

In [4]:
# Create a folder for output data
os.makedirs("output", exist_ok=True)
print("Folder 'output' created successfully.")

Folder 'output' created successfully.


In [11]:
# Set display options to expand the DataFrame view
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_colwidth', None) # Show full column content
pd.set_option('display.width', 800)        # Set a larger display width

# Part 1: Generate embeddings

## Load and pre-process data
The fastest way to load our [Kaggle dataset](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings) in Colab is to download it directly from the source, which we can do with the following commands.

In [6]:
!kaggle datasets download -d arshkon/linkedin-job-postings

Dataset URL: https://www.kaggle.com/datasets/arshkon/linkedin-job-postings
License(s): CC-BY-SA-4.0
Downloading linkedin-job-postings.zip to /content
 95% 150M/158M [00:01<00:00, 84.8MB/s]
100% 158M/158M [00:01<00:00, 94.1MB/s]


In [7]:
# Unzip the dataset
!unzip linkedin-job-postings.zip

Archive:  linkedin-job-postings.zip
  inflating: companies/companies.csv  
  inflating: companies/company_industries.csv  
  inflating: companies/company_specialities.csv  
  inflating: companies/employee_counts.csv  
  inflating: jobs/benefits.csv       
  inflating: jobs/job_industries.csv  
  inflating: jobs/job_skills.csv     
  inflating: jobs/salaries.csv       
  inflating: mappings/industries.csv  
  inflating: mappings/skills.csv     
  inflating: postings.csv            


For the purposes of this task, we'll only be analyzing the `postings.csv` file, which contains the `title`, `description`, and other metadata for job postings.

In [2]:
# Load the job postings data
postings_df = pd.read_csv("postings.csv")
print(f"Successfully read {len(postings_df)} rows.")

Successfully read 123849 rows.


Now we clean and standardize the dataset by dropping unneeded columns, converting data types, dropping rows that do not contain a job description, and dropping duplicate rows. We'll also drop many columns of the data to make sure that we can store as many embeddings as possible.

In [3]:
# Pre-process the dataframe
preprocessed_df = (
    postings_df
    .dropna(subset=['description'])                                     # Drop rows where 'description' is null
    .drop_duplicates(subset=['title', 'description'], keep='first')     # Drop rows that have the same title and description
    .assign(
        company_id=lambda df: df['company_id'].apply(lambda x: '{:.0f}'.format(float(x)) if pd.notnull(x) and isinstance(x, (int, float)) else x),  # Convert numerical values to string without decimals
        job_id=lambda df: df['job_id'].astype(str)                      # Convert column to string
    )

)[["job_id", "title", "description", "company_name"]]                   # Keep only the most important columns
print(f"Pre-processed dataset now contains {preprocessed_df.shape[0]} rows.")

Pre-processed dataset now contains 110898 rows.


### Generate sentence embeddings

Because job descriptions are short, often similar to each other, and lack context, we will generate embeddings on job titles concatenated with their respective job descriptions. This will give more context to the job titles and, as a result, increase the quality of the embeddings.

We're going to use `sentence-transformers/paraphrase-mpnet-base-v2` as our embedding model because it's:
* Open source
* Small enough to load in a notebook without quantization
* Good at handling longer text sequences
* Embeds the text in `768` dimensions, which gives us more detail than lower dimensional embeddings

In [5]:
# Concatenate job titles with job descriptions
def concatenate_titles(row):
    return f"{row['title']} - {row['description']}"

# Create a list of concatenated titles
concatenated_titles = preprocessed_df.apply(concatenate_titles, axis=1).tolist()

In [12]:
# Load sentence transformers embedding model
embedding_model = SentenceTransformer("sentence-transformers/paraphrase-mpnet-base-v2")

In [None]:
# Generate embeddings
embeddings = embedding_model.encode(
    concatenated_titles,
    show_progress_bar=True,
    batch_size=16                     # Choose smaller batch size to accommodate Colab's single GPU
)
print(f"Generated embeddings for {len(embeddings)} title/description pairs.")

Batches:   0%|          | 0/6932 [00:00<?, ?it/s]

Generated embeddings for 110898 title/description pairs.


# Part 2: Populate Weaviate vector database
## Set up a Weaviate database
There are a few different ways to populate a Weaviate vector database. We can either generate the embeddings upon insertion or generate the embeddings separately and then insert them into an empty database. Since we're using a fully open source embedding model and limited to just a single Google Colab GPU, we've chosen to generate our embeddings ahead of time.

For this section, we'll need the following credentials, which we can store as [secrets](https://medium.com/@parthdasawant/how-to-use-secrets-in-google-colab-450c38e3ec75) in Google Colab and access them within our notebook via `userdata.get("SECRET_NAME")`:
* HuggingFace API token (instructions [here](https://huggingface.co/docs/api-inference/en/quicktour))
* Weaviate API token (instructions [here](https://weaviate.io/developers/wcs/quickstart))

We also need our Weaviate REST endpoint, which you can find using the same instructions for getting a Weaviate API token.

In [None]:
# Get API keys from Colab secrets
weaviate_key = userdata.get("WCS_API_KEY")
huggingface_key = userdata.get("HF_TOKEN")

# Weaviate REST endpoint
weaviate_url = "https://my-sandbox-1xdpgej9.weaviate.network"

In [None]:
# Connect to Weaviate
client = weaviate.connect_to_wcs(
    cluster_url=weaviate_url,
    auth_credentials=weaviate.auth.AuthApiKey(weaviate_key),
)

Now that we're connected to our Weaviate cluster, we need to create a `collection` in which to store our vectors and metadata. Because we're bringing our own pre-generated embeddings, we won't be using a vectorizer for the collection.

In [None]:
# Create a collection
collection = client.collections.create(
    name="JobsPostings",
    vectorizer_config=wvc.config.Configure.Vectorizer.none(),     # Because we're bringing our pre-generated embeddings
)

## Upload to Weaviate vector database
To upload our data and vectors to Weaviate, we'll first convert them to the necessary formats and then insert them in batches.

In [None]:
# Convert dataset to dictionary
records = preprocessed_df.to_dict(orient="records")

# Convert embeddings from array to list
vectors = [embedding.tolist() for embedding in embeddings]

In [None]:
# Define the collection
collection = client.collections.get("Jobs")

# Insert all data in batches
with collection.batch.dynamic() as batch:
    for i, record in enumerate(records):
        batch.add_object(
            properties=record,                          # Non-embedded data
            vector=vectors[i],                          # Embeddings of title-descriptions
        )

# Print number of rows successfully upserted
print(f"Upserted {collection.aggregate.over_all(total_count=True).total_count} rows to collection.")

Upserted 116129 rows to collection.


In [None]:
# Close the Weaviate client
client.close()

## Query the vector database
We have plenty of job listings in our dataset, but not all of them are related to ML or AI. In order to gather this subset of data, we'll query the vector database using Weaviate's `.near_vector()` functionality.

While it would certainly be possible to use a RAG framework to retrieve AI-related job postings from the larger dataset, we're using vector search because it's the lowest hanging fruit. Furthermore, because we only need the vectors themselves, we really don't need a generative LLM. This keeps our approach simpler, eliminating the need for prompt engineering, response parsing, and the use of expensive LLM API endpoints.

### Reconnect to Weaviate
Just as before, we'll connect to Weaviate, specify the collection we're working with, and load our embedding model.

In [4]:
# Get API keys from Colab secrets
weaviate_key = userdata.get("WCS_API_KEY")
huggingface_key = userdata.get("HF_TOKEN")

# Weaviate REST endpoint
weaviate_url = "https://my-sandbox-1xdpgej9.weaviate.network"

# Connect to Weaviate
client = weaviate.connect_to_wcs(
    cluster_url=weaviate_url,
    auth_credentials=weaviate.auth.AuthApiKey(weaviate_key),
)

# Specify a collection
collection = client.collections.get("Jobs")

### Embed the query

Because we added pre-generated embeddings to our collection, we didn't specify a vectorizer upon creation. This means that we need to embed our search query (with the same model used to generate our vectors previously).

To fetch a broad and large enough sample that will work for topic modeling, we'll just use a simple keyword query `"data science, machine learning, artificial intelligence"`.

In [5]:
# Load sentence transformers embedding model
embedding_model = SentenceTransformer("sentence-transformers/paraphrase-mpnet-base-v2")

# Define a search query
query = "data science, machine learning, artificial intelligence"

# Embed the query using the same model used to embed the dataset
query_embedding = embedding_model.encode(query).tolist()

### Fetch and post-process query results
The full dataset contains ~111k job postings, of which we assume only a fraction are related to AI and machine learning. To identify these job postings, we use vector similarity search, which can only return a finite amount of data. That's why we limit the returned properties to only the most important data and limit the search to return the top `10000` vectors. With this, we make the assumption that jobs related to AI and ML make up ~10% of the overall dataset, which seems reasonable. In order to filter out noise, we'll set the `distance` metric to `0.83`, which we arrive at after doing some ad hoc analysis of of threshold distances and their relationship to ML-related jobs.

In [6]:
# Perform the query
response = collection.query.near_vector(
    near_vector=query_embedding,
    limit=10000,
    return_properties=["title", "description"],
    include_vector=True,
    distance=0.83,                                            # Set a similarity threshold
    return_metadata=MetadataQuery(distance=True)
)

# Print number of items returned in query
print(f"{len(response.objects)} items returned.")

10000 items returned.


## Part 3: Cluster AI/ML-related jobs with BERTopic
Now that we have our target dataset, we can proceed with training a topic model to explore the different types of AI-related jobs in the dataset.

`BERTopic` is an open source framework that leverages `Transformers` and cTF-IDF to cluster embedded text data. We're using BERTopic because of its modularity, simplicity, and rich offering of features, including its zero-shot capabilities, which we'll discuss late on.

### Prepare data for clustering
First, we'll take the data retrieved by the query and transform it to a dataframe and create the `docs` and `embeddings` objects required to run `BERTopic`.

In [7]:
# Save all the properties of returned vectors to dataframe
properties_df = pd.DataFrame([o.properties for o in response.objects])
properties_df["distance"] = [o.metadata.distance for o in response.objects]

# Extract embeddings from returned query and format as np.array
embeddings = np.array([o.vector["default"] for o in response.objects])

# Create docs for BERTopic using job titles
docs = properties_df["title"].tolist()

### Choose BERTopic sub-models

Next, we specify all the sub-models we'd like to use for our `BERTopic` model. We'll use the c-tfidf model to reduce the inclusion of frequent (and less meaningful) words in our topic representation and a count vectorizer model to remove stop words. To gently guide our topic model in the right direction, we'll set some seed words we want the model to give extra importance to.

For our representation model, we'll use a maximal marginal relevance model. This model takes into account the the similarity of keywords/keyphrases within each document, returning a more diverse and descriptive set of keywords for describing each cluster.

In [8]:
# Minimize inclusion of frequent words in the cluster representation
ctfidf_model = ClassTfidfTransformer(
    reduce_frequent_words=True,
    seed_words=[
        "machine learning engineer",
        "data scientist",
        "data science",
        "data engineer",
        "data engineering",
        "data architect",
        " AI "
    ]
)

# Add a vectorizer model to remove English stopwords
vectorizer_model = CountVectorizer(
    stop_words="english",
    min_df=2,
    ngram_range=(1, 2)                            # Target only shorter stopwords
)

# Create your representation model
representation_model = MaximalMarginalRelevance(
    diversity=0.4                                 # Increase this for more descriptive keywords
)

### Zero-shot topic model
Originally, traditional `BERTopic` representation models struggled to generate  descriptive clusters for this dataset. They primarily struggled with assigning the appropriate level of meaning to words that describe the skill level of job postings, and were creating clusters around terms like "intern," "manager," and "junior" rather than creating clusters for different types of jobs.

To solve this problem, we use `BERTopic`'s zero-shot topic model representation, which allows us to specify topics that we assume are already present in the dataset. This gives the model more direction while also allowing it to discover additional topics, preserving the unsupervised nature of topic modeling. We'll provide the model a small list of general job titles we expect to be in the corpus and set a similarity threshold for zero-shot assignment of job titles to these categories.

In order to avoid overfitting the model and ending up with too many clusters, we'll set the minimum topic size to `60`.

In [9]:
# Define some known topics in the dataset
zeroshot_topic_list = [
    "Machine Learning Engineer",
    "Data Scientist",
    "Data Analyst",
    "MLOps",
    "Data Engineer",
    "Software Engineer",
]

# Fit model using the zero-shot topics and sub-models
topic_model = BERTopic(
    embedding_model=embedding_model,
    min_topic_size=55,                                             # We can reduce n_topics later on
    vectorizer_model=vectorizer_model,
    ctfidf_model=ctfidf_model,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.83,                                   # Similarity threshold for assignment to a known topic
    representation_model=representation_model,
    verbose=True
)
topics, _ = topic_model.fit_transform(docs, embeddings)

2024-06-06 08:20:35,800 - BERTopic - Zeroshot Step 1 - Finding documents that could be assigned to either one of the zero-shot topics
2024-06-06 08:20:36,235 - BERTopic - Zeroshot Step 1 - Completed ✓
2024-06-06 08:20:36,244 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-06-06 08:21:39,301 - BERTopic - Dimensionality - Completed ✓
2024-06-06 08:21:39,305 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-06-06 08:21:40,489 - BERTopic - Cluster - Completed ✓
2024-06-06 08:21:40,502 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-06-06 08:22:11,631 - BERTopic - Representation - Completed ✓


In [37]:
# Collect topics in a dataframe
topic_info_df = topic_model.get_topic_info()

# Get some descriptive statistics
num_topics = (topic_info_df.shape[0]-1)
num_unclustered = topic_info_df.query('Topic == -1')['Count'].iloc[0]

print(f"Resulting topic model has {num_topics} topics. \n")
print(f"{num_unclustered} ({(num_unclustered / len(docs)):.2%}) documents are unclustered. \n")

# Examine the topics
topic_info_df

Resulting topic model has 25 topics. 

4629 (46.29%) documents are unclustered. 



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,4629,-1_data engineer_intern_business analyst_management,"[data engineer, intern, business analyst, management, project manager, modeler, client, specialist, development, quality]","[Data Science intern (30 May) DIN47, Data Science Intern, Summer 2024, Associate Director, Data Science]"
1,0,885,0_security analyst_security engineer_iam_senior security,"[security analyst, security engineer, iam, senior security, analyst cyber, cloud security, engineer information, engineer cyber, systems, identity]","[Cyber Security Analyst, Senior Information Cyber Security Analyst, Sr. Cyber Security Analyst]"
2,1,630,1_engineer data_architect data_azure data_engineer azure,"[engineer data, architect data, azure data, engineer azure, python, etl data, architect senior, aws, enterprise data, enterprise]","[Azure Data Architect, Azure Data Architect, Azure Data Architect]"
3,2,522,2_site reliability_reliability engineer_cloud engineer_software engineer,"[site reliability, reliability engineer, cloud engineer, software engineer, backend, stack engineer, devops engineer, engineer site, staff software, engineer java]","[Staff Site Reliability Engineer, Senior Software Engineer (Site Reliability Engineer), Cloud Site Reliability Engineer]"
4,3,316,3_service desk_information technology_support technician_desk analyst,"[service desk, information technology, support technician, desk analyst, servicenow, support analyst, specialist information, desk technician, analyst service, helpdesk]","[Information Technology Support Specialist, Information Technology Help Desk, Help Desk Support Specialist]"
5,4,306,4_machine learning_learning_ai engineer_generative ai,"[machine learning, learning, ai engineer, generative ai, genai, gen ai, senior ai, deep, llm, engineer senior]","[Senior Machine Learning Engineer, Machine Learning Engineer - RenderWolf.Ai, AI ML Software Engineer - Artificial Integence / Machine Learning ]"
6,5,279,5_analyst data_analyst w2_demand generation_analyst junior,"[analyst data, analyst w2, demand generation, analyst junior, market research, analyst marketing, analyst contract, forecasting, analyst digital, management analyst]","[Data Analyst, Sr Data Analyst - W2 - 12 Months - Sunnyvale, CA or New York, NY or Hoboken, NJ / 2 days/week onsite, Data Analyst 3]"
7,6,228,6_customer success_sales development_account manager_manager customer,"[customer success, sales development, account manager, manager customer, business development, sales manager, inside sales, sales account, manager salesforce, analyst account]","[Lead Customer Success Manager, Enterprise Customer Success Manager, Enterprise Customer Success Manager]"
8,7,217,7_statistical_programming_biostatistics_science research,"[statistical, programming, biostatistics, science research, data scientist, epidemiologist, medical laboratory, engineer clinical, manager clinical, clinical data]","[Associate Director, Statistical Programming, (CLS) Clinical Laboratory Scientist - Competitive pay + Large sign on/relocation bonus, Medical Laboratory Scientist (MT/MLT) (ASCP) - Microbiology - 10K Sign on bonus]"
9,8,188,8_remote customer_recruiter_work home_talent,"[remote customer, recruiter, work home, talent, analyst remote, program assistant, representative remote, remote salesforce, sales representative, est]","[Customer Service ( Remote work), Customer Service ( Remote work ), Customer Service ( Remote work )]"


### Create topic labels based on keywords
`BERTopic` doesn't actually give us topic labels each of the document clusters. Instead, it gives us the most representative keywords and documents for each cluster, leaving us the task of creating topic labels.

*Side note: I tried using several open source LLMs (Mistral, Zephyr) to generate topic labels based on keywords. The results were surprisingly disappointing, with responses suffering from serious formatting issues, cut off words, and overall low quality topic labels. As a result, I decided to used GPT-4o to give me ideas to label these topics manually. Even some of those responses were inaccurate.*

In [150]:
# Manually define topic labels
new_labels = [
    "Unclustered",
    "Cybersecurity Analytics and Engineering",
    "Cloud and DevOps Engineering",
    "(Enterprise) Data Engineering",
    "AI and Machine Learning Engineering",
    "Data Analytics",
    "IT Service and Support Desk",
    "Customer Success and Sales Management",
    "Clinical Data Science",
    "Remote Customer Service and Talent Management",
    "Financial Risk and Regulatory Analysis",
    "(Senior) Database Administration",
    "Automation and Mechanical Engineering",
    "Business Process Analytics",
    "Digital Marketing Strategy",
    "Quality Assurance and Automation",
    "Data Science",
    "Healthcare Information Management",
    "Customer Support and Insurance",
    "Oracle Cloud ERP Solutions",
    "Power BI and Reporting Analyst",
    "Business Intelligence and Analysis",
    "Data Center Operations",
    "Data Entry and Administration",
    "(Enterprise) Product Management",
    "Commercial Truck Driving"
]

# Update topic model with custom labels
topic_model.set_topic_labels(new_labels)

# Regenerate topic info df to include custom labels
topic_info_df = topic_model.get_topic_info()
topic_info_df


Unnamed: 0,Topic,Count,Name,CustomName,Representation,Representative_Docs
0,-1,4190,-1_data engineering_intern_sap_project manager,Unclustered,"[data engineering, intern, sap, project manager, program, business analyst, program manager, consultant, developer, development]","[NLP/data science/Data Analysis(Intern apr 30) DIN01, Vice President, Finance – Data Science & Visualization Analytics, Data Science intern ( Apr 30) -DIN01]"
1,0,908,0_threat_iam_engineer security_analyst security,Cybersecurity Analytics and Engineering,"[threat, iam, engineer security, analyst security, senior security, analyst information, cloud security, engineer information, cybersecurity engineer, security consultant]","[Cyber Security Analyst, Cyber Security Analyst, Senior Information Cyber Security Analyst]"
2,1,669,1_software engineer_site reliability_reliability engineer_cloud engineer,Cloud and DevOps Engineering,"[software engineer, site reliability, reliability engineer, cloud engineer, stack engineer, staff software, technical lead, java developer, devops engineer, engineer site]","[Site Reliability Engineer II, Sr. Java Developer + Site Reliability Engineer, Cloud Site Reliability Engineer]"
3,2,643,2_data engineer_architect data_azure data_big,(Enterprise) Data Engineering,"[data engineer, architect data, azure data, big, senior data, gcp, python, etl data, architect senior, engineer enterprise]","[Azure Data Architect, Azure Data Architect, Azure Data Architect]"
4,3,349,3_learning engineer_learning_ai engineer_artificial intelligence,AI and Machine Learning Engineering,"[learning engineer, learning, ai engineer, artificial intelligence, senior machine, generative ai, genai, gen ai, deep learning, llm]","[Machine Learning Engineer - RenderWolf.Ai, AI ML Software Engineer - Artificial Integence / Machine Learning , Senior Machine Learning Engineer, Generative AI]"
5,4,306,4_analyst data_analytics insights_analyst marketing_analyst w2,Data Analytics,"[analyst data, analytics insights, analyst marketing, analyst w2, analyst senior, forecasting, market research, ecommerce, director analytics, demand generation]","[Senior Manager, Data Analytics and Insights, Senior Analyst, Data & Analytics, Senior Analyst, Data & Analytics]"
6,5,304,5_service desk_information technology_help desk_support technician,IT Service and Support Desk,"[service desk, information technology, help desk, support technician, desk analyst, desk technician, specialist information, analyst service, helpdesk, sales advisor]","[Information Technology Support Specialist, Information Technology Help Desk, Help Desk Support Specialist]"
7,6,278,6_customer success_development representative_account executive_salesforce,Customer Success and Sales Management,"[customer success, development representative, account executive, salesforce, inside, enterprise account, inside sales, west, sales manager, analyst account]","[Enterprise Customer Success Manager, Enterprise Customer Success Manager, Enterprise Customer Success Manager]"
8,7,254,7_data scientist_research data_science research_biostatistics,Clinical Data Science,"[data scientist, research data, science research, biostatistics, statistician, clinical data, medical laboratory, manager clinical, genomics, laboratory scientist]","[Medical Laboratory Scientist (MT/MLT) (ASCP) - Microbiology - 10K Sign on bonus, (CLS) Clinical Laboratory Scientist - Competitive pay + Large sign on/relocation bonus, Clinical Laboratory Scientist (CLS) - Competitive pay + Large sign on/relocation bonus]"
9,8,208,8_customer service_remote customer_recruiter_work home,Remote Customer Service and Talent Management,"[customer service, remote customer, recruiter, work home, talent, technician field, analyst remote, representative, remote salesforce, program assistant]","[Customer Service ( Remote work ), Customer Service ( Remote work ), Customer Service ( Remote work )]"


## Part 4: Visualize and interpret topic model
### Prepare data for visualization
Now, we'll do some simple `Pandas` manipulations to clean up our data and format it to be visualized with `Plotly`. We'll use `BERTopic`'s `.get_document_info()` method to get a dataframe that contains each individual data point along with it's assigned topic.

In [151]:
# Get the topic label for each concatenated job posting
document_info_df = topic_model.get_document_info(docs, df=properties_df)

# Post-process cluster data for visualization
document_topics_df = (
    document_info_df
    .rename(
        columns={
            "Topic": "topic_number",
            "CustomName": "topic_label",
            "Representation": "representation",
            "Probability": "probability"
        }
    )
    .drop(columns=["Document", "Representative_Docs", "Name", "Top_n_words", "Representative_document"])
)
document_topics_df.head(3)


Unnamed: 0,description,title,distance,topic_number,topic_label,representation,probability
0,"The company is an applied behavioral research company working at the intersection of ML, social sciences, and recommendation systems / Prediction - as - a - service. The company enables businesses to build privacy preserving recommendation and behavioral technologies competitive to big tech without the use of interpretable raw customer data.We’re looking to develop the next generation of privacy preserving machine learning products that understand and predict behavior at scale. Our products and research teams need to handle information at a massive scale across a number of unstructured dimensions. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, privacy, artificial intelligence, and NLP.Job Qualification:● Bachelor’s degree or equivalent practical experience.● 5+ years of experience with software development in one or more programming languages, and with data structures/algorithms.● 5+ years with two or more languages/softwares included but not limited to: Python, Apache, Presto, R, ML/optimization, Scala● 5+ years of experience in one or more of the following areas: machine learning, recommendation systems, pattern recognition, NLP, data mining or artificial intelligence● 5+ years of experience with ML/AI algorithms and tools, deep learning and/or natural language processing.Responsibilities:● You enjoy partnering with data science teams to deploy and scale advanced algorithms● You strive to write elegant code, and you're comfortable with picking up new technologies independently● You enjoy collaborating with colleagues/partners internally and externally● You are passionate about building intuitive data models and an expert in distributed data processing patterns● You are comfortable working in a rapidly changing environment with ambiguous requirements. You are nimble and take intelligent risksWhat you will do:● Engineer efficient, adaptable, and scalable data pipelines to process structured and unstructured data● Maintain and rethink existing datasets and pipelines to service a wider variety of use cases● Develop highly scalable classifiers and tools leveraging machine learning, data regression, and rules-based models● Adapt standard machine learning methods to best exploit modern parallel environments (e.g. distributed clusters, multicore SMP, and GPU)",Machine Learning Engineer [4414],0.450224,3,AI and Machine Learning Engineering,"[learning engineer, learning, ai engineer, artificial intelligence, senior machine, generative ai, genai, gen ai, deep learning, llm]",1.0
1,"Qualifications:Master's or Ph.D. in Computer Science, Data Science, Statistics, or a related field.10+ years of experience in data science, machine learning, and AI.Strong expertise in NLP techniques, including text preprocessing, entity recognition, and sentiment analysis.Proficiency in machine learning tools and libraries such as SpaCy, TensorFlow, PyTorch, scikit-learn, and Azure ML.Experience building and deploying machine learning models in the Azure cloud environment.Familiarity with DevOps practices and tools for continuous integration and deployment (CI/CD).Excellent problem-solving skills and the ability to work in a fast-paced, collaborative environment.Strong communication and leadership skills, with a track record of successfully leading data science projects from conception to production.\n",Technical Lead- ML and Data Science,0.466082,3,AI and Machine Learning Engineering,"[learning engineer, learning, ai engineer, artificial intelligence, senior machine, generative ai, genai, gen ai, deep learning, llm]",0.629122
2,"Project Scope and Brief Description:\nNext-generation Artificial Intelligence for Genomics will use more complex datatypes and be applied to new crop contexts. We need a Data Scientist with demonstrated expertise in training and evaluating transformers such as BERT and its derivatives.\nSkills / Experience:\nRequired: Proficiency with Python, pyTorch, Linux, Docker, Kubernetes, Jupyter. Expertise in Deep Learning, Transformers, Natural Language Processing, Large Language Models\nPreferred: Experience with genomics data, molecular genetics. Distributed computing tools like Ray, Dask, Spark.",Data Scientist,0.471149,-1,Unclustered,"[data engineering, intern, sap, project manager, program, business analyst, program manager, consultant, developer, development]",0.0


### Reduce dimensions to get (x, y) coordinates
Before we can plot the data, we need to further reduce the vectors for the job postings to two dimensions, our x and y coordinates using `UMAP`.

In [152]:
# Reduce dimensions of embeddings for visualization
reduced_embeddings = UMAP(n_neighbors=20, n_components=2, min_dist=0.0, metric='cosine', random_state=42).fit_transform(embeddings)

# Add as coordinates to clusters df
document_topics_df = (
    document_topics_df
    .assign(
        x_coord=reduced_embeddings[:, 0],
        y_coord=reduced_embeddings[:, 1]
    )
)

Finally, we'll remove outliers the in the dataset so that we get a more concise view for our scatterplot.

In [153]:
# Calculate Z-scores for the specified columns and filter out outliers
z_scores = np.abs(stats.zscore(document_topics_df[["x_coord", "y_coord"]]))
filtered_df = document_topics_df[(z_scores < 3).all(axis=1)]

### Visualize clusters with `Plotly`
Because our sample size is smaller, we can take advantage of the granularity that a Plotly scatterplot offers. We'll be able to hover over individual points in the plot to see the job title and other relevant metadata.

In [156]:
# Get a list of unique colors for each cluster (and set "Unclustered" to gray)
color_swatches = ['rgb(211, 211, 211)'] + cc.glasbey_light

# Create a dictionary mapping each cluster to a color
unique_clusters = filtered_df['topic_label'].unique()
color_map = {cluster: color_swatches[i] for i, cluster in enumerate(unique_clusters)}

# Set the Unclustered color to light gray
color_map[unique_clusters[1]] = 'rgb(211, 211, 211)'

In [161]:
# Create a scatter plot
fig = go.Figure()

# Add scatter plot for each cluster
for cluster in filtered_df['topic_label'].unique():
    cluster_df = filtered_df[filtered_df['topic_label'] == cluster]
    fig.add_trace(go.Scattergl(
        x=cluster_df['x_coord'],
        y=cluster_df['y_coord'],
        mode='markers',
        marker=dict(
            color=color_map[cluster],
            size=3
        ),
        name=cluster,
        text=[f'Title: {title}<br>Cluster: {cluster}' for title in cluster_df['title']],
        hoverinfo='text'
    ))

# Update layout with ideal plot size, centered title, and customized legend
fig.update_layout(
    title=dict(
        text='Machine Learning and AI Job Postings Topic Map',
        x=0.5,
        xanchor='center'
    ),
    xaxis_title='X Coordinate',
    yaxis_title='Y Coordinate',
    showlegend=True,
    legend=dict(
        x=1.02,  # Move legend slightly to the right
        y=1,
        traceorder='normal',
        font=dict(size=12),
        title=dict(text='Cluster Label', side='top'),
        bordercolor='rgba(0,0,0,0)',  # Remove border color
        borderwidth=0  # Set border width to 0
    ),
    font=dict(
    family="Helvetica, Arial, sans-serif",
    size=15,
    color="black"
    ),
    width=1200,
    height=800
)

### Interpret topic model
Considering the relatively small size and inconsistent quality of the dataset, the topic model performs well.

We can see that several relevant types of jobs related to AI and ML appear in the clusters like `AI and Machine Learning Engineering`, `Data Science`, `Data Analytics`, `(Enterprise) Data Engineering` and so on. The model also picks up some more obscure types of jobs that are related to AI but that we might sometimes forget about including, `Data Entry and Administration`, `Data Center Operations`.

Unfortunately, the proportion of unclustered documents (46%) is quite high. Upon further investigation, many of the unclustered documents were actually highly relevant and should have been assigned to existing clusters, in particularly to the `AI and Machine Learning Engineering` and `Data Science` clusters.

The model also returns a few topics that are completely irrelevant including, `Healthcare Information Management` and `Customer Support and Insurance`.

#### Limitations
The primary constraint for this model is the dataset. The overall dataset is large enough to train a robust topic model, however when we filter out all jobs not related to AI and ML, the sample size decreases significantly. Furthermore, the general quality and length of job titles and descriptions is inconsistent. Some descriptions provide a high degree of detail and context for the positions, while others are short and basic.

Additionally, there is no broad consensus among employers on what defines and differentiates machine learning-related roles from each other. Terms like "Machine Learning Engineer" and "AI Engineer" are often used interchangably by recruiters. At the same time, many differing opinions exist regarding what differentiates a "Data Scientist" from a "Machine Learning Engineer" and what the qualifications for "junior" and "senior" roles should be. Concepts that are nebulous and hard to define for humans will generally also be hard to define for ML models.


#### Conclusion
In conclusion, the topic model demonstrates robust performance given the limitations of the dataset. Although a high proportion of relevant documents remain unclustered, the model successfully identifies key job categories related to AI and ML as well as more obscure ones.

## Save output files
Finally, we'll save all the files locally and zip them!

In [162]:
# Save dataframes
topic_info_df.to_csv("output/topic_info.csv", index=False, sep=";", quoting=csv.QUOTE_MINIMAL)
document_topics_df.to_csv("output/document_topics.csv", index=False, sep=";", quoting=csv.QUOTE_MINIMAL)

# Save topic model
topic_model.save("output/topic_model", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

# Save plot
pio.write_html(fig, file="output/topic_map.html", auto_open=True)

# Zip the folder
shutil.make_archive("output.zip".replace(".zip", ""), "zip", "output")
print(f"Folder 'output' zipped successfully into 'output.zip'")

Folder 'output' zipped successfully into 'output.zip'


In [163]:
# Close the Weaviate client
client.close()