# **Version 2**


**Improvements / Changes**
* Reorder code ✔
* Place all installs in one cell ✔
* Place all libraries in one cell ✔
*   Adjust prompt, ask for multiple representative jobs per topic instead of one label ✔
*   Improve topic modeling with BERTopic by adjusting parameters ✔
* Use larger part of the dataset ✔
* Test larger job description by increasing Max words ✔
* Test with preprocessed_historical_data.csv ✔
* IMPORTANT, use: reduce_outliers(look into with gpt), reduce_topics(), update_topis(representation_model, top_n_words, n_gram_range, vectorizer_model)


1. Reduce outliers with strategy=embeddings, distributions, probabilities, threshold=0.5?, SET calculate_probabilities=True when initializing BERTopic!, embeddings must be precomputed.
2. Reduce topics with reduce_topics() to around 35-70 topics
3. Update topics based on representation model, top n words and n gram range.

In [None]:
%%capture
!pip install transformers bertopic accelerate bitsandbytes xformers adjustText datamapplot umap-learn optuna gensim

In [None]:
from google.colab import drive

import pandas as pd

from huggingface_hub import notebook_login

from torch import cuda
from torch import bfloat16

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from sentence_transformers import SentenceTransformer
from transformers import pipeline

from umap import UMAP
from hdbscan import HDBSCAN

from bertopic import BERTopic
from bertopic.representation import TextGeneration

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import ParameterGrid

import datamapplot
import numpy as np
import matplotlib.pyplot as plt

import optuna
from gensim.models.coherencemodel import CoherenceModel
import gensim.corpora as corpora
from gensim.corpora import Dictionary


In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
dataset = pd.read_csv('/content/drive/My Drive/Colab Notebooks/preprocessed_historical_data 4.csv')
dataset = dataset.sample(frac=0.25, random_state=42)

dataset

Unnamed: 0,job_id,company_name,title,description,location,formatted_work_type,original_listed_time,expiry,job_posting_url,formatted_experience_level,cleaned_description,cleaned_title
73989,3902944011,Current Power,Senior Automation Engineer - Power Systems,The Senior Automation / Power Systems Engineer...,"Houston, TX",Full-time,2024-04-16 15:01:21,2024-05-16 15:01:21,https://www.linkedin.com/jobs/view/3902944011/...,Mid-Senior level,senior automation power system engineer primar...,senior automation engineer power system
59308,3901960222,DISH Network,DISH Installation Technician - Field,"Company Summary\n\nDISH, an EchoStar Company, ...","Orange, TX",Full-time,2024-04-18 21:49:34,2024-05-18 22:02:03,https://www.linkedin.com/jobs/view/3901960222/...,Not Specified,company summary dish echostar company reimagin...,dish installation technician field
44663,3900944095,"Coca-Cola Bottling Company UNITED, Inc.",Order Builder,Division: North Alabama\n\nDepartment : Oxford...,"Oxford, AL",Full-time,2024-04-17 21:06:28,2024-05-17 21:18:20,https://www.linkedin.com/jobs/view/3900944095/...,Entry level,division north alabama department oxford wareh...,order builder
81954,3903878594,Denver7 (KMGH-TV),"Mountain Multimedia Journalist, KMGH","KMGH, the E.W. Scripps Company ABC affiliate i...","Denver, CO",Full-time,2024-04-19 03:01:37,2024-05-19 03:12:28,https://www.linkedin.com/jobs/view/3903878594/...,Entry level,kmgh scripps company abc affiliate denver colo...,mountain multimedia journalist kmgh
113151,3905670593,BAYADA Home Health Care,Licensed Practical Nurse (LPN),"Come for the Flexibility, Stay for the Culture...","Teterboro, NJ",Full-time,2024-04-18 00:00:00,2024-05-19 09:55:42,https://www.linkedin.com/jobs/view/3905670593/...,Entry level,come flexibility stay culture needing life bal...,licensed practical nurse lpn
...,...,...,...,...,...,...,...,...,...,...,...,...
118068,3906088336,Activ8 Recruitment & Solutions,Sales Representative - Resin / Polymer Materia...,An international chemical manufacturing compan...,"Novi, MI",Full-time,2024-04-19 18:32:38,2024-05-19 18:32:38,https://www.linkedin.com/jobs/view/3906088336/...,Associate,international chemical manufacturing company n...,sale representative resin polymer material hyb...
96827,3904951786,"Take2 Consulting, LLC",Director Sales Market,Are you prepared to excel as a Market Director...,"Vienna, VA",Full-time,2024-04-18 15:33:33,2024-05-18 15:33:32,https://www.linkedin.com/jobs/view/3904951786/...,Director,prepared excel market director innovative rapi...,director sale market
80628,3903843265,Podium,Sales Operations Analyst (2024),"At Podium, our mission is to help local busine...","Lehi, UT",Part-time,2024-04-18 00:00:00,2024-05-18 22:55:23,https://www.linkedin.com/jobs/view/3903843265/...,Mid-Senior level,podium mission help local business win lead co...,sale operation analyst 2024
8368,3886202733,Jobot,Attorney (MedMal),Want to learn more about this role and Jobot? ...,"Ontario, CA",Full-time,2024-04-06 09:47:53,2024-05-06 10:13:45,https://www.linkedin.com/jobs/view/3886202733/...,Mid-Senior level,want learn role jobot click jobot logo follow ...,attorney medmal


In [None]:
# Fill missing values in the "description" column with an empty string
dataset["cleaned_description"] = dataset["cleaned_description"].fillna("")

# Ensure all values are strings
dataset["cleaned_description"] = dataset["cleaned_description"].astype(str)

# Limit the number of words in the description column so the model can handle the inputs
MAX_WORDS = 50
dataset["cleaned_description"] = dataset["cleaned_description"].apply(lambda x: ' '.join(x.split()[:MAX_WORDS]))

# Combine titles and descriptions into a single column
dataset["combined"] = dataset["cleaned_title"] + " " + dataset["cleaned_description"]

titles = dataset["cleaned_title"]
description = dataset["cleaned_description"]
combined = dataset["combined"].astype(str)
combined = combined.reset_index(drop=True)

# Check combined variable
print(f"Example combined entry: {dataset['combined'].iloc[0]}")

Example combined entry: senior automation engineer power system senior automation power system engineer primarily responsible conception design development implementation electrical system candidate extensive knowledge electrical power system ac drive technology active front end afe system plc programming experience candidate also posse ability apply mathematical engineering principle detailed description responsible designing developing electrical system including generator control switchboard ac


# **BERTopic Model**

In [None]:
# Pre-calculate embeddings
embedding_model = SentenceTransformer("BAAI/bge-large-en")
embeddings = embedding_model.encode(combined, show_progress_bar=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.3k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/720 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Batches:   0%|          | 0/968 [00:00<?, ?it/s]

In [None]:
vectorizer_model = CountVectorizer(stop_words="english")

In [None]:
# def objective(trial):
#     # Suggest hyperparameters
#     min_topic_size = trial.suggest_int("min_topic_size", 50, 100)
#     n_neighbors = trial.suggest_int("n_neighbors", 50, 100)
#     n_components = trial.suggest_int("n_components", 2, 4)
#     min_dist = trial.suggest_float("min_dist", 0.1, 0.3)
#     n_gram_range = trial.suggest_categorical("n_gram_range", [(1, 2), (1, 3), (1, 4)])

#     # Sub-models
#     umap_model = UMAP(
#         n_neighbors=n_neighbors,
#         n_components=n_components,
#         min_dist=min_dist,
#         metric="cosine",
#         random_state=42
#     )
#     hdbscan_model = HDBSCAN(
#         min_cluster_size=20,
#         min_samples=20,
#         metric="euclidean",
#         cluster_selection_method="eom",
#         prediction_data=True
#     )

#     # Initialize BERTopic
#     topic_model = BERTopic(
#         vectorizer_model=vectorizer_model,
#         umap_model=umap_model,
#         # hdbscan_model=hdbscan_model,
#         n_gram_range=n_gram_range,
#         min_topic_size=min_topic_size,
#         verbose=True
#     )

#     # Fit the model
#     topics, _ = topic_model.fit_transform(combined, embeddings)

#     # Count the number of outliers (-1 topics)
#     outlier_count = sum(1 for t in topics if t == -1)
#     outlier_percentage = outlier_count / len(combined)
#     print(f"Trial {trial.number}: Outlier Percentage: {outlier_percentage:.2%}")

#     # Calculate the number of topics (excluding outliers)
#     unique_topics = set(t for t in topics if t != -1)
#     num_topics = len(unique_topics)
#     print(f"Trial {trial.number}: Number of topics: {num_topics}")

#     # Constraints
#     if outlier_percentage > 0.30 or not (200 <= num_topics <= 1000):
#         print(f"Trial {trial.number}: Rejected due to constraints (Outliers/Topics).")
#         return 0  # Penalize heavily if constraints are violated

#     # Filter out topics with ID -1
#     filtered_data = [
#         (doc, topic) for doc, topic in zip(combined, topics) if topic != -1
#     ]
#     if not filtered_data:  # Handle case where all topics are outliers
#         print(f"Trial {trial.number} failed: All documents classified as outliers.")
#         return 0

#     filtered_combined, filtered_topics = zip(*filtered_data)

#     # Group documents by topic
#     documents = pd.DataFrame({
#         "Document": filtered_combined,
#         "ID": range(len(filtered_combined)),
#         "Topic": filtered_topics
#     })
#     documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
#     cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

#     # Define vectorizer and analyzer
#     vectorizer = topic_model.vectorizer_model
#     analyzer = vectorizer.build_analyzer()

#     # Create tokens, dictionary, and corpus
#     words = vectorizer.get_feature_names_out()
#     tokens = [analyzer(doc) for doc in cleaned_docs]
#     dictionary = corpora.Dictionary(tokens)
#     corpus = [dictionary.doc2bow(token) for token in tokens]

#     # Extract topic words
#     topic_words = [[word for word, _ in topic_model.get_topic(topic)]
#                    for topic in range(len(set(filtered_topics)))]

#     # Compute coherence score
#     coherence_model = CoherenceModel(
#         topics=topic_words,
#         texts=tokens,
#         dictionary=dictionary,
#         coherence='c_v'
#     )
#     coherence_score = coherence_model.get_coherence()

#     # Add penalties to the coherence score
#     penalty = 0
#     if outlier_percentage > 0.40:  # Add penalty for outliers between 25%-30%
#         penalty -= (outlier_percentage - 0.40) * 10  # Adjust penalty weight as needed
#     if num_topics < 35:  # Add penalty for fewer topics
#         penalty -= (35 - num_topics) * 0.5
#     elif num_topics > 100:  # Add penalty for excess topics
#         penalty -= (num_topics - 100) * 0.5

#     final_score = coherence_score + penalty
#     print(f"Trial {trial.number}: Coherence Score: {coherence_score}, Final Score: {final_score}")

#     return final_score


In [None]:
# # Optimize hyperparameters using Optuna
# study = optuna.create_study(direction="maximize")
# study.optimize(objective, n_trials=10)

# # Best hyperparameters
# best_params = study.best_params
# print("Best hyperparameters:", best_params)
# print("Best coherence score:", study.best_value)

In [None]:
# df = study.trials_dataframe()
# df

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_min_dist,params_min_topic_size,params_n_components,params_n_gram_range,params_n_neighbors,state
0,0,0.784551,2024-12-16 15:32:56.006404,2024-12-16 15:34:32.093599,0 days 00:01:36.087195,0.208005,48,4,"(1, 2)",42,COMPLETE
1,1,0.833971,2024-12-16 15:34:32.095618,2024-12-16 15:35:53.682678,0 days 00:01:21.587060,0.218563,42,5,"(1, 4)",43,COMPLETE
2,2,0.80396,2024-12-16 15:35:53.688213,2024-12-16 15:37:15.176044,0 days 00:01:21.487831,0.202496,55,3,"(1, 3)",50,COMPLETE
3,3,0.729255,2024-12-16 15:37:15.179735,2024-12-16 15:38:23.819442,0 days 00:01:08.639707,0.489785,71,2,"(1, 3)",43,COMPLETE
4,4,0.779651,2024-12-16 15:38:23.822740,2024-12-16 15:39:56.783837,0 days 00:01:32.961097,0.34915,53,4,"(1, 4)",61,COMPLETE


In [None]:
# Test without labeling topics
#{'min_topic_size': 79, 'n_neighbors': 59, 'n_components': 2, 'min_dist': 0.2575987952763153, 'n_gram_range': (1, 4)}

umap_model = UMAP(
    n_neighbors=10,
    n_components=5,
    min_dist=0.1,
    metric='cosine',
    random_state=42)

hdbscan_model = HDBSCAN(min_cluster_size=15,
                        min_samples=None,
                        metric="euclidean",
                        cluster_selection_method="eom",
                        prediction_data=True)

# Training BERTopic model to find best hyperparameters
topic_model = BERTopic(

  # Sub-models
  # vectorizer_model=vectorizer_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,

  # Hyperparameters
  min_topic_size=20,
  n_gram_range=(1,3),
  verbose=True,
  calculate_probabilities=True
)

# Train model
topics, probs = topic_model.fit_transform(combined, embeddings)

# topic_model.save("bertopic_model_all")

2024-12-23 10:37:10,095 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-23 10:38:02,682 - BERTopic - Dimensionality - Completed ✓
2024-12-23 10:38:02,685 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-12-23 10:42:26,236 - BERTopic - Cluster - Completed ✓
2024-12-23 10:42:26,250 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-23 10:42:38,327 - BERTopic - Representation - Completed ✓


In [None]:
topic_model.get_topic_info().head(20)

NameError: name 'topic_model' is not defined

In [None]:
topic_model.get_topic(0)

[('project', 0.02756681500625131),
 ('project manager', 0.02164597991562017),
 ('manager', 0.010982741335317759),
 ('construction', 0.010321413063816879),
 ('program manager', 0.007896146846620006),
 ('project management', 0.006782478946692517),
 ('construction project', 0.006224687326564847),
 ('management', 0.005823530239290635),
 ('program', 0.005674891334953495),
 ('senior project', 0.005386164758610527)]

In [None]:
print(probs.shape)

print(probs)

(30962, 354)
[[3.48309057e-03 2.27887585e-03 1.16953141e-03 ... 7.46794120e-04
  1.64207123e-03 1.44243508e-03]
 [1.03434461e-06 8.90869431e-07 1.06491418e-06 ... 1.13580889e-06
  9.55146576e-07 9.88876899e-07]
 [2.32659323e-03 1.39516555e-03 1.63139334e-03 ... 6.36427211e-04
  1.98747695e-03 1.93386897e-03]
 ...
 [2.90014481e-03 1.74855202e-03 1.38928014e-03 ... 5.28323763e-04
  2.09861143e-03 2.27364015e-03]
 [2.07543626e-03 1.64657272e-03 8.73712748e-04 ... 4.75486501e-04
  1.38170647e-03 1.08688621e-03]
 [4.77497986e-03 2.84418311e-03 2.12115949e-03 ... 8.22083721e-04
  3.19463997e-03 3.09576836e-03]]


In [None]:
def compute_coherence_score(topic_model, combined, topics, coherence_type='c_v'):
    """
    Computes the coherence score for a BERTopic model.

    Parameters:
    - topic_model: BERTopic
        The trained BERTopic model.
    - combined: List[str]
        The list of input documents used in training the model.
    - topics: List[int]
        The topic assignments for the documents.
    - coherence_type: str
        The type of coherence metric to use. Default is 'c_v'.

    Returns:
    - float
        The computed coherence score.
    """
    # Group documents by topic
    documents = pd.DataFrame({"Document": combined, "ID": range(len(combined)), "Topic": topics})
    documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
    cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

    # Extract vectorizer and analyzer
    vectorizer = topic_model.vectorizer_model
    analyzer = vectorizer.build_analyzer()

    # Create tokens, dictionary, and corpus
    tokens = [analyzer(doc) for doc in cleaned_docs]
    dictionary = Dictionary(tokens)
    corpus = [dictionary.doc2bow(token) for token in tokens]

    # Extract topic words
    topic_words = [
        [word for word, _ in topic_model.get_topic(topic)]
        for topic in range(len(set(topics)) - 1)
    ]

    # Compute coherence

    coherence_model = CoherenceModel(
        topics=topic_words,
        texts=tokens,
        corpus=corpus,
        dictionary=dictionary,
        coherence=coherence_type
    )
    coherence = coherence_model.get_coherence()

    print(f"Coherence Score ({coherence_type}): {coherence}")
    return coherence

In [None]:
score = compute_coherence_score(topic_model, combined, topics, coherence_type='c_v')


Coherence Score (c_v): 0.3290952383540109


In [None]:
# Reduce outliers from the model
reduced_outliers = topic_model.reduce_outliers(
    documents=combined,
    topics=topics,
    strategy='probabilities',
    probabilities=probs,
    # threshold=0.05
    # embeddings=embeddings
    )



In [None]:
topic_model.update_topics(docs=combined, topics=reduced_outliers)



In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,655,0_project_project manager_manager_construction,"[project, project manager, manager, constructi...",[project manager construction project manager ...
1,1,481,1_developer_java_salesforce_experience,"[developer, java, salesforce, experience, stac...",[full stack java developer role senior full st...
2,2,359,2_attorney_litigation_legal_law,"[attorney, litigation, legal, law, paralegal, ...",[litigation paralegal personal injury morgan m...
3,3,327,3_accountant_accounting_senior accountant_staf...,"[accountant, accounting, senior accountant, st...",[senior accountant senior accountant remote su...
4,4,404,4_analyst_business analyst_business_data,"[analyst, business analyst, business, data, sy...",[salesforce business analyst salesforce busine...
...,...,...,...,...,...
349,349,70,349_invited_accor_club_golf,"[invited, accor, club, golf, chicken, restaura...",[barista job description invited invited club ...
350,350,50,350_datavant_health_healthcare_data logistics ...,"[datavant, health, healthcare, data logistics ...",[patient care technician urgent care led organ...
351,351,39,351_rv_verizon_camping world_camping,"[rv, verizon, camping world, camping, world, c...",[technical project manager join verizon verizo...
352,352,128,352_finance_loan_credit_fund,"[finance, loan, credit, fund, treasury, invest...",[financial system manager req number r2356 emp...


In [None]:
topic_model.get_topic(0)

[('engineer', 0.01209009988726746),
 ('data', 0.010337793508504617),
 ('software', 0.009982906746637953),
 ('developer', 0.008591113541535828),
 ('experience', 0.008424262590094714),
 ('security', 0.007412362108307108),
 ('system', 0.007276760180496747),
 ('cloud', 0.007186922773749228),
 ('technology', 0.006824827479783184),
 ('solution', 0.006401470471145965)]

In [None]:
reduced_outlier_score = compute_coherence_score(topic_model, combined, reduced_outliers, coherence_type='c_v')


Coherence Score (c_v): 0.3678589822943739


In [None]:
topic_model.reduce_topics(docs=combined, nr_topics=100)

2024-12-21 12:59:48,000 - BERTopic - Topic reduction - Reducing number of topics
2024-12-21 13:00:00,803 - BERTopic - Topic reduction - Reduced number of topics from 354 to 100


<bertopic._bertopic.BERTopic at 0x7f18a5bc99f0>

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,10744,0_manager_team_sale_customer,"[manager, team, sale, customer, job, service, ...",[team member store 2746236 1543 lake street fo...
1,1,4271,1_maintenance_team_work_service,"[maintenance, team, work, service, support, po...",[sale manager position summary sale manager ma...
2,2,2732,2_product_team_manager_warehouse,"[product, team, manager, warehouse, position, ...",[behavior technician looking enthusiastic indi...
3,3,1961,3_financial_marketing_team_client,"[financial, marketing, team, client, job, serv...",[mgr case management made lot progress since o...
4,4,1443,4_patient_customer_service_care,"[patient, customer, service, care, sap, custom...",[sale specialist position white cap ordinary j...
...,...,...,...,...,...
95,95,18,95_chatbots_ai_train ai_train ai chatbots,"[chatbots, ai, train ai, train ai chatbots, ai...",[web developer dataannotation committed creati...
96,96,16,96_route service_route_service sale representa...,"[route service, route, service sale representa...",[route service sale representative workweek re...
97,97,16,97_mohawk_mohawk industry_productive life outs...,"[mohawk, mohawk industry, productive life outs...",[material handler join largest manufacturer ti...
98,98,16,98_horton_mortgage_mortgage title_mortgage fin...,"[horton, mortgage, mortgage title, mortgage fi...",[loan processor description horton largest hom...


In [None]:
# Get the updated topic assignments after reducing topics
updated_topics = topic_model.get_document_info(combined)['Topic']

# Compute coherence score
reduced_coherence_score = compute_coherence_score(
    topic_model=topic_model,
    combined=combined,
    topics=updated_topics,
    coherence_type='c_v'  # You can also use 'u_mass', 'c_uci', etc.
)

Coherence Score (c_v): 0.3889349519329845


# **LLama2 Model**

Huggingface token:



In [None]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

print(device)

cuda:0


In [None]:
# Set quantization configuration for loading large models with less GPU memory
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,             # Enable 4-bit quantization
    bnb_4bit_quant_type="nf4",     # Use normalized float 4-bit precision
    bnb_4bit_use_double_quant=True,  # Second quantization after the first
    bnb_4bit_compute_dtype=bfloat16  # Compute type for better performance
)

In [None]:
# Model name for topic label generating
model_id = "meta-llama/Llama-2-7b-chat-hf"

In [None]:
# Llama 2 Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, use_auth_token=True)

# Llama 2 Model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto',
)
model.eval()

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=

In [None]:
# Our text generator
generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    temperature=0.1,
    max_new_tokens=500,
    repetition_penalty=1.1
)

Device set to use cuda:0


In [None]:
prompt = "Could you explain to me how 4-bit quantization works as if I am 5?"
res = generator(prompt)
print(res[0]["generated_text"])

Could you explain to me how 4-bit quantization works as if I am 5?
 nobody likes a know-it-all, but I'm here to help! 😊

Quantization is like taking a big box of crayons and sorting them into smaller boxes. Imagine you have a big box of crayons with lots of different colors. You want to give some of these crayons to your little brother or sister, but you don't want to give them all the same color. So, you take the crayons and sort them into smaller boxes based on their colors. This way, you can easily find the right crayon for your little brother or sister when they want to draw something.

In the same way, when we quantize a number, we are taking a big number and breaking it down into smaller parts called bits. Each bit represents a small part of the original number. For example, if we have the number 10, we could break it down into two bits: 01. Now, each bit represents half of the original number. If we want to represent the number 10 again, we can put the two bits together: 01 = 2.

In [None]:
# System prompt describes information given to all conversations
system_prompt = """
<s>[INST] <<SYS>>
You are a helpful, respectful, and honest assistant for labeling topics.
<</SYS>>
"""

# Example prompt demonstrating the output we are looking for
example_prompt = """
I have a topic that contains the following documents:
- Developers work on cloud systems and deploy applications for companies using Java, Azure, and AWS.
- Cloud engineers build solutions using Azure and AWS for scalable software platforms.
- Veterinary software developers focus on applications for managing veterinary clinics and pet records.

The topic is described by the following keywords: 'developer, experience, cloud, java, veterinary, application, azure, year, beauty, aws'.

Based on the information about the topic above, please create a short label of this topic. Make sure you only return the label and nothing more.

[/INST] Cloud Software Development
"""

# Main prompt for labeling topics
main_prompt = """
[INST]
I have a topic described by the following keywords: '[KEYWORDS]'.

Based on these keywords, create a concise 1-5 word label for this topic.
Do not consider the content of any documents or specific company names.
Focus only on the most relevant and frequent terms from the keywords.

Ensure you only return the label and nothing more.
[/INST]
"""

# Combine the system, example, and main prompts into one
prompt = system_prompt + example_prompt + main_prompt



In [None]:
# Text generation with Llama 2
llama2 = TextGeneration(generator, prompt=prompt)

# Representation model
representation_model = {
    "Llama2": llama2,
}



In [None]:
topic_model.update_topics(docs=combined, representation_model=representation_model)

  3%|▎         | 10/355 [00:04<02:01,  2.83it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 355/355 [02:09<00:00,  2.75it/s]


In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Llama2,Representative_Docs
0,-1,13944,-1_team_company_work_service,"[team, company, work, service, manager, job, p...","[Business Operations, , , , , , , , , ]",[building equipment mechanic join verizon veri...
1,0,394,0_project_project manager_manager_construction,"[project, project manager, manager, constructi...","[Project Management, , , , , , , , , ]",[project manager construction project manager ...
2,1,357,1_developer_java_salesforce_stack,"[developer, java, salesforce, stack, full stac...","[Full Stack Dev, , , , , , , , , ]",[full stack java developer role senior full st...
3,2,352,2_attorney_litigation_legal_law,"[attorney, litigation, legal, law, paralegal, ...","[Legal Practice, , , , , , , , , ]",[litigation paralegal personal injury morgan m...
4,3,264,3_accountant_accounting_senior accountant_staf...,"[accountant, accounting, senior accountant, st...","[Accounting, , , , , , , , , ]",[senior accountant senior accountant remote su...
...,...,...,...,...,...,...
350,349,15,349_invited_cracker barrel_barrel_cracker,"[invited, cracker barrel, barrel, cracker, clu...","[Inclusive Community, , , , , , , , , ]",[barista job description invited invited club ...
351,350,15,350_aid healthcare_aid healthcare foundation_h...,"[aid healthcare, aid healthcare foundation, he...","[Healthcare Aid, , , , , , , , , ]",[patient care technician urgent care led organ...
352,351,15,351_verizon_connect around world_crisis celebr...,"[verizon, connect around world, crisis celebra...","[Verizon Career, , , , , , , , , ]",[technical project manager join verizon verizo...
353,352,15,352_financial system_finance_finance manager_f...,"[financial system, finance, finance manager, f...","[Financial Management, , , , , , , , , ]",[financial system manager req number r2356 emp...


In [None]:
topic_model.get_topic(3)

[('guest', 0.01702203425499397),
 ('food', 0.016646466314073367),
 ('restaurant', 0.011715782136144328),
 ('hotel', 0.011244571627421672),
 ('service', 0.008139834793662745),
 ('cook', 0.007350745066419152),
 ('team', 0.007207085219609822),
 ('chef', 0.006641397136122993),
 ('beverage', 0.006639520031835896),
 ('room', 0.00661672257716577)]

# **BERTopic with LLama2**

In [None]:
#{'min_topic_size': 79, 'n_neighbors': 59, 'n_components': 2, 'min_dist': 0.2575987952763153, 'n_gram_range': (1, 4)}

umap_model = UMAP(
    n_neighbors=75,
    n_components=2,
    min_dist=0.1,
    metric='cosine',
    random_state=42)

hdbscan_model = HDBSCAN(min_cluster_size=10,
                        min_samples=None,
                        metric="euclidean",
                        cluster_selection_method="eom",
                        prediction_data=False)

# Training BERTopic model to find best hyperparameters
topic_model = BERTopic(

  # Sub-models
  vectorizer_model=vectorizer_model,
  umap_model=umap_model,
  # hdbscan_model=hdbscan_model,
  representation_model=representation_model,

  # Hyperparameters
  min_topic_size=75,
  n_gram_range=(1,4),
  verbose=True,
)

# Train model
topics, probs = topic_model.fit_transform(combined, embeddings)

2024-12-23 10:53:29,795 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-23 10:55:11,939 - BERTopic - Dimensionality - Completed ✓
2024-12-23 10:55:11,941 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-12-23 10:55:14,356 - BERTopic - Cluster - Completed ✓
2024-12-23 10:55:14,367 - BERTopic - Representation - Extracting topics from clusters using representation models.
100%|██████████| 31/31 [00:10<00:00,  2.84it/s]
2024-12-23 10:55:26,949 - BERTopic - Representation - Completed ✓


In [None]:
# Group documents by topic
documents = pd.DataFrame({"Document": combined,  # Replace with your text data
                          "ID": range(len(combined)),
                          "Topic": topics})
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

# Extract vectorizer and analyzer
vectorizer = topic_model.vectorizer_model
analyzer = vectorizer.build_analyzer()

# Create tokens, dictionary, and corpus
words = vectorizer.get_feature_names_out()
tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]

# Extract topic words
topic_words = [[word for word, _ in topic_model.get_topic(topic)]
               for topic in range(len(set(topics))-1)]

# Compute coherence
coherence_model = CoherenceModel(topics=topic_words,
                                 texts=tokens,
                                 corpus=corpus,
                                 dictionary=dictionary,
                                 coherence='c_v')
coherence = coherence_model.get_coherence()
print(f"Coherence Score: {coherence}")

Coherence Score: 0.7085571065730416


In [None]:
# Show topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Llama2,Representative_Docs
0,-1,10383,-1_service_customer_work_team,"[service, customer, work, team, job, company, ...","[Customer Service, , , , , , , , , ]",[assistant store manager work multiple store r...
1,0,7153,0_project_manager_sale_team,"[project, manager, sale, team, marketing, comp...","[Project Management, , , , , , , , , ]",[corporate marketing manager job descriptionco...
2,1,3437,1_care_patient_nurse_health,"[care, patient, nurse, health, medical, regist...","[Nursing Care, , , , , , , , , ]",[licensed practical nurse capital nursing reha...
3,2,1878,2_engineer_developer_software_experience,"[engineer, developer, software, experience, da...","[Software Engineering, , , , , , , , , ]",[software engineer job description job summary...
4,3,1098,3_maintenance_technician_equipment_repair,"[maintenance, technician, equipment, repair, o...","[Maintenance Technician, , , , , , , , , ]",[facility maintenance engineer manager positio...
5,4,1014,4_store_sale_retail_customer,"[store, sale, retail, customer, associate, man...","[Retail Experience, , , , , , , , , ]",[assistant store manager jersey garden 224 ove...
6,5,1004,5_guest_food_restaurant_hotel,"[guest, food, restaurant, hotel, service, cook...","[Restaurant Service, , , , , , , , , ]",[barista job summary barista fill guest servic...
7,6,708,6_accounting_accountant_tax_financial,"[accounting, accountant, tax, financial, contr...","[Accounting & Finance, , , , , , , , , ]",[senior construction accountant buda tx senior...
8,7,529,7_healthcare_clinical_science_life,"[healthcare, clinical, science, life, patient,...","[Healthcare Solutions, , , , , , , , , ]",[scientist cell culture thermo fisher scientif...
9,8,417,8_financial_bank_payroll_specialist,"[financial, bank, payroll, specialist, client,...","[Financial Services, , , , , , , , , ]",[credit analyst 6 bank journey best helping cu...


In [None]:
# Add the 'Topic' column to your dataset if it's not already there
dataset['Topic'] = topics

# Create a mapping of Topic -> Llama2 labels
topic_info = topic_model.get_topic_info()
topic_labels = topic_info[['Topic', 'Llama2']].set_index('Topic').to_dict()['Llama2']

# Flatten the labels to ensure single string labels
flattened_labels = {
    topic: (labels[0] if isinstance(labels, list) and len(labels) > 0 else "Unlabeled")
    for topic, labels in topic_labels.items()
}

# Map the flattened labels to the 'Llama2' column in your dataset
dataset['Llama2'] = dataset['Topic'].map(flattened_labels)

# Calculate the percentage each topic represents
total_job_postings = len(dataset)
topic_counts = dataset['Topic'].value_counts()
topic_percentage = (topic_counts / total_job_postings) * 100

# Map the percentage to each row based on its topic
dataset['Topic_Percentage'] = dataset['Topic'].map(topic_percentage.round(2))

# Visualize columns
visualise_columns = ["title", "description", "Topic", "Llama2", "Topic_Percentage"]
dataset = dataset[visualise_columns]

dataset.head(30)

Unnamed: 0,title,description,Topic,Llama2,Topic_Percentage
0,Marketing Coordinator,Job descriptionA leading real estate firm in N...,-1,Job Opportunities in Technology and Engineering,50.07
1,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",4,Mental Health Professionals,0.94
2,Assitant Restaurant Manager,The National Exemplar is accepting application...,118,Restaurant Management,0.13
3,Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,5,Legal Job Openings,0.86
4,Service Technician,Looking for HVAC service tech with experience ...,184,HVAC Service Technician,0.08
5,Economic Development and Planning Intern,Job summary:The Economic Development & Plannin...,-1,Job Opportunities in Technology and Engineering,50.07
6,Producer,Company DescriptionRaw Cereal is a creative de...,-1,Job Opportunities in Technology and Engineering,50.07
7,Building Engineer,Summary: Due to the pending retirement of our ...,15,Engineering & Construction,0.58
8,Respiratory Therapist,"At Children’s, the region’s only full-service ...",-1,Job Opportunities in Technology and Engineering,50.07
9,Worship Leader,It is an exciting time to be a part of our chu...,-1,Job Opportunities in Technology and Engineering,50.07


In [None]:
topic_model.get_topic(11, full=True)["Llama2"]

TypeError: 'bool' object is not subscriptable

In [None]:
# Reduce embeddings to 2D
reduced_embeddings = umap_model.fit_transform(embeddings)
print(reduced_embeddings.shape)  # Should return (num_samples, 2)

(30962, 2)


In [None]:
topic_model.visualize_topics()

In [None]:
plot = topic_model.visualize_documents(combined, embeddings=embeddings)
plot

In [None]:
# Extract Llama2 labels from topic info
topic_info = topic_model.get_topic_info()
llama2_labels = topic_info['Llama2'].tolist()

# Assign labels to each document based on topics
document_topics = topic_model.get_document_info(combined)['Topic']
labels = [llama2_labels[topic] if topic != -1 else "Unlabeled" for topic in document_topics]

flattened_labels = [
    label[0] if isinstance(label, list) and len(label) > 0 else "Unlabeled"
    for label in labels
]

print(flattened_labels[:10])  # Inspect the first few labels

['Customer Service', 'Unlabeled', 'Retail Experience', 'Unlabeled', 'Unlabeled', 'Unlabeled', 'Project Management', 'Unlabeled', 'Nursing Care', 'Customer Service']


In [None]:
datamapplot.create_plot(
    data_map_coords=embeddings,
    labels=flattened_labels,
    title="Visualization of Job Postings with Llama2 Labels",
    sub_title="Generated using BERTopic and datamapplot",
    label_font_size=11,
    label_wrap_width=20,
    use_medoids=True,
)