# Semantic search for job postings
---


### Intro

**Use case:**
- a user gives a query with his main skills, and gets Linkedin job listings matching these skills. Simple general similarity here might not be the best option, cause queries greatly differ from job posting.

**The task:**
- Fine-tune an embedding model to optmize semantic job search

**Example:**
- input: "data scientis 6 years expierience, LLMs, credit risk, content marketing" 
- positive match: job listings for data scientist position
- negative match: job listings for data engineer possition

> **Fine-tuning text embeddings for specific user case**
>
> A bit on embeddings and fine-tuning:
> - The purpose is to fine-tune general text embeddings (to improve semantic search) for specific domain or use case.
> - Fine-tuning is adapting a model to specific use case with additonal training.
> - Text embedding - semantically meaningful vector.
> - Using base model embeddings, we retrieve most similar chunks, but the most similar chunk is not always what we are looking for.
> - Fine-tuned model can be further fine-tuned to add specific behaviour.
>
> Method:
> - **Contrastive Learning** (CL) in embeddings is a machine learning technique used to train models to create a representation space (embedding space) where similar data points are mapped close to each other, and dissimilar data points are pushed farther apart.
> - The core idea of contrastive learning revolves around the construction and use of sample pairs:
>     - Anchor Sample (x): The original data point.
>     - Positive Pair (x+): A data point that is semantically similar or related to the anchor.
>     - Negative Pairs (x−): Data points that are semantically dissimilar or unrelated to the anchor.
> - The training objective: the model (an encoder network) is trained using a contrastive loss function (like InfoNCE or Triplet Loss) that optimizes the distances between the embeddings of these pairs:
>     - Pull Positives Closer: The loss function penalizes the model when the embedding distance between the Anchor and the Positive Pair is large (or their similarity is low). This forces similar data into tight clusters in the embedding space.
>     - Push Negatives Apart: The loss function penalizes the model when the embedding distance between the Anchor and a Negative Pair is small (or their similarity is high). This separates dissimilar data, ensuring the embeddings are highly discriminative.
>
> By minimizing this loss, the model learns meaningful vector representations (embeddings) that effectively capture the inherent similarity and dissimilarity relationships within the data.

### Dataset

**Dataset construction:**
1. Download job listings dataset: https://huggingface.co/datasets/datastax/linkedin_job_listings
2. Keep only: ["Data Scientist", "Data Analyst", "Machine Learning Engineer", "Data Engineer", "AI Engineer", "Deep Learning"]
3. Keep job description field (job_description_pos)
4. Create a dataset with queries generated from job descriptions (query)
5. Create negative samples pairs by matching least similar queries and job descriptions (job_description_neg)
6. Shuffle and split: train: 0.8, test: 0.1, validation: 0.1
7. Upload to huggingface: dataset_dict.push_to_hub("shawhin/ai-job-embedding-finetuning")

In [None]:
# Load environment variables

import dotenv
dotenv.load_dotenv()    

True

In [None]:
# Download dataset from huggingface

from datasets import load_dataset
import pandas as pd

# Data: https://huggingface.co/datasets/shawhin/ai-job-embedding-finetuning
dataset = load_dataset("shawhin/ai-job-embedding-finetuning")

print("*"*50)
print(dataset) # structure
print("*"*50)
# print(dataset['train'][0]) # example
[print(dataset['train'][i]) for i in range(10)]
print("*"*50)


**************************************************
DatasetDict({
    train: Dataset({
        features: ['query', 'job_description_pos', 'job_description_neg'],
        num_rows: 809
    })
    validation: Dataset({
        features: ['query', 'job_description_pos', 'job_description_neg'],
        num_rows: 101
    })
    test: Dataset({
        features: ['query', 'job_description_pos', 'job_description_neg'],
        num_rows: 102
    })
})
**************************************************
{'query': 'Data engineering Azure cloud Apache Spark Kafka', 'job_description_pos': 'Skills:Proven experience in data engineering and workflow development.Strong knowledge of Azure cloud services.Proficiency in Apache Spark and Apache Kafka.Excellent programming skills in Python/Java.Hands-on experience with Azure Synapse, DataBricks, and Azure Data Factory.\nNice To Have Skills:Experience with BI Tools such as Tableau or Power BI.Familiarity with Terraform for infrastructure as code.Knowledge of 

### Fine-tuning

---
Referencies:
- [Fine-Tuning Text Embeddings For Domain-specific Search (w/ Python)](https://www.youtube.com/watch?v=hOLBrIjRAj4 "Fine-Tuning Text Embeddings For Domain-specific Search (w/ Python)")
- [Sentence transformer - Trainer](https://sbert.net/docs/sentence_transformer/training_overview.html#trainer "Sentence transformer - Trainer")