# Semantic search for job postings
---

**Use case:**
- a user gives a query with his main skills, and gets Linkedin job listings matching these skills. Simple general similarity here might not be the best option, cause queries greatly differ from job posting.

**The task:**
- Fine-tune an embedding model to optmize semantic job search

**Example:**
- input: "data scientis 6 years expierience, LLMs, credit risk, content marketing" 
- output: job posting for position "Technical curriculum developer III, AI/ML, Cloud Learning"


<style>
body {
    background-color: #f5f5f5;
}
</style>

> **Fine-tuning text embeddings for specific user case**
>
> A bit on embeddings and fine-tuning:
> - The purpose is to fine-tune general text embeddings (to improve semantic search) for specific domain or use case.
> - Fine-tuning is adapting a model to specific use case with additonal training.
> - Text embedding - semantically meaningful vector.
> - Using base model embeddings, we retrieve most similar chunks, but the most similar chunk is not always what we are looking for.
> - Fine-tuned model can be further fine-tuned to add specific behaviour.
>
> Method:
> - **Contrastive Learning** (CL) in embeddings is a machine learning technique used to train models to create a representation space (embedding space) where similar data points are mapped close to each other, and dissimilar data points are pushed farther apart.
> - The core idea of contrastive learning revolves around the construction and use of sample pairs:
>     - Anchor Sample (x): The original data point.
>     - Positive Pair (x+): A data point that is semantically similar or related to the anchor.
>     - Negative Pairs (xâˆ’): Data points that are semantically dissimilar or unrelated to the anchor.
> - The training objective: the model (an encoder network) is trained using a contrastive loss function (like InfoNCE or Triplet Loss) that optimizes the distances between the embeddings of these pairs:
>     - Pull Positives Closer: The loss function penalizes the model when the embedding distance between the Anchor and the Positive Pair is large (or their similarity is low). This forces similar data into tight clusters in the embedding space.
>     - Push Negatives Apart: The loss function penalizes the model when the embedding distance between the Anchor and a Negative Pair is small (or their similarity is high). This separates dissimilar data, ensuring the embeddings are highly discriminative.
>
> By minimizing this loss, the model learns meaningful vector representations (embeddings) that effectively capture the inherent similarity and dissimilarity relationships within the data.

### 1. Data

In [None]:
# Dependnecies

: 

---
[Reference link](https://www.youtube.com/watch?v=hOLBrIjRAj4 "Fine-Tuning Text Embeddings For Domain-specific Search (w/ Python)")