# Job Matching with Vector Similarity Search

This notebook demonstrates how to use ChromaDB for semantic job matching using vector similarity search. We'll load 1000 job postings and show how natural language queries can find relevant jobs even without exact keyword matches.


## 1. Understanding the Data Set

We'll review the data in `jobs_1000.jsonl` to gain a deeper understanding of its structure and content. Understanding file characteristics and data is crucial for effective solutioning.

- File Name: `jobs_1000.jsonl`
- One JSON object per row
- Each object contains the following fields:
    - *title*
    - *description*
    - *location*
    - *salary*
- The file contains 1000 rows
- Here's a sample JSON object, formatted for better readability:

```json
{
    "title": "Lead Data Scientist", 
    "description": "We're seeking a talented lead data scientist to join our leading SaaS provider. You'll lead technical initiatives and mentor junior developers using NumPy, Spark and Airflow. Our hybrid work model fosters collaboration and innovation. We offer learning and development budget and are committed to your professional growth.", 
    "location": "Miami, FL", 
    "salary": "$139,607 - $224,885"
}
```

## 2. Introduction of ChromaDB and Setup

We'll use ChromaDB to create a vector database of job postings. ChromaDB automatically computes embeddings for our documents, enabling semantic search capabilities.


In [1]:
import chromadb

In [2]:
# Create ChromaDB client
client = chromadb.Client()

In [3]:
client.get_settings()

Settings(environment='', chroma_api_impl='chromadb.api.rust.RustBindingsAPI', chroma_server_nofile=None, chroma_server_thread_pool_size=40, tenant_id='default', topic_namespace='default', chroma_server_host=None, chroma_server_headers=None, chroma_server_http_port=None, chroma_server_ssl_enabled=False, chroma_server_ssl_verify=None, chroma_server_api_default_path=<APIVersion.V2: '/api/v2'>, chroma_server_cors_allow_origins=[], chroma_http_keepalive_secs=40.0, chroma_http_max_connections=None, chroma_http_max_keepalive_connections=None, is_persistent=False, persist_directory='./chroma', chroma_memory_limit_bytes=0, chroma_segment_cache_policy=None, allow_reset=False, chroma_auth_token_transport_header=None, chroma_client_auth_provider=None, chroma_client_auth_credentials=None, chroma_server_auth_ignore_paths={'APIVersion.V2': ['GET'], 'APIVersion.V2/heartbeat': ['GET'], 'APIVersion.V2/version': ['GET'], 'APIVersion.V1': ['GET'], 'APIVersion.V1/heartbeat': ['GET'], 'APIVersion.V1/version

In [4]:
dict(client.get_settings())

{'environment': '',
 'chroma_api_impl': 'chromadb.api.rust.RustBindingsAPI',
 'chroma_server_nofile': None,
 'chroma_server_thread_pool_size': 40,
 'tenant_id': 'default',
 'topic_namespace': 'default',
 'chroma_server_host': None,
 'chroma_server_headers': None,
 'chroma_server_http_port': None,
 'chroma_server_ssl_enabled': False,
 'chroma_server_ssl_verify': None,
 'chroma_server_api_default_path': <APIVersion.V2: '/api/v2'>,
 'chroma_server_cors_allow_origins': [],
 'chroma_http_keepalive_secs': 40.0,
 'chroma_http_max_connections': None,
 'chroma_http_max_keepalive_connections': None,
 'is_persistent': False,
 'persist_directory': './chroma',
 'chroma_memory_limit_bytes': 0,
 'chroma_segment_cache_policy': None,
 'allow_reset': False,
 'chroma_auth_token_transport_header': None,
 'chroma_client_auth_provider': None,
 'chroma_client_auth_credentials': None,
 'chroma_server_auth_ignore_paths': {'APIVersion.V2': ['GET'],
  'APIVersion.V2/heartbeat': ['GET'],
  'APIVersion.V2/version'

## 3. Create ChromaDB Collection

We'll create a new collection to store our job postings. ChromaDB will automatically compute embeddings when we add documents.


In [5]:
collections = client.list_collections()

In [6]:
collections

[]

In [7]:
if len(collections) > 0:
    for collection in collections:
        if collection.name == "job_postings":
            print("Collection already exists, deleting it")
            client.delete_collection(collection.name)


In [8]:
# Create a new collection for job postings
collection = client.create_collection("job_postings")

In [9]:
collections = client.list_collections()

In [10]:
collections

[Collection(name=job_postings)]

In [11]:
for collection in collections:
    print(collection.name)

job_postings


In [12]:
print(f"Collection created: {collection.name}")
print(f"Documents: {collection.count()}")


Collection created: job_postings
Documents: 0


## 4. Understanding the Basic ChromaDB Operations

In this lecture, we'll explore the basic ChromaDB Operations (CRUD Operations) by performing the following steps:

1. **Read Data**: We'll read a single job posting from the `jobs_1000.jsonl` file to understand the data structure.
2. **Insert Data**: We'll insert the record into a ChromaDB collection and validate the insertion.
3. **Delete Data**: We'll delete the data from the collection to prepare it for loading the entire dataset.

This process will help us understand the data load workflow and ensure a smooth integration with ChromaDB.

In [13]:
jobs = open('jobs_1000.jsonl').read().splitlines()

In [14]:
type(jobs)

list

In [15]:
type(jobs[0])

str

In [16]:
len(jobs)

1000

In [17]:
type(jobs[0])

str

In [18]:
jobs[0]

'{"title": "Lead Data Scientist", "description": "We\'re seeking a talented lead data scientist to join our leading SaaS provider. You\'ll lead technical initiatives and mentor junior developers using NumPy, Spark and Airflow. Our hybrid work model fosters collaboration and innovation. We offer learning and development budget and are committed to your professional growth.", "location": "Miami, FL", "salary": "$139,607 - $224,885"}'

In [19]:
import json

In [20]:
job = json.loads(jobs[0])

In [21]:
type(job)

dict

In [22]:
job

{'title': 'Lead Data Scientist',
 'description': "We're seeking a talented lead data scientist to join our leading SaaS provider. You'll lead technical initiatives and mentor junior developers using NumPy, Spark and Airflow. Our hybrid work model fosters collaboration and innovation. We offer learning and development budget and are committed to your professional growth.",
 'location': 'Miami, FL',
 'salary': '$139,607 - $224,885'}

In [23]:
id = 1

In [24]:
description = job['description']

In [25]:
del job['description']

In [26]:
job

{'title': 'Lead Data Scientist',
 'location': 'Miami, FL',
 'salary': '$139,607 - $224,885'}

In [27]:
collection.delete(ids=[f"job_{id}"])

In [28]:
collection.add(
    documents=[description],
    ids=[f"job_{id}"],
    metadatas=[job]
)

In [29]:
collection.count()

1

In [30]:
collection.get()

{'ids': ['job_1'],
 'embeddings': None,
 'documents': ["We're seeking a talented lead data scientist to join our leading SaaS provider. You'll lead technical initiatives and mentor junior developers using NumPy, Spark and Airflow. Our hybrid work model fosters collaboration and innovation. We offer learning and development budget and are committed to your professional growth."],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': [{'title': 'Lead Data Scientist',
   'location': 'Miami, FL',
   'salary': '$139,607 - $224,885'}]}

In [31]:
collection.get(include=["documents", "metadatas", "embeddings"])

{'ids': ['job_1'],
 'embeddings': array([[-8.56935084e-02, -7.63464272e-02, -2.85115447e-02,
          2.71393172e-02,  1.73438936e-02, -8.86584148e-02,
         -2.58482154e-02, -4.57224762e-03, -7.12163299e-02,
          2.18408834e-02, -9.21703056e-02, -5.44975810e-02,
          2.08729636e-02,  1.03937695e-02,  2.23925561e-02,
          3.29139940e-02, -6.00729696e-02, -7.63795823e-02,
          2.27446854e-03, -1.17449425e-01, -1.32697999e-01,
         -1.84980780e-02, -3.22042294e-02, -7.11596906e-02,
          3.69420946e-02, -1.94074865e-02,  2.80121211e-02,
          1.07993055e-02, -2.90644672e-02, -2.65571456e-02,
          5.93542401e-03, -1.21784788e-02,  1.91518608e-02,
          6.49471804e-02,  3.83931696e-02,  5.19959852e-02,
          1.95624493e-02, -2.12178994e-02, -5.07351831e-02,
         -1.06918784e-02,  1.63193792e-02, -5.79283908e-02,
         -1.35355238e-02,  2.09595915e-02, -2.12800782e-02,
         -7.08005354e-02, -1.92942470e-02, -5.15225679e-02,
       

In [32]:
collection.delete(ids=f"job_{id}")

In [33]:
collection.get()

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': []}

In [34]:
# collection.update can be used to update record based on id
# collection.upsert can be used to insert or update record based on id

## 5. Load Data from JSONL

We'll read the job postings from `jobs_1000.jsonl` and prepare them for insertion into ChromaDB. Each record contains:
- **title**: Job title
- **description**: Rich job description
- **location**: Job location
- **salary**: Salary range

For semantic search, we'll combine the title and description as the document text, and store title, location, and salary as metadata for filtering.


In [35]:
# Read and parse JSONL file
documents = []
ids = []
metadatas = []

with open('jobs_1000.jsonl', 'r', encoding='utf-8') as f:
    for idx, line in enumerate(f):
        job = json.loads(line.strip())
        
        # Combine title and description for semantic search
        doc_text = f"{job['title']}: {job['description']}"
        documents.append(doc_text)
        
        # Generate unique ID
        ids.append(f"job_{idx}")
        
        # Store metadata for filtering
        metadatas.append({
            "title": job['title'],
            "location": job['location'],
            "salary": job['salary']
        })

print(f"Loaded {len(documents)} job postings")

Loaded 1000 job postings


In [36]:
documents[0]

"Lead Data Scientist: We're seeking a talented lead data scientist to join our leading SaaS provider. You'll lead technical initiatives and mentor junior developers using NumPy, Spark and Airflow. Our hybrid work model fosters collaboration and innovation. We offer learning and development budget and are committed to your professional growth."

In [37]:
ids[0]

'job_0'

In [38]:
metadatas[0]

{'title': 'Lead Data Scientist',
 'location': 'Miami, FL',
 'salary': '$139,607 - $224,885'}

In [39]:
# Preview first job posting
print("Sample job posting:")
print(f"ID: {ids[0]}")
print(f"Document: {documents[0][:200]}...")
print(f"Metadata: {metadatas[0]}")


Sample job posting:
ID: job_0
Document: Lead Data Scientist: We're seeking a talented lead data scientist to join our leading SaaS provider. You'll lead technical initiatives and mentor junior developers using NumPy, Spark and Airflow. Our ...
Metadata: {'title': 'Lead Data Scientist', 'location': 'Miami, FL', 'salary': '$139,607 - $224,885'}


## 6. Add Documents to Collection

Now we'll add all 1000 job postings to the collection. ChromaDB will automatically:
1. Compute embeddings for each document using the default embedding model
2. Store the embeddings along with the documents and metadata
3. Index them for fast similarity search


In [40]:
len(documents)

1000

In [41]:
len(set(ids))

1000

In [42]:
len(metadatas)

1000

In [43]:
# Add all documents to the collection
# ChromaDB will automatically compute embeddings
collection.add(
    documents=documents,
    ids=ids,
    metadatas=metadatas
)

In [44]:
print(f"Successfully added {collection.count()} job postings to the collection")


Successfully added 1000 job postings to the collection


In [45]:
collection.get(limit=10)

{'ids': ['job_0',
  'job_1',
  'job_2',
  'job_3',
  'job_4',
  'job_5',
  'job_6',
  'job_7',
  'job_8',
  'job_9'],
 'embeddings': None,
 'documents': ["Lead Data Scientist: We're seeking a talented lead data scientist to join our leading SaaS provider. You'll lead technical initiatives and mentor junior developers using NumPy, Spark and Airflow. Our hybrid work model fosters collaboration and innovation. We offer learning and development budget and are committed to your professional growth.",
  "Senior DBA: We're seeking a talented senior dba to join our fast-growing startup. You'll lead technical initiatives and mentor junior developers using Flask and Django. We offer a onsite with remote flexibility approach, allowing you to balance office collaboration with remote work. We offer stock options and are committed to your professional growth.",
  "Senior Security Engineer: Exciting opportunity for a senior security engineer at a e-commerce platform. You'll lead technical initiatives

## 7. Understanding the Default Embedding Model

ChromaDB uses **`all-MiniLM-L6-v2`** as its default embedding model. Here's what you need to know:

### Key Features:
- **Model**: `all-MiniLM-L6-v2` (Microsoft's sentence transformer)
- **Dimensions**: 384-dimensional embeddings
- **Optimization**: Balanced for both speed and quality
- **Download**: Automatically downloaded on first use (you may have seen a progress bar)

### How Embeddings Enable Semantic Search:

1. **Vector Representation**: Each document is converted into a 384-dimensional vector that captures its semantic meaning
2. **Semantic Understanding**: Similar concepts are positioned close together in the vector space
3. **Similarity Calculation**: When you query, your query text is also embedded, and ChromaDB finds documents with similar vectors using cosine similarity

### Semantic vs Keyword Matching:

- **Keyword Matching**: "Python developer" only matches documents containing exact words "Python" and "developer"
- **Semantic Matching**: "Python developer" can match "software engineer using Python", "backend developer with Python experience", etc., even if the exact phrase isn't present

This is why vector similarity search is powerful for job matching - it understands context and meaning, not just exact words!


## 8. Example Similarity Search Queries

Let's demonstrate the power of semantic search with various natural language queries. Notice how the results match based on meaning, not just keywords. Also, we will add metadata filtering where ever it is applicable.


In [46]:
results = collection.get(limit=5)

In [47]:
results

{'ids': ['job_0', 'job_1', 'job_2', 'job_3', 'job_4'],
 'embeddings': None,
 'documents': ["Lead Data Scientist: We're seeking a talented lead data scientist to join our leading SaaS provider. You'll lead technical initiatives and mentor junior developers using NumPy, Spark and Airflow. Our hybrid work model fosters collaboration and innovation. We offer learning and development budget and are committed to your professional growth.",
  "Senior DBA: We're seeking a talented senior dba to join our fast-growing startup. You'll lead technical initiatives and mentor junior developers using Flask and Django. We offer a onsite with remote flexibility approach, allowing you to balance office collaboration with remote work. We offer stock options and are committed to your professional growth.",
  "Senior Security Engineer: Exciting opportunity for a senior security engineer at a e-commerce platform. You'll lead technical initiatives and mentor junior developers using Spring Boot and Python. We 

In [48]:
for i in range(len(results['documents'])):
    id = results['ids'][i]
    document = results['documents'][i]
    metadata = results['metadatas'][i]

    print(f"ID: {id}")
    print(f"Document: {document}")
    print(f"Metadata: {metadata}")
    print('--------------------------------')


ID: job_0
Document: Lead Data Scientist: We're seeking a talented lead data scientist to join our leading SaaS provider. You'll lead technical initiatives and mentor junior developers using NumPy, Spark and Airflow. Our hybrid work model fosters collaboration and innovation. We offer learning and development budget and are committed to your professional growth.
Metadata: {'location': 'Miami, FL', 'salary': '$139,607 - $224,885', 'title': 'Lead Data Scientist'}
--------------------------------
ID: job_1
Document: Senior DBA: We're seeking a talented senior dba to join our fast-growing startup. You'll lead technical initiatives and mentor junior developers using Flask and Django. We offer a onsite with remote flexibility approach, allowing you to balance office collaboration with remote work. We offer stock options and are committed to your professional growth.
Metadata: {'title': 'Senior DBA', 'location': 'Hybrid - Austin, TX', 'salary': '$134,387 - $227,292'}
--------------------------

In [51]:
results = collection.query(
    query_texts=["remote Python Developer"],
    n_results=3
)

In [52]:
results

{'ids': [['job_344', 'job_164', 'job_255']],
 'embeddings': None,
 'documents': [["Python Developer: Join our leading SaaS provider as a python developer. You'll design and implement robust solutions for complex problems using Django, Rust and C++. Our fully remote approach fosters collaboration and innovation. We offer comprehensive health benefits and are committed to your professional growth.",
   "Python Developer: Exciting opportunity for a python developer at a cloud services provider. You'll design and implement robust solutions for complex problems using Go, Python and Express. We offer a onsite with remote flexibility approach, allowing you to balance office collaboration with remote work. We offer 401k matching and are committed to your professional growth.",
   "Python Developer: Exciting opportunity for a python developer at a established tech company. You'll design and implement robust solutions for complex problems using Node.js, Java and Flask. Our remote-first approach 

In [None]:
results = collection.query(
    query_texts=["data engineer"],
    n_results=2
)

In [None]:
results

In [None]:
results['ids'][0]

In [None]:
results['ids'][0][0]

In [None]:
results['documents'][0][0]

In [None]:
results['metadatas'][0][0]

In [None]:
results['distances'][0][0]

In [None]:
for i in range(len(results['documents'][0])):
    print(results['ids'][0][i])
    print(results['documents'][0][i])
    print(results['metadatas'][0][i])
    print(results['distances'][0][i])
    print('--------------------------------')


In [None]:
results = collection.query(
    query_texts=["data engineer", "python developer"],
    n_results=2
)

In [None]:
results

In [None]:
for outer in range(len(results['ids'])):
    for inner in range(len(results['ids'][outer])):
        print(results['ids'][outer][inner])
        print(results['documents'][outer][inner])
        print(results['metadatas'][outer][inner])
        print(results['distances'][outer][inner])
        print('--------------------------------')

In [None]:
def print_results(queries, collection, where=None, limit=5):
    results = collection.query(
        query_texts=queries,
        where=where,
        n_results=limit
    )
    for outer in range(len(results['documents'])):
        for inner in range(len(results['documents'][outer])):
            doc = results['documents'][outer][inner]
            metadata = results['metadatas'][outer][inner]
            distance = results['distances'][outer][inner]
        
            # Extract title from document (format: "Title: Description")
            title = doc.split(':')[0]
            description = ':'.join(doc.split(':')[1:]).strip()
            
            print(f"\n{inner+1}. {title}")
            print(f"   Location: {metadata['location']}")
            print(f"   Salary: {metadata['salary']}")
            print(f"   Description: {description}...")
            print(f"   Similarity Distance: {distance:.4f} (lower is more similar)")

In [None]:
print_results(["remote Python developer", "data engineer", "data scientist"], collection)

In [None]:
results = collection.query(
    query_texts=["remote python developer"],
    n_results=5
)

In [None]:
results

In [None]:
results = collection.query(
    query_texts=["remote python developer"],
    where={"title": "Python Developer"},
    n_results=5
)

In [None]:
results

In [None]:
results = collection.query(
    query_texts=["remote python developer"],
    where={"title": {"$in": ["Python Developer", "Senior Python Developer"]}},
    n_results=5
)

In [None]:
results # [['job_344', 'job_164', 'job_255', 'job_65', 'job_688']]

In [None]:
print_results(["python, java and react"], collection, where={"location": "Remote"})

In [None]:
print_results(["hybrid data engineer"], collection, where={"location": {"$in": ["Miami, FL", "Austin, TX"]}}, limit=10)

In [None]:
print_results(["Python"], collection, where={"$and": [{"location": {"$ne":"Remote"}}, {"title": "Data Engineer"}]})