# BetterJobSearch Tutorial

This notebook demonstrates the core features of BetterJobSearch:

1. Loading job data
2. Building the RAG search index
3. Semantic and hybrid search
4. Clustering visualization
5. Launching the web UI

## Setup

First, ensure you have installed the package:

```bash
uv pip install -e ".[all]"
```

In [None]:
# Import the main modules
from src.pipeline import load_jobs, build_index
from src import rag
from src import clustering

## 1. Loading Job Data

Load jobs from a JSON file. The sample data is included in `data/sample_jobs.json`.

In [None]:
# Load the sample jobs
jobs = load_jobs("data/sample_jobs.json")

# Inspect the first job
print(f"Total jobs: {len(jobs)}")
print(f"\nFirst job structure:")
first_job = jobs[0]
print(f"  Title: {first_job.get('job_data', {}).get('title')}")
print(f"  Company: {first_job.get('job_data', {}).get('companyName')}")
print(f"  Location: {first_job.get('job_data', {}).get('location')}")

## 2. Building the Search Index

The RAG pipeline chunks job descriptions and builds both vector (FAISS) and keyword (BM25) indexes.

In [None]:
# Build the index (this may take a few minutes for large datasets)
# Skip this cell if you've already built the index
build_index(jobs=jobs)

In [None]:
# Load the index into memory
rag.load_cache()

# Check how many chunks were created
all_chunks = rag.get_all_chunks()
print(f"Total chunks: {len(all_chunks)}")
print(f"Average chunks per job: {len(all_chunks) / len(jobs):.1f}")

## 3. Semantic and Hybrid Search

Search combines vector similarity with BM25 keyword matching.

In [None]:
# Simple search
query = "machine learning engineer with Python experience"
results = rag.retrieve(query, k=5)

print(f"Search: '{query}'\n")
for i, chunk in enumerate(results, 1):
    meta = chunk.meta
    print(f"[{i}] {meta.get('title', 'Unknown')} @ {meta.get('company', 'Unknown')}")
    print(f"    {chunk.text[:100]}...")
    print()

## 4. Clustering

Cluster jobs to discover market segments.

In [None]:
# Cluster chunks from a search
query = "software engineer"
chunks = rag.retrieve(query, k=50)

result = clustering.cluster_chunks(chunks, n_clusters=5)

print(f"Clustered {len(chunks)} chunks into {result['k']} clusters\n")
print("Top keywords per cluster:")
for cluster_id, keywords in result['keywords'].items():
    print(f"  Cluster {cluster_id}: {', '.join(keywords[:5])}")

## 5. Launching the Web UI

For the full interactive experience, launch the web UI from the command line:

```bash
python -m src.pipeline ui
```

Then open http://localhost:8050 in your browser.

Or uncomment the cell below to launch from the notebook:

In [None]:
# Uncomment to launch the UI (this will start a server)
# from src.pipeline import run_ui
# run_ui(port=8050)