# Job Search & Ranking Notebook

This notebook demonstrates how to:

1. Query the Google Custom Search JSON API for Machine Learning Engineer and Data Scientist job postings.  
2. Extract and normalize key fields (title, company, location, description, unique ID).  
3. Generalize the pipeline to return and process the top _N_ results.  
4. Parse, clean, and rank the retrieved job listings for downstream analysis or matching.

Eventually this will be integrated into an orchestrated agent workflow

In [46]:
%pip install -q openai-agents python-dotenv aiohttp backoff
from dotenv import dotenv_values
config = dotenv_values(".env")

Note: you may need to restart the kernel to use updated packages.


## Google Custom Search Engine
A custom google search is a relatively inexpensive way to gather large number of job posting URLs. 

With a google search, you can also filter on domain (greenhouse.io) and other keywords ("remote", "Machine Learning Engineer", "Data Scientist")

In contrast, integrated web tools called by LLMs typically only return a small number of results, with limited pagination functionality. Also pricey.

In [6]:
import hashlib, re, requests
from itertools import count
from typing import List, Dict

def extract_job_id(url: str) -> str:
    """Greenhouse numeric ID if present, otherwise SHA-256 of the URL."""
    m = re.search(r"/jobs/(\d+)", url)
    return m.group(1) if m else hashlib.sha256(url.encode()).hexdigest()


def parse_google_jobs(json_data: dict) -> List[Dict]:
    """Transform one Google CSE response page into a list of normalised dicts."""
    jobs = []
    for item in json_data.get("items", []):
        url = item["link"]
        jobs.append(
            {
                "unique_id": extract_job_id(url),
                "job_title": (
                    item.get("pagemap", {})
                        .get("metatags", [{}])[0]
                        .get("og:title", item["title"])
                ),
                "url": url,
                "company": url.split("/")[3] if "://" in url else None,
            }
        )
    return jobs


def fetch_google_jobs(
    *,
    query: str,
    api_key: str,
    cse_id: str,
    top_n: int = 50,
    user_agent: str | None = None,
    page_size: int = 10,
) -> List[Dict]:
    """
    Return **up to** `top_n` unique Greenhouse jobs for `query`.

    Parameters
    ----------
    top_n     : how many results you *want* (any positive int, even >100).
    page_size : 1-10.  Google will silently coerce values >10 down to 10.

    Notes
    -----
    Google never returns more than 100 total results per query and will
      raise `400 INVALID_REQUEST` if  `start + num > 100`.  :contentReference[oaicite:0]{index=0}  
    Each API call counts against your daily quota and costs $5 / 1 000 beyond
      the first 100 free queries.  :contentReference[oaicite:1]{index=1}
    """
    if top_n < 1:
        return []

    headers = {"User-Agent": user_agent} if user_agent else None
    results, seen = [], set()

    for start in count(1, page_size):
        #  stop before Google returns 400 error.
        if start > 99 or len(results) >= top_n:
            break

        resp = requests.get(
            "https://www.googleapis.com/customsearch/v1",
            params={
                "key": api_key,
                "cx": cse_id,
                "q": query,
                "num": min(page_size, 10),   # 10 is the API max
                "start": start,
            },
            headers=headers,
            timeout=30,
        )
        resp.raise_for_status()
        page_jobs = parse_google_jobs(resp.json())

        if not page_jobs:             # no more hits – bail early
            break

        for job in page_jobs:
            if job["unique_id"] not in seen:
                seen.add(job["unique_id"])
                results.append(job)
                if len(results) == top_n:
                    break

    if len(results) < top_n:
        print(
            f"[fetch_google_jobs] Requested {top_n} results but the API only "
            f"returned {len(results)} (Google caps each query at 100)."
        )

    return results

In [7]:
HARD_LIMIT = 99
query = 'site:boards.greenhouse.io intext:"Apply" (intext:"Machine Learning" OR intext:"Data Scientist") "remote"'
args = {
    "api_key": config["GOOGLE_API_KEY"],
    "cse_id":  config["GOOGLE_CSE_ID"],
    "query":   query,
    "top_n": 99,
    "page_size": 10
}

In [8]:
google_results = fetch_google_jobs(**args)

[fetch_google_jobs] Requested 99 results but the API only returned 94 (Google caps each query at 100).


In [9]:
google_results[:5]

[{'unique_id': '6828507',
  'job_title': 'Staff Data Scientist - Remote, United States',
  'url': 'https://boards.greenhouse.io/toast/jobs/6828507',
  'company': 'toast'},
 {'unique_id': '6799092',
  'job_title': 'Stitch Fix, Your Personal Stylist',
  'url': 'https://boards.greenhouse.io/stitchfix/jobs/6799092',
  'company': 'stitchfix'},
 {'unique_id': '6751486',
  'job_title': 'Staff Machine Learning Engineer, Edge AI/Model Optimization - Remote - US',
  'url': 'https://boards.greenhouse.io/samsara/jobs/6751486',
  'company': 'samsara'},
 {'unique_id': '5492159004',
  'job_title': 'Senior Data Scientist, ePROs',
  'url': 'https://boards.greenhouse.io/thymecare/jobs/5492159004',
  'company': 'thymecare'},
 {'unique_id': '4919145',
  'job_title': 'Work at Samsara: Apply to open roles today',
  'url': 'https://boards.greenhouse.io/samsara/jobs/4919145',
  'company': 'samsara'}]

## ATS JSON Endpoint
Many big ATSs (Applicant Tracking Service) publish an unauthenticated JSON endpoint that contains every field you care about—title, location, description, employment-type, compensation, etc. so you can skip any paid extraction service entirely

Why use `aiohttp`? Mainly because we want to hit lots of URLs quickly from Python without spinning up threads

In [19]:
import re, asyncio, aiohttp, json, time, backoff
import html
from bs4 import BeautifulSoup

GH_RE = re.compile(r'boards\.greenhouse\.io/(?P<co>[^/]+)/jobs/(?P<id>\d+)')

def api_from_job_url(url: str) -> str | None:
    m = GH_RE.search(url)
    if not m:            # not a Greenhouse link
        return None
    return f"https://boards-api.greenhouse.io/v1/boards/{m.group('co')}/jobs/{m.group('id')}"

def html_to_plain(raw: str) -> str:
    """Greenhouse `content` → readable text."""
    if not raw:
        return ""
    soup = BeautifulSoup(html.unescape(raw), "html.parser")   # unescape *before* parse
    for t in soup(["script", "style"]):
        t.decompose()
    text = soup.get_text(separator="\n\n", strip=True)
    text = html.unescape(text).replace("\xa0", " ")
    return "\n".join(ln.rstrip() for ln in text.splitlines() if ln.strip())

@backoff.on_exception(backoff.expo, aiohttp.ClientError, max_tries=5)
async def fetch(session, job_url):
    api = api_from_job_url(job_url)
    if not api:
        return {"url": job_url, "status": "skip"}
    async with session.get(api, timeout=30) as r:
        data = await r.json()
        return {
            "url": job_url,
            "status": r.status,
            "company": data.get("board_token"),
            "job_id": data.get("id") if  data.get("id") else "",
            "title": data.get("title") if data.get("title") else "",
            "location": data.get("location", {}).get("name") if data.get("location", {}).get("name") else "",
            "description": html_to_plain(data.get("content"),)
        }

async def grab(urls, concurrency=25):
    async with aiohttp.ClientSession() as sess:
        sem = asyncio.Semaphore(concurrency)
        async def limited(u):
            async with sem:
                return await fetch(sess, u)
        return await asyncio.gather(*(limited(u) for u in urls))

In [20]:
job_descriptions = await grab(job['url'] for job in google_results)

The reason why we don't get all successes is because many google results only go to company job boards, not job posting themselves.
- The website of the actual job posting requires going 1 level deeper than the actual job posting

For now, we will omit these. But soon we wil set up an agent tool call which does this for us, e.g. https://chatgpt.com/share/6837b52e-8810-8012-9233-eb07c48e6510

In [29]:
success = lambda x: (x['status'] == 200)
filtered_result = list(filter(valid_location, job_descriptions))
print(f"Found {len(filtered_result)} valid job descriptions out of {len(google_results)} google results")

Found 49 valid job descriptions out of 94 google results


## LLM Job Ranker

For this kind of task, I found that o4-mini works really well, and gpt-4.1 also works fairly well, but the smaller non-reasoning models do not give reliable ranks that I would agree with

In [45]:
system_prompt = f"""\
You are a career assistant that helps the user determine whether a job is relevant or not. Given the job description, ranks the job based on how well the job description matches the user preferences, experience, and resume.

For each job description provided output the following information
1. Company name
2. Job title
3. Overall job match score which ranges from 1 (bad match) to 5 (great match) 
4. One sentence justification

User preferences:
- Job does not require significant software engineering background
- Years of work experience less than 7

User resume:
{''.join(open('resume.txt','r').readlines())}
"""

In [33]:
import asyncio
from openai import AsyncOpenAI
from pydantic import BaseModel

client = AsyncOpenAI()

class JobRank(BaseModel):
    company: str
    job_title: str
    rank: int
    reason: str

async def rank_job(job):
    resp = await client.responses.parse(
        model="o4-mini",
        input=[
            {"role": "system", "content": system_prompt},
            {"role": "user",   "content": f"Job description:\n{job['description']}"},
        ],
        text_format=JobRank,
    )
    return resp.output_parsed

async def main():
    # schedule one task per job
    tasks = [asyncio.create_task(rank_job(job)) for job in filtered_result]
    # run them all concurrently
    ranked = await asyncio.gather(*tasks)
    return ranked

final_result = await main()

In [39]:
len(final_result)

49

In [38]:
final_result[10]

JobRank(company='Coinbase', job_title='Data Scientist', rank=4, reason='Your strong quantitative, Python and SQL experience and expertise in experimentation and modeling align well with this role’s requirements, though it involves more product-focused ETL and code review responsibilities than purely research work.')