# Chapter 2: AI-Powered Job Title Matching

## Learning Objectives

By the end of this chapter, you will be able to:
- Use AI to match job titles between different data sources
- Create intelligent mapping functions using PydanticAI
- Handle the complexity of job title variations automatically
- Build functions that will integrate seamlessly with your data pipeline

---

## Building on Our AI Foundation

In Chapter 1, we saw how AI can intelligently parse salary data. Now we'll apply similar techniques to solve an even trickier problem: matching job titles between different datasets.

### The Job Title Matching Challenge

In [None]:
from pathlib import Path

import nest_asyncio
import pandas as pd
from dotenv import load_dotenv

nest_asyncio.apply()

load_dotenv()

# Load OEWS data
bls_file = Path("../01_module/data/bls_jobs_metro_area_2024.csv")
oews_df = pd.read_csv(bls_file)

scraped_file_paths = list(Path("../01_module/data").glob("scraped_indeed_*.csv"))
dataframes = [pd.read_csv(file) for file in scraped_file_paths]
indeed_df = pd.concat(dataframes, ignore_index=True)
print(f"📊 Loaded {len(scraped_file_paths)} scraped files with {len(indeed_df)} total jobs")


print("\n=== THE MATCHING CHALLENGE ===")
print("Indeed job titles (sample):")
for title in indeed_df["job_title"].unique():
    print(f"  • {title}")

print("\nOEWS official job titles (sample):")
oews_titles = oews_df["OCC_TITLE"].unique()
for title in oews_titles[:8]:
    print(f"  • {title}")

print(
    f"\nChallenge: Map {indeed_df['job_title'].nunique()} Indeed titles → {len(oews_df['OCC_TITLE'].unique())} OEWS categories"
)

In [None]:
# Mini primer: one-shot structured response
from pydantic import BaseModel, Field
from pydantic_ai import Agent


class Echo(BaseModel):
    clean_title: str = Field(..., description="Normalized form")


mini_agent = Agent(
    "gemini-1.5-flash", output_type=Echo, system_prompt="Return a concise cleaned job title."
)

demo = mini_agent.run_sync("Sr. Software Developer (Backend)").output

demo

> **Learner Challenge**: Look at the Indeed titles vs OEWS titles. Try to manually match 3-4 Indeed titles to their closest OEWS categories. What challenges do you encounter? How many different ways can the same job be described?

---

## Setting Up AI Job Title Matching

Since we already configured PydanticAI in Chapter 1, we can jump straight into building our matching function.

In [None]:
# Structured output schema for title matching
class JobTitleMatch(BaseModel):
    matched_title: str
    confidence: int = Field(..., description="1-10 confidence score")
    reasoning: str = Field(..., description="Brief explanation of the match")


SYSTEM_PROMPT = """You are an expert at matching job titles to official occupational categories.
Return ONLY what fits the schema.

Rules:
1. Pick the closest official title (variations & seniority collapse to the base category)
2. If nothing fits reasonably, use "Other"
3. Confidence 1-10 (10 = exact / near-exact semantic match)
4. Reasoning: short, crisp, no repetition of the rules
"""

job_matcher = Agent("gemini-1.5-flash", output_type=JobTitleMatch, system_prompt=SYSTEM_PROMPT)

---

## Creating the Match Function

Let's build our core matching function that will handle the complexity of job title variations.

In [None]:
# --- 1. Define Structured Outputs ---

# Pydantic model for a SINGLE title match
class JobTitleMatch(BaseModel):
    original_title: str = Field(..., description="The original, unmodified job title.")
    matched_title: str = Field(..., description="The best matching official OEWS category.")
    confidence: int = Field(..., description="A 1-10 confidence score for the match.")
    reasoning: str = Field(..., description="A brief explanation for the match decision.")


# Pydantic model for a BATCH of title matches
class JobTitleMatchResults(BaseModel):
    matches: list[JobTitleMatch]


SYSTEM_PROMPT = """
You are an expert at matching a list of job titles to a list of official occupational categories.
For each job title provided, you must find the single best fit from the official category list.

Rules:
1. Prioritize semantic meaning over keywords. Collapse seniority (e.g., "Senior," "Lead," "II") and variations into the base category.
2. If no official category is a reasonable fit, you MUST use the category "Other".
3. Provide a confidence score from 1 (low confidence) to 10 (exact or near-exact semantic match).
4. Reasoning should be brief, explaining why the match was made (e.g., "Semantic match on developer roles").
"""

# This agent is specifically designed for the batch operation
batch_job_matcher = Agent(
    "gemini-1.5-flash", output_type=JobTitleMatchResults, system_prompt=SYSTEM_PROMPT
)


def ai_match_job_titles(indeed_titles: list[str], oews_titles: list[str]) -> list[dict]:
    """
    Matches a list of Indeed job titles to the best OEWS category using a batch AI call.
    This function relies exclusively on the AI agent.
    """
    # Filter out any empty/null titles before processing
    valid_titles = [t for t in indeed_titles if t and isinstance(t, str) and t.strip()]
    if not valid_titles:
        return []

    # The AI-powered batch approach is the only path.
    try:
        options_text = "\n".join(f"- {t}" for t in oews_titles) + "\n- Other"
        titles_text = "\n".join(f"- {t}" for t in valid_titles)

        prompt = f"""
        Official Categories:
        {options_text}

        Job Titles to Match:
        {titles_text}

        Return a structured list of matches for the job titles provided.
        """
        result = batch_job_matcher.run_sync(prompt)
        return [match.model_dump() for match in result.output.matches]

    except Exception as e:
        print(f"A critical AI error occurred during batch matching: {e}")

        return [
            {
                "original_title": title,
                "matched_title": "Error",
                "confidence": 0,
                "reasoning": f"AI processing failed: {str(e)}",
            }
            for title in valid_titles
        ]

---

## Testing Our Matching Function

Let's test our function with some challenging job title variations.

In [None]:
# Target OEWS titles (tech/data focus subset)
target_oews_titles = [
    "Computer Systems Analysts",
    "Information Security Analysts",
    "Computer and Information Research Scientists",
    "Computer Network Support Specialists",
    "Computer User Support Specialists",
    "Database Administrators",
    "Network and Computer Systems Administrators",
    "Computer Programmers",
    "Software Developers",
    "Software Quality Assurance Analysts and Testers",
    "Web Developers",
    "Web and Digital Interface Designers",
    "Data Scientists",
    "Market Research Analysts and Marketing Specialists",
]

# A list of real-world, messy job titles to test
test_titles = [
    "Senior Data Scientist",
    "Full Stack Developer",
    "Python Engineer",
    "Database Admin",
    "UI/UX Designer",
    "DevOps Engineer",
    "Machine Learning Engineer",
    "Frontend Developer",
]

# --- 1. Update the Header ---
print("=== TESTING AI-ONLY JOB TITLE MATCHING ===")
print(f"{'Indeed Title':<30} | {'Matched BLS Title':<45} | {'Conf':<5} | Reasoning")
print("-" * 120)

# --- 2. Make a SINGLE Batch Call ---
# Call the batch function once with the entire list of titles.
all_matched_results = ai_match_job_titles(test_titles, target_oews_titles)

# --- 3. Loop Through the Results ---
# Iterate through the list of result dictionaries to print each match.
for result in all_matched_results:
    # Use .get() for safe access to dictionary keys, providing defaults if a key is missing.
    original = result.get("original_title", "N/A")
    matched = result.get("matched_title", "N/A")
    confidence = result.get("confidence", 0)
    reasoning = result.get("reasoning", "N/A")

    print(f"{original:<30} | {matched:<45} | {confidence:<5} | {reasoning}")

> **Learner Challenge**: Add 2-3 job titles from your scraped data that weren't in the test list. Run them through the matching function and evaluate the results. Do the confidence scores seem reasonable? Are there any matches you disagree with?

---

## Applying Matches to Our Dataset

Now let's apply our matching function to our actual scraped data.

In [None]:
# Apply matching to unique job titles (for efficiency)
unique_titles = indeed_df["job_title"].dropna().unique().tolist()
print(f"Processing {len(unique_titles)} unique job titles in a single batch call...")

# --- 1. Make ONE Batch Call ---
# This is much more efficient than calling the API in a loop.
# It sends all titles at once and gets all results back.
batch_results = ai_match_job_titles(unique_titles, oews_titles)

# --- 2. Convert List of Results to a Dictionary ---
# Use a dictionary comprehension for a clean, Pythonic way to create a lookup map.
# This maps each original title to its full match result dictionary.
title_matches = {result.get("original_title"): result for result in batch_results}

# --- 3. Final Confirmation ---
print(f"\n✅ Matching completed for {len(title_matches)} titles!")

# (Optional) Display a few examples to verify the results
print("\n--- Example Matches ---")
for result in batch_results[:5]:  # Show the first 5 results
    original = result.get("original_title", "N/A")
    matched = result.get("matched_title", "N/A")
    confidence = result.get("confidence", 0)
    print(f"'{original}'  →  '{matched}' (Confidence: {confidence}/10)")

In [None]:
import pandas as pd


def apply_job_title_matches(df: pd.DataFrame, matches: dict[str, dict]) -> pd.DataFrame:
    """
    Applies the matched title results to the main DataFrame using a merge.

    Args:
        df: The original DataFrame with a 'job_title' column.
        matches: A dictionary mapping unique job titles to their match results.

    Returns:
        A new DataFrame with added columns for the matched title, confidence, and reasoning.
    """
    # Convert the matches dictionary into a DataFrame for efficient joining.
    # The dictionary keys (job titles) will become the index.
    matches_df = pd.DataFrame.from_dict(matches, orient="index")

    # Rename columns for clarity before merging
    matches_df = matches_df.rename(
        columns={
            "matched_title": "bls_title",
            "confidence": "match_confidence",
            "reasoning": "match_reasoning",
        }
    )

    # Perform a left merge to add the match data to the original DataFrame.
    # This joins rows where the 'job_title' in the left DataFrame
    # matches the index (the original title) in the right DataFrame.
    result_df = df.merge(matches_df, left_on="job_title", right_index=True, how="left")

    # Fill any non-matches with default values
    result_df["bls_title"] = result_df["bls_title"].fillna("Other")
    result_df["match_confidence"] = result_df["match_confidence"].fillna(1)
    result_df["match_reasoning"] = result_df["match_reasoning"].fillna("n/a")

    return result_df


indeed_with_matches = apply_job_title_matches(indeed_df, title_matches)

# The summary and analysis part of the script remains the same
print("=== MATCHING RESULTS ===")
print(f"Jobs processed: {len(indeed_with_matches)} | Unique titles: {len(title_matches)}")
print("\nTop mapped categories:")
print(indeed_with_matches["bls_title"].value_counts().head(8))
print("\nConfidence distribution:")
print(indeed_with_matches["match_confidence"].value_counts().sort_index())

display(indeed_with_matches.head(10))

---

## Saving Our Matching Function

Let's add our efficient batch job title matching function to our janitor module.

In [None]:
%%writefile -a ../workshoplib/src/workshoplib/janitor.py

class JobTitleMatch(BaseModel):
    original_title: str = Field(..., description="The original, unmodified job title.")
    matched_title: str = Field(..., description="The best matching official OEWS category.")
    confidence: int = Field(..., description="A 1-10 confidence score for the match.")
    reasoning: str = Field(..., description="A brief explanation for the match decision.")


class JobTitleMatchResults(BaseModel):
    matches: list[JobTitleMatch]


# Create the batch job matching agent
_batch_job_matcher = Agent(
    "gemini-1.5-flash",
    output_type=JobTitleMatchResults,
    system_prompt="""
    You are an expert at matching a list of job titles to a list of official occupational categories.
    For each job title provided, you must find the single best fit from the official category list.

    Rules:
    1. Prioritize semantic meaning over keywords. Collapse seniority (e.g., "Senior," "Lead," "II") and variations into the base category.
    2. If no official category is a reasonable fit, you MUST use the category "Other".
    3. Provide a confidence score from 1 (low confidence) to 10 (exact or near-exact semantic match).
    4. Reasoning should be brief, explaining why the match was made (e.g., "Semantic match on developer roles").
    """,
)


def ai_match_job_titles(indeed_titles: list[str], oews_titles: list[str]) -> list[dict]:
    """
    Matches a list of Indeed job titles to the best OEWS category using a batch AI call.

    This function uses Google's Gemini AI to efficiently match multiple job titles
    in a single API call, making it much faster and more cost-effective than individual calls.

    Args:
        indeed_titles: List of job titles from Indeed scraping
        oews_titles: List of official OEWS job categories

    Returns:
        List of dictionaries, each containing original_title, matched_title, confidence, and reasoning
    """

    # Filter out any empty/null titles before processing
    valid_titles = [t for t in indeed_titles if t and isinstance(t, str) and t.strip()]

    if not valid_titles:
        return []

    try:
        # Format options and titles for the AI
        options_text = "\n".join(f"- {t}" for t in oews_titles) + "\n- Other"
        titles_text = "\n".join(f"- {t}" for t in valid_titles)

        prompt = f"""Return a structured list of matches for the job titles provided.
            Official Categories:
            {options_text}

            Job Titles to Match:
            {titles_text}
        """

        result = _batch_job_matcher.run_sync(prompt)
        return [match.model_dump() for match in result.output.matches]

    except Exception as e:
        print(f"A critical AI error occurred during batch matching: {e}")

        # Return error entries for all titles
        return [
            {
                "original_title": title,
                "matched_title": "Error",
                "confidence": 0,
                "reasoning": f"AI processing failed: {str(e)}",
            }
            for title in valid_titles
        ]

def apply_job_title_matches(df: pd.DataFrame, matches: dict[str, dict]) -> pd.DataFrame:
    """
    Applies the matched title results to the main DataFrame using a merge.

    Args:
        df: The original DataFrame with a 'job_title' column.
        matches: A dictionary mapping unique job titles to their match results.

    Returns:
        A new DataFrame with added columns for the matched title, confidence, and reasoning.
    """
    # Convert the matches dictionary into a DataFrame for efficient joining.
    # The dictionary keys (job titles) will become the index.
    matches_df = pd.DataFrame.from_dict(matches, orient='index')

    # Rename columns for clarity before merging
    matches_df = matches_df.rename(columns={
        'matched_title': 'bls_title',
        'confidence': 'match_confidence',
        'reasoning': 'match_reasoning'
    })

    # Perform a left merge to add the match data to the original DataFrame.
    # This joins rows where the 'job_title' in the left DataFrame
    # matches the index (the original title) in the right DataFrame.
    result_df = df.merge(
        matches_df,
        left_on='job_title',
        right_index=True,
        how='left'
    )

    # Fill any non-matches with default values
    result_df['bls_title'] = result_df['bls_title'].fillna('Other')
    result_df['match_confidence'] = result_df['match_confidence'].fillna(1)
    result_df['match_reasoning'] = result_df['match_reasoning'].fillna('n/a')

    return result_df

In [None]:
import nest_asyncio
from dotenv import load_dotenv

load_dotenv()

nest_asyncio.apply()

from workshoplib.janitor import ai_match_job_titles

print("✅ Janitor module updated successfully!")
print("\nAvailable functions:")
print("  • traditional_parse_salaries (rule-based batch processing)")
print("  • ai_parse_salaries (AI-powered batch processing)")
print("  • ai_match_job_titles (AI-powered batch job title matching)")

# Quick test with sample data
test_titles = ["Data Analyst", "Software Engineer", "UX Designer"]
oews_titles = ["Data Scientist", "Software Developer", "User Experience Designer"]
test_result = ai_match_job_titles(test_titles, oews_titles)  # Use subset for quick test

print(f"\nTesting with: {test_titles}")
print("Results:")
for result in test_result:
    original = result.get("original_title", "N/A")
    matched = result.get("matched_title", "N/A")
    confidence = result.get("confidence", 0)
    print(f"  '{original}' → '{matched}' (confidence: {confidence}/10)")

---

## Chapter Summary

### What We've Accomplished

**Technical Skills:**
- ✅ **AI Job Matching** - Created intelligent job title mapping using PydanticAI
- ✅ **Complex Data Handling** - Processed variations and edge cases automatically
- ✅ **Confidence Scoring** - Built in quality assessment with AI confidence ratings
- ✅ **Batch Processing** - Applied AI functions efficiently to real datasets
- ✅ **Code Organization** - Extended our janitor module with reusable functions

**Key Insights:**
- AI excels at fuzzy matching and handling variations
- Confidence scores help identify uncertain matches for review
- Structured outputs make AI integration reliable and predictable
- Building reusable functions accelerates future development

### Preview: Chapter 3 Data Integration

In Chapter 3, we will:
- Create a location matching function to complete our data integration
- Use all our janitor functions together to clean our Indeed data
- Merge Indeed and OEWS datasets for the first time
- Save our final clean dataset for the afternoon dashboard module

We're building toward something powerful - stay with me!

---

*Ready for Chapter 3: Complete Data Integration*