# Chapter 1: AI-Powered Data Cleaning

## Learning Objectives

By the end of this chapter, you will be able to:
- Understand why scraped data needs cleaning and standardization
- Create traditional parsing functions for salary data
- Set up PydanticAI with Google Gemini for intelligent data processing
- Build AI-powered functions that outperform traditional rule-based approaches
- Save reusable data cleaning functions to your workshoplib

---

## Module 2: From Messy Data to Clean Insights

In Module 1, we extracted valuable job data from Indeed.com. But real-world data is never clean out of the box. Let's examine what we collected and see why we need data cleaning.

### The Reality of Scraped Data

In [None]:
from pathlib import Path

import pandas as pd

DATA_DIR = Path("data").absolute()
DATA_DIR.mkdir(exist_ok=True)

# Load OEWS data
bls_file = Path("../01_module/data/bls_jobs_metro_area_2024.csv")
oews_df = pd.read_csv(bls_file)

# Load scraped Indeed data
scraped_file_paths = list(Path("../01_module/data").glob("scraped_indeed_*.csv"))

dataframes = [pd.read_csv(file) for file in scraped_file_paths]
indeed_df = pd.concat(dataframes, ignore_index=True)
print(f"📊 Loaded {len(scraped_file_paths)} scraped files with {len(indeed_df)} total jobs")

print("\n=== INDEED DATA SAMPLE ===")
print(f"Shape: {indeed_df.shape}")
print(f"Columns: {list(indeed_df.columns)}")
print("\nFirst few salary values:")
for i, salary in enumerate(indeed_df["salary"].head()):
    print(f"  {i + 1}. '{salary}'")

> **Learner Challenge**: Examine the salary values in your scraped data. What patterns do you notice? What formats would be challenging to parse with traditional code? Make a list of at least 3 different salary formats you observe.

---

## Traditional Approach: Manual Salary Parsing

Let's start by building a traditional function to parse salary data. This will help us appreciate why AI is so powerful for this task.

In [None]:
import re
from typing import Optional

import pandas as pd


def _extract_numbers(text: str) -> list[float]:
    """Extracts all numerical values from a string, handling 'K' notation."""

    # First, handle 'k' notation (e.g., "80K", "120k") by converting to full numbers
    text = re.sub(r"(\d+)k", lambda m: str(int(m.group(1)) * 1000), text, flags=re.IGNORECASE)

    # Next, find all remaining numbers, including those with commas
    numbers_as_strings = re.findall(r"\$?(\d{1,3}(?:,\d{3})*(?:\.\d{2})?)", text)

    if not numbers_as_strings:
        return []

    # Convert all found strings to floats, handling potential errors
    try:
        return [float(num.replace(",", "")) for num in numbers_as_strings]
    except ValueError:
        return []


def _calculate_min_max(numbers: list[float], text: str) -> tuple[Optional[float], Optional[float]]:
    """Determines the min and max salary from a list of numbers."""

    if len(numbers) >= 2:
        return min(numbers), max(numbers)
    elif len(numbers) == 1:
        # If text indicates "up to", this single number is the max
        if "up to" in text or "up-to" in text:
            return None, numbers[0]
        # Otherwise, the single number is both the min and max
        else:
            return numbers[0], numbers[0]
    else:
        return None, None


def _parse_single_salary(salary_string: str) -> dict:
    """
    Orchestrates the parsing of a single salary string using helper functions.
    """
    # Handle empty or non-string inputs first
    if pd.isna(salary_string) or not str(salary_string).strip():
        return {"min_salary": None, "max_salary": None, "salary_type": "unknown"}

    text = str(salary_string).lower().strip()

    # Check for non-numeric terms first
    if any(word in text for word in ["competitive", "negotiable", "commission", "doe"]):
        return {"min_salary": None, "max_salary": None, "salary_type": "non_numeric"}

    # Use helpers to get numbers and determine min/max
    numbers = _extract_numbers(text)
    min_salary, max_salary = _calculate_min_max(numbers, text)

    # Determine the salary type and convert if necessary
    if "hour" in text or "/hr" in text:
        salary_type = "hourly_converted"
        # Assume 40 hours/week, 52 weeks/year for annual conversion
        annualization_factor = 40 * 52
        min_salary = min_salary * annualization_factor if min_salary else None
        max_salary = max_salary * annualization_factor if max_salary else None
    else:
        salary_type = "annual"

    # If no numbers were successfully parsed, classify as unknown
    if min_salary is None and max_salary is None:
        salary_type = "unknown"

    return {"min_salary": min_salary, "max_salary": max_salary, "salary_type": salary_type}


# --- 4. Public-Facing Batch Processing Function ---
def traditional_parse_salaries(salary_list: list[str]) -> list[dict]:
    """
    Parses a list of salary strings by applying the parsing logic to each item.

    Args:
        salary_list: A list of raw salary text strings.

    Returns:
        A list of dictionaries, each containing parsed salary information.
    """
    # Use a list comprehension for a clean and efficient way to process the batch
    return [_parse_single_salary(s) for s in salary_list]


# --- 5. Test the Updated Batch Function ---
test_salaries = [
    "$50,000 - $70,000 per year",
    "$25 per hour",
    "Up to $80K",
    "Competitive salary",
    "$120k annually",
    None,
    "Varies based on experience",
]

print("=== TESTING TRADITIONAL BATCH SALARY PARSING ===")
parsed_results = traditional_parse_salaries(test_salaries)

for original, parsed in zip(test_salaries, parsed_results):
    print(f"Input: '{original}'")
    print(f"Output: {parsed}")
    print("-" * 20)

---

## Setting Up AI Tools: PydanticAI + Google Gemini

Now let's set up AI tools that can handle salary parsing much more intelligently.

### Getting Your Google API Key

In [None]:
import os

from dotenv import load_dotenv

load_dotenv()

# Check if we have a Google API key
API_KEY = os.getenv("GOOGLE_API_KEY")

if not API_KEY:
    print("🔑 You need a Google API key to use Gemini!")
    print("\nTo get your free API key:")
    print("1. Go to https://aistudio.google.com/app/apikey")
    print("2. Sign in with your Google account")
    print("3. Click 'Create API Key'")
    print("4. Copy the key and add it to your .env file:")
    print("   GOOGLE_API_KEY=your_key_here")
    print("\nAfter adding the key, restart the notebook.")
else:
    print("✅ Google API key found!")

### Installing and Configuring PydanticAI

In [None]:
import nest_asyncio
from pydantic import BaseModel
from pydantic_ai import Agent
from pydantic_ai.models.google import GoogleModel
from pydantic_ai.providers.google import GoogleProvider

nest_asyncio.apply()

provider = GoogleProvider(api_key=API_KEY)
model = GoogleModel("gemini-1.5-flash", provider=provider)


# Test our AI setup with a simple example
class TestResponse(BaseModel):
    message: str
    number: int


test_agent = Agent(
    model,
    output_type=TestResponse,
    system_prompt="You are a helpful assistant. Return a friendly message and a random number.",
)

result = test_agent.run_sync("Say hello and give me a number between 1 and 100")
print(f"🤖 AI Test Result: {result.output.message} (Number: {result.output.number})")
print("✅ AI setup successful!")

---

## AI-Powered Salary Parsing

Now let's create an AI function that can parse salaries much more intelligently than our traditional approach.

In [None]:
import pandas as pd
from pydantic import BaseModel, Field
from typing import Literal, Optional
from pydantic_ai import Agent

SalaryType = Literal["annual", "hourly_converted", "non_numeric", "unknown"]


# This Pydantic model ensures the AI's response is always in a clean, predictable format.
class SalaryInfo(BaseModel):
    min_salary: Optional[float]
    max_salary: Optional[float]
    salary_type: SalaryType
    confidence: int = Field(
        ..., description="A 1-10 confidence score (1 = not confident, 10 = very confident)"
    )


class SalaryParsingResults(BaseModel):
    salaries: list[SalaryInfo] = Field(
        ..., description="A list of parsed salary information objects."
    )


salary_agent = Agent(
    "gemini-1.5-flash",
    output_type=SalaryParsingResults,
    system_prompt="""
    You are an expert at parsing a LIST of salary information strings from job postings.
    Process each item from the input list and return a corresponding list of structured objects.

    Rules:
        1. Convert hourly rates to annual amounts. Assume a standard 40-hour work week and 52 weeks per year.
        2. Handle salary ranges (e.g., "$50K - $70K") by setting both min_salary and max_salary.
        3. Handle single values (e.g., "Up to $80,000" or "$25/hour") appropriately.
        4. For non-numeric salaries (e.g., "Competitive salary"), set salary_type to 'non_numeric' and salaries to null.
        5. Always provide a confidence score from 1 (not confident) to 10 (very confident).
        6. If the input is empty, nonsensical, or unparsable, set salary_type to 'unknown'.

    Example Input List:
    - "$25 an hour"
    - "$50K - $70K"

    Example Output: A JSON object containing a list with two salary info objects.
    """,
)


def ai_parse_salaries(salary_list: list[str]) -> list[dict]:
    """
    Uses the AI agent to parse a list of salary strings in a single batch operation.

    Args:
        salary_list: A list of raw salary text strings from job postings.

    Returns:
        A list of dictionaries, each containing parsed salary information.
    """
    # Handle empty or null input list.
    if not salary_list:
        return []

    # Format the list of strings into a single prompt for the AI.
    formatted_list = "\n".join(f"- '{s}'" for s in salary_list if pd.notna(s) and str(s).strip())
    prompt = f"Parse the following list of salary strings:\n{formatted_list}"

    try:
        # Make a single synchronous call to the agent for the entire batch.
        result = salary_agent.run_sync(prompt)

        # Convert the list of Pydantic models from the result into a list of dictionaries.
        return [s.model_dump() for s in result.output.salaries]
    except Exception as e:
        print(f"AI batch parsing error: {e}")
        # Return a list of error objects matching the input length
        return [
            {"min_salary": None, "max_salary": None, "salary_type": "error", "confidence": 1}
        ] * len(salary_list)

In [None]:
test_salaries = [
    "$55,000 - $65,000 a year",
    "$30 an hour",
    "Up to $120k",
    "Competitive Salary",
    "From $95,000 a year",
    None,
    "Varies",
]

print("=== TESTING BATCH AI SALARY PARSING ===")
parsed_results = ai_parse_salaries(test_salaries)

# Display the results, pairing each input with its corresponding output.
for original, parsed in zip(test_salaries, parsed_results):
    print(f"Input: '{original}'")
    print(f"AI Output: {parsed}")
    print("-" * 20)

### Comparing Traditional vs AI Approaches

In [None]:
# --- Define Test Data ---
test_salaries = [
    "$55,000 - $65,000 a year",
    "$30 an hour",
    "Up to $120k",
    "Competitive Salary",
    "From $95,000 a year",
    None,
    "Varies",
]

# Process the entire list with the traditional, rule-based function.
all_trad_results = traditional_parse_salaries(test_salaries)

# Process the entire list with the AI-powered function.
all_ai_results = ai_parse_salaries(test_salaries)

print("=== TRADITIONAL vs AI COMPARISON ===")
print(f"{'Salary Input':<35} | {'Traditional Result':<25} | {'AI Result':<25} | {'AI Confidence'}")
print("-" * 115)

# Use zip() to iterate through the original inputs and both sets of results simultaneously.
for salary, trad_result, ai_result in zip(test_salaries, all_trad_results, all_ai_results):
    # Format the traditional result for clean printing.
    trad_min = trad_result.get("min_salary")
    trad_max = trad_result.get("max_salary")
    trad_summary = f"${trad_min or 0:,.0f} - ${trad_max or 0:,.0f}"

    # Format the AI result for clean printing.
    ai_min = ai_result.get("min_salary")
    ai_max = ai_result.get("max_salary")
    ai_summary = f"${ai_min or 0:,.0f} - ${ai_max or 0:,.0f}"

    # Handle display for non-numeric types for better readability.
    if trad_result.get("salary_type") in ["non_numeric", "unknown"]:
        trad_summary = f"({trad_result.get('salary_type')})"

    if ai_result.get("salary_type") in ["non_numeric", "unknown", "error"]:
        ai_summary = f"({ai_result.get('salary_type')})"

    # Ensure the original salary string is not None for printing.
    salary_str = str(salary) if salary is not None else "None"

    print(
        f"{salary_str:<35} | {trad_summary:<25} | {ai_summary:<25} | {ai_result.get('confidence', 'N/A')}/10"
    )

> **Learner Challenge**: Add 2-3 additional salary strings to the test data that you think might be challenging to parse. Try some edge cases like "DOE (Depends on Experience)" or "$15-20/hr + tips". Compare how the traditional vs AI approaches handle these cases.

---

## Saving Our Functions to WorkshopLib

Let's save our salary parsing functions to our `workshoplib` so we can use them in other modules.

In [None]:
%%writefile ../workshoplib/src/workshoplib/janitor.py

"""
Data cleaning and parsing utilities for the workshop.

This module contains functions for cleaning messy real-world data,
including both traditional rule-based approaches and AI-powered solutions.
"""

import re
from typing import Literal, Optional

import pandas as pd
from pydantic import BaseModel, Field
from pydantic_ai import Agent


def _extract_numbers(text: str) -> list[float]:
    """Extracts all numerical values from a string, handling 'K' notation."""

    # First, handle 'k' notation (e.g., "80K", "120k") by converting to full numbers
    text = re.sub(r'(\d+)k', lambda m: str(int(m.group(1)) * 1000), text, flags=re.IGNORECASE)

    # Next, find all remaining numbers, including those with commas
    numbers_as_strings = re.findall(r'\$?(\d{1,3}(?:,\d{3})*(?:\.\d{2})?)', text)

    if not numbers_as_strings:
        return []

    # Convert all found strings to floats, handling potential errors
    try:
        return [float(num.replace(',', '')) for num in numbers_as_strings]
    except ValueError:
        return []

def _calculate_min_max(numbers: list[float], text: str) -> tuple[Optional[float], Optional[float]]:
    """Determines the min and max salary from a list of numbers."""

    if len(numbers) >= 2:
        return min(numbers), max(numbers)
    elif len(numbers) == 1:
        # If text indicates "up to", this single number is the max
        if 'up to' in text or 'up-to' in text:
            return None, numbers[0]
        # Otherwise, the single number is both the min and max
        else:
            return numbers[0], numbers[0]
    else:
        return None, None

def _parse_single_salary(salary_string: str) -> dict:
    """
    Orchestrates the parsing of a single salary string using helper functions.
    """
    # Handle empty or non-string inputs first
    if pd.isna(salary_string) or not str(salary_string).strip():
        return {'min_salary': None, 'max_salary': None, 'salary_type': 'unknown'}

    text = str(salary_string).lower().strip()

    # Check for non-numeric terms first
    if any(word in text for word in ['competitive', 'negotiable', 'commission', 'doe']):
        return {'min_salary': None, 'max_salary': None, 'salary_type': 'non_numeric'}

    # Use helpers to get numbers and determine min/max
    numbers = _extract_numbers(text)
    min_salary, max_salary = _calculate_min_max(numbers, text)

    # Determine the salary type and convert if necessary
    if 'hour' in text or '/hr' in text:
        salary_type = 'hourly_converted'
        # Assume 40 hours/week, 52 weeks/year for annual conversion
        annualization_factor = 40 * 52
        min_salary = min_salary * annualization_factor if min_salary else None
        max_salary = max_salary * annualization_factor if max_salary else None
    else:
        salary_type = 'annual'

    # If no numbers were successfully parsed, classify as unknown
    if min_salary is None and max_salary is None:
        salary_type = 'unknown'

    return {
        'min_salary': min_salary,
        'max_salary': max_salary,
        'salary_type': salary_type
    }

def traditional_parse_salaries(salary_list: list[str]) -> list[dict]:
    """
    Parses a list of salary strings by applying the parsing logic to each item.

    Args:
        salary_list: A list of raw salary text strings.

    Returns:
        A list of dictionaries, each containing parsed salary information.
    """
    # Use a list comprehension for a clean and efficient way to process the batch
    return [_parse_single_salary(s) for s in salary_list]

SalaryType = Literal['annual', 'hourly_converted', 'non_numeric', 'unknown']

# This Pydantic model ensures the AI's response is always in a clean, predictable format.
class SalaryInfo(BaseModel):
    min_salary: Optional[float]
    max_salary: Optional[float]
    salary_type: SalaryType
    confidence: int = Field(..., description="A 1-10 confidence score (1 = not confident, 10 = very confident)")


class SalaryParsingResults(BaseModel):
    salaries: list[SalaryInfo] = Field(..., description="A list of parsed salary information objects.")

_salary_agent = Agent(
    'gemini-1.5-flash',
    output_type=SalaryParsingResults,
    system_prompt="""
    You are an expert at parsing a LIST of salary information strings from job postings.
    Process each item from the input list and return a corresponding list of structured objects.

    Rules:
        1. Convert hourly rates to annual amounts. Assume a standard 40-hour work week and 52 weeks per year.
        2. Handle salary ranges (e.g., "$50K - $70K") by setting both min_salary and max_salary.
        3. Handle single values (e.g., "Up to $80,000" or "$25/hour") appropriately.
        4. For non-numeric salaries (e.g., "Competitive salary"), set salary_type to 'non_numeric' and salaries to null.
        5. Always provide a confidence score from 1 (not confident) to 10 (very confident).
        6. If the input is empty, nonsensical, or unparsable, set salary_type to 'unknown'.

    Example Input List:
    - "$25 an hour"
    - "$50K - $70K"

    Example Output: A JSON object containing a list with two salary info objects.
    """
)

def ai_parse_salaries(salary_list: list[str]) -> list[dict]:
    """
    Uses the AI agent to parse a list of salary strings in a single batch operation.

    Args:
        salary_list: A list of raw salary text strings from job postings.

    Returns:
        A list of dictionaries, each containing parsed salary information.
    """
    # Handle empty or null input list.
    if not salary_list:
        return []

    # Format the list of strings into a single prompt for the AI.
    formatted_list = "\n".join(f"- '{s}'" for s in salary_list if pd.notna(s) and str(s).strip())
    prompt = f"Parse the following list of salary strings:\n{formatted_list}"

    try:
        # Make a single synchronous call to the agent for the entire batch.
        result = _salary_agent.run_sync(prompt)

        # Convert the list of Pydantic models from the result into a list of dictionaries.
        return [s.model_dump() for s in result.output.salaries]
    except Exception as e:
        print(f"AI batch parsing error: {e}")
        # Return a list of error objects matching the input length
        return [{'min_salary': None, 'max_salary': None, 'salary_type': 'error', 'confidence': 1}] * len(salary_list)

In [None]:
import pprint as pp
import nest_asyncio
from dotenv import load_dotenv

load_dotenv()

nest_asyncio.apply()

from workshoplib.janitor import ai_parse_salaries, traditional_parse_salaries

print("✅ Janitor module updated successfully!")
print("\nAvailable functions:")
print("  • traditional_parse_salaries (rule-based batch processing)")
print("  • ai_parse_salaries (AI-powered batch processing)")

# Quick test with sample data
test_salaries = ["$75,000 per year", "$30 an hour", "Competitive salary"]

print(f"\nTesting with: {test_salaries}")
traditional_results = traditional_parse_salaries(test_salaries)
ai_results = ai_parse_salaries(test_salaries)

print("\nTraditional results: ")
pp.pprint(traditional_results, compact=True, width=80)

print("\nAI results: ")
pp.pprint(ai_results, compact=True, width=80)

---

## Chapter Summary

### What We've Accomplished

**Technical Skills:**
- ✅ **Data Assessment** - Identified the messiness in real-world scraped data
- ✅ **Traditional Parsing** - Built rule-based salary parsing logic
- ✅ **AI Setup** - Configured PydanticAI with Google Gemini
- ✅ **AI-Powered Parsing** - Created intelligent salary parsing with confidence scores
- ✅ **Code Organization** - Saved reusable functions to workshoplib

**Key Insights:**
- Real-world data is always messy and needs cleaning
- Traditional approaches require extensive edge case handling
- AI can handle complexity and provide confidence scores
- Structured outputs make AI results more reliable

### Looking Ahead to Chapter 2

In Chapter 2, we will:
- Load both Indeed and OEWS datasets
- Use AI to match messy job titles to official BLS categories
- Create intelligent matching functions
- Add these tools to our janitor module

The foundation we've built here will make Chapter 2 much smoother!

---

*Ready for Chapter 2: AI-Powered Job Title Matching*