# Chapter 2: Targeted Web Scraping with Playwright

## Learning Objectives

By the end of this chapter, you will be able to:
- Set up and configure Playwright for automated web scraping
- Navigate job search websites programmatically
- Extract structured data from dynamic web pages
- Handle common web scraping challenges (rate limiting, dynamic content)
- Save scraped data in a format suitable further analysis

---

## Introduction to Web Scraping

> **Instructor Cue:** Begin by pulling up Indeed.com in your browser. Perform a manual search for "Data Scientists" in Oregon to show the class what we're trying to automate. Point out the various elements we want to extract and discuss the challenges of manual data collection at scale.

Based on our exploratory data analysis, we identified high-value target occupations in specific states. While the BLS OEWS data provides excellent foundational insights, it lacks the granular, real-time information that current job postings offer, such as:

- Specific skill requirements
- Company details and culture information
- Exact salary ranges and benefits
- Remote work options
- Educational requirements

**The Problem with websites nowadays 😒**

The modern web is dynamic. Many websites, like Indeed.com, use JavaScript to load content after the initial page loads. A simple tool like the `requests` library can't see this content because it only gets the initial HTML.

Let's first see what happens when we try the traditional approach with `requests`:

In [None]:
import requests
from bs4 import BeautifulSoup
from IPython.display import display, HTML

response = requests.get("https://www.indeed.com/jobs?q=data+analyst&l=Beaverton%2C+OR")
soup = BeautifulSoup(response.text, "html.parser")

print(f"Status Code: {response.status_code}")
print(f"Page title: {soup.title.text if soup.title else 'No title found'}")
print(f"Total HTML length: {len(response.text):,} characters")

# Look for job listings
job_cards = soup.find_all("div", class_="job_seen_beacon")
print(f"Job cards found with requests: {len(job_cards)}")

display(HTML(response.text))

As you can see, requests can access the basic HTML, but the job listings are loaded dynamically with JavaScript after the page loads, and we are being blocked by Cloudflare's anti-scraping measures.

This is where Playwright shines 💪🏾!

---

### Why Playwright Over Other Tools?

> **Instructor Cue:** Ask the audience: "Has anyone used BeautifulSoup or Selenium before? What challenges did you encounter?" Use their responses to highlight Playwright's advantages.

Playwright offers several advantages for modern web scraping:

1. **Fast and Reliable**: Built for modern web applications
2. **Handles JavaScript**: Executes dynamic content automatically
3. **Multiple Browser Support**: Chromium, Firefox, and Safari
4. **Built-in Waiting**: Intelligent waiting for elements to load
5. **Robust Error Handling**: Better handling of network issues and timeouts

---

## Getting Started Web Scraping with Playwright

Let's start by installing and configuring Playwright for our scraping task:

**Playwright** is a Python library that automates browser actions. It can launch a browser, navigate to pages, click buttons, and read content after all the JavaScript has finished running. This makes it perfect for scraping modern websites.


In [None]:
# NOTE: For workshop purposes, these commands are commented out.
# !playwright install chromium
# For this workshop, we use `uv sync` to install all dependencies from pyproject.toml.
# If you haven't set up your environment yet, run this in your terminal:
#   uv sync
#   playwright install chromium
# This will ensure all required packages (including patchright, pandas, etc.) are installed.

In [None]:
import asyncio
from datetime import datetime

import pandas as pd

# Try to import patchright first (which works better with Indeed), fall back to playwright
try:
    from patchright.async_api import async_playwright

    print("Using patchright for better compatibility with Indeed")
except ImportError:
    print("Patchright not found, falling back to standard playwright")
    from playwright.async_api import async_playwright

In [None]:
from pathlib import Path

DATA_DIR = Path("data").absolute()
DATA_DIR.mkdir(exist_ok=True)

> **Instructor Cue:** Walk through the installation process step by step. If anyone encounters installation issues, help them troubleshoot. Explain that we're using the async version of Playwright for better performance and compatiability with jupyter notebooks

---

### Step 1. Building Our Job Scraper Class

We'll define a class to encapsulate our scraping logic. This is a good practice that keeps our code organized and reusable.

In [None]:
class _Step1_IndeedJobScraper:
    """
    Step 1: Initial skeleton for our Indeed.com job scraper class using Playwright/Patchright (Async API for Jupyter).
    This class sets up the basic structure for future scraping logic.
    """

    def __init__(
        self,
        job_title: str = "Data Analyst",
        location: str = "Beaverton, OR",
        headless: bool = False,
    ):
        self.job_title = job_title
        self.location = location
        self.headless = headless

    async def run(self) -> list:
        """Run the Indeed job scraper.

        Returns:
            list: List of job dictionaries
        """
        # Our scraping logic will go here
        job_listings = []

        print(f"🔍 Starting scraper for '{self.job_title}' in '{self.location}'...")

        return job_listings


# Test our job scraper
job_scraper = _Step1_IndeedJobScraper()
# test_jobs = await job_scraper.run()
# print(f"📋 Scraper ready! Currently returns {len(test_jobs)} jobs")

### Step 2. Launching The Browser

> **Instructor Cue:** Explain the importance of responsible scraping practices. Discuss rate limiting, robots.txt files, and ethical considerations.

In [None]:
from playwright.async_api import Browser, Page, Playwright


class _Step2_IndeedJobScraper(_Step1_IndeedJobScraper):
    """
    Step 2: Enhanced job scraper for Indeed.com using Playwright/Patchright (Async API for Jupyter).
    This class builds upon the initial skeleton by adding functionality to launch the browser.
    """

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.browser: Browser = None
        self.playwright: Playwright = None
        self.page: Page = None

    async def init_browser(self):
        """Get a browser instance using Playwright (Async API for Jupyter)"""

        self.playwright = await async_playwright().start()
        self.browser = await self.playwright.chromium.launch_persistent_context(
            user_data_dir="./.browser_data",
            channel="chromium",
            headless=self.headless,
            no_viewport=True,
        )

    async def run(self):
        # Our scraping logic will go here
        job_listings = []

        print(f"🔍 Starting scraper for '{self.job_title}' in '{self.location}'...")

        try:
            await self.init_browser()

            self.page = await self.browser.new_page()
            await asyncio.sleep(2)

        finally:
            # Close manually
            await self.browser.close()
            await self.playwright.stop()

        return job_listings


# Test our job scraper
job_scraper = _Step2_IndeedJobScraper()
# test_jobs = await job_scraper.run()
# print(f"📋 Scraper ready! Currently returns {len(test_jobs)} jobs")

### Step 3. Defining our Job Search Logic

> **Instructor Cue:** Open the browser developer tools and show the class how to inspect elements to find the CSS selectors we'll use. This is a great hands-on moment to demonstrate how web scraping detective work happens.

In [None]:
class _Step3_IndeedJobScraper(_Step2_IndeedJobScraper):
    """
    Step 3: Job scraper for Indeed.com using Playwright/Patchright (Async API for Jupyter).
    This class builds upon the previous steps by adding functionality to fill the job search form.
    """

    async def fill_job_search_form(self):
        """
        Fill the job search form on Indeed.
        """
        # Navigate and search
        print("🌐 Navigating to Indeed...")
        await self.page.goto("https://www.indeed.com", timeout=60000)

        print("📝 Filling search form...")
        # Fill in Job Title
        await self.page.locator('input[name="q"]').click()
        await self.page.locator('input[name="q"]').fill(self.job_title)

        # Clear the default location before filling
        # await self.page.locator('input[name="l"]').press("Control+A")
        # await self.page.locator('input[name="l"]').press("Delete")

        # Fill in Location
        await self.page.locator('input[name="l"]').click()
        await self.page.locator('input[name="l"]').fill(self.location)

        await asyncio.sleep(2)  # Small wait to ensure input is registered

        await self.page.click('button[type="submit"]')

        print("⏳ Waiting for page to load...")
        await asyncio.sleep(3)
        await self.page.wait_for_selector(".jobsearch-LeftPane #mosaic-jobResults")

    async def run(self):
        # Our scraping logic will go here
        job_listings = []

        print(f"🔍 Starting scraper for '{self.job_title}' in '{self.location}'...")

        try:
            await self.init_browser()

            self.page = await self.browser.new_page()
            await asyncio.sleep(2)

            await self.fill_job_search_form()

        finally:
            # Close manually
            await self.browser.close()
            await self.playwright.stop()

        return job_listings


# Test our job scraper
job_scraper = _Step3_IndeedJobScraper(
    job_title="Cyber Security Engineer", location="New Orleans, LA"
)
# test_jobs = await job_scraper.run()
# print(f"📋 Scraper ready! Currently returns {len(test_jobs)} jobs")

### Step 4. Define Job Detail Extraction Logic

> **Instructor Cue:** Point out the defensive programming practices here - checking if elements exist before accessing them, handling exceptions gracefully. Ask the class: "Why is this error handling so important in web scraping?"

In [None]:
class _Step4_IndeedJobScraper(_Step3_IndeedJobScraper):
    """
    Step 4: Job scraper for Indeed.com using Playwright/Patchright (Async API for Jupyter).
    This class builds upon the previous steps by adding functionality to extract job listings.
    """

    def __init__(self, max_results: int = 10, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.max_results = max_results
        self._timeout_ms = 1500  # small timeout so we don't hang forever

    async def extract_job_listings(self, job_listings: list):
        """
        Extract job listings from the search results page.
        """

        print("🔍 Looking for job listings...")

        job_cards = self.page.locator(".cardOutline")
        card_count = await job_cards.count()

        if card_count == 0:
            print("❌ No job cards found. Website structure may have changed.")
            return job_listings

        print(f"✅ Found {card_count} jobs. Extracting data from the first {self.max_results}...\n")

        for i in range(min(card_count, self.max_results)):
            card = job_cards.nth(i)
            job_data = {
                "job_title": None,
                "company_name": None,
                "location": None,
                "salary": None,
                "job_description": None,
            }

            try:
                # Click the card to load the description (only if it exists)
                if await card.count() == 0:
                    print(f"  ⚠️ Card {i + 1} not found; skipping...")
                    continue

                await card.click(timeout=self._timeout_ms)

                # Wait for the job description panel to load its content
                await asyncio.sleep(1.5)

                # == Job Title ==
                title_elem = card.locator("h2 a span")
                if await title_elem.count() > 0:
                    title_attr = await title_elem.get_attribute("title", timeout=self._timeout_ms)

                    if title_attr:
                        job_data["job_title"] = title_attr
                    else:
                        try:
                            job_data["job_title"] = await title_elem.inner_text(
                                timeout=self._timeout_ms
                            )
                        except Exception:
                            job_data["job_title"] = None
                else:
                    print(f"\tℹ️ [{i + 1}] Job Title element missing.")

                # == Company Name ==
                company_elem = card.locator('[data-testid="company-name"]')
                if await company_elem.count() > 0:
                    try:
                        job_data["company_name"] = await company_elem.inner_text(
                            timeout=self._timeout_ms
                        )
                    except Exception:
                        job_data["company_name"] = None
                else:
                    print(f"\tℹ️ [{i + 1}] Company name missing.")

                # == Location ==
                location_elem = card.locator('[data-testid="text-location"]')
                if await location_elem.count() > 0:
                    try:
                        job_data["location"] = await location_elem.inner_text(
                            timeout=self._timeout_ms
                        )
                    except Exception:
                        job_data["location"] = None
                else:
                    print(f"\tℹ️ [{i + 1}] Location missing.")

                # == Salary ==
                company_div = card.locator('div[class*="company_location"]')
                if await company_div.count() > 0:
                    salary_range_div = company_div.locator("+ div")
                    if await salary_range_div.count() > 0:
                        try:
                            job_data["salary"] = (
                                await salary_range_div.inner_text(timeout=self._timeout_ms)
                            ).strip()
                        except Exception:
                            job_data["salary"] = None
                    else:
                        print(f"\tℹ️ [{i + 1}] Salary information missing.")

                # == Job Description ==
                description_elem = self.page.locator("#jobDescriptionText")
                if await description_elem.count() > 0:
                    try:
                        job_data["job_description"] = (
                            await description_elem.inner_text(timeout=self._timeout_ms)
                        ).strip()
                    except Exception:
                        job_data["job_description"] = None
                else:
                    print(f"\tℹ️ [{i + 1}] Job Description missing.")

                job_listings.append(job_data)

                if all(job_data.values()):
                    print(f"\t=> 📝 [{i + 1}] Extracted: {job_data['job_title']}")
                else:
                    print(f"\t=> 📝 Extracted Job [{i + 1}] Details with Missing Fields")

                print("-" * 60)
            except Exception as e:
                print(f"\t⚠️ Error extracting job {i + 1}: {e}")
                continue

    async def run(self):
        # Our scraping logic will go here
        job_listings = []

        print(f"🔍 Starting scraper for '{self.job_title}' in '{self.location}'...")

        try:
            await self.init_browser()

            self.page = await self.browser.new_page()
            await asyncio.sleep(2)

            await self.fill_job_search_form()
            await self.extract_job_listings(job_listings)

        except Exception as e:
            print(f"❌ Error during scraping: {e}")
            return job_listings
        finally:
            print(f"📊 Extracted a total of {len(job_listings)} job listings")

            # Close manually
            await self.browser.close()
            await self.playwright.stop()

            print("🔒 Browser closed")

        return job_listings


# Test our job scraper
job_scraper = _Step4_IndeedJobScraper(
    job_title="Data Scientist", location="San Jose, CA", max_results=15
)
test_jobs = await job_scraper.run()

print(f"\n📋 Scraper ready! Currently returns {len(test_jobs)} jobs")

> **Instructor Cue:** Show the test_jobs data structure to the class. Point out that we have a list of dictionaries, each representing a job. Mention that this is perfect for converting to a pandas DataFrame and saving as CSV. Ask: "What advantages does CSV format give us for data analysis?"

In [None]:
# import pprint as pp; pp.pprint(test_jobs, compact=True, width=80)

#### Save Scraper to Script

> **Instructor Cue:** Ask the class: "Why save our code to a separate file instead of keeping it in the notebook?" Take responses, then explain the benefits below.

We've built a working scraper through experimentation. Now let's make it reusable by saving it to our `workshoplib` package.

**Benefits of Modular Code:**
- **Reuse across modules** - Import our scraper in Module 2 instead of copy-pasting code
- **Share with others** - Your `workshoplib` becomes a professional toolkit
- **Maintain and improve** - Update the library without breaking existing notebooks
- **Industry practice** - Professional teams organize code this way

**Why Build It Incrementally?**

We use `%%writefile` first, then `%%writefile -a` (append) to:
- Show clear structure: imports → class → functions
- Understand dependencies between components
- Practice proper Python module organization
- Make debugging easier when issues arise

> **Instructor Cue:** Emphasize that this mirrors real development work - code is organized into logical, reusable components.

Let's save our scraper for use throughout the workshop:

In [None]:
%%writefile ../workshoplib/src/workshoplib/indeed_scraper.py

import asyncio
import re
from datetime import datetime
from pathlib import Path

import pandas as pd
from playwright.async_api import Browser, Page, Playwright

# Try to import patchright first (which works better with Indeed), fall back to playwright
try:
    from patchright.async_api import async_playwright
    print("Using patchright for better compatibility with Indeed")
except ImportError:
    print("Patchright not found, falling back to standard playwright")
    from playwright.async_api import async_playwright

Then let's save our `IndeedJobScraper` class that we built interatively overtime

In [None]:
%%writefile -a ../workshoplib/src/workshoplib/indeed_scraper.py

class IndeedJobScraper:
    """
    Job scraper for Indeed.com using Playwright/Patchright (Async API for Jupyter).
    """
    def __init__(self, job_title: str, location: str, headless: bool = True, max_results: int = 10):
        self.job_title = job_title
        self.location = location
        self.headless = headless
        self.browser: Browser = None
        self.playwright: Playwright = None
        self.page: Page = None
        self.max_results = max_results
        self._timeout_ms = 1500  # small timeout so we don't hang forever

    async def init_browser(self):
        """Get a browser instance using Playwright (Async API for Jupyter)
        """

        self.playwright = await async_playwright().start()
        self.browser = await self.playwright.chromium.launch_persistent_context(
            user_data_dir="./.browser_data",
            channel="chromium",
            headless=self.headless,
            no_viewport=True,
        )

    async def fill_job_search_form(self):
        """
        Fill the job search form on Indeed.
        """
        # Navigate and search
        print("🌐 Navigating to Indeed...")
        await self.page.goto("https://www.indeed.com", timeout=60000)

        print("📝 Filling search form...")
        # Fill in Job Title
        await self.page.locator('input[name="q"]').click()
        await self.page.locator('input[name="q"]').fill(self.job_title)

        # Clear the default location before filling
        # await self.page.locator('input[name="l"]').press("Control+A")
        # await self.page.locator('input[name="l"]').press("Delete")

        # Fill in Location
        await self.page.locator('input[name="l"]').click()
        await self.page.locator('input[name="l"]').fill(self.location)

        await asyncio.sleep(2)  # Small wait to ensure input is registered

        await self.page.click('button[type="submit"]')

        print("⏳ Waiting for page to load...")
        await asyncio.sleep(3)
        await self.page.wait_for_selector(".jobsearch-LeftPane #mosaic-jobResults")

    async def extract_job_listings(self, job_listings: list):
        """
        Extract job listings from the search results page.
        """

        print("🔍 Looking for job listings...")

        job_cards = self.page.locator('.cardOutline')
        card_count = await job_cards.count()

        if card_count == 0:
            print("❌ No job cards found. Website structure may have changed.")
            return job_listings

        print(f"✅ Found {card_count} jobs. Extracting data from the first {self.max_results}...\n")

        for i in range(min(card_count, self.max_results)):
            card = job_cards.nth(i)
            job_data = {
                "job_title": None,
                "company_name": None,
                "location": None,
                "salary": None,
                "job_description": None
            }

            try:
                # Click the card to load the description (only if it exists)
                if await card.count() == 0:
                    print(f"  ⚠️ Card {i+1} not found; skipping...")
                    continue

                await card.click(timeout=self._timeout_ms)

                # Wait for the job description panel to load its content
                await asyncio.sleep(1.5)

                # == Job Title ==
                title_elem = card.locator('h2 a span')
                if await title_elem.count() > 0:
                    title_attr = await title_elem.get_attribute('title', timeout=self._timeout_ms)

                    if title_attr:
                        job_data["job_title"] = title_attr
                    else:
                        try:
                            job_data["job_title"] = await title_elem.inner_text(timeout=self._timeout_ms)
                        except Exception:
                            job_data["job_title"] = None
                else:
                    print(f"\tℹ️ [{i+1}] Job Title element missing.")

                # == Company Name ==
                company_elem = card.locator('[data-testid="company-name"]')
                if await company_elem.count() > 0:
                    try:
                        job_data["company_name"] = await company_elem.inner_text(timeout=self._timeout_ms)
                    except Exception:
                        job_data["company_name"] = None
                else:
                    print(f"\tℹ️ [{i+1}] Company name missing.")

                # == Location ==
                location_elem = card.locator('[data-testid="text-location"]')
                if await location_elem.count() > 0:
                    try:
                        job_data["location"] = await location_elem.inner_text(timeout=self._timeout_ms)
                    except Exception:
                        job_data["location"] = None
                else:
                    print(f"\tℹ️ [{i+1}] Location missing.")

                # == Salary ==
                company_div = card.locator('div[class*="company_location"]')
                if await company_div.count() > 0:
                    salary_range_div = company_div.locator('+ div')
                    if await salary_range_div.count() > 0:
                        try:
                            job_data["salary"] = (await salary_range_div.inner_text(timeout=self._timeout_ms)).strip()
                        except Exception:
                            job_data["salary"] = None
                    else:
                        print(f"\tℹ️ [{i+1}] Salary information missing.")

                # == Job Description ==
                description_elem = self.page.locator('#jobDescriptionText')
                if await description_elem.count() > 0:
                    try:
                        job_data["job_description"] = (await description_elem.inner_text(timeout=self._timeout_ms)).strip()
                    except Exception:
                        job_data["job_description"] = None
                else:
                    print(f"\tℹ️ [{i+1}] Job Description missing.")

                job_listings.append(job_data)

                if all(job_data.values()):
                    print(f"\t=> 📝 [{i+1}] Extracted: {job_data['job_title']}")
                else:
                    print(f"\t=> 📝 Extracted Job [{i+1}] Details with Missing Fields")

                print("-" * 60)
            except Exception as e:
                print(f"\t⚠️ Error extracting job {i+1}: {e}")
                continue

    async def run(self):
        # Our scraping logic will go here
        job_listings = []

        print(f"🔍 Starting scraper for '{self.job_title}' in '{self.location}'...")

        try:
            await self.init_browser()

            self.page = await self.browser.new_page()
            await asyncio.sleep(2)

            await self.fill_job_search_form()
            await self.extract_job_listings(job_listings)

        except Exception as e:
            print(f"❌ Error during scraping: {e}")
            return job_listings
        finally:
            print(f"📊 Extracted a total of {len(job_listings)} job listings")

            # Close manually
            await self.browser.close()
            await self.playwright.stop()

            print("🔒 Browser closed")

        return job_listings


---

### Step 5. Saving Job Data to CSV File

Now that we have our job data extracted, we need to save it in a format that's easy to work with in our next modules. CSV files are perfect for this because they're:
- Easy to read with pandas
- Compatible with Excel and other tools
- Human-readable for quick inspection
- Small file size for efficient storage

Let's create a simple function to save our job data:

In [None]:
%%writefile -a ../workshoplib/src/workshoplib/indeed_scraper.py

def save_jobs2csv(job_data: list, job_title: str, location: str, data_dir: Path) -> str:
    """
    Save a list of job dictionaries to a CSV file.

    Args:
        job_data: List of dictionaries containing job information
        job_title: The job title used in the search (for filename)
        location: The location used in the search (for filename)
        data_dir: Directory to save the file (defaults to DATA_DIR)

    Returns:
        str: Path to the saved CSV file
    """
    # Step 1: Check if we have data to save
    if not job_data:
        print("⚠️ No job data provided. Nothing to save.")
        return ""

    print(f"📋 Converting {len(job_data)} job records to DataFrame...")

    # Step 2: Create DataFrame from our list of job dictionaries
    df = pd.DataFrame(job_data)

    # Step 3: Create a clean filename from job_title and location
    # Remove special characters that could cause file system problems
    clean_job_title = re.sub(r'[^\w\s-]', '', job_title.strip()).replace(' ', '_').lower()
    clean_location = re.sub(r'[^\w\s-]', '', location.strip()).replace(' ', '_').lower()

    # Step 4: Create the filename using our specified format
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"scraped_indeed_{clean_job_title}_{clean_location}_jobs_{timestamp}.csv"

    # Step 5: Create the full file path
    file_path = data_dir / filename

    # Step 6: Save to CSV (index=False means don't save row numbers)
    df.to_csv(file_path, index=False)

    print(f"💾 Saved {len(job_data)} jobs to: {file_path}")
    print(f"📊 Data shape: {df.shape[0]} rows × {df.shape[1]} columns")
    print(f"📂 File size: {file_path.stat().st_size:,} bytes")

    return str(file_path)

In [None]:
from workshoplib.indeed_scraper import save_jobs2csv

if test_jobs:
    print("🚀 Testing our save function...")
    saved_file = save_jobs2csv(
        job_data=test_jobs,
        job_title=job_scraper.job_title,
        location=job_scraper.location,
        data_dir=DATA_DIR,
    )

    # Let's quickly inspect what we saved
    print("\n📋 Quick preview of saved data:")
    df_preview = pd.read_csv(saved_file)
    display(df_preview.head())  # Show first couple of rows
    print(f"\n🏷️ Column names: {list(df_preview.columns)}")
    print(f"📊 Data types:\n{df_preview.dtypes}")
else:
    print("⚠️ No job data to save. Run the scraper first!")

> **Instructor Cue:** Walk through each step of the function slowly. Point to the filename creation process - show how "Data Analyst" becomes "data_analyst" and "New Orleans, LA" becomes "new_orleans_la". Ask the class: "Why do we need to clean these strings for filenames?" Emphasize that `index=False` prevents pandas from adding row numbers to our CSV.

#### Quick Sanity Check

Let's also verify our saved file and see what we accomplished:

In [None]:
print("📁 Let's see what files we've created in our data directory:")
for p in sorted(
    DATA_DIR.glob("scraped_indeed_*.csv"), key=lambda p: p.stat().st_mtime, reverse=True
):
    file_size = p.stat().st_size
    modified = datetime.fromtimestamp(p.stat().st_mtime)
    print(f"   📄 {p.name}")
    print(f"      Size: {file_size:,} bytes")
    print(f"      Modified: {modified}")
    print()

# Quick data inspection
if "saved_file" in locals() and saved_file:
    print("🔍 Quick data quality check:")
    df_check = pd.read_csv(saved_file)
    print(f"   📊 Total jobs saved: {len(df_check)}")

    # Check for missing data (we'll clean this in Module 2!)
    for col in df_check.columns:
        missing_count = df_check[col].isna().sum()
        missing_pct = (missing_count / len(df_check)) * 100
        print(f"   📈 {col}: {missing_count} missing values ({missing_pct:.1f}%)")

> **Instructor Cue:** Show the file in the data directory. Open it in a text editor or Excel to demonstrate that it's a standard CSV. Point out the missing data percentages and say: "Don't worry about these missing values - in Module 2, we'll use AI to help us clean and organize this messy real-world data!"

## Chapter 2 Wrap-Up: From Web to Structured Data

> **Instructor Cue:** Take a moment to celebrate with the class! Show them the CSV file they just created. Open it in Excel or a text editor and point out both the successful data extraction AND the messy, inconsistent nature of real-world data.

### 🎉 **What We've Accomplished Together**

In just one chapter, we've built something powerful:

**🔧 Technical Wins:**
- ✅ **Conquered Modern Web Scraping** - Moved beyond simple requests to handle JavaScript-heavy sites
- ✅ **Built a Robust Scraper Class** - 5 iterative steps from concept to working code
- ✅ **Extracted Real Job Data** - Not toy examples, but actual Indeed.com listings with all their complexity
- ✅ **Created Reusable Tools** - Our `IndeedJobScraper` is now part of `workshoplib` for future use

**📊 Data Discovery Success:**
- ✅ **Captured 5 Critical Fields** - Job titles, companies, locations, salaries, and full descriptions
- ✅ **Preserved Raw Reality** - We kept the data exactly as we found it, warts and all
- ✅ **Built a Data Pipeline** - From web page to structured CSV in minutes

### 🤔 **But Wait... There's a Problem!**

> **Instructor Cue:** Ask the class to look at their CSV data. Point out issues like: "Senior Data Scientist" vs "Data Scientist - Senior Level" vs "Sr. Data Scientist". Show salary inconsistencies like "$80K" vs "$80,000/year" vs "Competitive salary". This creates the perfect setup for Module 2.

Take a closer look at our beautifully extracted data. Notice anything... *messy*?

**The Reality of Web-Scraped Data:**
- 🔍 **Job Titles Are Inconsistent** - "Data Analyst" vs "Data Analytics Specialist" vs "Jr. Data Analyst"
- 💰 **Salary Formats Vary Wildly** - "$75,000/year" vs "$75K" vs "Competitive" vs missing entirely
- 📍 **Location Data Is All Over the Map** - "Portland, OR" vs "Portland, Oregon" vs "Portland Metro Area"
- 📝 **Descriptions Are Information Gold Mines** - But buried in paragraphs of unstructured text

**This is where most data projects stall out.** 😤

Traditionally, you'd spend days or weeks writing complex regex patterns, building lookup tables, and creating custom parsing logic for each field. It's tedious, error-prone, and breaks every time the data format changes slightly.

### 🚀 **Enter the AI Revolution**

> **Instructor Cue:** Build excitement here! This is the transition moment. You're about to show them how AI transforms the most tedious part of data work into something almost magical.

**But what if we could have a smart assistant do all that work for us?**

What if we could simply tell an AI:
- *"Hey, group these job titles into standard categories"*
- *"Extract the actual salary numbers from these messy strings"*
- *"Tell me what skills are mentioned in these job descriptions"*

**That's exactly what we're doing in Module 2!** 🎯

### 🎪 **Coming Up: AI-Powered Data Wrangling**

In our next module, we'll transform today's raw, messy data into analysis-ready insights using Google's Gemini AI:

**🧠 Module 2 Preview:**
- 📚 **Load Our Scraped Data** - Import the CSV we just created and explore its quirks
- 🤖 **Meet PydanticAI** - Set up our AI-powered data cleaning assistant  
- 🏷️ **Smart Job Classification** - Let AI categorize "Senior Data Scientist II" and "Data Science Manager" into clean, consistent groups
- 💰 **Intelligent Salary Parsing** - Transform "$80K-$100K DOE" into structured, comparable numbers
- 📊 **Generate Clean Visualizations** - Create publication-ready charts from our newly organized data
- 🔗 **Merge with BLS Data** - Combine our real-time job market insights with government employment statistics

**The Best Part?** You'll write maybe 10 lines of actual data cleaning code. The AI does the heavy lifting while you focus on the insights!

> **Instructor Cue:** End with energy and anticipation. Maybe say something like: "Ready to see what happens when we give AI superpowers to our data? Let's dive into Module 2 and turn this messy pile of job data into crystal-clear insights!"

---

**🎯 Key Takeaway:** We've proven we can extract data from the modern web. Now let's prove that AI can make sense of it faster and better than any traditional approach. The combination of web scraping + AI is where the real magic happens!