# Chapter 2: Targeted Web Scraping with Playwright

## Learning Objectives

By the end of this chapter, you will be able to:
- Set up and configure Playwright for automated web scraping
- Navigate job search websites programmatically
- Extract structured data from dynamic web pages
- Handle common web scraping challenges (rate limiting, dynamic content)
- Save scraped data in a format suitable for machine learning analysis

---

## Introduction to Web Scraping Strategy

> **Instructor Cue:** Begin by pulling up Indeed.com in your browser. Perform a manual search for "Data Scientists" in Oregon to show the class what we're trying to automate. Point out the various elements we want to extract and discuss the challenges of manual data collection at scale.

Based on our exploratory data analysis, we identified high-value target occupations in specific states. While the BLS OEWS data provides excellent foundational insights, it lacks the granular, real-time information that current job postings offer, such as:

- Specific skill requirements
- Company details and culture information  
- Exact salary ranges and benefits
- Remote work options
- Educational requirements

### Why Playwright Over Other Tools?

> **Instructor Cue:** Ask the audience: "Has anyone used BeautifulSoup or Selenium before? What challenges did you encounter?" Use their responses to highlight Playwright's advantages.

Playwright offers several advantages for modern web scraping:

1. **Fast and Reliable**: Built for modern web applications
2. **Handles JavaScript**: Executes dynamic content automatically
3. **Multiple Browser Support**: Chromium, Firefox, and Safari
4. **Built-in Waiting**: Intelligent waiting for elements to load
5. **Robust Error Handling**: Better handling of network issues and timeouts

---

## Setting Up Playwright Environment

Let's start by installing and configuring Playwright for our scraping task:

In [None]:
# Installation commands for Playwright
# If you haven't installed these packages yet, uncomment and run:
# !pip install playwright pandas requests requests_cache
# !pip install git+https://github.com/nsbe-pdc/patchright.git  # Special version for Indeed
# !playwright install chromium

# NOTE: For workshop purposes, these commands are commented out.
# If this is your first time running this notebook, you'll need to install
# the required packages. You can either:
#   1. Uncomment the lines above and run this cell, OR
#   2. Run these commands in a terminal:
#      pip install playwright pandas requests requests_cache
#      pip install git+https://github.com/nsbe-pdc/patchright.git
#      playwright install chromium

In [None]:
import asyncio
import json
import random
import re
import time
from datetime import datetime
from pathlib import Path

import pandas as pd

# Try to import patchright first (which works better with Indeed), fall back to playwright
try:
    from patchright.async_api import async_playwright

    print("Using patchright for better compatibility with Indeed")
except ImportError:
    print("Patchright not found, falling back to standard playwright")
    try:
        from playwright.async_api import async_playwright

        print("Using standard playwright")
    except ImportError:
        print("Playwright not installed. Please install it with:")
        print("pip install playwright")
        print("playwright install chromium")

# Make sure data directory exists
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)

> **Instructor Cue:** Walk through the installation process step by step. If anyone encounters installation issues, help them troubleshoot. Explain that we're using the async version of Playwright for better performance.

---

## Building Our Job Scraper Class

Let's create a comprehensive job scraper that can extract all the information we need:

In [None]:
class IndeedJobScraper:
    """
    A robust job scraper for Indeed.com using Playwright/Patchright.
    Designed to extract comprehensive job posting information.
    """

    def __init__(self, headless=False, delay_range=(1, 3)):
        """
        Initialize the scraper with configuration options.

        Args:
            headless: Whether to run browser in headless mode (default: False for better stability with Indeed)
            delay_range: Tuple of (min, max) seconds for random delays
        """
        self.headless = headless
        self.delay_range = delay_range
        self.scraped_jobs = []  # Store scraped jobs
        self.browser = None
        self.page = None
        self.context = None
        self.playwright = None
        self.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

    async def start_browser(self):
        """Initialize browser and page objects with optimized settings for Indeed."""
        self.playwright = await async_playwright().start()

        # Use a persistent context for better stability with Indeed
        user_data_dir = Path("./browser_data")
        user_data_dir.mkdir(exist_ok=True)

        try:
            # Try to use the persistent context approach which works better for some sites
            self.browser = await self.playwright.chromium.launch_persistent_context(
                user_data_dir=str(user_data_dir),
                headless=self.headless,  # Using non-headless mode for better results with Indeed
                viewport={"width": 1280, "height": 800},
                user_agent=self.user_agent,
            )
            self.page = await self.browser.new_page()

            # Configure page for better scraping
            await self.page.set_viewport_size({"width": 1280, "height": 800})
            await self.page.set_extra_http_headers({"Accept-Language": "en-US,en;q=0.9"})

            print(
                "Browser started successfully in "
                + ("headless" if self.headless else "visible")
                + " mode"
            )

        except Exception as e:
            print(f"Failed to launch persistent context: {e}. Falling back to standard browser.")
            # Fall back to standard browser launch
            self.browser = await self.playwright.chromium.launch(headless=self.headless)
            self.context = await self.browser.new_context(
                viewport={"width": 1280, "height": 800}, user_agent=self.user_agent
            )
            self.page = await self.context.new_page()

    async def close_browser(self):
        """Clean up browser resources."""
        if self.browser:
            await self.browser.close()
        if hasattr(self, "playwright") and self.playwright:
            await self.playwright.stop()

    def random_delay(self):
        """Add random delay to mimic human behavior."""
        delay = random.uniform(*self.delay_range)
        time.sleep(delay)

    async def search_jobs(self, job_title, location, max_pages=3):
        """
        Search for jobs on Indeed and return structured data.

        Args:
            job_title: The job title to search for
            location: Location (city, state or state abbreviation)
            max_pages: Maximum number of pages to scrape

        Returns:
            List of job dictionaries
        """
        # Reset the scraped jobs list
        self.scraped_jobs = []

        print(f"Starting search for '{job_title}' in '{location}'...")

        try:
            # Go directly to Indeed.com
            await self.page.goto(
                "https://www.indeed.com", wait_until="domcontentloaded", timeout=60000
            )
            await asyncio.sleep(2)  # Small wait for page to stabilize

            # Find search form elements
            print("📝 Filling search form...")
            await self.page.locator('input[name="q"]').click()
            await self.page.locator('input[name="q"]').fill(job_title)

            # Clear the location field before filling (if it has a default value)
            try:
                location_input = self.page.locator('input[name="l"]')
                await location_input.click()
                await location_input.fill("")  # Clear field
                await location_input.fill(location)  # Set new location
            except Exception as e:
                print(f"Issue with location field: {e}")
                # Alternative method
                try:
                    await self.page.evaluate('document.querySelector("input[name=l]").value = ""')
                    await self.page.locator('input[name="l"]').fill(location)
                except:
                    print("Could not clear location field, continuing...")

            # Click search
            await self.page.click('button[type="submit"]')
            await self.page.wait_for_load_state("domcontentloaded", timeout=60000)
            await asyncio.sleep(3)  # Wait for JavaScript to render results

            # Take a screenshot for verification
            screenshot_path = (
                f"search_results_{job_title.replace(' ', '_')}_{location.replace(' ', '_')}.png"
            )
            await self.page.screenshot(path=screenshot_path)
            print(f"Saved search results screenshot to {screenshot_path}")

            # Process the specified number of pages
            for page_num in range(max_pages):
                try:
                    print(f"Processing page {page_num + 1} of {max_pages}...")

                    # If not on the first page, construct the URL for pagination
                    if page_num > 0:
                        current_url = await self.page.evaluate("window.location.href")
                        base_url = current_url.split("&start=")[0]
                        next_url = f"{base_url}&start={page_num * 10}"
                        await self.page.goto(next_url, wait_until="domcontentloaded")
                        await asyncio.sleep(3)  # Wait for page to load

                    # Extract job data from the current page
                    print("Looking for job listings...")
                    page_jobs = await self.extract_jobs_from_page()

                    if page_jobs:
                        print(f"Found {len(page_jobs)} jobs on page {page_num + 1}")
                        self.scraped_jobs.extend(page_jobs)
                    else:
                        print(f"No jobs found on page {page_num + 1}")
                        break  # Exit if no jobs found (might be last page)

                    # Random delay between pages
                    if page_num < max_pages - 1:
                        delay_time = random.uniform(3, 7)
                        print(f"Waiting {delay_time:.2f} seconds before next page...")
                        await asyncio.sleep(delay_time)

                except Exception as e:
                    print(f"Error processing page {page_num + 1}: {str(e)}")
                    continue

        except Exception as e:
            print(f"Search error: {str(e)}")

        print(f"Total jobs scraped: {len(self.scraped_jobs)}")
        return self.scraped_jobs

    async def extract_jobs_from_page(self):
        """Extract job information from the current page."""
        jobs = []

        try:
            # Let the page fully load
            await asyncio.sleep(2)

            # Take a screenshot for debugging
            await self.page.screenshot(path="page_loaded.png")
            print("Saved screenshot to page_loaded.png")

            # Try multiple selectors, from more specific to more general
            selectors = [
                ".cardOutline",  # Modern Indeed card outline
                'div[data-testid="jobListing"]',  # Indeed job listings with testid
                "div.job_seen_beacon",  # Job listing with beacon tracking
                'div[class*="job_"]',  # Any div with job_ in class name
                "div.resultContent",  # Result content container
                "div.job-container",  # Generic job container
                ".jobsearch-ResultsList > div",  # Any div in the results list
            ]

            for selector in selectors:
                print(f"Trying selector: {selector}")
                try:
                    # Use a shorter timeout for each individual selector
                    await self.page.wait_for_selector(selector, timeout=5000)
                    job_cards = await self.page.query_selector_all(selector)

                    if job_cards and len(job_cards) > 0:
                        print(f"Found {len(job_cards)} job cards with selector: {selector}")
                        break
                except Exception as e:
                    print(f"Selector {selector} failed: {e}")
                    job_cards = []

            # If visible browser mode, wait for manual intervention if needed
            if not self.headless and not job_cards:
                print("\nNo job cards found automatically. The browser is visible.")
                print("You can manually check if the page has loaded correctly.")
                print("Press Enter in the notebook to continue...")

            if not job_cards or len(job_cards) == 0:
                print("Could not find any job cards with standard selectors")
                # Last resort: try to find any divs that might contain jobs
                try:
                    # Get the page content and analyze structure
                    page_html = await self.page.content()
                    print(f"Page content length: {len(page_html)} characters")

                    # Look for anything that might be a job listing
                    job_cards = await self.page.query_selector_all('div[id*="job_"]')
                    if not job_cards:
                        job_cards = await self.page.query_selector_all('div[class*="job"]')
                    if not job_cards:
                        job_cards = await self.page.query_selector_all('div > a[id*="job"]')
                except Exception as e:
                    print(f"Last resort job finding failed: {e}")

            if not job_cards or len(job_cards) == 0:
                print("No job cards found on page")
                return jobs

            print(f"Found {len(job_cards)} job cards on page")

            for i, card in enumerate(job_cards):
                try:
                    # Click the card to load job details in the panel
                    await card.click()
                    # Wait for job details to load
                    await asyncio.sleep(1.5)

                    job_data = await self.extract_single_job(card)
                    if job_data:
                        jobs.append(job_data)

                    # Add some randomness to avoid detection
                    await asyncio.sleep(random.uniform(0.5, 1.5))

                except Exception as e:
                    print(f"Error extracting job {i + 1}: {str(e)}")
                    continue

        except Exception as e:
            print(f"Error finding job cards: {str(e)}")

        return jobs

    async def extract_single_job(self, job_card):
        """
        Extract detailed information from a single job card.

        Args:
            job_card: Playwright element representing a job posting

        Returns:
            Dictionary with job information
        """
        job_data = {
            "job_title": None,
            "company_name": None,
            "location": None,
            "salary": None,
            "job_description": None,
            "scraped_at": datetime.now().isoformat(),
        }

        try:
            # Extract job title - try multiple possible selectors
            title_selectors = [
                "h2 a span",
                "h2.jobTitle span",
                "h2 span",
                "h2",
                '[data-testid="jobTitle"]',
                "a[data-jk] > span",
                ".jobTitle",
            ]

            for selector in title_selectors:
                title_element = await job_card.query_selector(selector)
                if title_element:
                    job_data["job_title"] = (
                        await title_element.get_attribute("title")
                        or await title_element.inner_text()
                    )
                    if job_data["job_title"]:
                        break

            # Extract company name
            company_selectors = [
                '[data-testid="company-name"]',
                "span.companyName",
                ".company_location > .companyName",
                '[data-testid="company-location"] .companyName',
                ".resultContent .company_location > div:first-child",
            ]

            for selector in company_selectors:
                company_element = await job_card.query_selector(selector)
                if company_element:
                    job_data["company_name"] = await company_element.inner_text()
                    if job_data["company_name"]:
                        break

            # Extract location
            location_selectors = [
                '[data-testid="text-location"]',
                "div.companyLocation",
                ".resultContent .metadataContainer .companyLocation",
                '[data-testid="company-location"] .companyLocation',
            ]

            for selector in location_selectors:
                location_element = await job_card.query_selector(selector)
                if location_element:
                    job_data["location"] = await location_element.inner_text()
                    if job_data["location"]:
                        break

            # Extract salary information
            # Method 1: Direct salary element
            salary_selectors = [
                "span.salary-snippet",
                "div.salary-snippet-container",
                'div[class*="salary"]',
                'span[class*="salary"]',
                ".metadata.salary-snippet-container",
                '.resultContent .metadataContainer [class*="salary"]',
            ]

            for selector in salary_selectors:
                salary_element = await job_card.query_selector(selector)
                if salary_element:
                    salary_text = await salary_element.inner_text()
                    job_data["salary"] = self.parse_salary(salary_text)
                    if job_data["salary"]:
                        break

            if not job_data["salary"]:
                # Method 2: Look in metadata
                company_div = await job_card.query_selector('div[class*="company_location"]')
                if company_div:
                    metadata_divs = await company_div.query_selector_all("+ div")
                    for div in metadata_divs:
                        text = await div.inner_text()
                        if any(
                            keyword in text.lower()
                            for keyword in ["$", "hour", "year", "month", "annu", "sal"]
                        ):
                            job_data["salary"] = self.parse_salary(text)
                            break

            # Extract job description from the details panel (now visible since we clicked the card)
            description_selectors = [
                "#jobDescriptionText",
                'div[id*="jobDescriptionText"]',
                'div[data-testid="jobDescriptionText"]',
                "#jobDescriptionSection",
                ".jobsearch-jobDescriptionText",
            ]

            for selector in description_selectors:
                description_element = await self.page.query_selector(selector)
                if description_element:
                    job_data["job_description"] = await description_element.inner_text()
                    if job_data["job_description"]:
                        break

        except Exception as e:
            print(f"Error extracting job details: {str(e)}")

        # Validate we got at least the essential data
        if job_data["job_title"] and job_data["company_name"]:
            return job_data
        else:
            print("Missing essential job data, skipping...")
            return None

    def parse_salary(self, salary_text):
        """
        Parse salary information from text.

        Args:
            salary_text: String containing salary information

        Returns:
            Dictionary with salary range and period information
        """
        if not salary_text:
            return None

        salary_data = {"min": None, "max": None, "period": None}

        # Strip currency symbols and commas
        text = salary_text.replace("$", "").replace(",", "")

        # Try to detect period (hourly, yearly, etc.)
        if "hour" in text.lower() or "/hr" in text.lower():
            salary_data["period"] = "hourly"
        elif "year" in text.lower() or "/yr" in text.lower() or "annual" in text.lower():
            salary_data["period"] = "yearly"
        elif "month" in text.lower() or "/mo" in text.lower():
            salary_data["period"] = "monthly"
        elif "week" in text.lower() or "/wk" in text.lower():
            salary_data["period"] = "weekly"
        elif "day" in text.lower() or "/day" in text.lower():
            salary_data["period"] = "daily"

        # Try to extract salary range
        # Look for patterns like $XX - $YY, $XX to $YY, $XX-$YY
        # or just a single value like $XX
        match = re.search(r"(\d+\.?\d*)\s*(?:[-–—to]+)\s*(\d+\.?\d*)", text)
        if match:
            salary_data["min"] = float(match.group(1))
            salary_data["max"] = float(match.group(2))
        else:
            # Try to find a single number
            match = re.search(r"(\d+\.?\d*)", text)
            if match:
                value = float(match.group(1))
                # If there's a single value, we don't know if it's min or max
                # For simplicity, we'll set both to the same value
                salary_data["min"] = value
                salary_data["max"] = value

        # If we couldn't extract any numbers, return None
        if salary_data["min"] is None and salary_data["max"] is None:
            return None

        return salary_data

> **Instructor Cue:** Explain the importance of responsible scraping practices. Discuss rate limiting, robots.txt files, and ethical considerations. Emphasize that we're adding delays and using realistic user agents to be respectful to the website.

---

## Implementing Search Functionality

Now let's implement the core search functionality:

In [None]:
# This method has been moved to the consolidated IndeedJobScraper class above
# No need for separate definition here

> **Instructor Cue:** Open the browser developer tools and show the class how to inspect elements to find the CSS selectors we'll use. This is a great hands-on moment to demonstrate how web scraping detective work happens.

---

## Extracting Job Details

The core extraction logic handles the complexity of parsing individual job postings:

In [None]:
# These methods have been moved to the consolidated IndeedJobScraper class above
# No need for separate definitions here

> **Instructor Cue:** Point out the defensive programming practices here - checking if elements exist before accessing them, handling exceptions gracefully. Ask the class: "Why is this error handling so important in web scraping?"

---

## Salary Parsing and Data Cleaning

One of the most challenging aspects is parsing salary information, which comes in various formats:

In [None]:
def parse_salary(self, salary_text):
    """
    Parse salary information from various formats.

    Args:
        salary_text: Raw salary text from job posting

    Returns:
        Dictionary with parsed salary information
    """
    if not salary_text:
        return None

    salary_info = {
        "raw_text": salary_text,
        "min_annual": None,
        "max_annual": None,
        "currency": "USD",
        "type": None,  # 'hourly', 'annual', 'range'
    }

    # Clean the text
    clean_text = re.sub(r"[^\d\.\-\$,khourlyweekannualyear\s]", "", salary_text.lower())

    # Extract numeric values
    numbers = re.findall(r"\$?(\d{1,3}(?:,\d{3})*(?:\.\d{2})?)", clean_text)

    if not numbers:
        return salary_info

    # Convert strings to numbers
    parsed_numbers = []
    for num in numbers:
        try:
            parsed_numbers.append(float(num.replace(",", "")))
        except ValueError:
            continue

    if not parsed_numbers:
        return salary_info

    # Determine salary type and convert to annual
    if "hour" in clean_text:
        salary_info["type"] = "hourly"
        # Convert hourly to annual (assuming 40 hours/week, 52 weeks/year)
        annual_multiplier = 40 * 52
        salary_info["min_annual"] = parsed_numbers[0] * annual_multiplier
        if len(parsed_numbers) > 1:
            salary_info["max_annual"] = parsed_numbers[1] * annual_multiplier

    elif (
        "year" in clean_text or "annual" in clean_text or any(num > 1000 for num in parsed_numbers)
    ):
        salary_info["type"] = "annual"
        salary_info["min_annual"] = parsed_numbers[0]
        if len(parsed_numbers) > 1:
            salary_info["max_annual"] = parsed_numbers[1]

    # Handle 'K' notation (e.g., "50K - 70K")
    if "k" in clean_text:
        salary_info["min_annual"] = parsed_numbers[0] * 1000
        if len(parsed_numbers) > 1:
            salary_info["max_annual"] = parsed_numbers[1] * 1000

    return salary_info

> **Instructor Cue:** This is a complex function that handles real-world data messiness. Walk through a few examples: "$25/hour", "$50,000 - $70,000 per year", "Up to $80K annually". Ask students to suggest other salary formats they've seen.

---

## Main Scraping Function

Let's put it all together with our main scraping function:

In [None]:
async def scrape_indeed_jobs(job_title, location, max_pages=3, headless=False):
    """
    Main function to scrape Indeed jobs.

    Args:
        job_title: Job title to search for
        location: Location to search in
        max_pages: Maximum pages to scrape
        headless: Whether to run browser in headless mode (default: False for better results with Indeed)

    Returns:
        DataFrame with scraped job data
    """
    print(f"🔍 Starting scraper for '{job_title}' in '{location}'...")
    print(f"Browser will run in {'headless' if headless else 'visible'} mode")

    try:
        # Use context manager for browser lifecycle management
        async with IndeedJobScraper(headless=headless) as scraper:
            # Perform the search and extract jobs
            jobs = await scraper.search_jobs(job_title, location, max_pages)

            if not jobs:
                print("No jobs found. Website structure might have changed or no results matched.")
                return pd.DataFrame()

            # Convert to DataFrame for analysis
            df = pd.DataFrame(jobs)

            if not df.empty:
                # Clean and enhance the data
                df = clean_scraped_data(df)

                # Save to CSV
                filename = f"data/indeed_jobs_{job_title.replace(' ', '_')}_{location.replace(' ', '_')}.csv"
                df.to_csv(filename, index=False)
                print(f"Data saved to: {filename}")

            return df

    except Exception as e:
        print(f"❌ Error during scraping: {str(e)}")
        import traceback

        traceback.print_exc()
        return pd.DataFrame()


def clean_scraped_data(df):
    """
    Clean and enhance the scraped job data.

    Args:
        df: Raw DataFrame from scraper

    Returns:
        Cleaned DataFrame
    """
    # Remove duplicates based on job title and company
    df = df.drop_duplicates(subset=["job_title", "company_name"], keep="first")

    # Clean text fields
    text_columns = ["job_title", "company_name", "location", "job_description"]
    for col in text_columns:
        if col in df.columns:
            df[col] = df[col].astype(str).str.strip()
            df[col] = df[col].replace("nan", None)
            df[col] = df[col].replace("None", None)

    # Extract salary ranges into separate columns
    if "salary" in df.columns and df["salary"].notna().any():
        # Handle cases where salary might be a string representation of a dict
        for i, salary in enumerate(df["salary"]):
            if isinstance(salary, str) and salary.startswith("{"):
                try:
                    df.at[i, "salary"] = json.loads(salary)
                except:
                    pass

        # Now normalize the salary data
        try:
            salary_df = pd.json_normalize(df["salary"].dropna())
            for col in ["min", "max", "period"]:
                if col in salary_df.columns:
                    df[f"salary_{col}"] = salary_df[col]
        except Exception as e:
            print(f"Error processing salary data: {e}")

    return df

> **Instructor Cue:** Emphasize the importance of data cleaning in any data pipeline. Ask: "What other cleaning steps might we want to add? How could we validate the quality of our scraped data?"

---

## Executing the Scraper

Now let's use our scraper to collect data for the target occupations we identified in Chapter 1:

In [None]:
# Load our target occupations from Chapter 1
try:
    target_occupations = pd.read_csv("data/bls_jobs_metro_area.csv")
    # Rename columns to match expected names in following cells
    target_occupations["occupation"] = target_occupations["OCC_TITLE"]
    target_occupations["state"] = target_occupations["PRIM_STATE"]
    print("Target occupations for scraping:")
    print(target_occupations[["occupation", "state"]].to_string(index=False))
except FileNotFoundError:
    print("File not found. Please ensure 'data/bls_jobs_metro_area.csv' exists.")
    # Create empty dataframe with expected structure for testing
    target_occupations = pd.DataFrame({"occupation": [], "state": []})

In [None]:
async def scrape_all_targets():
    """Scrape job data for all our target occupations."""
    all_scraped_data = []

    # Check if target_occupations is empty or doesn't have the expected columns
    if target_occupations.empty:
        print("No target occupations found to scrape.")
        return pd.DataFrame()

    # Make sure the required columns exist
    required_cols = ["occupation", "state"]
    if not all(col in target_occupations.columns for col in required_cols):
        print(f"Missing required columns in target_occupations. Needed: {required_cols}")
        print(f"Available: {target_occupations.columns.tolist()}")
        return pd.DataFrame()

    # Limit to first 3 targets for demo purposes
    target_subset = target_occupations.head(3)
    print(f"Will scrape {len(target_subset)} targets (limiting for workshop purposes)")
    print("Using non-headless browser mode for better results with Indeed")

    for i, row in target_subset.iterrows():
        job_title = row["occupation"]
        state = row["state"]

        print(f"\n{'=' * 50}")
        print(f"Scraping: {job_title} in {state}")
        print(f"{'=' * 50}")

        try:
            # Scrape jobs for this target
            df = await scrape_indeed_jobs(
                job_title=job_title,
                location=state,
                max_pages=2,  # Limit for demo purposes
                headless=False,  # Use visible browser for better results
            )

            if not df.empty:
                # Add metadata
                df["target_occupation"] = job_title
                df["target_state"] = state
                all_scraped_data.append(df)

                print(f"Successfully scraped {len(df)} jobs")
            else:
                print("No jobs found for this target")

        except Exception as e:
            print(f"Error scraping {job_title} in {state}: {str(e)}")

        # Be respectful to the website
        await asyncio.sleep(random.uniform(5, 10))

    # Combine all scraped data
    if all_scraped_data:
        try:
            combined_df = pd.concat(all_scraped_data, ignore_index=True)
            combined_df.to_csv("data/indeed_jobs_combined.csv", index=False)
            print(f"\nTotal jobs scraped: {len(combined_df)}")
            print("Combined data saved to: data/indeed_jobs_combined.csv")
            return combined_df
        except Exception as e:
            print(f"Error combining scraped data: {e}")
            if all_scraped_data and len(all_scraped_data) > 0:
                print("Returning first dataset as fallback")
                return all_scraped_data[0]

    print("No data was successfully scraped")
    return pd.DataFrame()


# Execute the scraping
print("Ready to start scraping. Uncomment the next line to begin.")
scraped_jobs_df = await scrape_all_targets()

> **Instructor Cue:** Run this code live, but be prepared for potential issues (rate limiting, website changes, etc.). Use any problems as teaching moments about the challenges of web scraping. If Indeed blocks requests, switch to a smaller demo or use pre-scraped sample data.

---

## Data Quality Assessment

Let's examine the quality and structure of our scraped data:

In [None]:
# Analyze our scraped data
if not scraped_jobs_df.empty:
    print("=== SCRAPED DATA ANALYSIS ===")
    print(f"Total jobs scraped: {len(scraped_jobs_df)}")
    print(f"Unique companies: {scraped_jobs_df['company_name'].nunique()}")
    print(f"Jobs with salary info: {scraped_jobs_df['salary'].notna().sum()}")
    print(
        f"Average job description length: {scraped_jobs_df['job_description'].str.len().mean():.0f} characters"
    )

    # Show sample of the data
    print("\nSample of scraped jobs:")
    display_columns = ["job_title", "company_name", "location", "salary_min_annual"]
    print(scraped_jobs_df[display_columns].head().to_string(index=False))

    # Salary analysis
    salary_data = scraped_jobs_df[scraped_jobs_df["salary_min_annual"].notna()]
    if not salary_data.empty:
        print("\nSalary Statistics:")
        print(f"Average minimum salary: ${salary_data['salary_min_annual'].mean():,.0f}")
        print(
            f"Salary range: ${salary_data['salary_min_annual'].min():,.0f} - ${salary_data['salary_min_annual'].max():,.0f}"
        )

    # Data completeness analysis
    print("\nData Completeness:")
    for col in ["job_title", "company_name", "location", "salary", "job_description"]:
        if col in scraped_jobs_df.columns:
            completeness = (scraped_jobs_df[col].notna().sum() / len(scraped_jobs_df)) * 100
            print(f"{col}: {completeness:.1f}% complete")

In [None]:
# Test scraper on a single job title
async def test_scraper():
    """Test our scraper on a single job and location."""
    print("Starting test scraper...")

    # Create simple test target
    my_target = {"job_title": "Data Scientist", "location": "Oregon"}

    try:
        # Create the scraper - with visible browser window
        scraper = IndeedJobScraper(headless=False)  # Non-headless mode for better results

        # Start the browser
        await scraper.start_browser()

        try:
            # Search for jobs
            print(f"Searching for {my_target['job_title']} in {my_target['location']}...")
            jobs = await scraper.search_jobs(
                job_title=my_target["job_title"],
                location=my_target["location"],
                max_pages=1,  # Just 1 page for testing
            )

            # Check results
            if jobs:
                print(f"Success! Found {len(jobs)} job postings.")
                # Convert to DataFrame for easier viewing
                test_job_results = pd.DataFrame(jobs)
                # Show a summary
                print("\nFirst job posting:")
                for key, value in jobs[0].items():
                    if key != "job_description":  # Description is too long to print
                        print(f"{key}: {value}")

                print("\nSummary of all jobs:")
                print(test_job_results[["job_title", "company_name", "location"]].head())

                return test_job_results
            else:
                print("No jobs found in test search.")
                print("This could be because:")
                print("1. The selectors need updating for Indeed's current layout")
                print("2. Indeed detected automation and showed a CAPTCHA")
                print("3. There are no matching job listings in the selected location")

                return pd.DataFrame()

        finally:
            print("\nScraping test complete. Closing browser.")
            # Always close the browser when done
            await scraper.close_browser()

    except Exception as e:
        print(f"Error in test scraper: {str(e)}")
        import traceback

        traceback.print_exc()
        return pd.DataFrame()


# Run the test
print("Starting scraper test with non-headless browser...")
test_job_results = await test_scraper()
print("Test complete!")

In [None]:
# Fallback for workshop scenarios if scraping fails
def load_fallback_data():
    """
    Load pre-scraped data as a fallback for workshop scenarios
    where scraping might fail due to rate limiting or website changes.
    """
    try:
        print("Attempting to load pre-scraped fallback data...")
        # Try to load from a fallback file if it exists
        fallback_file = Path("data/indeed_jobs_combined.csv")

        if fallback_file.exists():
            fallback_df = pd.read_csv(fallback_file)
            print(f"Loaded {len(fallback_df)} pre-scraped job records")
            return fallback_df
        else:
            # If no fallback file exists, create synthetic data
            print("No pre-scraped data found. Creating synthetic data for workshop purposes...")

            # Create sample data for a few jobs
            sample_data = []
            job_titles = ["Data Scientist", "Software Developer", "Machine Learning Engineer"]
            companies = ["TechCorp", "DataInnovations", "AI Solutions", "CodeMasters"]
            locations = ["Portland, OR", "Seattle, WA", "San Francisco, CA"]

            for i in range(20):  # Generate 20 sample jobs
                min_salary = random.randint(80000, 150000)
                max_salary = random.randint(150000, 200000)

                job = {
                    "job_title": random.choice(job_titles),
                    "company_name": random.choice(companies),
                    "location": random.choice(locations),
                    "salary": json.dumps(
                        {
                            "raw_text": f"${min_salary // 1000}k - ${max_salary // 1000}k per year",
                            "min_annual": min_salary,
                            "max_annual": max_salary,
                            "currency": "USD",
                            "type": "annual",
                        }
                    ),
                    "salary_min_annual": min_salary,
                    "salary_max_annual": max_salary,
                    "salary_type": "annual",
                    "job_description": "This is a sample job description for workshop purposes.",
                    "scraped_at": datetime.now().isoformat(),
                    "target_occupation": job_titles[0],
                    "target_state": "OR",
                }
                sample_data.append(job)

            fallback_df = pd.DataFrame(sample_data)

            # Save the synthetic data for future use
            fallback_df.to_csv("data/indeed_jobs_combined.csv", index=False)
            print(f"Created {len(fallback_df)} synthetic job records")
            return fallback_df
    except Exception as e:
        print(f"Error loading fallback data: {str(e)}")
        return pd.DataFrame()


# If scraping wasn't successful, use fallback data
if (
    "scraped_jobs_df" not in locals()
    or isinstance(scraped_jobs_df, pd.DataFrame)
    and scraped_jobs_df.empty
):
    print("No scraped data available. Using fallback data instead.")
    scraped_jobs_df = load_fallback_data()

> **Instructor Cue:** Use this analysis to discuss data quality issues common in web scraping. Point out missing values, inconsistent formats, and potential data validation needs. Ask: "What patterns do you notice? What might cause missing salary information?"

---

## Preparing Data for Module 2

Finally, let's prepare our scraped data for the next phase of analysis:

In [None]:
def prepare_for_analysis(df):
    """
    Prepare scraped data for machine learning analysis in Module 2.

    Args:
        df: DataFrame with scraped job data

    Returns:
        Cleaned DataFrame ready for analysis
    """
    # Create a working copy
    analysis_df = df.copy()

    # Feature engineering
    if "job_description" in analysis_df.columns:
        # Add text length features
        analysis_df["description_length"] = analysis_df["job_description"].str.len()
        analysis_df["description_word_count"] = analysis_df["job_description"].str.split().str.len()

    # Create salary availability indicator
    analysis_df["has_salary_info"] = analysis_df["salary"].notna()

    # Extract state from location for consistency
    if "location" in analysis_df.columns:
        analysis_df["state_extracted"] = analysis_df["location"].str.extract(r", ([A-Z]{2})")[0]

    # Create company size indicators (based on name patterns)
    if "company_name" in analysis_df.columns:
        known_large_companies = ["Google", "Microsoft", "Amazon", "Apple", "Meta", "Netflix"]
        analysis_df["is_big_tech"] = analysis_df["company_name"].str.contains(
            "|".join(known_large_companies), case=False, na=False
        )

    return analysis_df


# Prepare the data
if not scraped_jobs_df.empty:
    analysis_ready_df = prepare_for_analysis(scraped_jobs_df)

    # Save the analysis-ready dataset
    analysis_ready_df.to_csv("data/jobs_for_analysis.csv", index=False)
    print("Analysis-ready dataset saved to: data/jobs_for_analysis.csv")

    # Show what we've prepared
    print("\nDataset prepared for Module 2:")
    print(f"- {len(analysis_ready_df)} job postings")
    print(f"- {analysis_ready_df.columns.tolist()}")
    print("- Ready for regression modeling and AI analysis")

---

## Handling Common Scraping Challenges

> **Instructor Cue:** This is an important teachable moment. Discuss real-world scraping challenges and solutions. If students encountered errors during the demo, use those as examples.

### Challenge 1: Rate Limiting and IP Blocking

In [None]:
class RobustScraper(IndeedJobScraper):
    """Enhanced scraper with better error handling and rate limiting."""

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.request_count = 0
        self.max_requests_per_hour = 100

    async def handle_rate_limiting(self):
        """Handle rate limiting gracefully."""
        self.request_count += 1

        if self.request_count % 10 == 0:  # Every 10 requests
            print(f"Processed {self.request_count} requests. Taking a longer break...")
            await asyncio.sleep(random.uniform(10, 20))
        else:
            await asyncio.sleep(random.uniform(2, 5))

    async def retry_on_failure(self, operation, max_retries=3):
        """Retry operations that might fail due to network issues."""
        for attempt in range(max_retries):
            try:
                return await operation()
            except Exception as e:
                if attempt == max_retries - 1:
                    raise e
                print(f"Attempt {attempt + 1} failed: {str(e)}. Retrying...")
                await asyncio.sleep(random.uniform(5, 10))

### Challenge 2: Dynamic Content and JavaScript

In [None]:
async def wait_for_dynamic_content(self, selector, timeout=10000):
    """Wait for dynamically loaded content."""
    try:
        await self.page.wait_for_selector(selector, timeout=timeout)
        # Additional wait for any animations or lazy loading
        await self.page.wait_for_timeout(1000)
    except:
        print(f"Timeout waiting for selector: {selector}")

### Challenge 3: Data Validation

In [None]:
def validate_scraped_job(job_data):
    """
    Validate that a scraped job has minimum required information.

    Args:
        job_data: Dictionary with job information

    Returns:
        Boolean indicating if job data is valid
    """
    required_fields = ["job_title", "company_name"]

    # Check required fields exist and are not empty
    for field in required_fields:
        if not job_data.get(field) or str(job_data[field]).strip() == "":
            return False

    # Additional validation rules
    if job_data.get("job_title") and len(job_data["job_title"]) > 200:
        return False  # Suspiciously long title

    return True

---

## Module 1 Summary and Transition

### What We've Accomplished

> **Instructor Cue:** Take a moment to celebrate what the class has accomplished. This is a significant amount of technical work, and they should feel proud of building a complete data collection pipeline.

In Module 1, we've built a complete data discovery and collection pipeline:

1. **Exploratory Data Analysis**: Used government data to understand the job market landscape
2. **Target Identification**: Applied data-driven decision making to select specific occupations and locations
3. **Web Scraping Implementation**: Built a robust, production-ready scraper using Playwright
4. **Data Quality Assurance**: Implemented validation and cleaning procedures
5. **Pipeline Integration**: Connected our analysis results to our scraping targets

### Data Assets Created

In [None]:
# Summarize our data assets
print("=== MODULE 1 DATA ASSETS ===")
print("✓ OEWS national dataset (loaded and analyzed)")
print("✓ Target occupations CSV (data/bls_jobs_metro_area.csv)")
print("✓ Scraped job postings (data/indeed_jobs_combined.csv)")
print("✓ Analysis-ready dataset (data/jobs_for_analysis.csv)")
print("\nReady for Module 2: Machine Learning and Regression Analysis!")

### Looking Ahead to Module 2

> **Instructor Cue:** Build excitement for the next module. Ask: "Now that we have both government data and real-time job market data, what questions could we answer? What predictions might we make?"

With our rich dataset combining government statistics and current job postings, Module 2 will focus on:

- **Regression Modeling**: Predicting salaries based on job characteristics
- **Feature Engineering**: Extracting insights from job descriptions using NLP
- **Model Interpretation**: Understanding what factors drive compensation
- **Data Visualization**: Creating compelling stories with our combined datasets

---

## Hands-On Exercise: Test Your Own Target

> **Instructor Cue:** This is the main exercise for Chapter 2. Students will use the targets they identified in Chapter 1 or choose from the examples below. Give them 10-15 minutes to run their searches and analyze results.

**Exercise: Scrape Your Chosen Occupation**

Using the job scraper we built, test it with your target from Chapter 1 or choose one from these interesting options:

In [None]:
# Example targets from our BLS data (choose one or use your own from Chapter 1)

# Let's generate examples based on our actual BLS data
example_targets = []

if "target_occupations" in locals() and not target_occupations.empty:
    # Use actual data from our CSV
    for i, row in target_occupations.head(4).iterrows():
        example_targets.append({"job_title": row["occupation"], "location": row["state"]})
else:
    # Fallback examples if data isn't available
    example_targets = [
        {"job_title": "Data Scientists", "location": "OR"},
        {"job_title": "Software Developers", "location": "CA"},
        {"job_title": "Computer Systems Analysts", "location": "WA"},
        {"job_title": "Information Security Analysts", "location": "NY"},
    ]

# Display available examples
print("Available targets for scraping:")
for i, target in enumerate(example_targets):
    print(f"{i + 1}. {target['job_title']} in {target['location']}")

# Choose your target
my_target = example_targets[0]  # Change index or define your own

# Test the scraper with your chosen target
print(f"\nTesting scraper for: {my_target['job_title']} in {my_target['location']}")


# Run a small test (1 page only for time)
async def test_scraper():
    print("Starting test scraper...")
    test_results = await scrape_indeed_jobs(
        job_title=my_target["job_title"],
        location=my_target["location"],
        max_pages=1,  # Keep it quick for the workshop
        headless=True,
    )

    # Quick analysis of your results
    if not test_results.empty:
        print(f"\n✅ Successfully scraped {len(test_results)} jobs!")
        print(f"Companies found: {test_results['company_name'].nunique()}")

        # Check if salary columns exist
        if "salary" in test_results.columns:
            print(f"Jobs with salary info: {test_results['salary'].notna().sum()}")

        # Show a sample
        print("\nSample results:")
        sample_cols = [
            col for col in ["job_title", "company_name", "location"] if col in test_results.columns
        ]
        print(test_results[sample_cols].head(3).to_string(index=False))

        return test_results
    else:
        print("⚠️ No results found. Try a different job title or location.")
        return pd.DataFrame()


# Uncomment to run the test
print("Ready to test scraper. Running the test...")
test_job_results = await test_scraper()

> **Instructor Cue:** Walk around and help students troubleshoot. Common issues: network problems, Indeed rate limiting, or job titles that are too specific. Have backup pre-scraped data ready if needed.

---

*End of Module 1*

> **Instructor Cue:** End with a strong transition and celebrate their accomplishment: "You've just built a sophisticated data collection system that companies pay thousands of dollars for. You've gone from government statistics to real-time job market data in two hours! In the next module, we'll use this rich dataset to build predictive models and create AI-powered insights. Let's take a 15-minute break before diving into machine learning!"