Traditionally Python programmers use [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) to scrape content from the interent. Instead of being *traditional*, we're going to use [Playwright](https://playwright.dev/python/), a **browser automation tool**! This means you actually control the browser! Filling out forms, clicking buttons, downloading documents... it's magic!!!✨✨✨

# California Midwives

- Getting blocked???
- Firefox? VPNs?
- Selecting from dropdowns
- Infinite scroll
- BeautifulSoup and `next_sibling`
- Dataframe from list of dictionaries

## Installation

We need to install a few tools first! Remove the `#` and run the cell to install the Python packages and browsers that we'll need for our scraping adventure.

In [1]:
# %pip install --quiet lxml html5lib beautifulsoup4 pandas
# %pip install --quiet playwright
# !playwright install

## Opening up the browser and visiting our destination


In [2]:
from playwright.async_api import async_playwright

# "Hey, open up a browser"
playwright = await async_playwright().start()
# browser = await playwright.chromium.launch(headless=False)
browser = await playwright.firefox.launch(headless=False)

# Create a new browser window
page = await browser.new_page()

In [3]:
await page.goto("https://search.dca.ca.gov/")

<Response url='https://search.dca.ca.gov/' request=<Request url='https://search.dca.ca.gov/' method='GET'>>

## Selecting an option from a dropdown

You always start with `await page.locator("select").select_option("whatever option you want")`. You'll probably get an error because there are multiple dropdowns on the page, but Playwright doesn't know which one you want to use! Just read the error and figure out the right one.

In [4]:
# await page.locator("select").select_option("Licensed Midwife")
await page.get_by_label("License Type").select_option("Licensed Midwife")

['288']

In [5]:
# await page.get_by_text("Search").click()
await page.get_by_role("button", name="SEARCH").click()

In [11]:
await page.locator("body").wait_for()
last_height = await page.evaluate("document.body.scrollHeight")

while True:
    print("Scrolling down")
    await page.evaluate("window.scrollTo({left: 0, top: document.body.scrollHeight, behavior: 'smooth'})")
    await page.wait_for_load_state("networkidle", timeout=5000)
    await page.wait_for_timeout(2000)
    new_height = await page.evaluate("document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down
Scrolling down


CancelledError: 

## Grab the data from the page

[Pandas](https://pandas.pydata.org/) is the Python equivalent to Excel, and it's great at dealing with tabular data! Often the data on a web page that looks like a spreadsheet can be read with `pd.read_html`.

In this case, there *isn't one*. You need to use BeautifulSoup to scrape the page manually! *But* you first needed to use Playwright to open hte page, execute the search, and scrollllll to fill up the page first.

In [None]:
from bs4 import BeautifulSoup

doc = BeautifulSoup(await page.content())

In [None]:
rows = []

for item in doc.select("article"):
    row = {}
    row['content'] = item.text
    row['name'] = item.find("h3").text
    try:
        row['expiration_date'] = item.find("strong", string="Expiration Date:").next_sibling.text
    except:
        pass
    try:
        row['number'] = item.find("strong", string="License Number:").find_next_sibling('a').text
    except:
        pass
    try:
        row['status'] = item.find("strong", string="License Status:").next_sibling.text
    except:
        pass
    
    rows.append(row)

In [None]:
import pandas as pd

df = pd.DataFrame(rows)
df.tail()

## Saving the results

Now we'll save it to a CSV file! Easy peasy.

In [None]:
df.to_csv("output.csv", index=False)