Traditionally Python programmers use [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) to scrape content from the interent. Instead of being *traditional*, we're going to use [Playwright](https://playwright.dev/python/), a **browser automation tool**! This means you actually control the browser! Filling out forms, clicking buttons, downloading documents... it's magic!!!✨✨✨

# Texas Tow Truck Details

- Dropdowns
- Filling out text fields
- Clicking
- Looping through licenses
- Saving entire HTML page for cleanup later

## Installation

We need to install a few tools first! Remove the `#` and run the cell to install the Python packages and browsers that we'll need for our scraping adventure.

In [None]:
# %pip install --quiet lxml html5lib beautifulsoup4 pandas
# %pip install --quiet playwright
# !playwright install

## Opening up the browser and visiting our destination


In [None]:
from playwright.async_api import async_playwright

# "Hey, open up a browser"
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)

# Create a new browser window
page = await browser.new_page()

In [None]:
await page.goto("https://www.tdlr.texas.gov/LicenseSearch/")

## Selecting an option from a dropdown

You always start with `await page.locator("select").select_option("whatever option you want")`. You'll probably get an error because there are multiple dropdowns on the page, but Playwright doesn't know which one you want to use! Just read the error and figure out the right one.

In [None]:
# await page.locator("select").select_option("Tow Companies")
await page.locator("#SelectStatus").select_option("Tow Companies")

In [None]:
await page.locator("#mcrbutton").click()

In [None]:
# await page.locator("input").fill("006534502C")
await page.locator("#mcrdata").fill("006534502C")

In [None]:
# await page.get_by_text("Search").click()
await page.get_by_role("button", name="Search").click()

## Try to grab the tables from the page... and fail!

Usually we can use `pd.read_html` to grab the tables from the page, but in this case it doesn't really work!

In [None]:
from io import StringIO
import pandas as pd

html = await page.content()
tables = pd.read_html(StringIO(html))
len(tables)

The page isn't organized well! As an alternative, we'll just save *all of the content from the page* and clean it up later in Excel or pandas.

## Checking many licenses and saving the HTML

Below we'll go through a series of licenses and save content about each one to our CSV file.

In a normal case we'll cut data out of the page into different columns. In this case, though, we're just going to save a column called `page_contents` that's all of the content on the page and cut it out later.

I could have taken this list of licenses from the previous search or put together a manual list of license numbers.

In [None]:
licenses = [
    '006534502C',
    '006563949C',
    '006546993C',
    '006579452C',
    '006532111C',
]

In [None]:
# This is a list of page content
pages = []

# Go through each license
for license in licenses:
    print("Searching for", license)
    await page.goto("https://www.tdlr.texas.gov/LicenseSearch/")
    await page.locator("#SelectStatus").select_option("Tow Companies")

    # Click radio button and fill out field
    await page.locator("#mcrbutton").click()
    await page.locator("#mcrdata").fill(license)

    # Click search
    await page.get_by_role("button", name="Search").click()

    try:
        await page.get_by_text("Certificate Information").wait_for(timeout=5000)
    
        html = await page.content()
        # Add the current page to the list
        pages.append(html)
    except:
        # It didn't work! Let's just add nothing for this license.
        pages.append(None)
        print("No details")

In [None]:
df = pd.DataFrame({
    'TDLR': licenses,
    'page_contents': pages
})
df.head()

## Saving the results

Now we'll save it to a CSV file! Easy peasy.

In [None]:
df.to_csv("output.csv", index=False)