Traditionally Python programmers use [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) to scrape content from the interent. Instead of being *traditional*, we're going to use [Playwright](https://playwright.dev/python/), a **browser automation tool**! This means you actually control the browser! Filling out forms, clicking buttons, downloading documents... it's magic!!!✨✨✨

# Texas Medical Board Actions Details

- Filling out text inputs
- Inspecting the page
- Combining dataframes
- Looping through licenses
- Downloading PDFs (using Firefox)

## Installation

We need to install a few tools first! Remove the `#` and run the cell to install the Python packages and browsers that we'll need for our scraping adventure.

In [1]:
# %pip install --quiet lxml html5lib beautifulsoup4 pandas
# %pip install --quiet playwright
# !playwright install

## Opening up the browser and visiting our destination

We've been using Chromium (basically Chrome) for most of our exercises, but in this case we're using Firefox! Chromium for some reason sometimes gets blocked, while Firefox doesn't. Not sure why!

In [43]:
from playwright.async_api import async_playwright

# "Hey, open up a browser"
playwright = await async_playwright().start()
browser = await playwright.firefox.launch(headless=False)

# Create a new browser window
page = await browser.new_page()

In [30]:
await page.goto("https://profile.tmb.state.tx.us/SearchBA.aspx?eb2d4a70-6591-4ad4-ae6d-a2727e84cb39")

<Response url='https://profile.tmb.state.tx.us/SearchBA.aspx?eb2d4a70-6591-4ad4-ae6d-a2727e84cb39' request=<Request url='https://profile.tmb.state.tx.us/SearchBA.aspx?eb2d4a70-6591-4ad4-ae6d-a2727e84cb39' method='GET'>>

# Filling in a single license and searching

Filling in text fields, clicking, waiting for buttons to show up and clicking. Nothing crazy!

In [31]:
# M6992
# Q5611
# M1444
await page.locator("#BodyContent_tbLicense").fill("M6992")

In [32]:
await page.get_by_role("button", name="Search").click()

In [33]:
await page.get_by_text("Document").click()

### Opening up the documents section

In [34]:
await page.get_by_role("button", name="ExpandWeb Documents").wait_for()

In [35]:
await page.get_by_role("button", name="ExpandWeb Documents").click()

I dug into the code on the page and we have 3 links. 

In [36]:
links = page.locator(".doclink")
await links.count()

3

### Downloading one of the PDFs

By default this opens up in Chrome/Chromium as an in-page PDF. We change to Firefox to make it actually download.

In [37]:
# Start waiting for the download
async with page.expect_download() as download_info:
    # Perform the action that initiates download
    await links.nth(1).click()
download = await download_info.value

print("Saving as", download.suggested_filename)

# Wait for the download process to complete and save the downloaded file somewhere
await download.save_as(download.suggested_filename)

Saving as 0000C54E.PDF


### Return to the previous page

In [38]:
#await page.get_by_text("Back").click()
await page.get_by_role("link", name="Back").click()

## Putting it all together

If we had a CSV with licenses in it, we can read in the CSV file and extract a list of licenses.

In [2]:
import pandas as pd

df = pd.read_csv("licenses.csv")
df.head()

Unnamed: 0,Name,License,Type,Address,City,Board Actions
0,"HENSEN, ERIC LOGAN",R0868,Physician License,112 MEDICAL DR,PALESTINE,
1,"CASTILLO, MARGUI",NONE,,,,
2,"HADEN, MARSHALL LYNN",BP10078006,Physician-in-Training Permit,6431 FANNIN ST,HOUSTON,
3,"CHAVEZ, JORGE",NONE,,,,


In [5]:
licenses = df['License'].unique().tolist()

# We don't like the NONE one, so we'll remove it.
licenses.remove('NONE')
licenses

['R0868', 'BP10078006']

I don't think we have very many licenses in the CSV, so we'll try it again with a list of licenses that I put together manually.

In [44]:
licenses = ['M6992', 'Q5611', 'M1444']

for license in licenses:
    print("Searching for", license)
    await page.goto("https://profile.tmb.state.tx.us/SearchBA.aspx?eb2d4a70-6591-4ad4-ae6d-a2727e84cb39")
    
    # Fill out license
    await page.locator("#BodyContent_tbLicense").fill(license)
    
    # Click search button
    await page.get_by_role("button", name="Search").click()
    
    # Move into Documents section
    await page.get_by_text("Document").click()
    
    # Expand documents tab
    await page.get_by_role("button", name="ExpandWeb Documents").wait_for()
    await page.get_by_role("button", name="ExpandWeb Documents").click()

    links = page.locator(".doclink")
    count = await links.count()
    count

    # Download all documents
    for i in range(count):
        # Wait for download
        async with page.expect_download() as download_info:
            await links.nth(i).click()
        download = await download_info.value
        
        print("Saving as", download.suggested_filename)
        
        # Wait for the download process to complete and save the downloaded file somewhere
        await download.save_as(download.suggested_filename)

Searching for M6992


Exception: Locator.click: Connection closed while reading from the driver