Traditionally Python programmers use [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) to scrape content from the interent. Instead of being *traditional*, we're going to use [Playwright](https://playwright.dev/python/), a **browser automation tool**! This means you actually control the browser! Filling out forms, clicking buttons, downloading documents... it's magic!!!✨✨✨

# Texas Medical Board Actions

- Filling out text inputs
- Inspecting the page
- Looping through date ranges
- Combining dataframes

## Installation

We need to install a few tools first! Remove the `#` and run the cell to install the Python packages and browsers that we'll need for our scraping adventure.

In [1]:
# %pip install --quiet lxml html5lib beautifulsoup4 pandas
# %pip install --quiet playwright
# !playwright install

## Opening up the browser and visiting our destination


In [2]:
from playwright.async_api import async_playwright

# "Hey, open up a browser"
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)

# Create a new browser window
page = await browser.new_page()

In [3]:
await page.goto("https://profile.tmb.state.tx.us/SearchBA.aspx?eb2d4a70-6591-4ad4-ae6d-a2727e84cb39")

<Response url='https://profile.tmb.state.tx.us/SearchBA.aspx?eb2d4a70-6591-4ad4-ae6d-a2727e84cb39' request=<Request url='https://profile.tmb.state.tx.us/SearchBA.aspx?eb2d4a70-6591-4ad4-ae6d-a2727e84cb39' method='GET'>>

## Filling in a text field

You always start with `await page.locator("input").fill("whatever you want")`. You'll probably get an error because there are multiple inputs on the page, but Playwright doesn't know which one you want to use! Just read the error and figure out the right one.

In [4]:
#await page.locator("input").fill("3/11/2021")
await page.locator("#BodyContent_tbBADate").fill("3/11/2021")
await page.locator("#BodyContent_tbBADateRangeEnd").fill("3/1/2024")

## Click the search button and wait for the results to show up

In [5]:
await page.get_by_role("button", name="Search").click()

In [9]:
await page.get_by_text("Board Actions").wait_for()

## Grab the tables from the page

[Pandas](https://pandas.pydata.org/) is the Python equivalent to Excel, and it's great at dealing with tabular data! Often the data on a web page that looks like a spreadsheet can be read with `pd.read_html`.

You use `await page.content()` to save the contents of the page, then feed it to `read_html` to find the tables. `len(tables)` checks the number of tables you have, then you manually poke around to see which one is the one you're interested in. `tables[0]` is the first one, `tables[1]` is the second one, and so on...

In [10]:
from io import StringIO
import pandas as pd

html = await page.content()
tables = pd.read_html(StringIO(html))
len(tables)

2

In [11]:
tables[1]

Unnamed: 0,Name,License,Type,Address,City,Board Actions
0,"ABBASI, MAAZ AHMED",M6992,Physician License,7107 SENTINEL FLS,MISSOURI CITY,
1,"ABEBE, EYOEL",Q5611,Physician License,1717 MAIN ST,DALLAS,
2,"ABOUGHALI, WAEL ATA",M1444,Physician License,9330 BROADWAY ST SUITE B306,PEARLAND,
3,"ABRON, STEPHANIE CHRISTINA",N7983,Physician License,6220 WESTPARK DRIVE,HOUSTON,
4,"ACHARYA, SUJEET S",N8748,Physician License,1411 N. BECKLEY AVE.,DALLAS,
5,"ACOSTA, DIANA MORIN",NONE,,,,
6,"ACREE, JOSHUA SCOTT",R4654,Physician License,3136 HORIZON RD,ROCKWALL,
7,"ACUNA, JUAN",NONE,,,,
8,"ADAMS, BENJAMIN DONALD",N9993,Physician License,525 N. GARFIELD AVE,MONTEREY PARK,
9,"ADAMS, JEROME MARK",C5941,Physician License,4820 ROYAL OAK ST,WICHITA FALLS,


In [12]:
#await page.get_by_text("Back").click()
await page.get_by_role("link", name="Back").click()

## Cycling through dates

We'll start from one date, and work forward to another date. This is just to see what it looks like.

In [13]:
import datetime

start_date = datetime.date(2023, 4, 1)
end_date = datetime.date(2023, 4, 10)

# Loop over each day from the start date to the end date
current_date = start_date
while current_date <= end_date:
    current_date = current_date + datetime.timedelta(days=1)
    date_str = current_date.strftime("%m/%d/%Y")
    print("Searching", date_str)

Searching 04/02/2023
Searching 04/03/2023
Searching 04/04/2023
Searching 04/05/2023
Searching 04/06/2023
Searching 04/07/2023
Searching 04/08/2023
Searching 04/09/2023
Searching 04/10/2023
Searching 04/11/2023


## Putting it all together

Now we'll visit the page, go through each date, saving the results to our `all_data` overall list of data.

In [14]:
await page.goto("https://profile.tmb.state.tx.us/SearchBA.aspx?eb2d4a70-6591-4ad4-ae6d-a2727e84cb39")

<Response url='https://profile.tmb.state.tx.us/SearchBA.aspx?eb2d4a70-6591-4ad4-ae6d-a2727e84cb39' request=<Request url='https://profile.tmb.state.tx.us/SearchBA.aspx?eb2d4a70-6591-4ad4-ae6d-a2727e84cb39' method='GET'>>

In [15]:
import datetime
import pandas as pd
from io import StringIO

all_data = pd.DataFrame()

start_date = datetime.date(2023, 4, 1)
end_date = datetime.date(2023, 5, 1)

# Loop over each day from the start date to the end date
current_date = start_date
while current_date <= end_date:
    current_date = current_date + datetime.timedelta(days=1)
    date_str = current_date.strftime("%m/%d/%Y")
    
    print("Searching", date_str)
    
    await page.locator("#BodyContent_tbBADate").fill(date_str)
    await page.locator("#BodyContent_tbBADateRangeEnd").fill(date_str)

    await page.get_by_role("button", name="Search").click()

    try:
        await page.get_by_text("Board Actions").wait_for()
        html = await page.content()
        tables = pd.read_html(StringIO(html))
        df = tables[1]
    
        all_data = pd.concat([all_data, df], ignore_index = True)
    except:
        print("Failed")

    await page.get_by_role("link", name="Back").click()

Searching 04/02/2023
Failed
Searching 04/03/2023
Searching 04/04/2023
Searching 04/05/2023
Searching 04/06/2023


In [16]:
all_data

Unnamed: 0,Name,License,Type,Address,City,Board Actions
0,"HENSEN, ERIC LOGAN",R0868,Physician License,112 MEDICAL DR,PALESTINE,
1,"CASTILLO, MARGUI",NONE,,,,
2,"HADEN, MARSHALL LYNN",BP10078006,Physician-in-Training Permit,6431 FANNIN ST,HOUSTON,
3,"CHAVEZ, JORGE",NONE,,,,


## Saving the results

Now we'll save it to a CSV file! Easy peasy.

In [17]:
all_data.to_csv("licenses.csv", index=False)