In this example we are going to scrape the [New Jersey Division of Consumer Affairs license search site](https://newjersey.mylicense.com/verification/Search.aspx) for perfusionists, which are not *perfumerists*. They are, sadly, instead "a specialized healthcare professional who operates equipment to support a patient's circulatory or respiratory function during surgery."

Traditionally Python programmers use [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) to scrape content from the interent. Instead of being *traditional*, we're going to use [Playwright](https://playwright.dev/python/), a **browser automation tool**! This means you actually control the browser! Filling out forms, clicking buttons, downloading documents... it's magic!!!✨✨✨

# New Jersey Perfusionists

- Dropdowns
- Clicking
- Combining dataframes
- Looping through page numbers

## Installation

We need to install a few tools first! Remove the `#` and run the cell to install the Python packages and browsers that we'll need for our scraping adventure.

In [1]:
# %pip install --quiet lxml html5lib beautifulsoup4 pandas
# %pip install --quiet playwright
# !playwright install

## Opening up the browser and visiting our destination


In [2]:
from playwright.async_api import async_playwright

# "Hey, open up a browser"
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)

# Create a new browser window
page = await browser.new_page()

In [3]:
await page.goto("https://newjersey.mylicense.com/verification/Search.aspx")

<Response url='https://newjersey.mylicense.com/verification/Search.aspx' request=<Request url='https://newjersey.mylicense.com/verification/Search.aspx' method='GET'>>

## Selecting an option from a dropdown

You always start with `await page.locator("select").select_option("whatever option you want")`. You'll probably get an error because there are multiple dropdowns on the page, but Playwright doesn't know which one you want to use! Just read the error and figure out the right one.

In [4]:
# await page.locator("select").select_option("Acupuncture")
await page.locator("#t_web_lookup__profession_name").select_option("Perfusionist")

['Perfusionist']

In [5]:
# await page.get_by_text("Search").click()
await page.get_by_role("button", name="Search").click()

## Grab the tables from the page

[Pandas](https://pandas.pydata.org/) is the Python equivalent to Excel, and it's great at dealing with tabular data! Often the data on a web page that looks like a spreadsheet can be read with `pd.read_html`.

You use `await page.content()` to save the contents of the page, then feed it to `read_html` to find the tables. `len(tables)` checks the number of tables you have, then you manually poke around to see which one is the one you're interested in. `tables[0]` is the first one, `tables[1]` is the second one, and so on...

In [5]:
import pandas as pd
from io import StringIO

html = await page.content()
tables = pd.read_html(StringIO(html))
len(tables)

  tables = pd.read_html(html)


44

In this case we like the fourth table, `tables[3]`.

In [6]:
tables[3]

Unnamed: 0,Full Name,License Number,Profession,License Type,License Status,City,State
0,A CHO,,Acupuncture,Acupuncturist,Pending,Union City,NJ
1,A CHO,,,,,,
2,,,,,,,
3,,,,,,,
4,AARON PARK,25MZ00081700,Acupuncture,Acupuncturist,Active,Toms River,NJ
...,...,...,...,...,...,...,...
156,ALISA CLARK,,Acupuncture,Acupuncturist,Pending,Sarasota,FL
157,ALISA CLARK,,,,,,
158,,,,,,,
159,,,,,,,


## Clicking actual page numbers and saving as you go along

Sometimes you click "next" buttons, but sometimes it's easier (or necessary) to click page results numbers. In this case, we go from page 1 to 4 and scrape the contents of each page. (yes, `range(1,5)` stops at 4).

In [12]:
import pandas as pd
from io import StringIO

all_data = pd.DataFrame()

# Try it for several pages
for i in range(1,5):
    print("Clicking page", i)
    await page.get_by_text(str(i), exact=True).click()
    
    # Get all of the tables on the page
    html = await page.content()
    tables = pd.read_html(StringIO(html))
    df = tables[3]

    # Add the tables on this page to 
    all_data = pd.concat([all_df, df], ignore_index=True)

Clicking page 1
Clicking page 2
Clicking page 3
Clicking page 4


In [13]:
all_data

Unnamed: 0,Full Name,License Number,Profession,License Type,License Status,City,State
0,ABIGAIL E STROUD,25MI00030500,Perfusionist,Perfusionist,Active,Cherry Hill,NJ
1,ABIGAIL E STROUD,,,,,,
2,,,,,,,
3,,,,,,,
4,ADAM C YOUNG,25MI00039200,Perfusionist,Perfusionist,Active,Philadelphia,PA
...,...,...,...,...,...,...,...
800,FREDERICK PAUL WEBER,,Perfusionist,Perfusionist,Pending,Philadelphia,PA
801,FREDERICK PAUL WEBER,,,,,,
802,,,,,,,
803,,,,,,,


## Saving the results

Now we'll save it to a CSV file! Easy peasy.

In [14]:
all_data.to_csv("output.csv", index=False)