Traditionally Python programmers use [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) to scrape content from the interent. Instead of being *traditional*, we're going to use [Playwright](https://playwright.dev/python/), a **browser automation tool**! This means you actually control the browser! Filling out forms, clicking buttons, downloading documents... it's magic!!!✨✨✨

# Chicago Building Records

- Clicking
- Filling out forms with `type` instead of `fill`
- Extracting a single table
- Looping through addresses
- CSS selectors?
- Clicking links
- Dataframe manipulation
- Combining dataframes

## Installation

We need to install a few tools first! Remove the `#` and run the cell to install the Python packages and browsers that we'll need for our scraping adventure.

In [None]:
# %pip install --quiet lxml html5lib beautifulsoup4 pandas
# %pip install --quiet playwright
# !playwright install

## Opening up the browser and visiting our destination


In [None]:
from playwright.async_api import async_playwright

# "Hey, open up a browser"
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
context = await browser.new_context()

# Create a new browser window
page = await context.new_page()

In [None]:
await page.goto("https://webapps1.chicago.gov/buildingrecords/home")

## Clicking a button

In [None]:
await page.locator("#rbnAgreement1").click()
await page.get_by_text("Submit").click()

## Filling in a field that *demands* keyboard input

You usually use `.fill` to write in a text box. But some forms want to know someone typed in it! In this case, you'll use `.type` instead.

In [None]:
# await page.locator("input").fill("400 e 41st st")
# await page.get_by_label("Building Address").fill("400 e 41st st")
await page.get_by_label("Building Address").type("400 e 41st st")

In [None]:
await page.get_by_text("Submit").click()

## Grab the data from the page

[Pandas](https://pandas.pydata.org/) is the Python equivalent to Excel, and it's great at dealing with tabular data! Often the data on a web page that looks like a spreadsheet can be read with `pd.read_html`.

In this case, there *isn't one*. You need to use BeautifulSoup to scrape the page manually! *But* you first needed to use Playwright to open hte page, execute the search, and scrollllll to fill up the page first.

In [None]:
import pandas as pd
from io import StringIO

html = await page.content()
tables = pd.read_html(StringIO(html))
len(tables)

In [None]:
df = tables[2]
df.head()

In [None]:
df.to_csv("output.csv", index=False)

## Getting details of each of the inspections

How many links are on the page? I found `"#resultstable_inspections a"` by knowing how CSS selectors work. It means "links inside of an element with an id of `resultstable_inspections`.

In [None]:
links = page.locator("#resultstable_inspections a")
count = await links.count()
count

### Clicking a single link for details

When we click the link it opens up a new page. Below we click one of the links and wait until the "Print" text shows up on the page. We can talk about the new page with `new_page`, while `page` is still the original page.

In [None]:
async with context.expect_page() as new_page_info:
    await links.nth(1).click()

new_page = await new_page_info.value
await new_page.get_by_text("Print").wait_for()

Now we'll pull the content from the new page just like we normally do.

In [None]:
html = await new_page.content()
tables = pd.read_html(StringIO(html))
len(tables)

It's kind of a weird table, so we need to clean it up a bit. I know I (probably) promised that all of this would be 100% cut-and-paste reuseable but SADLY this time it is not.

In [None]:
df = tables[0]
df.columns = df.columns.droplevel()
df

## Putting it all together

Let's experiment with looking through the different links, and then make it work.

In [None]:
links = page.locator("#resultstable_inspections a")
count = await links.count()
count

In [None]:
for i in range(3):
    link = links.nth(i)
    inspection_num = await links.nth(i).inner_text()
    print("Inspection number", inspection_num)

In [None]:
for i in range(3):
    link = links.nth(i)
    inspection_num = await links.nth(i).inner_text()
    print("Inspection number", inspection_num)

    async with context.expect_page() as new_page_info:
        await link.click()

    new_page = await new_page_info.value
    await new_page.get_by_text("Print").wait_for()

    await new_page.close()

Okay let's go!!!

In [None]:
all_data = pd.DataFrame()

# for i in range(count):
for i in range(5):
    # Get the link and the link details
    link = links.nth(i)
    inspection_num = await links.nth(i).inner_text()

    # Open the page
    async with context.expect_page() as new_page_info:
        await link.click()

    # Access the page
    new_page = await new_page_info.value
    await new_page.get_by_text("Print").wait_for()

    # Grab the table
    try:
        print("Saving violations for", inspection_num)
        html = await new_page.content()
        tables = pd.read_html(StringIO(html), header=None)
        df = tables[0]
        df.columns = df.columns.droplevel()
        df['inspection_num'] = inspection_num
        all_data = pd.concat([all_data, df], ignore_index = True)
    except:
        print("No violations for", inspection_num)
        
    # Close the page
    await new_page.close()

In [None]:
all_data

## Saving the results

Now we'll save it to a CSV file! Easy peasy.

In [None]:
all_data.to_csv("output.csv", index=False)