In this example we are going to scrape the [OpenSyllabus works page](https://analytics.opensyllabus.org/record/works) for books included on college syllabi.

Traditionally Python programmers use [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) to scrape content from the interent. Instead of being *traditional*, we're going to use [Playwright](https://playwright.dev/python/), a **browser automation tool**! This means you actually control the browser! Filling out forms, clicking buttons, downloading documents... it's magic!!!✨✨✨

# OpenSyllabus works list

## What we'll learn/use

- Selectors
- 'Show more' button pagination
- Creating and saving a dataframe

## Installation

We need to install a few tools first! Remove the `#` and run the cell to install the Python packages and browsers that we'll need for our scraping adventure.

In [None]:
# %pip install --quiet lxml html5lib beautifulsoup4 pandas
# %pip install --quiet playwright
# !playwright install chrome firefox

## Requests + BS4 = doesn't work

First, let's see how we can tell that this site does *not* work with the "normal" approach to scraping, with requests and BeautifulSoup.

In [None]:
import requests
from bs4 import BeautifulSoup

response = requests.get("https://analytics.opensyllabus.org/record/works")
doc = BeautifulSoup(response.text)

No errors yet, but let's try to find some book titles... 

In [None]:
doc.find_all(class_='sc-9d100f21-9 bNgiIK')

No luck! Maybe if we look at the actual text of the response itself?

In [None]:
response.text

Looks like **the content isn't even on the page**. Now we can move on to *Playwright!*

## Opening up the browser and visiting our destination

In [None]:
from playwright.async_api import async_playwright

# "Hey, open up a browser"
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)

# Create a new browser window
page = await browser.new_page()

In [None]:
await page.goto("https://analytics.opensyllabus.org/record/works")

## Clicking load more buttons

The page that we're looking at has plenty of data, but there's more in hiding – if you scroll to the bottom of the page you'll see a "Show more" button.

Let's grab it using `page.get_by_text` and click it.

In [None]:
await page.get_by_text("Show more").click()

Again! and again! And again! Let's click it **five more times**.

We'll wait one second between each click so it doesn't accidentally double-click the button.

In [None]:
import time

for i in range(5):
    await page.get_by_text("Show more").click(timeout=5000)
    time.sleep(2)

If you replaced the `for i in range` with `while True` it would click forever and ever. Eventually the "Show more" button runs out and there's no more button on the page: Playwright will wait for "Show more" for 30 seconds before returning an error.

## Grab the content from the page

Now we need to get the *content* from the page. Everyone loves using BeautifulSoup to scrape, so why don't we just do that? You use `await page.content()` to save the contents of the page and feed it directly to BeautifulSoup.

In [None]:
from bs4 import BeautifulSoup

html = await page.content()
doc = BeautifulSoup(html)

Once you push the HTML from the Playwright page into BeautifulSoup, you can use all the same selectors and filters that you can in your "normal" scraping world.

In [None]:
titles = doc.find_all(class_='sc-9d100f21-9 bNgiIK')
for title in titles[:10]:
    print(title.text)

## Developing your selectors

Let's be honest: **writing custom scraping code isn't anyone's favorite thing to do.**

To put together your selectors to grab the "right" data, I suggest using [my ChatGPT prompt](https://chatgpt.com/share/683013d1-b394-800d-bbce-64a00778b0de) to help. You can see the [original prompt here](https://gist.github.com/jsoma/d46ba769764866331a83d702a3054751) if you'd like to use it with another AI tool.

You want to right-click the data you're interested in, then select **Inspect**. That provides two approaches to finding your region of interest: either browsing around on the right-hand side...

<img src="finding-row-1.gif" style="max-width: 600px">

...or using the element selector and clicking on the left-hand side.

<img src="finding-row-2.gif" style="max-width: 600px">

[Pandas](https://pandas.pydata.org/) is the Python equivalent to Excel, and it's great at dealing with tabular data! If you can build a list of dictionaries it's fantastic for saving the content.

In [None]:
import pandas as pd

rows = []
for tr in doc.select("tr.duwfJU"):
    row = {}

    try:
        row["rank"] = tr.select_one("p.hOuBOS").text.strip()
    except:
        pass

    try:
        row["title"] = tr.select_one("a[href^='/singleton/works'] p").text.strip()
    except:
        pass

    try:
        row["author"] = tr.select_one("a[href^='/singleton/authors'] p").text.strip()
    except:
        pass

    try:
        row["score"] = tr.select_one("div[name='score-star'] + p").text.strip()
    except:
        pass

    try:
        row["appearances"] = tr.select_one("div.elzFcv").text.strip()
    except:
        pass

    rows.append(row)

df = pd.DataFrame(rows)
df.head()

## Saving the file

Now we can save our pandas dataframe to a CSV to open up in Excel or wherever else!

In [None]:
df.to_csv("books.csv", index=False)