These are the notes for the [Advanced scraping with Playwright](https://www.eventbrite.com/e/advanced-web-scraping-with-playwright-lede-2024-info-session-tickets-810653144377) session, which was a sample class and info session hosted by [Professor Jonathan Soma](https://jonathansoma.com/) for the [Lede Program](https://ledeprogram.com/), a summer data journalism intensive at Columbia Journalism School.

## Requests and BeautifulSoup intro

The traditional entry point for learning to scrape in Python is by using [requests](https://pypi.org/project/requests/) and [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/). It's usually great!

In the case below, we're using it to scrape headlines [Le Monde's English website](https://www.lemonde.fr/en/).

In [4]:
import requests
from bs4 import BeautifulSoup

response = requests.get("https://www.lemonde.fr/en/")
doc = BeautifulSoup(response.text)

Sometimes you'll get lucky and be able to scrape by just specifying a tag name...

In [5]:
headlines = doc.find_all('h3')
for headline in headlines:
    print(headline.text)

Gazprombank executives quietly sold their French villas after the Ukraine invasion
Our die-hard obsession with virginity
In Egypt, Gazans are 'caught between war and a country that doesn't want' them
Two killed in Pakistan election protest as Khan allies lead in vote count
Jacques Doillon, accused of rape by several actors, denounces the 'lies'
France's new Education Minister Nicole Belloubet appointed in the midst of a school crisis
French prosecutors seek trial for Lafarge cement group over terror financing
Iran's Ayatollah Ali Khamenei removed from Instagram and Facebook
Prince Harry settles phone hacking lawsuit against UK tabloid
Vladimir Putin tells Tucker Carlson that Russia cannot be defeated in Ukraine
European countries are turning to 'selective immigration' to mend labor shortages
Three actresses accuse director Jacques Doillon of rape, sexual assault and harassment
Sections
French Focus
Opinion
Informations légales le Monde


...but more often that not a class is going to be more effective.

In [3]:
headlines = doc.find_all(class_='article__title')
for headline in headlines:
    print(headline.text)

Putin tells the West that Russia cannot be defeated in Ukraine
In Egypt, Gazans are 'caught between war and a country that doesn't want' them
Ukraine's Hungarian minority, caught between defending its identity and fear of Orban's policies
In the Khan Yunis tunnels, Israeli army has little hope of freeing hostages by force
Paris votes on SUVs: The end of the road for big cars?
Why did astronauts leave poop on the moon, and what can we learn from it?
What's the role of France's prime minister?
Footage of Cyclone Belal hitting France's Réunion Island
Pesticides: 'We, researchers, condemn the way scientific knowledge is being shelved'
TotalEnergies makes its biggest profit ever
Coca-Cola: Sponsor of the Paris 2024 Olympics and leading global plastic polluter
UNESCO World Heritage site is engulfed by flames in Patagonia, Argentina
With the Paris airport border police: 'The risk, for us, is that we won't be able to send them back'
Three actresses accuse director Jacques Doillon of rape, sexu

## Where requests + BeautifulSoup fails

Some websites you'll be able to download fine with requests, but when you start trying to use BeautifulSoup *nothing shows up*. For example, if we try to access [OpenSyllabus listing pages](https://explorer.opensyllabus.org/results-list/titles?size=50) we won't see any books show up in BeautifulSoup.

In [17]:
response = requests.get("https://explorer.opensyllabus.org/results-list/titles?size=50")
doc = BeautifulSoup(response.text)

In [18]:
doc.find_all(class_='name-div')

[]

This is because the page retrived by requests *doesn't actually have all those books on it*.

In [10]:
response.text

'<!doctype html><html lang="en"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width,initial-scale=1"><script async src="https://www.googletagmanager.com/gtag/js?id=UA-72367808-1"></script><script>function gtag(){dataLayer.push(arguments)}window.dataLayer=window.dataLayer||[],gtag("js",new Date),gtag("config","UA-72367808-1")</script><link rel="shortcut icon" href="/favicon.ico"><link rel="stylesheet" href="https://unpkg.com/leaflet@1.0.3/dist/leaflet.css"/><meta property="og:site_name" content="Open Syllabus"/><meta property="og:title" content="Open Syllabus: Explorer"/><meta property="og:description" content="Mapping the college curriculum across 7,292,573 syllabi."/><meta property="og:image" content="https://opensyllabus.org/og-image.jpg"/><meta name="twitter:card" content="summary_large_image"/><meta name="twitter:site" content="@opensyllabus"/><title>Open Syllabus</title><link href="/static/css/main.a318ad5b.css" rel="stylesheet"></head><body><div id="root

This is because visiting this site is a two-step process, first you load up this bare-bones skeleton page, then **the browser goes and gets the actual information.** Requests doesn't do that next step, so we need to try another tool!

## Enter Playwright

Instead of pulling the raw HTML contents of the page, Playwright actually controls your browser for you! It can load pages up, you can click things, fill out forms, all sorts of things.

## Installing Playwright

To install Playwright, you just need two commands: one to install the library, the other to install the necessary browsers. You can run them both from the command line, or put `!` in front of them if you're running them from a Jupyter Notebook.

```bash
pip install playwright
playwright install
```

If you come from a Selenium background, it's a lot easier than tracking down webdrivers, 'eh?

## Using Playwright

To begin we'll just access the same OpenSyllabus page as before and see the actual contents.

In [11]:
from playwright.async_api import async_playwright

# "Hey, open up a browser"
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)

# Create a new browser window
page = await browser.new_page()

# Tell it to go to this page
await page.goto("https://explorer.opensyllabus.org/results-list/titles?size=50")

<Response url='https://explorer.opensyllabus.org/results-list/titles?size=50' request=<Request url='https://explorer.opensyllabus.org/results-list/titles?size=50' method='GET'>>

Some people will actually scrape the page Playwright, grabbing titles and all of that, but I find it's easiest to take the HTML – the *full* HTML, after the skeleton has been filled in – and feed it to BeautifulSoup, just like we're used to.

In [19]:
html_content = await page.content()

doc = BeautifulSoup(html_content)

In [20]:
doc.find_all(class_='name-div')

[<div class="name-div"><p><a href="/result/title?id=9199819950029">The Elements of Style</a></p><span class="name"><div><a href="/result/author?id=William+Strunk">William Strunk</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
 <div class="name-div"><p><a href="/result/title?id=33749853015144">A Writer's Reference</a></p><span class="name"><div><a href="/result/author?id=Diana+Hacker">Diana Hacker</a></div></span><span class="publisher"><div><a href="/result/publisher?id=St.+Martin%27s+%2F+Bedford+Books"><div class="div-link">St. Martin's / Bedford Books</div>,</a><div class="div1-no-link">1989</div></div></span></div>,
 <div class="name-div"><p><a href="/result/title?id=7988639699494">A Manual for Writers of Term Papers, Theses, and Dissertations</a></p><span class="name"><div><a href="/result/author?id=Kate+L.+Turabian">Kate L. Turabian</a></div></span><span class="publisher"><div><a href="/result/publisher?id=University+of+Chicag

Now that we know how to access the page, we can grab the content just like we'd do with a "normal" requests/BeautifulSoup page.

In [21]:
books = doc.find_all('div', class_='title-item')

for book in books:
    print("----")
    print(book)
    print(book.find('a').text)
    print(book.find('span', class_='name').text)
    print(book.find(class_='appearances').text)
    print(book.find(class_='score').text)

----
<div class="title-item"><div class="rank">1</div><div class="title"><div class="name-div"><p><a href="/result/title?id=9199819950029">The Elements of Style</a></p><span class="name"><div><a href="/result/author?id=William+Strunk">William Strunk</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">15,533</div><div class="score">100</div></div>
The Elements of Style
William Strunk
15,533
100
----
<div class="title-item"><div class="rank">2</div><div class="title"><div class="name-div"><p><a href="/result/title?id=33749853015144">A Writer's Reference</a></p><span class="name"><div><a href="/result/author?id=Diana+Hacker">Diana Hacker</a></div></span><span class="publisher"><div><a href="/result/publisher?id=St.+Martin%27s+%2F+Bedford+Books"><div class="div-link">St. Martin's / Bedford Books</div>,</a><div class="div1-no-link">1989</div></div></span></div></div><div class="appearances">14,931</div><div clas

In [22]:
books = doc.find_all('div', class_='title-item')

all_data = []
for book in books:
    data = {
        'name': book.find('a').text,
        'author': book.find('span', class_='name').text,
        'appearances': book.find(class_='appearances').text,
        'score': book.find(class_='score').text
    }
    all_data.append(data)

all_data

[{'name': 'The Elements of Style',
  'author': 'William Strunk',
  'appearances': '15,533',
  'score': '100'},
 {'name': "A Writer's Reference",
  'author': 'Diana Hacker',
  'appearances': '14,931',
  'score': '100'},
 {'name': 'A Manual for Writers of Term Papers, Theses, and Dissertations',
  'author': 'Kate L. Turabian',
  'appearances': '13,426',
  'score': '100'},
 {'name': 'The Communist Manifesto',
  'author': 'Karl Marx',
  'appearances': '11,234',
  'score': '100'},
 {'name': 'The Republic',
  'author': 'Plato',
  'appearances': '9,883',
  'score': '100'},
 {'name': 'Calculus',
  'author': 'James Stewart',
  'appearances': '9,682',
  'score': '100'},
 {'name': 'Frankenstein',
  'author': 'Mary Wollstonecraft Shelley',
  'appearances': '9,320',
  'score': '100'},
 {'name': 'The Canterbury Tales',
  'author': 'Geoffrey Chaucer',
  'appearances': '9,172',
  'score': '100'},
 {'name': 'Nicomachean Ethics',
  'author': 'Aristotle',
  'appearances': '9,104',
  'score': '100'},
 {'n

And then we can do all anyone ever wants to do, which is convert it into a spreadsheet!

In [23]:
import pandas as pd

df = pd.DataFrame(all_data)
df.head()

Unnamed: 0,name,author,appearances,score
0,The Elements of Style,William Strunk,15533,100
1,A Writer's Reference,Diana Hacker,14931,100
2,"A Manual for Writers of Term Papers, Theses, a...",Kate L. Turabian,13426,100
3,The Communist Manifesto,Karl Marx,11234,100
4,The Republic,Plato,9883,100


## Interacting with the page

If we scroll down a bit, we see that the page **only lists the top 50 books**. We want more than that! And we get that by clicking the "Show More" button.

Playwright makes it easy with `page.get_by_text` and `.click()`

In [24]:
await page.get_by_text("Show more").click()

Notice that we didn't have to scroll down! If you're used to Selenium, it would lose its mind whenever you tried to click something that wasn't on the page. Playwright doesn't care, it finds it and clicks it!

If we want to click three times? Ten times? Just write a loop!

In [25]:
for _ in range(3):
    await page.get_by_text("Show more").click()

Notice that you **didn't have to wait for the content to load**. Selenium loves to throw errors if you try to click content that isn't on the page – Playwright doesn't care, it just waits for it to show up! This can be a pain if you had a typo and have to wait 30-60 seconds for Playwright to say "maybe you made a mistake," but I think we can live with that.

## Filling out forms

Let's try another page where we need to fill out some forms. The [North Dakota well search page](https://www.dmr.nd.gov/oilgas/findwellsvw.asp) is a good one!

In [26]:
# Imports
from playwright.async_api import async_playwright

playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()

await page.goto("https://www.dmr.nd.gov/oilgas/findwellsvw.asp")

Selecting from dropdowns is easy! Instead of importing a thousand additional tools (*cough Selenium cough*) we can just use `.select_option`.

What element on the page do we select? I don't know, and I don't care! An easy shortcut is to guess incorrectly – Playwright will automatically provide you with some options. If I try to just find all of the `select` fields...

```python
await page.locator("select").select_option('135')
```

...it gives me a few ideas for what I *should* have done.

```
1) <select size="1" id="ddmOperator" name="ddmOperator">…</select> aka get_by_label("Operator:")
2) <select size="1" id="ddmField" name="ddmField">…</select> aka get_by_label("Field:")
3) <select size="1" id="ddmSection" name="ddmSection">…</select> aka get_by_label("Section:")
4) <select size="1" id="ddmTownship" name="ddmTownship">…</select> aka get_by_label("Township:")
5) <select size="1" id="ddmRange" name="ddmRange">…</select> aka get_by_label("Range:")
```

I think `get_by_label("Township:")` seems good!

In [29]:
await page.get_by_label("Township:").select_option('135')

['135']

In [32]:
# await page.get_by_text("Submit").click()
await page.get_by_role("button", name="Submit").click()

Since the data is a table, we can actually feed the HTML directly into pandas.

In [33]:
html_content = await page.content()
tables = pd.read_html(html_content)
len(tables)

  tables = pd.read_html(html_content)


3

With some experimentation we can figure out it's the third table.

In [37]:
df = tables[2]
df.head()

Unnamed: 0,File No,CTB No,API No,Well Type,Well Status,Status Date,DTD,Location,Operator,Well Name,Field
0,1355,,3304700007,OG,DRY,2/1/1957,3200.0,NWNW 11-135-72,"CALVERT DRILLING, INC.",ARNOLD GERBER 1,WILDCAT
1,5523,,3304700020,OG,DRY,11/9/1974,5320.0,NWNW 29-135-73,WISE OIL COMPANY NO. 2 ET AL,BALTZER A. WEIGEL 1,WILDCAT
2,10369,,3302900028,OG,DRY,10/11/1983,4000.0,NWNE 22-135-74,ARKLA EXPLORATION CO.,ELLEFSON 1,WILDCAT
3,10173,,3302900027,OG,DRY,6/24/1983,5865.0,SWSW 14-135-76,SOUTHWESTERN ENERGY PRODUCTION CO.,BEASTROM 1-14,WILDCAT
4,16476,,3302900032,GASD,DRY,11/13/2009,1691.0,SESE 15-135-76,"STAGHORN ENERGY, LLC",WEISER 1-15,WILDCAT


It was township 135, so we can even save it as `135.csv`.

In [38]:
df.to_csv("135.csv", index=False)

## Waiting for elements to appear

If we want to search for *multiple* townships, we could write a loop to go through it.

If we do it "normally," though, we run into an error. Even though we can see the table on the page, it doesn't make it into BeautifulSoup.

```
---> 16 df = tables[2]
     17 df.to_csv(filename, index=False)

IndexError: list index out of range
```

This is because the table doesn't load until a tiny bit after the page loads. Playwright doesn't know we are waiting for the table, though, so it feeds the incomplete page to BeautifulSoup. If you were working with Selenium you might use `time.sleep` or the awful, horrible version of waiting they support, but with Playwright it's easy!

We're just going to wait for the "CTB No" field to show up:

```python
await page.get_by_text('CTB No').wait_for()
```

The code underneath that line won't continue until "CTB No" appears on the page.

In [40]:
township_numbers = ['129', '130', '135']

for num in township_numbers:
    # Fill it in
    print("Searching for page", num)
    await page.locator("#ddmTownship").select_option(num)
    await page.get_by_text("Submit", exact=True).click()

    # Wait for the table to show up
    await page.get_by_text('CTB No').wait_for()

    # Grab the table from the page
    html = await page.content()
    tables = pd.read_html(html)
    df = tables[2]

    # Build filename and save it
    filename = f"{num}.csv"
    print("Got it - saving as", filename)
    df.to_csv(filename, index=False)

Searching for page 129


  tables = pd.read_html(html)


Got it - saving as 129.csv
Searching for page 130


  tables = pd.read_html(html)


Got it - saving as 130.csv
Searching for page 135
Got it - saving as 135.csv


  tables = pd.read_html(html)


## More forms

Let's try to scrape the [South Dakota Board of Technical Professions](https://apps.sd.gov/ld17btp/licenseelist.aspx) this time. We're going to end up downloading some content, so I'm adding `downloads_path="."`, which makes Playwright download files into the same folder as this notebook.

In [52]:
from playwright.async_api import async_playwright

# "Hey, open up a browser"
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(
    headless=False,
    downloads_path="."
)

# Create a new browser window
page = await browser.new_page()

# Tell it to go to this page
await page.goto("https://apps.sd.gov/ld17btp/licenseelist.aspx")

<Response url='https://apps.sd.gov/ld17btp/licenseelist.aspx' request=<Request url='https://apps.sd.gov/ld17btp/licenseelist.aspx' method='GET'>>

If we want to start writing inside of the "Name" field, we can take the same shortcut as above and look for `page.locator("input")` instead of specifying anything specific. We get 10 options out of a total of 20.

```
Error: Error: strict mode violation: locator("input") resolved to 20 elements:
    1) <input type="hidden" id="ctl00_RadScriptManager1_TSM" n…/> aka locator("#ctl00_RadScriptManager1_TSM")
    2) <input value="" type="hidden" id="__EVENTTARGET" name="…/> aka locator("[id=\"__EVENTTARGET\"]")
    3) <input value="" type="hidden" id="__EVENTARGUMENT" name…/> aka locator("[id=\"__EVENTARGUMENT\"]")
    4) <input type="hidden" id="__VIEWSTATE" name="__VIEWSTATE…/> aka locator("[id=\"__VIEWSTATE\"]")
    5) <input type="hidden" value="C7B208E6" id="__VIEWSTATEGE…/> aka locator("[id=\"__VIEWSTATEGENERATOR\"]")
    6) <input value="" type="hidden" id="__VIEWSTATEENCRYPTED"…/> aka locator("[id=\"__VIEWSTATEENCRYPTED\"]")
    7) <input type="hidden" id="__EVENTVALIDATION" name="__EVE…/> aka locator("[id=\"__EVENTVALIDATION\"]")
    8) <input type="text" maxlength="50" id="ctl00_ContentPlac…/> aka locator("#ctl00_ContentPlaceHolder1_txtName")
    9) <input type="text" value="All" autocomplete="off" class…/> aka locator("#ctl00_ContentPlaceHolder1_ddlPEDisc_Input")
    10) <input type="hidden" autocomplete="off" id="ctl00_Conte…/> aka locator("#ctl00_ContentPlaceHolder1_ddlPEDisc_ClientState")
    ...
```

Luckily it seems like `locator("#ctl00_ContentPlaceHolder1_txtName")` is probably what we're looking for!

In [53]:
await page.locator("#ctl00_ContentPlaceHolder1_txtName").fill('SMITH')

Now we want to select something from the Profession dropdown. Because it **isn't a normal select box** we have to actually click the dropdown arrow, then click the profession we're interested in.

In [54]:
await page.locator("#ctl00_ContentPlaceHolder1_ddlProfession_Arrow").click()
await page.get_by_text("Professional Engineer").click()

Now let's click that Search button! If we try `page.get_by_text("Search")` we get a few options and we pick the most likely one.

In [55]:
await page.get_by_role("button", name="Search").click()

Now that the page loads we *could* use `page.content()` to feed the table into pandas... or we could just click the "Download CSV" button!

In [49]:
await page.locator("#ctl00_ContentPlaceHolder1_rgLicensee_ctl00_ctl02_ctl00_ExportToCsvButton").click()

It downloads – with an awful filename of `9b96da64-757f-4166-a84c-63fa1830e77c` – into the current folder. Thank you `downloads_path="."` that we set up when we launched the browser!

If we want to be a little more in control of the filename, we can skip `downloads_path` and write slightly more complicated code. In the code below, we're listening for the download to happen and redirect it to a "good" filename when it starts.

In [56]:
async with page.expect_download() as download_info:
    # Perform the action that initiates download
    await page.locator("#ctl00_ContentPlaceHolder1_rgLicensee_ctl00_ctl02_ctl00_ExportToCsvButton").click()
download = await download_info.value

# Wait for the download process to complete and save the downloaded file somewhere
await download.save_as("content.csv")

Did it work???

In [58]:
df = pd.read_csv("content.csv")
df.head()

Unnamed: 0,Profession,Name,Address,City,State,Zip,Phone,Registration<br/>Number,PE<br/>Disc.,Expiration<br/>Date,Status
0,PE,Lane Lee Goldsmith,PO Box 123,Mobridge,SD,57601,(605) 350-4625,8908,CE,6/30/2024,Active
1,PE,Todd L. Goldsmith,5212 Basswood St,Rapid City,SD,57703,(605) 848-0040,5163,CE,12/31/2024,Active
2,PE,"Willie Morgan NeSmith, Jr.",7300 Marks Lane,Austell,GA,30168,(770) 941-5100,10305,CE,7/31/2024,Active
3,PE,Aaron D. Smith,"1801 W 32nd St, Bldg B Suite 104",Joplin,MO,64804,(417) 624-0444,8630,CE,7/31/2025,Active
4,PE,Alexander Smith,8309 W 42nd Street,Sioux Falls,SD,57106,6052202447,15570,CE,12/31/2025,Active


## Solving CAPTCHAs

When you're looking to scrape a site with a browser automation tool, you'll often run up against CAPTCHAs that demand to know that you are *not* a robot. Since you *are* a robot, it can be problematic.

Luckily, a lot of services exist to help you out with that! The one that's easiest to use and requires the least technical skill is probably NopeCHA, I [wrote a writeup of how to use NopeCHA to solve CAPTCHAs with Playwright here](https://jonathansoma.com/everything/scraping/solving-captchas-in-playwright-with-nopecha/). It involves downloading an extension that you then embed an API key into, then visiting the CAPTCHA-y page with the Playwright browser.

NopeCHA works pretty well from my experience, but it's always a cat and mouse game! You might have luck using [2captcha](https://github.com/2captcha/2captcha-python) or [anti-captcha](https://anti-captcha.com/) if NopeCHA comes up empty.