# Scraping basics for Playwright

If you feel comfortable with scraping in general, you're free to skip this notebook and try to go right to the next one. Same thing if you get bored partway down.

> The [scraping section](https://jonathansoma.com/everything/scraping/) on my Everything I Know site might be helpful.
>
> I know I love them, but **you don't have to use CSS selectors!**

## Part 0: Imports

Import what you need to use Playwright, and start up a new browser to use for scraping. 

> If you end up opening a lot of Chromes/Chromiums, shutting down the Python kernel with the stop button is an easy way to make them go away! You'll have to re-run your notebook, but at least you won't have sixty icons in your dock.

In [3]:
from playwright.async_api import async_playwright
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

## Part 1: Scraping by class

Scrape the content at http://jonathansoma.com/lede/static/by-class.html, printing out the title, subhead, and byline.

In [4]:
await page.goto("http://jonathansoma.com/lede/static/by-class.html")

<Response url='https://jonathansoma.com/lede/static/by-class.html' request=<Request url='https://jonathansoma.com/lede/static/by-class.html' method='GET'>>

In [5]:
from bs4 import BeautifulSoup


In [6]:
html = await page.content()

doc = BeautifulSoup(html)

In [7]:
doc.find(class_='title').text

'How to Scrape Things'

In [9]:
doc.select_one('.title').text

'How to Scrape Things'

In [11]:
doc.select_one('.subhead').text

'Some Supplemental Materials'

In [12]:
doc.select_one('.byline').text

'By Jonathan Soma'

## Part 2: Scraping using tags

Scrape the content at http://jonathansoma.com/lede/static/by-tag.html, printing out the title, subhead, and byline.

In [13]:
await page.goto("http://jonathansoma.com/lede/static/by-tag.html")

<Response url='https://jonathansoma.com/lede/static/by-tag.html' request=<Request url='https://jonathansoma.com/lede/static/by-tag.html' method='GET'>>

In [14]:
doc.select_one('h1').text

'How to Scrape Things'

In [15]:
doc.select_one('h3').text

'Some Supplemental Materials'

In [16]:
doc.select_one('p').text

'By Jonathan Soma'

## Part 3: Scraping using a single tag

Scrape the content at http://jonathansoma.com/lede/static/by-list.html, creating a dictionary out of the title, subhead, and byline in sentences, e.g. "the title is `______`"

> **This will be important for the next few:** you can use `.get_by_text` but it seems kind of silly since maybe the text would change. I think getting them all, then using list indexes like `[0]`, etc, would be better! If I sold you on CSS selectors, you can also look up `nth-of-type` and use it with `.select_one`.

In [17]:
await page.goto("http://jonathansoma.com/lede/static/by-list.html")

<Response url='https://jonathansoma.com/lede/static/by-list.html' request=<Request url='https://jonathansoma.com/lede/static/by-list.html' method='GET'>>

In [18]:
html = await page.content()

doc = BeautifulSoup(html)

In [19]:
paras = doc.select("p")
paras 

[<p>How to Scrape Things</p>,
 <p>Some Supplemental Materials</p>,
 <p>By Jonathan Soma</p>]

In [20]:
paras[0].text

'How to Scrape Things'

In [21]:
paras[1].text

'Some Supplemental Materials'

In [22]:
paras[2].text

'By Jonathan Soma'

In [23]:
await page.get_by_text("By Jonathan Soma").click()

## Part 4: Scraping a single table row

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, printing out the title, subhead, and byline in sentences, e.g. "the title is `______`."

In [25]:
await page.goto("http://jonathansoma.com/lede/static/single-table-row.html")

<Response url='https://jonathansoma.com/lede/static/single-table-row.html' request=<Request url='https://jonathansoma.com/lede/static/single-table-row.html' method='GET'>>

In [26]:
html = await page.content()

doc = BeautifulSoup(html)

In [27]:
cells = doc.select("td")
cells 

[<td>How to Scrape Things</td>,
 <td>Some Supplemental Materials</td>,
 <td>By Jonathan Soma</td>]

## Part 5: Saving into a dictionary

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [28]:
await page.goto("http://jonathansoma.com/lede/static/single-table-row.html")

<Response url='https://jonathansoma.com/lede/static/single-table-row.html' request=<Request url='https://jonathansoma.com/lede/static/single-table-row.html' method='GET'>>

In [29]:
html = await page.content()

doc = BeautifulSoup(html)

In [30]:
cells = doc.select("td")
cells 

[<td>How to Scrape Things</td>,
 <td>Some Supplemental Materials</td>,
 <td>By Jonathan Soma</td>]

In [31]:
book = {
    'title': cells[0].text,
    'subhead': cells[1].text,
    'byline': cells[2].text
}
book


{'title': 'How to Scrape Things',
 'subhead': 'Some Supplemental Materials',
 'byline': 'By Jonathan Soma'}

## Part 6: Scraping multiple table rows

Scrape the content at http://jonathansoma.com/lede/static/multiple-table-rows.html, printing out each title, subhead, and byline.

> You won't use pandas for this one, either!

In [32]:
await page.goto("http://jonathansoma.com/lede/static/multiple-table-rows.html")

<Response url='https://jonathansoma.com/lede/static/multiple-table-rows.html' request=<Request url='https://jonathansoma.com/lede/static/multiple-table-rows.html' method='GET'>>

In [33]:
html = await page.content()

doc = BeautifulSoup(html)

In [35]:
rows = doc.select("tr")
rows

[<tr>
 <td>How to Scrape Things</td>
 <td>Some Supplemental Materials</td>
 <td>By Jonathan Soma</td>
 </tr>,
 <tr>
 <td>How to Scrape Many Things</td>
 <td>But, Is It Even Possible?</td>
 <td>By Sonathan Joma</td>
 </tr>,
 <tr>
 <td>The End of Scraping</td>
 <td>Let's All Use CSV Files</td>
 <td>By Amos Nathanos</td>
 </tr>]

In [47]:
for row in rows:
    print("--------")
    cells = row.select("td")
    print("Title is", cells[0].text)
    print("Subhead is", cells[1].text)
    print("Byline is", cells[2].text)
    

--------
Title is How to Scrape Things
Subhead is Some Supplemental Materials
Byline is By Jonathan Soma
--------
Title is How to Scrape Many Things
Subhead is But, Is It Even Possible?
Byline is By Sonathan Joma
--------
Title is The End of Scraping
Subhead is Let's All Use CSV Files
Byline is By Amos Nathanos


## Part 7: Scraping an actual table

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a list of dictionaries.

> Don't use pandas here, either, even though that's exactly what we did in class.

In [39]:
await page.goto("http://jonathansoma.com/lede/static/multiple-table-rows.html")

<Response url='https://jonathansoma.com/lede/static/multiple-table-rows.html' request=<Request url='https://jonathansoma.com/lede/static/multiple-table-rows.html' method='GET'>>

In [40]:
html = await page.content()

doc = BeautifulSoup(html)

In [48]:
all_books = []

for row in rows:
    print("--------")
    cells = row.select("td")

book = {
    'title': cells[0].text,
    'subhead': cells[1].text,
    'byline': cells[2].text
}
all_books.append(book)
all_books

--------
--------
--------


[{'title': 'The End of Scraping',
  'subhead': "Let's All Use CSV Files",
  'byline': 'By Amos Nathanos'}]

## Part 8: Scraping multiple table rows into a list of dictionaries

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a pandas DataFrame.

> There are two ways to do this one! One uses just pandas, the other one uses the result from Part 7.

In [49]:
import pandas as pd
df = pd.DataFrame(all_books)
df


Unnamed: 0,title,subhead,byline
0,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


## Part 9: Scraping into a file

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html and save it as `output.csv`