# 5-2: More Scraping + actionability

- Scraping next elements
- Scraping HTML tables
- Image downloading
- Driving a webpage

Using locators to find elements on the page is a fundamental part of web scraping. In this notebook, we'll learn how to use Playwright to find elements on the page using different types of locators.


## Scraping HTML tables

Its easy to scrape HTML tables into a pandas dataframe. 

`pd.read_html()` is a function that reads all HTML tables on the page into a list of pandas DataFrames.

#### 5-2-tables.py

```python
from playwright.sync_api import Playwright, sync_playwright, expect
import pandas as pd

def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://ist256.com/fall2023/")

    # Let's scrape the page!
    # use pandas read_html to parse the HTML

    # get a list of all tables on the page
    dfs = pd.read_html(page.content())

    # print the first table
    print(dfs[0])
    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)


## Scraping the next adjcent element

Sometimes you need to use one selector to **find** the element, but what we want is to scrape the **next** element right after the page.

`.query_selector('~ *')` to find the next adjacent sibling element.


#### 5-2-after.py

```python

from playwright.sync_api import Playwright, sync_playwright, expect


def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://ist256.com/fall2023/syllabus/")
    
    # select the title by selector
    outcomes = page.query_selector("h3#learning-outcomes")
    print(outcomes.inner_text())
    next_element = outcomes.query_selector('~ *')
    print(next_element.inner_text())

    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)
    
```

## Challenge 5-2-1: 

Scrape the additional textbook recommendations from:

https://ist256.com/fall2023//syllabus/

for each recommendation print it in a loop:


## Downloading an Image

You can use playwright to download an image by getting the `src` attribute. 


#### 5-2-image.py

```python

from playwright.sync_api import Playwright, sync_playwright, expect
import requests

def download_image(url): 
    filename = url.split("/")[-1]
    response = requests.get(url) 
    with open(filename, 'wb') as file: 
        file.write(response.content)
    return filename

def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    site = "https://ist256.com/fall2023/"
    page.goto(site)

    image = page.query_selector("img.logo")
    image_source = image.get_attribute("src")
    print(image_source)

    filename = download_image(site + image_source)
    print(filename)
    
    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)


```

## Playwright Codegen 

Playwright has a codegen feature that can help you generate code to interact with a webpage. 

```bash

python -m playwright codegen 

```

Let's use playwright to search for a course in the course catalog and return the title and description.

`get_by_role()` selector is used to find elements by their role attribute.

#### 5-2-codegen.py

```python
import re
from playwright.sync_api import Playwright, sync_playwright, expect


def run(playwright: Playwright) -> None:
    course = "IST 356"

    # playwright codegen
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("about:blank")
    page.goto("http://coursecatalog.syr.edu/")
    page.get_by_label("Search Keyword Field").click()
    page.get_by_label("Search Keyword Field, required").fill(course)
    page.get_by_label("Search Keyword Field, required").press("Enter")
    page.get_by_role("link", name=f"Best Match: {course}").click()
    with page.expect_popup() as page1_info:
        page.get_by_role("link", name="Print (opens a new window) ").click()
    
        page1 = page1_info.value
        page1.goto(page1.url)

        course_selector = page1.query_selector("table")
        course_text = course_selector.inner_text()
        print(course_text)

    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)

```


## Challenge 5-2-2: Scraping Google Searcher

Use the playwright codegen to extract the SU 2023 Football schedule form https://cuse.com 

Input the year, output the schedule.
