# 🌐 Scraping, Part 8: Scraping gracefully

*Sleeping, announcing yourself, caching, and catching HTTP errors.*

In [1]:
import requests
from bs4 import BeautifulSoup

## Sleeping 💤

If you're scraping a lot of pages, or a delicate site, you might want to add some slowness.

In [2]:
from time import sleep

for i in range(3):
    print("Fetching page " + str(i + 1))
    sleep(1)

Fetching page 1
Fetching page 2
Fetching page 3


## Announcing yourself 📢

Are you undertaking a substantial scraping project? Announce yourself with HTTP headers.

In [3]:
ident = (
    "Jeremy Singer-Vine (jsvine@gmail.com), " + 
    "scraping for educational purposes"
)

html = requests.get(
    "https://example.com",
    headers = {
        "From": ident
    }
).text

## Caching 💾

Only fetch each page once (unless it's changing rapidly). How?

- Come up with a predictable `page->filepath` naming scheme
- On each iteration, get the page's `filepath` in the scheme
- If the `filepath` does not yet exist, save the HTML to the `filepath` after you fetch it
- If the `filepath` already exists, skip the fetching and just load the file

Make a subdirectory wherever your notebook is, using your command line:

```sh
mkdir table-pages
```

Import `Path`:

In [4]:
from pathlib import Path

Implement caching:

In [5]:
BASE_URL = "https://scraping-practice-jsvine.vercel.app/launches/paginated/"

for i in range(3):
    dest = Path("table-pages/" + str(i + 1) + ".html")
    print(dest)

table-pages/1.html
table-pages/2.html
table-pages/3.html


In [6]:
BASE_URL = "https://scraping-practice-jsvine.vercel.app/launches/paginated/"

for i in range(3):
    dest = Path("table-pages/" + str(i + 1) + ".html")
    
    if dest.exists(): # ... load it from file
        print(f"Already have {dest}, loading!")
        page_html = open(dest).read()
        
    else: # ... fetch it
        page_url = BASE_URL + "?page=" + str(i + 1)
        print("Fetching " + page_url)
        page_html = requests.get(page_url).text
        
        # ... and then save it to file
        with open(dest, "w") as f:
            f.write(page_html)
            
    page_soup = BeautifulSoup(page_html)
    heading = page_soup.select("h3")[0]
    print(heading.text)

Already have table-pages/1.html, loading!
Page 1 of 23
Already have table-pages/2.html, loading!
Page 2 of 23
Already have table-pages/3.html, loading!
Page 3 of 23


## Catching HTTP errors 🥴

So far, we've been dealing with reliable servers. But sometimes they get overloaded or malfunction.

There are a wide variety of errors, and different ways of handling them, but we'll focus on one specific and common case here.

In [7]:
flaky_url = "https://scraping-practice-jsvine.vercel.app/launches/paginated/flaky/"

requests.get(flaky_url)

<Response [500]>

In [8]:
response = requests.get(flaky_url)
print(response.status_code)

200


Here's what happens when we try to scrape a flaky server with no safeguards:

In [9]:
for i in range(3):
    page_url = flaky_url + "?page=" + str(i + 1)
    print("Fetching " + page_url)
    page_html = requests.get(page_url).text            
    page_soup = BeautifulSoup(page_html)
    heading = page_soup.select("h3")[0]
    print(heading.text)

Fetching https://scraping-practice-jsvine.vercel.app/launches/paginated/flaky/?page=1


IndexError: list index out of range

And here's *one* way we can fix that:

In [None]:
for i in range(3):
    page_url = flaky_url + "?page=" + str(i + 1)
    while True:
        print("Fetching " + page_url)
        page_response = requests.get(page_url)
        if page_response.status_code == 200:
            break
        else:
            print("Sleeping and then trying again...")
            sleep(3)
            
    page_html = page_response.text            
    page_soup = BeautifulSoup(page_html)
    heading = page_soup.select("h3")[0]
    print(heading.text)

---

---

---