# Beyond INST126: Web Scraping

In this class activity, we will work on a concrete web scraping task. We will use a special website called [toscrape.com](//toscrape.com) which offers two scraping sandbox environments:
1. **[books.toscrape.com](//books.toscrape.com)** &ndash; a fictional bookstore using static pages and pagination,
2. **[quotes.toscrape.com](//quotes.toscrape.com)** &ndash; a fictional repository of famous quotes, using API endpoints asynchronous requests.

For this activity we will focus **on the book store (books.toscrape.com)** only.

## Getting started

Visit the bookstore, and browse through a few pages of results. Locate a book price and observe the HTML code associated to it. You will need to know what the HTML looks like in order to extract the price information. To inspect the HTML of page with Firefox or Google Chrome, right-click anywhere on its background (but not on an image), and click `Inspect` or `View Page Source` (the names of these options may be different depending on your browser). 

Once you have a good idea of what the underlying HTML looks like, start working on the two tasks below.

In [None]:
# Write down your roles
DRIVER = ""
NAVIGATOR = ""

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---

# Reference -- keep this open while working on the tasks!

- [Quickstart guide to Requests](https://requests.readthedocs.io/en/latest/user/quickstart/)
- **Cheat sheet of common HTML elements** with examples, split in:
  + [Inline elements](https://developer.mozilla.org/en-US/docs/Learn/HTML/Cheatsheet#inline_elements), like links or images;
  + [Block elements](https://developer.mozilla.org/en-US/docs/Learn/HTML/Cheatsheet#block_elements), like paragraphs or tables.

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---

## Task 1: scrape a single page of results

Write a function called `scrapepage` that scrapes all the prices listed on a single page of the bookstore.

Your function should take one parameter &ndash; the URL (as a string) of the page to scrape. It should return a list with all the book prices present on the page, converted to float. 

### Hints

1. The text of the response is available in an attribute called `text`. Consider using a `for` loop to iterate through it. As this is a single string (with `\n` separators), you will need to `.split()` it in multiple lines first.
2. To remove the HTML markup sorrounding the price, here too you can make use of the `.split()` method of the string type;
3. If you see the symbol `Â` being printed in the price, then make sure to specify the encoding before extracting the price, like this:
```
    response = requests.get("https://books.toscrape.com/")
    response.encoding = 'utf-8'
```

In [None]:
import requests 

# Your solution here
...

# This will scrape the front page of the bookstore
scrapepage("https://books.toscrape.com")

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---

In [None]:
# Now switch roles
DRIVER = ""
NAVIGATOR = ""

## Task 2

Write a function called `averageprice` that computes the average price of books found on multiple pages of the bookstore.Your function should call the `scrapepage` function defined above to fetch one page at a time, scrape the book prices, and append them to the final list. Then, it should compute the average price. (For this last step you can convert the list to a Pandas series and make use of the `.mean()` method.)

A list with the URLs of ten pages is already provided for you in the cell below. If all goes well, using your function on those URLs should result in an average price of `34.79625`.

In [None]:
import requests 
import pandas as pd

# Your solution here
...

ten_locations = [
    "https://books.toscrape.com/",
    "https://books.toscrape.com/catalogue/page-2.html",
    "https://books.toscrape.com/catalogue/page-3.html",
    "https://books.toscrape.com/catalogue/page-4.html",
    "https://books.toscrape.com/catalogue/page-5.html",
    "https://books.toscrape.com/catalogue/page-6.html",
    "https://books.toscrape.com/catalogue/page-7.html",
    "https://books.toscrape.com/catalogue/page-8.html",
    "https://books.toscrape.com/catalogue/page-9.html",
    "https://books.toscrape.com/catalogue/page-10.html"
]

averageprice(ten_locations)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---

# Solutions + Extra Challenge

## Task 1

In [None]:
import requests 

### BEGIN SOLUTION
def scrapepage(url):
    tmp = []
    response = requests.get(url)
    response.encoding = "utf-8"
    for line in response.text.split("\n"):
        if 'price_color' in line:
            line = line.strip()
            price = line.split(">")[1].split("<")[0]
            price = price.strip("£")
            tmp.append(float(price))
    return tmp
### END SOLUTION

# This will scrape the front page of the bookstore
scrapepage("https://books.toscrape.com")

## Task 2

In [None]:
import requests 
import pandas as pd

## BEGIN SOLUTION
def averageprice(locations):
    prices = []
    for url in locations:
        prices.extend(scrapepage(url))
    s = pd.Series(prices)
    return s.mean()
## END SOLUTION

ten_locations = [
    "https://books.toscrape.com/",
    "https://books.toscrape.com/catalogue/page-2.html",
    "https://books.toscrape.com/catalogue/page-3.html",
    "https://books.toscrape.com/catalogue/page-4.html",
    "https://books.toscrape.com/catalogue/page-5.html",
    "https://books.toscrape.com/catalogue/page-6.html",
    "https://books.toscrape.com/catalogue/page-7.html",
    "https://books.toscrape.com/catalogue/page-8.html",
    "https://books.toscrape.com/catalogue/page-9.html",
    "https://books.toscrape.com/catalogue/page-10.html"
]

averageprice(ten_locations)

## Extra Challenge (no solution provided)

Modify the `averageprice` function so that, instead of receiving a fixed list of URLs to scrape, identifies the next URL to visit automatically. To do so, you can take advantange of the fact that, at the bottom of each page, there is a &ldquo;Next&rdquo; button pointing to the next page to visit in the sequence.