# Web scraping example

This notebook provides an example about the use of web scraping to retrieve price data.

Specifically, it uses the test website [Books to scrape](https://books.toscrape.com/) to demonstrate how to gather data from a website, including navigation throught the various pages. We used the browser inspection functions to analyze the HTML structure of the webpages and define the paths for data extractions. The techniques and style used here are only an example, you could achieve the same result in many different ways. Also the structure is specific for this website, you will need to adapt this code for others.

There are additional resources you may explore in order to get some practice with different website architectures before applying your efforts to real websites:

- [Web Scraper test sites](https://www.webscraper.io/test-sites)
- [web-scraping.dev](https://www.web-scraping.dev/)

You can run this notebook on your own environment, provided you install the Python packages listed in the requirements.txt file. You can also run this notebook on a cloud environment like Google Colab. The minimum version of Python tested with this notebook is 3.9, we suggest at least 3.10 or newer.

Please be considerate in your web scraping operations and always include a delay between calls to avoid overloading the source website.

## Import libraries

In [35]:
import time
import random
import urllib.parse
from datetime import datetime
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Setup variables

Request headers are the primary way a website operator may identify your requests. It is good practice to use a User-Agent name which can identify your institution and leave an email for being contacted.

In [3]:
heads = {
    'User-Agent':'Scraping Project Name', # Change with the name of your project or institution
    'email': 'your.email@institution.gov' # Change with your own email address
    }

s = requests.Session()

In [2]:
shop_url = "https://books.toscrape.com/"

## Acquire category links

Get the homepage

In [19]:
with s.get(shop_url, headers=heads) as res:
    response = BeautifulSoup(res.text, "html.parser")

Get the list of categories, excluding the full catalogue (Which is in a different `ul` element)

In [20]:
categories = [  # We use list comprehension to iterate over the various elements
                # You could also use a regular for-loop for the same puropose
    {
        "category": item.get_text().strip(), # We use "strip" to remove white space around the text
        "url": urllib.parse.urljoin(shop_url, item.get("href"))  # Links in the menu are relative, we need to add the website root
    }
    for item in
    response.find("ul", class_="nav-list").find("ul").find_all("a")
]

Example category

In [21]:
categories[0]

{'category': 'Travel',
 'url': 'https://books.toscrape.com/catalogue/category/books/travel_2/index.html'}

## Acquire product data across categories

### Exploration

Before structuring the data acquisition with functions and loops, it is common practice to manually explore a few web pages to check the data structure and organization.

In [22]:
with s.get(categories[1].get("url"), headers=heads) as res:
    response = BeautifulSoup(res.text, "html.parser")

Get product data

In [29]:
products = [
    {
        "name": item.find("h3").find("a").get("title"), # text inside the tag is truncated in some instance
        "price": item.find("p", {"class": "price_color"}).get_text(),
        "link": urllib.parse.urljoin(  # Also in this case links are relative
            categories[1].get("url"),
            item.find("h3").find("a").get("href")),
        "category": categories[1].get("category")

    }
    for item in
    response.find_all("article", {"class": "product_pod"})
]

In [30]:
products[:4]

[{'name': 'Sharp Objects',
  'price': 'Â£47.82',
  'link': 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
  'category': 'Mystery'},
 {'name': 'In a Dark, Dark Wood',
  'price': 'Â£19.63',
  'link': 'https://books.toscrape.com/catalogue/in-a-dark-dark-wood_963/index.html',
  'category': 'Mystery'},
 {'name': 'The Past Never Ends',
  'price': 'Â£56.50',
  'link': 'https://books.toscrape.com/catalogue/the-past-never-ends_942/index.html',
  'category': 'Mystery'},
 {'name': 'A Murder in Time',
  'price': 'Â£16.64',
  'link': 'https://books.toscrape.com/catalogue/a-murder-in-time_877/index.html',
  'category': 'Mystery'}]

Find pagination link for next page

In [26]:
response.find("ul", {"class": "pager"}).find("li", {"class": "next"}).find("a").get("href")

'page-2.html'

In [27]:
next_page = urllib.parse.urljoin( # Join relative link
    categories[1].get("url"),
    response.find("ul", {"class": "pager"}).find("li", {"class": "next"}).find("a").get("href"))

In [28]:
next_page

'https://books.toscrape.com/catalogue/category/books/mystery_3/page-2.html'

### Systematic data acquisition

After the exploration, it is common practice to arrange the data acquisition using functions and loops. The function below acquire data from a category page and recursively extend the acquisition to following pages if they exist.

In [40]:
def get_product_data(category: dict, s: requests.Session, heads: dict) -> list:
    """
    Get product data from a category page and recursively extend
    the acquisition to following pages if they exist.
    Arguments:
    category (dict): dictionary with category name and url
    s (requests.Session): session object
    heads (dict): request headers

    Returns:
    results: list of product data
    """
    # Random delay to prevent server overloading
    time.sleep(random.randint(10, 15)/10)
    # Get category webpage
    with s.get(category.get("url"), headers=heads) as res:
        response = BeautifulSoup(res.text, "html.parser")
    # Extract product data
    results = [
        {
            "name": item.find("h3").find("a").get("title"), # text inside the tag is truncated in some instance
            "price": item.find("p", {"class": "price_color"}).get_text(),
            "link": urllib.parse.urljoin(  # Also in this case links are relative
                category.get("url"),
                item.find("h3").find("a").get("href")),
            "category": category.get("category"),
            "date": datetime.now().strftime("%Y-%m-%d"),

        }
        for item in
        response.find_all("article", {"class": "product_pod"})
    ]
    # Check if there is a next page, navigating the structure step by step
    # First check if there is a pagination area
    next_page = response.find("ul", {"class": "pager"})
    if next_page is not None:
        next_page = next_page.find("li", {"class": "next"})
        # Second, check if there is a next page in the navigation
        if next_page is not None:
            # Compose next page URL
            new_url = urllib.parse.urljoin( # Join relative link
                category.get("url"),
                next_page.find("a").get("href"))
            # Acquire data from next page
            new_results = get_product_data(
                # Update category object with next page url
                category={"name": category.get("name"), "url": new_url},
                s=s,
                heads=heads)
            # Join results
            results.extend(new_results)
    return results


Make a loop to acquire data from all categories

In [41]:
products = []
for category in categories:
    # Random delay to prevent server overloading
    time.sleep(random.randint(10, 15)/10)
    new_products = get_product_data(category, s, heads)
    products.extend(new_products)

Arrange the results in a DataFrame and save

In [43]:
data_df = pd.DataFrame(products)
# Save file with current date in the name
data_df.to_csv("book_prices_{}.csv".format(datetime.now().strftime("%Y-%m-%d")), index=False)

## Next steps

You may have seen that inside each book webpage there is additional information, such as the ISBN code (which uniquely identifies each book), the number of units in stock, and more. If this information is important for you, you may want to extend the data acquisition to each product page and enrich your data.

However, you should also consider the additional load you would impose on the source website with those incremental calls. Since product attributes rarely change, a more balanced aproach may be to acquire them with a lower frequency (for instance, once a month) rather than daily.