# Web Scraping 101 (oDCM)

*In the Webdata for Dummies tutorial we had a first brief look at scraping elements from a single web page (the "A Light in the Attic" book). In this tutorial we extend this idea by collecting information from multiple pages, exporting the data to CSV, and running Python files from the command line.*

--- 

## Learning Objectives

Students will be able to: 
* Generate a seed by scraping and parsing URLs from a parent object
* Select data for extraction on a website using CSS selectors
* Loop through list of URLs to capture data in bulk using functions
* Generate a seed by scraping and parsing URLs from a parent object
* Write a dictionary of data to a CSV file (enriched with metadata)
* Run data entire data collection from a Python script on the command line






--- 

## Acknowledgements
This course draws on online resources built by Brian Keegan, Colt Steele, David Amos, Hannah Cushman Garland, Kimberly Fessel, and Thomas Laetsch. 


--- 

## Contact
For technical issues try to be as specific as possible (e.g., include screenshots, your notebook, errors) so that we can help you better.

**WhatsApp**  
+31 13 466 8938

**Email**  
odcm@uvt.nl

---

## 1. Generating Seeds


### 1.1 Collecting Links


In web scraping we typically refer to a "seed" as a starting point for data collection. For example, we can first scrape all book links from the overview page and then iterate over all links to scrape the product description (or anything else on that page). In this case we pick the [book catalogue](https://books.toscrape.com/catalogue/category/books_1/index.html) as our seed. From this page there are two ways to move towards a book page: either by clicking on the book cover or on the title of the book (figure below). 

<img src="images/books_links.png" align="left" width=80%/>

This also becomes clear once we inspect the underlying HTML code with Google Chrome Inspector. The thumbnails (`<img>`) are surrounded by `<a>` tags which contain a link (`href`) to the book. Also, within the book headers (`<h3>`) we find nested links (`<a>`) to the book pages:

<img src="images/inspector_links.png" align="left" width=90%/>

Previously, we got away with selecting elements by tag (e.g., `<h2>`) but this time we'll run into problems if we try to filter down on `<a>` tags (i.e., links). Why? Because the overview page also contains `<a>` elements we are not interested in: 

* "Books to Scrape" link at the top
* "Home" breadcrumb link 
* Left sidebar with all book genres (e.g., Travel)
* The next button at the bottom

For that reason, we need to be more specific so that we only scrape the links to the book pages and ignore all other `<a>` tags. Let's briefly revisit the notion of HTML classes. A class is often used as a reference in the code. For example, to make all text elements with a given class blue or increase the font size. In the screenshot above you find an `<article>` tag with class `product_pod` in which a `<div>` is nested which contains the image and link attribute we're after. Every link to a book is nested within this class, but aforementioned `<a>` tags on other parts of the page are not. Thus, if we can tell our scraper that we're only interested in the `<a>` tags within the `product_pod` class, we end up with our desired selection of links.

Again, we'll use `.find_all()` to capture all matching elements on the page. The difference, however, is that we specifiy it's a class (`class_=`), rather than a HTML tag, and the class name we need (`product_pod`). This returns a list with all 20 `product_pod` classes on the page (i.e., one for each book). In this example, we pick the first book from the list (A Light in the Attic, element `[0]` from the list) and extract the `<a>` tag nested within the `product_pod` class. Finally, we pull out the `href` attribute from the `<a>` tag which gives us the book link.

In [42]:
import requests
from bs4 import BeautifulSoup

# make a get request to the books overview page (see Webdata for Dummies tutorial)
url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'
res = requests.get(url)
soup = BeautifulSoup(res.text)

# return the href attribute in the <a> tag nested within the first product class element
soup.find_all(class_="product_pod")[0].find("a").attrs["href"]

'../../a-light-in-the-attic_1000/index.html'

Note the `../../` in front of the link which tells the browser: go back two directories from the current URL:
* Current URL: https://books.toscrape.com/catalogue/category/books_1/index.html
* 1 step back: https://books.toscrape.com/catalogue/category/books_1
* 2 steps back: https://books.toscrape.com/catalogue/category/

Thereafter, it appends `a-light-in-the-attic_1000/index.html` to the URL which forms the full link to the [A Light in the Attic](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html) book. 

#### Exercise 1
1. Extract the link from the second book (Tipping the Velvet) using BeautifulSoup.
2. Create a new variable `book_url` that concatenates the base URL (` https://books.toscrape.com/catalogue/category/`) and the string you extracted in the previous exercise. Use slicing to remove the `../../` part in between. The final output should be: `https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html` 

### 1.2 Iterating over items

Ideally, we'd like our code to extract the URL from every book (not just a single one). Hence, we store the temporary results in a variable `books` and loop over each book of which the link is stored in a list `book_urls`. 

In [50]:
# list of all books on the overview page
books = soup.find_all(class_="product_pod")
book_urls = []

for book in books: 
    book_url = book.find("a").attrs["href"]
    book_urls.append(book_url)
    
# print the first five urls
print(book_urls[0:5])

['../../a-light-in-the-attic_1000/index.html', '../../tipping-the-velvet_999/index.html', '../../soumission_998/index.html', '../../sharp-objects_997/index.html', '../../sapiens-a-brief-history-of-humankind_996/index.html']


In practise, it may be more convenient to create a dictionary in which the `book_title` is the key and the `book_url` the value. This way it is more intuitive to look up the URL from a given book because you don't have to remember the exact position in the list but can simply pass it the title of the book. 

In the Google Inspector screenshot in the beginning of this section, you can see that the book title is stored in the `alt` attribute from the `<img>` tag (as well as in the `title` attribute from the second `<a>` tag). Using a similar approach as above, we collect the `book_title` and `book_url` of each book and use these records to update `book_dict`.

In [52]:
book_dict = {}

for book in books: 
    book_title = book.find("img").attrs["alt"] 
    book_url = book.find("a").attrs["href"]
    book_dict[book_title] = book_url

As a result, we can simply pass the book title (mind the capitals!) to the dictionary to obtain the corresponding URL.

In [54]:
print(book_dict['A Light in the Attic'])

../../a-light-in-the-attic_1000/index.html


#### Exercise 2
1. Like exercise 1.2, write a program that transforms the relative URLs (`../..`) in `book_dict` into a full URL. Tip: you can use `for key, value in book_dict.items():` to iterate over the key value pairs in the dcitionary and update URLs accordingly. 
2. One of the books on `books.toscrape.com` is [Black Dust](https://books.toscrape.com/catalogue/black-dust_976/index.html). What happens once you pass this title as a key to `book_dict`? Why is that? 

### 1.3 Page Navigation
The [`books.toscrape.com`](https://books.toscrape.com/catalogue/category/books_1/index.html) contains 1000 books which are spread across 50 pages. At the bottom of the page, you can click on "next" to move to the next page. 

<img src="images/books.png" align="left" width=80%/>

Once you repeat this a couple of times, you get the following pattern of URLs:

`https://books.toscrape.com/catalogue/category/books_1/page-1.html`
`https://books.toscrape.com/catalogue/category/books_1/page-2.html`
`https://books.toscrape.com/catalogue/category/books_1/page-3.html`

Can you guess the next one? Indeed, the URL can be divided into a fixed base url (`https://books.toscrape.com/catalogue/category/books_1/`) and a counter that is dependent on the page you're visiting (e.g., `page-1.html`). Now let's create a list of all 50 URLs! First, we create a f-string `counter_url` variable of which we can change the `counter` variable. Next, we concatenate the `base_url` and the `counter_url` to get the `full_url`. 



In [68]:
counter = 1
counter_url = f"page-{counter}.html"
full_url = base_url + counter_url 
print(full_url)

https://books.toscrape.com/catalogue/category/books_1/page-1.html


In a similar fashion, we generate a list of 50 `page_urls` with a for loop that starts at 1 and ends at 50 (not 51!). 

In [66]:
base_url = "https://books.toscrape.com/catalogue/category/books_1/"
page_urls = []

for counter in range(1, 51):
    counter_url = f"page-{counter}.html"
    full_url = base_url + counter_url 
    page_urls.append(full_url)

As expected, this gives a list of all page urls that contain books. 

In [77]:
# print the last five page urls
print(f"The number of page urls in the list is: {len(page_urls)}")

The number of page urls in the list is: 50


#### Exercise 3
In this exercise, we practise with generating a seed for another website, [`quotes.toscrape.com`](https://quotes.toscrape.com/), which displays 100 famous quotes from GoodReads categorized by tag. 

<img src="images/quotes.png" align="left" width=70% style="border: 1px solid black" />

1. Make yourself comfortable with how the [site](https://quotes.toscrape.com) works and ask yourself questions such as: how does the nagiation work, how many pages are there, what is the base url, and how does it change if I move to the next page?
2. Generate a list `quote_pager_urls` that contains the page urls we need if we'd like to scrape all 100 quotes.

## 2. Data Extraction


### 2.1 Timers
Next, we combine the concepts from section 1.2 and 1.3 to extract the book urls across different pages. But before we start implementing it, we need to realize that sending many requests at the same time can overload a server. Therefore, it's highly recommended to pause between requests rather than sending them simultaneously. This avoids that your IP address (i.e., numerical label assigned to each device connected to the internet) gets blocked and you can no longer visit (and scrape) the website. 

In Python you can import the `sleep` module which pauses the execution of future commands for a given amount of time. For example, the print statement after `sleep(5)` will only be executed after 5 seconds:

In [81]:
# run this cell again to see the timer in action yourself!
from time import sleep
sleep(5)
print("I'll be printed to the console after 5 seconds!")

I'll be printed to the console after 5 seconds!


With this addition to our toolkit, let's finish up our book URL scraper by putting together everything we have learned thus far. Following Python conventions, let's modularize our  code into functions to improve readability and reusabilty. 

First, we define a function `generate_page_urls()` that takes a base url and an upper limit of the number of pages (50) as input parameters. This way we can easily update our scraper if more books are added or if the base url changes. 

Second, the `extract_book_urls()` function takes a list of page urls as input and returns a dictionary of book titles and URLs. Note the two-step structure of the for-loops: on every page we create a `books` object which we subsequently loop over by extracting the `book_title` and `book_url` from each book. These records are added to the dictionary `book_dict` which is eventually returned by the function.

In [129]:
def generate_page_urls(base_url, num_pages):
    '''generate a list of full page urls from a base url and counter that has takes on the values between 1 and num_pages'''
    page_urls = []
    
    for counter in range(1, num_pages + 1):
        counter_url = f"page-{counter}.html"
        full_url = base_url + counter_url 
        page_urls.append(full_url)
        
    return page_urls
    
def extract_book_urls(page_urls):
    '''collect the book title and url for every book on all page urls'''
    book_dict = {}
    
    for page_url in page_urls: 
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text)
        books = soup.find_all(class_="product_pod")

        for book in books: 
            book_title = book.find("img").attrs["alt"] 
            book_url = "https://books.toscrape.com/catalogue/" + book.find("a").attrs["href"][6:]
            book_dict[book_title] = book_url
            
        sleep(1)  # pause 1 second after each request
            
    return book_dict
    
base_url = "https://books.toscrape.com/catalogue/category/books_1/"
page_urls = generate_page_urls(base_url, 2) # to save time and resources we only scrape the first 2 pages
book_dict = extract_book_urls(page_urls)

Although this code works without problems, there is one little improvement that we can make. If the number of pages changes, we need to manually update the `num_pages` parameter. For example, we may miss out once new books are added which appear on page 51 and further. 


### 2.2 Next Page Button

A general solution is therefore to look up whether there is a `next` button on the page (HTML code below). If so, it means a next page exists and we keep on incrementing the page counter by 1. If not, it means we have reached the last page. 

<img src="images/next_page.png" align="left" width=90% style="border: 1px solid black" />

As such, we write the function `check_next_page()` which takes an URL as input and returns the outgoing link of the next button (if present):

In [120]:
def check_next_page(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text)
    next_btn = soup.find(class_= "next")
    return next_btn.find("a").attrs["href"] if next_btn else None

page_1 = "https://books.toscrape.com/catalogue/page-1.html"
print(f"The next page is: {check_next_page(page_1)}")

The next page is: page-2.html


We pass it the first page of the bookshop, and it returns the link to the second page (note that `page-2.html` is a relative path from the current URL). Now let's check what happens once we pass it the 50th page!

#### Exercise 4
1. Pass `https://books.toscrape.com/catalogue/page-50.html` to `check_next_page()` and observe the output. Is that what you expected? 
2. Write a function that that checks whether the output of `check_next_page()` is not `None` (i.e., anything but `None`). If so, it should return a new variable `page_url` that concatenates the base url and the relative path to the next page. If not, it should print the statement `This is the last page!`

As a last step, we have revised the `extract_book_urls()` function. Instead of generating the list of page URLs upfront, we now use a `while` loop that stays `True` as long as there is a page url. At the end of each loop we update the `page_url` according to the link of the next button (using `check_next_page()`). On the last page there is no new page url and thus we break out of the while loop. All in all, we have modularized our code into functions, made it future-proof (e.g. if new books are added), and reduced the number of lines of code to get the job done! Try to run the function and inspect the output for yourself (you may need to wait for a bit as the scraper loops through all 50 pages!).

In [132]:
def extract_book_urls(page_url):
    '''collect the book title and url for every book on all page urls'''
    book_dict = {}

    while page_url: 
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text)
        books = soup.find_all(class_="product_pod")

        for book in books: 
            book_title = book.find("img").attrs["alt"] 
            book_url = "https://books.toscrape.com/catalogue/" + book.find("a").attrs["href"][6:]
            book_dict[book_title] = book_url

        sleep(1)  # pause 1 second after each request
        
        if check_next_page(page_url) != None: 
            page_url = "https://books.toscrape.com/catalogue/category/books_1/" + check_next_page(page_url)
        else: 
            break
        
    return book_dict

book_dict = extract_book_urls("https://books.toscrape.com/catalogue/page-1.html")

### 2.3 Page-level Data Collection

Do you remember trying to obtain the URL of the [Black Bust](https://books.toscrape.com/catalogue/black-dust_976/index.html) book in exercise 2? Let's see whether it works...

In [135]:
print(book_dict["Black Dust"])

https://books.toscrape.com/catalogue/black-dust_976/index.html


Excellent, it works flawlessly! But why did we need the book URLs in the first place? It forms the seed for other web scraping efforts. For example, the product descriptions can only be obtained from the book pages itself which means we need to loop over all book urls to extract the right information. In the follow-up exercise, we'll look at how to do this. 

#### Exercise 5
1. We'd like to extract the product description of our 3 favorite books. Fill in the blanks below to finish the `get_book_description()` function. Each `#` represents a single missing character (e.g. `####` means the solution requires 4 characters). 
2. Run the function and inspect the output. If you look carefully, you may spot `\x80\x99t` symbols throughout the product description. Look up the original text on the book pages and compare it side-by-side with the output of `book_dict`. What do these symbols mean? In the  Web Scraping Advanced in week 5, we'll go more in-depth about what it is exactly and how you can encode such characters.

In [None]:
def get_book_description(books):
    book_descriptions = {}

    for book in books: 
        page_url = book_dict[####]

        res = requests.get(########)
        soup = BeautifulSoup(res.text)

        # tip: look at the Google Inspector screenshot below 
        description = soup.find(id="content_inner").find_all("p")[#].get_text()
        book_descriptions[####] = ###########

    return book_descriptions

favorite_books = ["Black Dust", "The Grand Design", "Twenties Girl"]
book_descriptions = get_book_description(##############)

<img src="images/black_dust.png" align="left" width=90% style="border: 1px solid black" />

### 2.4 CSV Export

* Include relevant metadata (e.g., time of extraction)

In [None]:
with open("book_descriptions.csv", "w") as file: 
    headers = ["title", "description"]
    csv_writer = DictWriter(file, fieldnames=headers)
    csv_writer.writeheader()
    for description in book_descriptions: 
        csv_writer.writerow(description)

## 3. Executing Python Files

* Discuss pros and cons of Jupyter Notebooks and `.py` files
    * Top down hierarchy 
    * Experiment and develop code on the fly
    * Easier to debug (run cell by cell) 
    * Reproducibility (Latex + Markdown + inline plots) 
    * Browse through directories (file explorer)
    * Can handle large codebases far easier
    * Presenting/sharing -> iPython
    * Prototyping -> iPython -> Move code into functions -> create new cell that uses function from cell above -> when all seems to be running good -> copy paste in a `.py` file -> move to module
* Introduction to Spyder
* Instructions on how to run `.py` files in the terminal 

