# Web Scraping 101

*After finishing this tutorial, you can extract data from multiple pages on the web, and export such data to CSV files so that you can use it in an analysis. Plan a few hours to work through this notebook. Taking a few breaks inbetween keeps you sharp!*

*Just starting out with web scraping? Then make sure to have followed the ["webdata for dummies" tutorial](https://odcm.hannesdatta.com/docs/tutorials/webdata-for-dummies/) first.*

*Enjoy!*

--- 

## Learning Objectives

* Identifying a strategy to generating seeds (“sampling”)
    * Extracting multiple elements at once using the `.find_all()` function
    * Preventing array misalignment
* Navigating on a website 
    * Using URLs to programmatically visit web pages
    * Writing loops to execute data collections in bulk using functions
* Improving extraction design
    * Implementing timers and modularizing extraction code
    * Storing data in CSV or JSON files with relevant meta data
* Scraping more advanced, dynamic websites
    * Understanding the difference between headless requests and browser emulation 
    * Learn when to apply one of the two methods (using `requests` and `selenium`)

--- 

<div class="alert alert-block alert-info"><b>Support Needed?</b> 
    For technical issues outside of scheduled classes, please check the <a href="https://odcm.hannesdatta.com/docs/course/support" target="_blank">support section</a> on the course website.
</div>


## 1. Generating seeds ("sampling")


__Importance__

So far, we've extracted (=parsed) some information (e.g., titles, product names, prices) from products' individual *product pages*. What we haven't done yet is decide for __which products to obtain that information__. Ideally, we would like to capture information for a *sample of books* (or users, movies, series, etc.).

In web scraping, we typically refer to a "seed" as a starting point for a data collection. Without a seed, there's no data to collect.

For example, before we can crawl through all books available on [this site](https://books.toscrape.com/catalogue/category/books_1/index.html), we first need to generate a *list of all books on the page*.

One way to get there would be to:

1. first scrape all book links (“seeds”) from the overview page, and 
2. then iterate over all links to scrape the product description (or anything else on that page; we have done this in the webdata for dummies tutorial).

Note that the overview page allows us to "navigate" to the individual book pages, either by clicking on the book cover or the book title (see red boxes in the figure below). 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/books_links.png" align="left" width=80%/>

### 1.1 Collecting links to use as seeds

Let's check out how the links from the book covers or book titles are encoded in the website's source code.

Open the [book catalogue](https://books.toscrape.com/catalogue/category/books_1/index.html), and inspect the underlying HTML code with the Chrome Inspector (right click --> inspect element). 

The book covers (`<img>`) are surrounded by `<a>` tags, which contain a link (`href`) to the book. 

Also, the book titles (`<h3>`) are surrounded by `<a>` tags with the relevant links to the book pages.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/inspector_links.png" align="left" width=80%/>

How could we tell a computer to capture the links to the various books on the site?

One simple way is to select *elements by their tags*. For example, to extract all links (`<a>` tags). 

<div class="alert alert-block alert-info"><b>How to extract multiple elements at once?</b>
    <br>
    
- By working through other tutorials, you may already be familiar with the <code>.find()</code> function of BeautifulSoup. The <code>.find()</code> function returns the <b>first element</b> that matches your particular "search query". <br>
- If you want to extract <b>all elements</b> that match a particular search pattern (say, a class name), you can use BeautifulSoup's <code>.find_all()</code> function.<br>
- Note that the "result" of the <code>.find_all()</code> option is a list of results __that you need to iterate through.__

</div>


__Exercise 1.1__

Please run the code cell below, which extracts all links (the `a` tag!), and prints the URL (`href`) to the screen. Don't worry, you don't need need to understand the code yet, we'll go over it line by line shortly!

If you look at these links more closely, you'll notice that we're not interested in many of these links... 

Make a list of all links we're *not* interested in (i.e., those *not* pointing to a book page). Which ones are those? Can you find out why they are there?

In [1]:
# Run this code now
import requests
from bs4 import BeautifulSoup

# make a get request to the books overview page (see Webdata for Dummies tutorial)
user_agent = {'User-agent': 'Mozilla/5.0'}
url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'
res = requests.get(url, headers = user_agent)
res.encoding = res.apparent_encoding

soup = BeautifulSoup(res.text)

# return the href attribute in the <a> tag nested within the first product class element
for link in soup.find_all("a"): 
    print(link.attrs["href"])

../../../index.html
../../../index.html
index.html
../books/travel_2/index.html
../books/mystery_3/index.html
../books/historical-fiction_4/index.html
../books/sequential-art_5/index.html
../books/classics_6/index.html
../books/philosophy_7/index.html
../books/romance_8/index.html
../books/womens-fiction_9/index.html
../books/fiction_10/index.html
../books/childrens_11/index.html
../books/religion_12/index.html
../books/nonfiction_13/index.html
../books/music_14/index.html
../books/default_15/index.html
../books/science-fiction_16/index.html
../books/sports-and-games_17/index.html
../books/add-a-comment_18/index.html
../books/fantasy_19/index.html
../books/new-adult_20/index.html
../books/young-adult_21/index.html
../books/science_22/index.html
../books/poetry_23/index.html
../books/paranormal_24/index.html
../books/art_25/index.html
../books/psychology_26/index.html
../books/autobiography_27/index.html
../books/parenting_28/index.html
../books/adult-fiction_29/index.html
../books/humo

**Your answer**

...

__Solution__

The links we want to ignore are...

* "Books to Scrape" link at the top
* "Home" breadcrumb link 
* Left sidebar with all book genres (e.g., Travel)
* The next button at the bottom

These links are present on the page, because they are used by users to navigate on the page. This can also be seen on the animation:

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/books_overview.gif" align="left" width=50%/>

### 1.2 Collecting *More Specific* Links

__Importance__

We've just discovered that selecting elements by their tags gives us many irrelevant links. But, how can we narrow down these links, or, in other words, __how can we scrape only the book links we're interested in?__.

To answer this question, we need to briefly revisit the notion of how an HTML code is structured. __Open your browser's inspect mode again and hover over the product pictures on the site.__

After inspecting, you'd probably notice that the page is generated according to a rigid structure: all product links are contained in a `<div>` tag, with the class name `product_pod`. The "wrong links" extracted above (i.e., the ones in the page's header and sidebar) are *not* part of these elements. 

So, if we can tell our scraper that we're only interested in the `<a>` tags *within the `product_pod` class*, we end up with our desired selection of links. 

__Let's try it out__

Like before, we'll use `.find_all()` to capture all matching elements on the page. The difference, however, is that we do not directly try to extract the __links__ with the tag `a`, but first try to obtain a __list with product containers__ identified by the classname `product_pod`.

Run the code below, in which we first try to capture all book containers using the `product_pod` class.


In [2]:
import requests
from bs4 import BeautifulSoup

# make a get request to the books overview page (see Webdata for Dummies tutorial)
url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'
header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers=header)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)

# return all book containers
books = soup.find_all(class_="product_pod")
len(books)

20

As expected, we retrieve 20 book containers. You can now also use the books object to look at the data for the first, second, third, ... book.

In [3]:
books[0]

<article class="product_pod">
<div class="image_container">
<a href="../../a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../../../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="../../a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

...to subsequently try to extract the link for the first book...

In [4]:
books[0].find('a')['href']

'../../a-light-in-the-attic_1000/index.html'

...the second book...

In [5]:
books[1].find('a')['href']

'../../tipping-the-velvet_999/index.html'

...or all books.

In [6]:
links = []
for book in books:
    links.append(book.find('a')['href'])
links

['../../a-light-in-the-attic_1000/index.html',
 '../../tipping-the-velvet_999/index.html',
 '../../soumission_998/index.html',
 '../../sharp-objects_997/index.html',
 '../../sapiens-a-brief-history-of-humankind_996/index.html',
 '../../the-requiem-red_995/index.html',
 '../../the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 '../../the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 '../../the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 '../../the-black-maria_991/index.html',
 '../../starving-hearts-triangular-trade-trilogy-1_990/index.html',
 '../../shakespeares-sonnets_989/index.html',
 '../../set-me-free_988/index.html',
 '../../scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html',
 '../../rip-it-up-and-start-again_986/index.html',
 '../../our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html',

Note the `../../` in front of the link which tells the browser: this tells the browser to go back two directories from the current URL:
* Current URL: https://books.toscrape.com/catalogue/category/books_1/index.html
* 1 step back: https://books.toscrape.com/catalogue/category/books_1
* 2 steps back: https://books.toscrape.com/catalogue/category/

Thereafter, it appends `a-light-in-the-attic_1000/index.html` to the URL which forms the full link to the [A Light in the Attic](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html) book. 

Pretty cool, right? So let's proceed with some exercises.

#### Exercise 1.2
1. Modify the loop (`for book in books`) above to extract the *absolute URLs* rather than the relative URLs. Specifically, combine the website's URL (`https://books.toscrape.com/catalogue/`) and the string you extracted in the previous code snippet (`../../a-light-....`). You can remove the `../../` by using the `.replace('../../', '')` function on the URL. The final URL needs to be: `https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html` 
2. Write a function to collect all links (seeds) from this page, i.e., including loading packages, making the HTTP request, and returning the information as an array.

In [7]:
# your answer goes here!

#### Solutions

In [8]:
# Question 1 
links = []
for book in books:
    extracted_link = book.find('a')['href'].replace('../../','')
    combined_link = "https://books.toscrape.com/catalogue/" + extracted_link
    links.append(combined_link)
links

['https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'https://books.toscrape.com/catalogue/soumission_998/index.html',
 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'https://books.toscrape.com/catalogue/the-requiem-red_995/index.html',
 'https://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'https://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'https://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'https://books.toscrape.com/catalogue/the-black-maria_991/index.html',
 'https://books.toscrape.com/catalogue/starving-hearts-triangular-trade-tr

In [9]:
# Question 2
import requests
from bs4 import BeautifulSoup

def get_all_links(url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'):
    # make a get request to the books overview page (see Webdata for Dummies tutorial)
    print(f'Getting links from page {url}.')
    header = {'User-agent': 'Mozilla/5.0'}
    res = requests.get(url, headers=header)
    res.encoding = res.apparent_encoding
    soup = BeautifulSoup(res.text)

    links = []
    for book in books:
        extracted_link = book.find('a')['href'].replace('../../','')
        combined_link = "https://books.toscrape.com/catalogue/" + extracted_link
        links.append(combined_link)
    return(links) # to return all links

get_all_links()

Getting links from page https://books.toscrape.com/catalogue/category/books_1/index.html.


['https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'https://books.toscrape.com/catalogue/soumission_998/index.html',
 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'https://books.toscrape.com/catalogue/the-requiem-red_995/index.html',
 'https://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'https://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'https://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'https://books.toscrape.com/catalogue/the-black-maria_991/index.html',
 'https://books.toscrape.com/catalogue/starving-hearts-triangular-trade-tr

# 1.3 Preventing array misalignment

So far, we have only extracted *one* piece of information (the URL) from the product overview pages. But, what if we want to use the product overview page to extract multiple data points (say, about the price and the review valence)?

A simple solution may be to just use multiple `.find_all()` commands.

__Example__:


In [10]:
# Run this code now
import requests
from bs4 import BeautifulSoup

header = {'User-agent': 'Mozilla/5.0'}
url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'
res = requests.get(url, headers=header)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)

# getting the titles
book_titles = []
for title in soup.find_all('h3'): book_titles.append(title.get_text())

# getting the valence
stars = []
for star in soup.find_all(class_='star-rating'): stars.append(star.attrs['class'][1])

# book titles
print(book_titles)

# stars
print(stars)

['A Light in the ...', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History ...', 'The Requiem Red', 'The Dirty Little Secrets ...', 'The Coming Woman: A ...', 'The Boys in the ...', 'The Black Maria', 'Starving Hearts (Triangular Trade ...', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little ...", 'Rip it Up and ...', 'Our Band Could Be ...', 'Olio', 'Mesaerion: The Best Science ...', 'Libertarianism for Beginners', "It's Only the Himalayas"]
['Three', 'One', 'One', 'Four', 'Five', 'One', 'Four', 'Three', 'Four', 'One', 'Two', 'Four', 'Five', 'Five', 'Five', 'Three', 'One', 'One', 'Two', 'Two']


While this approach seems easily implemented, it is __highly error-prone and needs to be avoided.__

<div class="alert alert-block alert-info"><b>What's an array misalignment?</b>
    <br>
    
<ul>
<li>
When extracting information from the web, we sometimes are prone to "ripping apart" the website's original structure by putting data points into individual arrays (e.g., lists such as one list for book titles and another for stars). </li>
<li>In so doing, we violate the data's original structure: we should store information on books, and <b>each book</b> has a title and rating.</li>
    <li>The <b>correct way of organizing the data</b> is to create a list of books (e.g., in a dictionary) and then store each attribute (e.g., the title, the valence, etc.) <b>within</b> these objects. <b>Only if we store data this way</b> can we be sure to store everything correctly. </li>
<br>
<li>When we do not adhere to this practice, we run the risk of "array misalignment". For example, if only ONE data point were missing for a book, then the (independent) book_titles array (say, with 20 items) wouldn't be "1:1 aligned" with the valence array (say, with only 19 items).</li>

</div>

__So, how to do it correctly?__

We will first have to iterate through each __book__, and within each book extract the information.

Storing the information in a list of dictionaries corresponds most to this solution (see the example below):

In [11]:
# Run this code now
import requests
from bs4 import BeautifulSoup

header = {'User-agent': 'Mozilla/5.0'}
url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'
res = requests.get(url, headers=header)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)

# loop through the books
books = []
for book in soup.find_all('article', class_='product_pod'):
    title = book.find('h3').get_text()
    valence = book.find(class_='star-rating').attrs['class'][1]
    
    obj = {'title': title,
           'valence': valence}
    
    books.append(obj)
    
books

[{'title': 'A Light in the ...', 'valence': 'Three'},
 {'title': 'Tipping the Velvet', 'valence': 'One'},
 {'title': 'Soumission', 'valence': 'One'},
 {'title': 'Sharp Objects', 'valence': 'Four'},
 {'title': 'Sapiens: A Brief History ...', 'valence': 'Five'},
 {'title': 'The Requiem Red', 'valence': 'One'},
 {'title': 'The Dirty Little Secrets ...', 'valence': 'Four'},
 {'title': 'The Coming Woman: A ...', 'valence': 'Three'},
 {'title': 'The Boys in the ...', 'valence': 'Four'},
 {'title': 'The Black Maria', 'valence': 'One'},
 {'title': 'Starving Hearts (Triangular Trade ...', 'valence': 'Two'},
 {'title': "Shakespeare's Sonnets", 'valence': 'Four'},
 {'title': 'Set Me Free', 'valence': 'Five'},
 {'title': "Scott Pilgrim's Precious Little ...", 'valence': 'Five'},
 {'title': 'Rip it Up and ...', 'valence': 'Five'},
 {'title': 'Our Band Could Be ...', 'valence': 'Three'},
 {'title': 'Olio', 'valence': 'One'},
 {'title': 'Mesaerion: The Best Science ...', 'valence': 'One'},
 {'title':

## 2. Navigating on a Website

### 2.1. Using URLs

__Importance__

Alright - what have we learnt up this point?

We've learnt how to extract seeds from __one page.__

So... what's missing?

Exactly! The [`books.toscrape.com`](https://books.toscrape.com/catalogue/category/books_1/index.html) contains many books, spread across __50 pages__. 

So, the goal of this section is to navigate through the __entire book assortment__, not only the first 20 books!

__Let's try it out__

Open [the website](https://books.toscrape.com/catalogue/category/books_1/index.html), and click on the "next" button at the bottom of the page.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/books.png" align="left" width=60%/>


Repeat this a couple of times, and observe how the URL in your navigation bar is changing...

- `https://books.toscrape.com/catalogue/category/books_1/page-1.html`
- `https://books.toscrape.com/catalogue/category/books_1/page-2.html`
- `https://books.toscrape.com/catalogue/category/books_1/page-3.html`

Can you guess the next one...?

Indeed! The URL can be divided into a __fixed base part__ (`https://books.toscrape.com/catalogue/category/books_1/`), and a __counter__ that is dependent on the page you're visiting (e.g., `page-1.html`). 

__Now let's create a list of all 50 URLs!__ 

First, we create a counter variable, which we now set to 1 (but it can take on any value later on). Then, we append the site's URL to it.

In [12]:
counter = 1
page_urls = []
while counter <= 50:
    page_urls.append(f'https://books.toscrape.com/catalogue/page-{counter}.html')
    counter+=1
page_urls

['https://books.toscrape.com/catalogue/page-1.html',
 'https://books.toscrape.com/catalogue/page-2.html',
 'https://books.toscrape.com/catalogue/page-3.html',
 'https://books.toscrape.com/catalogue/page-4.html',
 'https://books.toscrape.com/catalogue/page-5.html',
 'https://books.toscrape.com/catalogue/page-6.html',
 'https://books.toscrape.com/catalogue/page-7.html',
 'https://books.toscrape.com/catalogue/page-8.html',
 'https://books.toscrape.com/catalogue/page-9.html',
 'https://books.toscrape.com/catalogue/page-10.html',
 'https://books.toscrape.com/catalogue/page-11.html',
 'https://books.toscrape.com/catalogue/page-12.html',
 'https://books.toscrape.com/catalogue/page-13.html',
 'https://books.toscrape.com/catalogue/page-14.html',
 'https://books.toscrape.com/catalogue/page-15.html',
 'https://books.toscrape.com/catalogue/page-16.html',
 'https://books.toscrape.com/catalogue/page-17.html',
 'https://books.toscrape.com/catalogue/page-18.html',
 'https://books.toscrape.com/catalogu

As expected, this gives a list of all page URLs that contain books. 

In [13]:
# print the last five page urls (btw, run print(page_urls) for yourself to see all page URLs!)
print("The number of page urls in the list is: " + str(len(page_urls)))

The number of page urls in the list is: 50


#### Exercise 2.1

Let's take a step back again, and practice getting seeds from *another website*: [`quotes.toscrape.com`](https://quotes.toscrape.com/) displays 100 famous quotes from GoodReads, categorized by tag. 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/quotes.png" align="left" width=60% style="border: 1px solid black" />

1. Make yourself comfortable with how the [site](https://quotes.toscrape.com) works and ask yourself questions such as: how does the navigation work, how many pages are there, what is the base URL, and how does it change if I move to the next page?
2. Generate a list `quote_page_urls` that contains the page URLs we need if we'd like to scrape all 100 quotes.

In [14]:
# your answer goes here!

#### Solutions
1. The 100 quotes are evenly spread across 10 pages. The base URL is `https://quotes.toscrape.com/page/` followed by a page number between 1 and 10.

In [15]:
counter = 1
quote_page_urls = []
while counter <= 10:
    quote_page_urls.append(f'https://quotes.toscrape.com/page/{counter}')
    counter+=1
quote_page_urls


['https://quotes.toscrape.com/page/1',
 'https://quotes.toscrape.com/page/2',
 'https://quotes.toscrape.com/page/3',
 'https://quotes.toscrape.com/page/4',
 'https://quotes.toscrape.com/page/5',
 'https://quotes.toscrape.com/page/6',
 'https://quotes.toscrape.com/page/7',
 'https://quotes.toscrape.com/page/8',
 'https://quotes.toscrape.com/page/9',
 'https://quotes.toscrape.com/page/10']

Of course, one of the big disadvantages of this "manual" link building is that we need to "know" how many pages to extract information from. This may vastly differ by category. 

We turn towards this issue next.

### 2.2 Using links contained in elements (e.g., buttons)

__Importance__

For now, the book link extraction has worked without problems. Yet, there's still one little improvement that we can make. *If the number of pages changes*, we need to manually update for how many pages we would like to retrieve seeds.

A general solution is therefore to look up whether there is a `next` button on the page (see HTML code below). We can then either "grab" the URL and visit it (so, in essence, we're still using URLs to navigate), or - instead - "click" on it.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/next_page.png" align="left" width=60% style="border: 1px solid black" />

__Let's try it out__

So, let's write a snippet that "captures" the link of the next page button on the [books page](https://books.toscrape.com).

We always proceed in small steps.

In [16]:
# Step 1: Load the website's source code and convert to BeautifulSoup object
url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'
header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)

In [17]:
# Step 2: Trying to locate the "next" class.
soup.find(class_='next')

<li class="next"><a href="page-2.html">next</a></li>

In [18]:
# Step 3: Trying to locate the <a> tag within the "next" class

In [19]:
soup.find(class_='next').find('a')

<a href="page-2.html">next</a>

In [20]:
# Step 4: Trying to extract the link ('href' attribute)
soup.find(class_='next').find('a')['href']

'page-2.html'

At each iteration, we can observe how we're getting closer to the information we need.

Now, we only need to combine the base URL with the page number.

In [21]:
next_page = soup.find(class_='next').find('a')['href']
'https://books.toscrape.com/catalogue/category/books_1/' + next_page

'https://books.toscrape.com/catalogue/category/books_1/page-2.html'

__Exercise 2.2__

Please first load the snippet below, which has wrapped the "next page" capturing in a function. Observe the use of `try` and `except`, which accounts for the last page NOT having a next page button.

In [22]:
base_url = 'https://books.toscrape.com/catalogue/category/books_1/'

def next_page(url):
    header = {'User-agent': 'Mozilla/5.0'}
    res = requests.get(url, headers = header)
    res.encoding = res.apparent_encoding
    soup = BeautifulSoup(res.text)
    try:
        next_page = soup.find(class_='next').find('a')['href']
    except:
        next_page = 'no next page'
    return(base_url + next_page)



1. Pass `https://books.toscrape.com/catalogue/page-49.html` to `next_page()` and observe the output. Then, use  `https://books.toscrape.com/catalogue/page-50.html`. Is that what you expected? 

2. Write a while loop that assembles a list of all product pages for the book category (`'https://books.toscrape.com/catalogue/category/books_1/'`), by extracting next page URLs from each page and appending them to an array/list called `urls`.


In [23]:
# write your code here

__Solution__

In [24]:
# Question 1
next_page('https://books.toscrape.com/catalogue/page-49.html') # works


'https://books.toscrape.com/catalogue/category/books_1/page-50.html'

In [25]:
next_page('https://books.toscrape.com/catalogue/page-50.html') # returns "no next page"

'https://books.toscrape.com/catalogue/category/books_1/no next page'

In [26]:
# Question 2
urls = []

# define first URL to start from
url = 'https://books.toscrape.com/catalogue/category/books_1/'

while True:
    print('Trying to get next page URL from ' + url)
    next_url = next_page(url)
    if 'no next page' in next_url: break
    url = next_url
    urls.append(url)
    
urls

Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-2.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-3.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-4.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-5.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-6.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-7.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-8.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-9.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-10.html
Trying to get next p

['https://books.toscrape.com/catalogue/category/books_1/page-2.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-3.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-4.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-5.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-6.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-7.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-8.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-9.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-10.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-11.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-12.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-13.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-14.html',
 'https://books.toscrape.com/catalogue/category/books_1/page-15.html',
 'https://book

### 2.3 Collecting all seeds

Up to this moment, we have defined what seeds are (crucially important for sampling!), and introduced several ways through which you can navigate on a site. The only thing that's missing is combining these two things: navigating through all of the available pages, and collecting seeds for which we can later extract data.

__Exercise 2.3__

Using the solution from exercise 2.2, write code that navigates through all pages of the book category and stores product URLs in a list of dictionaries, containing the following data points:
- product URL
- URL from which page the product URL was captured
- current time stamp


__Solution__

In [27]:
import time

seeds = []
url = 'https://books.toscrape.com/catalogue/category/books_1/' #initialize for first page
counter = 0 #initialize counter so that you can break earlier from this loop when needed

while True:
    counter+=1
    
    #if (counter>4): break # deactivate this comment if you want to break after x iterations for prototyping
    
    print(f'Trying to get next page URL from {url}')
    
    header = {'User-agent': 'Mozilla/5.0'}
    res = requests.get(url, headers=header)
    res.encoding = res.apparent_encoding
    soup = BeautifulSoup(res.text)
    
    
    # extract information
    urls = soup.find_all(class_="product_pod")
    for book in urls:
        url_book = book.find("a").attrs["href"]
        book_url = "https://books.toscrape.com/catalogue/" + url_book
        book_url = book_url.replace('../', '')
        seeds.append({'product_url': book_url,
                      'page_url': url,
                      'timestamp': int(time.time())})
    
    # next page available?
    try:
        url = 'https://books.toscrape.com/catalogue/category/books_1/' + soup.find(class_='next').find('a')['href']
    except:
        break # no next page present


Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-2.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-3.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-4.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-5.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-6.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-7.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-8.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-9.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-10.html
Trying to get next p

In [28]:
# take a look at the collected seeds
seeds

[{'product_url': 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1676484017},
 {'product_url': 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1676484017},
 {'product_url': 'https://books.toscrape.com/catalogue/soumission_998/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1676484017},
 {'product_url': 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1676484017},
 {'product_url': 'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1676484017},
 {'product_url': 'https://books.toscr

To retrieve product information, you could now loop through this list of links and obtain the respective product information (see webdata for dummies tutorial).

In [29]:
len(seeds)

1000

## 3. Improving Extraction Design

### 3.1 Timers

__Importance__

Before we started running some of the cells above, you may have observed the usage of the `time.sleep` function. Sending many requests at the same time can overload a server. Therefore, it's highly recommended to pause between requests rather than sending them all simultaneously. This avoids that your IP address (i.e., numerical label assigned to each device connected to the internet) gets blocked, and you can no longer visit (and scrape) the website. 

__Let's try it out__

In Python, you can import the `time` module, which pauses the execution of future commands for a given amount of time. For example, the print statement after `time.sleep(3)` will only be executed after 3 seconds:

In [30]:
# run this cell again to see the timer in action yourself!
import time
pause = 3
time.sleep(pause)
print(f"I'll be printed to the console after {pause} seconds!")

I'll be printed to the console after 3 seconds!


__Exercise 3.1__

Modify the code above to sleep for 2 minutes. Go grab a coffee inbetween. Did it take you longer than 2 minutes?

(if you want to abort the running code, just select the cell and push the "stop" button)

In [31]:
# your answer goes here!

**Solution**  

In [32]:
time.sleep(2*60)
print("Done!")

Done!


### 3.2 Modularization

**Importance**  

In scraping, many things have to be executed *multiple times*. For example, whenever we open a new page on books.toscrape.com, we would like to extract all the available book links.

To help us execute things over and over again, we will "modularize" our code into functions. We can then call these functions whenever we need them. Another benefit from using functions is that we can improve the readability and reusability of our code. If you need a quick refresher on functions, please revisit section 4 of the [Python Bootcamp](https://odcm.hannesdatta.com/docs/tutorials/pythonbootcamp/).

**Let's try it out**

Let's finish up our book URL scraper by putting together everything we have learned thus far.

1. We need a function that extracts all seeds, given a category URL. We would like to store these seeds in a JSON file and save it to the disk. This will consititute our "sample" going forward.
2. We need a function that opens this JSON file, and captures all of the relevant product information (for now, let's use the title and price).

__Exercise 3.2__

Write a function to accomplish (1) above? (capturing the seeds and storing them in a JSON file)? Start with the solution in 2.3.

__Solution__

In [33]:
import time

def get_seeds(start_url = 'https://books.toscrape.com/catalogue/category/books_1/'):
    seeds = []
    url = start_url
    counter = 0 #initialize counter so that you can break earlier from this loop when needed

    while True:
        counter+=1

        if (counter>4): break # (de)activate this comment if you want to break after x iterations for prototyping

        print(f'Trying to get next page URL from {url}')

        header = {'User-agent': 'Mozilla/5.0'}
        res = requests.get(url, headers=header)
        res.encoding = res.apparent_encoding
        soup = BeautifulSoup(res.text)

        # extract information
        urls = soup.find_all(class_="product_pod")
        for book in urls:
            url_book = book.find("a").attrs["href"]
            book_url = "https://books.toscrape.com/catalogue/" + url_book
            book_url = book_url.replace('../', '')
            seeds.append({'product_url': book_url,
                          'page_url': url,
                          'timestamp': int(time.time())})
        
        # next page available?
        try:
            url = 'https://books.toscrape.com/catalogue/category/books_1/' + soup.find(class_='next').find('a')['href']
        except:
            break # no next page present
            
    return(seeds)


In [34]:
data = get_seeds('https://books.toscrape.com/catalogue/category/books_1/')

Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-2.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-3.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-4.html


In [35]:
# preview the data
data

[{'product_url': 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1676484189},
 {'product_url': 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1676484189},
 {'product_url': 'https://books.toscrape.com/catalogue/soumission_998/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1676484189},
 {'product_url': 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1676484189},
 {'product_url': 'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1676484189},
 {'product_url': 'https://books.toscr

In [36]:
# store data in new-line separated JSON files

import json
f = open('seeds.json','w',encoding = 'utf-8')
for item in data:
        f.write(json.dumps(item))
        f.write('\n')
f.close()

__Exercise 3.3__

Now, let's write some code that loads `seeds.json`, and visits each of the websites to extract the product title and price. Remember to build in a little timer (e.g., waiting for 1 second). The prototype/starting code below stops automatically after 5 iterations to minimize server load. Try removing the prototyping condition using the comment character `#` when you think you're done!


In [37]:
# start from the code below
import time # we need the time package for implementing a bit of waiting time

content = open('seeds.json', 'r').readlines() # let's read in the seed data

counter = 0 # initialize counter to 0

# loop through all lines of the JSON file
for line in content:
    # increment counter and check whether prototyping condition is met
    counter = counter + 1
    if counter>5: break # deactivate this if you want to loop through the entire file
        
    # convert loaded data to JSON object/dictionary for querying
    obj = json.loads(line)
    
    # show URL for which product information needs to be captured
    print(obj['product_url'])
    
    # eventually sleep for a second
    time.sleep(1)
    

https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
https://books.toscrape.com/catalogue/soumission_998/index.html
https://books.toscrape.com/catalogue/sharp-objects_997/index.html
https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html


<div class="alert alert-block alert-info"><b>Tips</b>
    <br>
    <ul>
        <li>
            Use the function <code>parse_website</code> from exercise 1.6 in the "webdata for dummies" tutorial and remove the file saving part.
        </li>
 
</div>


__Solution__

In [38]:
# Paste the parse_website() function here from an earlier tutorial. Remember also using the import statements!
import requests
from bs4 import BeautifulSoup

def parse_website(url):
    header = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
    request = requests.get(url, headers = header)
    request.encoding = request.apparent_encoding # set encoding to UTF-8
    source_code = request.text

    # make information "extractable" using BeautifulSoup
    soup = BeautifulSoup(source_code)
    
    # title
    title = soup.find('h1').get_text()
    price = soup.find(class_='price_color').get_text()
    instock = soup.find(class_='instock availability').get_text().strip()
    stars = soup.find(class_='star-rating').attrs['class'][1]

    data = {'title': title,
            'price': price,
            'instock': instock,
            'stars': stars}
    
    return(data)

In [39]:
# test whether the function works (I just randomly picked a book)
parse_website('https://books.toscrape.com/catalogue/set-me-free_988/index.html')

{'title': 'Set Me Free',
 'price': '£17.46',
 'instock': 'In stock (19 available)',
 'stars': 'Five'}

In [40]:
# now start from the code above and "use" the function

# start from the code below
import time # we need the time package for implementing a bit of waiting time

content = open('seeds.json', 'r').readlines() # let's read in the seed data

counter = 0 # initialize counter to 0

# loop through all lines of the JSON file
for line in content:
    # increment counter and check whether prototyping condition is met
    counter = counter + 1
    if counter>5: break # deactivate this if you want to loop through the entire file
        
    # convert loaded data to JSON object/dictionary for querying
    obj = json.loads(line)
    
    # show URL for which product information needs to be captured
    url = obj['product_url']
    print(f'Retrieving data for {url}.')
    
    retrieved_data = parse_website(url)
    retrieved_data['timestamp_retrieval'] = int(time.time())
    # store data
    f = open('book_data.json', 'a', encoding = 'utf-8')
    f.write(json.dumps(retrieved_data))
    f.write('\n')
    f.close() 
    
    # eventually sleep for a second
    time.sleep(1)
 

Retrieving data for https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html.
Retrieving data for https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html.
Retrieving data for https://books.toscrape.com/catalogue/soumission_998/index.html.
Retrieving data for https://books.toscrape.com/catalogue/sharp-objects_997/index.html.
Retrieving data for https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html.


In [41]:
# inspect data in pandas
import pandas as pd
pd.read_json('book_data.json', lines=True)

Unnamed: 0,title,price,instock,stars,timestamp_retrieval
0,A Light in the Attic,£51.77,In stock (22 available),Three,2023-02-15 08:14:16
1,Tipping the Velvet,£53.74,In stock (20 available),One,2023-02-15 08:14:17
2,Soumission,£50.10,In stock (20 available),One,2023-02-15 08:14:19
3,Sharp Objects,£47.82,In stock (20 available),Four,2023-02-15 08:14:20
4,Sapiens: A Brief History of Humankind,£54.23,In stock (20 available),Five,2023-02-15 08:14:22
5,A Light in the Attic,£51.77,In stock (22 available),Three,2023-02-15 18:03:18
6,Tipping the Velvet,£53.74,In stock (20 available),One,2023-02-15 18:03:20
7,Soumission,£50.10,In stock (20 available),One,2023-02-15 18:03:21
8,Sharp Objects,£47.82,In stock (20 available),Four,2023-02-15 18:03:23
9,Sapiens: A Brief History of Humankind,£54.23,In stock (20 available),Five,2023-02-15 18:03:25


### 3.3 Summary

At the beginning of this tutorial, we set out the promise of writing multi-page scrapers from start to finish. Although the examples we have studied are relatively simple, the same principles (seed definition, data extraction plan, page-level data collection) apply to any other website you'd like to scrape. 

But... then, there are more *advanced websites*, which we address next.

# 4. Scraping more advanced, dynamic websites

In previous tutorials, you have used the `requests` library to retrieve web data. For example, re-run the following code.



In [42]:
import requests
from bs4 import BeautifulSoup

header = {'User-agent': 'Mozilla/5.0'}
request = requests.get('https://books.toscrape.com/catalogue/sharp-objects_997/index.html', headers = header)
request.encoding = request.apparent_encoding
source_code = request.text

# save website 
f=open('simple_website.html','w',encoding='utf-8')
f.write(source_code)
f.close()

# parse some information
soup=BeautifulSoup(source_code)
soup.find('h1')

<h1>Sharp Objects</h1>

This works well for relatively simple websites, but... try the same for the homepage of Twitch!

In [43]:
request = requests.get('https://www.twitch.tv/', headers = header)
request.encoding = request.apparent_encoding
source_code = request.text
soup=BeautifulSoup(source_code)

# save website 
f=open('advanced_website.html','w',encoding='utf-8')
f.write(source_code)
f.close()

When trying to open `advanced_website.html` in your browser, you quickly realize there is a problem. You can't see what's on the website when you manually open it using the URL. This mainly has to do with how advanced a website is: in the case of Twitch, you'd encounter quite a dynamic site with a video player, previews, real-time updates on the number of streams, etc. The normal request library isn't just able to handle it. 

So, we're resorting to an alternative way to retrieve data, using `selenium`.

## 4.1 Making a connection to a website using Selenium

<div class="alert alert-block alert-warning"><b>Installing Selenium and Chromedriver</b> 

To install Selenium and Chromedriver locally, please follow the <a href="https://tilburgsciencehub.com/configure/python-for-scraping/?utm_campaign=referral-short">Tutorial on Tilburg Science Hub</a>.
    
You can also use the code snippet below to automate the installation. Running this snippet takes a little longer each time, but the benefit is that it almost always works!
</div>


In [44]:
# Installing and starting up Chrome using Webdriver Manager
!pip install webdriver_manager
!pip install selenium

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

# Opening the Twitch site
driver = webdriver.Chrome(ChromeDriverManager().install())

url = "https://twitch.tv/"
driver.get(url)

If everything went smooth, your computer opened a new Chrome window, and opened `twitch.tv`. 

<div class="alert alert-block alert-info"><b>Using Google Colab</b> 

If you're using Google Colab, you don't see your browser open up manually.
    
Whenever you switch pages, just manually open that page in your browser. Although this feels like a little less interactive, you will still be able to work through this tutorial!

</div>

From now onwards, you can use `driver.get('https://google.com')` to point to different websites (i.e., you don't need to install it over and over again, unless you open up a new instance of Jupyter Notebook).

## 4.2 Using BeautifulSoup with Selenium


We can now also try to extract information. Note that we're converting the source code of the site to a `BeautifulSoup` object (because you may have learnt how to use `BeautifulSoup` earlier).

In [45]:
# we also need the time package to wait a few seconds until the page is loaded
import time
url = "https://twitch.tv/"
driver.get(url)
time.sleep(3)

Rather than using the "source code" obtained with the `requests` library, we can now convert the source code of the Selenium website to a BeautifulSoup object.

In [46]:
soup=BeautifulSoup(driver.page_source)

...and start experimenting with querying the site, such as retrieving the titles of the currently active streams.

In [47]:
streams = soup.find_all('a', attrs = {'data-test-selector':"TitleAndChannel"})

# print a list of stream names
counter = 0
for stream in streams:
    counter = counter + 1
    print('Stream ' + str(counter) + ': ' + stream.get_text())


Stream 1: 🔴CLICK HERE🔴CLICK NOW🔴CLICKY CLICKY🔴NEWS BIG🔴DRAMA MEGA🔴NO VALENTINE ANDY🔴LONELY CERTIFIED CONTENT🔴BASEMENT WARLORD🔴#1 GOBLIN🔴xQc
Stream 2: HIGLIGHTS: G2 Esports vs Heroic - IEM Katowice 2023 - Grand FinalESL_CSGO
Stream 3: VCT LOCK//IN  TH vs. EG— Alpha Bracket Day 3VALORANT
Stream 4: PSN: AuzioMF - 86+ MIXED CAMPAIGN PLAYER PICKS! 🔥 !prime @AuzioMFAuzioMF
Stream 5: ADEYEMI'S ARMYdannyaarons
Stream 6: #ANALYSE 5 AVEC SLIPIXotplol_
Stream 7: [DROPS] Annie Huffley Hufflepuff playthrough - HARD mode - 100% challenges complete, 92% trophies !nordvpnAnnieFuchsia
Stream 8: freelancingsips_
Stream 9: [DROPS] CAN I GET A HOYYAHHHHHHHHH!!! hogwarts later <3 !discord !skillsharelydiaviolet
Stream 10: ❌Donaton Stream❌ !Donaton !TSamSaberi
Stream 11: DON'T HAVE MUCH TIME BUT I WANT TO STREAMFoolish_Gamers
Stream 12: 🦐 WHOLESOME ORCA IS BACK! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ~ BLO'HOLE BLAST FLAVOR RELEASE → !gg 《VTuber》!socials !gg !merchShylily
Stream 13: i r wizard | disc

Wow - this is cool. You've just learnt a second way to open websites using `selenium`. The benefit of `selenium` is that you can work with highly dynamic websites (which also helps you to not getting blocked). The drawback is that `selenium` is slower than just using the `requests` library, and it may sometimes be buggy on computers without a screen (which matters when you scale up your data collection.

<div class="alert alert-block alert-info"><b>Awesome stuff with Selenium</b> 

Selenium is your best shot at navigating a dynamic website. It can do amazing things, such as 
    
<ul>
    <li>"clicking" on buttons</li>
    <li>scrolling through a site</li>
    <li>hovering over items and capturing information from popups,</li>
    <li>starting to play a stream,</li>
    <li>typing text and submitting it in the chat, and</li>
    <li>so much more...!</li>
</ul>
    
Note though that we won't cover the advanced functionality of Selenium in this tutorial, but the optional "Web data advanced" tutorial holds the necessary information.
   
</div>



__Exercise 4.1__

Please write code snippets to extract the following pieces of information. Do you choose `requests` or `selenium`?

1. The titles of all `<h2>` tags from `https://odcm.hannesdatta.com/docs/course/`
2. The titles of all available TV series from `https://www.bol.com/nl/nl/l/series/3133/30291/` (about 24)

```
soup.find_all('a', class_='product-title')
```


We also need the time package to wait a few seconds until the page is loaded.

```
import time
url = "https://twitch.tv/" # some example URL
driver.get(url)
time.sleep(3)
```

In [48]:
# write your solution here

In [49]:
# Solution to question 1:
header = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
request = requests.get('https://odcm.hannesdatta.com/docs/course/', headers = header)
request.encoding = request.apparent_encoding # set encoding to UTF-8
soup = BeautifulSoup(request.text)
for title in soup.find_all('h2'): print(title.get_text())

Instructor
Course description
Prerequisites
Teaching format
Assessment
Code of Conduct
Structure of the course
More links


In [50]:
# Solution to question 2:
driver.get('https://www.bol.com/nl/nl/l/series/3133/30291/')
time.sleep(3)
soup = BeautifulSoup(driver.page_source)

In [51]:
urls = []
for url in soup.find_all('a', class_='product-title'):
    urls.append(url.attrs['href'])
urls

['/nl/nl/p/midsomer-murders-seizoen-19-deel-2/9200000119833762/',
 '/nl/nl/p/midsomer-murders-seizoen-17/9200000132010294/',
 '/nl/nl/p/ncis-seizoen-19/9300000135569426/',
 '/nl/nl/p/fawlty-towers/9300000087454356/',
 '/nl/nl/p/sisi-seizoen-2/9300000139818897/',
 '/nl/nl/p/chicago-fire-seizoen-10/9300000123634169/',
 '/nl/nl/p/flikken-maastricht-seizoen-16/9300000096688928/',
 '/nl/nl/p/star-trek-discovery-seizoen-4/9300000127973053/',
 '/nl/nl/p/house-of-the-dragon-seizoen-1/9300000127606162/',
 '/nl/nl/p/game-of-thrones-seizoen-1-8/9300000045366024/',
 '/nl/nl/p/ncis-los-angeles-s12/9300000058801046/',
 '/nl/nl/p/star-trek-picard-seizoen-2/9300000123707493/',
 '/nl/nl/p/nachtwacht-het-donkere-spiegelbeeld/9300000128499338/',
 '/nl/nl/p/midsomer-murders-seizoen-12-deel-2/9200000132010284/',
 '/nl/nl/p/midsomer-murders-seizoen-18-deel-1/9200000132010326/',
 '/nl/nl/p/columbo-complete-collection/9200000096426621/',
 '/nl/nl/p/outlander-seizoen-6-blu-ray-import-met-nl-ondertiteling/93000

### 4.3 Using interactive elements (e.g., by clicking buttons)

__Importance__

For more dynamic websites, we may have to click on certain elements (rather than extracting some URL).

<div class="alert alert-block alert-info"><b>Extracting elements using Selenium, not BeautifulSoup</b> 

Selenium is really great for navigating dynamic website. There are two ways in which you can use it for querying sites:
    
<ul>
    <li>put the "selenium" source code (<code>driver.page_source</code>) to BeautifulSoup, and then use BeautifulSoup commands, or </li>
    <li>directly use selenium (and it's own query language) to extract elements.</li>
</ul>
    
In the next few examples, we are using selenium's "internal" query language (which you identify easily because it is a subfunction of the `driver` object, and because it has a different name (`find_element`, instead of `find` or `find_all`).
    
Want to know more about selenium's built-in query language? Check out the "Advanced Web Scraping Tutorial", or dig up some extra material from the web. Knowing both BeautifulSoup and Selenium makes you most productive!
  
</div>

__Try it out__

If you haven't done so, rerun the installation code for `selenium` from above. Then, proceed by running the following cell and observe what happens in your browser.


In [52]:
driver.get('https://books.toscrape.com/catalogue/category/books_1/')

After a few seconds, your browser will have loaded the website in Chrome. Now, run the next cells.

In [53]:
# Step 1: Let's try location the element
from selenium.webdriver.common.by import By
driver.find_element(By.CLASS_NAME, 'next')

<selenium.webdriver.remote.webelement.WebElement (session="381144fe48efd393c0dbb5cb4d5a4689", element="ecbeb367-1848-4d47-bdba-e7a98ed9578e")>

In [54]:
# Step 2: Finding the link within the `next` class
driver.find_element(By.CLASS_NAME, 'next').find_element(By.TAG_NAME, 'a')

<selenium.webdriver.remote.webelement.WebElement (session="381144fe48efd393c0dbb5cb4d5a4689", element="5f555662-5379-47ab-9a2a-e6e76c2ea298")>

In [55]:
# Step 3: Clicking the link!
driver.find_element(By.CLASS_NAME, 'next').find_element(By.TAG_NAME, 'a').click()

Boom! In step 3, we finally clicked on the link. Just try rerunning this cell with step 3 over and over again. Does iterating through the pages work?!

__Exercise 4.2__

Iterate through the entire set of pages, until there are no new pages left. This time, use `selenium` and click on the next page button. You can start on page 47 (`https://books.toscrape.com/catalogue/category/books_1/page-47.html`) to speed up this exercise a bit.

Make use of the `time.sleep(2)` function to make the code wait a bit after each page load.


__Solution__

In [57]:
import time
urls = []
driver.get('https://books.toscrape.com/catalogue/category/books_1/page-47.html')
time.sleep(1)

while True:
    try:
        driver.find_element(By.CLASS_NAME, 'next').find_element(By.TAG_NAME, 'a').click()
        time.sleep(1)
    except:
        break
urls

[]

## After-class exercises

### Exercise 1

Extending the code written for exercise 3.2 in "Web data 101", please collect seeds from ten self-chosen product categories and store them in a file called `all_seeds.json`.

### Exercise 2

Please use the code written in exercise 3.3 in "Web Data 101" and extend it so capture more information (e.g., not only title and price, but also as other attributes/data points you are interested in. In particular, try getting the product description!

Try running your code and store the product data in a JSON dictionary called `all_books.json`.

### Exercise 3

Please complete an entire data collection project in a `.py` file, capturing data for 10 product categories and all products contained on all of the pages. You can proceed in two steps: first collect the seeds, then obtain all data. In addition, parse all retrieved data to a CSV file (with rows and columns), using `pd.read_json(filename, lines = True)` for reading in the JSON data, and `pd.to_csv(filename)` for saving the data in tabular format.

Run your data collection from the terminal.

The final deliverable is
- `all_seeds.json`
- `all_books.json`
- `all_books.csv`




## Backup: Executing Python Files

### Jupyter Notebooks versus editors such as Visual Studio Code, PyCharm, or Spyder

Jupyter Notebooks are ideal for combining programming and markdown (e.g., text, plots, equations), making it the default choice for sharing and presenting reproducible data analyses. Since we can execute code blocks one by one, it's suitable for developing and debugging code on the fly. 

That said, Jupyter Notebooks also have some severe limitations when using them in production environments. That's where an "Integrated Development Environment" (IDE) comes in, such as Visual Studio Code or PyCharm. Let's revisit the most important differences.

First, the order in which you run cells within a notebook may affect the results. While prototyping, you may lose sight of the top-down hierarchy, which can cause problems once you restart the kernel (e.g., a library is imported after it is being used). Second, there is no easy way to browse through directories and files within a Jupyter Notebook. Third, notebooks cannot handle large codebases nor big data remarkably well. 

That's why we recommend starting in Jupyter Notebooks, moving code into functions along the way, and once all seems to be running well, save your Jupyter Notebook as a `.py` file and continue working with it in Visual Studio Code.

Below, we introduce you to the IDE (here, Spyder, but VS Code looks very similar), and show you how to run Python files from the command line. 

### Introduction to Spyder
The first time you need to click on the green "Install" button in Anaconda Navigator, after which you start Spyder by clicking on the blue "Launch" button (alternatively, type `spyder` in the terminal). 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/anaconda_navigator.png" width=90% align="left" style="border: 1px solid black" />


The main interface consists of three panels: 
1. **Code editor** = where you write Python code (i.e., the content of code cells in a notebook)
2. **Variable / files** = depending on which tab you choose either an overview of all declared variables (e.g. look up their type or change their values) or a file explorer (e.g., to open other Python files)
3. **Console** = the output of running the Python script from the code editor (what normally appears below each cell in a notebook)

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/spyder.png" width=90% align="left" style="border: 1px solid black" />

**Let's try it out!**     
Copy the solution from exercise 3.3 to a new file, called `webscraping_101.py`. To run the script you can

- click on the green play button to run all code, or
- highlight the parts of the script you want to execute and then click the run selection button.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/toolbar.png" width=40% align="left" style="border: 1px solid black" />

Once the script is running, you may need to interrupt the execution because it is simply taking too long or you spotted a bug somewhere. Click on the red rectangular in the console to stop the execution. 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/interrupt.gif" width=80% align="left" style="border: 1px solid black" />

### Run Python Files 

__For Mac and Linux users__

1. Open the terminal and navigate to the folder in which the `.py` file has been saved (use `cd` to change directories and `ls` to list all files).
2. Run the Python script by typing `python <FILENAME.py>` (e.g., `python webscraping_101.py`).

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/running_python.gif" width=60% align="left" style="border: 1px solid black" />

__For Windows users__

1. Open Windows explorer and navigate to the folder in which the `.py` file has been saved. Type `cmd` to open the command prompt. Alternatively, open the command prompt from the start menu (and use `cd` to change directories and `dir` to list files).
2. Activate Anaconda by typing `conda activate`.
3. Run the Python script by typing `python <FILENAME.py>` (e.g., `python webscraping_101.py`).