# Webdata for Dummies (oDCM)

*The internet offers abundant possibilities to collect data that can be used in empirical research projects and provide business value. In this tutorial we start off with a gentle introduction to online data collection. What is web scraping? What does API actually stand for, and how does it differ from web scraping? All this and much more, coming up in the Webdata for Dummies tutorial!*

--- 

## Learning Objectives

Students will be able to: 
* Define what web scraping is and the issues surrounding it
* Develop strategies for identifying relevant structures in semi-structed data using browser console tools
* Utilize Python-based libraries to make request and parse web data
* Navigate and access structured web data like HTML and JSON
* Retrieve data from platforms' application programming interfaces (APIs)


--- 

## Acknowledgements
This course draws on online resources built by Brian Keegan, Colt Steele, David Amos, Hannah Cushman Garland, Kimberly Fessel, and Thomas Laetsch. 


--- 

## Contact
For technical issues try to be as specific as possible (e.g., include screenshots, your notebook, errors) so that we can help you better.

**WhatsApp**  
+31 13 466 8938

**Email**  
odcm@uvt.nl

---

## 1. Why Learn Web Scraping?

Say that you want to store or analyze data from a website. Then of course you can manually copy paste the data from each page but that has several limitations. What if the data on the page gets updated? Or what if there are simply so many pages that you cannot do it all by hand? Web scraping can help you overcome these issues by programmaticaly grabbing data from the web. Before we can extract elements from a website, we need to understand how a page is built up. 

--- 

## 2. HTML basics

### 2.1 Structure 
A web page consists of specifically formatted text files which serve various functions:

- `.html` (HyperText Markup Language) files give structure to a page (e.g., menu navigation, text, tables)
- `.css` (Cascading Style Sheet) files determine how the page looks (e.g., colors and fonts)
- `.js` (JavaScript) files add interactivity (e.g., page animations)

Most HTML elements are represented by a pair of tags -- an opening tag and a closing tag. A table, for example, starts with `<table>` and ends with `</table>`. The first tag tells the browser: "Hey! I got a table here! Render it as a table." The closing tag (note the forward slash!) tells the browser: "Hey! I'm all done with that table, thanks." Inside the table are nested more HTML tags representing rows (`<tr>`) and cells (`<td>`). 

```html
<html>
    <table id="example-table" class="striped-table" style="width: 95%">
        <tr> <!-- Header -->
            <td>Column A</td>
            <td>Column B</td>
        </tr>
        <tr> <!-- Row 1 --->
            <td>Row 1, Column A</td>
            <td>Row 1, Column B</td>
        </tr>
        <tr> <!-- Row 2 --->
            <td>Row 2, Column A</td>
            <td>Row 2, Column B</td>
        </tr>
    </table>
</html>
```

This what the rendered HTML table looks like:

<html>
    <table id="example-table" class="striped-table" style="width: 95%">
        <tr> <!-- Header -->
            <td>Column A</td>
            <td>Column B</td>
        </tr>
        <tr> <!-- Row 1 --->
            <td>Row 1, Column A</td>
            <td>Row 1, Column B</td>
        </tr>
        <tr> <!-- Row 2 --->
            <td>Row 2, Column A</td>
            <td>Row 2, Column B</td>
        </tr>
    </table>
</html>


HTML elements can have any number of attributes, such as IDs, which uniquely identify elements --

```html
<table id="example-table">
```

-- classes, which identify a type of element (contrary to ids, a class may be used more than once) --

```html
<table class="striped-table">
```

-- and styles, which define how specific elements appear (e.g. the width of the table) --

```html
<table style="width:95%;">
```

As you may already noticed, we use spaces (or TABs) to separate the elements from one another (a.k.a. indentation) to provide structure and improve readability. For example, the `<table>` tag is placed farther to the right than the `<html>` tag indicates that the table is nested within the HTML block.

This may be a lot to take in if you're completely new to HTML, but don't worry as the goal of this section is not to learn you how to code from scratch but rather to teach you what HTML is, and why it is relevant for web scraping.

#### Exercise 1
Double click on the rendered table above to edit the HTML structure, finish the exercises below, and run the cell again (`Shift + Enter`) to see the result of your changes.

1. Add another row in the table above so that it becomes a 2 (columns) x 4 (rows) table. That is, 3 regular rows and 1 table header row.
2. Fill the cells with the corresponding text labels (e.g., Row 3, Column A). 
3. Change the table width to `50%` so that the table becomes narrower.

--- 
### 2.2 Inspecting HTML in your browser

You can look at the HTML that makes up any web page by _inspecting the source_ in a web browser. We assume you're using Chrome, if you're using Safari or Mozilla, there are slightly different workflows. 

#### Inspect element

You can inspect specific elements on the page by right-clicking on the page and selecting "Inspect" or "Inspect Element" from the context menu that pops up. Hover over elements in the "Elements" tab to highlight them on the page. This can be helpful when you're trying to figure how to uniquely identify the element you want to scrape.

In this exercise we look at the HTML structure of a fictitious [online bookstore](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html). Each of the 1000 books has its own page which shows the title, stock level, star rating, product description, and a table with other product information. Note that the prices and ratings are randomly generated and therefore the figures on your screen may deviate from the ones below.

<img src="images/inspect.png" align="left" width=80%/>

In the screenshot above I selected the book title ("A Light in the Attic"), right-clicked, and chose "Inspect". The same text is highlighted in blue in the HTML code below. The `<h1>` and `</h1>` tags surrounding the title indicate that this text is a header on the web page . Move your pointer down to the line below (`<p class="price_color">£51.77</p>`) and you'll see that in the top screen it now highlights the price (rather than the title) of the book. This way you can easily investigate any webpage. 

As we discussed earlier, tags can be nested within other tags. This also becomes clear from the screenshot below in which the small gray triangles (▶) indicate that there is code hidden within these blocks. Click on them to expand the code, see what's inside, and click again to collapse them. 

<img src="images/html_structure.png" align="left" width=80%/>

#### Exercise 2
1. Use the inspect tool to find the HTML element that constitutes the table header "**Number of reviews**" at the bottom of the [page](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html).
2. Look up how many elements on the page are associated with the class `sub-header` (within the Inspector screen use `Ctrl+F` on a PC or `⌘+F` on Mac to search)
3. You can make local (only on your computer) changes to the web page by double clicking in the inspector and swapping the code for something else (yes, you can overwrite what's already written there!). Change the price of the book to £39.95 and assign it a 5 star-rating. What happens once you refresh the page?

<img src="images/exercise_inspector.png" width=40% align="left"  style="border: 1px solid black"/>




### 2.3 Requesting HTML
Rather than using the Inspector to look up the source code, we can use Python's `requests` library for that purpose. As this library is not loaded by default, we first need to import it. The total source code contains over 9000 characters, therefore we only print out the product description.

In [1]:
import requests

# make a get request to the "A Light in the Attic" webpage
url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
book_request = requests.get(url)

# return the source code from the request object
book_source_code = book_request.text

# print out part of the source code
print(book_source_code[5710:6860])

        <div id="product_description" class="sub-header">
            <h2>Product Description</h2>
        </div>
        <p>It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to r

Now it's your turn! Use slicers in the code below so that it prints out the title of the book: *A Light in the Attic* (without all other shebang)!

In [None]:
print(book_source_code[... : ...])

As you probably noticed, that was quite a hassle to extract the right elements from the page. Fortunately, there is a much better way using XPath selectors and CSS locators, which we'll discuss in week 3. For now we'll give you a quick preview of what's coming up next.

### 2.4 Introduction to BeautifulSoup 
This open-source Python library allows you to navigate through and extract data from HTML files using Python. It does NOT gather information from the web, for this we use `requests` as we did above. So first you send a request to a page to gather the data, and then you send it over to Beautifulsoup to extract the information. 

In the code snippet below we import the package and turn the `book_source_code` (the HTML code from the "A Light in the Attic" webpage we used earlier) into a BeautifulSoup object. Once parsed, we can easily navigate the code by tag name. Since we know that the title is surrounded by `<h1>` tags (see Google Inspector screenshot above), we use `soup.find('h1')` to print out the title of the book. Do you see how much easier that is?!

In [3]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(book_source_code)
print(soup.find('h1'))

<h1>A Light in the Attic</h1>


The `.find()` method will always print out the first matching element that it finds. For example, the web page has two `<h2>` elements which contain the "Product Description" and "Product Information" subheaders.  Only the first one will be returned by `.find()`:

In [4]:
print(soup.find('h2'))

<h2>Product Description</h2>


To capture all matching `<h2>` elements you use the `find_all()` method like this:

In [5]:
print(soup.find_all('h2'))

[<h2>Product Description</h2>, <h2>Product Information</h2>]


Note that it now returns a list of elements (`[element1, element2]`), so to access individual elements you need to apply indexing: 

In [6]:
# obtain first h2 element 
print(soup.find_all('h2')[0])

# obtain second h2 element
print(soup.find_all('h2')[1])

<h2>Product Description</h2>
<h2>Product Information</h2>


Both subheaders are still surrounded by `<h2>` and `</h2>` tags. To get rid of them, append `.get_text()` to your code: 

In [7]:
# sub header without h2 tags
print(soup.find_all('h2')[0].get_text())

Product Description


#### Exercise 4

1. Collect the HTML table that contains product information (UPC, type, price, tax, etc.) of the "A Light in the Attic" book using BeautifulSoup. The output should look like this ("Â" has to do with the GBP-pound symbol): 

```html
<table class="table table-striped">
    <tr>
        <th>UPC</th><td>a897fe39b1053632</td>
    </tr>
    <tr>
        <th>Product Type</th><td>Books</td>
    </tr>
    <tr>
        <th>Price (excl. tax)</th><td>Â£51.77</td>
    </tr>
    <tr>
        <th>Price (incl. tax)</th><td>Â£51.77</td>
    </tr>
    <tr>
        <th>Tax</th><td>Â£0.00</td>
    </tr>
    <tr>
        <th>Availability</th>
        <td>In stock (22 available)</td>
    </tr>
    <tr>
        <th>Number of reviews</th>
        <td>0</td>
    </tr>
</table>
```


2. Obtain the tax amount from the table. Tip: you can chain `.find()` and `find_all()` statements, for example: `.find('body').find_all('h1')` returns all `<h1>`s in the `body`. Your code should return the following output: 

```html
Â£0.00 
```

--- 

## 3. Application Programming Interface


### 3.1 What is an API?
Another common data collection method is using an Application Programming Interface (API). That's a mouthful, but in essence it is nothing more than a version of a website intended for computers to talk with one another. Here's an example of what the [output](https://www.reddit.com/r/science/comments/k0bjqt/study_finds_users_not_notifications_initiate_89.json) of an API may look like: 

<img src="images/api_example.png" width=70% align="left"  style="border: 1px solid black"/>


A few things stand out right away: it only contains text which is structured according to a data structure (e.g., list (`[]`) and dictionary (`{}`)), there's no human interface with buttons, menus, and links, yet you can access it like any other website by filling out the URL in your browser (`reddit.com/r/science/...` in this example). In fact, the API output above corresponds to the following [Reddit webpage](https://www.reddit.com/r/science/comments/k0bjqt/study_finds_users_not_notifications_initiate_89)  (i.e., an American social news aggregation and discussion site). For example, have a look at the third and fourth line from above which states the `title` of the post you see below.  

<img src="images/reddit.png" width=70% align="left"  style="border: 1px solid black"/>

If you have taken a quick look at the [API output](https://www.reddit.com/r/science/comments/k0bjqt/study_finds_users_not_notifications_initiate_89/.json), you may have come to the conclusion that making sense of raw JavaScript Object Notation (JSON) is easier said than done. Fortunately, this [plugin](https://chrome.google.com/webstore/detail/json-viewer/gbmdgpbipfallnflgajpaliibnhdgobh) automatically formats and highlights the output such that it's easier to digest. For the following exercise, we therefore highly recommend installing the Chrome plugin. Alternatively, you can copy paste the JSON output into this [online viewer](http://jsonviewer.stack.hu) and inspect the "Viewer" tab. 

#### Exercise 5
Navigate through the JSON tree structure of the post above and anwer the following questions:
1. At the parent level you find two dictionaries at line 5 and 197 (i.e., the blue arrows). Collapse the content and describe in your own words what each dictionary represents. How does it relate to the Reddit HTML page? 

<img src="images/json_viewer.png" width=70% align="left"  style="border: 1px solid black"/>

2. Suppose that you want to extract the date of each comment. What path do you need to navigate? Note that times are often registered in UTC format which is a common time scale used across the world.
3. The first comment is from the post author (fotogneric) and has gathered the most points. How many downvotes did this comment get (you find the answer in the JSON output)? 



Now that you understand what APIs are, you may rightfully wonder: why should I learn APIs when I could scrape the elements from the website instead (like the book webshop)? One of the major advantages of APIs is that you can directly access the data you need without all the hassle of selecting the right HTML tags. Another advantage is that you can often customize your API request (e.g., the first 100 comments, or only posts about science) which may not always be possible in the web interface. That's why we recommend using an API whenever possible. However, in practise, you may find yourself in a scenario where there is no API to access the data you want, or the access it too limited. In these scenarios, web scraping would allow you to access the data as long as it is available on a website.

In week 3, we'll learn you how to extract data from the Reddit API. For now, we'll look at a somewhat simpler example to get you started with APIs. 

### 3.2 Open an API endpoint
[icanhazdadjoke.com](https://icanhazdadjoke.com) is the largest selection of dad jokes on the internet. Every time you visit the  webpage, it shows you a random joke. The reason we picked this website is that we can call the API without providing authentication tokens. This means that you can collect jokes without creating an account, and that the API does not ask for a secret key or token (i.e., password to tell the API: "this is user X requesting data about Y!"). In general, there are limits to how often or how much data you can request. Therefore, most commercial APIs require authentication so that they can track your usage of the API and verify whether that you're a paid subscriber or not. 

<img src="images/icanhazdadjoke.png" width=70% align="left"  style="border: 1px solid black"/>

In section 2.3 on web scraping, we requested the HTML data from the "A Light in the Attic" and saved it to a variable `book_request`. Here we also use the `requests` library, but this time we add `headers={"Accept": application/json}"` to our request so that we do not get back the raw HTML but rather the output in JSON. Thereafter, we return the JSON from the request object (`response.json()`). Each `joke_request` object contains three attributes: 
* `id` = a unique identifier for each joke
* `joke` = the text of the joke
* `status` = the HTML status code (200 indicates a successful request)

In [8]:
# request JSON output from icanhazdadjoke API
url = "https://icanhazdadjoke.com"
response = requests.get(url, headers={"Accept": "application/json"})
joke_request = response.json() 
print(joke_request)

{'id': 'fNmOm3Ediyd', 'joke': 'What do you call a dictionary on drugs? High definition.', 'status': 200}


#### Exercise 6
1. What happens if you run the cell above again? Why is that? 
2. You can extract the text of the joke as follows: `joke_request['joke']`. Revise the code snippet above such that it stores the text of 10 jokes in a list (tip: use a for-loop).  

### 3.3 Parameters

As mentioned earlier, it is common use to customize requests so that the API returns the exact data you need. You have probably already done this a dozen times without even knowing it. For example, if you Google the word "cat", the results page may look something like this:

<img src="images/google.png" width=70% align="left"  style="border: 1px solid black"/>

Note how the link in the browser starts off with [`google.com/search?q=cat...`](https://www.google.com/search?q=cat). Thus, the search query `cat` is already embedded in the link itself. So rather than filling out the search box on the webpage itself, you can also tweak it in the URL directly. In a similar way, you can request `cat` jokes from the [`icanhazdadjoke.com/search`](https://icanhazdadjoke.com/search?term=cat) page with the `term` parameter:

<img src="images/search.png" width=70% align="left"  style="border: 1px solid black"/>

With this idea in mind, we can update the `search_url` and include the `params` attribute which contains a dictionary with parameters that further specifies our request.

In [9]:
search_url = "https://icanhazdadjoke.com/search"

response = requests.get(search_url, 
                        headers={"Accept": "application/json"}, 
                        params={"term": "cat"})
joke_request = response.json()

The `joke_request` object contains a list with all cat-related jokes (`joke_request['results']`), the search term (`cat`), and the total number of jokes (`10`).

#### Exercise 7
1. Change the search term parameter to `dog` and revisit `joke_request['results']`. How many dog jokes are there? 
2. Write a function `find_joke()` that takes a query as input parameter and returns the first related joke from the `icanhazdadjoke` search API. 
3. Currently, the API provides 650 jokes in total. You can check that for yourself by passing an empty string (`""`) as a search `term`. The results object only shows you the first 20 jokes. To view the remaining 630 jokes you need pagination: rather than returning all output at once, the API divides the data into subsets which can be accessed on various pages. By default each page contains 20 jokes, where page 1 shows joke 1 to 20, page 2 jokes 21 to 40, ..., and page 33 jokes 641 to 650. You can adjust the number of results on each page (max. 30) with the `limit` parameter (e.g., `params={"term": "", "limit": 10}`). Choose different levels for limit (e.g., `10`, `20`, and `30`) and see how it affects the `total_pages` field. Why is that? 
4. You can specify the current page number with the `page` parameter (e.g., `params={"term": "", "page": 2}`. Adapt the function `find_joke()` such that it loops over all available pages and stores the jokes in a list. You can leave the `limit` parameter at its default value (20). Make sure that your function also works when you pass it a search `term`. Tip: to determine how many page you need to loop through, you can use `total_pages` field (e.g., there are only 10 cat jokes, so in that case 1 page would suffice). 

---