# Web data for dummies (in-class)

*The internet offers abundant possibilities to collect data that can be used in empirical research projects or provide business value. This tutorial is a gentle introduction using web scraping and APIs in Python. Get inspired now!*

## Learning Objectives

After completion of this tutorial, students will be able to:

* Explain the differences between retrieving data from websites vs. APIs
* Retrieve web data in Python using the `requests` library, and store retrieved data in HTML or JSON/TXT files for further inspection.
* Use browser control tools to develop strategies for capturing data from websites (e.g., text, numbers, pictures)
* Select elements from websites using BeautifulSoup (selectors)
* Select elements from JSON dictionaries obtained through APIs (attribute-value pairs)
* Apply programming concepts (e.g., loops, functions) to the collection of web data

---
<div class="alert alert-block alert-info"><b>Support Needed?</b> 
    For technical issues outside of scheduled classes, please check the <a href="https://odcm.hannesdatta.com/docs/course/support" target="_blank">support section</a> on the course website.
</div>

------

---

## 1. Web Scraping

### 1.1 What is web scraping?

Say that you want to capture and analyze data from a website. Of course, you could simply copy-paste the data from each page. But, of course, this manually executed job would have severe limitations. What if the data on the page gets updated (i.e., would you have time available to copy-paste the new data, too)? Or what if there are simply so many pages that you can't possibly do it all by hand (i.e., thousands of product pages)? 

Web scraping can help you overcome these issues __by programmatically extracting data from the web__. Before we can extract/grab/capture/scrape information from a website, we need a bit of background on how websites work technically, so let's focus on that first.


### 1.2. How websites work

#### Importance

It's vital to take some time to get familiar with HTML - the primary programming language used when building websites. Once we're familiar with HTML (and the structure of websites), we can rapidly navigate complex websites to extract the elements we're interested in (e.g., price, product categories, ...). In other words: to reach our end goal, we do have to give you some technical details first.

So, here we go: A web page consists of various text files, each one with its style, formatting, and syntax. These files each serve a specific purpose:

- `.html` (HyperText Markup Language) files give structure to a page (e.g., where's the menu?, which content to show (e.g., text, tables)?)
- `.css` (Cascading Style Sheet) files determine how the page looks (e.g., which color do headers have? what's the font used for text in a paragraph?)
- `.js` (JavaScript) files add interactivity (e.g., button animations)

#### Let's try it out
Check out this simple [example](https://codepen.io/rcyou/pen/QEObEk/). The site shows the source code of a site (`.html`, `.css`, and `.js`) in an online editor, along with a rendered ("viewable") version of the site. Once you make changes to the code, the site gets automatically updated.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/codepen.png" align="left" width=60%/>


#### Exercise 1 
Just to get a feeling for how things work, let's make the following changes in the [CodePen snippet](https://codepen.io/rcyou/pen/QEObEk/): 
1. Change the text between the `<h1>` tags to `I am a purple of size 3em`. 
2. Change the `h1` font-size to `3em` and the color to purple (add `color: purple;` below `margin-bottom`).  
3. Remove the JavaScript code. What happens now when you click the blue button?

---

#### Solutions
Clicking the button should no longer trigger the script to hide the paragraph text.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/purple_headline.png" align="left" width=60%/>



--- 
### 1.3 Advancing HTML skills

#### Importance

Most HTML elements are represented by a pair of tags - an opening tag and a closing tag. 

For example, a table starts with `<table>` and ends with `</table>`. 

The first tag tells the browser: "Hey! I got a table here! Render it as a table, so it displays nicely on the site." 

The closing tag (note the forward-slash!) tells the browser: "Hey! I'm all done with that table, thanks." Inside the table are nested more HTML tags representing rows (`<tr>`) and cells (`<td>`).


```html
<html>
    <table id="example-table" class="striped-table" style="width: 95%">
        <tr> <!-- Header row -->
            <td>Column A</td>
            <td>Column B</td>
        </tr>
        <tr> <!-- Row 1 --->
            <td>Row 1, Column A</td>
            <td>Row 1, Column B</td>
        </tr>
        <tr> <!-- Row 2 --->
            <td>Row 2, Column A</td>
            <td>Row 2, Column B</td>
        </tr>
    </table>
</html>
```

This what the rendered HTML table looks like:

<html>
    <table id="example-table" class="striped-table" style="width: 95%">
        <tr> <!-- Header -->
            <td>Column A</td>
            <td>Column B</td>
        </tr>
        <tr> <!-- Row 1 --->
            <td>Row 1, Column A</td>
            <td>Row 1, Column B</td>
        </tr>
        <tr> <!-- Row 2 --->
            <td>Row 2, Column A</td>
            <td>Row 2, Column B</td>
        </tr>
    </table>
</html>


HTML elements can have any number of

- __attributes__, such as IDs, which *uniquely* identify elements

```html
<table id="example-table">
```

- __classes__, which identify a *type* of an element (contrary to ids, a class can be used more than once)

```html
<table class="striped-table">
```

- and __styles__, which define how specific elements *appear* (e.g. the width of the table)

```html
<table style="width:95%;">
```

As you may already have noticed, we use spaces (or tabs) to separate the elements from one another (the geeks among us will call this "indentation") to provide structure and improve readability.

Yes, that's right. *Improve readability*.

Code may look complex to read at first, but when you take a closer look at it, it boils down to simple English, following a particular structure (also known as syntax).

For example, the `<table>` tag is placed farther to the right than the `<html>` tag indicates that the table is nested within the HTML block.

This may be a lot to take in if you're entirely new to HTML, but don't worry, as the goal of this section is not to teach you how to code from scratch but rather to teach you what HTML is and why it is relevant for web scraping.


__Let's try it out__

Double-click on the rendered table below to edit its HTML structure. Try to change some simple things, e.g., the text. Re-run the cell (Shift + Enter, or click the Run button in Juypyter Notebook). Watch your changes come alive!


#### Exercise 2

Please finish the exercises below. After each change, rerun the cell.

__Important:__ don't make too many changes at once. Always proceed in small steps to see whether your code (still) works!

1. Add another row in the table above to become a 2 (columns) x 4 (rows) table. That is 3 regular rows and 1 table header row.
2. Fill the cells with the corresponding text labels (e.g., Row 3, Column A). 
3. Change the table width to `50%` so that the table becomes narrower.

*Make your changes here:*

<html>
    <table id="example-table" class="striped-table" style="width: 95%">
        <tr> <!-- Header -->
            <td>Column A</td>
            <td>Column B</td>
        </tr>
        <tr> <!-- Row 1 --->
            <td>Row 1, Column A</td>
            <td>Row 1, Column B</td>
        </tr>
        <tr> <!-- Row 2 --->
            <td>Row 2, Column A</td>
            <td>Row 2, Column B</td>
        </tr>
    </table>
</html>

**Solutions**
<html>
    <table id="example-table" class="striped-table" style="width: 50%">
        <tr> <!-- Header -->
            <td>Column A</td>
            <td>Column B</td>
        </tr>
        <tr> <!-- Row 1 --->
            <td>Row 1, Column A</td>
            <td>Row 1, Column B</td>
        </tr>
        <tr> <!-- Row 2 --->
            <td>Row 2, Column A</td>
            <td>Row 2, Column B</td>
        </tr>
        <tr> <!-- Row 3 --->
            <td>Row 3, Column A</td>
            <td>Row 3, Column B</td>
        </tr>
    </table>
</html>

--- 
### 1.4 Finding content in a website's source code

#### Importance

Alright, we've now covered tables (`<table>`). However, there are hundreds of different tag words in HTML, and it's impossible to memorize all of them. That's why developers use a pretty handy tool to *inspect the source* of a website directly in the browser. From now onwards, we recommend you to use Chrome (in Safari and Mozilla, things look slightly different, and we can't cover those, unfortunately.)

Suppose you have identified an element you want to capture (e.g., a price or the name of a product). You can "ask" your browser for the specific HTML tag of that object (so it becomes easier to capture that element later). 

#### Let's try it out 


How does it work? Start by inspecting specific elements on the page by *right-clicking on the page* and selecting __"Inspect"__ from the context menu that pops up. Then, hover over elements in the "Elements" tab to highlight them on the page. This can be super helpful when you're trying to figure out how to (uniquely) identify the element you want to scrape.

Check out the HTML structure of this fictitious [online bookstore](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html). Each of the 1000 books has its page, which shows the title, stock level, star rating, product description, and a table with other product information. Note that the prices and ratings are randomly generated, and therefore the figures on your screen may deviate from the ones below.


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/inspect.gif" align="left" width=60%/>

In the screenshot above, we've selected the book title ("A Light in the Attic"), right-clicked on it, and chose "Inspect." The same text is highlighted in blue in the HTML code below. 

Try it out yourself!

The `<h1>` and `</h1>` tags surrounding the title indicate that this text is a header on the web page. Move your pointer down to the line below (`<p class="price_color">£51.77</p>`), and you'll see that on the top screen, it now highlights the price (rather than the title) of the book. This way, you can quickly investigate any webpage. 

Also, try this out yourself!

As we discussed earlier, tags can be nested within other tags. This also becomes clear from the screenshot below, in which the small gray triangles (▶) indicate that there is code hidden within these blocks. Click on them to expand the code, see what's inside, and click again to collapse them.


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/html_structure.png" align="left" width=70%/>

#### Exercise 3
1. Use the inspect tool to find the HTML element that constitutes the table header "**Number of reviews**" at the bottom of this [page](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html).
2. Look up how many elements on the page are associated with the class `sub-header` (within the Inspector screen, use `Ctrl+F` on a PC or `⌘+F` on Mac to search)
3. You can make local (only on your computer) changes to the web page by double-clicking in the inspector and swapping the code for something else (yes, you can overwrite what's already written there!). Change the price of the book to £39.95 and assign it a five star-rating. 
4. After making the changes in 3.), refresh the page (reload it). What happens (and why)?


*A "faked" price and star-rating*

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/exercise_inspector.png" width=40% align="left"  style="border: 1px solid black"/>


In [None]:
# your answer goes here!

#### Solutions
1. The `<th>` (table header) tags enclose the text "Number of reviews." 
2. Three elements are associated with the class `sub-header` (product description, product information, reviews)
3. The star rating can be changed from the class attribute to `star-rating Five`. Once you refresh the page, the original (unedited price and star rating) appears again.
---


### 1.5 Loading a website's source code into Python

#### Importance

Alright. Up to this moment, we've learned about HTML and fiddled around with a website's source code. But we finally want to understand how we can load a website's source code into Python.

Rather than (manually) using the Inspector, we now automate these tasks using Python's `requests` library. Libraries are "extensions" to Python, but most of them are not loaded by default. So let's import the library using `import requests`.

The total source code of the website, by the way, contains over 9000 characters. Therefore, we only print out the product description here.


#### Let's try it out

Please (re)run the code cell below.

In [31]:
import requests

# make a get request to the "A Light in the Attic" webpage
url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
user_agent = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
book_request = requests.get(url, headers = user_agent)

# return the source code from the request object
book_source_code = book_request.text

# print out (part of) the source code to verify you have loaded the correct page
print(book_source_code)



<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    A Light in the Attic | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
        <meta name="created" content="24th Jun 2016 09:29" />
        <meta name="description" content="
    It&#39;s hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein&#39;s humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and lov

#### Exercise 4
1. Change the code snippet above to download the website for [this product](https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html).
2. Write a little code snippet that stores the website's raw code in a file, called `website.html`. Recall that you can use previously learnt concepts, in specific `f=open(<filename>, encoding='utf-8')`, `f.write()`, and `f.close()`.
3. Write a loop to store the raw HTML source code for the first four books on [this page](https://books.toscrape.com/catalogue/category/books_1/index.html). Before starting, create an array of dictionaries with URLs and filenames to store the websites.

```
urls = [{'url': 'first_url',
         'filename': 'filename1.html'},
        {'url': 'second_url',
         'filename': 'filename2.html'}]
```
         
         


In [32]:
# start with your code here

#### Solutions 

In [33]:
# 

---
### 1.6 Extracting information from a website's source code using `BeautifulSoup` 

#### Importance

It's useful to store raw data from websites (we will make use of this a lot). But - how can extract specific information from a website, such as a product's title or price? 

Fortunately, we can make use of the *structured* nature of HTML, by selecting information on the basis of:
 
- tags names (e.g., `<h1>`, `<table>`), or
- attributes such as IDs (e.g., `<table id="example-table">`), or class names (e.g., `<table class="striped-table">`)
    
For now, we'll show you how to apply these concepts using *BeautifulSoup*, a fantastic Python library that allows you to navigate and extract data from HTML files. BeautifulSoup does NOT gather information from the web itself (for this, we still use `requests`, as above). 

#### Let's try it out
First, we import the package `BeautifulSoup` and turn the `book_source_code` (the HTML code from the "A Light in the Attic" webpage we used earlier) into BeautifulSoup object. Once converted, we can easily navigate the code by *tag names*, *attribute names*, or *class names*.

Since we know that the title is surrounded by `<h1>` tags (see Google Inspector screenshot above), we use `soup.find('h1')` to print out the title of the book.

Please run the following cells to see things in action!


In [13]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(book_source_code)
print(soup.find('h1'))

<h1>A Light in the Attic</h1>


The `.find()` method will always print out the first matching element that it finds. For example, the web page has two `<h2>` elements which contain the "Product Description" and "Product Information" subheaders.  Only the first one will be returned by `.find()`:

In [14]:
print(soup.find('h2'))

<h2>Product Description</h2>


To capture __all__ matching `<h2>` elements you use the `find_all()` method like this:

In [6]:
print(soup.find_all('h2'))

[<h2>Product Description</h2>, <h2>Product Information</h2>]


Note that it now returns a list of elements (`[element1, element2]`), so to access individual elements you need to apply indexing: 

In [15]:
# obtain first h2 element 
print(soup.find_all('h2')[0])

# obtain second h2 element
print(soup.find_all('h2')[1])

<h2>Product Description</h2>
<h2>Product Information</h2>


Both subheaders are still surrounded by `<h2>` and `</h2>` tags. To get rid of them, append `.get_text()` to your code:

In [16]:
# sub header without h2 tags
print(soup.find_all('h2')[0].get_text())

Product Description


#### Exercise 5

1. Retrieve the website's source code, and capture the following information:
    - Product title,
    - price,
    - in-stock availability, and
    - the number of stars.
    
__Tips:__

- To extract information using class, use the _class argument in the find function.

```
soup.find(class_ = 'class_name_to_find)
```

- You can also extract information by *counting* the number of classes. For example, ```len(soup.find('h2'))``` returns the number of `h2` elements on a site.

In [None]:
# your answer goes here!

#### Solutions

In [26]:
# title
print(soup.find('h1').get_text())
# price
print(soup.find(class_='price_color').get_text())
# in-stock availability
print(soup.find(class_='instock availability').get_text())
# number of stars
print(len(soup.find_all(class_='icon-star')))

A Light in the Attic
Â£51.77


    
        In stock (22 available)
    

5


### 1.7 Writing your complete first web scraper

__Exercise 6__

Now it's time to put in action everything we have learnt so far.

- Use the list of four URLs specified above, 
- Write one loop that stores the raw website data in separate HTML files (storing raw data for diagnostic purposes is very helpful!), 
- Stores the extracted information (title, price, instock availability and number of stars) in a dictionary, that is stored in new-line separated JSON files.

In [29]:
# write your code here


__Solution__

### 1.8 Wrapping up

Congrats! You've just learned the first steps in collecting online data from websites! Along with boosting your "geek"-factor (wait till you show this to your friends!), you've gained an intuition on how websites are built up (HTML, CSS, JS), how source code translates into a rendered (visual) website (or, in other words, you know how to spoof websites - now, take screenshots and show *that* to your friends...), how websites can be loaded into Python, and how you can use `BeautifulSoup` to extract information using tag or attribute names and classes.

--- 

## 2. Application Programming Interface (API)


### 2.1 What is an API?

An equally important data collection method is called Application Programming Interface (API). That's a mouthful, but in essence, it is nothing more than a version of a _website intended for computers, rather than humans, to talk to one another_. 

APIs are everywhere, and most are used to provide...
- data (e.g., retrieve a user name and demographics), 
- functions (e.g., start playing music from Spotify, turn on your lamps in your "smart home"), or 
- algorithms (e.g., submit an image, retrieve a written text for what's *on* the image).

In what follows, we'll introduce you to the API of [Reddit](https://www.reddit.com), a popular American social news aggregation and discussion site that is sometimes described as the *front page of the internet*. Reddit gives you an up to date view on what's happening around the world and is based on the principle that the community of around 1 billion users decides what is newsworthy and what's not through a voting system. 

You can think of Reddit upvotes as Facebook likes. Posts are arranged based on the number of votes, and those with many upvotes are featured on the homepage. The grey number next to each post represents the sum of votes (= upvotes - downvotes).


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/reddit_homepage.png" width=60% align="left"  style="border: 1px solid black"/>



### 2.2 How APIs work

__Importance__

APIs work very similar to websites. At the core, instead of obtaining the source code of a (rendered) web site, you obtain code that computers can easily understand to process the content of a website. APIs provide you with simpler and more scalable ways to obtain data, so you really have to understand how they work.

__Let's try it out__

Consider the screen shot above (a view of the Reddit website). Here's an example of how the output of the Reddit API (click on it to view it in your browser):

https://www.reddit.com/r/science/comments/k0bjqt/study_finds_users_not_notifications_initiate_89.json

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/api_example.png" width=60% align="left"  style="border: 1px solid black"/>

*A few things stand out right away:*

- the output only contains text, which is structured according to a data structure (e.g., array or list (`[]`) and dictionary (`{}`)), 
- there's no human interface with buttons, menus, and links, yet...
- you can access it like any other website by filling out the URL in your browser (`reddit.com/r/science/...` in this example).

In fact, the API output above corresponds to the (visual) Reddit thread, which you can open here:

https://www.reddit.com/r/science/comments/k0bjqt/study_finds_users_not_notifications_initiate_89

For example, look at the third and fourth line from above, which states the `title` of the post you can also see below on the rendered website.  

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/reddit.png" width=60% align="left"  style="border: 1px solid black"/>



*Tip:*
If you have taken a look at the API output, you may conclude that making sense of raw JavaScript Object Notation (JSON) is easier said than done. Fortunately, this [plugin](https://chrome.google.com/webstore/detail/json-viewer/gbmdgpbipfallnflgajpaliibnhdgobh) automatically formats and highlights the output such that it's easier to digest. For the following exercise, we therefore highly recommend installing the Chrome plugin. 

Install it, and view the output again. That's much better, right? (Alternatively, you can copy-paste the JSON output into this [online viewer](http://jsonviewer.stack.hu) and inspect the "Viewer" tab).


#### Exercise 6

Navigate through the JSON tree structure of the post above and anwer the following questions:

1. At the parent level you find two dictionaries at line 5 and 197 (i.e., the blue arrows). Collapse the content and describe in your own words what each dictionary represents. How does it relate to the Reddit HTML page? 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/reddit_api.gif" width=70% align="left"  style="border: 1px solid black"/>

2. The first comment is from the post author (fotogneric) and has gathered the most points. How many downvotes did this comment get (you find the answer in the JSON output)? 

3. Suppose that you want to extract the date and time each comment was created. What path do you need to navigate? 

(Note that times are often registered in UTC format, a globally interchangeable time representation (also known as Epoch time). More specifically, it is the number of seconds elapsed since January 1, 1970. It can be used as a universal time scale around the world. Copy-paste the UTC time to an online [epoch converter](https://www.epochconverter.com), and check whether it corresponds with the date and time on the webpage).



**Your answer**  

...

#### Solutions
1. The dictionary that starts at line 5 contains data on the post (title, subreddit, upvote ratio, thumbnail/image, link to article). The other dictionary stores the comments of the post (author, body text, timestamp). 
2. At the moment of writing this solution (December 2020) the post has 0 downvotes (`'downs': 0`).
3. The `created` key stores a large number (e.g., 1606274053) that can be translated into a date and time (for this example: 25 November 2020 03:14 GMT). The corresponding path for the timestamp of the first comment is: `request[1]['data']['children'][0]['data']['created']` (a written description that follows these directions also suffices: first, you take the 2nd element `[1]` in the list, then you choose the `data` key, etc.).
---

### 2.2 Retrieve data from an API endpoint

__Importance__

We now proceed to *downloading* data from an API. But since APIs are mostly provided via paid subscriptions, all we can offer here is a nerdy example *of a free API*, which - unfortunately - isn't even remotely related to the field of marketing. 

We proudly present to you... __[icanhazdadjoke.com](https://icanhazdadjoke.com)!__ ...which is the largest selection of *dad jokes on the internet* (yes, we also didn't know that existed!).

---

*Background: Free versus paid APIs*

Paid APIs require their users to authenticate themselves. Think of an authentication key as a "key to unlock the service." Web services use such authentication keys to track whether you're allowed to use the API and how much you use it. This offers numerous opportunities for API business models, in which, for example, the service employs a pay-by-request (or by 1,000 requests) model.

---

*Back to Icanhazdadjoke.com...*

We picked this web site because we don't have to use any authentication token, and there's no limit to retrieving data. But, how does the website work, and why is there an API?

Every time you visit the site, the site shows a *random joke*. From a technical perspective, each time a user opens the site, a little software program on the server makes an API call to the daddy joke API to draw a new joke to be displayed. The designers have split the showing of information (website) from the actual content (the jokes, available through the API). This offers the opportunity to provide the data in two ways: an excellent visual representation of dad jokes (the website) and a service for drawing jokes programmatically to embed in other software products.

Sounds familiar? Yep! Facebook and Instagram do precisely the same. Instead of tying in their technology with the website, they have split the visual representation from the actual content. This allows social media networks to monetize their data in other ways (e.g., by having advertisers programmatically access the Facebook API to learn about potential targets for their ad campaigns).

__Let's try it out__

Try out to generate a [random joke](http://icanhazdadjoke.com) on the website... (click the link)


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/icanhazdadjoke.gif" width=70% align="left"  style="border: 1px solid black"/>


Can we retrieve that data via the service's API?

Like for webscraping (where we stored the HTML source code of the website in the `book_request` variable), we proceed with the `requests` library. The only difference is we need to add a so-called header (so that the API knows it's talking to a computer! In Python: `headers={"Accept": application/json}"`).

Each `joke_request` response from the API contains three attributes: 
* `id` = a unique identifier for each joke
* `joke` = the text of the joke
* `status` = the HTML status code (200 indicates a successful request)

Try to spot those attributes in the printed JSON response!

In [16]:
# request JSON output from icanhazdadjoke API
url = "https://icanhazdadjoke.com"
response = requests.get(url, headers={"Accept": "application/json"})
joke_request = response.json() 
print(joke_request)

{'id': 'AQn3wPKeqrc', 'joke': 'It was raining cats and dogs the other day. I almost stepped in a poodle.', 'status': 200}


#### Exercise 7
1. What happens if you run the cell above again? Why is that? 
2. Turn off your WiFi and try running the cell again. What happens this time? 
3. Turn on your WiFi again and revise the code snippet above, so that it stores the text of 10 jokes in a list. You can extract the text of the joke as follows: `joke_request['joke']` (tip: use a for-loop). 

In [None]:
# your answer goes here!

#### Solutions
1. Another random joke is generated, so the `id` and `joke` change every time. 
2. A connection error occurs because the `requests` package could not establish a connection with the API.


In [15]:
# Question 3
jokes = [] 

for counter in range(10):
    url = "https://icanhazdadjoke.com"
    response = requests.get(url, headers={"Accept": "application/json"})
    joke_request = response.json() 
    jokes.append(joke_request)

### 2.3 APIs versus web scrapers

Now that you understand what APIs are, you may rightfully wonder: why should I learn APIs when I could scrape the elements from the website instead (like the book webshop)?

One of the major advantages of APIs is that you can directly access the data you need *without all the hassle of selecting the right HTML tags*. Another advantage is that you can often customize your API request (e.g., the first 100 comments or only posts about science), which may not always be possible in the web interface. 

Last, using APIs is legitimized by a web site (mostly, you will have to pay a license fee to use APIs!). So it's a more stable and legit way to retrieve web data compared to web scraping. That's also why we recommend using an API whenever possible. 

In practice, though, APIs really can't give you all the data you possibly want, and web scraping allows you to access complementary data (e.g., viewable on a website or somewhere hidden in the source code).


### 2.4 Wrap-up

After finishing this section, you should not only have learned a couple of new dad jokes, but you should also be able to explain to your parents what an API is and give practical examples of how consumers - as well as companies - may benefit from using them. Furthermore, you should be able to request data from various APIs  (even if it's one you have never seen before!).

---