# Webdata for Dummies (oDCM)

*The internet offers abundant possibilities to collect data that can be used in empirical research projects and provide business value. In this tutorial we start off with a gentle introduction to online data collection. What is web scraping, and how does it differ from APIs? Get inspired now - a working scraper in less than an hour! Go for it!*

--- 

## Learning Objectives

After completion of this tutorial, students will be able to:

* Explain the differences between websites and APIs
* Retrieve web data in Python (`requests`), and save request results in HTML or JSON/TXT files for further inspection
* Develop strategies for capturing data (e.g., text, numbers, pictures) from websites, using browser console tools
* Select elements from websites using selectors in BeautifulSoup
* Select elements from the JSON result set of APIs using attribute-value pairs

--- 

## Acknowledgements
This course draws on a variety of online resources which can be retrieved from the [course website](https://odcm.hannesdatta.com/#student-profile--prerequisites). 


--- 

## Support Needed?
For technical issues outside of scheduled classes, please check the [support section](https://odcm.hannesdatta.com/docs/course/support) on the course website.

---

## 1. Web Scraping

### 1.1 Why Learn Web Scraping?

Say that you want to store or analyze data from a website. Then of course you can manually copy paste the data from each page but that has several limitations. What if the data on the page gets updated? Or what if there are simply so many pages that you cannot do it all by hand? Web scraping can help you overcome these issues by programmaticaly grabbing data from the web. Before we can extract elements from a website, we need to understand how a page is built up. Once we're familiar with the page structure, we can quickly navigate towards the elements on the page we need for our scraping project (e.g., price or product category). Thus, understanding HTML - the main programming language of web pages - is vital to getting started with web scraping. 

--- 

### 1.2. HTML structure

A web page consists of specifically formatted text files which serve various functions:

- `.html` (HyperText Markup Language) files give structure to a page (e.g., menu navigation, text, tables) 
- `.css` (Cascading Style Sheet) files determine how the page looks (e.g., colors and fonts)
- `.js` (JavaScript) files add interactivity (e.g., button animations)

Here's a simple [example](https://codepen.io/rcyou/pen/QEObEk/) of a header (`<h1>`), paragraph (`<p>`), and a button that hides the paragraph once clicked upon. The HTML, CSS, and Javascript code are available in the online editor and the output of the code is rendered below. Once you make any changes it gets updated automatically.

<img src="images/codepen.png" align="left" width=60%/>

#### Exercise 1 
Just to get a feeling for how things work, let's the make following changes in the [CodePen snippet](https://codepen.io/rcyou/pen/QEObEk/): 
1. Change the text between the `<h1>` tags to `I am a purple headline of size 3em`. 
2. Change the `h1` font-size to `3em` and the color to purple (add `color: purple;` below `margin-bottom`).  
3. Remove the JavaScript code. What happens now once you click on the blue button? 

---

#### Solutions
Clicking the button should no longer trigger the script to hide the paragraph text.

<img src="images/purple_headline.png" align="left" width=60%/>


Most HTML elements are represented by a pair of tags -- an opening tag and a closing tag. A table, for example, starts with `<table>` and ends with `</table>`. The first tag tells the browser: "Hey! I got a table here! Render it as a table so it displays nicely on the site." The closing tag (note the forward slash!) tells the browser: "Hey! I'm all done with that table, thanks." Inside the table are nested more HTML tags representing rows (`<tr>`) and cells (`<td>`). 

```html
<html>
    <table id="example-table" class="striped-table" style="width: 95%">
        <tr> <!-- Header -->
            <td>Column A</td>
            <td>Column B</td>
        </tr>
        <tr> <!-- Row 1 --->
            <td>Row 1, Column A</td>
            <td>Row 1, Column B</td>
        </tr>
        <tr> <!-- Row 2 --->
            <td>Row 2, Column A</td>
            <td>Row 2, Column B</td>
        </tr>
    </table>
</html>
```

This what the rendered HTML table looks like:

<html>
    <table id="example-table" class="striped-table" style="width: 95%">
        <tr> <!-- Header -->
            <td>Column A</td>
            <td>Column B</td>
        </tr>
        <tr> <!-- Row 1 --->
            <td>Row 1, Column A</td>
            <td>Row 1, Column B</td>
        </tr>
        <tr> <!-- Row 2 --->
            <td>Row 2, Column A</td>
            <td>Row 2, Column B</td>
        </tr>
    </table>
</html>


HTML elements can have any number of

-- __attributes__, such as IDs, which *uniquely* identify elements

```html
<table id="example-table">
```

-- __classes__, which identify a *type* of element (contrary to ids, a class can be used more than once) --

```html
<table class="striped-table">
```

-- and __styles__, which define how specific elements *appear* (e.g. the width of the table) --

```html
<table style="width:95%;">
```

As you may already noticed, we use spaces (or TABs) to separate the elements from one another (the geeks among us will call this "indentation") to provide structure and improve readability.

Yes, that's right. *Improve readability*.

Code may look complex to read at first, but when you do take a closer look at it, it boils down to simple English, following a particular structure (also known as syntax).

For example, the `<table>` tag is placed farther to the right than the `<html>` tag indicates that the table is nested within the HTML block.

This may be a lot to take in if you're completely new to HTML, but don't worry as the goal of this section is not to learn you how to code from scratch but rather to teach you what HTML is, and why it is relevant for web scraping. 

#### Exercise 3
Double click on the rendered table above to edit the HTML structure, finish the exercises below, and run the cell again (`Shift + Enter`) to see the result of your changes.

1. Add another row in the table above so that it becomes a 2 (columns) x 4 (rows) table. That is, 3 regular rows and 1 table header row.
2. Fill the cells with the corresponding text labels (e.g., Row 3, Column A). 
3. Change the table width to `50%` so that the table becomes narrower.


**Solutions**
<html>
    <table id="example-table" class="striped-table" style="width: 50%">
        <tr> <!-- Header -->
            <td>Column A</td>
            <td>Column B</td>
        </tr>
        <tr> <!-- Row 1 --->
            <td>Row 1, Column A</td>
            <td>Row 1, Column B</td>
        </tr>
        <tr> <!-- Row 2 --->
            <td>Row 2, Column A</td>
            <td>Row 2, Column B</td>
        </tr>
        <tr> <!-- Row 3 --->
            <td>Row 3, Column A</td>
            <td>Row 3, Column B</td>
        </tr>
    </table>
</html>

--- 
### 1.3 Inspecting HTML in your browser

You can look at the HTML that makes up any web page by _inspecting the source_ in a web browser. We recommend you to use Chrome from now onwards (in Safari and Mozilla, things look slightly different, and we can't cover those, unfortunately.)

#### Inspect element

You can inspect specific elements on the page by right-clicking on the page and selecting "Inspect" from the context menu that pops up. Hover over elements in the "Elements" tab to highlight them on the page. This can be helpful when you're trying to figure how to uniquely identify the element you want to scrape.

In this exercise we look at the HTML structure of a fictitious [online bookstore](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html). Each of the 1000 books has its own page which shows the title, stock level, star rating, product description, and a table with other product information. Note that the prices and ratings are randomly generated and therefore the figures on your screen may deviate from the ones below.

<img src="images/inspect.gif" align="left" width=60%/>

In the screenshot above I selected the book title ("A Light in the Attic"), right-clicked, and chose "Inspect". The same text is highlighted in blue in the HTML code below. The `<h1>` and `</h1>` tags surrounding the title indicate that this text is a header on the web page . Move your pointer down to the line below (`<p class="price_color">£51.77</p>`) and you'll see that in the top screen it now highlights the price (rather than the title) of the book. This way you can easily investigate any webpage. 

As we discussed earlier, tags can be nested within other tags. This also becomes clear from the screenshot below in which the small gray triangles (▶) indicate that there is code hidden within these blocks. Click on them to expand the code, see what's inside, and click again to collapse them. 

<img src="images/html_structure.png" align="left" width=70%/>

#### Exercise 3
1. Use the inspect tool to find the HTML element that constitutes the table header "**Number of reviews**" at the bottom of the [page](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html).
2. Look up how many elements on the page are associated with the class `sub-header` (within the Inspector screen use `Ctrl+F` on a PC or `⌘+F` on Mac to search)
3. You can make local (only on your computer) changes to the web page by double clicking in the inspector and swapping the code for something else (yes, you can overwrite what's already written there!). Change the price of the book to £39.95 and assign it a 5 star-rating. What happens once you refresh the page (and why)?

<img src="images/exercise_inspector.png" width=40% align="left"  style="border: 1px solid black"/>

#### Solutions
1. The `<th>` (table header) tags enclose the text "Number of reviews" 
2. 3 elements are associated with the class `sub-header` (product description, product information, reviews)
3. The star rating can be changed the class attribute to `star-rating Five`. Once you refresh the page the original (unedited price and star rating) appears again.
--- 

### 1.4 Requesting HTML
Rather than using the Inspector to look up the source code, we can use Python's `requests` library for that purpose. As this library is not loaded by default, we first need to import it. The total source code contains over 9000 characters, therefore we only print out the product description.

In [2]:
import requests

# make a get request to the "A Light in the Attic" webpage
url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
book_request = requests.get(url)

# return the source code from the request object
book_source_code = book_request.text

# print out part of the source code
print(book_source_code[5710:6860])

        <div id="product_description" class="sub-header">
            <h2>Product Description</h2>
        </div>
        <p>It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to r

#### Exercise 3
Now it's your turn! Use slicers in the code below so that it prints out the title of the book: *A Light in the Attic* (without all other shebang)! 

In [None]:
print(book_source_code[...:...])

#### Solutions 

In [55]:
# one of these solutions (multiple solutions possible)
print(book_source_code[3319:3339]) 
print(book_source_code[4479:4499])
print(book_source_code[5872:5892])

A Light in the Attic
A Light in the Attic
A Light in the Attic


As you probably noticed, that was quite a hassle to extract the right elements from the page. Fortunately, there is a much better way using XPath selectors and CSS locators that is beyond the scope of this tutorial. For now we'll give you a quick preview of what's coming up next. 

### 1.5 Introduction to BeautifulSoup 
This open-source Python library allows you to navigate through and extract data from HTML files using Python. It does NOT gather information from the web, for this we use `requests` as we did above. So first you send a request to a page to gather the data, and then you send it over to Beautifulsoup to extract the information. 

In the code snippet below we import the package and turn the `book_source_code` (the HTML code from the "A Light in the Attic" webpage we used earlier) into a BeautifulSoup object. Once parsed, we can easily navigate the code by tag name. Since we know that the title is surrounded by `<h1>` tags (see Google Inspector screenshot above), we use `soup.find('h1')` to print out the title of the book. Do you see how much easier that is?!

In [56]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(book_source_code)
print(soup.find('h1'))

<h1>A Light in the Attic</h1>


The `.find()` method will always print out the first matching element that it finds. For example, the web page has two `<h2>` elements which contain the "Product Description" and "Product Information" subheaders.  Only the first one will be returned by `.find()`:

In [3]:
print(soup.find('h2'))

<h2>Product Description</h2>


To capture all matching `<h2>` elements you use the `find_all()` method like this:

In [4]:
print(soup.find_all('h2'))

[<h2>Product Description</h2>, <h2>Product Information</h2>]


Note that it now returns a list of elements (`[element1, element2]`), so to access individual elements you need to apply indexing: 

In [5]:
# obtain first h2 element 
print(soup.find_all('h2')[0])

# obtain second h2 element
print(soup.find_all('h2')[1])

<h2>Product Description</h2>
<h2>Product Information</h2>


Both subheaders are still surrounded by `<h2>` and `</h2>` tags. To get rid of them, append `.get_text()` to your code: 

In [6]:
# sub header without h2 tags
print(soup.find_all('h2')[0].get_text())

Product Description


#### Exercise 5

1. Collect the HTML table that contains product information (UPC, type, price, tax, etc.) of the "A Light in the Attic" book using BeautifulSoup. The output should look like this ("Â" has to do with the GBP-pound symbol): 

```html
<table class="table table-striped">
    <tr>
        <th>UPC</th><td>a897fe39b1053632</td>
    </tr>
    <tr>
        <th>Product Type</th><td>Books</td>
    </tr>
    <tr>
        <th>Price (excl. tax)</th><td>Â£51.77</td>
    </tr>
    <tr>
        <th>Price (incl. tax)</th><td>Â£51.77</td>
    </tr>
    <tr>
        <th>Tax</th><td>Â£0.00</td>
    </tr>
    <tr>
        <th>Availability</th>
        <td>In stock (22 available)</td>
    </tr>
    <tr>
        <th>Number of reviews</th>
        <td>0</td>
    </tr>
</table>
```


2. Obtain the tax amount from the table. Tip: you can chain `.find()` and `find_all()` statements, for example: `.find('body').find_all('h1')` returns all `<h1>`s in the `body`. Your code should return the following output: 

```html
Â£0.00 
```

#### Solutions

In [66]:
# Question 1
print(soup.find('table')) # or print(soup.find_all('table')[0])

# Question 2
print(soup.find('table').find_all('tr')[4].find('td').get_text())

Â£0.00


After finishing this section, you should have gained an intuitive understanding of the building blocks of websites and learned how to view the underlying HTML code, request a website through Python code, and select HTML elements by tag. 

--- 

## 2. Application Programming Interface


### 2.1 What is an API?
Another common data collection method is using an Application Programming Interface (API). That's a mouthful, but in essence it is nothing more than a version of a _website intended for computers, rather than humans, to talk to one another_. 

<img src="images/reddit_homepage.png" width=60% align="left"  style="border: 1px solid black"/>



[Reddit](https://www.reddit.com), a popular American social news aggregation and discussion site that is sometimes described as the *frontpage of the internet*, gives you an up to date view on what's happening around the world. It's based on the principle that the community of around 1 billion users decides what is newsworthy and what's not through a voting system. You can think of Reddit upvotes as Facebook likes. Posts  are arranged based on the number of votes and those with many upvotes are featured on the homepage. The gray number next to each post represents the sum of votes (= upvotes - downvotes). 

Here's an example of what the [output](https://www.reddit.com/r/science/comments/k0bjqt/study_finds_users_not_notifications_initiate_89.json) of the Reddit API may look like: 

<img src="images/api_example.png" width=60% align="left"  style="border: 1px solid black"/>

A few things stand out right away: it only contains text which is structured according to a data structure (e.g., list (`[]`) and dictionary (`{}`)), there's no human interface with buttons, menus, and links, yet you can access it like any other website by filling out the URL in your browser (`reddit.com/r/science/...` in this example). In fact, the API output above corresponds to the following [Reddit thread](https://www.reddit.com/r/science/comments/k0bjqt/study_finds_users_not_notifications_initiate_89). For example, have a look at the third and fourth line from above which states the `title` of the post you see below.  

<img src="images/reddit.png" width=60% align="left"  style="border: 1px solid black"/>

If you have taken a quick look at the [API output](https://www.reddit.com/r/science/comments/k0bjqt/study_finds_users_not_notifications_initiate_89/.json), you may have come to the conclusion that making sense of raw JavaScript Object Notation (JSON) is easier said than done. Fortunately, this [plugin](https://chrome.google.com/webstore/detail/json-viewer/gbmdgpbipfallnflgajpaliibnhdgobh) automatically formats and highlights the output such that it's easier to digest. For the following exercise, we therefore highly recommend installing the Chrome plugin. Alternatively, you can copy paste the JSON output into this [online viewer](http://jsonviewer.stack.hu) and inspect the "Viewer" tab. 

#### Exercise 6
Navigate through the JSON tree structure of the post above and anwer the following questions:
1. At the parent level you find two dictionaries at line 5 and 197 (i.e., the blue arrows). Collapse the content and describe in your own words what each dictionary represents. How does it relate to the Reddit HTML page? 

<img src="images/reddit_api.gif" width=70% align="left"  style="border: 1px solid black"/>

2. The first comment is from the post author (fotogneric) and has gathered the most points. How many downvotes did this comment get (you find the answer in the JSON output)? 

3. Suppose that you want to extract the date and time each comment was created. What path do you need to navigate? Note that times are often registered in UTC format which is a globally interchangeable time representation (also known as Epoch time). More specifically, it is the number of seconds elapsed since January 1 1970 and therefore can be used as a universal time scale around the world. Copy paste the UTC time to an online [epoch converter](https://www.epochconverter.com) and check whether it corresponds with the date and time on the webpage.



#### Solutions
1. The dictionary that starts at line 5 contains data on the post (title, subreddit, upvote ratio, thumbnail/image, link to article). The other dictionary stores the comments of the post (author, body text, timestamp). 
2. At the moment of writing this solution (December '20) the post has 0 downvotes (`'downs': 0`).
3. The `created` key stores a large number (e.g., 1606274053) that can be translated into a date and time (for this example: 25 November 2020 03:14 GMT). The corresponding path for the timestamp of the first comment is: `request[1]['data']['children'][0]['data']['created']` (a written description that follows these directions also suffices: first, you take the 2nd element `[1]` in the list, then you choose the `data` key, etc.).
---

Now that you understand what APIs are, you may rightfully wonder: why should I learn APIs when I could scrape the elements from the website instead (like the book webshop)?

One of the major advantages of APIs is that you can directly access the data you need *without all the hassle of selecting the right HTML tags*. Another advantage is that you can often customize your API request (e.g., the first 100 comments, or only posts about science) which may not always be possible in the web interface. 

That's why we recommend using an API whenever possible. However, in practise, you may find yourself in a scenario where there is no API to access the data you want, or the access it too limited. In these scenarios, web scraping would allow you to still access the data, as long as it is available (i.e., viewable on a website, or somewhere hidden in the source code).

Extracting data from the Reddit API is beyond the scope of this tutorial, hence we'll look at a somewhat simpler example to get you started with APIs. 

### 2.2 Retrieve data from an API endpoint
APIs are mostly provided via paid subscriptions (e.g., [Chartmetric API](https://api.chartmetric.com/apidoc/#api-Artist-GetPlaylistsByArtists) offers music data analytics on Spotify playlists among others).

So, we have to do with a nerdy example which isn't even remotely related to the field of marketing. We proudly present to you... __[icanhazdadjoke.com](https://icanhazdadjoke.com)!__ ...which is the largest selection of *dad jokes on the internet* (yes, we also didn't know that existed!).

The reason we picked this website is that we can call the API without providing authentication tokens, or payment. This means that *you* can collect jokes without creating an account, and that the API does not ask for a secret key or token (i.e., password to tell the API: "this is user X requesting data about Y!"). In general, there are limits to how often or how much data you can request. Therefore, most commercial APIs require authentication so that they can track your usage of the API and verify whether you're a paid subscriber or not. 

So how does the website work (and why is there an API?). Every time you visit the site, the site shows a *random joke*. From a technical perspective, each time a user opens the site, a little software program on the server makes an API call to the daddy joke API to draw a new joke to be displayed. So, the designers have split the displaying of information (website) from the actual content (the jokes, available through the API). This offers the opportunity to provide the data in two ways: a nice visual representation of dad jokes (the website), and a service for drawing jokes programmatically to embed in other software products.

Sounds familiar? Yep! Facebook and Instagram do exactly the same. Instead of tying in their technology with the website, they have split the visual representation from the actual content. This allows social media networks to monetize their data in other ways (e.g., by having advertisers programmatically access the Facebook API to learn about potential targets for their ad campaigns).

But let's focus again, and try out to generate a [random joke](icanhazdadjoke.com) yourself...


<img src="images/icanhazdadjoke.gif" width=70% align="left"  style="border: 1px solid black"/>

In section 1.4 on web scraping, we requested the HTML data from the "A Light in the Attic" and saved it to a variable `book_request`. Here we also use the `requests` library, but this time we add `headers={"Accept": application/json}"` to our request so that we do not get back the raw HTML but rather the output in JSON. Thereafter, we return the JSON from the request object (`response.json()`). Each `joke_request` object contains three attributes: 
* `id` = a unique identifier for each joke
* `joke` = the text of the joke
* `status` = the HTML status code (200 indicates a successful request)

In [99]:
# request JSON output from icanhazdadjoke API
url = "https://icanhazdadjoke.com"
response = requests.get(url, headers={"Accept": "application/json"})
joke_request = response.json() 
print(joke_request)

{'id': 'fNeVDI6USnb', 'joke': 'I made a playlist for hiking. It has music from Peanuts, The Cranberries, and Eminem. I call it my Trail Mix.', 'status': 200}


#### Exercise 7
1. What happens if you run the cell above again? Why is that? 
2. Turn of your Wi-Fi and try running the cell again. What happens this time? 
3. You can extract the text of the joke as follows: `joke_request['joke']`. Revise the code snippet above such that it stores the text of 10 jokes in a list (tip: use a for-loop).  

#### Solutions
1. Another random joke is generated, so the `id` and `joke` change every time. 
2. A connection error occurs because the `requests` package could not establish a connection with the API.

In [102]:
# Solution Question 3
jokes = [] 

for counter in range(10):
    url = "https://icanhazdadjoke.com"
    response = requests.get(url, headers={"Accept": "application/json"})
    joke_request = response.json() 
    jokes.append(joke_request)

### 2.3 Parameters

As mentioned earlier, it is common use to customize requests so that the API returns the exact data you need. You have probably already done this a dozen times without even knowing it. For example, if you Google the word `cat`, the results page may look something like this:

<img src="images/google.png" width=60% align="left"  style="border: 1px solid black"/>

Note how the link in the browser starts off with [`google.com/search?q=cat...`](https://www.google.com/search?q=cat). Thus, the search query `cat` is already embedded in the link itself. So rather than filling out the search box on the webpage itself, you can also tweak it in the URL directly. In a similar way, you can request `cat` jokes from the [`icanhazdadjoke.com/search`](https://icanhazdadjoke.com/search?term=cat) page with the `term` parameter:

<img src="images/search.png" width=60% align="left"  style="border: 1px solid black"/>

With this idea in mind, we can update the `search_url` and include the `params` attribute which contains a dictionary with parameters that further specifies our request.

In [9]:
search_url = "https://icanhazdadjoke.com/search"

response = requests.get(search_url, 
                        headers={"Accept": "application/json"}, 
                        params={"term": "cat"})
joke_request = response.json()

The `joke_request` object contains a list with all cat-related jokes (`joke_request['results']`), the search term (`cat`), and the total number of jokes (`10`).

#### Exercise 8
1. Change the search term parameter to `dog` and revisit `joke_request['results']`. How many dog jokes are there? 
2. Write a function `find_joke()` that takes a query as input parameter and returns the first related joke from the `icanhazdadjoke` search API. Your function should also work if there are no results.
3. Currently, the API provides 650 jokes in total. You can check that for yourself by passing an empty string (`""`) as a search `term`. The results object only shows you the first 20 jokes. To view the remaining 630 jokes you need pagination: rather than returning all output at once, the API divides the data into subsets which can be accessed on various pages. By default each page contains 20 jokes, where page 1 shows joke 1 to 20, page 2 jokes 21 to 40, ..., and page 33 jokes 641 to 650. You can adjust the number of results on each page (max. 30) with the `limit` parameter (e.g., `params={"term": "", "limit": 10}`). Choose different levels for limit (e.g., `10`, `20`, and `30`) and see how it affects the `total_pages` field. Why is that? 
4. You can specify the current page number with the `page` parameter (e.g., `params={"term": "", "page": 2}`. Adapt the function `find_joke()` such that it loops over all available pages and stores the ids and jokes in a list. You can leave the `limit` parameter at its default value (20). Make sure that your function also works when you pass it a search `term`. Tip: to determine how many page you need to loop through, you can use `total_pages` field (e.g., there are only 10 cat jokes, so in that case 1 page would suffice). 

In [108]:
# Question 1 
search_url = "https://icanhazdadjoke.com/search"

response = requests.get(search_url, 
                        headers={"Accept": "application/json"}, 
                        params={"term": "dog"})
joke_request = response.json()
print(f"The number of dog jokes is: {joke_request['total_jokes']}")

The number of dog jokes is: 12


In [113]:
# Question 2
def find_jokes(term):
    search_url = "https://icanhazdadjoke.com/search"

    response = requests.get(search_url, 
                            headers={"Accept": "application/json"}, 
                            params={"term": term})
    joke_request = response.json()
    
    if joke_request['total_jokes'] > 0: 
        return joke_request["results"][0]

find_jokes("asdfasdf")

In [120]:
# Question 3
for limit in range(10, 31, 10):
    response = requests.get(search_url, 
                            headers={"Accept": "application/json"}, 
                            params={"term": "", 
                                   "limit": limit})
    joke_request = response.json()
    print(f"Limit {limit} gives {joke_request['total_pages']} pages")
# the higher the limit, the more results fit on a single page, and thus the lower the number of pages  

Limit 10 gives 65 pages
Limit 20 gives 33 pages
Limit 30 gives 22 pages


In [135]:
# Question 4
def find_jokes(term):
    search_url = "https://icanhazdadjoke.com/search"
    page = 1
    jokes = []

    while True: 
        response = requests.get(search_url, 
                                headers={"Accept": "application/json"}, 
                                params={"term": term,  # optionally you can add "limit": 20 but that's already the default so it doesn't change anything
                                        "page": page})
        joke_request = response.json()
        jokes.extend(joke_request['results'])
        if joke_request['current_page'] <= joke_request['total_pages']:
            page += 1
        else: 
            return jokes

output = find_jokes("")

And that wraps up this tutorial! After finishing this section, you should not only have learned a couple of new dad jokes but you should also be able to explain to your parents what an API is and give practical examples of how consumers - as well as companies - may benefit from using them. Furthermore, you should be able to request data from various APIs  (even if it's one you have never seen before!). 

---