# API 101 (oDCM)

*... (focus lies on pagination and parameters; look at two case studies: icanhazdadjoke and Reddit)*

--- 

## Learning Objectives

Students will be able to: 
* Send HTTP requests to retrieve data from APIs
* Iterate over multiple pages 
* Extract and store results of API request

--- 

## Acknowledgements
This course draws on a variety of online resources which can be retrieved from the [course website](https://odcm.hannesdatta.com/#student-profile--prerequisites). 


--- 

## Support Needed?
For technical issues outside of scheduled classes, please check the [support section](https://odcm.hannesdatta.com/docs/course/support) on the course website.

## 1. icanhazdajoke
### 1.1 Use parameters to modify the API results   
We know you love dad jokes, so guess what? We're back with many more jokes, and you're going to learn how to save them all!

As you may remember you can customize requests so that the API returns the *exact* data you need. You have probably already done this a dozen times without even knowing it. For example, if you Google the word `cat`, the results page may look something like this:

<img src="images/google.png" width=60% align="left"  style="border: 1px solid black"/>

Note how the link in the browser starts off with [`google.com/search?q=cat...`](https://www.google.com/search?q=cat). Thus, the search query `cat` is already embedded in the link itself. Cool, right?

__Let's try it out__

So, rather than filling out the search box on the webpage itself, you can also tweak it in the URL directly. Try it!

In a similar way, you can request `cat` jokes from the [`icanhazdadjoke.com/search`](https://icanhazdadjoke.com/search?term=cat) page with the `term` parameter:

<img src="images/search.png" width=60% align="left"  style="border: 1px solid black"/>

With this idea in mind, we can update the `search_url` and include the `params` attribute which contains a dictionary with parameters that further specifies our request. Run the cell below to see cat jokes!

In [156]:
search_url = "https://icanhazdadjoke.com/search"

response = requests.get(search_url, 
                        headers={"Accept": "application/json"}, 
                        params={"term": "cat"})
joke_request = response.json()
print(joke_request)

{'current_page': 1, 'limit': 20, 'next_page': 1, 'previous_page': 1, 'results': [{'id': '8UnrHe2T0g', 'joke': '‘Put the cat out’ … ‘I didn’t realize it was on fire'}, {'id': 'iGJeVKmWDlb', 'joke': 'My cat was just sick on the carpet, I don’t think it’s feline well.'}, {'id': 'daaUfibh', 'joke': 'Why was the big cat disqualified from the race? Because it was a cheetah.'}, {'id': '1wkqrcNCljb', 'joke': "Did you know that protons have mass? I didn't even know they were catholic."}, {'id': 'BQfaxsHBsrc', 'joke': 'What do you call a pile of cats?  A Meowtain.'}, {'id': 'O7haxA5Tfxc', 'joke': 'Where do cats write notes?\r\nScratch Paper!'}, {'id': 'TS0gFlqr4ob', 'joke': 'What do you call a group of disorganized cats? A cat-tastrophe.'}, {'id': '0wcFBQfiGBd', 'joke': 'Did you hear the joke about the wandering nun? She was a roman catholic.'}, {'id': 'AQn3wPKeqrc', 'joke': 'It was raining cats and dogs the other day. I almost stepped in a poodle.'}, {'id': '39Etc2orc', 'joke': 'Why did the man

The `joke_request` object now contains a list with all cat-related jokes (`joke_request['results']`), the search term (`cat`), and the total number of jokes (`10`).

#### Exercise 1
1. Change the search term parameter to `dog` and revisit `joke_request['results']`. How many dog jokes are there? 
2. Write a function `find_joke()` that takes a query as an input parameter and returns the number of jokes from the `icanhazdadjoke` search API. 




In [None]:
# Question 1 
search_url = "https://icanhazdadjoke.com/search"

response = requests.get(search_url, 
                        headers={"Accept": "application/json"}, 
                        params={"term": "dog"})
joke_request = response.json()
print(f"The number of dog jokes is: {joke_request['total_jokes']}")

In [165]:
# Question 2
def find_jokes(term):
    search_url = "https://icanhazdadjoke.com/search"

    response = requests.get(search_url, 
                            headers={"Accept": "application/json"}, 
                            params={"term": term})
    joke_request = response.json()
    num_results = joke_request['total_jokes']
    return num_results

find_jokes("asdfasdf")

0

### 1.2 Pagination

<img src="images/num_jokes.png" width=60% align="left"  style="border: 1px solid black"/>

Currently, the API provides about 649 jokes in total. You can check that for yourself by passing an empty string (`""`) as a search `term` in the web interface (or in the `find_jokes()` function of course!). 

The API output, however, only shows you the first 20 jokes. To view the remaining 629 jokes, you need pagination. That is, the API divides the data into subsets which can be accessed on various pages, rather than returning all output at once. 

In [168]:
print(f'The total number of jokes is {find_jokes("")}')

The total number of jokes is 649


By default, each page contains 20 jokes, where page 1 shows jokes 1 to 20, page 2 jokes 21 to 40, ..., and page 33 jokes 641 to 649. You can adjust the number of results on each page (max. 30) with the `limit` parameter (e.g., `params={"limit": 10}`). In the example below we set `limit` equal to `10`, `20`, and `30` and see how it affects the number of pages. As expected we findt that the higher the limit, the more results fit on a single page, and thus the lower the number of pages.

In [169]:
for limit in range(10, 31, 10):
    response = requests.get(search_url, 
                            headers={"Accept": "application/json"}, 
                            params={"term": "", 
                                   "limit": limit})
    joke_request = response.json()
    print(f"Limit {limit} gives {joke_request['total_pages']} pages")

Limit 10 gives 65 pages
Limit 20 gives 33 pages
Limit 30 gives 22 pages


--- 
#### Exercise 2

In addition to the `limit` parameter, you can specify the current page number with the `page` parameter (e.g., `params={"term": "", "page": 2}`. 

Adapt the function `find_joke()` such that it loops over all available pages and stores the ids and jokes in a list. You can leave the `limit` parameter at its default value (20). Make sure that your function also works when you pass it a search `term`. Tip: to determine how many pages you need to loop through, you can use the `total_pages` field (e.g., there are only ten cat jokes, so in that case, 1 page would suffice).

#### Solutions

In [None]:
def find_jokes(term):
    search_url = "https://icanhazdadjoke.com/search"
    page = 1
    jokes = []

    while True: 
        response = requests.get(search_url, 
                                headers={"Accept": "application/json"}, 
                                params={"term": term,  # optionally you can add "limit": 20 but that's already the default so it doesn't change anything
                                        "page": page})
        joke_request = response.json()
        jokes.extend(joke_request['results'])
        if joke_request['current_page'] <= joke_request['total_pages']:
            page += 1
        else: 
            return jokes

output = find_jokes("")

### 1.3 Wrap-up
...

--- 
## 2. Reddit

### 2.1 Subreddits

Although we already touched upon the Reddit API last time, we'll provide a more throrough description of subreddits here as this entire tutorial is devoted to getting started with the API. Users can post content in subreddits which are niche communities around a specific topic. There is a subreddit for almost everything, and they all start with `reddit.com/r/...`, for example, [askreddit](https://www.reddit.com/r/AskReddit), [aww](https://www.reddit.com/r/aww/), [gifs](https://www.reddit.com/r/gifs/), [showerthoughts](https://www.reddit.com/r/Showerthoughts), [lifehacks](https://www.reddit.com/r/lifehacks), [getmotivated](https://www.reddit.com/r/GetMotivated), [moviedetails](https://www.reddit.com/r/MovieDetails), [todayilearned](https://www.reddit.com/r/todayilearned/), or [foodporn](https://www.reddit.com/r/FoodPorn/). 

<img src="images/reddit_science.png" width=60% align="left"  style="border: 1px solid black"/>

Subreddits are hosted by moderators and come with their own set of rules (e.g., links to papers you share in [`r/science`](https://www.reddit.com/r/science/) must be less than 6 months old). Other users can join a subreddit so that they receive updates about new posts and comments.

<img src="images/reddit_moderators.png" width=60% align="left"  style="border: 1px solid black"/>

#### Exercise 3
Consult the [`marketing`](https://www.reddit.com/r/marketing/hot/) subreddit and answer the following questions: 
1. For your thesis, you need to collect a couple more survey responses. Are you allowed to share a link to your survey in this subreddit? Please elaborate how you came to this conclusion. 
2. You're a bit stubborn and decide to do it anyway and therefore run the risk of being reported by one of the moderators. How many moderators take care of managing this subreddit? 
3. Like other social media platforms, you can navigate towards Reddits' user profiles and learn more about these persons. Inspect the profile of one of the moderators of the marketing subreddit, [`sixwaystop313`](https://www.reddit.com/user/sixwaystop313), and describe in your own words what types of information you can gather from this user. How is the feed organized? 

#### Solutions
1. No, the subreddit rules prescribe users not to post surveys and homework assignments (right sidebar).
2. `r/marketing` is moderated by 10 users (of which 1 AutoModerator)
3. On a user page you find the bio, trophies, communities the user moderates, connected acccounts, and most importantly: all user's posts and comments.

---

### 2.2 API headers  


**Importance**  

To request data from the Reddit API we need to include so-called `headers` in our request. HTTP headers are an important part of the API request as they include meta-data associated with the request (e.g., type of browser, language, expected data format, etc.). 

**Let's try it out**  

Below we make a request to the moderators page of the [`marketing`]() subreddit that includes such a header. In the upcoming exercise, we make our very first request to the Reddit API and parse the output!

In [84]:
import requests
url = 'https://www.reddit.com/r/marketing/about/moderators/.json'

headers = {'authority': 'www.reddit.com', 'cache-control': 'max-age=0', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'same-origin', 'sec-fetch-mode': 'navigate', 'sec-fetch-user': '?1', 'sec-fetch-dest': 'document', 'accept-language': 'en-GB,en;q=0.9'}
response = requests.get(url, headers=headers)
json_response = response.json()

#### Exercise 3
1. First, take a look at the `json_response` object. Then, leave out the `headers` parameter in your request, run the cell again, and inspect the `json_response` another time. Are there any differences? 
2. Write a for-loop that prints the moderator `name` of the `marketing` subreddit. Every subreddit includes a bot moderator (`AutoModerator`) which should not be included.
3. Convert your code from the previous exercise into a function `get_mods()` that takes a `subreddit` as input and returns a list of moderators names. Test your function for the `science` subreddit. How many moderators does it have? 

#### Solutions
1. Without the `headers` parameter, it returns an error code (429).

In [85]:
# Question 2 
# don't forget to run the request object with headers again!
for item in json_response['data']['children']: 
    moderator_name = item['name']
    if moderator_name != 'AutoModerator': 
        print(moderator_name)

dpatrick86
v022450781
r0nin
Gustomaximus
everythingswan
sixwaystop313
shampine
JonODonovan
AptSeagull


In [90]:
# Question 3
def get_mods(subreddit):
    moderator_names = []
    response = response = requests.get(f'https://www.reddit.com/r/{subreddit}/about/moderators/.json', headers=headers)
    json_response = response.json()
    for item in json_response['data']['children']:
        moderator_name = item['name']
        if moderator_name != 'AutoModerator':
            moderator_names.append(moderator_name)
    return moderator_names
    
science_moderators = get_mods('science')
print(f"The science subreddit is moderated by {len(science_moderators)} users")

The `science` subreddit is moderated by 1545 users


---
### 2.3 Pagination

**Importance**  

In addition to subreddits (`r/...`) and moderator pages (`.../about/moderators`), Reddit users have their own profile page. Let's have another look at the marketing moderator [profile](https://www.reddit.com/user/sixwaystop313) we saw before. Each of the `children` in the `data` is characterized by a type (e.g., `t1` = comment, `t3` = post), subreddit, timestamp, number of comments, upvotes, and downvotes and many others. 

In [133]:
mod = "sixwaystop313"
response = requests.get(f'https://www.reddit.com/user/{mod}.json', headers=headers)
json_response = response.json()
json_response['data']['children'][0] # first item in the list

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'Detroit',
  'selftext': '',
  'author_fullname': 't2_3pmgd',
  'saved': False,
  'mod_reason_title': None,
  'gilded': 0,
  'clicked': False,
  'title': "Bedrock offers outdoor dining, free parking and more with 'Decked Out Detroit'",
  'link_flair_richtext': [],
  'subreddit_name_prefixed': 'r/Detroit',
  'hidden': False,
  'pwls': 6,
  'link_flair_css_class': '',
  'downs': 0,
  'thumbnail_height': 78,
  'top_awarded_type': None,
  'hide_score': False,
  'name': 't3_kaidyk',
  'quarantine': False,
  'link_flair_text_color': 'light',
  'upvote_ratio': 0.9,
  'author_flair_background_color': None,
  'subreddit_type': 'public',
  'ups': 21,
  'total_awards_received': 0,
  'media_embed': {},
  'thumbnail_width': 140,
  'author_flair_template_id': None,
  'is_original_content': False,
  'user_reports': [],
  'secure_media': None,
  'is_reddit_media_domain': False,
  'is_meta': False,
  'category': None,
  'secure_media_embe

#### Exercise 5
1. In the `json_reponse` object find a comment of the author (`kind`: `'t1'`). You can change the counter `[0]` until you come across a comment.
2. Store the text of the comment in a variable called `comment_text`. 
3. How many objects are stored in `json_response['data']['children']`? What's the last one called? 

#### Solutions
1. At the moment of writing the 3rd item in the list is a comment:
`json_response['data']['children'][2]`
2. The text of the comment is stored in a `body` element: 
`comment_text = json_response['data']['children'][2]['body']`
3. The object comprises 25 items (`len(json_response['data']['children'])`)

As you just noticed, the API only returns a subset of all records (every time you scroll to the bottom of the page, it pulls in new data - ordered chronologically). It relies on a concept called "pagination" so that it does not need to send all data at once (which would take ages for a user that has been active on Reddit since 2009!). In many ways it's similar to the `books.toscrape.com` site where we looped through all pages to obtain the book URLs (`/page-1`, `/page-2`, etc.). Similarly, we can pass our request a parameter `after` that tells the API which part of the data it needs to return. The difference, however, is that it's not a number (like `/page-1`) but a random string of characters that only be obtained from the previous request. More specifically, the request we already made contains our secret 

In [145]:
json_response['data']['after']

't1_gdif7j5'

In [151]:
# after (?= after)
after = json_response['data']['after']
url = f'https://www.reddit.com/user/{mod}.json?after={after}'
response = requests.get(url, headers=headers)
json_response_after = response.json()

#### Exercise 6 
1. In welke subreddit was de gebruiker erg actief?
2. ... 
3. ... 

Met screenshots laten zien hoe de ene pagina stopt en hoe dan de andere API request hem weer oppakt. 

In [None]:
mod = "wub_wub"
after = None

activity_list = []

for _ in range(20): # niet al te hoog zetten want dan duur het super lang (API limit van 1000)
    if after == None:
        url = f"https://www.reddit.com/user/{mod}.json"
    else: 
        url = f"https://www.reddit.com/user/{mod}.json?after={after}"
    response = requests.get(url, headers=headers)
    after = response.json()['data']['after']
    mod_activity = [{'mod': mod, 'activity_utc': int(item['data']['created_utc'])} for item in json_response['data']['children']]
    activity_list.extend(mod_activity)

https://www.reddit.com/dev/api/

`t1` = comment
`t2` = account
`t3` = link
`t4` = message
`t5` = subreddit
`t6` = award

* op de account pagina van de moderators gaan kijken
    * https://www.reddit.com/user/dpatrick86
    * reddit.com/user/XXX
    
    

In [58]:
# after = collect more information - beyond what you see
mod = "wub_wub"
response = requests.get(f'https://www.reddit.com/user/{mod}.json', headers=headers)
json_response = response.json()


In [59]:
# 25 results
len(json_response['data']['children'])

25

In [60]:
# created utc 
[{'mod': mod, 'activity_utc': int(item['data']['created_utc'])} for item in json_response['data']['children']]

[{'mod': 'wub_wub', 'activity_utc': 1607509350},
 {'mod': 'wub_wub', 'activity_utc': 1607321170},
 {'mod': 'wub_wub', 'activity_utc': 1607188726},
 {'mod': 'wub_wub', 'activity_utc': 1607105309},
 {'mod': 'wub_wub', 'activity_utc': 1607100694},
 {'mod': 'wub_wub', 'activity_utc': 1607100606},
 {'mod': 'wub_wub', 'activity_utc': 1606980304},
 {'mod': 'wub_wub', 'activity_utc': 1606978359},
 {'mod': 'wub_wub', 'activity_utc': 1606977826},
 {'mod': 'wub_wub', 'activity_utc': 1606906189},
 {'mod': 'wub_wub', 'activity_utc': 1606903404},
 {'mod': 'wub_wub', 'activity_utc': 1606809046},
 {'mod': 'wub_wub', 'activity_utc': 1606730701},
 {'mod': 'wub_wub', 'activity_utc': 1606664323},
 {'mod': 'wub_wub', 'activity_utc': 1606650180},
 {'mod': 'wub_wub', 'activity_utc': 1606553745},
 {'mod': 'wub_wub', 'activity_utc': 1606489021},
 {'mod': 'wub_wub', 'activity_utc': 1606483945},
 {'mod': 'wub_wub', 'activity_utc': 1606482863},
 {'mod': 'wub_wub', 'activity_utc': 1606482170},
 {'mod': 'wub_wub', 

In [62]:
# after (?= after)
after = json_response['data']['after']
response = requests.get(f'https://www.reddit.com/user/{mod}.json?after={after}')
json_response_after = response.json()
json_response_after

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 25,
  'children': [{'kind': 't1',
    'data': {'total_awards_received': 0,
     'approved_at_utc': None,
     'comment_type': None,
     'awarders': [],
     'mod_reason_by': None,
     'banned_by': None,
     'author_flair_type': 'text',
     'removal_reason': None,
     'link_id': 't3_jxc8bu',
     'author_flair_template_id': None,
     'likes': None,
     'replies': '',
     'user_reports': [],
     'saved': False,
     'id': 'gcx16tw',
     'banned_at_utc': None,
     'mod_reason_title': None,
     'gilded': 0,
     'archived': False,
     'no_follow': True,
     'author': 'wub_wub',
     'num_comments': 9,
     'edited': False,
     'can_mod_post': False,
     'created_utc': 1605852330.0,
     'send_replies': True,
     'parent_id': 't1_gcwvwlp',
     'score': 1,
     'author_fullname': 't2_5352p',
     'over_18': False,
     'treatment_tags': [],
     'approved_by': None,
     'mod_note': None,
     'all_awardings': [],
     

In [67]:
mod = "wub_wub"
after = None

activity_list = []

for _ in range(20):
    if after == None:
        url = f"https://www.reddit.com/user/{mod}.json"
    else: 
        url = f"https://www.reddit.com/user/{mod}.json?after={after}"
    response = requests.get(url, headers=headers)
    after = response.json()['data']['after']
    mod_activity = [{'mod': mod, 'activity_utc': int(item['data']['created_utc'])} for item in json_response['data']['children']]
    activity_list.extend(mod_activity)

### 2.4 Time Conversion

In [75]:
import time 

time_example = 1595571434
print(time.strftime("%Y-%m-%d", time.gmtime(time_example)))
print(time.strftime("%H:%m", time.gmtime(time_example)))

2020-07-24
06:07


### 2.5 Wrap-Up

Je kunt niet zomaar elke site scrapen -> bijv. als het achter een log-in zit of de website maatregelen getroffen om te voorkomen dat data wordt gescraped. 