# API 101 (oDCM)

*Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce pretium risus at ultricies egestas. Vivamus sit amet arcu sem. In hac habitasse platea dictumst. Nulla pharetra vitae mauris sed mollis. Pellentesque placerat mauris dui, in venenatis nisl posuere ac. Nunc vitae tincidunt risus, ut pellentesque odio. Donec quam neque, iaculis id eros et, condimentum vulputate nulla. Nullam sed ligula leo.*

--- 

## Learning Objectives

Students will be able to: 
* Send HTTP requests to retrieve data from APIs
* Iterate over multiple pages 
* Extract and store results of API request

--- 

## Acknowledgements
This course draws on online resources built by Adam Williamson, Brian Keegan, Colt Steele, David Amos, Hannah Cushman Garland, Kimberly Fessel, and Thomas Laetsch. 


--- 

## Contact
For technical issues try to be as specific as possible (e.g., include screenshots, your notebook, errors) so that we can help you better.

**WhatsApp**  
+31 13 466 8938

**Email**  
odcm@uvt.nl

## 1. Data Collection

### 1.1 What is Reddit?

Although we already touched upon the Reddit API last time, we'll provide a more throrough description here as this entire tutorial is devoted to getting started with the API. Reddit is sometimes described as the *frontpage of the internet* since it gives you an up to date view on what's happening around the world. It's based on the principle that the community of around 1 billion users decides what is newsworthy and what's not through a voting system. You can think of Reddit upvotes as Facebook likes. Posts  are arranged based on the number of votes and those with many upvotes are featured on the homepage. The gray number next to each post represents the sum of votes (= upvotes - downvotes; 7013 in the figure below). 

<img src="images/reddit_science.png" width=70% align="left"  style="border: 1px solid black"/>

Users can post content in subreddits which are niche communities around a specific topic. There is a subreddit for almost everything, and they all start with `reddit.com/r/...`, for example, [askreddit](https://www.reddit.com/r/AskReddit), [aww](https://www.reddit.com/r/aww/), [gifs](https://www.reddit.com/r/gifs/), [showerthoughts](https://www.reddit.com/r/Showerthoughts), [lifehacks](https://www.reddit.com/r/lifehacks), [getmotivated](https://www.reddit.com/r/GetMotivated), [moviedetails](https://www.reddit.com/r/MovieDetails), [todayilearned](https://www.reddit.com/r/todayilearned/), or [foodporn](https://www.reddit.com/r/FoodPorn/). Subreddits are hosted by moderators and come with their own set of rules (e.g., links to papers you share in [`r/science`](https://www.reddit.com/r/science/) must be less than 6 months old). Other users can join a subreddit so that they receive updates about new posts and comments.

<img src="images/reddit_moderators.png" width=70% align="left"  style="border: 1px solid black"/>

* Determine who are the moderators? 
* When are moderators most active posting and commenting? 

* add `.json` (https://curl.trillworks.com)
* request headers -> paste in cURL
* limit 1000 requests
* can't push any data to reddit
* er zijn meer dingen die je kunt doen maar die vereisen authenticatie 

In [24]:
import requests

headers = {'authority': 'www.reddit.com', 'cache-control': 'max-age=0', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'same-origin', 'sec-fetch-mode': 'navigate', 'sec-fetch-user': '?1', 'sec-fetch-dest': 'document', 'accept-language': 'en-GB,en;q=0.9'}

response = requests.get('https://www.reddit.com/r/marketing/about/moderators/.json', headers=headers)


In [25]:
response

<Response [200]>

In [26]:
json_response = response.json()
json_response['data']['children']

[{'name': 'dpatrick86',
  'author_flair_text': None,
  'mod_permissions': ['all'],
  'date': 1284657752.0,
  'rel_id': 'rb_7n2fa',
  'id': 't2_3c2ya',
  'author_flair_css_class': None},
 {'name': 'v022450781',
  'author_flair_text': '@valters',
  'mod_permissions': ['all'],
  'date': 1304131436.0,
  'rel_id': 'rb_epdsv',
  'id': 't2_4zxlp',
  'author_flair_css_class': 'fl-marketer'},
 {'name': 'r0nin',
  'author_flair_text': None,
  'mod_permissions': ['all'],
  'date': 1304616384.0,
  'rel_id': 'rb_exhei',
  'id': 't2_3f1dg',
  'author_flair_css_class': None},
 {'name': 'Gustomaximus',
  'author_flair_text': 'Professional',
  'mod_permissions': ['all'],
  'date': 1304616402.0,
  'rel_id': 'rb_exhf2',
  'id': 't2_43san',
  'author_flair_css_class': 'fl-professional'},
 {'name': 'everythingswan',
  'author_flair_text': None,
  'mod_permissions': ['all'],
  'date': 1329421838.0,
  'rel_id': 'rb_ym8h7',
  'id': 't2_4myk4',
  'author_flair_css_class': None},
 {'name': 'sixwaystop313',
  'a

In [22]:
# create a list of moderators
[item['name'] for item in json_response['data']['children']]

for item in json_response['data']['children']: 
    print(item['name'])

dpatrick86
v022450781
r0nin
Gustomaximus
everythingswan
sixwaystop313
shampine
JonODonovan
AutoModerator
AptSeagull


In [23]:
# ignore the AutoModerator
for item in json_response['data']['children']: 
    if item['name'] != 'AutoModerator':
        print(item['name'])

dpatrick86
v022450781
r0nin
Gustomaximus
everythingswan
sixwaystop313
shampine
JonODonovan
AptSeagull


In [32]:
# function to get the moderators from any subreddit
def get_mods(subreddit):
    response = response = requests.get(f'https://www.reddit.com/r/{subreddit}/about/moderators/.json', headers=headers)
    json_response = response.json()
    mods = [item['name'] for item in json_response['data']['children'] if item['name'] != 'AutoModerator']
    return mods
    
get_mods('marketing')

['dpatrick86',
 'v022450781',
 'r0nin',
 'Gustomaximus',
 'everythingswan',
 'sixwaystop313',
 'shampine',
 'JonODonovan',
 'AptSeagull']

In [35]:
# return both subreddit and moderators and store in dictionary
# function to get the moderators from any subreddit
def get_mods(subreddit):
    response = response = requests.get(f'https://www.reddit.com/r/{subreddit}/about/moderators/.json', headers=headers)
    json_response = response.json()
    mods = [{'subreddit': subreddit, 'mod': item['name']} for item in json_response['data']['children'] if item['name'] != 'AutoModerator']
    return mods
    
get_mods('marketing')

# kan niet gewoon subreddit: name doen want een key moet altijd uniek zijn!

[{'subreddit': 'marketing', 'mod': 'dpatrick86'},
 {'subreddit': 'marketing', 'mod': 'v022450781'},
 {'subreddit': 'marketing', 'mod': 'r0nin'},
 {'subreddit': 'marketing', 'mod': 'Gustomaximus'},
 {'subreddit': 'marketing', 'mod': 'everythingswan'},
 {'subreddit': 'marketing', 'mod': 'sixwaystop313'},
 {'subreddit': 'marketing', 'mod': 'shampine'},
 {'subreddit': 'marketing', 'mod': 'JonODonovan'},
 {'subreddit': 'marketing', 'mod': 'AptSeagull'}]

https://www.reddit.com/dev/api/

`t1` = comment
`t2` = account
`t3` = link
`t4` = message
`t5` = subreddit
`t6` = award

* op de account pagina van de moderators gaan kijken
    * https://www.reddit.com/user/dpatrick86
    * reddit.com/user/XXX
    
    

**Comments**  
* `body` = text of the comment  
* `created_utc` = date time of comment (https://www.epochconverter.com) - Exercise 5 - web data for dummies - number of seconds since 1970 - epoch time


In [58]:
# after = collect more information - beyond what you see
mod = "wub_wub"
response = requests.get(f'https://www.reddit.com/user/{mod}.json', headers=headers)
json_response = response.json()


In [59]:
# 25 results
len(json_response['data']['children'])

25

In [60]:
# created utc 
[{'mod': mod, 'activity_utc': int(item['data']['created_utc'])} for item in json_response['data']['children']]

[{'mod': 'wub_wub', 'activity_utc': 1607509350},
 {'mod': 'wub_wub', 'activity_utc': 1607321170},
 {'mod': 'wub_wub', 'activity_utc': 1607188726},
 {'mod': 'wub_wub', 'activity_utc': 1607105309},
 {'mod': 'wub_wub', 'activity_utc': 1607100694},
 {'mod': 'wub_wub', 'activity_utc': 1607100606},
 {'mod': 'wub_wub', 'activity_utc': 1606980304},
 {'mod': 'wub_wub', 'activity_utc': 1606978359},
 {'mod': 'wub_wub', 'activity_utc': 1606977826},
 {'mod': 'wub_wub', 'activity_utc': 1606906189},
 {'mod': 'wub_wub', 'activity_utc': 1606903404},
 {'mod': 'wub_wub', 'activity_utc': 1606809046},
 {'mod': 'wub_wub', 'activity_utc': 1606730701},
 {'mod': 'wub_wub', 'activity_utc': 1606664323},
 {'mod': 'wub_wub', 'activity_utc': 1606650180},
 {'mod': 'wub_wub', 'activity_utc': 1606553745},
 {'mod': 'wub_wub', 'activity_utc': 1606489021},
 {'mod': 'wub_wub', 'activity_utc': 1606483945},
 {'mod': 'wub_wub', 'activity_utc': 1606482863},
 {'mod': 'wub_wub', 'activity_utc': 1606482170},
 {'mod': 'wub_wub', 

In [62]:
# after (?= after)
after = json_response['data']['after']
response = requests.get(f'https://www.reddit.com/user/{mod}.json?after={after}')
json_response_after = response.json()
json_response_after

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 25,
  'children': [{'kind': 't1',
    'data': {'total_awards_received': 0,
     'approved_at_utc': None,
     'comment_type': None,
     'awarders': [],
     'mod_reason_by': None,
     'banned_by': None,
     'author_flair_type': 'text',
     'removal_reason': None,
     'link_id': 't3_jxc8bu',
     'author_flair_template_id': None,
     'likes': None,
     'replies': '',
     'user_reports': [],
     'saved': False,
     'id': 'gcx16tw',
     'banned_at_utc': None,
     'mod_reason_title': None,
     'gilded': 0,
     'archived': False,
     'no_follow': True,
     'author': 'wub_wub',
     'num_comments': 9,
     'edited': False,
     'can_mod_post': False,
     'created_utc': 1605852330.0,
     'send_replies': True,
     'parent_id': 't1_gcwvwlp',
     'score': 1,
     'author_fullname': 't2_5352p',
     'over_18': False,
     'treatment_tags': [],
     'approved_by': None,
     'mod_note': None,
     'all_awardings': [],
     

In [67]:
mod = "wub_wub"
after = None

activity_list = []

for _ in range(20):
    if after == None:
        url = f"https://www.reddit.com/user/{mod}.json"
    else: 
        url = f"https://www.reddit.com/user/{mod}.json?after={after}"
    response = requests.get(url, headers=headers)
    after = response.json()['data']['after']
    mod_activity = [{'mod': mod, 'activity_utc': int(item['data']['created_utc'])} for item in json_response['data']['children']]
    activity_list.extend(mod_activity)

In [69]:
len(activity_list)

500

In [75]:
import time 

time_example = 1595571434
print(time.strftime("%Y-%m-%d", time.gmtime(time_example)))
print(time.strftime("%H:%m", time.gmtime(time_example)))

2020-07-24
06:07


### 1.3 Modularize code

In [None]:
/hot 
/new
/top
count=10
limit=10
`t=` {hour, day, week, month, year, all}

## 2. Data Preprocessing

### 2.1 Time
### 2.2 Missing Values 
### 2.3 Data Imputation
### 2.4 Export Data

---

* Humans do not really change the parameters in the search bar (but use buttons and sliders on the page for that)
* For each joke we have an id and the joke

In [2]:
import requests 

# rather than hardcoding it like this
response = requests.get(
    "http://www.example.com?key1=value1&key2=value2"
)

# this is the preferred way 
response = requests.get(
    "http://www.example.com",
    params={
        "key1": "value1",
        "key2": "value2"
    }
)

In [None]:
# documentation: https://icanhazdadjoke.com/api
# page = which page of the results to fetch (default: 1)
# limit = number of results per page (default: 20) (max:30)
# term = search term to use (default: list all jokes)
url = "https://icanhazdadjoke.com/search"
response = requests.get(url, 
                        headers={"Accept": "application/json"},
                        params={"term": "cat",
                                "page": 2,
                                "limit": 1}
                        
                       )
data = response.json() # similar to a Python dictionary
data

In [None]:
# API project
* Print out a random joke according to user search query
* If there are no jokes about -> appropriate message ("Sorry, I don't have any jokes about X. Please try again.")

* https://github.com/kimfetti/Conferences/tree/master/PyCon_2020
* https://www.youtube.com/watch?v=RUQWPJ1T6Zc&t=190s
* https://github.com/hancush/web-scraping-with-python/blob/master/session/web-scraping-with-python.ipynb#HTML-basics
* https://www.udemy.com/course/the-modern-python3-bootcamp/learn/lecture/7991196#overview
* https://campus.datacamp.com/courses/web-scraping-with-python/introduction-to-html?ex=1
* https://realpython.com/python-web-scraping-practical-introduction/
* https://github.com/CU-ITSS/Web-Data-Scraping-S2019