# üóìÔ∏è Week 08: Data summarisation and more grammar-of-graphics

In this week‚Äôs lecture, we‚Äôre going to explore a different approach to collecting data from the web. Instead of scraping data from a single page, we‚Äôll connect to something called an API. This will enable us to gather data from a website in a more organised manner, usually in JSON or XML formats.

Additionally, we‚Äôll dive into a theoretical framework for data visualisation known as the grammar of graphics. We‚Äôll use the `plotnine` library in Python to apply this framework and create more effective and engaging visualisations. Looking forward to an exciting week!

## Setup

In [None]:
import os
import json
import requests

import pandas as pd
import matplotlib.pyplot as plt

from pprint import pprint
from scrapy import Selector
from tqdm.notebook import tqdm

## 1. Collecting data from an API

So far we have spent a reasonable amount of time gathering data via web scraping. We send requests to a web server, pretending to be a browser operating at the request of a human user, and then we reverse engineer the HTML code to extract the information we need. 

However, web scraping is not the only way to obtain data from the web; sometimes, it isn't even the best way!

Many websites, platforms and organizations offer **direct access** to their data via an API (Application Programming Interface). I will illustrate the process by using the Reddit API.

### 1.1 First, an IMPORTANT detour: read credentials

To use the API, we need to provide several sensitive pieces of information: 

- Your Reddit username (do you want people to know it?)
- Your Reddit password, in plain text (do you want people to know it?)
- Your Reddit app's client ID (do you want anyone to send requests on your behalf?)
- Your Reddit app's client secret (do you want anyone to send requests on your behalf?)

If I leave this information in the notebook (on GitHub, especially), anyone reading it can impersonate me and send requests to Reddit on my behalf - a **serious security risk**

**Never leave your credentials anywhere in your GitHub repository or notebook**

In [2]:
credentials_file_path = "./credentials.json"

#open the file and load the data into a variable
with open(credentials_file_path, "r") as file:
    credentials = json.load(file)

Okay, now I took that out of the way, let's access the API!

### 1.2 Obtain a Token

The Reddit API requires, probably for security reasons, that we obtain an **access token** every time we access it via a script. We send a first request to a specific **API endpoint** in our credentials and get a string of characters, a token, that we can use to access the API for a limited amount of time.

In [4]:
#We will use the requests library, only this time we have to set up authentication parameters first
client_auth = requests.auth.HTTPBasicAuth(credentials["app_client_id"], credentials["app_client_secret"])

#You also need to send, via HTTP POST, your Reddit username and password
post_data = {"grant_type": "password", "username": credentials["reddit_username"], "password": credentials["reddit_password"]}

#Just like Wikimedia, Reddit API also requests that we self-identify ourselves in the User-Agent
headers = {"User-Agent": f"LSE DS105W API practice by {credentials['reddit_username']}"}

**Actually send the request**

In [5]:
#From their documentation, I learned this is the endpoint I need
ACCESS_TOKEN_ENDPOINT = "https://www.reddit.com/api/v1/access_token"

#This time we are sending a HTTP POST request instead of a HTTP GEt
response = requests.post(ACCESS_TOKEN_ENDPOINT, auth=client_auth, data=post_data, headers=headers)
response.json()

{'access_token': 'eyJhbGciOiJSUzI1NiIsImtpZCI6IlNIQTI1NjpzS3dsMnlsV0VtMjVmcXhwTU40cWY4MXE2OWFFdWFyMnpLMUdhVGxjdWNZIiwidHlwIjoiSldUIn0.eyJzdWIiOiJ1c2VyIiwiZXhwIjoxNzIwMDMzNjM5LjE1ODg4OCwiaWF0IjoxNzE5OTQ3MjM5LjE1ODg4OCwianRpIjoiQTBzNEJwTnJWU1liNkNIdDljTjRoemM2T21uVlVnIiwiY2lkIjoiWTh1d0hxYVNuNGpoa1FsU3FoTF9BdyIsImxpZCI6InQyX3h3aDhxOHE0dyIsImFpZCI6InQyX3h3aDhxOHE0dyIsImxjYSI6MTcxMjYwODAzMjA4NCwic2NwIjoiZUp5S1Z0SlNpZ1VFQUFEX193TnpBU2MiLCJmbG8iOjl9.i0mWaMhgtTcS2nlx-TbTlyFOe6xdYTYVeAyMYnxPeElM11hlgA-DfaVY6EfcAHUTe2fk63g1zySmH_1kVIrvqlr_SzRDt9ZgV7hXNbolLjVxDIuLGXYNtr7-KIaqA5XF1dfHoliK1Rj9BhnA8itD-rwHafokkg8gpkWzx40urhLjNpmWoclh3B5LcMfJpQcMB-X8pHNSbOWxHK47_w8Av3d5aErzP_mNa2JdPni9KQY-Quy8R4lpB9hZJFBWNDjeLPb4ErobF9eXiyFhhdxhgIKPC8Bdr-Vm_H4YqRXwDlY-t3E_oayQ4ga-oUgageXIBl9nYjUACL7DJM2pJYctjg',
 'token_type': 'bearer',
 'expires_in': 86400,
 'scope': '*'}

Double-check: how long does the token last in hours?

In [6]:
86400/(60*60)

24.0

Let's save our token:

In [8]:
my_token = response.json()["access_token"]

From now on, all my requests need to be followed by these HTTP HEADERS:

In [11]:
headers = {"Authorization": f"bearer {my_token}", 
           "User-Agent": f"LSE DS105W API practice by {credentials['reddit_username']}"}

### 1.3 Send our first request with the token

In [13]:
BASE_ENDPOINT = "https://oauth.reddit.com"

response = requests.get(f"{BASE_ENDPOINT}/top?limit=100", headers=headers)

response.status_code

200

In [14]:
response.json()

{'kind': 'Listing',
 'data': {'after': None,
  'dist': 16,
  'modhash': None,
  'geo_filter': '',
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'math',
     'selftext': '',
     'author_fullname': 't2_agjaq',
     'saved': False,
     'mod_reason_title': None,
     'gilded': 0,
     'clicked': False,
     'title': 'Amateur Mathematicians Find Fifth ‚ÄòBusy Beaver‚Äô Turing Machine | Quanta Magazine - Ben Brubaker - Computability | After decades of uncertainty, a motley team of programmers has proved precisely how complicated simple computer programs can get',
     'link_flair_richtext': [],
     'subreddit_name_prefixed': 'r/math',
     'hidden': False,
     'pwls': 6,
     'link_flair_css_class': None,
     'downs': 0,
     'thumbnail_height': 73,
     'top_awarded_type': None,
     'hide_score': False,
     'name': 't3_1dtmzyb',
     'quarantine': False,
     'link_flair_text_color': 'dark',
     'upvote_ratio': 1.0,
     'author_flair_backgrou

In [23]:
#Obtain the data from the response
data_page01 = response.json()

#Children is were the data actually is
data_page01["data"]["children"][1]

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'datascience',
  'selftext': 'I‚Äôm currently &lt;12 months into my role as a senior data scientist at my company, where I work with a small cross-functional team of seven developers (front end, backend, infra) I‚Äôm collaborating with another data scientist who is personal friends with my manager. However, I‚Äôve been facing some challenges that I hope to get advice on.\n\nThe other data scientist in my team spends most of his time reading and posting academic papers on SOTA models (most are shit and irrelevant that generates 0 business value in our use case) onto the group chat and disappears for most of the day, but my manager buys into it bc it is SOTA. While he constantly suggests building out these models, he does not code or contribute to the development work. This behavior significantly increases my workload, as I cannot delegate these tasks to anyone else due to our small team size.\n\nI‚Äôve tried addressing thi

From reading the [Reddit API documentation](https://www.reddit.com/dev/api), I learned that we have to use the `after` parameter to paginate. That is, after sending a request, I have to retrieve the `after` value from the responde and use it in the next request.

In [24]:
after_id = data_page01["data"]["after"]

Now, I can send a new request to get the next 100 top posts

In [26]:
data_page02 = requests.get(f"{BASE_ENDPOINT}/top?limit=100&after={after_id}", headers=headers).json()

I could continue doing this and accesing the next 100 posts from each subsequent page. As mentioned in the past, creating these variables by hand (`data_page02`, `data_page03`, ...) is a very bad practice. We should use a loop instead, ideally combined with a costum function.

But let's stop here! Let me save those two JSON files and send them  to you so that you can use them in the next section.

In [27]:
os.makedirs("w08_reddit_data", exist_ok=True)

with open("w08_reddit_data/page01.json", "w") as file:
    json.dump(data_page01, file)

with open("w08_reddit_data/page02.json", "w") as file:
    json.dump(data_page02, file)