# 🗓️ Week 08: Data summarisation and more grammar-of-graphics

In this week’s lecture, we’re going to explore a different approach to collecting data from the web. Instead of scraping data from a single page, we’ll connect to something called an API. This will enable us to gather data from a website in a more organised manner, usually in JSON or XML formats.

Additionally, we’ll dive into a theoretical framework for data visualisation known as the grammar of graphics. We’ll use the `plotnine` library in Python to apply this framework and create more effective and engaging visualisations. Looking forward to an exciting week!

## Setup

In [22]:
import os
import json
import requests

import pandas as pd
import matplotlib.pyplot as plt

from pprint import pprint
from scrapy import Selector
from tqdm.notebook import tqdm

## 1. Collecting data from an API

So far we have spent a reasonable amount of time gathering data via web scraping. We send requests to a web server, pretending to be a browser operating at the request of a human user, and then we reverse engineer the HTML code to extract the information we need. 

However, web scraping is not the only way to obtain data from the web; sometimes, it isn't even the best way!

Many websites, platforms and organizations offer **direct access** to their data via an API (Application Programming Interface). I will illustrate the process by using the Reddit API.

### 1.1 First, an IMPORTANT detour: read credentials

To use the API, we need to provide several sensitive pieces of information: 

- Your Reddit username (do you want people to know it?)
- Your Reddit password, in plain text (do you want people to know it?)
- Your Reddit app's client ID (do you want anyone to send requests on your behalf?)
- Your Reddit app's client secret (do you want anyone to send requests on your behalf?)

If I leave this information in the notebook (on GitHub, especially), anyone reading it can impersonate me and send requests to Reddit on my behalf - a **serious security risk**

**Never leave your credentials anywhere in your GitHub repository or notebook**

In [39]:
credentials_file_path = "./credentials.json"

#open the file and load the data into a variable
with open(credentials_file_path, "r") as file:
    credentials = json.load(file)

Okay, now I took that out of the way, let's access the API!

### 1.2 Obtain a Token

The Reddit API requires, probably for security reasons, that we obtain an **access token** every time we access it via a script. We send a first request to a specific **API endpoint** in our credentials and get a string of characters, a token, that we can use to access the API for a limited amount of time.

In [40]:
#We will use the requests library, only this time we have to set up authentication parameters first
client_auth = requests.auth.HTTPBasicAuth(credentials["app_client_id"], credentials["app_client_secret"])

#You also need to send, via HTTP POST, your Reddit username and password
post_data = {"grant_type": "password", "username": credentials["reddit_username"], "password": credentials["reddit_password"]}

#Just like Wikimedia, Reddit API also requests that we self-identify ourselves in the User-Agent
headers = {"User-Agent": f"LSE DS105W API practice by {credentials['reddit_username']}"}

**Actually send the request**

In [41]:
#From their documentation, I learned this is the endpoint I need
ACCESS_TOKEN_ENDPOINT = "https://www.reddit.com/api/v1/access_token"

#This time we are sending a HTTP POST request instead of a HTTP GEt
response = requests.post(ACCESS_TOKEN_ENDPOINT, auth=client_auth, data=post_data, headers=headers)
response.json()

{'access_token': 'eyJhbGciOiJSUzI1NiIsImtpZCI6IlNIQTI1NjpzS3dsMnlsV0VtMjVmcXhwTU40cWY4MXE2OWFFdWFyMnpLMUdhVGxjdWNZIiwidHlwIjoiSldUIn0.eyJzdWIiOiJ1c2VyIiwiZXhwIjoxNzIwMTMzOTAwLjMzNjgwMywiaWF0IjoxNzIwMDQ3NTAwLjMzNjgwMiwianRpIjoiWVgtb1Z6c3dXRjVNSFBvWGhQV3F6ZnQtbGs3bDdBIiwiY2lkIjoiWTh1d0hxYVNuNGpoa1FsU3FoTF9BdyIsImxpZCI6InQyX3h3aDhxOHE0dyIsImFpZCI6InQyX3h3aDhxOHE0dyIsImxjYSI6MTcxMjYwODAzMjA4NCwic2NwIjoiZUp5S1Z0SlNpZ1VFQUFEX193TnpBU2MiLCJmbG8iOjl9.G0H9NRu5JGYNEO8kIXTQSZfNWLsO1lZNBPtEB72EFnFvX5K1pJNFmd7rKxRqXuA6Ov7pnpBwHOhLSvR1ZpKVsehQQGcmN7sooTWadJr_V6I6oc4_oeo0YQbzzCBzqwCpi7ZaK1w_dqEwuAeR5p48N_m5hSfIjRbGCuUz7zHSBDqPDscnaNu4qdTtQl26PEwo62BbZzKu3v9ZlvSmVfmkdoYZ3chxSkbwMS-9-uMArB80tbPUTbYS-QGMlyfPK-1OFh91TcW0PgVfVuMu8mmQBWvegBtaE6yNsaR91Bp6KRccrwab67L8eh6Pb1dmFsKVoI84YATYJdsbnK6_zysIwg',
 'token_type': 'bearer',
 'expires_in': 86400,
 'scope': '*'}

Double-check: how long does the token last in hours?

In [26]:
86400/(60*60)

24.0

Let's save our token:

In [42]:
my_token = response.json()["access_token"]

From now on, all my requests need to be followed by these HTTP HEADERS:

In [43]:
headers = {"Authorization": f"bearer {my_token}", 
           "User-Agent": f"LSE DS105W API practice by {credentials['reddit_username']}"}

### 1.3 Send our first request with the token

In [57]:
BASE_ENDPOINT = "https://oauth.reddit.com"

response = requests.get(f"{BASE_ENDPOINT}/top?limit=100", headers=headers)

response.status_code

200

In [58]:
response.json()

{'kind': 'Listing',
 'data': {'after': None,
  'dist': 15,
  'modhash': None,
  'geo_filter': '',
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'math',
     'selftext': '',
     'author_fullname': 't2_iwtba27g',
     'saved': False,
     'mod_reason_title': None,
     'gilded': 0,
     'clicked': False,
     'title': '“If they know my work, they don’t need my C.V. If they need my C.V., they don’t know my work.” Grigory Perelman',
     'link_flair_richtext': [],
     'subreddit_name_prefixed': 'r/math',
     'hidden': False,
     'pwls': 6,
     'link_flair_css_class': None,
     'downs': 0,
     'thumbnail_height': 78,
     'top_awarded_type': None,
     'hide_score': False,
     'name': 't3_1duau50',
     'quarantine': False,
     'link_flair_text_color': 'dark',
     'upvote_ratio': 0.95,
     'author_flair_background_color': None,
     'subreddit_type': 'public',
     'ups': 503,
     'total_awards_received': 0,
     'media_embed': {},
     't

In [59]:
#Obtain the data from the response
data_page01 = response.json()

#Children is were the data actually is
len(data_page01["data"]["children"])

15

From reading the [Reddit API documentation](https://www.reddit.com/dev/api), I learned that we have to use the `after` parameter to paginate. That is, after sending a request, I have to retrieve the `after` value from the responde and use it in the next request.

In [60]:
after_id = data_page01["data"]["after"]

Now, I can send a new request to get the next 100 top posts

In [61]:
data_page02 = requests.get(f"{BASE_ENDPOINT}/top?limit=100&after={after_id}", headers=headers).json()

I could continue doing this and accesing the next 100 posts from each subsequent page. As mentioned in the past, creating these variables by hand (`data_page02`, `data_page03`, ...) is a very bad practice. We should use a loop instead, ideally combined with a costum function.

But let's stop here! Let me save those two JSON files and send them  to you so that you can use them in the next section.

In [62]:
os.makedirs("w08_reddit_data", exist_ok=True)

with open("w08_reddit_data/page01.json", "w") as file:
    json.dump(data_page01, file)

with open("w08_reddit_data/page02.json", "w") as file:
    json.dump(data_page02, file)

## 2. Summarising and plotting data

In [72]:
# Get the data out of the 'children' key-value pair first
df1 = pd.DataFrame.from_dict(data_page01["data"]["children"])
df2 = pd.DataFrame.from_dict(data_page02["data"]["children"])

#Focus on the 'data' column and unpack it using json_normalize
df1 = pd.json_normalize(df1["data"], max_level=1)
df2 = pd.json_normalize(df2["data"], max_level=1)

#Summarize the code above into a list comprehension
all_pages = [data_page01, data_page02]
all_dfs = [pd.DataFrame.from_dict(page["data"]["children"]) for page in all_pages] 
all_dfs = [pd.json_normalize(df["data"], max_level=1) for df in all_dfs]

df = pd.concat(all_dfs, ignore_index=True)

df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,preview.images,preview.enabled,media_metadata.seqy3ayqzaad1,link_flair_template_id
0,,math,,t2_iwtba27g,False,,0,False,"“If they know my work, they don’t need my C.V....",[],...,https://www.newyorker.com/magazine/2006/08/28/...,2992487,1720002000.0,0,,False,[{'source': {'url': 'https://external-preview....,False,,
1,,math,My new paper on the continuum hypothesis is av...,t2_emauyjh0,False,,0,False,"A mathematical thought experiment, showing how...",[],...,https://www.reddit.com/r/math/comments/1due2e8...,2992487,1720012000.0,1,,False,,,"{'status': 'valid', 'e': 'Image', 'm': 'image/...",
2,,datascience,I have been working as a Data Scientist at an ...,t2_bv171ji2,False,,0,False,My current data scientist job is putting me in...,[],...,https://www.reddit.com/r/datascience/comments/...,1813569,1720015000.0,0,,False,,,,ea9e2296-0db0-11ef-bb4d-6e8d785fd493
3,,datascience,I had a coffee chat with a director here at th...,t2_i69qgpqa,False,,0,False,Do you guys agree with the hate on Kmeans??,[],...,https://www.reddit.com/r/datascience/comments/...,1813569,1720018000.0,0,,False,,,,937a6f50-d780-11e7-826d-0ed1beddcc82
4,,datascience,"In another sub, I was reading about people wan...",t2_gajawaxj4,False,,0,False,Ageism in the field,[],...,https://www.reddit.com/r/datascience/comments/...,1813569,1720021000.0,0,,False,,,,4fad7108-d77d-11e7-b0c6-0ee69f155af2


In [75]:
# Other way to obtain the same dataframe as above with all pandas functions

df_reddit = (
    pd.concat([
        pd.json_normalize(data_page01["data"]["children"], max_level=0),
        pd.json_normalize(data_page02["data"]["children"], max_level=0)
    ])
)

df_reddit = pd.json_normalize(df_reddit["data"], max_level=1)

# Sort columns by name to make it easier for me to figure out what I want to keep
df_reddit = df_reddit.reindex(sorted(df_reddit.columns), axis=1)

selected_cols = ['id', 'title', 'subreddit', 'ups', 'num_comments'] 

df_reddit = df_reddit[selected_cols]

df_reddit.head(7)

Unnamed: 0,id,title,subreddit,ups,num_comments
0,1duau50,"“If they know my work, they don’t need my C.V....",math,503,45
1,1due2e8,"A mathematical thought experiment, showing how...",math,131,12
2,1duf019,My current data scientist job is putting me in...,datascience,90,39
3,1dug1va,Do you guys agree with the hate on Kmeans??,datascience,54,61
4,1duh8lg,Ageism in the field,datascience,29,29
5,1dubva0,"Finding the 6th busy beaver number (Σ(6), AKA ...",math,35,2
6,1dumr9e,How do you guys take down lecture notes?,math,21,20
