# Week 2A: Accessing data from Reddit

## Overview

Reddit is an online bulletin board system to host user-generated content, e.g., text, image, video, audio `posts`. It is categorized into `subreddits` which are communities or user-groups meant to serve as a bulletin board on a specific topic or for a specific group of people. The users can `comment` on posts, which can then be `upvoted` or `downvoted` by other users. Each subreddit is moderated by `moderators` who try to enforce community rules to the comments and discussion therein.

## APIs

There are two APIs that are widely used to scrape data from Reddit
- Reddit API (https://www.reddit.com/dev/api/) - This is the most detailed API with endpoints that can enable us to find almost anything on Reddit. There is a Python wrapper, `praw`,  that helps us access this API (https://praw.readthedocs.io/en/stable/index.html). 
- Pushshift API (https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/) - Pushshift is a big-data storage and analytics project that enables access to the Reddit data albeit with some delay related to specific content (e.g, editing of comments might not be included instantly). It allows us to process data before accessing it, e.g., counting comments by specific users. In a raw Reddit API, we will need to do it locally, while Pushshift API enables us to do this without the hassle of downloading data. There is a Python wrapper, `psaw`, that helps us access Reddit data easily using Pushshit API (https://pypi.org/project/psaw/). 

In this exercise, we will use `psaw`. 

In [None]:
%%capture
# We call the Python package manager pip to install the psaw package
# (advanced) Using ! in ipython notebook runs the command in the bash shell and not in the python; try running !ls in a new cell
!pip install --upgrade psaw

### Pushshift API

There are two ways to access this API 
- Plain API through https://api.pushshift.io/.  
- (advanced) Elasticsearch search engine through https://elastic.pushshift.io/. This search engine is designed for fast aggregation and query on big-data. 

The full API is documented here https://github.com/pushshift/api. However, in this exercise, we will learn how to use the plain API through https://api.pushshift.io/ and `psaw`. 

### Querying Pushshift API manually

To use the Pushshift API, we need to know the endpoints that are accessible. Each endpoint serves a specific purpose. There are two endpoints available for this API:
- `/reddit/search/comment` to search for comments
- `/reddit/search/submission` to search for posts

Thus, for example, if we need to search for comments, they can be accessed via https://api.pushshift.io/reddit/search/comment. 

Once we have the correct address, we need a query to search the database. Any data that is sent to an API is included only after a `?` in the URL link. For example, if we need to look up submissions that have word "science" in them, our query will look like:

```
https://api.pushshift.io/reddit/search/submission/?q=science
```

If we click on the above link or copy and paste the above link in any browser, we will see a JSON response from this enddpoint giving us 25 (by default) most recent posts containing the word "science". Each post is in the form of key-value pair. An example response of a post is as follows:

        {
            "all_awardings": [],
            "allow_live_comments": false,
            "author": "Own_Professional_190",
            "author_flair_css_class": null,
            "author_flair_richtext": [],
            "author_flair_text": null,
            "author_flair_type": "text",
            "author_fullname": "t2_jaz08fd5",
            "author_is_blocked": false,
            "author_patreon_flair": false,
            "author_premium": false,
            "awarders": [],
            "can_mod_post": false,
            "contest_mode": false,
            "created_utc": 1644227451,
            "domain": "self.UToledo",
            "full_link": "https://www.reddit.com/r/UToledo/comments/smmgy3/questions_about_transferring/",
            "gildings": {},
            "id": "smmgy3",
            "is_created_from_ads_ui": false,
            "is_crosspostable": true,
            "is_meta": false,
            "is_original_content": false,
            "is_reddit_media_domain": false,
            "is_robot_indexable": true,
            "is_self": true,
            "is_video": false,
            "link_flair_background_color": "",
            "link_flair_richtext": [],
            "link_flair_text_color": "dark",
            "link_flair_type": "text",
            "locked": false,
            "media_only": false,
            "no_follow": true,
            "num_comments": 0,
            "num_crossposts": 0,
            "over_18": false,
            "permalink": "/r/UToledo/comments/smmgy3/questions_about_transferring/",
            "pinned": false,
            "retrieved_on": 1644227461,
            "score": 1,
            "selftext": "Hi, \n\nI am an international student looking for universities to transfer. I heard that Toledo is known for its engineering program and co-op program. I am interested in computer science and data science. Can anybody tell me about general thoughts on university life at Toledo? It can include anything - reputation, class experience, dorm life, life outside of university, and so on. \n\nThank you in advance and stay safe :D",
            "send_replies": true,
            "spoiler": false,
            "stickied": false,
            "subreddit": "UToledo",
            "subreddit_id": "t5_2wpwg",
            "subreddit_subscribers": 182,
            "subreddit_type": "public",
            "thumbnail": "self",
            "title": "Questions about transferring",
            "total_awards_received": 0,
            "treatment_tags": [],
            "upvote_ratio": 1.0,
            "url": "https://www.reddit.com/r/UToledo/comments/smmgy3/questions_about_transferring/"
        }

There are many ways to query this API. Head over to https://github.com/pushshift/api#search-parameters-for-comments to check the paramaters that you can pass to the API to enhance your queries.


### Understanding the JSON response

Here are a few keys returned by the API and what they mean. Most of them are self-explanatory, and which ones are needed will depend heavily on the specific use cases. 

----

| **Key**        	|   	| **Description**                                               	|
|----------------	|---	|---------------------------------------------------------------	|
| _url_          	|   	| url of the `post` or `comment`                                	|
| _author_       	|   	| username of the redditor who created this `post` or `comment` 	|
| _created_utc_  	|   	| time in UTC when this `post` or `comment` was created         	|
| _subreddit_    	|   	| `subreddit` on which this `post` or `comment` was created     	|
| _title_        	|   	| title of the `post`                                           	|
| _selftext_     	|   	| content of the `post` or the `comment`                        	|
| _retrieved_on_ 	|   	| time in UTC when this data was extracted by the Pushshift API 	|

----

### Querying Pushshift API using Python


There are several parameters that can be passed to this search query. We will work through some of those parameters in this notebook. From here on, we will make use of `psaw` API. 

In [None]:
import pandas as pd
from psaw import PushshiftAPI

# This command instantiates an object, with methods that will be used through out the exercise:
api = PushshiftAPI()

The instance `PushshiftAPI()` has two main functions 
1. `search_submissions` to query `/reddit/search/submission` endpoint
2. `search_comments` to query `/reddit/search/comment` endpoint


There are lots of parameters taht the above two functions can take. You can check them out here - https://pushshift.io/api-parameters/. Their use will be highly dependent on what you want to do with these APIs, at which point, it's merely a matter of reading the documentation. We will be using some common parameters in the exercises that follow. 

<div class="alert alert-info">

**Exercise 0.1:** We are going to start with collecting 50 most recents posts on Reddit, from the subreddit Ask me Anything (IAmA; https://www.reddit.com/r/AMA/) with more than 1,000 upvotes:
</div>

In [None]:
posts = api.search_submissions(subreddit='IAmA', score=">1000", limit=50)

In [None]:
type(posts) # what is the type of the above object

generator

The API returns a generator object, which is an iterator over the list. We can only access the elements of this list through iteration in a sequential manner, i.e, we can not index on this generator like `posts[0]`. 

<div class="alert alert-info">

**Exercise 0.2:** Make a pandas dataframe of the entries returned by the API
</div>

The code below is a list comprehension that loops through the generator and extracts relevant data for each matching Reddit post. It then turns that list into a Pandas DataFrame.

Note: Each element of `posts` is of type `psaw.PushshiftAPI.submission` which is a special object. This object provides an attribute `d_` to extract a Python dictinary, with easier access to all collected attributes. We will use this attribute to build a better representation for our purpose.

In [None]:
df_posts = pd.DataFrame([p.d_ for p in posts]) # We iterate over the generator



<div class="alert alert-info">

**Exercise 0.3:** Now check for yourself the following - 

1. Number of rows and columns in the resulting dataframe 
2. The list of fieldnames that are returned by the API
3. Look at 10 random rows of data only for the columns "authors", "subreddit", “title”, and upvote “score.”

</div>

In [None]:
print("Shape (nb of rows, nb of columns):", df_posts.shape)

Shape (nb of rows, nb of columns): (50, 73)


In [None]:
# Which attributes do we now have access to?
df_posts.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id',
       'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
       'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'post_hint',
       'preview', 'pwls', 'retrieved_on', 'score', 'selftext', 'send_replies',
       'spoiler', 'stickied', 'subreddit', 'subre

In [None]:
df_posts[['author', 'subreddit', 'title', 'score']].sample(10)

Unnamed: 0,author,subreddit,title,score
17,mister4string,IAmA,"On the first night of Christmas, a stranger ga...",10490
2,PerthNerdTherapist,IAmA,I am a full time nerd therapist! I run Dungeon...,1567
4,EmDawgrzz,IAmA,I am a 19 year old gal making a living by work...,1885
28,iamthatis,IAmA,"I'm Christian Selig, I used to work at Apple a...",2674
47,ReviewMeta,IAmA,"I'm Tommy, I built ReviewMeta - a site that de...",19338
42,meigom,IAmA,"My name is Meigo Märk and I walked 20,000 kilo...",11523
15,Rick_Smith_Axon,IAmA,"I am Rick Smith, the founder and CEO of Axon E...",23850
14,MainlyMozartSD,IAmA,I'm the Principal Bass of the San Francisco Sy...,6719
13,PhilipRosedale,IAmA,"I am Philip Rosedale, founder of Second Life a...",2665
26,iamthatis,IAmA,[Update on yesterday's Apollo SPCA Fundraiser]...,12560


### Exercise 1: Extracting texts from Reddit posts

<div class="alert alert-info">

**Exercise 1.1:** Let's now collect submissions related to the Oxford Internet Institute. 
- search for the exact keyword "Oxford Internet Institute", and select only posts with at least 10 upvotes
- make a relevant dataframe where each row consists of a single entry returned by the API. 
- Check the number of rows and columns in the resulting dataframe
</div>

You will have to check the parameters and their description here - https://github.com/pushshift/api#search-parameters-for-submissions. For example (make sure you double check the corresponding entries in the documentation),
- `q` takes in the query, i.e., the keyword that you search for in the posts
- `score` takes in a string to constrain the score range of the posts



In [None]:
posts = # YOUR CODE HERE
df_posts = pd.DataFrame([p.d_ for p in posts])


<div class="alert alert-info">

**Exercise 1.2:** Let's look at some of the columns 
- have a look at the titles, number of comments, and the date
- Can you interpret the date column? Head over to https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html and use `pd.to_datetime` for conversion of this column to an appropriate human-readable format. 

</div>

In [None]:
# YOUR CODE HERE 

In [None]:
# Reminder: this is all the columns you can access:
# df_posts.columns


<div class="alert alert-info">

**Exercise 1.3:** How would you list the subreddit of posts mentioning the OII?

</div>


In [None]:
list_subreddits = # YOUR CODE HERE 
print(list_subreddits)


<div class="alert alert-info">

**Exercise 1.4:** How would you list the usernames mentioning the OII?

</div>


In [None]:
list_usernames = # YOUR CODE HERE 
print(list_usernames)


<div class="alert alert-info">

**Exercise 1.5:** Collect 100 posts related to ```policy``` regulation from the subreddits ```climatechange``` and ```datascience```. For this, modify the query below and the subreddit list:

</div>

In [None]:
query = "policy"
subreddit = "climatechange,datascience"

posts = # YOUR CODE HERE 
df_posts = pd.DataFrame([p.d_ for p in posts])
df_posts['created_utc'] = # YOUR CODE HERE 

df_posts[['title', 'num_comments', 'created_utc']].sample(10)

### Exercise 2: Accessing Comments in Reddit

The comments are accessible via the psaw method `api.search_comments`. For the list of acceptable parameters, head over https://github.com/pushshift/api#search-parameters-for-comments. 

<div class="alert alert-info">

**Exercise 2.1:** Search comments containing the words "sociology" with `score` greater than 1000. Do the following afterwards - 
- Make a dataframe of the entried returned by the query
- convert the `created_utc` column to a more human-readable datetime format
</div>



In [None]:
comments = # YOUR CODE HERE 
comments = pd.DataFrame([c.d_ for c in comments])
comments['created_utc'] = # YOUR CODE HERE 

<div class="alert alert-info">

**Exercise 2.2:** Let's look at the content of the comments
- Check which key of the entries contain the content of the comments
- have a look at the score, text and subreddit of the comments we collected
</div>


In [None]:
comments[[
    # YOUR CODE HERE: Comma separated list of columns 
    ]].sample(10)

To search for multiple phrases in posts — such as posts that mention sociology AND internet — we can use parentheses and the bitwise AND operator &: `query = "(sociology) & (internet)"`

<div class="alert alert-info">

**Exercise 2.3:** Let's look at the comments that contain the above query
- make a dataframe of the entries returned by the above query
- convert the `created_utc` column in a human-readable format
- Sort the resulting dataframe by ascending value of their scores;  check out https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
- have a look at the score, text and subreddit of the comments we collected
</div>


In [None]:
query = "(sociology) & (internet)"

comments = # YOUR CODE HERE
comments = pd.DataFrame([c.d_ for c in comments])
comments['created_utc'] = # YOUR CODE HERE

In [None]:
# Let's sort comments by decreasing score:
comments = # YOUR CODE HERE

In [None]:
# The top 10 comments:
# YOUR CODE HERE

### Exercise 3: Accessing User data in Reddit

<div class="alert alert-info">

**Exercise 3.1:** Find comments made by a user `nasa`: limit the query to 1000 entries. 
- Make a dataframe of the entries returned by the API
- convert `created_utc` to a human-readable format
- count the number of comments that `nasa` made on each subreddit. checkout https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html to do so.
</div>

In [None]:
user_comments = # YOUR CODE HERE
df_user_comments = pd.DataFrame([c.d_ for c in user_comments])
df_user_comments['created_utc'] = # YOUR CODE HERE


In [None]:
df_user_comments[['subreddit', 'created_utc']].sample(10)

In [None]:
# YOUR CODE HERE to count number of subreddits

## Homework: Subreddits of users that posts about OII


<div class="alert alert-info">

- Write a function `subreddit_of_user` that takes in a string input `username` and returns a `list` of unique subreddits on which that `username` comments. 
- Find out all those posts with score `>10` that contain the keyword - "Oxford Internet Institute" 
- Find out the list of unique authors in the entries returned above
- call `subreddit_of_user` on each of the authors found above
</div>



In [None]:
def subreddits_of_user(username):
    user_comments = # YOUR CODE HERE
    df_user_comments = pd.DataFrame([c.d_ for c in user_comments])
    
    # YOUR CODE HERE
    # return list of unique subreddits

In [None]:
posts_oii =  # YOUR CODE HERE
df_posts_oii = pd.DataFrame([p.d_ for p in posts_oii])

list_of_usernames =  # YOUR CODE HERE: list of unique author names

for user in list_of_usernames:
    print(user)
    print(# YOUR CODE HERE)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f746e373-dc41-4dbe-b3f9-5f3af42ff658' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>