# Case Study: Scraping Reddit with Pushshift

This lesson will walk you through the basics of scraping Reddit data using the popular Pushshift API.  We'll walk through some of the useful parameters at our disposal, then work our way through a number of exercises putting those parameters to use.  Finally, we'll have some fun visualizing patterns in Reddit content. 

## Reddit

[Reddit](https://www.reddit.com/) provides an additional avenue for large-scale digital data collection and analysis.


Notably, Reddit allows its users, known as Redditors, to create and maintain communities, or "subreddits," within the larger Reddit platform.  Redditors can opt in or out of a wide variety of communities at any time, allowing them to personalize their feeds and engage with a uniquely tailored online social circle. 

## Pushshift

Per the [subreddit](https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/) dedicated to its discussion, pushshift.io allows users to analyze and aggregate large volumes of data from Reddit, while also providing the option to specify the desired ranges time from which to collect data.  


### Setup

As in the previous chapter discussing more general web scraping through APIs, we'll be importing both `requests` and `pandas`. 

In [1]:
import requests
import pandas as pd

We'll also want to save our base URL as a string for later use.  When using Pushshift, we'll have to indicate whether we want to scrape submissions or comments in our request within the base URL. To scrape submissions, the base URL should end with `/submission/`. To scrape comments, the base URL should end with `/comment/`.  So, if we want to request submissions we'd set our base URL as the following: 

In [17]:
base_url = 'https://api.pushshift.io/reddit/search/submission/'

### Parameters

We'll now discuss a number of the parameters available through Pushshift. You can find a full list of parameters available through the API [here](https://pushshift.io/api-parameters/).  We'll return to some of the parameters listed at the link provided later on in the chapter, when we begin analyzing some scraped Reddit content. 

- **sort**: Allows you to control the direction in which you scrape results. Results can be scraped in ascending ('asc') or descending ('desc') order. 


> - **sort_type**: Allows you to set the parameter used to sort requests. 


- **created_utc**: This parameter is useful for restricting requests by the Coordinated Universal Time, or UTC, of their creation.  There are a number of online conversion tools available for converting human-readable dates to UTCs; I typically use [this one](https://www.epochconverter.com/).


> - **before**: Use this to scrape content created *before* a certain UTC.
- **after**: Use this to scrape content created *after* a certain UTC.


- **size**: Allows you to set a maximum size limit on your request.  Without looping requests, size per request maxes out at 1,000 submissions or comments.


- **author**: Allows you to request only the content produced by a single Redditor, or a specific set of Redditors.


- **subreddit**: Allows you to request only the content from a single subreddit, or a specific set of subreddits.


- **score**: Allows you to sort or otherwise restrict requests depending on the net number of likes versus dislikes received by content.


- **num_comments**: Allows you to sort or otherwise restrict requests depending on the total number of comments on submissions. 


### Pushshift Exercise 1: Making a Request

Using the API Documentation outlined above as a guide, let's use pushshift to scrape some Reddit content.  We'll start by looking at some submissions from [r/changemyview](https://www.reddit.com/r/changemyview/), which describes itself as "a place to post an opinion you accept may be flawed, in an effort to understand other perspectives on the issue."


Let's look at the 50 submissions to r/changemyview from 2019 that received the highest number of comments. We can start by setting the parameters for our request in a dictionary: 

In [14]:
parameters = {'subreddit' : 'changemyview',
              'sort'       : 'desc',
              'sort_type'  : 'num_comments',
              'after'       : '1546300800',
              'before'       : '1577836800',
              'size'       : 50,}     

Now we're ready to make our request to the API.

In [15]:
r = requests.get(base_url, params = parameters)

We'll want to store the data scraped in the request in JSON format, so we can access it within the notebook as a dataframe. 

In [11]:
df = pd.DataFrame(r.json()['data'])

Try opening the dataframe in the cell below.

In [16]:
df

Unnamed: 0,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,can_mod_post,...,allow_live_comments,awarders,steward_reports,total_awards_received,author_premium,og_description,og_title,gilded,post_hint,preview
0,Amiller1776,,,[],3∆,dark,text,t2_19engkdd,False,False,...,,,,,,,,,,
1,It_is_not_that_hard,,,[],,,text,t2_3wkz6h8h,False,False,...,False,[],[],0.0,,,,,,
2,YoloPudding,,,[],,,text,t2_ycthq,False,False,...,False,[],[],0.0,False,,,,,
3,oshawottblue,,,[],,,text,t2_jjmij,False,False,...,,,,0.0,,,,,,
4,Asker1777,,,[],,,text,t2_2vvqwp7o,False,False,...,,,,,,,,,,
5,RedUlster,,,[],,,text,t2_4hnldl0n,False,False,...,False,[],[],0.0,,,,,,
6,carlsaganheaven,,,[],,,text,t2_3nghffvk,False,False,...,False,,,0.0,,,,,,
7,rudz97,,,[],,,text,t2_4192p599,False,False,...,False,,[],0.0,,,,,,
8,phileconomicus,,,[],,,text,t2_5qehc,False,False,...,False,,,0.0,,,,1.0,self,"{'enabled': False, 'images': [{'id': 'cRvVtrz-..."
9,Akukurotenshi,,,[],,,text,t2_3ndqqb0u,False,False,...,False,[],[],0.0,,,,1.0,,


There's a lot going on!  It will be useful to determine a list of keys in the JSON file we'd like to look at, so we're not bombarded with too much information at one time.   

In [12]:
df.keys()

Index(['author', 'author_flair_background_color', 'author_flair_css_class',
       'author_flair_richtext', 'author_flair_text', 'author_flair_text_color',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'can_mod_post', 'contest_mode', 'created_utc', 'domain', 'full_link',
       'gildings', 'id', 'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subreddit_type', 'suggested_sort', 'thumbnail', 'title', 'updated_utc',
       'url', 'whitelist_status', 'wls', 'all_awa

Some of these keys (for example,`author`, `created_utc`, `num_comments`, and `score`) were outlined in the API definitions above, but a few of the keys are new.  Here are a couple of important keys we haven't yet discussed:

- **title**: This is the title a Redditor has given their submission.  


- **selftext**: This is the text contained within the submission itself. While a submission will always contain some text in the `title`, some image-only submissions will not contain any text in `selftext`. 

#### Pushshift Exercise 1 Answer (to be removed / added as Sample Answer Code later on)

In [8]:
  parameters = {'subreddit' : 'changemyview',
              'sort'       : 'desc',
              'after'       : '1546300800',
              'before'       : '1577836800',
              'sort_type'  : 'num_comments',
              'size'       : 50,}            

For relatively simple (but easily obtainable) data visualizations through the Pushift API, we can check out the [Pushshift Reddit Search](https://redditsearch.io/)

### Pushshift Exercise 2