# Pushshift API - Extracting Subreddit Submissions

---

Our main objective in this notebook is to extract submissions (or posts) from two of our favorite subreddits, [/r/explainlikeim5](https://www.reddit.com/r/explainlikeimfive/) and [/r/legaladvice](https://www.reddit.com/r/legaladvice/) via [Pushshift's API](https://github.com/pushshift/api).

This handy RESTful web API, created by the /r/datasets mod team, allows us to extract data from new posts on subreddits.  It allows you to input parameters on your search so that you only get back the data you want -- parameters, such as:

- subreddit
- title
- selftext
- author
- created_utc

In this notebook, we will be primarily working with one function to call, extract, and output a Pandas DataFrame.  The main steps that are covered in the `get_reddit` function are as follows:

1.  Create a blank Pandas DataFrame
2.  Use Pushshift's URL and specific parameters to get back what we want
3.  Call using `requests.get()` and get back a 100-row .json file
4.  Extract from the API 49 more times all the while using a throttle timer to be mindful of Reddit's server
5.  Append these rows to our main DataFrame
6.  Return the DataFrame

In the end, we will output this into a csv file and save it in our `/data` folder.

In [2]:
import requests
import pandas as pd
import time

In [3]:
# Largely adapted Paulene's code that she graciously shared!  Thanks Paulene!

def get_reddit(subreddit):
    '''Webscraping subreddit for 5000 submissions and returning a dataframe'''
    
    # Instantiating some variables
    count = 0
    df = pd.DataFrame(None)
    
    # Pushshift's URL and params
    url = "https://api.pushshift.io/reddit/search/submission"
    params = {
        'subreddit': subreddit,
        'size': 100,
        'fields': ['subreddit', 'title', 'selftext', 'author', 'created_utc']}
    res = requests.get(url, params)
    data = res.json()
    posts = data['data']
    
    # Creating Pandas DataFrame for submissions
    df_new = pd.DataFrame(data = posts)
    df = df.append(df_new)
    count += 1
    print(f'Round {count}')
    
    # While count is less than 50, bring in more submissions.
    while count < 50:
        
        params2 = {
            'subreddit': subreddit,
            'size': 100,
            'fields': ['subreddit', 'title', 'selftext', 'author', 'created_utc'],
            'before': int(df.iloc[-1, 1])}
        res2 = requests.get(url, params2)
        data2 = res2.json()
        posts2 = data2['data']
        
        # Creating DataFrame
        df_new = pd.DataFrame(data = posts2)
        df = df.append(df_new)
        
        count += 1
        
        # Progress printouts:  Code adapted from global lect: NLP I
        if (count + 1) % 10 == 0:
            print(f'Round {count + 1} of 50.') # printing checks help visualize that the fxn is working
        
        time.sleep(3)

    df.set_index(pd.Index([i for i in range(len(df))]), inplace = True)
        
    return df


### Calling our function and outputting our data

In [4]:
eli5 = get_reddit('explainlikeimfive')

Round 1
Round 10 of 50.
Round 20 of 50.
Round 30 of 50.
Round 40 of 50.
Round 50 of 50.


In [9]:
# eli5.to_csv('./data/eli5.csv', index=False)

In [4]:
advice = get_reddit('advice')

Round 1
Round 10 of 50.
Round 20 of 50.
Round 30 of 50.
Round 40 of 50.
Round 50 of 50.


In [6]:
advice.to_csv('./data/advice.csv', index=False)