# 01 Web Scraping Submissions from Reddit

In this notebook, we'll scrape the required submission data from Reddit for our analysis.  

In [1]:
import requests
import pandas as pd
import time
import os

### Scraping Subreddits: r/Republican, r/democrats, and r/Libertarian

The following subreddits were selected for their member size and usage frequency, allowing us to pull 3,000 posts from each for the 4-5 week time period before 11/28/2020.  
* r/democrats: 158k Members
* r/Republican: 163k Members
* r/Libertarian: 442k Members

The documentation for the pushshift.io API has not been updated to reflect the decrease in the maximum number of submissions allowed to be pulled during each request from 500 to 100, but that change has been reflected in the code below.

Additionally, there was no clear indication of rate limit for requests that I could find, but the following information was posted by the project lead Jason Baumgartner on Reddit several months ago:  
> In an effort to relieve some of the strain on the API, the rate limit is being adjusted to one request per second. *(Full post can be found [here](https://www.reddit.com/r/pushshift/comments/g7125k/in_an_effort_to_relieve_some_of_the_strain_on_the/))*

To meet (and exceed this requirement), there is a three second delay between each request.

In [2]:
# setting base url for scraping submissions with the pushshift.io API
url = 'https://api.pushshift.io/reddit/search/submission'

# list of subreddits to pull
subreddits = ['Republican', 'democrats', 'Libertarian']

# creating DataFrames for holding raw data pulls
raw_gop_df = pd.DataFrame()
raw_dem_df = pd.DataFrame()
raw_librt_df = pd.DataFrame()

# new max size for api (used to be 500)
params = {'size' : 100} 

# number of desired unique submissions for training and testing on each subreddit
target_submissions = 3_000

# number of responses required to meet the desired target submissions
num_gets = target_submissions//params['size'] # must be integer to be used in for loop

# iterating through subreddits in list
for subreddit in subreddits:
    # setting params for request
    params['subreddit'] = subreddit
    params['before'] = '' # always want to start with most recent
    
    # iterate through number requests/responses needed
    for get in range(num_gets):
        
        # passing requests url and params
        response = requests.get(url, params)
        
        # checking status code and printing status message to STD.OUT for monitoring 
        if response.status_code == 200:
            print(f'Request status good for request {get + 1} of {num_gets} on subreddit {subreddit}')
        else:
            print(f'Possible issue: {response.status_code}')
        
        # processing the received json response to create a DataFrame
        data = response.json()
        posts = data['data']
        df = pd.DataFrame(posts)
        
        # sorting df by 'created_utc' to get the oldest timecode to use for next request
        df = df.sort_values('created_utc', ascending=True).reset_index(drop=True)
        
        # setting param to pull requests before the last pull
        params['before'] = df['created_utc'][0]

        
        # adding DataFrame to appropriate raw data frame 
        if subreddit == subreddits[0]:
            raw_gop_df = raw_gop_df.append(df, ignore_index=True)
        elif subreddit == subreddits[1]:
            raw_dem_df = raw_dem_df.append(df, ignore_index=True)
        else:
            raw_librt_df = raw_librt_df.append(df, ignore_index=True)
        
        time.sleep(3)

Request status good for request 1 of 30 on subreddit Republican
Request status good for request 2 of 30 on subreddit Republican
Request status good for request 3 of 30 on subreddit Republican
Request status good for request 4 of 30 on subreddit Republican
Request status good for request 5 of 30 on subreddit Republican
Request status good for request 6 of 30 on subreddit Republican
Request status good for request 7 of 30 on subreddit Republican
Request status good for request 8 of 30 on subreddit Republican
Request status good for request 9 of 30 on subreddit Republican
Request status good for request 10 of 30 on subreddit Republican
Request status good for request 11 of 30 on subreddit Republican
Request status good for request 12 of 30 on subreddit Republican
Request status good for request 13 of 30 on subreddit Republican
Request status good for request 14 of 30 on subreddit Republican
Request status good for request 15 of 30 on subreddit Republican
Request status good for request 16

All data has been scrapped. Now to save them and move on to cleaning and EDA.

In [5]:
# saving raw DataFrames in case more info is needed from them later
raw_gop_df.to_csv('./data/raw/raw_gop_data.csv', index=False)
raw_dem_df.to_csv('./data/raw/raw_dem_data.csv', index=False)
raw_librt_df.to_csv('./data/raw/raw_librt_data.csv', index=False)