# General Assembly DSI 13 EC 
# Project 3 - Web APIs & NLP
## Mike Bell 
### October 23, 2020

# Project Overview

The goal of this project is to use NLP and various classification models to predict if a reddit post came from r/math or r/physics. This classification will be done on the text content of the post alone (title and body), no comments or pictures/media are used. 

Most of the code here can be very easily modified for use on any two subreddits: each notebook contains a 'subreddit' variable which is a list containing the names of the two subreddits to be compared. Most of the data scraping, preprocessing and cleaning can be done by simply changing the names of the desired subreddits, and running the notebooks. The data directory '../data/subreddit0_subreddit1_data/' is used to save and retrieve the csv files generated during the scraping and preprocessing stages. 



## Notebook 1: Data Collection using the Pushshift Reddit API

In this notebook we use the PushShift Reddit API to scrape and collect data from two subreddits. 
Our primary goal is to train a classification model to predict which subreddit a submission came from based on its text (both title and body text).

Our primary analysis will focus on the r/math and r/physics subreddits, but as stated above, the code below is set up so as to easily, and quickly, scrape data from any two given subreddits.

Since Pushshift only allows 100 results per request, our scraping process pulls 100 items and tracks the UTC created timestamp of the oldest post retrieved. On the next request, we use this timestamp as the 'before' attribute to get another 100 posts older than this one. This process is repeated until the desired number of requests is made. 

Each request for each subreddit is written to a csv, no aggregation or cleaning is performed in this notebook.

In [1]:
import requests
import pandas as pd
import time
import os
from datetime import datetime, timezone
pd.set_option('display.max_rows', 500)

In [3]:
# Submission endpoint for pushshift API
SUB_ENDPT = 'https://api.pushshift.io/reddit/search/submission'

In [4]:
# Pulls submissions from a given subreddit, with parameters
# sub_size (int): maximum number of posts to retrieve (limited by 100)
# before (int): only look for posts created before this UTC timestamp 
# after (int): only look for posts created after this UTC timestamp
# Note: We also choose to automatically ignore any posts with selftext equal to '[removed]', which typically indicates
# a post was automoderated for being inappropriate for the given subreddit.

def get_subreddit(subreddit, sub_size = 100, before = '', after = ''):
   
    sub_params = {    
        'subreddit': subreddit,
        'size' : sub_size, 
        'before' : before,
        'after' : after,
        'selftext:not' : '[removed]'
    }

    res = requests.get(SUB_ENDPT, sub_params)
    
    subs = pd.DataFrame(res.json()['data'])
    
    return subs 

In [16]:
# folders and filenames are automatically generated based on the two subreddit names given in the list 'subreddits'
subreddits = [ 'math', 'physics']
subreddit_dir = f'../data/{subreddits[0]}_{subreddits[1]}_data/'
if not os.path.exists(subreddit_dir):
    os.makedirs(subreddit_dir)

In [None]:
# Iterate and make requests to each subreddit
# the lowest utc timestamp for each previous request is tracked and used as the 'before' parameter 
# for the next request

num_requests = 1000

curr_time = int(datetime.now(timezone.utc).timestamp())
oldest_utc = [curr_time,curr_time]
for i in range(num_requests):
    for idx, subr in enumerate(subreddits):
        df = get_subreddit(subr, before = oldest_utc[idx])
        oldest_utc[idx] = df['created_utc'].min()
        df.to_csv(f'{subreddit_dir}{subr}_{i}.csv', index = False)
