# 1. Data Collection
***

In this notebook, we will be gathering our Reddit posts and comment in order to train our classification model. We'll leverage the PushShift API to access this data, which allows us to quickly grab a large number of posts from our subreddits (Jokes and DadJokes). From there, we'll gather the top comments for each of our posts.

**Executive Summary**:

 - [Grabbing Posts](#Pull-Posts-Down-from-Postshift)
 - [Grabbing Comments](#Pull-Associated-Comments-In)
 - [Combine Data, Save to .csv](#Combine-Data,-Save-to-.csv)

### Import Relevant Libraries

In [132]:
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import time
import requests
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

# Import CountVectorizer and TFIDFVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

The PushShift Reddit API has two main functionalities -- querying for posts and querying for comments. The API can also take in a **before date** parameter which will allow us to change the timeframe and gather more posts than the 1000 limit that is built into API calls. The API only takes *epoch time* as a format here so we will have to do some math-- specifically we'll grab today's date in epoch time, and iteratively subract two weeks from the current data to grab older posts.

We then leverage **requests** to query the API, grabbing the returned JSON object and building a temporary dataframe to store it. Each of these dataframes are then concatenated together for ease of reference/maniupation using **pandas.**

### Pull Posts Down from PostShift

In [133]:
#Build header that we need to pass through with our API request
headers = {'User-agent': 'Admin'}

#Find current time in epochs, and assign two weeks of epoch time to a variable
epoch_time = int(time.time())
two_weeks = 1209600

#Build our API request, grab the JSON that's returned and store in a dataframe. We'll use this dataframe to build off of
url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=dadjokes&size=1&sort=des&before=%s' % (epoch_time)
res = requests.get(url, headers=headers)
json = res.json()
jokes_df = pd.DataFrame(json['data'])

#These are the subreddits we're interested in
subreddits = ['jokes', 'dadjokes']

#For the last 8 weeks, grab 1000 posts from the PushShift Reddit API for our subreddits. 
#Each iteration, store in a dataframe and append it to our existing df
for subreddit in subreddits:
 
    for i in range(1,4):

        url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=' + subreddit + '&size=1000&sort=des&before=%s' % (epoch_time -i*two_weeks)
        res = requests.get(url, headers=headers)
        json = res.json()

        temp_df = pd.DataFrame(json['data'])
        jokes_df = pd.concat([jokes_df,temp_df],ignore_index=True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




### Pull Associated Comments In

Now that we have our comments in a dataframe, we can take the **permalink** column and use it to build a second query to grab comments for our posts. Here, we will not leverage PushShift and explore a different approach. adding **'.json'** to a Reddit URL converts the content on said page to a JSON object. For all of our posts, we will take the corresponding comment URL and attached '.json' to it in order to grab the top ten parent comments. We'll build a string of the top ten parent comments and store it, alongside the 'permalink' URL, in a dictionary, so that we can build this into a dataframe and merge with our existing dataframe containing our posts.

In [136]:
#Grab our comments
comment_urls = list(jokes_df['permalink'])

In [137]:
#Initialize a list to store our comment dicts
post_comment_text = []

#For every comment URL in our list
for comment_url in comment_urls:
    
    #Build http request for our comment and store the response
    comment_request_url = 'https://www.reddit.com/' + comment_url + '.json'
    res = requests.get(comment_request_url, headers=headers)
   
    #Check for request status. If we were succesfull proceed with comments pull
    if res.status_code ==200:
        
        #Build a dict that will be home to our comment URL and the comment string
        comment_dict = {}
        comment_dict['permalink'] = comment_url
        comment_text = []
        comment_json = res.json()
        
        #If there are more than ten parent comments...
        if (len(comment_json[1]['data']['children']) > 10):
            
            #Just grab the top ten parent comments
            for i in range(10):
                comment_text.append(comment_json[1]['data']['children'][i]['data']['body'])

            #Store as a pipe-delimited string in our dict, and take that dict and store in our list
            comment_dict['comment_text'] = "|".join(comment_text)
            post_comment_text.append(comment_dict)
        
        #If there are between 0 and 10 parent comments...
        elif ((len(comment_json[1]['data']['children']) < 10) & (len(comment_json[1]['data']['children']) > 0)):

            #Grab each comment and append to list
            for i in range(len(comment_json[1]['data']['children'])):
                    comment_text.append(comment_json[1]['data']['children'][i]['data']['body'])
                    
             #Store as a pipe-delimited string in our dict, and take that dict and store in our list
            comment_dict['comment_text'] = "|".join(comment_text)
            post_comment_text.append(comment_dict)
                                     
        else:
            pass
        
            
    #If we fail, break and share the HTTP response we got 
    else:
        print(res.status_code)
        break
        
    #Sleep because we're nice people
    time.sleep(1)
        

In [138]:
#Check length of comment list
len(post_comment_text)

6610

### Combine Data, Save to .csv

In [139]:
#Store our comments in a dataframe
comment_df = pd.DataFrame(post_comment_text)

In [140]:
#check head of comments df
comment_df.head()

Unnamed: 0,permalink,comment_text
0,/r/Jokes/comments/dafwnf/i_ran_into_the_doctor...,Quick we’re running out of time
1,/r/Jokes/comments/dafvvi/the_only_two_white_ac...,One of my favorite parts of Between Two Fern: ...
2,/r/Jokes/comments/dafsqg/prisoner_and_guard/,I can't be the only one this doesn't make sens...
3,/r/Jokes/comments/dafskz/why_do_condoms_darken...,No|I don't get it\n\n\nI wanna laugh...
4,/r/Jokes/comments/dafsed/i_hate_when_girls_try...,......


In [131]:
#Merge our dataframes together using 'permalink'
jokes_df = pd.merge(jokes_df, comment_df, how='left', left_on='permalink', right_on='permalink')

#Just grab the columns we're interested in and drop the rest from our df
jokes_df = jokes_df[['title','selftext','score','over_18','created_utc','subreddit','num_comments','permalink','url','id','author', 'comment_text']]

#Write our dataframe to a csv file so we can pull in later
jokes_df.to_csv('./datasets/jokedata.csv', index=False)