# Can we determine if a reddit user found a post satisfying or infuriating?
## More specifically can we determine if a user posted their comment on r/oddlysatisfying or r/mildlyinfuriating using NLP?
The goal of this project is to try and create a machine learning algorithm that can ready bulks of comment data on reddit and classify the subreddit from which that data came from. This project will only be looking at the two previously mentioned subreddits.


## Data Collection
Here we will be using the Pushshift.io Reddit API to collect comments from reddit posted on r/mildlyinfuriating and on r/oddlysatisfying. We'll be shooting for about 100,000 comments for each subreddit so that even after cleaning we will still have a sizable chunk of data to work with. 

In [1]:
# imports
import requests
import pandas as pd
import time

I'll be using two different functions here for collection. One function will handle individual requests to the pushshift servers for batches of 100 comments at a time. The other will set up a loop to gather and append each batch to a dataframe until we've reached the desired amount of data.

In [2]:
# function that gets a collection of comments harvested from our desired subreddit using the pushshift api
def get_reddit_comment_data(subreddit, size = 100, before = None):
    
    # set our url and parameters of data we're collecting
    url = 'https://api.pushshift.io/reddit/search/comment'
    params = {
        'subreddit': subreddit,
        'size' : size,
        'before' : before
    }
    
    # connect to the page and collect the data returned
    res = requests.get(url, params)
    data = res.json()
    
    #convert our data into a DataFrame
    df = pd.DataFrame(data['data'])
    
    # return relevant columns from our dataframe
    return df[['author','body','created_utc','subreddit']]

In [3]:
# this function makes multiple requests successively so we can exceed the 100 comment limit and appends each request to a dataframe
def get_bulk_reddit_comment_data(sub, loop = 10, stamp = None):
    
    print(f'Starting collection of {loop*100} comments from r/{sub}')
    print(f'This will take at least {int(loop*2)} seconds')
    
    #retrieve the first batch of comments starting with the most recent post if no start time is specified
    bulk_df = get_reddit_comment_data(subreddit = sub, before = stamp)
    
    # retrieve the timestamp of the oldest comment in the batch which will always be the first in the dataframe
    timestamp = bulk_df['created_utc'].min()
    
    # grab the other 9 (or loop-1) sets of data requested
    for i in range(loop-1):
        
        # grab the next most recent batch of comments based on our timestamp from the previous batch
        temp_df = get_reddit_comment_data(subreddit = sub, before = timestamp)
        
        # set the new oldest comment timestamp based on the batch we just collected
        timestamp = temp_df['created_utc'].min()
        
        # append our new batch of comments to the master dataframe
        bulk_df = pd.concat([bulk_df,temp_df], axis=0)
        
        # wait 2 seconds before next loop to be polite to the pushshift servers
        time.sleep(2)
        
        # progress bar since this can take a while to run with the baked in wait between loops
        print(f'{sub} progress is {int((i+1)/loop*100)}%', end='\r', flush=True)
    
    # final print statement when loop completes to let us know
    print(f'r/{sub} data collection complete! Size is {bulk_df.shape}')
    
    return bulk_df

Now that we have the tools to collect, let's get a bunch of data! This can take a while with a 2 second delay per batch so I included a little progress bar to check in on it (1,000 loops for 2 different subredits will take over an hour).

Original data gathered is included but used the same lines of code as below. These will demonstrate how the progress bar worked if you want to try them yourself!

In [4]:
# get comments from r/mildlyinfuriating and store them for cleaning
mild_df = get_bulk_reddit_comment_data(sub = 'mildlyinfuriating', loop=10)
mild_df.to_csv('../data/mildlyinfuriating_example.csv', index=False)

# get comments from r/oddlysatisfying and store them for cleaning
odd_df = get_bulk_reddit_comment_data(sub = 'oddlysatisfying', loop=10)
odd_df.to_csv('../data/oddlysatisfying_example.csv', index=False)

Starting collection of 1000 comments from r/mildlyinfuriating
This will take at least 20 seconds
r/mildlyinfuriating data collection complete! Size is (1000, 4)
Starting collection of 1000 comments from r/oddlysatisfying
This will take at least 20 seconds
r/oddlysatisfying data collection complete! Size is (1000, 4)


Now we can send out our collected comments for data cleaning!