In [1]:
import pandas as pd
import requests
import json
from datetime import datetime

## Set Time Anchor

Created during exploratory pulls to have a reference point for collecting non-duplicate data over multiple days. The original code to create this timestamp is commented out.

In [2]:
time_anchor = 1587230611
datetime.fromtimestamp(time_anchor)

datetime.datetime(2020, 4, 18, 12, 23, 31)

## Data Collection from r/askscience

The pushshift API only allows for pulling down 500 posts at a time; the goal here is 5,000+ observations for each subreddit.

In [3]:
# grab the first 500 posts with the pushshift API

url = 'https://api.pushshift.io/reddit/search/submission'

paramsfirst500 = {
    'subreddit': 'askscience',
    'size': 500,
    'before': time_anchor
}

resfirst500 = requests.get(url, paramsfirst500)

# check status for success
resfirst500.status_code

200

In [4]:
# convert to json, extract only a list of posts
datafirst500 = resfirst500.json()
first500posts = datafirst500['data']

# sanity check for titles
[i['title'] for i in first500posts][:5]

['Is there a term for the compulsive stripping during a breakdown?',
 'Funny and Cute Baby - Funny Cute Videos | Simple relief from corona Virus',
 'Is there an optimal amount of subjects your brain can learn over a period of time?',
 'A term for compulsive stripping, pulling hair.',
 'Is there any benefit of scrubbing onion on hot grills?']

#### Original Time Anchor Creation

In [5]:
# create a reference timestamp for collecting more data over the next few days

# time_anchor = first500posts[0]['created_utc']

#### Gather More Data

Now to loop backwards through time and pull in posts.

In [6]:
# instantiate list to hold all data from the loops
data = first500posts.copy()

# get oldest post's timestamp each loop to go further back
oldesttimestamp = first500posts[-1]['created_utc']

In [7]:
# function to grab posts using the pushshift API

def pull_posts_before(subreddit, n, final_destination, starting_when=oldesttimestamp):
    '''The 'n' parameter determines the number of posts pulled in multiples of 500
        'final_destination' needs to be a list in which to store the posts'''
    for i in range(n):

        # set request parameters
        paramsnext500 = {
            'subreddit': f'{subreddit}',
            'size': 500,
            'before': starting_when
        }

        # create request
        resnext500 = requests.get(url, paramsnext500)

        # print out status code each loop to ensure success
        print('Pulling down data... Status Code:', resnext500.status_code)

        # convert to json, strip away outer layer to get only post data
        datanext500 = resnext500.json()
        next500posts = datanext500['data']

        # append to data
        final_destination.extend(next500posts)

        # set new oldesttimestamp for next loop
        starting_when = next500posts[-1]['created_utc']

In [8]:
# use pull_posts_before with n = 39 to add 19,500 more posts to our first 500
pull_posts_before('askscience', 39, data)

# check length of data, should be 20,000
len(data)

Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down

20000

In [9]:
# check for duplicates, make sure the above code isn't grabbing the same data over and over
# and check if we've reached the end of the subreddit, grabbing the first ever 500 posts over and over

print(len([i['title'] for i in data]) - len(set([i['title'] for i in data])))

451


In [10]:
# 451 duplicates will need to be removed before modelling.

## Data Collection for r/shittyaskscience

In [11]:
# instantiate list for posts
shitty_data = []

# call pull_posts with n = 40 to get 20,000 posts
pull_posts_before('shittyaskscience', 40, shitty_data)

# check length of data, should be 40,000
len(shitty_data)

Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down data... Status Code: 200
Pulling down

20000

In [12]:
# check for duplicates
print(len([i['title'] for i in shitty_data]) - len(set([i['title'] for i in shitty_data])))

353


In [13]:
# we have 353 duplicates to remove later

## Dataframe Creation

Only the values for "subreddit" and "title" are needed for the models, but "removed_by_category" and "banned_by" values show if a post has been deleted or removed by moderators, rendering said post invalid. Unnecessary columns can be dropped later.

In [14]:
# build dataframes
df = pd.DataFrame(data)
df['subreddit'] = 'askscience'

shitty_df = pd.DataFrame(shitty_data)
shitty_df['subreddit'] = 'shittyaskscience'

# combine dataframes
combined_df = pd.concat([df, shitty_df])
combined_df.reset_index(drop=True)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,can_gild,category,content_categories,hidden,quarantine,removal_reason,subreddit_name_prefixed,suggested_sort,rte_mode,author_id
0,[],False,Pakislav,,[],,text,t2_8ozi5,False,False,...,,,,,,,,,,
1,[],False,BabyParenting1920,,[],,text,t2_63yulwx7,False,False,...,,,,,,,,,,
2,[],False,Medwin_the_Scaled,,[],,text,t2_5d29szr2,False,False,...,,,,,,,,,,
3,[],False,[deleted],,,,,,,,...,,,,,,,,,,
4,[],False,Silmarlion,,[],,text,t2_f5edp,False,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,,,Redskuling,,[],,text,,,,...,,,,,,,,,markdown,
39996,,,TheCreatorLovesYou,,[],,text,,,,...,,,,,,,,,markdown,
39997,,,P0J0,,[],,text,,,,...,,,,,,,,,markdown,
39998,,,bloodofgore,,[],,text,,,,...,,,,,,,,,markdown,


In [15]:
# check the target
combined_df['subreddit'].value_counts()

shittyaskscience    20000
askscience          20000
Name: subreddit, dtype: int64

## Export Data to .csv

In [16]:
# naming the file 'b4timeanchor.csv' in order to keep track of old and new data to be collected in the future

combined_df.to_csv('./data/b4timeanchor.csv', index=False)