## Data Collection
For this project I utilized Pushshift API to gather reddit data from the Xbox One (Xbox) and PS4 (PS) subreddits. 

In [1]:
#Imported Libraries
import time
import requests
import numpy as np
import pandas as pd

The code block below will gather data from a subreddit group and return it as a pandas dataframe. With the recent update on Pushshift API's rules, only 100 posts per request is allowed. Therefore, I nested a while loop in the function to continue getting posts until the desired amount is reached. To prevent duplicated posts, I set the oldest/minimum created_utc (float equivalent of time and date) as an argument for the `.get` method's `before` parameter. Lastly, I throttled each request for 3 seconds for politeness and compliance to rate-limiting policies. I also added print statements to track pull request status. 

In [2]:
def get_data(subreddit, size, before_time=1594969200):
   
    '''Pulls posts at the desired subreddit for the defined size and returns a dataframe'''

    base_url = 'https://api.pushshift.io/reddit/search/submission'
    posts = [] 
    posts_length = 0 
    #before_time = 1594969200 Float conversion of July 17, 2020 at 12:00:00 A.M.
    
    #Loops until total number of post is scrapped from subreddit
    while posts_length < size:
        res = requests.get(base_url, params = {"subreddit": subreddit, "size": 100, "before": before_time}).json()
        data = res['data']
        df = pd.DataFrame(data)
        
        #Will record the minimum created_utc at each request as the beginning time for the next request
        before_time = int(df[['created_utc']].sort_values('created_utc').values[0])
        posts.append(df)
        posts_length += len(data)
    
        #Prints out the status of the number of post pulled from subreddit
        if (posts_length) % 1000 == 0:
            print(f'Post {posts_length} of {size}')
        
        #Delay each loop by 3 seconds - for ethical purposes
        time.sleep(3)
    
    return pd.concat(posts)

### Xbox One Data

I collected 10,000 posts from the r/xboxone subreddit and stored it into a CSV file called `xboxone.csv`. 

In [8]:
xbox_one = get_data('xboxone', 10_000)

Post 1000 of 10000
Post 2000 of 10000
Post 3000 of 10000
Post 4000 of 10000
Post 5000 of 10000
Post 6000 of 10000
Post 7000 of 10000
Post 8000 of 10000
Post 9000 of 10000
Post 10000 of 10000


In [9]:
xbox_one.to_csv('../datasets/xbox_one.csv')

### PS4 Data

I collected 10,000 posts from the r/ps4 subreddit and stored it into a CSV file called `ps4.csv`. 

In [6]:
#Gathering data from PS4 subreddit 
ps_4 = get_data('PS4', 10000)

Post 1000 of 10000
Post 2000 of 10000
Post 3000 of 10000
Post 4000 of 10000
Post 5000 of 10000
Post 6000 of 10000
Post 7000 of 10000
Post 8000 of 10000
Post 9000 of 10000
Post 10000 of 10000


In [7]:
ps_4.to_csv('../datasets/ps_4.csv')