## 01 Data Scraping  
(Problem statement here)

I need to write a function that pulls at least 10,000 posts from each subreddit and then saves that in a dataframe which I can save as a csv. I'll need to use one or more "for" loops and set some parameters for what types of posts I would like to retrieve.  
### Skip Ahead:  
[Function 1](#Function-1)  
[Function 2](#Function-2)  
[Epilogue](#Epilogue)

In [1]:
import requests
import pandas as pd
import numpy as np
import time
import random
#https://stackoverflow.com/questions/25351968/how-to-display-full-non-truncated-dataframe-information-in-html-when-convertin
pd.set_option('display.max_colwidth', None)
#https://stackoverflow.com/questions/19124601/pretty-print-an-entire-pandas-series-dataframe
pd.set_option('display.max_rows', None)

### Function 1
#### get_posts()

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'
params = {
    'subreddit' : 'Xbox',    #subreddit in params will be whatever subreddit is entered into function
    'size' : 100          #will pull 100 posts at a time for given subreddit
}

In [3]:
#referred to lesson 5.04 notes and Pushshift tutorial notes

In [4]:
def get_posts(sub, parameters):  #takes 2 arguments - subreddit and paramater list
    #set params
    #pushshift url:
    url = 'https://api.pushshift.io/reddit/search/submission'
    #appending subreddit url:
    r=requests.get(url, parameters)
    #break out if status code is not okay
    if r.status_code != 200:
        return f'Error! Code {r.status_code}.'
    #convert data
    data = r.json()
    posts=data['data']
    df = pd.DataFrame(posts)
    return df

In [5]:
#test
df = get_posts('Xbox', params)

In [6]:
#df.columns

In [7]:
#df = df[['subreddit', 'id', 'author', 'num_comments', 'selftext', 'title', 'upvote_ratio', 'url']]

In [8]:
df['created_utc'].tail(1) #this is the time stamp from the last post I pulled down

99    1602785049
Name: created_utc, dtype: int64

In [9]:
#https://stackoverflow.com/questions/31614804/how-to-delete-a-column-in-pandas-dataframe-based-on-a-condition

Success! This function will pull down 100 posts for a given subreddit and parameters list. Now I need to loop this function append each loop to a single df until I get 10_000 posts. I also want to filter out any posts that will not be useful to my model. In this case, I want "is_self" i.e. text posts as opposed to oustide links or photos. So each run-through I'd like to drop any posts in my df where ['is_self'] = False. I'd also like to filter out any [removed] posts which are posts that were deleted after posting. Finally, I need to build in a timestamp to get posts older than the last batch that I pulled before.

### Function 2
#### all_the_posts()

In [10]:
#https://www.geeksforgeeks.org/how-to-create-an-empty-dataframe-and-append-rows-columns-to-it-in-pandas/

In [11]:
def all_the_posts(sub_list): #will pull at least 10k posts from subreddits in list
    for sub in sub_list:
        params = {'subreddit' : sub,       #subreddit in params will be whatever subreddit is entered into function
                    'size' : 100}          #will pull 100 posts at a time for given subreddit
        df_m = pd.DataFrame()              #create an empty master data frame
        count = 0                          #create a counter to count how many loops I make
        while len(df_m) < 10_000:          #loop will run until minimum post goal is met
            df_new = get_posts(sub, params)                    #call in the get_posts function, save to df
            df_new = df_new[df_new['is_self']==True]           #only keep text posts
            df_new = df_new[df_new['selftext']!='[removed]']   #delete any "removed" posts
            df_m = pd.concat([df_m, df_new])                   #add new df to master df
            #print(len(df_m))                                   #print size of master df
            count+=1                                           #advance counter by 1
            print(f'Loop {count} for {sub} complete; {len(df_m)} posts collected') #print status message
            params = {'subreddit' : sub,
                     'size':100,
                     'before': df_m['created_utc'].min()}      #change params to pull posts older than oldest...
            time.sleep(random.randint(4,11))                   #...post in master df + sleep between d/l's
        #loop done, I only need to save relevant columns (selected below)
        df_m = df_m[['subreddit', 'id', 'author', 'num_comments', 'selftext', 'title', 'upvote_ratio', 'url']]
        df_m.to_csv(f'./sub_data/{sub}_10k.csv', index=False)  #after acquiring 10k posts, save to csv
    return 'Data collection complete'        #return completion message when both csv's are saved

In [12]:
#all_the_posts(['Xbox', 'Playstation'])

#### Epilogue  
That did it! I was able to collect 10,005 rows of Xbox data in 171 loops and 10,0021 rows of Playstation data in 237 loops - automatically filtering out non-text and [removed] posts in the process and only keeping the relevant text and identifying features. I now have my two csv files, Xbox_10k and Playstation_10k which I can combine and analyze to create a model.