#### Step 1: Data Collection

In [1]:
import requests
import pandas as pd

First, Pushshift's reddit API is set as the base url. 

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

Next is a function that does the main pulling of posts. 

First it tries to read in the already saved "posts.csv" file, then looks specifically at the posts from the subreddit we specified as a parameter, and takes the last UTC timestap in the dataframe. If that file is not available or causes an error, a new, blank dataframe is created and the last timestamp is set to None. 

The API takes in several parameters when fetching a request, so those parameters are saved in a dictionary. The subreddit is the specified function argument, the size is the maximum allowed amount of 100, and if it exists, the last timestamp is added to the parameters as well. 

The requests module result is saved as 'res', and then the data is pulled from the json format of res. The columns taken include subreddit the post is from, the post's title, the post's text, and the UTC timestamp. These 100 rows are then added to the bottom of the already existing dataframe of posts, and the concatenation is re-saved, overwriting the previous version. 

In [3]:
def update_posts(subreddit):
    """
    Pulls 100 reddit posts and adds them to the dataset.
    
    Args:
        subreddit: the name of the subreddit to pull posts from
    
    Returns:
        pandas DataFrame: the 100 posts added to the dataset
    """
    try:
        data = pd.read_csv('../data/posts.csv')
        check = data[data['subreddit'].str.lower() == subreddit] 
        before = check['created_utc'].iloc[-1]
    except:
        data = pd.DataFrame(columns=['subreddit', 'title', 'selftext', 'created_utc'])
        before = None
    
    params = {'subreddit' : subreddit, 'size' : 100}
    if before != None:
        params.update({'before' : before})
    
    res = requests.get(url, params)
    update = pd.DataFrame(res.json()['data'])[['subreddit', 'title', 'selftext', 'created_utc']]
    
    data = pd.concat([data, update], ignore_index=True)
    data.to_csv('../data/posts.csv', index=False)
    
    return update

The above 'update_posts' function is called for r/coffee and r/tea in conjunction to guarantee that there is an even amount. 

In [4]:
def run_updates():
    """
    Pulls and adds 100 reddit posts from each of r/coffee and r/tea
    
    Args:
        None
        
    Returns:
        None
    """
    update_posts('coffee')
    update_posts('tea')

The 'run_updates' function is called in a loop until the saved dataset is at least as long as the given input. 

In [5]:
def update_to(n):
    """
    Continually pulls and adds reddit posts to the dataset
    
    Args:
        n (int): the number of posts the set must be greater than or equal to in order to stop
        
    Returns:
        None
    """
    while len(pd.read_csv('../data/posts.csv')) < n:
        run_updates()