## Project 3 - Subreddit 
### by Steven Tran

<span style='color: red;'>Add TOC when available</span>

---

# 01: Data Collection

In this notebook, I will collect (X) number of posts from two subreddits. The table below describes the two subredits, and the date range from which user posts were collected:

| Subreddit | # of user posts | FROM | TO |
|-----------|-----------------|------|----|
| [Sub1](#)      |                 |      |    |
| [Sub2](#)      |                 |      |    |


In [6]:
import requests, time, pandas as pd

In [67]:
#access pushshift api to get reddit posts:
base_url = 'https://api.pushshift.io/reddit/search/submission'

def get_posts(subreddit, num_valid_posts = 200):
    #define some boundaries:
    target_valid_posts = num_valid_posts #200 by default
    count_posts_collected = 0
    beginning_date = 1609459200 # jan 1 2021 00:00:00 # start at this date and go back in time; UTC
    final_df = pd.DataFrame()
    times_run = 0
    seconds_waited = 0
    
    
    #check if the subreddit is valid:
    check_params = {
        'subreddit': subreddit,
        'size': 0,
        'metadata': 'true'
    }

    if requests.get(base_url,check_params).json()['metadata']['total_results'] > target_valid_posts:
        #initial parameters:
        params = {
            'subreddit': subreddit,
            'size': 100,
            'before': beginning_date
        }
        
        while count_posts_collected < target_valid_posts:
            #get posts:
            res = requests.get(base_url,params).json()['data']
            this_data = pd.DataFrame(res)[['id','subreddit','title','selftext','created_utc']]

            #get new time (needs to happen before a bunch of rows get removed):
            new_time = list(this_data['created_utc'])[-1]
            
            # drop removed and deleted:
            this_data.drop(this_data[this_data['selftext']=='[removed]'].index, inplace=True)
            this_data.drop(this_data[this_data['selftext']=='[deleted]'].index, inplace=True)
            
            #append new posts:
            if final_df.shape[0] == 0:
                final_df = this_data
            else:
                final_df = pd.concat([final_df,this_data])
           
            #get new size:
            count_posts_collected += this_data.shape[0]
            new_size = min(target_valid_posts - count_posts_collected,100)

            #set new params:
            params = {
                'subreddit': subreddit,
                'size': new_size,
                'before': new_time,                  
            }
            
            #wait a bit before we go again:
            time.sleep(10)
            seconds_waited += 10
            times_run += 1
        
    else:
        print(f'Error, {subreddit} did not have enough posts to be considered.')
    print()
    print(f'Went out to get posts {times_run} times. Waited {seconds_waited} seconds or {seconds_waited/60:.1f} minutes.')
    return final_df

def get_metadata(subreddit):
    meta_params = {
        'subreddit': subreddit,
        'size': 0,
        'metadata': 'true'
    }
    
    res = requests.get(base_url, meta_params)
    sub_metadata = res.json()['metadata']
    return {'sub_metadata': sub_metadata}

---

A few contenders for subreddits we'll include.

In [71]:
#movies:
movies = (get_posts('movies', 8_000), get_metadata('movies'))


Went out to get posts 111 times. Waited 1110 seconds or 18.5 minutes.
