## Project 3 - Subreddit 
### by Steven Tran

<span style='color: red;'>Add TOC when available</span>

---

# 01: Data Collection

In this notebook, I will collect (X) number of posts from two subreddits. The table below describes the two subredits, and the date range from which user posts were collected:

| Subreddit | # of user posts | FROM | TO |
|-----------|-----------------|------|----|
| [Sub1](#)      |                 |      |    |
| [Sub2](#)      |                 |      |    |


In [1]:
import requests, time, pandas as pd, numpy as np

In [2]:
#access pushshift api to get reddit posts:
base_url = 'https://api.pushshift.io/reddit/search/submission'

def get_metadata(subreddit):
    meta_params = {
        'subreddit': subreddit,
        'size': 0,
        'metadata': 'true'
    }
    
    res = requests.get(base_url, meta_params)
    sub_metadata = res.json()['metadata']
    return {'sub_metadata': sub_metadata}

def get_posts(subreddit, num_valid_posts = 200):
    #define some boundaries:
    target_valid_posts = num_valid_posts #200 by default
    count_posts_collected = 0
    beginning_date = 1609459200 # jan 1 2021 00:00:00 # start at this date and go back in time; UTC
    final_df = pd.DataFrame()
    times_run = 0
    seconds_waited = 0
    

    if get_metadata(subreddit)['sub_metadata']['total_results'] > target_valid_posts:
    #if requests.get(base_url,check_params).json()['metadata']['total_results'] > target_valid_posts:
        #initial parameters:
        params = {
            'subreddit': subreddit,
            'size': min(target_valid_posts,100),
            'before': beginning_date
        }
        
        while count_posts_collected < target_valid_posts:
            try:
                #get posts:
                res = requests.get(base_url,params).json()['data']
                this_data = pd.DataFrame(res)[['id','subreddit','author','title','selftext','created_utc']]

                #get new time (needs to happen before a bunch of rows get removed):
                new_time = list(this_data['created_utc'])[-1]

                # drop removed and deleted:
                this_data.drop(this_data[this_data['selftext']=='[removed]'].index, inplace=True)
                this_data.drop(this_data[this_data['selftext']=='[deleted]'].index, inplace=True)
                this_data.drop(this_data[this_data['selftext'].isna()].index, inplace=True)
                #append new posts assuming this_data is not of length 0:
                if final_df.shape[0] == 0:
                    final_df = this_data
                else:
                    final_df = pd.concat([final_df,this_data])

                #get new size:
                count_posts_collected += this_data.shape[0]
                new_size = min(target_valid_posts - count_posts_collected,100)

                #set new params:
                params = {
                    'subreddit': subreddit,
                    'size': new_size,
                    'before': new_time,                  
                }
                
                #show in console what the current collected count is:
                if count_posts_collected < target_valid_posts:
                    print(f'Collected {count_posts_collected} observations, continuing.', end='\r')
                else:
                    print('==========================================================')
                    print(f'Collected {count_posts_collected} observations.')
                    print('\nSuccess.')
                #wait a bit before we go again:
                time.sleep(10)
                seconds_waited += 10
                times_run += 1
                
            except:
                target_valid_posts = 0
                print(f'Error, returning request data from current params. Managed to collect {count_posts_collected}')
                final_df.reset_index(drop=True,inplace=True)
                return (res, final_df)
        
    else:
        print(f'Error, {subreddit} did not have enough posts to be considered.')
    print()
    print(f'Went out to get posts {times_run} times. Waited {seconds_waited} seconds or {seconds_waited/60:.1f} minutes.')
    final_df.reset_index(drop=True,inplace=True)
    return final_df



---

## Go get the data

I'd caution against running this whole script... on my machine, it took between 15 to 25 minutes to get the target 8k posts. YMMV.

In [3]:
#personal finance:
pf = get_posts('personalfinance', 8_000)
pf.to_csv('../data/posts__personal_finance.csv', index=False)

Collected 8000 observations.

Success.

Went out to get posts 103 times. Waited 1030 seconds or 17.2 minutes.


In [4]:
del pf # for memory release

---

In [5]:
#financial planning:
fp = get_posts('FinancialPlanning', 8_000)
fp.to_csv('../data/posts__financial_planning.csv',index=False)

Collected 8000 observations.

Success.

Went out to get posts 83 times. Waited 830 seconds or 13.8 minutes.


In [6]:
del fp # for memory release

---

In [7]:
#unpopular opinion
uo = get_posts('unpopularopinion',8_000)
uo.to_csv('../data/posts__unpopular_opinion.csv', index=False)

Collected 8000 observations.

Success.

Went out to get posts 215 times. Waited 2150 seconds or 35.8 minutes.


In [8]:
del uo #for memory release

---

In [9]:
#offmychest:
omc = get_posts('offmychest',8_000)
omc.to_csv('../data/posts__offmychest.csv',index=False)

Collected 8000 observations.

Success.

Went out to get posts 119 times. Waited 1190 seconds or 19.8 minutes.


In [10]:
del omc #for memory release

---

In [11]:
#dad jokes:
dj = get_posts('dadjokes', 8_000)
dj.to_csv('../data/posts__dad_jokes.csv',index=False)

Collected 8000 observations.

Success.

Went out to get posts 104 times. Waited 1040 seconds or 17.3 minutes.


In [12]:
del dj #for memory release

---

In [13]:
#jokes:
jokes = get_posts('jokes',8_000)
jokes.to_csv('../data/posts__jokes.csv',index=False)

Collected 8000 observations.

Success.

Went out to get posts 101 times. Waited 1010 seconds or 16.8 minutes.


In [14]:
del jokes #for memory release

---