## Phase 1 - Problem Definition
###    1.1 Official Goal(s):
        For project 3, your goal is two-fold:
        1. Using [Pushshift's](https://github.com/pushshift/api) API, you'll collect posts from two subreddits of your choosing.
        2. You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.



### 1.2 Subreddit Selection

Reddit.com has a 'best of' feature-- both of Reddit itself and of specific subreddits.  How a post or comment is selected to be a 'best of' is a fascinating rabbit hole to dig in to- see a guest post by Randall Munroe of XKCD fame on the subject here: 
    https://redditblog.com/2009/10/15/reddits-new-comment-sorting-system/
    or take a look at the reddit post asking the same question here:
    https://www.reddit.com/r/NoStupidQuestions/comments/6cmz29/how_does_reddit_determine_the_best_ranking_in_a/

TL;DR- it combines a statistical algorithm that tracks the activity, number of comments and number of upvotes to determine which comments and which posts are the most engaged with and flags it for a redditor's review.

The question I wanted to examine is whether or not the titles can be parsed to determine whether they come from the original subreddit or from the 'best of' subreddit *in the same category.*
    
To do this I'm looking at the subreddits of:

    1. r/legaladvice
    2. r/bestoflegaladvice

In short: can we build a model that will predict whether a post is from the legal advice subreddit or the curated best of legal advice subreddit?

## 1.3 Problem Statement

How well can we train a classification model to correctly classify the title of a subreddit post as belonging to the r/legaladvice subreddit or the r/bestoflegaladvice subreddt?

Stretch question:  Predict which r/legaladvice posts are most likely to be added to r/bestoflegaladvice?

## Phase 2 - Data Gathering


### 2.0 imports

In [1]:
import requests
import pandas as pd
import numpy as np
import time
from datetime import datetime

### 2.1 define function(s) to gather posts from reddit using pushshift API

In [2]:
def get_posts(subreddit, n):
    url = 'https://api.pushshift.io/reddit/search/submission'
    if n < 100:
        params = {
        'subreddit' : subreddit, 
        'size': n 
        }
        res = requests.get(url, params)
        data = res.json()
        posts = data['data']
    else:
# note:  Pushshift.io now has a hard limit of 100 posts returned per API hit, so I'm setting this 100 limit here and will loop through this call until I hit n posts
        #get now in epoch date time format
        today = datetime.now()
        now = today.replace(hour=0, minute=0, second=0, microsecond=0)
        epoch = int(now.timestamp()) #get now in epoch date time format

        params = {
            'subreddit' : subreddit,
            'size' : 100, #pull 100 posts at a time
            'before' : epoch #set to now
        }
        posts = []
        # until I have as many posts as called for
        while len(posts) <  n:
            # get the posts
            res = requests.get(url, params)
            # convert to list
            data = res.json()
            # add to list
            print(data['data'][99]['created_utc'])
            posts.extend(data['data'])
            print(len(posts))
            # set params 'before' to oldest post's utc
            params['before'] = data['data'][99]['created_utc']
            # pause for 5 seconds so we're not hitting the API too fast and maxing it out.
            time.sleep(5)

    return pd.DataFrame(posts) # 

### 2.2 gather 1_000 posts and titles from each of 2 subreddits
        - r/bestoflegaladvice
        - r/legaladvice

NOTE: starting with 1,000 of each, will revist later if it's looking like we need more data.
NOTE2: Increased call to 5000 of each

In [3]:
bola_df = get_posts('bestoflegaladvice', 5_000)
print('r/bestoflegaladvice complete')

1594010486
100
1592482797
200
1591304165
300
1590281738
400
1589406545
500
1588262292
600
1587156358
700
1586049803
800
1585659170
900
1584522247
1000
1583448396
1100
1582724865
1200
1581809060
1300
1581054492
1400
1580286038
1500
1579365255
1600
1578518076
1700
1577854171
1800
1577217411
1900
1576568363
2000
1575812212
2100
1574901864
2200
1574123219
2300
1573266252
2400
1572363297
2500
1571211780
2600
1570404554
2700
1569526292
2800
1568723513
2900
1567706207
3000
1566789161
3100
1565849695
3200
1565009333
3300
1564114306
3400
1563375873
3500
1562637320
3600
1561705837
3700
1561130359
3800
1560561386
3900
1559911840
4000
1559089487
4100
1558233213
4200
1557496183
4300
1556931968
4400
1556280887
4500
1555523245
4600
1554854692
4700
1554379003
4800
1553864374
4900
1553255503
5000
r/bestoflegaladvice complete


In [4]:
la_df = get_posts('legaladvice', 5_000)
print('r/legaladvice complete')

1595092520
100
1595077135
200
1595049873
300
1595038261
400
1595028793
500
1595019983
600
1595012757
700
1595006539
800
1594997823
900
1594976108
1000
1594959485
1100
1594951098
1200
1594942471
1300
1594934997
1400
1594928398
1500
1594921790
1600
1594913881
1700
1594900515
1800
1594875256
1900
1594866981
2000
1594859666
2100
1594852141
2200
1594844690
2300
1594837682
2400
1594830874
2500
1594823477
2600
1594800572
2700
1594784847
2800
1594777108
2900
1594769995
3000
1594763195
3100
1594756908
3200
1594751259
3300
1594743020
3400
1594732328
3500
1594707774
3600
1594695114
3700
1594688751
3800
1594681713
3900
1594675089
4000
1594668568
4100
1594661848
4200
1594655941
4300
1594643829
4400
1594619664
4500
1594607468
4600
1594597545
4700
1594588136
4800
1594578294
4900
1594567373
5000
r/legaladvice complete


In [5]:
bola_df[['id', 'title']].head(5)

Unnamed: 0,id,title
0,htlkyk,STOP!!! READ PLEASE!! 💯 Help officers save liv...
1,htfxe1,LAOP's friend is picked up secret-police style...
2,ht9n1z,LAOP is being harassed by someone claiming to ...
3,ht6jc5,LAUKOP is on a maverick weed vigilante quest
4,ht41gd,I dont understand my mom sold me and two of my...


Interestingly, the Best of Legal Advice data returned lists 'selftext' as 'deleted'.  That's okay for this iteration as we're only looking at titles to predict the class.

This might be something to revist later. 

In [6]:
la_df[['id', 'title']].head(5)

Unnamed: 0,id,title
0,htnf9d,I just bought a car from a private party and c...
1,htner0,My mom is still collecting child support for m...
2,htneb3,Certification Lost in the Mail
3,htncsy,Graffiti w/ Washable Chalk Spray Paint?
4,htnc9d,My soon-to-be fiancé's ex raped and impregnate...


In [7]:
bola_df[['title', 'subreddit']].head(5)

Unnamed: 0,title,subreddit
0,STOP!!! READ PLEASE!! 💯 Help officers save liv...,bestoflegaladvice
1,LAOP's friend is picked up secret-police style...,bestoflegaladvice
2,LAOP is being harassed by someone claiming to ...,bestoflegaladvice
3,LAUKOP is on a maverick weed vigilante quest,bestoflegaladvice
4,I dont understand my mom sold me and two of my...,bestoflegaladvice


I'm going to save my bulk file first so I don't have to re-hit the pushshift API again until I decide I want more data.

In [8]:
bulk_df = pd.concat([bola_df, la_df], ignore_index = True)

In [9]:
bulk_df.to_csv('./data/bulkredditdata.csv', index=False)

For the moment, this is what I'm looking for, the two dataframes are compatible and getting the data I'm looking for.  Time to filter and save to .csv as clean

## 2.3 Data Cleaning

In this particular case, I have no nans but I have a lot more columns than I am currently examining.  

For this version of the project, all I am looking at is titles and which subreddit it belongs to.  I'll slice it to the two relevant columns and save it as clean_reddit_data.csv

In [10]:
clean_bola_df = bola_df[['title', 'subreddit']].copy()
clean_la_df = la_df[['title', 'subreddit']].copy()

In [11]:
clean_df = pd.concat([clean_bola_df, clean_la_df], ignore_index = True)
clean_df.shape

(10000, 2)

## 2.4 save data to CSV for EDA

In [12]:
clean_df.to_csv('./data/clean_reddit_data.csv', index = False)

All set!  Moving over to modeling for EDA and modeling.