# Project 3: Reddit Predictor

## DEFINE THE PROBLEM:

Reddit servers have gone down and in the process subreddit post have gotten mixed up. We don't know what comes from where. A crack team of developers were able to fix the website and get it up and running again, organized and ready to go. They recruited a team of data scientist to help prevent this disorganization ever happening again. Their request: Based of the post and its references, can we predict where a post comes from? 

## GATHER THE DATA:

### Code to gather data

In [2]:
# Load data collection libraries
import requests
import time
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression as lr
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer 
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

In [3]:
# Function to pull subreddit posts
def get_subreddit(url, n_pulls, headers):    

    # Create empty templates
    posts = []
    after = None

    # Create a loop that does max 25 requests per pull
    for pull_num in range(n_pulls):
        print("Pulling data attempted", pull_num+1,"time(s)")

        if after == None:
            new_url = url                 # base case
        else:
            new_url = url+"?after="+after # subsequent iterations

        res = requests.get(new_url, headers=headers)

        if res.status_code == 200:
            subreddit_json = res.json()                      # Pull JSON
            posts.extend(subreddit_json['data']['children']) # Get subreddit posts
            after = subreddit_json['data']['after']          # 'after' = ID of the last post in this iteration
        else:
            print("We've run into an error. The status code is:", res.status_code)
            break

        time.sleep(1)
        
    return(posts)

### Load in first subreddit

In [4]:
# Call function

# Define URL and username
url_name = "https://www.reddit.com/r/politics.json"
username = {"User-agent": 'boom-deva'}      # header to prevent 429 error

data = get_subreddit(url_name, n_pulls = 43, headers = username)

Pulling data attempted 1 time(s)
Pulling data attempted 2 time(s)
Pulling data attempted 3 time(s)
Pulling data attempted 4 time(s)
Pulling data attempted 5 time(s)
Pulling data attempted 6 time(s)
Pulling data attempted 7 time(s)
Pulling data attempted 8 time(s)
Pulling data attempted 9 time(s)
Pulling data attempted 10 time(s)
Pulling data attempted 11 time(s)
Pulling data attempted 12 time(s)
Pulling data attempted 13 time(s)
Pulling data attempted 14 time(s)
Pulling data attempted 15 time(s)
Pulling data attempted 16 time(s)
Pulling data attempted 17 time(s)
Pulling data attempted 18 time(s)
Pulling data attempted 19 time(s)
Pulling data attempted 20 time(s)
Pulling data attempted 21 time(s)
Pulling data attempted 22 time(s)
Pulling data attempted 23 time(s)
Pulling data attempted 24 time(s)
Pulling data attempted 25 time(s)
Pulling data attempted 26 time(s)
Pulling data attempted 27 time(s)
Pulling data attempted 28 time(s)
Pulling data attempted 29 time(s)
Pulling data attempted 

```python
data[0]['data']
```
We will have to make a for loop to get to the post into a "dictionary". It's type is a list. But it acts like a dictionary
Above is an example of how to get the post. So make sure you are creating a data frame after you loop then from there
make sure you are putting it in a dataframe after. Then do the same for your second reddit forum then join the two.
Also keep track of how you will binarize the subreddits. One should receive 1 another 0. Create a new column after you
make a data frame for the subreddit. 

In [5]:
subreddit_1 = {i:data[i]["data"]["title"].lower() for i in range(len(data))}
subreddit_1 = pd.DataFrame.from_dict(subreddit_1, orient = 'index', columns = ['Text'])
subreddit_1['subreddit'] = data[0]['data']['subreddit']
subreddit_1.head()

Unnamed: 0,Text,subreddit
0,"i’m alan s. inouye, a registered lobbyist, and...",politics
1,subpoena for mueller report and documents appr...,politics
2,sen. elizabeth warren will unveil a bill to ma...,politics
3,"mueller testimony before congress ‘inevitable,...",politics
4,american meritocracy is a myth - recent scanda...,politics


In [6]:
subreddit_1.to_csv("Subreddit_1_Politics")

### Load Second subreddit

In [7]:
# Call function

# Define URL and username
url_name = "https://www.reddit.com/r/stocks.json"
username = {"User-agent": 'boom-deva'}      # header to prevent 429 error

data = get_subreddit(url_name, n_pulls = 43, headers = username)

Pulling data attempted 1 time(s)
Pulling data attempted 2 time(s)
Pulling data attempted 3 time(s)
Pulling data attempted 4 time(s)
Pulling data attempted 5 time(s)
Pulling data attempted 6 time(s)
Pulling data attempted 7 time(s)
Pulling data attempted 8 time(s)
Pulling data attempted 9 time(s)
Pulling data attempted 10 time(s)
Pulling data attempted 11 time(s)
Pulling data attempted 12 time(s)
Pulling data attempted 13 time(s)
Pulling data attempted 14 time(s)
Pulling data attempted 15 time(s)
Pulling data attempted 16 time(s)
Pulling data attempted 17 time(s)
Pulling data attempted 18 time(s)
Pulling data attempted 19 time(s)
Pulling data attempted 20 time(s)
Pulling data attempted 21 time(s)
Pulling data attempted 22 time(s)
Pulling data attempted 23 time(s)
Pulling data attempted 24 time(s)
Pulling data attempted 25 time(s)
Pulling data attempted 26 time(s)
Pulling data attempted 27 time(s)
Pulling data attempted 28 time(s)
Pulling data attempted 29 time(s)
Pulling data attempted 

In [8]:
# To clean the data I will put it into a corpus which is basically a data frame of documents (Text)
subreddit_2 = {i:data[i]["data"]["title"].lower() for i in range(len(data))}
subreddit_2 = pd.DataFrame.from_dict(subreddit_2, orient = 'index', columns = ['Text'])
subreddit_2['subreddit'] = data[0]['data']['subreddit']
subreddit_2.head()

Unnamed: 0,Text,subreddit
0,rate my portfolio - r/stocks quarterly thread ...,stocks
1,"r/stocks daily discussion wednesday - apr 03, ...",stocks
2,"$750,000 to invest",stocks
3,what i learned this winter...,stocks
4,amazon's giant 'dystopian' delivery-drone blim...,stocks


In [9]:
subreddit_2.to_csv("Subreddit_2_Stocks")