# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Import necessary files

In [249]:
#Importing almost every file we have learned thus far, if used in the project or not.
import requests
import json
import pandas as pd
import re 
from nltk.corpus import stopwords
import nltk
import nltk
nltk.download('wordnet')
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction import text
from sklearn.metrics import roc_auc_score, roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz
from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn import tree
from sklearn.metrics import classification_report, confusion_matrix

[nltk_data] Downloading package wordnet to /Users/premrao/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Creating a Function to Scrape data from the reddit website

In [250]:
import time

In [251]:
def redditscraper(url, number_of_posts):

    all_posts =[]
    p = {}
        #using a for loop to iterate through the number of posts
    for _ in range(number_of_posts): 
        

        # Get the posts by hitting the url, put it in json and store it
        res = requests.get(url,params=p, headers={'User-agent': 'premrao'})
        data = res.json()
        res.raise_for_status()
        list_of_posts = data['data']['children']
        all_posts = all_posts + list_of_posts

        # reassign the after to the current 'after', and then update the url to hit
        after = data['data']['after']
        if after == None:
            print("Print 'after' recieved: " , after)
            break
        else:
            p.update({'after':after})
            print("Print 'after' recieved: ", after)
            
        # go to sleep for half a second so you do not overwhelm reddit servers or cause an alert on their system
      
        time.sleep(0.5)
        
    return all_posts
        
        

In [252]:
#Scraping from the nba subreddit
#as it is in batches of 25, I made the number of posts 40 or over
nba_subreddit=redditscraper(url='http://www.reddit.com/r/nba.json',number_of_posts = 50)

Print 'after' recieved:  t3_9exrkc
Print 'after' recieved:  t3_9eymsc
Print 'after' recieved:  t3_9exdyc
Print 'after' recieved:  t3_9eo4by
Print 'after' recieved:  t3_9ewhv9
Print 'after' recieved:  t3_9ep9ym
Print 'after' recieved:  t3_9epqq4
Print 'after' recieved:  t3_9engg5
Print 'after' recieved:  t3_9ej6s7
Print 'after' recieved:  t3_9ehgt6
Print 'after' recieved:  t3_9ej5d9
Print 'after' recieved:  t3_9ehs0k
Print 'after' recieved:  t3_9e4245
Print 'after' recieved:  t3_9eair6
Print 'after' recieved:  t3_9e5hty
Print 'after' recieved:  t3_9e27gg
Print 'after' recieved:  t3_9dwokl
Print 'after' recieved:  t3_9dyxda
Print 'after' recieved:  t3_9e2bkb
Print 'after' recieved:  t3_9dxft7
Print 'after' recieved:  t3_9dxbnd
Print 'after' recieved:  t3_9dykuq
Print 'after' recieved:  t3_9dnt5r
Print 'after' recieved:  t3_9dw440
Print 'after' recieved:  t3_9dzsfi
Print 'after' recieved:  t3_9dnct8
Print 'after' recieved:  t3_9dl7ky
Print 'after' recieved:  t3_9dpq1h
Print 'after' reciev

In [253]:
#scraping from the wallstreetbets subreddit

wallstreetbets_subreddit=redditscraper(url='http://www.reddit.com/r/wallstreetbets.json',number_of_posts = 50)

Print 'after' recieved:  t3_9ex64d
Print 'after' recieved:  t3_9ezob5
Print 'after' recieved:  t3_9eqkpu
Print 'after' recieved:  t3_9eokf0
Print 'after' recieved:  t3_9est22
Print 'after' recieved:  t3_9egnum
Print 'after' recieved:  t3_9eudxm
Print 'after' recieved:  t3_9edyon
Print 'after' recieved:  t3_9e6x41
Print 'after' recieved:  t3_9dzaf9
Print 'after' recieved:  t3_9dspwi
Print 'after' recieved:  t3_9dxqi9
Print 'after' recieved:  t3_9e2qfu
Print 'after' recieved:  t3_9dz2tt
Print 'after' recieved:  t3_9e1tff
Print 'after' recieved:  t3_9dyqpg
Print 'after' recieved:  t3_9dpiq4
Print 'after' recieved:  t3_9dwpl1
Print 'after' recieved:  t3_9dnfet
Print 'after' recieved:  t3_9djjcb
Print 'after' recieved:  t3_9dmpsx
Print 'after' recieved:  t3_9di8hj
Print 'after' recieved:  t3_9dkeme
Print 'after' recieved:  t3_9dm2mp
Print 'after' recieved:  t3_9dmxlu
Print 'after' recieved:  t3_9dj0r1
Print 'after' recieved:  t3_9di3z0
Print 'after' recieved:  t3_9dm8oj
Print 'after' reciev

## Exploratory Data Analysis (EDA)

In [254]:
nba_subreddit[0]['data'].keys()

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'link_flair_text', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'parent_whitelist_status', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'author_flair_background_color', 'subreddit_type', 'ups', 'domain', 'media_embed', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'saved', 'can_mod_post', 'score', 'approved_by', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'banned_by', 'author_flair_type', 'contest_mode', 'selftext_html', 'likes', 'suggested_sort', 'banned_at_utc', 'view_count', 'archived', 'no_follow', 'is_crosspostable', 'pinned', 'over_18', 'media_only', 'link_flair_template_

In [255]:
wallstreetbets_subreddit[0]['data'].keys()

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'thumbnail_height', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'author_flair_background_color', 'subreddit_type', 'ups', 'domain', 'media_embed', 'thumbnail_width', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'post_hint', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'banned_by', 'author_flair_type', 'contest_mode', 'selftext_html', 'likes', 'suggested_sort', 'banned_at_utc', 'view_count', 'archived', 'no_follow', 'is_crosspostable', 'pinned', 'over_18', 'preview',

In [256]:
#Converting the data into a dataframe using list comprehension
nba = pd.DataFrame(x['data'] for x in nba_subreddit)

In [285]:
nba.head(10)

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,suggested_sort,thumbnail,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,AutoModerator,,,,[],,,...,new,,Daily Locker Room and Free Talk + Game Threads...,16,https://www.reddit.com/r/nba/comments/9exfie/d...,[],,False,all_ads,6
1,,,False,Chancelor_West,,,Celtics4,[],1b704ed0-362b-11e8-86da-0e69a4981634,Jabari Birdman,...,qa,,[Announcement] Julie Phayer will be joining us...,93,https://www.reddit.com/r/nba/comments/9eqyxp/a...,[],,False,all_ads,6
2,,,False,flimsyfresh,,,Lakers2,[],3c4644d4-362b-11e8-b8e6-0e22df847c30,Lakers,...,,,My Uber driver had NBA Jam hooked up for passe...,14181,https://np.imgur.com/A1aWu27.jpg,[],,False,all_ads,6
3,,,False,Ih8reposts,,,76ers3,[],,[PHI] Tiago Splitter,...,,,Fun Fact: Andre Drummond has the most Offensiv...,681,https://www.reddit.com/r/nba/comments/9ey2vf/f...,[],,False,all_ads,6
4,,,False,Mind_Fcuk,,,Bullets,[],,Bullets,...,,,Kyrie Irving is enrolled in a program at Harva...,252,https://www.boston.com/sports/boston-celtics/2...,[],,False,all_ads,6
5,,,False,KSmooove,,,Lakers1,[],,Lakers,...,,,Gilbert Arenas puts up $100k for shootout vs N...,155,https://www.youtube.com/watch?v=fZf0Gx4-5tU,[],,False,all_ads,6
6,,,False,gnullify,,,Kings3,[],deafde6e-3feb-11e8-89dc-0e993ebc6d5c,Kings,...,,,Jaylen Brown hooping in T-Mac's INSANE home gym,464,https://youtu.be/1wViFt76cmk,[],,False,all_ads,6
7,,,False,scooper1030,,,Suns1,[],,Suns,...,,,[OC] Making the case for TJ Warren as the NBA'...,176,https://www.reddit.com/r/nba/comments/9eyxpi/o...,[],,False,all_ads,6
8,,,False,vvvdvvv,,,Spurs2,[],,Spurs,...,,,What's the jersey you bought that you regrette...,240,https://www.reddit.com/r/nba/comments/9ex7i9/w...,[],,False,all_ads,6
9,,,False,Mebegilley,,,Pacers2,[],,Pacers,...,,,"Kevin Garnett 32 Points, 21 Rebs, 5 Blocks 200...",89,https://www.youtube.com/watch?v=Vo4eqnG-lIQ,[],,False,all_ads,6


In [258]:
wsb = pd.DataFrame(x['data'] for x in wallstreetbets_subreddit)

In [286]:
wsb.head(10)

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,AutoModerator,,,,[],,,...,,,"What Are Your Moves Tomorrow, September 12",8,https://www.reddit.com/r/wallstreetbets/commen...,[],,False,house_only,1
1,,,False,jarredshere,,,,[],,,...,140.0,140.0,"Coming this Fall, The AMD Story",3554,https://i.redd.it/tkwzgwrc4ml11.jpg,[],,False,house_only,1
2,,,False,WrennAmethyst,,,,[],,,...,114.0,140.0,Any emoji pattern experts in here?,239,https://i.imgur.com/a5fr5G5.jpg,[],,False,house_only,1
3,,,False,kiLo28,,,,[],,,...,87.0,140.0,"We made it boyos, 100% club.",192,https://i.redd.it/mvxnjqsywml11.png,[],,False,house_only,1
4,,,False,CANT_MILK_THOSE,,,,[],,,...,120.0,140.0,Summary of Cramer's Micron Analysis these last...,184,https://i.redd.it/rejhxgqkqml11.png,[],,False,house_only,1
5,,,False,IOutsourced,,,,[],,,...,73.0,140.0,EA Holders watching TTWO and ATVI today,96,https://i.redd.it/sp1rq9bornl11.jpg,[],,False,house_only,1
6,,,False,POCKET_POOL_CHAMP,,,,"[{'e': 'text', 't': '⭐'}]",,⭐,...,24.0,140.0,$MSFT making my micro hard,214,https://i.redd.it/u1gvwlveeml11.png,[],,False,house_only,1
7,,,False,GreatTraderOnizuka,,,,[],,,...,140.0,140.0,"NVDA $48k put follow up, closed",160,https://i.redd.it/g60qi7b0hml11.jpg,[],,False,house_only,1
8,,,False,jeffynihao,,,,"[{'e': 'text', 't': 'Anyong Haseyo!'}]",,Anyong Haseyo!,...,,,"I sold SQ, ROKU, and AMD for MU 3 months ago",116,https://www.reddit.com/r/wallstreetbets/commen...,[],,False,house_only,1
9,,,False,WSBConsensus,,,,"[{'e': 'text', 't': 'cannot properly use flair'}]",,cannot properly use flair,...,,,85% of insured Amazon (AMZN) Prime memebers op...,75,https://www.reddit.com/r/wallstreetbets/commen...,[],,False,house_only,1


In [260]:
#Using frames to concat 2 dataframes
frames = [nba,wsb]
combined_df = pd.concat(frames)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  This is separate from the ipykernel package so we can avoid doing imports until


In [261]:
#calling the subreddit column in the combined dataframe to identify which post is in which subreddit
combined_df['subreddit']

0                 nba
1                 nba
2                 nba
3                 nba
4                 nba
5                 nba
6                 nba
7                 nba
8                 nba
9                 nba
10                nba
11                nba
12                nba
13                nba
14                nba
15                nba
16                nba
17                nba
18                nba
19                nba
20                nba
21                nba
22                nba
23                nba
24                nba
25                nba
26                nba
27                nba
28                nba
29                nba
            ...      
856    wallstreetbets
857    wallstreetbets
858    wallstreetbets
859    wallstreetbets
860    wallstreetbets
861    wallstreetbets
862    wallstreetbets
863    wallstreetbets
864    wallstreetbets
865    wallstreetbets
866    wallstreetbets
867    wallstreetbets
868    wallstreetbets
869    wallstreetbets
870    wal

In [262]:
wsb.shape

(886, 95)

In [263]:
nba.shape

(730, 93)

In [264]:
combined_df.shape

(1616, 97)

### Save your results as a CSV


In [265]:
# Export to csv
combined_df.to_csv('combined_DF')

In [266]:
#Using the title to establish the difference as every post has a title that is relevant to the sub whereas selftext may vary
combined_df['title']

0      Daily Locker Room and Free Talk + Game Threads...
1      [Announcement] Julie Phayer will be joining us...
2      My Uber driver had NBA Jam hooked up for passe...
3      Fun Fact: Andre Drummond has the most Offensiv...
4      Kyrie Irving is enrolled in a program at Harva...
5      Gilbert Arenas puts up $100k for shootout vs N...
6        Jaylen Brown hooping in T-Mac's INSANE home gym
7      [OC] Making the case for TJ Warren as the NBA'...
8      What's the jersey you bought that you regrette...
9      Kevin Garnett 32 Points, 21 Rebs, 5 Blocks 200...
10     What are your favourites NBA urban legends and...
11     Melo and Harden working on their chemistry in ...
12                 Counting down 50-31 on the SI Top 100
13     [Charania] Free agent Luol Deng has reached ag...
14     Danny Green on Kahwi's future: "The city of To...
15     Possible new OKC Thunder city edition jersey. ...
16     CJ commented “Carrrrrrrrrryyyyyyyy” and Jaylen...
17     [OC] Check out my 2019 N

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [267]:
#Converting the subreddit names into 0's and 1's
combined_df.subreddit = combined_df.subreddit.str.replace('wallstreetbets','1') 
combined_df.subreddit = combined_df.subreddit.str.replace('nba','0') 

In [268]:
#Instantiate y to the subreddit column of the combined dataframe
y = combined_df.subreddit
#I chose title as X as every reddit post has a title that is geared towards its respective subreddit discusion
X = combined_df.title
# Train test split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Count Vectorizer and fitting with Logistic Regression

In [280]:
# Vectorize using the number of features that you scrape
cvec = CountVectorizer(stop_words='english')
#fit and transform the train, but only transform the test
X_train_counts = cvec.fit_transform(X_train)

X_test_counts = cvec.transform(X_test)





In [281]:
#Use Logistic regression to fit and score for count vectorizer
log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
print("Train Score :")
print(log_reg.score(X_train_counts,y_train))
print("Test Score:")
log_reg.score(X_test_counts, y_test)



Train Score :
0.9963031423290203
Test Score:


0.8838951310861424

### Tf-Idf and fitting, with Logistic Regression

In [282]:
#Same process for Tf-Idf as CountVec
tvec = TfidfVectorizer(stop_words='english')

X_train_tfid = tvec.fit_transform(X_train)
X_test_tfid = tvec.transform(X_test)

In [283]:
#declare the model

log_reg.fit(X_train_tfid, y_train)
print("Test Score :")
print(log_reg.score(X_train_tfid,y_train))
print("Train Score")
log_reg.score(X_test_tfid, y_test)



Test Score :
0.9926062846580407
Train Score


0.897003745318352

As we can see, our score imporves when fitting the model with TF-Idf over CountVectorizer

In [284]:
#From the vectorizer we can aquire the words we want
columns=cvec.get_feature_names()
#Make a DF of the words
co_ef = pd.DataFrame(log_reg.coef_, columns = columns)

#Sort from most to least compatible with wallstreetbets
df_coef = co_ef.T.sort_values(by=0, ascending=False).T

df_coef.head()



Unnamed: 0,mu,amd,calls,buy,short,tesla,week,tsla,long,puts,...,allen,james,basketball,season,players,game,player,lebron,team,nba
0,2.494332,2.147041,1.733749,1.612342,1.498732,1.465826,1.429913,1.398775,1.391733,1.35384,...,-1.196126,-1.206766,-1.260888,-1.393497,-1.807938,-1.887318,-2.00782,-2.013422,-2.377234,-3.869232


#### This list helps us understand what words relate with each subreddit. Worlds like 'mu' 'amd' and 'tesla' have a strong correlation to r/wallstreetbets whereas words like 'game' 'players' 'lebron' 'nba' have a strong affinity with r/nba

### Predicting subreddit using Random Forests and Decision Tree Classifier 

#### Decision tree-CountVec Model

In [274]:
#Decision tree classifier with CountVec Model
decision_tree = DecisionTreeClassifier(max_features=30,random_state=33)
#                                 ?
decision_tree = decision_tree.fit(X_train_counts, y_train)

print("Train Score:")
print(decision_tree.score(X_train_counts, y_train))
print("Test Score :")
print(decision_tree.score(X_test_counts,y_test))




Train Score:
0.9972273567467652
Test Score :
0.8314606741573034


#### Confusion Matrix and classification report

In [275]:
#Confusion matrix for Decision tree using the predictions from the CountVec Model
predictions= decision_tree.predict(X_test_counts)
print('This is the Classification Report')
print(classification_report(y_test, predictions))
print('This is the Confusion Matrix')
print(confusion_matrix(y_test, predictions))

This is the Classification Report
             precision    recall  f1-score   support

          0       0.91      0.72      0.80       255
          1       0.78      0.94      0.85       279

avg / total       0.84      0.83      0.83       534

This is the Confusion Matrix
[[183  72]
 [ 18 261]]


#### Random Forest for CountVec Model

In [276]:
#Random Forest Classifier with the CountVec Model, 
rf = RandomForestClassifier(n_estimators=50, max_depth=100, max_leaf_nodes=24, max_features=1000)
                                
rf = clf.fit(X_train_counts, y_train)

print("Train Score:")
print(rf.score(X_train_counts, y_train))
print("Test Score :")
print(rf.score(X_test_counts,y_test))



Train Score:
0.8401109057301294
Test Score :
0.7734082397003745


#### Confusion Matrix and classification report

In [277]:
predictions= rf.predict(X_test_counts)
print('This is the Classification Report')
print(classification_report(y_test, predictions))
print('This is the Confusion Matrix')
print(confusion_matrix(y_test, predictions))
rf.get_params

This is the Classification Report
             precision    recall  f1-score   support

          0       0.99      0.53      0.69       255
          1       0.70      0.99      0.82       279

avg / total       0.84      0.77      0.76       534

This is the Confusion Matrix
[[136 119]
 [  2 277]]


<bound method BaseEstimator.get_params of RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=100, max_features=1000, max_leaf_nodes=24,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)>

#### Decision tree for Tf-Idf Model

In [278]:
decision_tree1 = DecisionTreeClassifier(max_features=30,random_state=66)
                              
decision_tree1 = decision_tree1.fit(X_train_tfid, y_train)

print("Train Score:")
print(decision_tree1.score(X_train_tfid, y_train))
print("Test Score :")
print(decision_tree1.score(X_test_tfid,y_test))



Train Score:
1.0
Test Score :
0.8127340823970037


#### Random Forest for Tf-Idf Model

In [279]:
rf1 = RandomForestClassifier(n_estimators=10, max_depth=200, max_leaf_nodes=12, max_features=1000)
                                 
rf1 = rf.fit(X_train_tfid, y_train)

print("Train Score:")
print(rf1.score(X_train_tfid, y_train))
print("Test Score :")
print(rf1.score(X_test_tfid,y_test))



Train Score:
0.8345656192236599
Test Score :
0.7752808988764045


### Concluding thoughts and results are discussed in the Executive Summary in the README file