# Project: Using Reddit's API for Predicting Comments
### Author: Kihoon Sohn

### Table of Contents

- Notebook 1 - Data Fetching: `json` webscrap and unpack to dataframe
- Notebook 2 - Data Cleansing: exploratory data analysis and feature engineering
- **Notebook 3 - Data Modeling(current)**: build a predictive model

**Disclaimer**: Due to the file size restriction in GitHub, `/dataset/` folder and other large files were ignored by `.gitignore`. Therefore the notebook might not reproducible. 

In [1]:
# import libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler

from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn.feature_extraction import stop_words
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.pipeline import Pipeline, make_pipeline
from nltk.stem import PorterStemmer


# set random seed
import numpy as np
np.random.seed(538)

### 3a: Read CSV for NLP (51K hotposts from reddit)

In [2]:
# read CSV
df = pd.read_csv('./dataset/master_06-02-2018(hotposts).csv')
df.drop(columns='Unnamed: 0', axis=1, inplace=True)

In [3]:
print(df.columns)
df.head(1)

Index(['id', 'title', 'subreddit', 'num_comments', 'created_utc',
       'fetched time', 'age', 'comments'],
      dtype='object')


Unnamed: 0,id,title,subreddit,num_comments,created_utc,fetched time,age,comments
0,8m1wov,"It happens in anime, it happens in life",combinedgifs,407,2018-05-25 13:52:24,2018-05-25 20:10:12,377.0,1


### 3b: Build model and evaluate

##### Tree-based Model 1 - Feature: `subreddit` with RandomForest

In [4]:
# Set the X and y
y = df['comments']
X = df['subreddit']

In [5]:
# dummy all the subreddits
X = pd.get_dummies(X, drop_first=True)

In [6]:
# Train / Test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=42)

In [7]:
# Random Forest

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7618310767246937

In [8]:
# to compare score between train / test set.  
rf.score(X_train, y_train)

0.7886533838118661

In [9]:
# GridSearch

rf = RandomForestClassifier()
rf_params = {
    'n_estimators' : [10, 15, 20]
}
gs = GridSearchCV(rf, param_grid=rf_params)
gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)

0.7636720369193357
{'n_estimators': 10}


In [10]:
gs.score(X_test, y_test)

0.7623468729851709

Model 1 result
- shown some improvement compared to baseline accuracy.
- test set score is better than train set score.
- Gridsearch changes the score slightly increased. 

##### Tree-based Model 2 - Features: `subreddit`, `is_cat`, `is_funny` with RandomForest

In [11]:
# Set the X and y

y = df['comments']
X = df[['subreddit', 'title']].copy(deep=True) 

In [12]:
# let's boolean for cat/funny is in the title
X['is_cat'] = X['title'].map(lambda x: 1 if 'cat' in x else 0)
X['is_funny'] = X['title'].map(lambda x: 1 if 'funny' in x else 0)

In [13]:
# drop 'title' and dummy all of it.
X.drop('title', axis=1, inplace=True)
X = pd.get_dummies(X, drop_first=True)

In [14]:
# Train / Test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=42)

In [15]:
# Random Forest

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7607350096711799

In [16]:
# to compare score between train / test set.  
rf.score(X_train, y_train)

0.7909193909414983

In [17]:
# GridSearch

rf = RandomForestClassifier()
rf_params = {
    'n_estimators' : [10, 15, 20]
}
gs = GridSearchCV(rf, param_grid=rf_params)
gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)

0.7635614999861828
{'n_estimators': 10}


In [18]:
gs.score(X_test, y_test)

0.7622179239200516

Model 2 result
- compared to the Model 1, `is_cat` and `is_funny` doesn't affect much in the score changes. Therefore, it didn't show any meaningful improvement. 

##### Tree-based Model 3 - Feature: `title`, `subreddit`, `age` with TfidfVectorizer & Random Forest

In [19]:
# Set the X and y

y = df['comments']
X = df[['subreddit', 'title', 'age']].copy(deep=True)

In [20]:
# dummy all the subreddits
X = pd.get_dummies(X, columns=['subreddit'], drop_first=True)

In [21]:
# Train / Test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=42)

In [22]:
tvec = TfidfVectorizer(stop_words='english')

X_train_matrix = tvec.fit_transform(X_train['title'])
X_test_matrix = tvec.transform(X_test['title'])

In [23]:
X_train_df = pd.DataFrame(X_train_matrix.todense(),
                         columns=tvec.get_feature_names(),
                         index=X_train.index)

In [24]:
X_test_df = pd.DataFrame(X_test_matrix.todense(),
                        columns=tvec.get_feature_names(),
                        index=X_test.index)

In [25]:
assert X_train.index.all() == X_train_df.index.all()

In [26]:
X_train_all = pd.concat([X_train_df, X_train.drop('title', axis=1)], axis=1)
X_test_all = pd.concat([X_test_df, X_test.drop('title', axis=1)], axis=1)

In [27]:
X_train_all.head()

Unnamed: 0,00,000,000ft,000k,000km,000kr,000lb,000th,001,00100000,...,subreddit_yourmomshousepodcast,subreddit_youseeingthisshit,subreddit_youtube,subreddit_youtubehaiku,subreddit_yuruyuri,subreddit_yvonnestrahovski,subreddit_zelda,subreddit_zen,subreddit_zerocarb,subreddit_zuckmemes
24100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
34130,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
36106,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
35994,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
14459,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
rf = RandomForestClassifier()
rf.fit(X_train_all, y_train)
rf.score(X_test_all, y_test)

0.8034816247582205

In [29]:
# to compare score between train / test set.  
rf.score(X_train_all, y_train)

0.9793295935004284

In [30]:
# GridSearch

rf = RandomForestClassifier()
rf_params = {
    'n_estimators' : [10, 15, 20]
}
gs = GridSearchCV(rf, param_grid=rf_params)
gs.fit(X_train_all, y_train)
print(gs.best_score_)
print(gs.best_params_)

0.8058142426838367
{'n_estimators': 20}


In [31]:
gs.score(X_test_all, y_test)

0.8104448742746615

Model 3 result
- train set score and test set score difference is quite huge. It can be said that the model is overfitted.
- However, among the previous two models, it shows the best score.

##### Tree-based Model 4 - Feature: `title` with CountVectorizer, TfidfVectorizer, Random Forest

In [32]:
# Set the X and y

y = df['comments']
X = df['title']

In [33]:
# Train / Test Split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=42)

In [34]:
# set the pipeline

cvec = CountVectorizer(stop_words='english', max_features=500)
tvec = TfidfVectorizer(stop_words='english', max_features=500)

In [35]:
# print(X_train.shape)

cvec.fit(X_train)
X_train_matrix = cvec.transform(X_train)
print(X_train_matrix[:5])

  (0, 64)	1
  (0, 94)	1
  (0, 138)	1
  (0, 495)	1
  (1, 411)	1
  (2, 145)	1
  (2, 393)	1
  (3, 477)	1
  (4, 210)	1
  (4, 323)	1
  (4, 468)	1


In [36]:
print(X_train.iloc[0,])
print(cvec.get_feature_names()[64], cvec.get_feature_names()[94])

After three years, I finally _______ my daughter's cat.
cat daughter


In [37]:
X_train_matrix = cvec.fit_transform(X_train)

In [38]:
forest = RandomForestClassifier(max_depth=10, n_estimators=5)
forest.fit(X_train_matrix, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [39]:
forest.score(X_train_matrix, y_train)

0.7466493492138061

In [40]:
X_test_matrix = cvec.transform(X_test)
forest.predict(X_test_matrix)
forest.score(X_test_matrix, y_test)

0.7484203739522889

In [41]:
X_train_matrix = tvec.fit_transform(X_train)
X_test_matrix = tvec.transform(X_test)

forest.fit(X_train_matrix, y_train)
forest.score(X_test_matrix, y_test)

0.7478401031592521

Model 4 result
- The score between the sets are relatively small, the model seemed not to over/underfit. However, it stays as similar or same as baseline score. 
- Therefore, by using CVEC, TVEC, RF with `title` is not the best modeling.

**Non-tree-based Model 1 - Feature: `title` with CountVectorizer, TfidfVectorizer, Logistic Regression, KNN**

In [48]:
# Set the X and y
y = df['comments']
X = df['title']

X = X.apply(lambda x: PorterStemmer().stem(x))

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=42)

In [50]:
X_train.values

array(["after three years, i finally _______ my daughter's cat.",
       'spoiler: this sub currently...', "i'm gluten free and i'm sorri",
       ..., 'this crackhead mat',
       'estonian president with klavan in kiev',
       'hey kitty girls! it’s my birthday today &amp; my mum got me this card'],
      dtype=object)

In [51]:
# let's set CountVectorizer / HashingVectorizer / TFIDF
# by removing english stop words, ngram_range(1,2)

cvec = CountVectorizer(stop_words='english', ngram_range=(1,2))
hvec = HashingVectorizer(stop_words='english')
tvec = TfidfVectorizer(stop_words='english', ngram_range=(1,2))

lr = LogisticRegression()


In [52]:
# `make_pipeline` - credit goes to Harsha
pipe_lr = make_pipeline(cvec, lr)
pipe_lr.fit(X_train, y_train)
pipe_lr.score(X_test, y_test)

0.7409413281753707

In [53]:
# pipeline / fit / score for lr & hvec

pipe_lr_2 = make_pipeline(hvec, lr)
pipe_lr_2.fit(X_train, y_train)
pipe_lr_2.score(X_test, y_test)

0.7474532559638942

In [54]:
# pipeline / fit / score for lr & tvec

pipe_lr_3 = make_pipeline(tvec, lr)
pipe_lr_3.fit(X_train, y_train)
pipe_lr_3.score(X_test, y_test)

0.7460348162475822

In [61]:
# pipeline / fit / score for knn & cvec

knn = KNeighborsClassifier()
pipe_knn = make_pipeline(cvec, knn)
pipe_knn.fit(X_train, y_train)
pipe_knn.score(X_test, y_test)

0.7364925854287556

In [64]:
pipe_knn_2 = make_pipeline(tvec, knn)
pipe_knn_2.fit(X_train, y_train)
pipe_knn_2.score(X_test, y_test)

0.7408768536428111

Model 5 result
- `title` with CVEC, TVEC with Logistic Regression, KNN doesn't show the best performance in my model. 

**Non-tree-based Model 2 - Feature: `title`, `age`, `subreddit` with  TfidfVectorizer, Logistic Regression, KNN**

In [4]:
# set the X and y

y = df['comments']
X = df[['title', 'age', 'subreddit']].copy(deep=True)

In [5]:
# Dummy subreddit
X = pd.get_dummies(X, columns=['subreddit'], drop_first = True)

In [6]:
# Train / Test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=42)

In [7]:
# let's tvec to title
tvec = TfidfVectorizer(stop_words='english')

X_train_matrix = tvec.fit_transform(X_train['title'])
X_test_matrix = tvec.transform(X_test['title'])

X_train_df = pd.DataFrame(X_train_matrix.todense(),
                         columns=tvec.get_feature_names(),
                         index=X_train.index)

X_test_df = pd.DataFrame(X_test_matrix.todense(),
                        columns=tvec.get_feature_names(),
                        index=X_test.index)

assert X_train.index.all() == X_train_df.index.all()

In [8]:
X_train_all = pd.concat([X_train_df, X_train.drop('title', axis=1)], axis=1)
X_test_all = pd.concat([X_test_df, X_test.drop('title', axis=1)], axis=1)

In [10]:
# Let's fit& score with Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train_all, y_train)
logreg.score(X_test_all, y_test)

0.8368794326241135

In [11]:
# gridsearch for the params
lr_params = {
    'penalty' : ['l1', 'l2']
}
gs = GridSearchCV(LogisticRegression(), param_grid=lr_params)
gs.fit(X_train_all, y_train)
print(gs.best_score_)
print(gs.best_params_)

0.8300218310442977
{'penalty': 'l2'}


In [12]:
# score on test set
gs.score(X_test_all, y_test)

0.8368794326241135

In [13]:
# let's check the best estimator
print(gs.best_estimator_)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)


In [14]:
# the best estimator coef

coefs = pd.DataFrame(gs.best_estimator_.coef_[0], index=X_train_all.columns, columns=['coef'])
coefs['coef'] = np.exp(coefs['coef'])
coefs = coefs.sort_values(by='coef', ascending=False)
coefs.head()

Unnamed: 0,coef
subreddit_AskReddit,85.757194
subreddit_cars,22.536326
subreddit_CFB,20.950063
subreddit_MemeEconomy,17.945319
subreddit_Drama,17.7172


In [15]:
# get odds ratio for the age
coefs.loc['age', :]

Unnamed: 0,coef
age,1.391433
age,1.003595


**^tried to find why I have two 'age' columns in my set. Will dig more.**

In [17]:
# try KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train_all, y_train)
knn.score(X_test_all, y_test)

0.7647969052224372

Model 6 result
- Among all the other model, it shows the best score `.83` in Logistic Regression. 
- Also, odds ratio can be interpreted as every minutes increase 39%  in odds having higher number of posts. 

### Executive Summary

How to get an attention in a sea of information, correction, in a sea of Reddit posts? Since the user-generated-site ranked the 6th most visited webpage in the world, it is more and more competitive to be remarkable. People spend a lot time in this categorical playground. In the Alexa Global Topsites 500 released the fact that 15 minutes and 8 seconds, in average, is the time people spend in the site daily. This figure outranked all of the following competitor in top 3 site from the same source: google(7:16), youtube(8:32), facebook(10:46). Therefore, hundreds of posts are judging, even in this very moment, instantly as either eye-catchy or nah-boring.

  To analyze, therefore, the massive data is necessary whether which characteristics and/or features affect the most. As a data-scientist, I fetched fifty one thousands Reddit 'hot posts' in seven-days period and built the machine learning models to identify and find the odds. I utilized multiple data science methods, via python, such as CountVectorizer, TfidfVectorizer, RandomForest, Logistic Regression, etc. After evaluating multiple models, it shows that `title`, `age`, `subreddit` are the best features to build a model. 

  It appeared the most appeared word in the title for the popular posts were: "new, one, first, now, time, day, will, made, game, today." Also, the most predictive popular subreddits that your can get attention are: r/AskReddit, r/cars, r/CFB, r/MemeEconomy, r/Drama. In conclusion, to make your post stays longer by capturing peoples attention, you will try to choose your subreddits and certain words in the title.