# Project 3: Web APIs & NLP

In week four we learned about a few different classifiers. In week five we're learning about webscraping, APIs, and Natural Language Processing (NLP). This project will put those skills to the test.

For project 3, your goal is two-fold:
1. Using [PRAW](https://praw.readthedocs.io/en/stable/index.html), you'll collect posts from two subreddits of your choosing.
2. You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

## Part 3: Preprocessing and Modeling

Now that we have cleaned our data and have done EDA on it, we will now use that data to do our preprocessing and modeling.

In [1]:
# import libraries
import pandas as pd
import numpy as np

from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer, PorterStemmer

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
# Load cleaned data
subreddits = pd.read_csv("./data/reddit_data_cleaned.csv")
subreddits.head()

Unnamed: 0,subreddit,self_text
0,1,**MAJOR UPDATE: we have his name. Incredible r...
1,1,"On October 1, 2017, Stephen Paddock, a 64-year..."
2,1,Anny van der Groen-Heyligers was a 23 year old...
3,0,POTENTIAL SPOILERS AHEAD:\nSo an anonymous sou...
4,0,This has a Spoiler tag just in case nobody has...


In [3]:
# Taking a look at the shape of the dataset
subreddits.shape

(4727, 2)

Tokenize

In [4]:
# Tokenize: split the text into distinct chunks based on some pattern.
# RegEx: remove unnecessary characters
tokenizer = RegexpTokenizer(r"[\w\-]+")

In [5]:
# Checking to make sure the tokenizer works properly
tokenizer.tokenize(subreddits.self_text[20])

['THE',
 'SCENE',
 'In',
 'June',
 'of',
 '2001',
 'Lee',
 'Hwang',
 'and',
 'Bang',
 'were',
 'in',
 'need',
 'of',
 'quick',
 'cash',
 'They',
 'were',
 'in',
 'Sokcho',
 'City',
 'South',
 'Korea',
 'during',
 'the',
 'height',
 'of',
 'tourist',
 'season',
 'Around',
 '2',
 'AM',
 'in',
 'the',
 'morning',
 'they',
 'saw',
 'a',
 'man',
 'in',
 'his',
 '40s',
 'go',
 'into',
 'a',
 'hotel',
 'The',
 'three',
 'men',
 'followed',
 'their',
 'target',
 'up',
 'to',
 'a',
 'suite',
 'on',
 'the',
 'third',
 'floor',
 'Lee',
 'rang',
 'the',
 'bell',
 'When',
 'the',
 'man',
 'answered',
 'Lee',
 'told',
 'him',
 'that',
 'he',
 'was',
 'an',
 'employee',
 'at',
 'the',
 'hotel',
 'Lee',
 'went',
 'in',
 'first',
 'and',
 'subdued',
 'the',
 'victim',
 'by',
 'threatening',
 'him',
 'with',
 'a',
 'knife',
 'At',
 'his',
 'signal',
 'Hwang',
 'and',
 'Bang',
 'also',
 'entered',
 'the',
 'suite',
 'and',
 'took',
 '130',
 '000',
 'won',
 'approximately',
 '113',
 'from',
 'the',
 'vict

In [6]:
# tokenizing self_text column
post_tokens = [tokenizer.tokenize(post.lower()) for post in subreddits.self_text]

In [7]:
# No of posts (it should be the same as the length of the dataframe)
len(post_tokens)

4727

In [8]:
# Making sure above works properly
post_tokens[20]

['the',
 'scene',
 'in',
 'june',
 'of',
 '2001',
 'lee',
 'hwang',
 'and',
 'bang',
 'were',
 'in',
 'need',
 'of',
 'quick',
 'cash',
 'they',
 'were',
 'in',
 'sokcho',
 'city',
 'south',
 'korea',
 'during',
 'the',
 'height',
 'of',
 'tourist',
 'season',
 'around',
 '2',
 'am',
 'in',
 'the',
 'morning',
 'they',
 'saw',
 'a',
 'man',
 'in',
 'his',
 '40s',
 'go',
 'into',
 'a',
 'hotel',
 'the',
 'three',
 'men',
 'followed',
 'their',
 'target',
 'up',
 'to',
 'a',
 'suite',
 'on',
 'the',
 'third',
 'floor',
 'lee',
 'rang',
 'the',
 'bell',
 'when',
 'the',
 'man',
 'answered',
 'lee',
 'told',
 'him',
 'that',
 'he',
 'was',
 'an',
 'employee',
 'at',
 'the',
 'hotel',
 'lee',
 'went',
 'in',
 'first',
 'and',
 'subdued',
 'the',
 'victim',
 'by',
 'threatening',
 'him',
 'with',
 'a',
 'knife',
 'at',
 'his',
 'signal',
 'hwang',
 'and',
 'bang',
 'also',
 'entered',
 'the',
 'suite',
 'and',
 'took',
 '130',
 '000',
 'won',
 'approximately',
 '113',
 'from',
 'the',
 'vict

In [9]:
# Creating tokenized column in dataset
subreddits['tokenized'] = pd.DataFrame(data=[post_tokens], index=['tokenized']).T[['tokenized']]

In [10]:
# Adding tokenized rows
subreddits['tokenized'] = subreddits['tokenized'].apply(lambda row: ' '.join(row))
subreddits

Unnamed: 0,subreddit,self_text,tokenized
0,1,**MAJOR UPDATE: we have his name. Incredible r...,major update we have his name incredible reddi...
1,1,"On October 1, 2017, Stephen Paddock, a 64-year...",on october 1 2017 stephen paddock a 64-year-ol...
2,1,Anny van der Groen-Heyligers was a 23 year old...,anny van der groen-heyligers was a 23 year old...
3,0,POTENTIAL SPOILERS AHEAD:\nSo an anonymous sou...,potential spoilers ahead so an anonymous sourc...
4,0,This has a Spoiler tag just in case nobody has...,this has a spoiler tag just in case nobody has...
...,...,...,...
4722,1,Crazy to see this news:\n\nhttps://wydaily.com...,crazy to see this news https wydaily com lates...
4723,0,"In the finale, Ben survives the night and come...",in the finale ben survives the night and comes...
4724,0,Justt rewatching The Shining. During the scene...,justt rewatching the shining during the scene ...
4725,0,"I made a post a couple years back to this end,...",i made a post a couple years back to this end ...


### Lemmatizing & Stemming

In [11]:
# Instantiate Lemmatizer
wn = WordNetLemmatizer()

# Lemmatize tokens
tokens_lem = [wn.lemmatize(i) for i in post_tokens[2]]

In [12]:
# Instantiate PorterStemmer
ps = PorterStemmer()

# Stem tokens
tokens_stem = [ps.stem(i) for i in post_tokens[2]]

In [13]:
# Taking the above for stemming and making a function to use when modeling

def subreddit_preprocessor_stem(str_input):
    text = str_input.split()
    ps = PorterStemmer()
    return ' '.join([ps.stem(word) for word in text]) 

In [14]:
# Taking the above for lemmatizing and making a function to use when modeling

def subreddit_preprocessor_lem(str_input):
    text = str_input.split()
    wn = WordNetLemmatizer()
    return ' '.join([wn.lemmatize(word) for word in text]) 

## Modeling

In [15]:
# Creating X and y
X = subreddits['tokenized']
y = subreddits['subreddit']

In [16]:
# Baseline model
y.value_counts(normalize=True)

subreddit
0    0.510049
1    0.489951
Name: proportion, dtype: float64

In [17]:
# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 2024)

In [18]:
# Instantiate a CountVectorizer with the default hyperparameters.
cvec = CountVectorizer()

In [19]:
# Fit model
cvec.fit(X_train)

In [20]:
# Transforming train and test datasets
X_train_cv = cvec.transform(X_train)
X_test_cv = cvec.transform(X_test)

In [21]:
# Checking shape of train set
X_train_cv.shape

(3545, 60819)

### Modeling with CountVectorizer and MultinomialNB using pipe

In [22]:
# Using a pipline to fit multiple models
# 1. CountVectorizer
# 2. Multinomial Naive Bayes 

pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('nb', MultinomialNB())])

In [23]:
# Fitting pipe on training set
pipe.fit(X_train, y_train)

In [24]:
# Getting train and test accuracy scores
pipe.score(X_train, y_train), pipe.score(X_test, y_test)

(0.9895627644569817, 0.9805414551607445)

### Modeling with CountVectorizer and MultinomialNB using pipe and GridSearching using params

In [25]:
# Instantiate GridSearchCV.
# Used below link to pick the best n_jobs
# https://towardsdatascience.com/understanding-the-n-jobs-parameter-to-speedup-scikit-learn-classification-26e3d1220c28#:~:text=According%20to%20the%20official%20scikit,1%20means%20using%20all%20processors.

pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('nb', MultinomialNB())])

params = {
    'cvec__max_features': [4_000, 5_000, 6_000], 
    'cvec__min_df': [2, 3], 
    'cvec__ngram_range': [(1,1), (1,2)], 
    'cvec__preprocessor': [subreddit_preprocessor_stem, subreddit_preprocessor_lem] 
}

gs = GridSearchCV(pipe, params, cv=5, n_jobs = -1) 

In [26]:
%%time
# Fitting training set
gs.fit(X_train, y_train)

CPU times: total: 7.86 s
Wall time: 14min 27s


In [27]:
# Getting accuracy scores for train and test sets
gs.score(X_train, y_train), gs.score(X_test, y_test)

(0.9726375176304655, 0.9568527918781726)

In [28]:
# Getting best parameters from gridsearch
gs.best_params_

{'cvec__max_features': 5000,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 1),
 'cvec__preprocessor': <function __main__.subreddit_preprocessor_lem(str_input)>}

In [29]:
# Getting best score from gridsearch
gs.best_score_

0.9706629055007052

### Modeling without preprocessors and changing params

In [30]:
# Instantiate GridSearchCV2
# Used below link to pick the best n_jobs
# https://towardsdatascience.com/understanding-the-n-jobs-parameter-to-speedup-scikit-learn-classification-26e3d1220c28#:~:text=According%20to%20the%20official%20scikit,1%20means%20using%20all%20processors.

pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('nb', MultinomialNB())])


params2 = {
    'cvec__max_features': [5000, 6000, 7000, None], 
    'cvec__min_df': [2, 3], 
    'cvec__ngram_range': [(1,1), (1,2), (2,2), (3,3)]
}

gs2 = GridSearchCV(pipe, params2, cv=5, n_jobs = -1) 

In [31]:
%%time
# Fitting training set
gs2.fit(X_train, y_train)

CPU times: total: 7.73 s
Wall time: 2min 42s


In [32]:
# Getting accuracy scores for train and test sets
gs2.score(X_train, y_train), gs2.score(X_test, y_test) 

(0.9968970380818054, 0.9788494077834179)

In [33]:
# Getting best parameters from gridsearch above
gs2.best_params_

{'cvec__max_features': None, 'cvec__min_df': 2, 'cvec__ngram_range': (2, 2)}

In [34]:
# Getting best score from gridsearch above
gs2.best_score_

0.9873060648801129

### Modeling same as above but replacing None max features with 4000

In [35]:
# Instantiate GridSearchCV3
# Used below link to pick the best n_jobs
# https://towardsdatascience.com/understanding-the-n-jobs-parameter-to-speedup-scikit-learn-classification-26e3d1220c28#:~:text=According%20to%20the%20official%20scikit,1%20means%20using%20all%20processors.

pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('nb', MultinomialNB())])


params3 = {
    'cvec__max_features': [4000, 5000, 6000, 7000], 
    'cvec__min_df': [2, 3], 
    'cvec__ngram_range': [(1,1), (1,2), (2,2), (3,3)]
}

gs3 = GridSearchCV(pipe, params3, cv=5, n_jobs = -1) 

In [36]:
%%time
# Fitting training set
gs3.fit(X_train, y_train)

CPU times: total: 6.42 s
Wall time: 2min 53s


In [37]:
# Getting accuracy scores for train and test sets
gs3.score(X_train, y_train), gs3.score(X_test, y_test)

(0.9844851904090268, 0.9737732656514383)

In [38]:
# Getting best parameters from gridsearch above
gs3.best_params_

{'cvec__max_features': 7000, 'cvec__min_df': 3, 'cvec__ngram_range': (1, 1)}

In [39]:
# Getting best score from gridsearch above
gs3.best_score_

0.9771509167842032

### Modeling same as above but replacing CountVectorize with TfidfVectorize

In [40]:
# Instantiate GridSearchCV4
# Used below link to pick the best n_jobs
# https://towardsdatascience.com/understanding-the-n-jobs-parameter-to-speedup-scikit-learn-classification-26e3d1220c28#:~:text=According%20to%20the%20official%20scikit,1%20means%20using%20all%20processors.

pipe2 = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('nb', MultinomialNB())])


params2 = {
    'tvec__max_features': [5000, 6000, 7000, None], 
    'tvec__min_df': [2, 3], 
    'tvec__ngram_range': [(1,1), (1,2), (2,2), (3,3)]
}

gs4 = GridSearchCV(pipe2, params2, cv=5, n_jobs = -1) 

In [41]:
%%time
# Fitting training set
gs4.fit(X_train, y_train)

CPU times: total: 7.5 s
Wall time: 2min 45s


In [42]:
# Getting accuracy scores for train and test sets
gs4.score(X_train, y_train), gs4.score(X_test, y_test)

(0.9974612129760225, 0.9796954314720813)

In [43]:
# Getting best parameters from gridsearch above
gs4.best_params_

{'tvec__max_features': None, 'tvec__min_df': 2, 'tvec__ngram_range': (2, 2)}

In [44]:
# Getting best score from gridsearch above
gs4.best_score_ 

0.9873060648801129

### Modeling using RandomForest

In [45]:
# Instantiate GridSearchCV5
# Used below link to pick the best n_jobs
# https://towardsdatascience.com/understanding-the-n-jobs-parameter-to-speedup-scikit-learn-classification-26e3d1220c28#:~:text=According%20to%20the%20official%20scikit,1%20means%20using%20all%20processors.

pipe3 = Pipeline([
    ('cvec', CountVectorizer(ngram_range=(2,2), min_df=2)),
    ('rf', RandomForestClassifier(random_state = 2024))])


params4 = {
    "rf__n_estimators":[100, 150, 200, 500],
    "rf__max_depth": [None, 2, 4],
    "rf__min_samples_leaf": [1, 2, 4]
}

gs5 = GridSearchCV(pipe3, params4, cv=5, n_jobs = -1) 

In [46]:
%%time
# Fitting training set
gs5.fit(X_train, y_train)

CPU times: total: 13.8 s
Wall time: 5min 51s


In [47]:
# Getting accuracy scores for train and test sets
gs5.score(X_train, y_train), gs5.score(X_test, y_test)

(1.0, 0.9754653130287648)

In [48]:
# Getting best parameters from gridsearch above
gs5.best_params_

{'rf__max_depth': None, 'rf__min_samples_leaf': 1, 'rf__n_estimators': 150}

In [49]:
# Getting best score from gridsearch above
gs5.best_score_

0.9734837799717913

### Modeling with randomforest and TfidfVectorizer

In [50]:
# Instantiate GridSearchCV6
# Used below link to pick the best n_jobs
# https://towardsdatascience.com/understanding-the-n-jobs-parameter-to-speedup-scikit-learn-classification-26e3d1220c28#:~:text=According%20to%20the%20official%20scikit,1%20means%20using%20all%20processors.

pipe4 = Pipeline([
    ('tvec', TfidfVectorizer(ngram_range=(2,2), min_df=2)),
    ('rf', RandomForestClassifier(random_state = 2024))])


params4 = {
    "rf__n_estimators":[100, 150, 200, 500],
    "rf__max_depth": [None, 2, 4],
    "rf__min_samples_leaf": [1, 2, 4]
}

gs6 = GridSearchCV(pipe4, params4, cv=5, n_jobs = -1) 

In [51]:
%%time
# Fitting training set
gs6.fit(X_train, y_train)

CPU times: total: 24 s
Wall time: 6min 31s


In [52]:
# Getting accuracy scores for train and test sets
gs6.score(X_train, y_train), gs6.score(X_test, y_test)

(1.0, 0.9771573604060914)

In [53]:
# Getting best parameters from gridsearch above
gs6.best_params_

{'rf__max_depth': None, 'rf__min_samples_leaf': 1, 'rf__n_estimators': 500}

In [54]:
# Getting best score from gridsearch above
gs6.best_score_

0.9740479548660085

## Conclusion and Recommendations

| Num | Vectorizer | Model | Best Score | Train Score |Test Score (Accuracy)| Best Parameters |
| --- | --- | --- | --- | --- | --- | --- |
| 1 | CountVectorizer | Multinomial Naïve Bayes |0.9873|0.9969|0.9788|{'cvec__max_features': None, 'cvec__min_df': 2, 'cvec__ngram_range': (2, 2)} |
| 2 | TfidfVectorizer | Multinomial Naïve Bayes |0.9873|0.9975|0.9797|{'tvec__max_features': None, 'tvec__min_df': 2, 'tvec__ngram_range': (2, 2)} |
| 3 | CountVectorizer | Random Forest Classifier |0.9735|1.0|0.9755|{'rf__max_depth': None, 'rf__min_samples_leaf': 1, 'rf__n_estimators': 150} |
| 4 | TfidfVectorizer | Random Forest Classifier |0.9740|1.0|0.9772|{'rf__max_depth': None, 'rf__min_samples_leaf': 1, 'rf__n_estimators': 500} |

All the models above outperformed the baseline accuracy score of 0.510049. Since the focus was on getting as many correct predictions as possible, the Multinomial Naïve Bayes with TfidfVectorizer (the second model on the table above) overall had the best predictive performance on the classification problem. However, we can see the the Multinomial Naïve Bayes with CountVectorizer also has very close results. 

For next steps, moving forward, we could potentially include more relevant and similar subreddits into our research, and label all these as binary classification of 0 whereas UnresolvedMysteries remains as 1. This may expand our modeling capacity. In addition, image posts were not taken into account when picking the subreddits. 

The model works well but does not achieve a 100% accuracy. To handle this, a Misplaced Subreddit Post button could be a good idea to use so that users can reach out and let us know that the post seems to be in the wrong subreddit.