# Project 3: Web APIs and Classification: Model Benchmarks

In [1]:
#Imports:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import requests
import re
from bs4 import BeautifulSoup as bs
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import random
import time

%matplotlib inline

> <font size=3 color='blue'> You don't have to import some of these libraries for this notebook

In [2]:
# Set the graph style
plt.style.use('ggplot')

## Reading the dataframe

In [3]:
final_df = pd.read_csv('./datasets/final_df.csv')
final_df

Unnamed: 0,text,label
0,driven individual rushing towards dream ever s...,1.0
1,reduce bounce rate webpage,1.0
2,made animated summary lean start eric ries hop...,1.0
3,skate ramp business,1.0
4,help getting textile prototype created,1.0
...,...,...
2161,trying learn various ing strategy came across ...,0.0
2162,pretend know lot finance economics sold positi...,0.0
2163,bill ackman bet market recovery despite covid ...,0.0
2164,news covid vaccine drugmaker pfizer pfe partne...,0.0


## Train test split

Split the model into their train and test set before transforming the text using the count vectorizer

In [4]:
X = final_df['text']
y = final_df['label']

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

# Reset the indexes
X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

In [6]:
print(X_train.shape)
print(X_test.shape)

(1624,)
(542,)


## Transforming the text using `countvectorizer`

In [7]:
# Import CountVectorizer.
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate the "CountVectorizer" object
cvec = CountVectorizer(analyzer = "word",
                             lowercase= False) 

In [8]:
# Transform the words to tokenize the words  
X_train_vec = cvec.fit_transform(X_train)
X_test_vec = cvec.transform(X_test)

In [9]:
# Convert X_train into a DataFrame.

X_train_df = pd.DataFrame(X_train_vec.toarray(),
                          columns=cvec.get_feature_names())
X_train_df

Unnamed: 0,aa,aaa,aapl,aar,aaron,aaxn,aaz,ab,abandon,abbv,...,zm,zoetis,zone,zoo,zookeeper,zoom,zts,zuck,zuckerberg,zweig
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,2,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1619,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1620,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1621,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1622,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
# Convert X_test into a DataFrame.
X_test_df = pd.DataFrame(X_test_vec.toarray(),
                         columns=cvec.get_feature_names())

X_test_df

Unnamed: 0,aa,aaa,aapl,aar,aaron,aaxn,aaz,ab,abandon,abbv,...,zm,zoetis,zone,zoo,zookeeper,zoom,zts,zuck,zuckerberg,zweig
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
537,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
538,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
539,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
540,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Baseline Accuracy

In [11]:
y_train.value_counts(normalize=True)

1.0    0.538793
0.0    0.461207
Name: label, dtype: float64

The baseline accuracy is 0.53 as the majority class fall under the entrepreneur subreddit which is class 1.

The baseline accuracy is used to check if the created model performs better than the model without features.

## Training on the logistics regression CV model

In [12]:
from sklearn.linear_model import LogisticRegressionCV

# Find the best solver
solver_lst = ['newton-cg','lbfgs','liblinear','sag', 'saga']

for s in solver_lst:
    lrcv = LogisticRegressionCV(Cs=[0.00001,0.001, 0.01, 0.1,1], cv=5, n_jobs=-1, solver=s, max_iter=7000)
    lrcv.fit(X_train_vec, y_train)
    test_score = lrcv.score(X_test_vec, y_test)
    print(f'For solver {s}, the test score is {test_score}')

For solver newton-cg, the test score is 0.9059040590405905
For solver lbfgs, the test score is 0.9059040590405905
For solver liblinear, the test score is 0.9040590405904059
For solver sag, the test score is 0.9114391143911439
For solver saga, the test score is 0.9132841328413284


Best solver wold be `saga` as it has the highest accuracy score.

In [13]:
lrcv = LogisticRegressionCV(Cs=[0.00001,0.001, 0.01, 0.1,1], cv=5, n_jobs=-1, solver='saga', max_iter=5000)
lrcv.fit(X_train_vec, y_train)

LogisticRegressionCV(Cs=[1e-05, 0.001, 0.01, 0.1, 1], cv=5, max_iter=5000,
                     n_jobs=-1, solver='saga')

In [14]:
# Best Alpha value from Cross Validation
lrcv.C_

array([0.1])

Alpha value is the regularization parameter which is used to reduce a model's overfitting. The lower the value, the more regularization is used on the model.
The Logistic Regression CV found 0.1 to be the best alpha value.

In [15]:
# Score for the training set
lrcv.score(X_train_vec, y_train)

0.9673645320197044

In [16]:
# Score for the test set
lrcv.score(X_test_vec, y_test)

0.9132841328413284

It seems the model is overfitting comparing against the train and test set as the train set has a higher accuracy score compared to the test test.

I can reduce the number of features in the model to reduce the variance which will decrease the overfitting and help improve the accuracy score.

I can also increase regularization strength of the model to reduce the overfitting.

In [17]:
from sklearn.metrics import confusion_matrix

y_pred = lrcv.predict(X_test_vec)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

print("True Negative:", tn)
print("False Positive:", fp)
print("True Positive:", tp)
print("False Negative:", fn)

True Negative: 231
False Positive: 19
True Positive: 264
False Negative: 28


In [18]:
confusion_matrix(y_test, y_pred)

array([[231,  19],
       [ 28, 264]], dtype=int64)

In [19]:
specificity = tn / (tn+fp) # How accurately can the model predict for the negative class
sensitivity = tp / (tp+fn) # How accurately can the model predict for the positive class
accruacy = (tp+tn) / (tp+tn+fp+fn)

print('Specficity:', round(specificity,2))
print('Sensitivity:',round(sensitivity,2))
print('Accuracy:',round(accruacy,2))

Specficity: 0.92
Sensitivity: 0.9
Accuracy: 0.91


The Sensitivity is slightly lower compared to the Specificity, which means the model is slightly more likely to accurately predict the negative class compared to the positive class. However, since I'm trying to predict whether the model is able to accurately predict the subreddit where the post belongs to, optimizing for Sensitivity and Specificity would not be a good measure. Accuracy would be a better performance metric.

A total of 47 posts were misclassified which will be later on investigated further.

## Interpreting the coefficients for the Logisitics Regression

In [20]:
# Top coefficients for the positive class, the entrepreneur subreddit
lrcv_coef = pd.DataFrame(np.exp(lrcv.coef_[0]),
                          index=cvec.get_feature_names(),
                          columns=['Coefficients']).sort_values('Coefficients',ascending=False)
lrcv_coef.head(10)

Unnamed: 0,Coefficients
business,3.105827
idea,2.274191
product,1.670755
startup,1.62949
marketing,1.468443
start,1.467806
website,1.443104
work,1.420869
help,1.378951
want,1.359499


As the logistic regression coefficients represent the log odds that an observation is in target class, 1, given the values of it X variables, the log odd coefficients need to be converted to regular odds to make sense of them. This is done through exponentiating the log odds coefficients.

For example:

For every one-unit in `business`, the odds that the observation is in entrepreneur class is 3.1 times as large as the odds that the observation is not in the entrepreneur class provided all other variables are constant.

In [21]:
# Top coefficients for the negative class, for the investing subreddit
lrcv_coef.tail(10)

Unnamed: 0,Coefficients
long,0.708195
market,0.700323
portfolio,0.696829
fund,0.693976
palantir,0.693638
buy,0.6397
ment,0.637986
etf,0.617748
ing,0.423148
stock,0.272978


These coefficients are least likely to represent the entrepreneur subreddit class and more likely to represent the investing subreddit class given their low coefficients.

## Logistics Regression: Identifying posts that were misclassified

In [86]:
# Index of Misclassified posts for the test set
index_misclassified_post = y_test[y_pred != y_test].index
index_misclassified_post

Int64Index([ 12,  24,  38,  47,  90, 103, 145, 146, 152, 178, 185, 208, 211,
            223, 232, 246, 290, 316, 385, 387, 420, 432, 437, 499, 500, 503,
            522],
           dtype='int64')

In [87]:
# Creating the dataframe for the misclassified posts
log_reg_misclass_posts = pd.DataFrame(X_test[index_misclassified_post])
log_reg_misclass_posts

Unnamed: 0,text
12,digital advertisement market reach b
24,competitor popping like crazy discussion
38,affordable preferably online magazine subscrip...
47,weed business profitable worth california
90,want learn valuing business best place start
103,suppose based vaccine anticipation news wonder...
145,hi guy stock geared solely towards heavily gho...
146,started ing march adding k month account gradu...
152,researched much understand mortgage company of...
178,stm


In [88]:
# Setting the index values for y_pred
y_pred_series = pd.Series(y_pred, index=X_test.index)
y_pred_series

0      1.0
1      0.0
2      1.0
3      0.0
4      1.0
      ... 
537    0.0
538    1.0
539    0.0
540    0.0
541    1.0
Length: 542, dtype: float64

In [89]:
# Adding the predicted and actual values for the labels
log_reg_misclass_posts['y_true'] =  y_test[index_misclassified_post]
log_reg_misclass_posts['y_pred'] = y_pred_series[index_misclassified_post]

In [90]:
# Misclassified posts and their predicted and actual classification
log_reg_misclass_posts

Unnamed: 0,text,y_true,y_pred
12,digital advertisement market reach b,0.0,1.0
24,competitor popping like crazy discussion,1.0,0.0
38,affordable preferably online magazine subscrip...,1.0,0.0
47,weed business profitable worth california,1.0,0.0
90,want learn valuing business best place start,0.0,1.0
103,suppose based vaccine anticipation news wonder...,0.0,1.0
145,hi guy stock geared solely towards heavily gho...,0.0,1.0
146,started ing march adding k month account gradu...,0.0,1.0
152,researched much understand mortgage company of...,0.0,1.0
178,stm,0.0,1.0


## Combining the misclassified posts and the probability of the posts being classified

In [107]:
# The actual probability of the class being predicted by the model
combined_mis = pd.DataFrame(lrcv.predict_proba(X_test_vec[index_misclassified_post]), columns=['Invest prob', 'Entrepreneur prob'])
combined_mis = pd.concat([combined_mis, X_test[index_misclassified_post].reset_index(drop=True)], axis=1) # Combining the text posts and their probabilities
combined_mis

Unnamed: 0,Invest prob,Entrepreneur prob,text
0,0.546764,0.453236,digital advertisement market reach b
1,0.497689,0.502311,competitor popping like crazy discussion
2,0.452619,0.547381,affordable preferably online magazine subscrip...
3,0.251686,0.748314,weed business profitable worth california
4,0.086048,0.913952,want learn valuing business best place start
5,0.37776,0.62224,suppose based vaccine anticipation news wonder...
6,0.875781,0.124219,hi guy stock geared solely towards heavily gho...
7,0.884543,0.115457,started ing march adding k month account gradu...
8,0.790496,0.209504,researched much understand mortgage company of...
9,0.490412,0.509588,stm


The model misclassified some posts as some had only a few words and their probabilities are quite close which makes the classifier to difficult to classify the post. 

In [83]:
# Getting the frequency of the words that are misclassified as entrepreneur label
misclass_words = combined_mis['text'].str.cat(sep=' ')

misclass_words_dict = {}
for word in misclass_words.split():
    if word in misclass_words_dict:
        misclass_words_dict[word] = misclass_words_dict[word] + 1
    else:
        misclass_words_dict[word] = 1

In [105]:
misclass_words_freq = pd.DataFrame(misclass_words_dict.values(), 
             index=misclass_words_dict.keys(),
             columns=['Frequency']).sort_values('Frequency',ascending=False)

# Top 10 words that were in misclassified posts
misclass_words_freq.head(10)

Unnamed: 0,Frequency
company,29
bank,23
revenue,17
reza,13
going,13
business,13
need,12
also,12
startup,12
financing,12


In [106]:
lrcv_coef[lrcv_coef.index.isin(list(misclass_words_freq.index))]

Unnamed: 0,Coefficients
business,3.105827
product,1.670755
startup,1.629490
marketing,1.468443
start,1.467806
...,...
fund,0.693976
buy,0.639700
ment,0.637986
ing,0.423148


The posts and titles that were misclassified as entrepreneur had words that have strong coefficients in them such as 'business'.

## Training on the Multi-nominal Bayes Model

I'll be using Multi-nominal Bayes as the X column is filled with the integer counts of the terms in each document.

In [31]:
# Import the Multinominal Naive bayes
from sklearn.naive_bayes import MultinomialNB

In [32]:
# Instiate the model
mnb = MultinomialNB()
# Fit the training set
mnb.fit(X_train_vec, y_train) 

# Accuracy score of the training set
mnb.score(X_train_vec, y_train)

0.9624384236453202

In [33]:
# Accuracy score of the test set
mnb.score(X_test_vec, y_test)

0.9428044280442804

There a slight overfitting of the model on the test set but Multi-nominal Naive Bayes seems to generalize better than Logistics Regression and scores better than the baseline model.

## Interpreting the probabilities for the Multi-nominal Bayes Model

In [34]:
mnb_prob = pd.DataFrame({'Token':cvec.get_feature_names(),'Negative Class Prob':mnb.feature_log_prob_[0],
              'Positive Class Prob':mnb.feature_log_prob_[1]})
mnb_prob.head()

Unnamed: 0,Token,Negative Class Prob,Positive Class Prob
0,aa,-10.193635,-10.329
1,aaa,-10.193635,-11.022148
2,aapl,-9.500488,-11.022148
3,aar,-10.193635,-11.022148
4,aaron,-10.886783,-10.329


In [35]:
# Top log probabilities for the positive class, the entrepreneur subreddit

mnb_prob.sort_values('Positive Class Prob', ascending=False)[['Token','Positive Class Prob']].head(10)

Unnamed: 0,Token,Positive Class Prob
1274,business,-4.600525
6799,people,-5.06631
10440,would,-5.155679
5429,like,-5.161361
4022,get,-5.214005
6442,one,-5.311721
4548,idea,-5.321704
10410,work,-5.338568
9461,time,-5.345394
7223,product,-5.352267


In [36]:
# Top log probabilities for the negative class, the investing subreddit
mnb_prob.sort_values('Negative Class Prob', ascending=False)[['Token','Negative Class Prob']].head(10)

Unnamed: 0,Token,Negative Class Prob
8906,stock,-4.625291
1776,company,-4.762099
5702,market,-4.865759
10490,year,-5.093769
10440,would,-5.199807
8368,share,-5.28098
7169,price,-5.397845
5429,like,-5.539675
4780,ing,-5.666427
4678,inc,-5.710633


The higher the probability means the features are more important for the positive class, which is the entrepreneur subreddit.

The same can be said for the negative class which is the investing subreddit.

## Confusion matrix for Multi-nominal Bayes Model

In [37]:
y_pred = mnb.predict(X_test_vec)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

print("True Negative:", tn)
print("False Positive:", fp)
print("True Positive:", tp)
print("False Negative:", fn)

True Negative: 234
False Positive: 16
True Positive: 277
False Negative: 15


In [38]:
confusion_matrix(y_test, y_pred)

array([[234,  16],
       [ 15, 277]], dtype=int64)

In [39]:
specificity = tn / (tn+fp) # How accurately can the model predict for the negative class
sensitivity = tp / (tp+fn) # How accurately can the model predict for the positive class
accruacy = (tp+tn) / (tp+tn+fp+fn)

print('Specficity:', round(specificity,2))
print('Sensitivity:',round(sensitivity,2))
print('Accuracy:',round(accruacy,2))

Specficity: 0.94
Sensitivity: 0.95
Accuracy: 0.94


A total of 31 posts were misclassified, 16 False Positives + 15 False Negatives, which is much better than 48 posts that were misclassified for the logistics regression. The scores are also slightly higher compared to the logistics regression model.

## Misclassified posts for Multi-nominal Bayes Model

In [40]:
# Index of Misclassified posts for the test set
index_misclassified_post = y_test[y_pred != y_test].index
index_misclassified_post

Int64Index([ 44,  47,  61,  67,  90, 103, 178, 185, 208, 211, 221, 223, 232,
            246, 251, 290, 295, 327, 385, 387, 393, 420, 421, 430, 432, 464,
            470, 499, 500, 503, 522],
           dtype='int64')

In [41]:
# Creating the dataframe for the misclassified posts
mnb_misclass_posts = pd.DataFrame(X_test[index_misclassified_post])
mnb_misclass_posts

Unnamed: 0,text
44,raising usd ask anything
47,weed business profitable worth california
61,amazon start amazon pharmacy w free delivery p...
67,looking buying vending machine trying figure l...
90,want learn valuing business best place start
103,suppose based vaccine anticipation news wonder...
178,stm
185,new targeted matching service grow network des...
208,hello year hard know live chile unemployed sin...
211,st time importer home gym china


In [42]:
# Common posts that were misclassified in both logistics regression and MNB 
log_reg_misclass_posts['text'][log_reg_misclass_posts['text'].isin(mnb_misclass_posts['text'])]

61     amazon start amazon pharmacy w free delivery p...
67     looking buying vending machine trying figure l...
90          want learn valuing business best place start
103    suppose based vaccine anticipation news wonder...
178                                                  stm
211                      st time importer home gym china
232                                      freight company
246    came across thought might interesting info pub...
290            example mortgage cost landlord price rent
295    actually able get lawsuit purchased within dat...
385    wondering hear mainly public consultation ofte...
421     anyone website stock footage image fat watermark
464        warrant conversion general sbews specifically
499    hi like find community avenue helped learning ...
503                       freight transportation company
Name: text, dtype: object

A total of 16 common posts were misclassified in both models.

# Futher Model evaluation

## Using Count Vectorizer

### Optimizing Logisitics Regression model

In [43]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Pipe to add the count vectorizer and logistic regression
pipe_logreg = Pipeline([
    ('countvec', CountVectorizer(lowercase=False)), # Already coverted to lowercase
    ('logreg', lrcv),
])

# Parameters to test the different hyper parameters
params_log_reg = {
    'countvec__ngram_range': [(1,1),(1,2),(2,2)], # Testing using unigrams only, unigrams and bigrams, bigrams only
    'countvec__max_features': [1000, 2000, 3000, 4000], # Since features are about 10,551, I'll try to use lower features
    'countvec__min_df': [1,2], # Minimum number of times token must occur in the document to include token
    'countvec__max_df': [.8, .9], # Ignore words that are occuring more than 80% or 90% in the documents from the corpus
}

In [44]:
# Instantiate the GridSearchCV

gs_log_reg = GridSearchCV(pipe_logreg,
                 param_grid=params_log_reg,
                 cv=5)

In [45]:
# Fit the model
gs_log_reg.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('countvec',
                                        CountVectorizer(lowercase=False)),
                                       ('logreg',
                                        LogisticRegressionCV(Cs=[1e-05, 0.001,
                                                                 0.01, 0.1, 1],
                                                             cv=5,
                                                             max_iter=5000,
                                                             n_jobs=-1,
                                                             solver='saga'))]),
             param_grid={'countvec__max_df': [0.8, 0.9],
                         'countvec__max_features': [1000, 2000, 3000, 4000],
                         'countvec__min_df': [1, 2],
                         'countvec__ngram_range': [(1, 1), (1, 2), (2, 2)]})

In [46]:
# Find the best parameters
gs_log_reg.best_params_

{'countvec__max_df': 0.8,
 'countvec__max_features': 4000,
 'countvec__min_df': 1,
 'countvec__ngram_range': (1, 1)}

In [47]:
# Best score
gs_log_reg.best_score_

0.9002488129154796

In [48]:
# Accuracy score for the training set
gs_log_reg.score(X_train, y_train)

0.958743842364532

In [49]:
# Accuracy score the test set
gs_log_reg.score(X_test, y_test)

0.9095940959409594

In [50]:
y_pred = gs_log_reg.predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

print("True Negative:", tn)
print("False Positive:", fp)
print("True Positive:", tp)
print("False Negative:", fn)

True Negative: 230
False Positive: 20
True Positive: 263
False Negative: 29


Despite searching for the best hyperparamter and removing the stopwords, there does not seem to be much change in the accuracy score.

### Optimizing Multi-nominal Naive Bayes model

In [51]:
# Pipe to add the count vectorizer and Multi-nominal Bayes model
pipe_mnb = Pipeline([
    ('countvec', CountVectorizer(lowercase=False)), # Already coverted to lowercase
    ('mnb', MultinomialNB())
])

# Parameters to test the different hyper parameters
params_mnb = {
    'countvec__ngram_range': [(1,1),(2,2)], # Testing using unigrams bigrams
    'countvec__max_features': [8000, 9000, 10000], # Since features are about 10,551, I'll try to use lower features
    'countvec__min_df': [1,2], # Minimum number of documents to include token
    'countvec__max_df': [.9, .95], # Maximum number of documents to include token
    'mnb__alpha': [0.1,0.2], # Testing different alpha values
}

In [52]:
# Instantiate the GridSearchCV

gs_mnb = GridSearchCV(pipe_mnb,
                 param_grid=params_mnb,
                 cv=5)

In [53]:
# Fit the model
gs_mnb.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('countvec',
                                        CountVectorizer(lowercase=False)),
                                       ('mnb', MultinomialNB())]),
             param_grid={'countvec__max_df': [0.9, 0.95],
                         'countvec__max_features': [8000, 9000, 10000],
                         'countvec__min_df': [1, 2],
                         'countvec__ngram_range': [(1, 1), (2, 2)],
                         'mnb__alpha': [0.1, 0.2]})

In [54]:
# Best parameters for the model
gs_mnb.best_params_

{'countvec__max_df': 0.9,
 'countvec__max_features': 9000,
 'countvec__min_df': 1,
 'countvec__ngram_range': (1, 1),
 'mnb__alpha': 0.2}

In [55]:
# Best model
gs_mnb.best_score_

0.926727445394112

In [56]:
# Score against the training set
gs_mnb.score(X_train, y_train)

0.9655172413793104

In [57]:
# Score against the test set
gs_mnb.score(X_test, y_test)

0.9464944649446494

The model is slightly overfitting but it's doing better than the logistics regression model with a higher accuracy score.

## Using TFTID 

### Optimizing Logisitics Regression model

In [58]:
# Pipe to add the TFIDF vectorizer and Multi-nominal Bayes model

# Import TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
pipe_tvec = Pipeline([
    ('tvec', TfidfVectorizer(lowercase=False)), # Already coverted to lowercase
    ('logreg', lrcv)
])

# Parameters to test the different hyper parameters
params_log_reg = {
    'tvec__ngram_range': [(1,1),(1,2),(2,2)], # Testing using unigrams only, unigrams and bigrams, bigrams only
    'tvec__max_features': [1000, 2000, 3000, 4000], # Since features are about 10,551, I'll try to use lower features
    'tvec__min_df': [2,3], # Minimum number of documents to include token
    'tvec__max_df': [.8, .9], # Maximum number of documents to include token
    #'logreg__solver': ['newton-cg', 'liblinear'], # Testing different algorithms
    #'logreg__C': [0.001, 0.01, 0.8] # Different alpha values, which are regularization hyper parameters to reduce the model's overfitting
}

In [59]:
# Instantiate the GridSearchCV

gs_tvec_log_reg = GridSearchCV(pipe_tvec,
                 param_grid=params_log_reg,
                 cv=5)

In [60]:
# Fit the model
gs_tvec_log_reg.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tvec',
                                        TfidfVectorizer(lowercase=False)),
                                       ('logreg',
                                        LogisticRegressionCV(Cs=[1e-05, 0.001,
                                                                 0.01, 0.1, 1],
                                                             cv=5,
                                                             max_iter=5000,
                                                             n_jobs=-1,
                                                             solver='saga'))]),
             param_grid={'tvec__max_df': [0.8, 0.9],
                         'tvec__max_features': [1000, 2000, 3000, 4000],
                         'tvec__min_df': [2, 3],
                         'tvec__ngram_range': [(1, 1), (1, 2), (2, 2)]})

In [61]:
# Best parameters for the model
gs_tvec_log_reg.best_params_

{'tvec__max_df': 0.8,
 'tvec__max_features': 4000,
 'tvec__min_df': 2,
 'tvec__ngram_range': (1, 1)}

In [62]:
# Best model score
gs_tvec_log_reg.best_score_

0.9144178537511871

In [63]:
# Score against the training set
gs_tvec_log_reg.score(X_train, y_train)

0.9802955665024631

In [64]:
# Score against the test set
gs_tvec_log_reg.score(X_test, y_test)

0.9372693726937269

Almost the same accuracy score with the count vectorizer.

### Optimizing Multi-nominal Naive Bayes model

In [66]:
# Pipe to add the count vectorizer and Multi-nominal Bayes model
pipe_mnb = Pipeline([
    ('tvec', TfidfVectorizer(lowercase=False)), # Already coverted to lowercase
    ('mnb', MultinomialNB())
])

# Parameters to test the different hyper parameters
params_mnb = {
    'tvec__ngram_range': [(1,1),(2,2)], # Testing using unigrams bigrams
    'tvec__max_features': [5000,8000, 9000, 10000], # Since features are about 10,551, I'll try to use lower features
    'tvec__min_df': [1,2], # Minimum number of documents to include token
    'tvec__max_df': [.9, .95], # Maximum number of documents to include token
    'mnb__alpha': [0.1,0.2], # Testing different alpha values
}

In [67]:
# Instantiate the GridSearchCV

gs_tvec_mnb = GridSearchCV(pipe_mnb,
                 param_grid=params_mnb,
                 cv=5)

In [68]:
# Fit the model
gs_tvec_mnb.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tvec',
                                        TfidfVectorizer(lowercase=False)),
                                       ('mnb', MultinomialNB())]),
             param_grid={'mnb__alpha': [0.1, 0.2], 'tvec__max_df': [0.9, 0.95],
                         'tvec__max_features': [5000, 8000, 9000, 10000],
                         'tvec__min_df': [1, 2],
                         'tvec__ngram_range': [(1, 1), (2, 2)]})

In [69]:
# Best parameters for the model
gs_tvec_mnb.best_params_

{'mnb__alpha': 0.1,
 'tvec__max_df': 0.9,
 'tvec__max_features': 10000,
 'tvec__min_df': 1,
 'tvec__ngram_range': (1, 1)}

In [70]:
# Best model
gs_tvec_mnb.best_score_

0.94397150997151

In [71]:
# Score against the training set
gs_tvec_mnb.score(X_train, y_train)

0.9895320197044335

In [72]:
# Score against the test set
gs_tvec_mnb.score(X_test, y_test)

0.9501845018450185

The model is slightly overfitting but it's doing better than the logistics regression model with a higher accuracy score.

## Summary table of the accuracy scores for the models

| Set | Type of vectorizer | Type of model | Accuracy Score |
|-|-|-|-|
| Training | Count Vectorizer | Logistic Regresion | 0.9581 |
| Training | TFIDF Vectorizer | Logistic Regression | 0.9778 |
| Test | Count Vectorizer | Logistic Regresion | 0.9096 |
| Test | TFIDF Vectorizer | Logistic Regression | 0.9336 |
| Train | Count Vectorizer | Multi-nominal Naive Bayes | 0.9667 |
| Training | TFIDF Vectorizer | Multi-nominal Naive Bayes | 0.9864 |
| Test | Count Vectorizer | Multi-nominal Naive Bayes | 0.9354 |
| **Test** | **TFIDF Vectorizer** | **Multi-nominal Naive Bayes** | **0.9428**|

Overall, I can see that Multi-nominal Navie Bayes generally performs slightly better than Logistics Regression and TFIDF vectorizer performs slightly better than Count Vectorizer.

> <font size=3 color='blue'>Instead of doing the analysis of word coefficients and misclassified posts on the model that was not tuned with gridsearch, it would be better to do it on the 'best model' you selected. You should also be tuning the entire pipeline, not each step separately

# Conclusion and Recommendation

To conclude, posts and their titles that were transformed with TFIDF Vectorizer were shown to perform slightly better than the posts transformed with the Count Vectorizer. A similar finding was seen for datasets trained on Multi-nominal Naive Bayes (MNNB) model performed slightly better than Logistic Regression. 
Even though both models have shown close performance, the Multi-nominal Naive Bayes model with TFIDF Vectorizer has shown to be the best model with a accuracy test score of 0.9428 while the logistic regression model with TFIDF Vectorizer coming close with an accuracy test score of 0.9336. Both models also beats the baseline accuracy score of 0.5387.

I would recommend MNNB model to be deployed in the entrepreneur subreddit as it's show to have a very high accuracy in filtering out investing subreddit posts from the entrepreneur subreddit and this would help to save time and energy instead of manually looking through the posts.

> <font size=3 color='blue'>Do infer the differences between the subreddits and possible use cases for the model / insights beyond just classification.