<a id='top'></a>
# Reddit API and Classification

**Previous:** [Data Cleaning and EDA](./02_data_cleaning_and_eda.ipynb)

## Preprocessing and Modeling
---

### Imports

In [1]:
# Standard data science imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#NLTK
from nltk.corpus import stopwords # Import the stop word list
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer

#Scikit-learn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve, recall_score, precision_score, balanced_accuracy_score, zero_one_loss

#### Data Imports

In [2]:
df = pd.read_csv('../datasets/cleaned_df.csv')

### Preprocessing stage
In this stage, we will take the clean data and segregate them into train and test data. We will be using ```train_test_split``` to segregate them and our test data size is set at 20%.

The **train data** `(train_title & train_subreddit)` will further undergo a round of ```train_test_split``` to create our validation set `(X_train, X_test, y_train, y_test)` which will aid us in testing various models to find the model that best generalizes. 

The **test data** `(test_title & test_subreddit)` will remain to be our "unseen data", which will test the sucess of our final model selected.


These data will then undergo feature extraction (or vectorization) such as `CountVectorizer`, as the words need to be encoded as integers or floating point values for use as input to machine learning algorithms.

#### 1. Create an unseen data 
In order to test if the model is successful, we shall create a set of unseen data (test set). This will check how well our data generalizes to the unseen data.

Since the trend in technology changes with time, we will take a randomized sample of 20% of the dataset as the test set.

In [3]:
X = df['title']
y = df['subreddit']

train_title, test_title, train_subreddit, test_subreddit = train_test_split(X,
                                                                            y,
                                                                            stratify=y,
                                                                            random_state=42,
                                                                            test_size=0.20)

In [4]:
unseen_df = pd.DataFrame(data=[test_subreddit, test_title]).T

In [5]:
#Export unseen data to csv.
unseen_df.to_csv('../datasets/unseen.csv', index=False)

#### 2. Create validation set
This set will help us model evaluation and hyperparameters tuning.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(train_title,
                                                    train_subreddit,
                                                    random_state=42)

In [7]:
#preview X_train
X_train.head()

1319    target ad placeholders reference upcoming new ...
26      product video designed speed helpfulness googl...
674     xiaomi mi mix solidifies status bona fide work...
355              samsung galaxy z fold review future flat
1458                           shortcuts sunday september
Name: title, dtype: object

In [8]:
train_df = pd.DataFrame(data=[y_train, X_train]).T
test_df = pd.DataFrame(data=[y_test, X_test]).T

In [9]:
#Export validation set to csv.
train_df.to_csv('../datasets/train.csv', index=False)
train_df.to_csv('../datasets/test.csv', index=False)

#### 3. Feature Extraction: Vectorize our text data
In order to pass the text features into our model, we need to transform the title strings into a matrix of word (i.e.token) counts. 


The ```CountVectorizer``` create columns (also known as vectors), where each column counts how many times each word is observed in each title string.

In [10]:
# Instantiate our CountVectorizer.
cvec = CountVectorizer()
# Fit our CountVectorizer on the training data and transform training data.
X_train_cvec = pd.DataFrame(cvec.fit_transform(X_train).toarray(),
                          columns = cvec.get_feature_names())
# Transform our testing data with the already-fit CountVectorizer.
X_test_cvec = pd.DataFrame(cvec.transform(X_test).toarray(),
                         columns = cvec.get_feature_names())

In [11]:
print(f" X_train set has {X_train_cvec.shape[1]} features after CountVectorizer")
X_train_cvec.head()

 X_train set has 2689 features after CountVectorizer


Unnamed: 0,ability,able,abrams,absolutely,accelerate,access,accessories,accessory,according,accordingly,...,ysk,yuzu,zach,zdnet,zenfone,zigbee,zones,zoom,zte,zuckerberg
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
X_train_cvec.shape

(941, 2689)

In [13]:
#preview of frequency of some top features
top20feat = X_train_cvec.sum().sort_values(ascending=False).head(20).index.tolist()
top20featcount = X_train_cvec.sum().sort_values(ascending=False).head(20).values.tolist()
df_top20 = pd.DataFrame(data=[top20feat, top20featcount], index=['word', 'count']).T
df_top20


Unnamed: 0,word,count
0,apple,244
1,app,122
2,ios,107
3,android,105
4,new,100
5,google,90
6,iphone,63
7,watch,61
8,samsung,57
9,pixel,56


## Modelling
Our data has been processed above, it is now ready for modelling!

### Baseline accuracy
Baseline accuracy which is the percentage of the majority class, regardless of whether it is 1 or 0. 
This serves as the benchmark for our models to beat.

In [14]:
#Check baseline accuracy
df['subreddit'].value_counts(normalize=True)

1    0.533461
0    0.466539
Name: subreddit, dtype: float64

### Naive Bayes model
Now that our data is ready to be fed into the model, lets try fitting into our first model, a Naive Bayes model. 

In the X_train set above, the columns are all integer counts, so MultunomialNB is the best choice. 

In [15]:
from sklearn.naive_bayes import MultinomialNB
#1. Instantiate model
nb_c = MultinomialNB()

#2. Fit our model
nb_c_model = nb_c.fit(X_train_cvec, y_train)

#3. Generate our predictions
nb_c_pred = nb_c_model.predict(X_test_cvec)
#   probabilities of predicting 1
nb_c_proba_1 = [i[1] for i in nb_c_model.predict_proba(X_test_cvec)]

#### Evaluating our first model

In [16]:
# define function to print scores:
def model_score(estimator, X_train, y_train, X_test, y_test):
    
    #mean cross val score on train model
    mean_cv_score = round(cross_val_score(estimator, X_train, y_train, cv=5).mean(),4)
    
    #score our model on the training and test set
    train_score = round(estimator.score(X_train, y_train),4)
    test_score = round(estimator.score(X_test, y_test),4)
    
    #print model scores
    print(f"Cross val score on train model: {mean_cv_score}")
    print(f"Accuracy score on train model: {train_score}")
    print(f"Accuracy score on test model: {test_score}")
    
    return mean_cv_score, train_score, test_score

In [17]:
mean_cv_score, train_score, test_score = model_score(nb_c_model, X_train_cvec, y_train, X_test_cvec, y_test)

Cross val score on train model: 0.9373
Accuracy score on train model: 0.9862
Accuracy score on test model: 0.9331


In [18]:
#define function to print performance metrics
def perf_metrics(y_true, y_pred, y_proba):
    
    #Generate a confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

    # print categories
    print("True Negatives: %s" % tn)
    print("False Positives: %s" % fp)
    print("False Negatives: %s" % fn)
    print("True Positives: %s" % tp)
    
    #compute metrics
    bal_accuracy = round(balanced_accuracy_score(y_true, y_pred),4)
    misclassification = round(zero_one_loss(y_true, y_pred),4)
    specificity = round(recall_score(y_true, y_pred, pos_label=0),4)
    sensitivity = round(recall_score(y_true, y_pred, pos_label=1),4)
    precision = round(precision_score(y_true, y_pred),4)
    roc_auc = round(roc_auc_score(y_true, y_proba),4)
        
    #print metrics
    print("\nPerformance metrics\n--------------------")
    print(f'Balanced Accuracy: {bal_accuracy}')
    print(f'Misclassification: {misclassification}')
    print(f'Specificity: {specificity}')
    print(f'Sensitivity: {sensitivity}')
    print(f'Precision: {precision}')
    print(f'ROC_AUC: {roc_auc} \n')

    #cross-check with classfication report
    print(classification_report(y_true, y_pred, digits=4))
    
    return tn, fp, fn, tp, bal_accuracy, misclassification, specificity, sensitivity, precision, roc_auc

In [19]:
tn, fp, fn, tp, bal_accuracy, misclassification, specificity, sensitivity, precision, roc_auc = perf_metrics(
    y_test, nb_c_pred, nb_c_proba_1)

True Negatives: 130
False Positives: 14
False Negatives: 7
True Positives: 163

Performance metrics
--------------------
Balanced Accuracy: 0.9308
Misclassification: 0.0669
Specificity: 0.9028
Sensitivity: 0.9588
Precision: 0.9209
ROC_AUC: 0.9764 

              precision    recall  f1-score   support

           0     0.9489    0.9028    0.9253       144
           1     0.9209    0.9588    0.9395       170

    accuracy                         0.9331       314
   macro avg     0.9349    0.9308    0.9324       314
weighted avg     0.9337    0.9331    0.9330       314



Store these scores into a list of dictionaries for reference

In [20]:
dict_scores = {'model': 'nb',
             'vectorizer': 'cvec',
             'valid_train': train_score,
             'valid_test': test_score,
             'mean_cv': mean_cv_score,
             'tn': tn,
             'fp': fp,
             'fn': fn,
             'tp': tp,
             'bal_accuracy': bal_accuracy,
             'misclassification': misclassification,
             'specificity': specificity,
             'sensitivity': sensitivity,
             'precision': precision,
             'roc_auc' : roc_auc,
             'params' : 'default'}

scores = [dict_scores]


In [21]:
#preview scores
pd.DataFrame(data=scores)

Unnamed: 0,model,vectorizer,valid_train,valid_test,mean_cv,tn,fp,fn,tp,bal_accuracy,misclassification,specificity,sensitivity,precision,roc_auc,params
0,nb,cvec,0.9862,0.9331,0.9373,130,14,7,163,0.9308,0.0669,0.9028,0.9588,0.9209,0.9764,default


It appears that the model is slightly overfit. If used as our final model, we can tune our hyperparameters to set limit to max_features which might decrease the overfit.

#### Naive Bayes model + TfidfVectorizer

Next, we will try a different Vectorizer: ```TfidfVectorizer``` and see which scores better.

A Term Frequency-Inverse Document Frequency (TF-IDF) score tells us which words are most discriminating between "documents". Words that occur often in one "document" but don't occur in many "documents" contain a great deal of discriminating power.

Simply put, the mechanism of the TfidfVectorizer is that: 
- Common words are penalized.
- Rare words have more influence.



In [22]:
#Instantiate Tfidf Vectorizer
tvec = TfidfVectorizer()

# Fit our TfidfVectorizer on the training data and transform training data.
X_train_tvec = pd.DataFrame(tvec.fit_transform(X_train).toarray(),
                          columns = tvec.get_feature_names())
# Transform our testing data with the already-fit TfidfVectorizer.
X_test_tvec = pd.DataFrame(tvec.transform(X_test).toarray(),
                         columns = tvec.get_feature_names())

In [23]:
#preview of TfidfVectorized X_train
print(f" X_train set has {X_train_tvec.shape[1]} features after TfidfVectorizer")
X_train_tvec

 X_train set has 2689 features after TfidfVectorizer


Unnamed: 0,ability,able,abrams,absolutely,accelerate,access,accessories,accessory,according,accordingly,...,ysk,yuzu,zach,zdnet,zenfone,zigbee,zones,zoom,zte,zuckerberg
0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
936,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
937,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
938,0.0,0.0,0.0,0.476334,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
939,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
from sklearn.naive_bayes import MultinomialNB
#1. Instantiate model
nb_t = MultinomialNB()

#2. Fit our model
nb_t_model = nb_t.fit(X_train_tvec, y_train)

#3. Generate our predictions
nb_t_pred = nb_t_model.predict(X_test_tvec)
#   probabilities of predicting 1
nb_t_proba_1 = [i[1] for i in nb_t_model.predict_proba(X_test_cvec)]

### Evaluating our Naive Bayes model with TfidfVectorizer

In [25]:
#score our model on the training and test set
mean_cv_score, train_score, test_score = model_score(nb_t, X_train_tvec, y_train, X_test_tvec, y_test)

Cross val score on train model: 0.9235
Accuracy score on train model: 0.9894
Accuracy score on test model: 0.9204


Seems like the model with TfidfVectorizer did not score as well as the model with CountVectorizer. We would still like to store the scores into the dictionary for reference. Since there are many steps to do so, we have defined a function below.

In [26]:
#define a function to run pipe with vectorizer and estimator
# also stores scores into dictionary

def run_pipe(pipe, X_train, X_test, y_train, y_test, dict_scores, params='default'):
    
    #fit pipeline
    pipe.fit(X_train, y_train)
    
    print(f"Model = {pipe.get_params()['steps'][1][0]} with {pipe.get_params()['steps'][0][0]} ")
    print("--------------------\n")
    
    #evaluate model
    mean_cv_score, train_score, test_score = model_score(pipe, X_train, y_train, X_test, y_test)

    print("--------------------\n")
    
    #generate predictions
    y_pred = pipe.predict(X_test)
    
    #generate probabilities of predicting 1
    y_proba = [i[1] for i in pipe.predict_proba(X_test)]
    
    #compute metrics
    tn, fp, fn, tp, bal_accuracy, misclassification, specificity, sensitivity, precision, roc_auc = perf_metrics(
        y_test, y_pred, y_proba)
    
    #store scores into dictionary
    dict_scores = {'model': pipe.get_params()['steps'][1][0],
                 'vectorizer': pipe.get_params()['steps'][0][0],
                 'valid_train': train_score,
                 'valid_test': test_score,
                 'mean_cv': mean_cv_score,
                 'tn': tn,
                 'fp': fp,
                 'fn': fn,
                 'tp': tp,
                 'bal_accuracy': bal_accuracy,
                 'misclassification': misclassification,
                 'specificity': specificity,
                 'sensitivity': sensitivity,
                 'precision': precision,
                 'roc_auc' :roc_auc,
                 'params': params}
    scores.append(dict_scores)

In [27]:
#create our pipeline
pipe_tvec_nb = Pipeline([('tvec', TfidfVectorizer()),
                         ('nb', MultinomialNB())])
#call function:
run_pipe(pipe_tvec_nb, X_train, X_test, y_train, y_test, dict_scores)

Model = nb with tvec 
--------------------

Cross val score on train model: 0.9256
Accuracy score on train model: 0.9894
Accuracy score on test model: 0.9204
--------------------

True Negatives: 124
False Positives: 20
False Negatives: 5
True Positives: 165

Performance metrics
--------------------
Balanced Accuracy: 0.9158
Misclassification: 0.0796
Specificity: 0.8611
Sensitivity: 0.9706
Precision: 0.8919
ROC_AUC: 0.9753 

              precision    recall  f1-score   support

           0     0.9612    0.8611    0.9084       144
           1     0.8919    0.9706    0.9296       170

    accuracy                         0.9204       314
   macro avg     0.9266    0.9158    0.9190       314
weighted avg     0.9237    0.9204    0.9199       314



In [28]:
#preview scores in a dataframe
pd.DataFrame(data=scores)

Unnamed: 0,model,vectorizer,valid_train,valid_test,mean_cv,tn,fp,fn,tp,bal_accuracy,misclassification,specificity,sensitivity,precision,roc_auc,params
0,nb,cvec,0.9862,0.9331,0.9373,130,14,7,163,0.9308,0.0669,0.9028,0.9588,0.9209,0.9764,default
1,nb,tvec,0.9894,0.9204,0.9256,124,20,5,165,0.9158,0.0796,0.8611,0.9706,0.8919,0.9753,default


### Fit a Logistic Regression model
In order to make a comparison, we shall try fitting to another classification model - Logistic Regression model.
To streamline that process, lets fit them into pipelines.

In [29]:
#Fit each model into pipeline
pipe_cvec_lr = Pipeline([('cvec', CountVectorizer()),
                  ('lr', LogisticRegression())])

pipe_tvec_lr = Pipeline([('tvec', TfidfVectorizer()),
                  ('lr', LogisticRegression())])


In [30]:
#run the pipeline using our defined function
run_pipe(pipe_cvec_lr, X_train, X_test, y_train, y_test, dict_scores)
run_pipe(pipe_tvec_lr, X_train, X_test, y_train, y_test, dict_scores)

Model = lr with cvec 
--------------------

Cross val score on train model: 0.9235
Accuracy score on train model: 0.9968
Accuracy score on test model: 0.914
--------------------

True Negatives: 130
False Positives: 14
False Negatives: 13
True Positives: 157

Performance metrics
--------------------
Balanced Accuracy: 0.9132
Misclassification: 0.086
Specificity: 0.9028
Sensitivity: 0.9235
Precision: 0.9181
ROC_AUC: 0.9783 

              precision    recall  f1-score   support

           0     0.9091    0.9028    0.9059       144
           1     0.9181    0.9235    0.9208       170

    accuracy                         0.9140       314
   macro avg     0.9136    0.9132    0.9134       314
weighted avg     0.9140    0.9140    0.9140       314

Model = lr with tvec 
--------------------

Cross val score on train model: 0.9362
Accuracy score on train model: 0.9872
Accuracy score on test model: 0.9236
--------------------

True Negatives: 128
False Positives: 16
False Negatives: 8
True P

In [31]:
#preview scores in a dataframe
pd.DataFrame(data=scores)

Unnamed: 0,model,vectorizer,valid_train,valid_test,mean_cv,tn,fp,fn,tp,bal_accuracy,misclassification,specificity,sensitivity,precision,roc_auc,params
0,nb,cvec,0.9862,0.9331,0.9373,130,14,7,163,0.9308,0.0669,0.9028,0.9588,0.9209,0.9764,default
1,nb,tvec,0.9894,0.9204,0.9256,124,20,5,165,0.9158,0.0796,0.8611,0.9706,0.8919,0.9753,default
2,lr,cvec,0.9968,0.914,0.9235,130,14,13,157,0.9132,0.086,0.9028,0.9235,0.9181,0.9783,default
3,lr,tvec,0.9872,0.9236,0.9362,128,16,8,162,0.9209,0.0764,0.8889,0.9529,0.9101,0.9826,default


The purpose of our model is to help moderators identify whether a post is related to the Apple community and at the same time, filter out unrelated posts. Hence, an important metric would be accuracy, which is the rate of correctly predicting all classes. Since our data has slightly unbalanced classes, we can look at the balanced accuracy which is the average of recall obtained on each class `(sensitivity + specificity)/2` .

Based on the tested models above, it appears that the Multinomial Naive Bayes model coupled with Count Vectorizer has the highest balanced accuracy rate. This model also seemed to generalise better based on the valid test score. Hence, we will select this model as our baseline model.

<!-- 

Logistic Regression model coupled with the Tf-idf Vectorizer is generalising better, however the model is overfitting to the train data.

In this case, we look at other metrics important to our model: Sensitivity and Specificity scores.




Since our model is to correctly predict whether or not a post belongs to either the Apple or Android subreddit, we would want to optimise both the sensitivity (or Recall/ True Positive) and the specificity (or True Negative) metrics.

probability of correctly classifying your classes. A higher AUC represents a better mode!

This model also has the least False negatives, Hence, we will tune the hyperparameters and see if the model improves. -->

### Tuning hyperparameters of the Multinomial Naive Bayes Model with CountVectorizer

Using our baseline model, we will tune its hyperparameters using GridSearchCV to determine the combination to create a model that best generalizes to unseen data.

We will be tuning the following parameters in our model:
- CountVectorizer:
    - tokenizer (using LemmaTokenizer/ StemTokenizer)      
    - max_df (considers words with a max upper threshold of words or % of words)
    - min_df (considers words with a min lower threshold of words or % of words)
    - max_features (sets the maximum number of features in the model)
    - ngram_range (captures n-word phrases)


- MultinomialNB:
    - alpha (magnitude of regularization)
    



In [32]:
#define a function to run pipe with vectorizer and estimator
# also stores scores into dictionary

def run_gs_pipe(pipe, X_train, X_test, y_train, y_test, df_scores, grid_params):
    
    #run gridsearch
    gscv = GridSearchCV(pipe, grid_params, cv=5, verbose=1, n_jobs=-1)
    
    #fit to grid
    gscv.fit(X_train, y_train)

    print(f"Model = {gscv.best_estimator_.steps[1][0]} with {gscv.best_estimator_.steps[0][0]} ")
    print("--------------------\n")
    
    #mean cross val score on train model
    mean_cv_score = round(gscv.best_score_,4)
    
    #score our model on the training and test set
    train_score = round(gscv.score(X_train, y_train),4)
    test_score = round(gscv.score(X_test, y_test),4)
    
    #print model scores
    print(f"Cross val score on train model: {mean_cv_score}")
    print(f"Accuracy score on train model: {train_score}")
    print(f"Accuracy score on test model: {test_score}")
    print(f"Best parameters: {gscv.best_params_}")
    
    print("\n--------------------\n")

    
    #generate predictions
    y_pred = gscv.predict(X_test)
    
    #generate probabilities of predicting 1
    y_proba = [i[1] for i in gscv.predict_proba(X_test)]
    
    #compute metrics
    tn, fp, fn, tp, bal_accuracy, misclassification, specificity, sensitivity, precision, roc_auc = perf_metrics(
        y_test, y_pred, y_proba)
    
    #store scores into dictionary
    dict_scores = {'model': gscv.best_estimator_.steps[1][0],
                 'vectorizer': gscv.best_estimator_.steps[0][0],
                 'valid_train': train_score,
                 'valid_test': test_score,
                 'mean_cv': mean_cv_score,
                 'tn': tn,
                 'fp': fp,
                 'fn': fn,
                 'tp': tp,
                 'bal_accuracy': bal_accuracy,
                 'misclassification': misclassification,
                 'specificity': specificity,
                 'sensitivity': sensitivity,
                 'precision': precision,
                 'roc_auc' :roc_auc,
                 'params': gscv.best_params_}
    scores.append(dict_scores)
    
    return gscv

In [33]:
#create our pipeline
pipe_cvec_nb = Pipeline([('cvec', CountVectorizer()),
                         ('nb', MultinomialNB())])

#### Tuning the CountVectorizer(tokenizer)
In our previous section in Data Cleaning and EDA, we have cleaned our text data in 'title' to remove all non-letters and convert to lowercase letters for the same words to be categorised together. 

To further process our text data, we can use Lemmatizing or Stemming to shorten words so we can combine similar forms of the same word.

When we "lemmatize" data, we take words and attempt to return their <i>lemma</i>, or the base/dictionary form of a word.

When we "stem" data, we take words and attempt to return a base form of the word. It tends to be cruder than using lemmatization.

Hence, we have created two class objects: `LemmaTokenizer` and `StemTokenizer` to insert into our hyperparameters tuning and determine if lemmatizing or stemming makes a better model.

In [34]:
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, title):
        return [self.wnl.lemmatize(t) for t in word_tokenize(title)]
    
class StemTokenizer(object):
    def __init__(self):
        self.ps = PorterStemmer()
    def __call__(self, title):
        return [self.ps.stem(t) for t in word_tokenize(title)]

In [35]:
%%time
#create our first grid of parameters to tune
param_1 = [{'cvec__tokenizer': [LemmaTokenizer(), StemTokenizer(), None]}]


#run pipe
gscv1 = run_gs_pipe(pipe_cvec_nb, X_train, X_test, y_train, y_test, dict_scores, param_1)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Model = nb with cvec 
--------------------

Cross val score on train model: 0.9384
Accuracy score on train model: 0.9862
Accuracy score on test model: 0.9331
Best parameters: {'cvec__tokenizer': None}

--------------------

True Negatives: 130
False Positives: 14
False Negatives: 7
True Positives: 163

Performance metrics
--------------------
Balanced Accuracy: 0.9308
Misclassification: 0.0669
Specificity: 0.9028
Sensitivity: 0.9588
Precision: 0.9209
ROC_AUC: 0.9764 

              precision    recall  f1-score   support

           0     0.9489    0.9028    0.9253       144
           1     0.9209    0.9588    0.9395       170

    accuracy                         0.9331       314
   macro avg     0.9349    0.9308    0.9324       314
weighted avg     0.9337    0.9331    0.9330       314

CPU times: user 169 ms, sys: 70.5 ms, total: 239 ms
Wall time: 3.77 s


[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:    3.7s finished


Since the model works better without using lemmatization/stemming and would be a redundant feature, we shall exclude it in our grid of hyperparameters.

#### Tuning the CountVectorizer(n_grams)

In [36]:
%%time
#create our second grid of parameters to tune
param_2 = [{'cvec__ngram_range' : [(1,1), (1,2), (1,3)] # unigrams or bigrams or tri-grams
           }]


#run pipe
gscv2 = run_gs_pipe(pipe_cvec_nb, X_train, X_test, y_train, y_test, dict_scores, param_2)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:    0.1s finished


Fitting 5 folds for each of 3 candidates, totalling 15 fits
Model = nb with cvec 
--------------------

Cross val score on train model: 0.9394
Accuracy score on train model: 0.9968
Accuracy score on test model: 0.9299
Best parameters: {'cvec__ngram_range': (1, 2)}

--------------------

True Negatives: 130
False Positives: 14
False Negatives: 8
True Positives: 162

Performance metrics
--------------------
Balanced Accuracy: 0.9279
Misclassification: 0.0701
Specificity: 0.9028
Sensitivity: 0.9529
Precision: 0.9205
ROC_AUC: 0.9782 

              precision    recall  f1-score   support

           0     0.9420    0.9028    0.9220       144
           1     0.9205    0.9529    0.9364       170

    accuracy                         0.9299       314
   macro avg     0.9312    0.9279    0.9292       314
weighted avg     0.9303    0.9299    0.9298       314

CPU times: user 120 ms, sys: 8.9 ms, total: 129 ms
Wall time: 197 ms


In [37]:
pd.DataFrame(data=scores)

Unnamed: 0,model,vectorizer,valid_train,valid_test,mean_cv,tn,fp,fn,tp,bal_accuracy,misclassification,specificity,sensitivity,precision,roc_auc,params
0,nb,cvec,0.9862,0.9331,0.9373,130,14,7,163,0.9308,0.0669,0.9028,0.9588,0.9209,0.9764,default
1,nb,tvec,0.9894,0.9204,0.9256,124,20,5,165,0.9158,0.0796,0.8611,0.9706,0.8919,0.9753,default
2,lr,cvec,0.9968,0.914,0.9235,130,14,13,157,0.9132,0.086,0.9028,0.9235,0.9181,0.9783,default
3,lr,tvec,0.9872,0.9236,0.9362,128,16,8,162,0.9209,0.0764,0.8889,0.9529,0.9101,0.9826,default
4,nb,cvec,0.9862,0.9331,0.9384,130,14,7,163,0.9308,0.0669,0.9028,0.9588,0.9209,0.9764,{'cvec__tokenizer': None}
5,nb,cvec,0.9968,0.9299,0.9394,130,14,8,162,0.9279,0.0701,0.9028,0.9529,0.9205,0.9782,"{'cvec__ngram_range': (1, 2)}"


It appears that the gridsearch has identified best params to be n-grams of (1,2). As the 'valid_train' score seemed to increase, the 'valid_test' score did not increase. Hence, the model is not generalising better. In addition, there is an increase in misclassification rate which is not a favourable outcome. Hence, we wil set n-gram to the default parameter for best results.

In [38]:
%%time
#create our third grid of parameters to tune
param_3 = [{'cvec__max_df': [0.9, 1.0],
            'cvec__min_df': [1 , 2],
            'cvec__max_features': [1500, 2000, 2100],
           }]


#run pipe
gscv3 = run_gs_pipe(pipe_cvec_nb, X_train, X_test, y_train, y_test, dict_scores, param_3)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Model = nb with cvec 
--------------------

Cross val score on train model: 0.9373
Accuracy score on train model: 0.9819
Accuracy score on test model: 0.9331
Best parameters: {'cvec__max_df': 0.9, 'cvec__max_features': 2100, 'cvec__min_df': 1}

--------------------

True Negatives: 130
False Positives: 14
False Negatives: 7
True Positives: 163

Performance metrics
--------------------
Balanced Accuracy: 0.9308
Misclassification: 0.0669
Specificity: 0.9028
Sensitivity: 0.9588
Precision: 0.9209
ROC_AUC: 0.9761 

              precision    recall  f1-score   support

           0     0.9489    0.9028    0.9253       144
           1     0.9209    0.9588    0.9395       170

    accuracy                         0.9331       314
   macro avg     0.9349    0.9308    0.9324       314
weighted avg     0.9337    0.9331    0.9330       314

CPU times: user 214 ms, sys: 14.4 ms, total: 229 ms
Wall time: 283 ms


[Parallel(n_jobs=-1)]: Done  45 out of  60 | elapsed:    0.2s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:    0.2s finished


In [39]:
pd.DataFrame(data=scores)

Unnamed: 0,model,vectorizer,valid_train,valid_test,mean_cv,tn,fp,fn,tp,bal_accuracy,misclassification,specificity,sensitivity,precision,roc_auc,params
0,nb,cvec,0.9862,0.9331,0.9373,130,14,7,163,0.9308,0.0669,0.9028,0.9588,0.9209,0.9764,default
1,nb,tvec,0.9894,0.9204,0.9256,124,20,5,165,0.9158,0.0796,0.8611,0.9706,0.8919,0.9753,default
2,lr,cvec,0.9968,0.914,0.9235,130,14,13,157,0.9132,0.086,0.9028,0.9235,0.9181,0.9783,default
3,lr,tvec,0.9872,0.9236,0.9362,128,16,8,162,0.9209,0.0764,0.8889,0.9529,0.9101,0.9826,default
4,nb,cvec,0.9862,0.9331,0.9384,130,14,7,163,0.9308,0.0669,0.9028,0.9588,0.9209,0.9764,{'cvec__tokenizer': None}
5,nb,cvec,0.9968,0.9299,0.9394,130,14,8,162,0.9279,0.0701,0.9028,0.9529,0.9205,0.9782,"{'cvec__ngram_range': (1, 2)}"
6,nb,cvec,0.9819,0.9331,0.9373,130,14,7,163,0.9308,0.0669,0.9028,0.9588,0.9209,0.9761,"{'cvec__max_df': 0.9, 'cvec__max_features': 21..."


The score did not seem to improve from its baseline model, hence we will set CountVectorizer to its default hyperparameters. We will now tune hyperparameters for the Multinomial model.

In [40]:
%%time
#create our fourth grid of parameters to tune
param_4 = [{'nb__alpha' : np.logspace(0,5,1000)
           }]


#run pipe
gscv4 = run_gs_pipe(pipe_cvec_nb, X_train, X_test, y_train, y_test, dict_scores, param_4)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 5 folds for each of 1000 candidates, totalling 5000 fits


[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 1200 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 3200 tasks      | elapsed:    9.9s


Model = nb with cvec 
--------------------

Cross val score on train model: 0.9426
Accuracy score on train model: 0.9851
Accuracy score on test model: 0.9363
Best parameters: {'nb__alpha': 1.303512244681509}

--------------------

True Negatives: 130
False Positives: 14
False Negatives: 6
True Positives: 164

Performance metrics
--------------------
Balanced Accuracy: 0.9337
Misclassification: 0.0637
Specificity: 0.9028
Sensitivity: 0.9647
Precision: 0.9213
ROC_AUC: 0.9771 

              precision    recall  f1-score   support

           0     0.9559    0.9028    0.9286       144
           1     0.9213    0.9647    0.9425       170

    accuracy                         0.9363       314
   macro avg     0.9386    0.9337    0.9356       314
weighted avg     0.9372    0.9363    0.9361       314

CPU times: user 6.93 s, sys: 180 ms, total: 7.11 s
Wall time: 15.5 s


[Parallel(n_jobs=-1)]: Done 5000 out of 5000 | elapsed:   15.4s finished


In [41]:
pd.DataFrame(data=scores)

Unnamed: 0,model,vectorizer,valid_train,valid_test,mean_cv,tn,fp,fn,tp,bal_accuracy,misclassification,specificity,sensitivity,precision,roc_auc,params
0,nb,cvec,0.9862,0.9331,0.9373,130,14,7,163,0.9308,0.0669,0.9028,0.9588,0.9209,0.9764,default
1,nb,tvec,0.9894,0.9204,0.9256,124,20,5,165,0.9158,0.0796,0.8611,0.9706,0.8919,0.9753,default
2,lr,cvec,0.9968,0.914,0.9235,130,14,13,157,0.9132,0.086,0.9028,0.9235,0.9181,0.9783,default
3,lr,tvec,0.9872,0.9236,0.9362,128,16,8,162,0.9209,0.0764,0.8889,0.9529,0.9101,0.9826,default
4,nb,cvec,0.9862,0.9331,0.9384,130,14,7,163,0.9308,0.0669,0.9028,0.9588,0.9209,0.9764,{'cvec__tokenizer': None}
5,nb,cvec,0.9968,0.9299,0.9394,130,14,8,162,0.9279,0.0701,0.9028,0.9529,0.9205,0.9782,"{'cvec__ngram_range': (1, 2)}"
6,nb,cvec,0.9819,0.9331,0.9373,130,14,7,163,0.9308,0.0669,0.9028,0.9588,0.9209,0.9761,"{'cvec__max_df': 0.9, 'cvec__max_features': 21..."
7,nb,cvec,0.9851,0.9363,0.9426,130,14,6,164,0.9337,0.0637,0.9028,0.9647,0.9213,0.9771,{'nb__alpha': 1.303512244681509}


After hyperparamters tuning the additive smoothing parameter, the performance metrics seemed to be slightly better.
Hence, this model will be selected as our production model.

In [42]:
#Export scores to csv:
pd.DataFrame(data=scores).to_csv('../datasets/scores.csv', index=False)

## Evaluating our Production Model

In [48]:
predictions = gscv4.predict(test_title)

In [49]:
pred_proba = [i[1] for i in gscv4.predict_proba(test_title)]
    
#compute metrics
tn, fp, fn, tp, bal_accuracy, misclassification, specificity, sensitivity, precision, roc_auc = perf_metrics(
    test_subreddit, predictions, pred_proba)
print(f"Accuracy: {gscv4.score(test_title, test_subreddit)}")

True Negatives: 132
False Positives: 14
False Negatives: 10
True Positives: 158

Performance metrics
--------------------
Balanced Accuracy: 0.9223
Misclassification: 0.0764
Specificity: 0.9041
Sensitivity: 0.9405
Precision: 0.9186
ROC_AUC: 0.98 

              precision    recall  f1-score   support

           0     0.9296    0.9041    0.9167       146
           1     0.9186    0.9405    0.9294       168

    accuracy                         0.9236       314
   macro avg     0.9241    0.9223    0.9230       314
weighted avg     0.9237    0.9236    0.9235       314

Accuracy: 0.9235668789808917


In [50]:
#store production model scores into dictionary
prod_scores = {'model': gscv4.best_estimator_.steps[1][0],
             'vectorizer': gscv4.best_estimator_.steps[0][0],
             'tn': tn,
             'fp': fp,
             'fn': fn,
             'tp': tp,
             'mean_cv': gscv4.best_score_,
             'accuracy': gscv4.score(test_title, test_subreddit),
             'bal_accuracy': bal_accuracy,
             'misclassification': misclassification,
             'specificity': specificity,
             'sensitivity': sensitivity,
             'precision': precision,
             'roc_auc' :roc_auc,
             'params': gscv4.best_params_}

In [51]:
#Export production scores on unseen to csv:
pd.DataFrame(data=prod_scores).to_csv('../datasets/prod_score.csv', index=False)

<div style="text-align: right">
    <div class="right"> >>> <b>Next: </b>
        <a href="./04_conclusion_and_recommendation.ipynb">Conclusion and Recommendations</a>
    </div>
    </div>

[Go to top](#top)

---