# Project 3: Suicide Watch

## Notebook 3: Model Optimisation and Evaluation

This notebook contains the optimisation to identify the best hyperparameters for the following models:

* Multinomial Naive Bayes
* Logistic Regression

### Contents

- [Optimising Tfidf Multinomial Naive Bayes](#Optimising-Tfidf-Multinomial-Naive_Bayes)
- [Optimising Tfidf Logistic Regression](#Optimising-Tfidf-Logistic-Regression)

In [1]:
import requests
import time
import pandas as pd
import numpy as np
import ast

# preprocessing imports
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# modeling imports
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# plotting imports
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

### Train/Test Split (Lemmatized Posts)

In [None]:
# Since it is classification problem, we will stratify y to ensure equal split of X and y 
# in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [None]:
# Check the length to confirm that train and test variables have the same length
print('Number of rows in X_train: {}'. format(len(X_train)))
print('Number of rows in y_train: {}'. format(len(y_train)))
print('Number of rows in X_test: {}'. format(len(X_test)))
print('Number of rows in y_test: {}'. format(len(y_test)))

### GridSearchCV

The GridSearchCV tool allows us to program multiple hyperparameters across our models. It will generate a model with each combination of our desired hyperparameters, and optimize the highest-scoring result.

We will run a model for each of the following classifiers:

* Multinomial Naive Bayes
* K-Nearest Neighbors
* Logistic Regression

We will run two GridSearches to benchmark these models for two feature extraction techniques: CountVectorizer and TfidfVectorizer. We can use the accuracy of the results to narrow our model selection to the most effective approaches.

As these models execute, the results will be displayed, then stored into a DataFrame for final comparison.

#### Baseline Accuracy

In [None]:
# baseline accuracy
baseline = y_train.value_counts(normalize=True)[1]
baseline

Here we find the baseline accuracy, which is the likelihood of a post being from the suicide subreddit, by calculating the percentage of the dataset that is the target value of 1. Normalising the value counts shows the percentage, and gives a baseline accuracy of 51.7%.

#### Count Vectorizer

In [None]:
# Pipeline steps for each combination of model
# Include standard scaler for knn and logistic regression because distance are important when classifying
steps_list_cv = [ 
    [('cv', CountVectorizer()),('multi_nb', MultinomialNB())],
    [('cv', CountVectorizer()),('scaler', StandardScaler(with_mean=False)),('knn', KNeighborsClassifier())], 
    [('cv', CountVectorizer()),('scaler', StandardScaler(with_mean=False)),('logreg', LogisticRegression())]
]

In [None]:
steps_titles_cv = ['multi_nb + cv','knn + cv','logreg + cv']

In [None]:
pipe_params_cv = [
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]},
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]},
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]}
]   

In [None]:
# instantiate results DataFrame for CountVectorizer
gs_results_cv = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy',
                                      'baseline_accuracy','recall', 'precision', 'f1-score'])
gs_results_cv.head()

In [None]:
# Loop through index of number of steps
for i in range(len(steps_list_cv)):
    # instantiate pipeline 
    pipe = Pipeline(steps=steps_list_cv[i])
    # fit GridSearchCV to model and model's params
    gs = GridSearchCV(pipe, pipe_params_cv[i], cv=3) 

    model_results = {}

    gs.fit(X_train, y_train)
    
    print('Model: ', steps_titles_cv[i])
    model_results['model'] = steps_titles_cv[i]

    print('Best Params: ', gs.best_params_)
    model_results['best_params'] = gs.best_params_

    print(gs.score(X_train, y_train), '\n')
    model_results['train_accuracy'] = gs.score(X_train, y_train)
    
    print(gs.score(X_test, y_test), '\n')
    model_results['test_accuracy'] = gs.score(X_test, y_test)
    
    model_results['baseline_accuracy'] = baseline
    
    # Display the confusion matrix results showing true/false positive/negative
    tn, fp, fn, tp = confusion_matrix(y_test, gs.predict(X_test)).ravel() 
    print("True Negatives: %s" % tn)
    print("False Positives: %s" % fp)
    print("False Negatives: %s" % fn)
    print("True Positives: %s" % tp, '\n')
    
    model_results['recall'] = tp/(tp+fn)
    model_results['precision'] = tp/(tp+fp)
    model_results['f1-score'] = 2*((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn)))

    gs_results_cv = gs_results_cv.append(model_results, ignore_index=True)
    pd.set_option('display.max_colwidth', 200)

In [None]:
gs_results_cv.sort_values('test_accuracy',ascending=False)

Although Multinomial NB gave the highest test accuracy, there seems to be a high overfitting of training data since the training data accuracy is much higher than the testing data accuracy for both Multinomial NB and Logistic Regression. All three models have a higher accuracy than the baseline, but only barely so for KNearestNeighbours.

#### Tfidf Vectorizer

In [None]:
steps_list_tf = [ 
    [('tf',TfidfVectorizer()),('multi_nb',MultinomialNB())],
    [('tf',TfidfVectorizer()),('scaler',StandardScaler(with_mean=False)),('knn',KNeighborsClassifier())], 
    [('tf',TfidfVectorizer()),('scaler',StandardScaler(with_mean=False)),('logreg',LogisticRegression())]
]

In [None]:
steps_titles_tf = ['multi_nb + tf','knn + tf','logreg + tf']

In [None]:
pipe_params_tf = [
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]}
]

In [None]:
# instantiate results DataFrame
gs_results_tf = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy','baseline_accuracy',
                                      'recall', 'precision', 'f1-score'])
gs_results_tf.head()

In [None]:
# Loop through index of number of steps
for i in range(len(steps_list_tf)):
    # instantiate pipeline 
    pipe = Pipeline(steps=steps_list_tf[i])
    # fit GridSearchCV to model and model's params
    gs = GridSearchCV(pipe, pipe_params_tf[i], cv=3) 

    model_results = {}

    gs.fit(X_train, y_train)
    
    print('Model: ', steps_titles_tf[i])
    model_results['model'] = steps_titles_tf[i]

    print('Best Params: ', gs.best_params_)
    model_results['best_params'] = gs.best_params_

    print(gs.score(X_train, y_train), '\n')
    model_results['train_accuracy'] = gs.score(X_train, y_train)
    
    print(gs.score(X_test, y_test), '\n')
    model_results['test_accuracy'] = gs.score(X_test, y_test)
    
    model_results['baseline_accuracy'] = baseline

    # Display the confusion matrix results showing true/false positive/negative
    tn, fp, fn, tp = confusion_matrix(y_test, gs.predict(X_test)).ravel() 
    print("True Negatives: %s" % tn)
    print("False Positives: %s" % fp)
    print("False Negatives: %s" % fn)
    print("True Positives: %s" % tp, '\n')
    
    model_results['recall'] = tp/(tp+fn)
    model_results['precision'] = tp/(tp+fp)
    model_results['f1-score'] = 2*((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn)))

    gs_results_tf = gs_results_tf.append(model_results, ignore_index=True)
    pd.set_option('display.max_colwidth', 200)

In [None]:
gs_results_tf.sort_values('test_accuracy', ascending=False)

Similar to the result from CountVectorizer, Multinomial NB gave the highest test accuracy. The training data is still highly overfitted for both Multinomial NB and Logistic Regression but both still did better than the baseline. 
However, comparing the test accuracy between Multinomial NB + Tfidf Vectorizer against Multinomial NB + Count Vectorizer, the accuracy for the former is higher. 

CountVectorizer gives a vector with the number of times each word appears in the document. This leads to the problem of having common words that appear most of the time being weighted higher than other words that carry the topic information. Tfidf balances out the term frequency with its inverse document frequency which means that common words that occur across documents will have lower scores than when using CountVectorizer. Thus, the Tfidf is better able to identify words that are important.

In [None]:
# Concatenate the dataframes containing results from both vectorizations
results_lem = pd.concat([gs_results_cv, gs_results_tf], ignore_index=True)
results_lem.sort_values('test_accuracy', ascending=False)

### Train/Test Split (Stemmed Posts)

Set feature as posts that were stemmed and y for the subreddit suicide.

In [None]:
X = df_prep['post_stem']
y = df_prep['suicide']

In [None]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [None]:
# Check the length to confirm that train and test variables have the same length
print('Number of rows in X_train: {}'. format(len(X_train)))
print('Number of rows in y_train: {}'. format(len(y_train)))
print('Number of rows in X_test: {}'. format(len(X_test)))
print('Number of rows in y_test: {}'. format(len(y_test)))

### GridSearchCV

A model will be run for the following classifiers:

* Multinomial Naive Bayes
* K-Nearest Neighbors
* Logistic Regression

The results will be displayed, then stored into a DataFrame for a final comparison with the lemmatized posts.

#### Count Vectorizer

In [None]:
# Pipeline steps for each combination of model
# Include standard scaler for knn and logistic regression because distance are important when classifying
steps_list_cv_st = [ 
    [('cv', CountVectorizer()),('multi_nb', MultinomialNB())],
    [('cv', CountVectorizer()),('scaler', StandardScaler(with_mean=False)),('knn', KNeighborsClassifier())], 
    [('cv', CountVectorizer()),('scaler', StandardScaler(with_mean=False)),('logreg', LogisticRegression())]
]

In [None]:
steps_titles_cv_st = ['multi_nb + cv','knn + cv','logreg + cv']

In [None]:
pipe_params_cv_st = [
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]},
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]},
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]}
]   

In [None]:
# instantiate results DataFrame for CountVectorizer
gs_results_cv_st = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy',
                                      'baseline_accuracy','recall', 'precision', 'f1-score'])
gs_results_cv_st.head()

In [None]:
# Loop through index of number of steps
for i in range(len(steps_list_cv_st)):
    # instantiate pipeline 
    pipe = Pipeline(steps=steps_list_cv_st[i])
    # fit GridSearchCV to model and model's params
    gs = GridSearchCV(pipe, pipe_params_cv_st[i], cv=3) 

    model_results = {}

    gs.fit(X_train, y_train)
    
    print('Model: ', steps_titles_cv_st[i])
    model_results['model'] = steps_titles_cv_st[i]

    print('Best Params: ', gs.best_params_)
    model_results['best_params'] = gs.best_params_

    print(gs.score(X_train, y_train), '\n')
    model_results['train_accuracy'] = gs.score(X_train, y_train)
    
    print(gs.score(X_test, y_test), '\n')
    model_results['test_accuracy'] = gs.score(X_test, y_test)
    
    model_results['baseline_accuracy'] = baseline
    
    # Display the confusion matrix results showing true/false positive/negative
    tn, fp, fn, tp = confusion_matrix(y_test, gs.predict(X_test)).ravel() 
    print("True Negatives: %s" % tn)
    print("False Positives: %s" % fp)
    print("False Negatives: %s" % fn)
    print("True Positives: %s" % tp, '\n')
    
    model_results['recall'] = tp/(tp+fn)
    model_results['precision'] = tp/(tp+fp)
    model_results['f1-score'] = 2*((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn)))
    
    gs_results_cv_st = gs_results_cv_st.append(model_results, ignore_index=True)
    pd.set_option('display.max_colwidth', 200)

In [None]:
gs_results_cv_st.sort_values('test_accuracy',ascending=False)

The result from using stemmed data is almost the same as the results from using lemmatized data. Multinomial NB remains the best performing but with a little less overfitting than the one with lemmatized data.  

#### Tfidf Vectorizer

In [None]:
steps_list_tf_st = [ # list of pipeline steps for each model combo
    [('tf',TfidfVectorizer()),('multi_nb',MultinomialNB())],
    [('tf',TfidfVectorizer()),('scaler',StandardScaler(with_mean=False)),('knn',KNeighborsClassifier())], 
    [('tf',TfidfVectorizer()),('scaler',StandardScaler(with_mean=False)),('logreg',LogisticRegression())]
]

In [None]:
steps_titles_tf_st = ['multi_nb + tf','knn + tf','logreg + tf']

In [None]:
pipe_params_tf_st = [
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]}
]

In [None]:
# instantiate results DataFrame
gs_results_tf_st = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy','baseline_accuracy',
                                         'recall', 'precision', 'f1-score'])
gs_results_tf_st.head()

In [None]:
# Loop through index of number of steps
for i in range(len(steps_list_tf_st)):
    # instantiate pipeline 
    pipe = Pipeline(steps=steps_list_tf_st[i])
    # fit GridSearchCV to model and model's params
    gs = GridSearchCV(pipe, pipe_params_tf_st[i], cv=3) 

    model_results = {}

    gs.fit(X_train, y_train)
    
    print('Model: ', steps_titles_tf_st[i])
    model_results['model'] = steps_titles_tf_st[i]

    print('Best Params: ', gs.best_params_)
    model_results['best_params'] = gs.best_params_

    print(gs.score(X_train, y_train), '\n')
    model_results['train_accuracy'] = gs.score(X_train, y_train)
    
    print(gs.score(X_test, y_test), '\n')
    model_results['test_accuracy'] = gs.score(X_test, y_test)
    
    model_results['baseline_accuracy'] = baseline

    # Display the confusion matrix results showing true/false positive/negative
    tn, fp, fn, tp = confusion_matrix(y_test, gs.predict(X_test)).ravel() 
    print("True Negatives: %s" % tn)
    print("False Positives: %s" % fp)
    print("False Negatives: %s" % fn)
    print("True Positives: %s" % tp, '\n')
    
    model_results['recall'] = tp/(tp+fn)
    model_results['precision'] = tp/(tp+fp)
    model_results['f1-score'] = 2*((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn)))

    gs_results_tf_st = gs_results_tf_st.append(model_results, ignore_index=True)
    pd.set_option('display.max_colwidth', 200)

In [None]:
gs_results_tf_st.sort_values('test_accuracy', ascending=False)

Multinomial NB again is the best model with the highest accuracy but is very highly overfitted.

In [None]:
# Concatenate the dataframes containing results from both vectorizations
results_stem = pd.concat([gs_results_cv_st, gs_results_tf_st], ignore_index=True)
results_stem.sort_values('test_accuracy', ascending=False)

In [None]:
# Compare with the above result
results_lem.sort_values('test_accuracy', ascending=False)

For the best performing model i.e. Multinomial NB with Tfidf Vectorizer, the lemmatized data performed only marginally better in terms of accuracy. However, given the context of the problem at hand, we would want to maximise recall as much as possible since it shows the model’s ability to find all the data points of interest i.e. all the suicide posts. If recall is low, it means that red flags would not be raised for those who are suicidal even when they are actually suicidal. The following models seem to do better in terms of recall and accuracy:
1. Lemmatized Tfidf Multinomial NB
    * tf__ngram_range=(1,1)
    * tf__stop_words='english'
2. Lemmatized Tfidf Scaled Logistic Regression
    * tf__ngram_range=(1,2)
    * tf__stop_words='english'

## To be continued in Notebook 3: Model Optimisation