# Project 3: Suicide Watch

## Notebook 3: Model Selection and Optimisation

This notebook contains the codes to assess the effectiveness of potential classification models using the preprocessed data, including:

* Multinomial Naive Bayes
* K-Nearest Neighbors
* Logistic Regression Classifier

Models are tested using two vectorization transformers: CountVectorizer, TF-IDF

A GridSearch is run across all models to rule out non-viable options. The models with the most predictive potential are then selected and optimised here:

* Tfidf + Multinomial Naive Bayes
* Tfidf + Logistic Regression

### Contents

- [Train/Test Split (Lemmatized Posts)](#Train/Test-Split-(Lemmatized-Posts))
- [Grid Search CV](#Grid-Search-CV)
    * [Baseline Accuracy](#Baseline-Accuracy)
    * [Count Vectorizer](#Count-Vectorizer)
    * [Tfidf Vectorizer](#Tfidf-Vectorizer)
- [Train/Test Split (Stemmed Posts)](#Train/Test-Split-(Stemmed-Posts))
- [Grid Search CV](#Grid-Search-CV)
    * [Count Vectorizer](#Count-Vectorizer)
    * [Tfidf Vectorizer](#Tfidf-Vectorizer)
- [Optimising Tfidf Multinomial Naive Bayes](#Optimising-Tfidf-Multinomial-Naive_Bayes)
- [Optimising Tfidf Logistic Regression](#Optimising-Tfidf-Logistic-Regression)
- [Conclusion-and-Recommendations](#Conclusion-and-Recommendations)

In [8]:
import requests
import time
import pandas as pd
import numpy as np
import ast

# preprocessing imports
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_extraction import stop_words

# modeling imports
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# plotting imports
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

### Train/Test Split (Lemmatized Posts)

In [2]:
# Import preprocessed data
df_prep = pd.read_csv('../datasets/preprocessed.csv')
df_prep.head()

Unnamed: 0,post_lem,post_stem,suicide
0,ask kindly stop make tea read text p...,ask kindli stop make tea read text p...,1
1,u ryfflex still please let know year ...,u ryfflex still pleas let know year a...,1
2,broken kid like mei see many people like ...,broken kid like mei see mani peopl like ...,1
3,anyone angry bitter time tolerate anybo...,anyon angri bitter time toler anybodi ...,1
4,could use someone talk,could use someon talk,1


Set feature, X and target, y variables for lemmatized posts for train/test split.

In [3]:
X = df_prep['post_lem']
y = df_prep['suicide']

In [4]:
# Since it is classification problem, we will stratify y to ensure equal split of X and y 
# in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [5]:
# Confirm that train and test variables have the same length
print('Number of rows in X_train: {}'.format(len(X_train)))
print('Number of rows in y_train: {}'.format(len(y_train)))
print('Number of rows in X_test: {}'.format(len(X_test)))
print('Number of rows in y_test: {}'.format(len(y_test)))

Number of rows in X_train: 1417
Number of rows in y_train: 1417
Number of rows in X_test: 473
Number of rows in y_test: 473


### GridSearchCV

The GridSearchCV tool allows us to program multiple hyperparameters across our models. It will generate a model with each combination of our desired hyperparameters, and optimize the highest-scoring result.

We will run a model for each of the following classifiers:

* Multinomial Naive Bayes
* K-Nearest Neighbors
* Logistic Regression

We will run two GridSearches to benchmark these models for two feature extraction techniques: CountVectorizer and TfidfVectorizer. We can use the accuracy of the results to narrow our model selection to the most effective approaches.

As these models execute, the results will be displayed, then stored into a DataFrame for final comparison.

#### Baseline Accuracy

In [6]:
# baseline accuracy
baseline = y_train.value_counts(normalize=True)[1]
baseline

0.5172900494001411

Here we find the baseline accuracy, which is the likelihood of a post being from the suicide subreddit, by calculating the percentage of the dataset that is the target value of 1. Normalising the value counts shows the percentage, and gives a baseline accuracy of 51.7%.

#### Count Vectorizer

In [10]:
# Pipeline steps for each combination of model
# Include standard scaler for knn and logistic regression because distance are important when classifying
steps_list_cv = [ 
    [('cv', CountVectorizer()),('multi_nb', MultinomialNB())],
    [('cv', CountVectorizer()),('scaler', StandardScaler(with_mean=False)),('knn', KNeighborsClassifier())], 
    [('cv', CountVectorizer()),('scaler', StandardScaler(with_mean=False)),('logreg', LogisticRegression())]
]

In [11]:
steps_titles_cv = ['multi_nb + cv','knn + cv','logreg + cv']

In [12]:
pipe_params_cv = [
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]},
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]},
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]}
]   

In [13]:
# instantiate results DataFrame for CountVectorizer
gs_results_cv = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy',
                                      'baseline_accuracy','recall', 'precision', 'f1-score'])
gs_results_cv.head()

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,baseline_accuracy,recall,precision,f1-score


In [14]:
# Loop through index of number of steps
for i in range(len(steps_list_cv)):
    # instantiate pipeline 
    pipe = Pipeline(steps=steps_list_cv[i])
    # fit GridSearchCV to model and model's params
    gs = GridSearchCV(pipe, pipe_params_cv[i], cv=3) 

    model_results = {}

    gs.fit(X_train, y_train)
    
    print('Model: ', steps_titles_cv[i])
    model_results['model'] = steps_titles_cv[i]

    print('Best Params: ', gs.best_params_)
    model_results['best_params'] = gs.best_params_

    print(gs.score(X_train, y_train), '\n')
    model_results['train_accuracy'] = gs.score(X_train, y_train)
    
    print(gs.score(X_test, y_test), '\n')
    model_results['test_accuracy'] = gs.score(X_test, y_test)
    
    model_results['baseline_accuracy'] = baseline
    
    # Display the confusion matrix results showing true/false positive/negative
    tn, fp, fn, tp = confusion_matrix(y_test, gs.predict(X_test)).ravel() 
    print("True Negatives: %s" % tn)
    print("False Positives: %s" % fp)
    print("False Negatives: %s" % fn)
    print("True Positives: %s" % tp, '\n')
    
    model_results['recall'] = tp/(tp+fn)
    model_results['precision'] = tp/(tp+fp)
    model_results['f1-score'] = 2*((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn)))

    gs_results_cv = gs_results_cv.append(model_results, ignore_index=True)
    pd.set_option('display.max_colwidth', 200)

Model:  multi_nb + cv
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
0.9174311926605505 

0.693446088794926 

True Negatives: 150
False Positives: 78
False Negatives: 67
True Positives: 178 

Model:  knn + cv
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
0.5335215243472125 

0.5179704016913319 

True Negatives: 3
False Positives: 225
False Negatives: 3
True Positives: 242 

Model:  logreg + cv
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
0.9992942836979535 

0.6257928118393234 

True Negatives: 126
False Positives: 102
False Negatives: 75
True Positives: 170 



In [15]:
gs_results_cv.sort_values('test_accuracy',ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,baseline_accuracy,recall,precision,f1-score
0,multi_nb + cv,"{'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.917431,0.693446,0.51729,0.726531,0.695312,0.710579
2,logreg + cv,"{'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.999294,0.625793,0.51729,0.693878,0.625,0.65764
1,knn + cv,"{'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.533522,0.51797,0.51729,0.987755,0.518201,0.679775


Although Multinomial NB gave the highest test accuracy, there seems to be a high overfitting of training data since the training data accuracy is much higher than the testing data accuracy for both Multinomial NB and Logistic Regression. All three models have a higher accuracy than the baseline, but only barely so for KNearestNeighbours.

#### Tfidf Vectorizer

In [16]:
steps_list_tf = [ 
    [('tf',TfidfVectorizer()),('multi_nb',MultinomialNB())],
    [('tf',TfidfVectorizer()),('scaler',StandardScaler(with_mean=False)),('knn',KNeighborsClassifier())], 
    [('tf',TfidfVectorizer()),('scaler',StandardScaler(with_mean=False)),('logreg',LogisticRegression())]
]

In [17]:
steps_titles_tf = ['multi_nb + tf','knn + tf','logreg + tf']

In [18]:
pipe_params_tf = [
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]}
]

In [19]:
# instantiate results DataFrame
gs_results_tf = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy','baseline_accuracy',
                                      'recall', 'precision', 'f1-score'])
gs_results_tf.head()

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,baseline_accuracy,recall,precision,f1-score


In [20]:
# Loop through index of number of steps
for i in range(len(steps_list_tf)):
    # instantiate pipeline 
    pipe = Pipeline(steps=steps_list_tf[i])
    # fit GridSearchCV to model and model's params
    gs = GridSearchCV(pipe, pipe_params_tf[i], cv=3) 

    model_results = {}

    gs.fit(X_train, y_train)
    
    print('Model: ', steps_titles_tf[i])
    model_results['model'] = steps_titles_tf[i]

    print('Best Params: ', gs.best_params_)
    model_results['best_params'] = gs.best_params_

    print(gs.score(X_train, y_train), '\n')
    model_results['train_accuracy'] = gs.score(X_train, y_train)
    
    print(gs.score(X_test, y_test), '\n')
    model_results['test_accuracy'] = gs.score(X_test, y_test)
    
    model_results['baseline_accuracy'] = baseline

    # Display the confusion matrix results showing true/false positive/negative
    tn, fp, fn, tp = confusion_matrix(y_test, gs.predict(X_test)).ravel() 
    print("True Negatives: %s" % tn)
    print("False Positives: %s" % fp)
    print("False Negatives: %s" % fn)
    print("True Positives: %s" % tp, '\n')
    
    model_results['recall'] = tp/(tp+fn)
    model_results['precision'] = tp/(tp+fp)
    model_results['f1-score'] = 2*((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn)))

    gs_results_tf = gs_results_tf.append(model_results, ignore_index=True)
    pd.set_option('display.max_colwidth', 200)

Model:  multi_nb + tf
Best Params:  {'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}
0.9414255469301341 

0.7145877378435518 

True Negatives: 143
False Positives: 85
False Negatives: 50
True Positives: 195 

Model:  knn + tf
Best Params:  {'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}
0.5172900494001411 

0.5179704016913319 

True Negatives: 0
False Positives: 228
False Negatives: 0
True Positives: 245 

Model:  logreg + tf
Best Params:  {'tf__ngram_range': (1, 2), 'tf__stop_words': 'english'}
0.9992942836979535 

0.6659619450317125 

True Negatives: 126
False Positives: 102
False Negatives: 56
True Positives: 189 



In [21]:
gs_results_tf.sort_values('test_accuracy', ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,baseline_accuracy,recall,precision,f1-score
0,multi_nb + tf,"{'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}",0.941426,0.714588,0.51729,0.795918,0.696429,0.742857
2,logreg + tf,"{'tf__ngram_range': (1, 2), 'tf__stop_words': 'english'}",0.999294,0.665962,0.51729,0.771429,0.649485,0.705224
1,knn + tf,"{'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}",0.51729,0.51797,0.51729,1.0,0.51797,0.682451


Similar to the result from CountVectorizer, Multinomial NB gave the highest test accuracy. The training data is still highly overfitted for both Multinomial NB and Logistic Regression but both still did better than the baseline. 
However, comparing the test accuracy between Multinomial NB + Tfidf Vectorizer against Multinomial NB + Count Vectorizer, the accuracy for the former is higher. 

CountVectorizer gives a vector with the number of times each word appears in the document. This leads to the problem of having common words that appear most of the time being weighted higher than other words that carry the topic information. Tfidf balances out the term frequency with its inverse document frequency which means that common words that occur across documents will have lower scores than when using CountVectorizer. Thus, the Tfidf is better able to identify words that are important.

In [22]:
# Concatenate the dataframes containing results from both vectorizations
results_lem = pd.concat([gs_results_cv, gs_results_tf], ignore_index=True)
results_lem.sort_values('test_accuracy', ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,baseline_accuracy,recall,precision,f1-score
3,multi_nb + tf,"{'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}",0.941426,0.714588,0.51729,0.795918,0.696429,0.742857
0,multi_nb + cv,"{'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.917431,0.693446,0.51729,0.726531,0.695312,0.710579
5,logreg + tf,"{'tf__ngram_range': (1, 2), 'tf__stop_words': 'english'}",0.999294,0.665962,0.51729,0.771429,0.649485,0.705224
2,logreg + cv,"{'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.999294,0.625793,0.51729,0.693878,0.625,0.65764
1,knn + cv,"{'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.533522,0.51797,0.51729,0.987755,0.518201,0.679775
4,knn + tf,"{'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}",0.51729,0.51797,0.51729,1.0,0.51797,0.682451


### Train/Test Split (Stemmed Posts)

Set feature as posts that were stemmed and y for the subreddit suicide.

In [23]:
X = df_prep['post_stem']
y = df_prep['suicide']

In [24]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [25]:
# Check the length to confirm that train and test variables have the same length
print('Number of rows in X_train: {}'. format(len(X_train)))
print('Number of rows in y_train: {}'. format(len(y_train)))
print('Number of rows in X_test: {}'. format(len(X_test)))
print('Number of rows in y_test: {}'. format(len(y_test)))

Number of rows in X_train: 1417
Number of rows in y_train: 1417
Number of rows in X_test: 473
Number of rows in y_test: 473


### GridSearchCV

A model will be run for the following classifiers:

* Multinomial Naive Bayes
* K-Nearest Neighbors
* Logistic Regression

The results will be displayed, then stored into a DataFrame for a final comparison with the lemmatized posts.

#### Count Vectorizer

In [26]:
# Pipeline steps for each combination of model
# Include standard scaler for knn and logistic regression because distance are important when classifying
steps_list_cv_st = [ 
    [('cv', CountVectorizer()),('multi_nb', MultinomialNB())],
    [('cv', CountVectorizer()),('scaler', StandardScaler(with_mean=False)),('knn', KNeighborsClassifier())], 
    [('cv', CountVectorizer()),('scaler', StandardScaler(with_mean=False)),('logreg', LogisticRegression())]
]

In [27]:
steps_titles_cv_st = ['multi_nb + cv','knn + cv','logreg + cv']

In [28]:
pipe_params_cv_st = [
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]},
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]},
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]}
]   

In [29]:
# instantiate results DataFrame for CountVectorizer
gs_results_cv_st = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy',
                                      'baseline_accuracy','recall', 'precision', 'f1-score'])
gs_results_cv_st.head()

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,baseline_accuracy,recall,precision,f1-score


In [30]:
# Loop through index of number of steps
for i in range(len(steps_list_cv_st)):
    # instantiate pipeline 
    pipe = Pipeline(steps=steps_list_cv_st[i])
    # fit GridSearchCV to model and model's params
    gs = GridSearchCV(pipe, pipe_params_cv_st[i], cv=3) 

    model_results = {}

    gs.fit(X_train, y_train)
    
    print('Model: ', steps_titles_cv_st[i])
    model_results['model'] = steps_titles_cv_st[i]

    print('Best Params: ', gs.best_params_)
    model_results['best_params'] = gs.best_params_

    print(gs.score(X_train, y_train), '\n')
    model_results['train_accuracy'] = gs.score(X_train, y_train)
    
    print(gs.score(X_test, y_test), '\n')
    model_results['test_accuracy'] = gs.score(X_test, y_test)
    
    model_results['baseline_accuracy'] = baseline
    
    # Display the confusion matrix results showing true/false positive/negative
    tn, fp, fn, tp = confusion_matrix(y_test, gs.predict(X_test)).ravel() 
    print("True Negatives: %s" % tn)
    print("False Positives: %s" % fp)
    print("False Negatives: %s" % fn)
    print("True Positives: %s" % tp, '\n')
    
    model_results['recall'] = tp/(tp+fn)
    model_results['precision'] = tp/(tp+fp)
    model_results['f1-score'] = 2*((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn)))
    
    gs_results_cv_st = gs_results_cv_st.append(model_results, ignore_index=True)
    pd.set_option('display.max_colwidth', 200)

Model:  multi_nb + cv
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
0.8969654199011997 

0.6976744186046512 

True Negatives: 152
False Positives: 76
False Negatives: 67
True Positives: 178 

Model:  knn + cv
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
0.534227240649259 

0.5285412262156448 

True Negatives: 7
False Positives: 221
False Negatives: 2
True Positives: 243 

Model:  logreg + cv
Best Params:  {'cv__ngram_range': (1, 2), 'cv__stop_words': 'english'}
0.9992942836979535 

0.6300211416490487 

True Negatives: 77
False Positives: 151
False Negatives: 24
True Positives: 221 



In [31]:
gs_results_cv_st.sort_values('test_accuracy',ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,baseline_accuracy,recall,precision,f1-score
0,multi_nb + cv,"{'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.896965,0.697674,0.51729,0.726531,0.700787,0.713427
2,logreg + cv,"{'cv__ngram_range': (1, 2), 'cv__stop_words': 'english'}",0.999294,0.630021,0.51729,0.902041,0.594086,0.71637
1,knn + cv,"{'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.534227,0.528541,0.51729,0.991837,0.523707,0.685472


The result from using stemmed data is almost the same as the results from using lemmatized data. Multinomial NB remains the best performing but with a little less overfitting than the one with lemmatized data.  

#### Tfidf Vectorizer

In [32]:
steps_list_tf_st = [ # list of pipeline steps for each model combo
    [('tf',TfidfVectorizer()),('multi_nb',MultinomialNB())],
    [('tf',TfidfVectorizer()),('scaler',StandardScaler(with_mean=False)),('knn',KNeighborsClassifier())], 
    [('tf',TfidfVectorizer()),('scaler',StandardScaler(with_mean=False)),('logreg',LogisticRegression())]
]

In [33]:
steps_titles_tf_st = ['multi_nb + tf','knn + tf','logreg + tf']

In [34]:
pipe_params_tf_st = [
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]}
]

In [35]:
# instantiate results DataFrame
gs_results_tf_st = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy','baseline_accuracy',
                                         'recall', 'precision', 'f1-score'])
gs_results_tf_st.head()

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,baseline_accuracy,recall,precision,f1-score


In [36]:
# Loop through index of number of steps
for i in range(len(steps_list_tf_st)):
    # instantiate pipeline 
    pipe = Pipeline(steps=steps_list_tf_st[i])
    # fit GridSearchCV to model and model's params
    gs = GridSearchCV(pipe, pipe_params_tf_st[i], cv=3) 

    model_results = {}

    gs.fit(X_train, y_train)
    
    print('Model: ', steps_titles_tf_st[i])
    model_results['model'] = steps_titles_tf_st[i]

    print('Best Params: ', gs.best_params_)
    model_results['best_params'] = gs.best_params_

    print(gs.score(X_train, y_train), '\n')
    model_results['train_accuracy'] = gs.score(X_train, y_train)
    
    print(gs.score(X_test, y_test), '\n')
    model_results['test_accuracy'] = gs.score(X_test, y_test)
    
    model_results['baseline_accuracy'] = baseline

    # Display the confusion matrix results showing true/false positive/negative
    tn, fp, fn, tp = confusion_matrix(y_test, gs.predict(X_test)).ravel() 
    print("True Negatives: %s" % tn)
    print("False Positives: %s" % fp)
    print("False Negatives: %s" % fn)
    print("True Positives: %s" % tp, '\n')
    
    model_results['recall'] = tp/(tp+fn)
    model_results['precision'] = tp/(tp+fp)
    model_results['f1-score'] = 2*((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn)))

    gs_results_tf_st = gs_results_tf_st.append(model_results, ignore_index=True)
    pd.set_option('display.max_colwidth', 200)

Model:  multi_nb + tf
Best Params:  {'tf__ngram_range': (1, 2), 'tf__stop_words': 'english'}
0.9943542695836274 

0.7124735729386892 

True Negatives: 156
False Positives: 72
False Negatives: 64
True Positives: 181 

Model:  knn + tf
Best Params:  {'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}
0.5172900494001411 

0.5179704016913319 

True Negatives: 0
False Positives: 228
False Negatives: 0
True Positives: 245 

Model:  logreg + tf
Best Params:  {'tf__ngram_range': (1, 2), 'tf__stop_words': 'english'}
0.9992942836979535 

0.6659619450317125 

True Negatives: 133
False Positives: 95
False Negatives: 63
True Positives: 182 



In [37]:
gs_results_tf_st.sort_values('test_accuracy', ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,baseline_accuracy,recall,precision,f1-score
0,multi_nb + tf,"{'tf__ngram_range': (1, 2), 'tf__stop_words': 'english'}",0.994354,0.712474,0.51729,0.738776,0.715415,0.726908
2,logreg + tf,"{'tf__ngram_range': (1, 2), 'tf__stop_words': 'english'}",0.999294,0.665962,0.51729,0.742857,0.65704,0.697318
1,knn + tf,"{'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}",0.51729,0.51797,0.51729,1.0,0.51797,0.682451


Multinomial NB again is the best model with the highest accuracy but is very highly overfitted.

In [38]:
# Concatenate the dataframes containing results from both vectorizations
results_stem = pd.concat([gs_results_cv_st, gs_results_tf_st], ignore_index=True)
results_stem.sort_values('test_accuracy', ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,baseline_accuracy,recall,precision,f1-score
3,multi_nb + tf,"{'tf__ngram_range': (1, 2), 'tf__stop_words': 'english'}",0.994354,0.712474,0.51729,0.738776,0.715415,0.726908
0,multi_nb + cv,"{'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.896965,0.697674,0.51729,0.726531,0.700787,0.713427
5,logreg + tf,"{'tf__ngram_range': (1, 2), 'tf__stop_words': 'english'}",0.999294,0.665962,0.51729,0.742857,0.65704,0.697318
2,logreg + cv,"{'cv__ngram_range': (1, 2), 'cv__stop_words': 'english'}",0.999294,0.630021,0.51729,0.902041,0.594086,0.71637
1,knn + cv,"{'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.534227,0.528541,0.51729,0.991837,0.523707,0.685472
4,knn + tf,"{'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}",0.51729,0.51797,0.51729,1.0,0.51797,0.682451


In [39]:
# Compare with the above result
results_lem.sort_values('test_accuracy', ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,baseline_accuracy,recall,precision,f1-score
3,multi_nb + tf,"{'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}",0.941426,0.714588,0.51729,0.795918,0.696429,0.742857
0,multi_nb + cv,"{'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.917431,0.693446,0.51729,0.726531,0.695312,0.710579
5,logreg + tf,"{'tf__ngram_range': (1, 2), 'tf__stop_words': 'english'}",0.999294,0.665962,0.51729,0.771429,0.649485,0.705224
2,logreg + cv,"{'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.999294,0.625793,0.51729,0.693878,0.625,0.65764
1,knn + cv,"{'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.533522,0.51797,0.51729,0.987755,0.518201,0.679775
4,knn + tf,"{'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}",0.51729,0.51797,0.51729,1.0,0.51797,0.682451


For the best performing model i.e. Multinomial NB with Tfidf Vectorizer, the lemmatized data performed only marginally better in terms of accuracy. However, given the context of the problem at hand, we would want to maximise recall as much as possible since it shows the model’s ability to find all the data points of interest i.e. all the suicide posts. If recall is low, it means that red flags would not be raised for those who are suicidal even when they are actually suicidal. The following models seem to do better in terms of recall and accuracy:
1. Lemmatized Tfidf Multinomial NB
    * tf__ngram_range=(1,1)
    * tf__stop_words='english'
2. Lemmatized Tfidf Scaled Logistic Regression
    * tf__ngram_range=(1,2)
    * tf__stop_words='english'
    
We will optimise these models and pick the best.

### Optimising Tfidf Multinomial Naive Bayes

Reset X and y so that we use the lemmatized posts instead of stemmed posts. Train/test split again.

In [80]:
X = df_prep['post_lem']
y = df_prep['suicide']

In [81]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [82]:
# Create empty dataframe to store results for Multinomial NB model tweaks
mnb_runs = pd.DataFrame(columns=['train_accuracy','test_accuracy','best_params','recall',
                                 'precision','f1-score'])

In [83]:
# Instantiate Pipeline with fixed parameters from the earlier run
mnb_pipe = Pipeline([('tf',TfidfVectorizer(stop_words='english', ngram_range=(1,1))),
                 ('mnb',MultinomialNB())
                ])

In [84]:
# parameters for GridSearch using Pipeline, formatted to call named estimators
mnb_params = {'mnb__alpha': np.arange(1,1.5,.1),
              'tf__max_features': [2000, 2500, 3000],
              'tf__min_df': [2, 3], 
              'tf__max_df': [.4, 0.45, 0.55]
             }

In [85]:
# Empty dictionary to store results
mnb_results = {} 

mnb_gs = GridSearchCV(mnb_pipe, mnb_params, cv=5)
mnb_gs.fit(X_train, y_train)

print('Train Accuracy: ', mnb_gs.score(X_train, y_train))
mnb_results['train_accuracy'] = mnb_gs.score(X_train, y_train)

print('Test Accuracy: ',mnb_gs.score(X_test, y_test))
mnb_results['test_accuracy'] = mnb_gs.score(X_test, y_test)

print('Best Params: ',mnb_gs.best_params_)
mnb_results['best_params'] = mnb_gs.best_params_ 

tn, fp, fn, tp = confusion_matrix(y_test, mnb_gs.predict(X_test)).ravel()
mnb_results['recall'] = tp/(tp+fn)
mnb_results['precision'] = tp/(tp+fp)
mnb_results['f1-score'] = 2*((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn)))

mnb_runs = mnb_runs.append(mnb_results, ignore_index=True)
pd.set_option('display.max_colwidth', 200)

Train Accuracy:  0.8701482004234298
Test Accuracy:  0.693446088794926
Best Params:  {'mnb__alpha': 1.4000000000000004, 'tf__max_df': 0.45, 'tf__max_features': 3000, 'tf__min_df': 2}


In [86]:
mnb_runs

Unnamed: 0,train_accuracy,test_accuracy,best_params,recall,precision,f1-score
0,0.870148,0.693446,"{'mnb__alpha': 1.4000000000000004, 'tf__max_df': 0.45, 'tf__max_features': 3000, 'tf__min_df': 2}",0.808163,0.668919,0.731978


### Optimising Tfidf Logistic Regression

In [87]:
# Create empty dataframe to store results for Logistic Regression model tweaks
lr_runs = pd.DataFrame(columns=['train_accuracy','test_accuracy','best_params','recall',
                                'precision','f1-score'])

In [88]:
# Instantiate Pipeline with fixed parameters from the earlier run
lr_pipe = Pipeline([('tf',TfidfVectorizer(stop_words='english', ngram_range=(1,2))),
                 ('scaler',StandardScaler(with_mean=False)),('lr',LogisticRegression(solver='liblinear'))
                ])

In [89]:
# parameters for GridSearch using Pipeline
lr_params = {'tf__max_features': [1000,2000, 2500],
             'tf__min_df': [2, 3], 
             'tf__max_df': [0.7, 0.75, 0.8],
             'lr__penalty': ['l1', 'l2'],
             'lr__class_weight': [None, 'balanced']}

In [90]:
# Empty dictionary to store results
lr_results = {} 

lr_gs = GridSearchCV(lr_pipe, lr_params, cv=5)
lr_gs.fit(X_train, y_train)

print('Train Accuracy: ', lr_gs.score(X_train, y_train))
lr_results['train_accuracy'] = lr_gs.score(X_train, y_train)

print('Test Accuracy: ',lr_gs.score(X_test, y_test))
lr_results['test_accuracy'] = lr_gs.score(X_test, y_test)

print('Best Params: ',lr_gs.best_params_)
lr_results['best_params'] = lr_gs.best_params_ 

tn, fp, fn, tp = confusion_matrix(y_test, lr_gs.predict(X_test)).ravel()
lr_results['recall'] = tp/(tp+fn)
lr_results['precision'] = tp/(tp+fp)
lr_results['f1-score'] = 2*((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn)))

lr_runs = lr_runs.append(lr_results, ignore_index=True)
pd.set_option('display.max_colwidth', 200)

Train Accuracy:  0.9992942836979535
Test Accuracy:  0.6448202959830867
Best Params:  {'lr__class_weight': None, 'lr__penalty': 'l1', 'tf__max_df': 0.8, 'tf__max_features': 2500, 'tf__min_df': 2}


In [91]:
lr_runs

Unnamed: 0,train_accuracy,test_accuracy,best_params,recall,precision,f1-score
0,0.999294,0.64482,"{'lr__class_weight': None, 'lr__penalty': 'l1', 'tf__max_df': 0.8, 'tf__max_features': 2500, 'tf__min_df': 2}",0.693878,0.646388,0.669291


Based on the multiple runs of searching for the best parameters, the one selected as the production model is **Tfidf Vectorization with Multinomial Naive Bayes** with the following parameters:
* 'mnb__alpha': 1.4000000000000004,
* 'tf__max_df': 0.45
* 'tf__max_features': 3000
* 'tf__min_df': 2
* 'tf__stop_words': 'english'
* 'tf__ngram_range': (1,1)

Although the accuracy is lower than that of the same model in the first run earlier in the notebook, the recall is a little higher at 0.81 compared to 0.79. It also is less overfitted, so it should generalise better to unseen data. This model is also better than the Logistic Regression in terms of recall, with a recall of 0.81 for the former and 0.69 for the latter. 

The model is chosen mainly based on recall since the idea of false positives is far better than false negatives. The occurrence of false negatives should be minimised because if we fail to identify potentially suicidal individuals, we would not be giving them the care and treatment needed, which can result in a loss of lives. Precision is not as important; even if we are not able to maximise it, which translates to a higher number of false positives, we are only giving extra care and attention to those who are depressed and not potentially suicidal. They may, in fact, benefit from this extra attention. 

### Model Evaluation

In [92]:
# Instantiate model with best parameters.
tf_final = TfidfVectorizer(max_df=0.45, max_features=3000, min_df=2, 
                             stop_words='english', ngram_range=(1,1))
X_train_final = tf_final.fit_transform(X_train).todense()
X_test_final = tf_final.transform(X_test).todense()

# Fit model.
mnb_final = MultinomialNB(alpha=1.4000000000000004)
mnb_final.fit(X_train_final, y_train)
mnb_final.score(X_test_final, y_test)

0.693446088794926

Create a data frame to compare the predicted class versus the actual class. We will look through the inaccurately predicted texts to pick out some pattern. 

In [100]:
model_eval = pd.DataFrame(X_test)
model_eval['suicide'] = y_test
model_eval['predictions'] = preds

In [105]:
# Check dataframe
model_eval.head()

Unnamed: 0,post_lem,suicide,predictions
516,birthday tomorrowi recently go hospital per therapist get ivcd month ago nothing gotten better med helped still find planning death psychiatrist prescribes ambien googled ...,1,1
41,way recover live struggling depression social anxiety etc sometimes wish would transformed another person normal want live good life want live life carrying pain everyd...,1,1
1432,depression feel likei put word time depression feel like go life tight tight pressure squeezing heart day long ignore go throughout day nobody knowing going snap ...,0,0
754,bad person step grandmother extremely abusive toxic mother life got diagnosed cancer last year terminal time however required u dedicate much time longer car whate...,1,1
1824,fucking reasoni cant stand dont wanna kill really hope pray car run fuck rib gunpoint ask fucking shoot mother seem get shes pulling depressed know depressed k...,0,1


In [111]:
# Pick out the false negatives
fn = model_eval[model_eval['suicide'] == 1][model_eval['predictions'] == 0]

In [120]:
print('The number of false negatives are {} out of {} posts.'.format(fn.shape[0], model_eval.shape[0]))

The number of false negatives are 47 out of 473 posts.


In [119]:
pd.set_option('display.max_colwidth', 1000)
fn

Unnamed: 0,post_lem,suicide,predictions
919,everyone backstabbing sack shit,1,0
942,depression got worse lost cati got better little bit drinking cry nonstop beside cutting drinking alcohol,1,0
290,anyone wanna talk sorry error english first language anxious right long story kind confusing go depressed since ever know started started thing know depressed since kid thing getting worse worse big crisis last two year life expecting thing complicated tried tried still fucked thing cousin life another country asked could stay 3 month thing get better organize live moved one month ago trying build thing live maybe another country another life maybe thing get better know trying asked stay home really kind supportive august 2019 got last month nice wake purpose say bad thing like watch tv cause sleeping couch couch bed insinuated find place since day arrived try help thing clean house wash clothes go supermarket buy food cook food try outside give space still acting bad towards know since asked 3 ...,1,0
445,well see fun worldnothing aint fun nomore nothing excites nomore nothing look foward nomore life aint fun nomore whats point staying,1,0
167,find sense power anything hitting situation fucked every one fucked hate fact headed nothing meaningless middle class civilian life influence meaning anyone get fucking angry everything happening world around want forced play rigged game time feel much authority anything body smacking hard possibly suffering like since born long like want cry help honest people know even cry one would listen like cry screaming loud pillow last ten minute cry screaming animal trapped cage processed meat meat even enjoyed anyone thrown away,1,0
222,old brain common life planet addition new hate reason problem fuck new brain,1,0
377,bullshitso went stupid therapist last week self harm suicide thing school dont really want help least hoping help mental health didnt happen instead going get tested autism asperger dont think autism bad really wanted wanted sort mental health take thing could mean autistic hate dont get referred therapist specializes autism need sorry offensive sorry,1,0
827,feeling suicidali hospitalised twice last week severe depression suicidal thought anxiety fiancé broken absolute mess one someone please tell ok broken,1,0
296,tired people saying get better constantly getting worsefucking liar,1,0
778,know talk anymore around 6 7 dad got scholarship australia really happy someone lot friend exceptional grade etc turned 180 came back australia home country know junior high pretty average execpt grade starting fall got occasional teasing different english accent come high school everything hit real hard grade par dad allow hang friend started heli parent would make irrational night curfew also got numerous trouble school time thought friend left one one even spread rumor gay big country top girlfriend cheated felt desolate unwanted ever since became socially awkward started stutter mumble speech thought jumping roof canceled eventually ate depression loneliness time gaining 10 kilo making obese last year got accepted uni city away home gave much needed liberty away dad tried making friend usual stuff shrugged cool enough plain awkward person kil...,1,0


After studying the posts, there seems to be no clear trend. That said, there are some that could have just posted in the subreddit while showing mild symptoms of depression, rather than being in a more 'advanced' stage of depression that could potentially be a suicide case. Besides, different people express themselves differently and what could have been a genuine cry for help such as 'could use someone to talk to' might not have been phrased the same way by others having the same suicidal thoughts. There is definitely a need for more data for the model to be able to pick up the minor differences in how people express themselves. 

### Conclusion and Recommendations

**Production Model**

As mentioned above, the production model would be **Tfidf Vectorization with Multinomial Naive Bayes**, which gave a rather high recall rate of 81% - the fraction of the total amount of suicidal instances that were actually retrieved by the model. 

**Limitations**

It is important to recognise the limitations of using posts or written notes of any kind to raise red flags about potential suicide cases. The written notes cannot capture the tone in which it is written and cannot capture a person's behavioural changes which usually serves as a telltale sign. Thus, while it can be useful as a form of first screening, there is also a need to examine behavioural changes in relationships, eating and sleeping habits, etc.

**Further Expansion**

The two subreddits used are depression and suicide, which means that it could fail to classify those who are suicidal and at the same time, face depression correctly. The project could benefit from widening the scope of data collection to include people who may be suicidal due to other conditions like trauma and anxiety. 

Besides, higher data granularity, such as location, age, and gender would probably help in classifying because people from different regions, genders and age would express themselves differently. This will help identify the nuances and classify the categories better. The project can also be generalised to include more stakeholders such as parents, the community and teachers to help identify suicidal patients early on. It will be particularly helpful for those who do not have background in the field, i.e. those who are not trained psychologists and psychiatrists. 