# Project 3

Done by: Goh Chun Shan, DSIF 7


## Overview of project notebooks:

1 - Project Overview and Data Acquisition through Webscraping

2 - Exploratory Data Analysis

**3 - Model Tuning and Insights** (current notebook)

Note: In this notebook, the subreddits that each post is from is denoted as binary where: Social Anxiety is 1, OCD is 0.

In [40]:
# Import libaries
import time
import pandas as pd
import numpy as np
import nltk

In [41]:
from sklearn.pipeline import Pipeline, make_pipeline #can try to explore make_pipeline, don't have to write so much
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report, precision_score, f1_score

In [42]:
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import metrics

import warnings
warnings.filterwarnings('ignore')

# Preprocessing

In [43]:
#read in files
df_combined = pd.read_csv('data\combined.csv')

In [44]:
print(df_combined.shape)
df_combined.head()

(18000, 8)


Unnamed: 0,subreddit,selftext,title,id,fulltext,fulltextlength,selftextlength,titlelength
0,0,"I'll go first, I'm 28F. I have a lot of childh...",What are your most disturbing/ disgusting intr...,yy5m0e,What are your most disturbing/ disgusting intr...,1299,1237,116
1,0,How do you all cope with OCD and live on?,"life, coping with OCD",yy5ja0,"life, coping with OCD How do you all cope with...",63,41,22
2,0,"I’ll try to make this as concise as possible, ...","Existential OCD, and fear of Psychosis OCD?",yy4vb3,"Existential OCD, and fear of Psychosis OCD? I’...",1599,1555,24
3,0,I suffer from many different kinds of OCD and ...,OCD is ruining my life and can potentially rui...,yy4v6r,OCD is ruining my life and can potentially rui...,195,138,53
4,0,Last night I went to see Black Panther 2 with ...,"Do the exposure, babe!",yy4qs8,"Do the exposure, babe! Last night I went to se...",503,480,43


In [87]:
df_combined['fulltext'] = df_combined['fulltext'].apply(lambda x:str(x))

### Stopwords

In [71]:
# the default NLTK stopword list
stop_words = set(stopwords.words('english'))  

# add additional stopwords
additional_stopwords = {'ocd','anxiety','social'}
stop_words = stop_words.union(additional_stopwords)

In [137]:
#remove the rows where subreddit only contains stopwords
df = df_combined
df['fulltext'] = df['fulltext'].apply(lambda x: ' '.join([word for word in str(x).split() if word not in (stop_words)]))

## Normalisation

The process of text normalisation aim to reduce the amount of noise in the data, through the removal of cases, punctuations, stopwords, and changing word constructions. 

Lemmatisation and stemming are both methods that try to bring inflated words to the same form, a process that reduces some noise in the data (e.g., from counting run, ran, and running all as different words). If processing speed is not a concern in this case due to the small size of the working corpus, lemmatisation will be used.

### A) Tokenize

In [138]:
tokenizer = RegexpTokenizer('\s+', gaps = True)

In [146]:
text_tokens = [tokenizer.tokenize(text.lower()) for text in (df['fulltext'])]

In [159]:
text_tokens.head()

0    what disturbing disgusting intrusive thoughts ...
1                 life coping ocd how cope ocd live on
2    existential ocd fear psychosis ocd ill try mak...
3    ocd ruining life potentially ruin life i suffe...
4    do exposure babe last night i went see black p...
Name: text_tokens, dtype: object

In [140]:
df['text_tokens'] = pd.DataFrame(data=[text_tokens], index=['text_tokens']).T[['text_tokens']]
df['text_tokens'] = df['text_tokens'].apply(lambda row: ' '.join(row))

In [141]:
df['text_tokens'] = df['text_tokens'].str.replace(r'[^\w\s]+', '')

### B) Lemmatization

In [162]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

In [249]:
def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

In [252]:
lems = []
for post in df['text_tokens']:
    tok_post = []
    for word in post.split():
        tok_post.append(lemmatize_text(word)[0])
    lems.append(tok_post)

In [253]:
df['text_lem'] = lems

In [255]:
df['text_lem'] = df['text_lem'].apply(lambda row: ' '.join(row))

### C) Stemming

In [152]:
p_stemmer = PorterStemmer()

In [261]:
stem = []
for post in df['text_tokens']:
    tok_post = []
    for word in post.split():
        tok_post.append(p_stemmer.stem(word)) #why no issue with the lists of lists
    stem.append(tok_post)

In [262]:
df['text_stem'] = stem

In [263]:
df['text_stem'] = df['text_stem'].apply(lambda row: ' '.join(row))

In [264]:
df.head()

Unnamed: 0,subreddit,selftext,title,id,fulltext,fulltextlength,selftextlength,titlelength,text_tokens,text_lem,text_stem
0,0,"I'll go first, I'm 28F. I have a lot of childh...",What are your most disturbing/ disgusting intr...,yy5m0e,What disturbing/ disgusting intrusive thoughts...,1299,1237,116,what disturbing disgusting intrusive thoughts ...,what disturbing disgusting intrusive thought i...,what disturb disgust intrus thought ill go fir...
1,0,How do you all cope with OCD and live on?,"life, coping with OCD",yy5ja0,"life, coping OCD How cope OCD live on?",63,41,22,life coping ocd how cope ocd live on,life coping ocd how cope ocd live on,life cope ocd how cope ocd live on
2,0,"I’ll try to make this as concise as possible, ...","Existential OCD, and fear of Psychosis OCD?",yy4vb3,"Existential OCD, fear Psychosis OCD? I’ll try ...",1599,1555,24,existential ocd fear psychosis ocd ill try mak...,existential ocd fear psychosis ocd ill try mak...,existenti ocd fear psychosi ocd ill tri make c...
3,0,I suffer from many different kinds of OCD and ...,OCD is ruining my life and can potentially rui...,yy4v6r,OCD ruining life potentially ruin life. I suff...,195,138,53,ocd ruining life potentially ruin life i suffe...,ocd ruining life potentially ruin life i suffe...,ocd ruin life potenti ruin life i suffer mani ...
4,0,Last night I went to see Black Panther 2 with ...,"Do the exposure, babe!",yy4qs8,"Do exposure, babe! Last night I went see Black...",503,480,43,do exposure babe last night i went see black p...,do exposure babe last night i went see black p...,do exposur babe last night i went see black pa...


# Train-Test Split

In [405]:
y = df['subreddit']
X = df['text_tokens']
X_lem = df['text_lem']
X_stem = df['text_stem']

In [406]:
# Split the data into the training and testing sets for normal tokenized words
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

In [407]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(12060,)
(5940,)
(12060,)
(5940,)


In [408]:
X_train = X_train.values.astype('U')
X_test = X_test.values.astype('U')

In [409]:
# LEMMATIZATION: Split the data into the training and testing sets.
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_lem,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

In [410]:
X_train = X_train2.values.astype('U')
X_test = X_test2.values.astype('U')

In [411]:
print(X_train2.shape)
print(X_test2.shape)
print(y_train2.shape)
print(y_test2.shape)

(12060,)
(5940,)
(12060,)
(5940,)


In [412]:
# STEMMING: Split the data into the training and testing sets.
X_train3, X_test3, y_train3, y_test3 = train_test_split(X_stem,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

In [414]:
X_train3 = X_train3.values.astype('U')
X_test3= X_test3.values.astype('U')

In [415]:
print(X_train3.shape)
print(X_test3.shape)
print(y_train3.shape)
print(y_test3.shape)

(12060,)
(5940,)
(12060,)
(5940,)


# Baseline model

The basic baseline model would have an accuracy of 50% since we have the same number of posts from each subreddit (Social Anxiety and OCD).

In [267]:
y.value_counts(normalize = True)

1    0.5
0    0.5
Name: subreddit, dtype: float64

# CountVectorizer 

In [None]:
# instantiate a basic vectoriser with only settings for extracting bi- and tri-grams
#cvec = CountVectorizer(stop_words = stop_words, ngram_range=(2,3))

In [363]:
cvec = CountVectorizer(stop_words = stop_words)

In [364]:
cvec.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words={'haven', 'while', 'at', 'm', 'itself', 'because', "mightn't", 'if', 'some', "don't", "you'll", "she's", 'do', 'after', 'couldn', "shouldn't", 's', 'very', 'are', 't', 'mustn', 'she', "mustn't", 'ocd', "that'll", 'does', 'social', 'they', 'what', "couldn't", 'myself', 'yourself', 'themsel... 'down', 'have', 'both', 'you', 'he', 'a', 'wasn', 'on', 'can', 'them', 'is', "won't", 'did', 'don'},
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [416]:
X_train_cvec = cvec.transform(X_train) 
X_test_cvec = cvec.transform(X_test)



In [417]:
print(X_train_cvec.shape)
print(X_test_cvec.shape)

(12060, 20126)
(5940, 20126)


For Lemmatization

In [418]:
cvec.fit(X_train2)
X_train_cvec2 = cvec.transform(X_train2)
X_test_cvec2 = cvec.transform(X_test2)

In [419]:
print(X_train_cvec2.shape)
print(X_test_cvec2.shape)

(12060, 26439)
(5940, 26439)


For Stemming

In [420]:
cvec.fit(X_train3)
X_train_cvec3 = cvec.transform(X_train3)
X_test_cvec3 = cvec.transform(X_test3)

In [421]:
print(X_train_cvec3.shape)
print(X_test_cvec3.shape)

(12060, 20126)
(5940, 20126)


# Key Metric used for model evaluation: $F_1$-score

The key metric we are using to evaluate the model is 'f1_score'. In this classification problem, we neither want to minimize false positives or negatives as both mental health conditions are equally important and we wish

Instead of using 'Recall' or 'Precision', using 'f1_score' balances our false positives and false negatives. As either false positives or false negatives increase, the denominator increases while the numerator stays fixed, meaning our $F_1$-score decreases.


In [422]:
#Create function for F1 score

def f1_scorer(model, X_train, X_test, y_train, y_test):
    f1_train = f1_score(y_true = y_train,
                        y_pred = model.predict(X_train))
    f1_test = f1_score(y_true = y_test,
                       y_pred = model.predict(X_test))
    
    print("The training F1-score for " + str(model.__class__.__name__) + " is: " + str(f1_train))
    print("The testing F1-score for " + str(model.__class__.__name__) + " is: " + str(f1_test))
    print()

# Model 1 - Random Forest

Using Random Forest for a Binary Classification Problem:

Random forest decorrelates the trees in decision trees from one another. By loooking at the randomly selected subset of the features (X variables), Random forest results in higher bias, lower variance. As the Random Forest method limits the allowed variables to split on in each node, the bias for a single random forest tree is increased even more.

In Random forests, only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a node.

### Model 1.1 RF with normal tokenization

In [307]:
rf = RandomForestClassifier(n_estimators=100)
cross_val_score(rf, X_train, y_train, cv=5).mean()

0.8893864013266999

In [330]:
model_rf = rf.fit(X_train, y_train)

In [309]:
print(f'Score on training set: {rf.score(X_train, y_train)}')
print(f'Score on testing set: {rf.score(X_test, y_test)}')

Score on training set: 0.9986733001658374
Score on testing set: 0.8915824915824916


In [337]:
rf_pred = rf.predict(X_test)

In [338]:
accuracies = cross_val_score(rf, X_train, y_train, cv=5, scoring = 'roc_auc')
print('roc auc training set: ', np.mean(accuracies))
print('roc auc test set : ', metrics.roc_auc_score(y_test, rf_pred))

roc auc training set:  0.952678426551598
roc auc test set :  0.8883838383838384


Evaluate model $F_1$-score with default parameters:

In [340]:
f1_scorer(model_rf, X_train, X_test, y_train, y_test)

The training F1-score for RandomForestClassifier is: 0.9986741796486576
The testing F1-score for RandomForestClassifier is: 0.8871489361702127



Use Gridsearch to fine tune the hyperparameters

#### Random Forest GridSearch

In [423]:
pipe_rf = Pipeline([
    ('cvec', CountVectorizer(stop_words = stop_words)),
    ('rf', RandomForestClassifier())
])


In [436]:
pipe_rf_params = {
    'cvec__max_features': [2_000, 3_000],
    'cvec__ngram_range': [(1,1), (1,2)],
    'rf__n_estimators': [100, 200],
    'rf__max_depth': [None, 1, 2],

}

In [437]:
gs_rf = GridSearchCV(pipe_rf,
                     param_grid = pipe_rf_params, 
                     cv=5)

In [426]:
gs_rf.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words={'haven', '...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'rf__n_estimators': [100, 200], 'rf__max_depth': [None, 1, 2]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [427]:
print(gs_rf.best_score_)
gs_rf.best_params_

0.8932835820895523


{'rf__max_depth': None, 'rf__n_estimators': 200}

In [438]:
gs_rf.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words={'haven', '...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'cvec__max_features': [2000, 3000], 'cvec__ngram_range': [(1, 1), (1, 2)], 'rf__n_estimators': [100, 200], 'rf__max_depth': [None, 1, 2]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [439]:
print(gs_rf.best_score_)
gs_rf.best_params_

0.8905472636815921


{'cvec__max_features': 3000,
 'cvec__ngram_range': (1, 2),
 'rf__max_depth': None,
 'rf__n_estimators': 200}

In [446]:
#change model parameters to optimal ones found during GridSearch
pipe_rf_gs = Pipeline([
    ('cvec', CountVectorizer(stop_words = stop_words, ngram_range= (1, 2), max_features= 3000)),
    ('rf', RandomForestClassifier(n_estimators=200))
])

In [447]:
model_rf1 = pipe_rf_gs.fit(X_train, y_train)

Evaluate $F_1$-score after tuning

In [448]:
f1_scorer(model_rf1, X_train, X_test, y_train, y_test)

The training F1-score for Pipeline is: 0.9981776010603048
The testing F1-score for Pipeline is: 0.8898633833698769



The testing F1-score only improved slightly after tuning to 0.890 (rounded) from 0.8871 (before tuning).

### Model 1.2 RF with Lemmatization


In [440]:
gs_rf.fit(X_train2, y_train2)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words={'haven', '...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'cvec__max_features': [2000, 3000], 'cvec__ngram_range': [(1, 1), (1, 2)], 'rf__n_estimators': [100, 200], 'rf__max_depth': [None, 1, 2]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [441]:
print(gs_rf.best_score_)
gs_rf.best_params_

0.889469320066335


{'cvec__max_features': 3000,
 'cvec__ngram_range': (1, 2),
 'rf__max_depth': None,
 'rf__n_estimators': 200}

In [449]:
#change model parameters to optimal ones found during GridSearch (no change from previous)
pipe_rf_gs = Pipeline([
    ('cvec', CountVectorizer(stop_words = stop_words, ngram_range= (1, 2), max_features= 3000)),
    ('rf', RandomForestClassifier(n_estimators=200))
])

In [450]:
model_rf2 = pipe_rf_gs.fit(X_train2, y_train2)

In [451]:
f1_scorer(model_rf2, X_train2, X_test2, y_train2, y_test2)

The training F1-score for Pipeline is: 0.9981776010603048
The testing F1-score for Pipeline is: 0.8862639217009788



### Model 1.3 RF with Stemming

In [442]:
gs_rf.fit(X_train3, y_train3)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words={'haven', '...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'cvec__max_features': [2000, 3000], 'cvec__ngram_range': [(1, 1), (1, 2)], 'rf__n_estimators': [100, 200], 'rf__max_depth': [None, 1, 2]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [443]:
print(gs_rf.best_score_)
gs_rf.best_params_

0.9014096185737976


{'cvec__max_features': 3000,
 'cvec__ngram_range': (1, 1),
 'rf__max_depth': None,
 'rf__n_estimators': 200}

In [452]:
#change model parameters to optimal ones found during GridSearch (no change from previous)
pipe_rf_gs3 = Pipeline([
    ('cvec', CountVectorizer(stop_words = stop_words, ngram_range= (1, 1), max_features= 3000)),
    ('rf', RandomForestClassifier(n_estimators=200))
])

In [453]:
#model_rf3 = pipe_rf_gs3.fit(X_train3, y_train3)

In [454]:
model_rf3 = pipe_rf_gs3.fit(X_train3, y_train3)
f1_scorer(model_rf3, X_train3, X_test3, y_train3, y_test3)

The training F1-score for Pipeline is: 0.9985084521047397
The testing F1-score for Pipeline is: 0.8978759558198811



Comment:

Random Forest models are prone to overfitting.
We choose Model 1.3 - Random Forest using stemming as it produces highest testing f1-score of 0.90 (rounded up), which has n-gram range of (1,1) unlike the other 2 models which has n-gram range of (1,2).


Confusion matrix and interpretation

In [455]:
y_predict = model_rf3.predict(X_test3) 
# print confusion matrix
cmatrix = confusion_matrix(y_test3, y_predict)
print("Confusion matrix:")
pd.DataFrame(cmatrix, 
             index = ['actual r/social anxiety','actual r/ocd'],
             columns = ['predicted r/social anxiety', 'predicted r/ocd'])
# tn, fp, 
# fn, tp

Confusion matrix:


Unnamed: 0,predicted r/social anxiety,predicted r/ocd
actual r/social anxiety,2697,273
actual r/ocd,328,2642


# Model 2 - Naive Bayes

**Description**:
Naive Bayes is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

This model shortens the time taken for training. We use this as a comparison to see whether model performance is improved from the Random Forest model used previously.

### Model 2.1 Naive Bayes with normal Tokenization

In [428]:
pipe_nb = Pipeline([
    ('cvec', CountVectorizer(stop_words = stop_words)),
    ('nb', MultinomialNB())
])

In [429]:
pipe_nb_params = {
    'cvec__max_features': [2_000, 3_000, 4_000, 5_000],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [.9, .95],
    'cvec__ngram_range': [(1,1), (1,2)]
}

In [457]:
# Instantiate GridSearchCV.
gs_nb = GridSearchCV(pipe_nb,
                  param_grid = pipe_nb_params, 
                  cv=5) 

In [458]:
gs_nb.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words={'haven', '...kenizer=None, vocabulary=None)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'cvec__max_features': [2000, 3000, 4000, 5000], 'cvec__min_df': [2, 3], 'cvec__max_df': [0.9, 0.95], 'cvec__ngram_range': [(1, 1), (1, 2)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [432]:
print(gs_nb.best_score_)

0.9072139303482587


In [459]:
print(gs_nb.best_score_)
gs_nb.best_params_

0.9072139303482587


{'cvec__max_df': 0.9,
 'cvec__max_features': 5000,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 1)}

In [466]:
#change model parameters to optimal ones found during GridSearch 
pipe_nb_gs = Pipeline([
    ('cvec', CountVectorizer(stop_words = stop_words, ngram_range= (1, 1), max_features= 5000, max_df = 0.9, min_df = 3)),
    ('nb', MultinomialNB())
])

In [467]:
model_nb1 = pipe_rf_gs.fit(X_train, y_train)

In [468]:
f1_scorer(model_nb1, X_train, X_test, y_train, y_test)

The training F1-score for Pipeline is: 0.9981776010603048
The testing F1-score for Pipeline is: 0.888288896388795



### Model 2.2  Naive Bayes with Lemmatization

In [460]:
gs_nb2 = GridSearchCV(pipe_nb,
                  param_grid = pipe_nb_params, 
                  cv=5) 

In [461]:
gs_nb2.fit(X_train2, y_train2)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words={'haven', '...kenizer=None, vocabulary=None)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'cvec__max_features': [2000, 3000, 4000, 5000], 'cvec__min_df': [2, 3], 'cvec__max_df': [0.9, 0.95], 'cvec__ngram_range': [(1, 1), (1, 2)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [462]:
print(gs_nb2.best_score_)
gs_nb2.best_params_

0.9072139303482587


{'cvec__max_df': 0.9,
 'cvec__max_features': 5000,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 1)}

In [469]:
#change model parameters to optimal ones found during GridSearch (no change from previous)
pipe_nb_gs = Pipeline([
    ('cvec', CountVectorizer(stop_words = stop_words, ngram_range= (1, 1), max_features= 5000, max_df = 0.9, min_df = 3)),
    ('nb', MultinomialNB())
])

In [470]:
model_nb2 = pipe_rf_gs.fit(X_train2, y_train2)
f1_scorer(model_nb2, X_train2, X_test2, y_train2, y_test2)

The training F1-score for Pipeline is: 0.9981779029319199
The testing F1-score for Pipeline is: 0.8883650446203065



### Model 2.3  Naive Bayes with Stemming

In [463]:
gs_nb3 = GridSearchCV(pipe_nb,
                  param_grid = pipe_nb_params, 
                  cv=5) 

In [472]:
gs_nb3.fit(X_train3, y_train3)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words={'haven', '...kenizer=None, vocabulary=None)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'cvec__max_features': [2000, 3000, 4000, 5000], 'cvec__min_df': [2, 3], 'cvec__max_df': [0.9, 0.95], 'cvec__ngram_range': [(1, 1), (1, 2)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [473]:
print(gs_nb3.best_score_)
gs_nb3.best_params_
#Again, no difference in best parameters

0.9077943615257048


{'cvec__max_df': 0.9,
 'cvec__max_features': 5000,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 1)}

In [474]:
model_nb3 = pipe_rf_gs.fit(X_train3, y_train3)
f1_scorer(model_nb3, X_train3, X_test3, y_train3, y_test3)

The training F1-score for Pipeline is: 0.9985082048731975
The testing F1-score for Pipeline is: 0.8975229046487954



In [None]:
#Accuracy score
print('For Naive Bayes Model with Stemming')
print(f'Score on training set: {model_nb3.score(X_train3, y_train3)}')
print(f'Score on testing set: {model_nb3.score(X_test3, y_test3)}')

Confusion matrix and interpretation

In [476]:
y_predict = model_nb3.predict(X_test3) #If using lemmmatization, change to X_test2 or X_test3 for stemming
# print confusion matrix
cmatrix = confusion_matrix(y_test3, y_predict)
print("Confusion matrix:")
pd.DataFrame(cmatrix, 
             index = ['actual r/social anxiety','actual r/ocd'],
             columns = ['predicted r/social anxiety', 'predicted r/ocd'])
# tn, fp, 
# fn, tp

Confusion matrix:


Unnamed: 0,predicted r/social anxiety,predicted r/ocd
actual r/social anxiety,2691,279
actual r/ocd,325,2645


## Extension: Comparison of final model using CountVectorizer with TF-IDF Vectorizer

In [493]:
#Using the Gridsearch parameters that were optimal for Countvectorizer, only switching out the vectorizer to TF-IDF vectorizer
pipe_rf_extension = Pipeline([
    ('tvec', TfidfVectorizer(stop_words = stop_words, ngram_range= (1, 1), max_features= 3000)),
    ('rf', RandomForestClassifier(n_estimators=200))
])

The model performance based on the key metric, F1-score, is better using TF-IDF than CountVectorizer

In [494]:
model_rf4 = pipe_rf_extension.fit(X_train3, y_train3)
f1_scorer(model_rf4, X_train3, X_test3, y_train3, y_test3)

The training F1-score for Pipeline is: 0.9985086992543496
The testing F1-score for Pipeline is: 0.9009345794392524



In [495]:
y_predict = model_rf4.predict(X_test3) 
# print confusion matrix
cmatrix = confusion_matrix(y_test3, y_predict)
print("Confusion matrix:")
pd.DataFrame(cmatrix, 
             index = ['actual r/social anxiety','actual r/ocd'],
             columns = ['predicted r/social anxiety', 'predicted r/ocd'])
# tn, fp, 
# fn, tp

Confusion matrix:


Unnamed: 0,predicted r/social anxiety,predicted r/ocd
actual r/social anxiety,2706,264
actual r/ocd,319,2651


# Conclusion

Comparison of models:
The final model chosen for this classification problem is: Random Forest using Stemming, with TF-IDF Vectorizer.

Using the TF-IDF Vectorizer in place of CountVectorizer improves the accuracy and F1 score only slightly, and all models are overfitted.

# Extension: Visualisation of top words and bigrams

In [None]:
# convert training data to dataframe
X_train_df = pd.DataFrame(tvec.fit_transform(X_train).todense(), 
                          columns=tvec.get_feature_names())

# plot top occuring words
X_train_df.sum().sort_values(ascending=False).head(10).plot(kind='barh');

In [None]:
#Top 10 words
X_train_df = pd.DataFrame(X_train, #what is .todense?
                          columns=cvec.get_feature_names())

# plot top occuring words
X_train_df.sum().sort_values(ascending=False).head(10).plot(kind='barh');

In [None]:
cv_trigrams = CountVectorizer(ngram_range=(3, 3), stop_words=stop_words)
cv_trigrams.fit(X_train2)#df['text_lem'])

trigrams_cv = cv_trigrams.transform(X_train2) #df['text_lem'])
trigrams_df = pd.DataFrame(trigrams_cv.todense(), columns=cv_trigrams.get_feature_names())

trigrams_df.sum().sort_values(ascending=False).head(15)

In [None]:

#For Future Reference only: sample codes to iterate through all 

# create pipeline to check both vectorisers
pl = Pipeline([
    # select the 'title' column
    ('selector', FunctionTransformer(lambda x:x['title'], validate=False)),  
    # vectorizers (and all pipeline steps below) will be specified in the param_grid
    ('vectorizer', None),                                                   
    ('reducer', None),
    ('binarizer', None),
    ('classifier', None)
])

# specify the param grid for gridsearch, which includes different feature selection methods
param_grid = [{
        # vectorisers to try: count vectoriser, tf-idf vectoriser
        'vectorizer': [CountVectorizer(tokenizer = spacy_tokenizer, stop_words = stop_words, ngram_range = (1,3)),
                       TfidfVectorizer(tokenizer = spacy_tokenizer, stop_words = stop_words, ngram_range = (1,3))],
        # feature selection by max df
        'vectorizer__max_df': [1, 0.05, 0.1],

        # to binarise or not to binarise
        'binarizer': [None,
                     Binarizer()],
        # models to test: multinomial Naive Bayes and logistic regression
        'classifier': [MultinomialNB(), RandomForestClassifier()]
    }]

# use kfold for cv to allow shuffling
kf = KFold(n_splits = 2, shuffle = True, random_state = 7)

# perform gridsearch for the best feature selection, model, etc
gs_title = GridSearchCV(pl, cv=kf, param_grid=param_grid, scoring = 'accuracy', iid=False, verbose=True)
gs_title.fit(xtrain, ytrain)
ypred_title = gs_title.predict(xval)

# call .score on the gs object to use the best parameters found during gridsearch to evaluate train and val 
print('train accuracy:', gs_title.score(xtrain, ytrain))
print('validation set accuracy:', gs_title.score(xval, yval))
print(gs_title.best_params_)