## Contents
- [Data Importing](#Data-Importing)
- [Classification metrics](#Classification-metrics)
- [Vectorizer Selection via GridSeachCV](#Vectorizer-Selection-via-GridSeachCV)
- [Hyper-parameter tuning](#Hyper-parameter-tuning)
- [Error Analysis](#Error-Analysis)
- [Redefine Data set and run models](#Redefine-Data-set-and-run-models)
- [Production Model Selection](#Production-Model-Selection)
- [Conclusion and Recommendation](#Conclusion-and-Recommendation)

In [1]:
import pandas as pd
import requests
import string
string.punctuation
import re
import nltk
from nltk.stem import WordNetLemmatizer
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
from bs4 import BeautifulSoup
import contractions

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, \
    plot_roc_curve, roc_auc_score, recall_score, precision_score, f1_score
from transformers import pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk import word_tokenize   
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
import matplotlib.dates as mdate
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

import gensim
import gensim.downloader as api
from gensim.models.word2vec import Word2Vec


from sklearn.model_selection import(
    cross_val_score,
    train_test_split,
    GridSearchCV
)

from sklearn.preprocessing import (
    StandardScaler,
    PolynomialFeatures
)
from datetime import timezone
import datetime

pd.set_option('display.max_colwidth', None)

from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier

## Data importing

In [2]:
df = pd.read_csv('../Data/df.csv')
df.drop(columns = 'Unnamed: 0', inplace = True)

Check for Null

In [3]:
df.isnull().sum()

author                   0
subreddit                0
selftext                 0
title                    0
created_utc              0
datetime                 0
link_flair_css_class     0
alltext                  0
length_text              0
wrdcount_text            0
month                    0
day                      0
clean_text               1
ttl_post                 0
user_contribute_where    0
stem_clean_text          1
lemmi_clean_text         1
dtype: int64

In [4]:
df.dropna(inplace = True)

###  Model Prep: Create `X` and `y` variables

Our features will be:
- `subreddit`
- `stem_clean_text`
- `lemmi_clean_text`

For our baseline modelling will be using lemmi_clean_text

Our target will be `subreddit`

In [5]:
X = df['lemmi_clean_text']
y = df['subreddit']

### Train/Test Split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

## Classification Metrics

Possible classification metrics that can be used how well my model perform are *Accuracy, Misclassification Rate, Sensitivity (True Positive Rate), Specificity (True Negative Rate), Precision (Positive Predictive Value), F1 Score and ROC AUC*.

For this project I will be mainly using *Accuracy* accompanied with confusion matrix to determine effectiveness of my models

For further evaluations, I will be using Specificity to choose the best model that helps gauge on how best to minimize this in order to reduce missing out identifying users potentionally having PTSD.

I will also use ROC to gave a gauge the degree of overlap between the words in post from the two subreddits. 

### Baseline Accuracy Check

In [7]:
print(f'Total post: {len(df)}')
df['subreddit'].value_counts(normalize= True)


Total post: 20228


1    0.500395
0    0.499605
Name: subreddit, dtype: float64

With 10000 post added into each class, the classes are very well balanced from both sides 

Also, from the above, we can see the ***baseline accuracy*** is at 50.1%. 
If nothing ws done for the classification model, and just to assign every post to the PTSD class, i would classify 50.1% of the post correctly. 

Therefore, any classification model designed for this data must have an accuracy higher than the ***baseline accuracy*** of 50.1%

## Vectorizer Selection via GridSeachCV

Consider two NLP vectorizers and iterate over parameters below with Logistic regression model first.
- *Count Vectorizer*
- *TF-IDF Vectorizer*

Embed Stopword list into vectorizers 

In [8]:
# create stop word list 
stopwordlist = nltk.corpus.stopwords.words('english')

# to input in subreddit headers so that we can exclude it for model
headers = ['ptsd', 'anxiety','anxious', "cptsd", 'trauma','traumatized','traumas',"posttraumatic",'stress','disorder',"traumatic", 'www', 'https', 'mentalgrenade', 'com']
for head in headers:
    stopwordlist.append(head)



In [9]:
# Creating Pipeline with:
# Estimator: LogisticRegression (both)
# Transformer: CountVectorizer - pipe1, TfidfVectorizer - pipe2

pipe1 = Pipeline([('tvec', TfidfVectorizer(stop_words= stopwordlist)),
                  ('gboost', GradientBoostingClassifier())
                 ])

In [28]:
GradientBoostingClassifier()._get_param_names()

['ccp_alpha',
 'criterion',
 'init',
 'learning_rate',
 'loss',
 'max_depth',
 'max_features',
 'max_leaf_nodes',
 'min_impurity_decrease',
 'min_samples_leaf',
 'min_samples_split',
 'min_weight_fraction_leaf',
 'n_estimators',
 'n_iter_no_change',
 'random_state',
 'subsample',
 'tol',
 'validation_fraction',
 'verbose',
 'warm_start']

In [24]:
pipe1._get_param_names()

['memory', 'steps', 'verbose']

In [31]:

pipe1_params = {'tvec__max_features': [1000],
                'tvec__min_df': [0.02,0.01],
                'tvec__max_df': [0.7,0.8],
                'tvec__ngram_range': [(1,1)],
                'gboost__max_depth': [4,5,6],
                'gboost__n_estimators': [150,200],
                'gboost__learning_rate': [.12,.15,.2]
               }

In [32]:
# Creating two separate GridSearchCV objects for:
# CountVectorizer and TfidfVectorizer

gs_pipe1 = GridSearchCV(pipe1, param_grid=pipe1_params, cv=5, n_jobs=-1)

In [33]:
# Fitting GridSearchCV with CountVectorizer transformer on X_train and y_train.

gs_pipe1.fit(X_train, y_train);

In [None]:
gs_pipe1.best_params_

{'gboost__learning_rate': 0.12,
 'gboost__max_depth': 4,
 'gboost__n_estimators': 150,
 'tvec__max_df': 0.7,
 'tvec__max_features': 1000,
 'tvec__min_df': 0.02,
 'tvec__ngram_range': (1, 1)}

In [None]:
gs_pipe1.best_score_

0.7952665398954626

In [None]:
gs_pipe1.score(X_test,y_test)

0.7929602531144948

In [None]:
gs_pipe1.score(X_train,y_train)

0.8679058730472612