# Model preparation and benchmarking

In this notebook we will:
- Prepare the data based on observations from EDA
- Null model prediction
- Classifier without direct use of the words in the text (numeric features extracted from the text). Logistic regression is used as our estimator.
- Simple classifier using NLP. Logistic regression and CountVectorizer were used as estimator and transformer in this section.

In [39]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

import re 
from nltk.sentiment.vader import SentimentIntensityAnalyzer # for sentiment analyzer
# sklearn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix


## Reading data and train test split

In [40]:
# read the data from preprocessed data that are saved into files
df = pd.read_csv('./../dataset/offmychestrelationship_advice_processed.csv')

In [42]:
df['subreddit'].value_counts(normalize=True)

1    0.562143
0    0.437857
Name: subreddit, dtype: float64

Our data does not have imbalance.

In [43]:
X = df[['text', 'word_count', 'sentiment']]
y = df['subreddit']

In [44]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    random_state=42,
                                                    test_size=0.2,
                                                    stratify=y)
X_train = pd.DataFrame(X_train, columns=['text', 'word_count', 'sentiment'])
X_test = pd.DataFrame(X_test, columns=['text', 'word_count', 'sentiment'])

In [45]:
X_test.shape

(3253, 3)

In [46]:
y_test.value_counts(normalize=True)

1    0.56225
0    0.43775
Name: subreddit, dtype: float64

In [47]:
X_train.shape

(13008, 3)

In [48]:
y_train.value_counts(normalize=True)

1    0.562116
0    0.437884
Name: subreddit, dtype: float64

## Null model

In [49]:
y_train.value_counts(normalize=True)

1    0.562116
0    0.437884
Name: subreddit, dtype: float64

Our null model is to predict all the subreddits as 1 (relationship_advice). This will give us an accuracy of 0.56. 

## Numeric features classifier

In this section, we will try to test the logistic regression model on the train data set by just using the numeric features we extracted from the data (number of words and sentiment scores). The purpose of this model will be to see if those information by themselves could do anything for us. 

In [50]:
# first scales the training data using standard scaler
sc = StandardScaler()

Z_train = sc.fit_transform(X_train[['word_count', 'sentiment']])
Z_test = sc.transform(X_test[['word_count', 'sentiment']])

logr = LogisticRegression()
logr.fit(Z_train, y_train)

In [51]:
print(f'Train data score: {round(logr.score(Z_train, y_train), 2)}')
print(f'Test data score: {round(logr.score(Z_test, y_test), 2)}')

Train data score: 0.64
Test data score: 0.65


As we can see from the results, with just using a simple linear regression and two numerical columns about the length of the text and the sentiment, we can get a 0.65 accuracy which is better than the null model. Note that since we are just using two features here, there is no need to regularize the algorithm. 

## Base NLP classifier

For the base NLP model, we will just use logistic regression for our estimator and also CountVectorizer as our transformer and see how this model will perform. In next notebook, we will try other estimators and transformers to find the one with the best performance. 

In [52]:
# create X train and test best on only the text column
Z_train = X_train['text']
Z_test = X_test['text']

It is worthwhile mentioning that the following cell has been run numerous times so that the best parameters are found and the search is limited to those, and we are only seeing the latest model here. In a real world with enough compute power, we may not need to do that but due to the computation power limitation in my personal laptop, I took a divide and conquer approach. Another approach to use here is RandomizedGridSearch which limits the number of samples but might miss some important points. 

In [53]:
# create a pipe instance 
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('logr', LogisticRegression())
])

# decide on what parameters to modify for transformer and estimator
pipe_params = {
    'cvec__stop_words': [None, 'english'],
#   'cvec__ngram_range': [(1, 1), (1, 2), (1, 3)],    
    'cvec__ngram_range': [(1, 2)],
    'cvec__max_df': [1.0, 0.8, 0.5],
    'cvec__min_df': [2, 4],
#   'cvec__max_features': [500, 1000, 3000, 5000],
    'cvec__max_features': [3000],

    'logr__C': [0.02, 0.05, 0.07, 0.1, 0.2],
    'logr__penalty': ['l2'], #, 'elasticnet', 'none']
}

gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 n_jobs=-1,
                 cv=3,
                 verbose=3)

gs.fit(Z_train, y_train)

Fitting 3 folds for each of 60 candidates, totalling 180 fits


4 fits failed out of a total of 180.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\masou\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\masou\anaconda3\Lib\site-packages\sklearn\base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\masou\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 416, in fit
    Xt = self._fit(X, y, **fit_params_steps)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\masou\anaconda3\Lib\site-pack

In [54]:
print(f'best score: {gs.best_score_}') # through cross-validation
print(f'best params: {gs.best_params_}') # best parameters
print('=============')
pred = gs.predict(Z_test)

print(f'train score: {gs.score(Z_train, y_train)}')
print(f'test score: {gs.score(Z_test, y_test)}')

print('=============')
pd.DataFrame(gs.cv_results_).sort_values(by='mean_test_score', ascending=False).head(5)

best score: 0.8803813038130381
best params: {'cvec__max_df': 0.5, 'cvec__max_features': 3000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': 'english', 'logr__C': 0.07, 'logr__penalty': 'l2'}
train score: 0.9491851168511685
test score: 0.8834921610820781


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_cvec__max_df,param_cvec__max_features,param_cvec__min_df,param_cvec__ngram_range,param_cvec__stop_words,param_logr__C,param_logr__penalty,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
47,21.233511,0.252291,6.636843,0.227249,0.5,3000,2,"(1, 2)",english,0.07,l2,"{'cvec__max_df': 0.5, 'cvec__max_features': 30...",0.882611,0.880304,0.878229,0.880381,0.00179,1
57,21.384796,0.364946,6.42736,0.414535,0.5,3000,4,"(1, 2)",english,0.07,l2,"{'cvec__max_df': 0.5, 'cvec__max_features': 30...",0.883072,0.880074,0.877768,0.880304,0.002172,2
56,21.64441,0.194813,6.025435,0.285132,0.5,3000,4,"(1, 2)",english,0.05,l2,"{'cvec__max_df': 0.5, 'cvec__max_features': 30...",0.881227,0.882149,0.877537,0.880304,0.001993,2
58,20.376885,0.334262,5.595811,0.988325,0.5,3000,4,"(1, 2)",english,0.1,l2,"{'cvec__max_df': 0.5, 'cvec__max_features': 30...",0.883994,0.878921,0.877076,0.879997,0.002925,4
46,21.558097,0.686077,6.219707,0.223955,0.5,3000,2,"(1, 2)",english,0.05,l2,"{'cvec__max_df': 0.5, 'cvec__max_features': 30...",0.880535,0.881458,0.877537,0.879843,0.001674,5


Now, let us take a look at the predictions and see what is the specificity and sensitivity. 

In [55]:
tn, fp, fn, tp = confusion_matrix(y_test, pred).ravel()
print(f'specificity (true negatives/actual negatives): {round(tn/(tn+fp), 2)}')
print(f'sensitivity (recall) (true positives/actual positives): {round(tp/(tp+fn), 2)}')

specificity (true negatives/actual negatives): 0.87
sensitivity (recall) (true positives/actual positives): 0.89


As we can see, the model performs similarly for the negative and positive class. Part of this behavior is because the data we have used in this section is relatively balanced.

## Key takeaways

During the benchmarking process, we used logistic regression and countvectorizer to convert the text into numbers and see how that could be used for training the model on the train data and test it on the test data. Here are a few conclusions after multiple rounds of modeling and parameter tuning:
- Some of the common parameters for countvectorizer and logistic regression was used and the best performance showed and accuracy of about 0.87, which was better than the null model and also the model with simple features. 
- We used regularization as a way to overcome overfitting and it was successful to some extent, but the final model still suffers from overfitting. l1 regularization was way more aggressive than l2 in limiting model variance but they both achieved similar results. l2 regularization was used because of it's flexibility in using the solver and faster response. 
- Various hyperparameter tuning schemes were used to limit model variance by making the model performance on training and test (or crossvalidate) data comparable while at the same time improving the test/crossval scores of the model. However, the test/crossval score did not got more than 0.88 in any of the models. 
- Looking at the ``cv_results_`` of the latest and greatest search, we can see that almost all of the combination of the parameters in the GridSearchCV result in a similar score. This makes us think the current model will not show any more improvement in the results in case we just focus on the current hyperparameters to optimize and in fact, the best performance of the model has already reached. <br>

**Some notes on overfitting**
- As part of the effort to investigate how the model performs, we focused on the ``max_features`` parameter and increased this feature from 500 to 7000 and also no limit.
- Our observation shows that for smaller ``max_features`` the train score and test/cv scores are low (train score of 87 and cv score of 84 when a maximum of 500 features are used).
- As the ``max_features`` number increased (3000), the train scored increased rapidly but the cv score increased slowly and reached a plateau of 0.87. Further increase in the in the maximum number of features caused the train score goes up to 0.99 while the test/cv score did not improved after 0.87. 
- One conclusion we can draw out of this discussion is the fact that the reason for the observed overfitting is that we are using too many features while using these number of features improves the training score and does not help with the test/cv score. 
- In other words, if we use a maximum of 2000 words, train score (0.91) becomes very close to the test/cv score of (0.86) without sacrificing the performance of the model, and the model does not seem to be overfitting.