# Model preparation and benchmarking

In this notebook we will:
- Prepare the data based on observations from EDA
- Null model prediction
- Classifier without direct use of the words in the text (numeric features extracted from the text). Logistic regression is used as our estimator.
- Simple classifier using NLP. Logistic regression and CountVectorizer were used as estimator and transformer in this section.

In [51]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

import re 
from nltk.sentiment.vader import SentimentIntensityAnalyzer # for sentiment analyzer
# sklearn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix


## Data preparation

Based on observations from the EDA part, we will prepare a dataframe to use for our analysis. 

In [23]:
# read the data from the .csv file created in the previous section
df = pd.read_csv('./../dataset/offmychestrelationship_advice.csv')
df.head(2)

Unnamed: 0,text,title,listingid,created,url,media,subreddit
0,I had a few toxic relationships as a teenage a...,I'm more than I ever expected to be,18n8wrf,1703116000.0,https://www.reddit.com/r/offmychest/comments/1...,,offmychest
1,"TLDR is, essentially, the title.\n\nFor a litt...",How do I (F27) ask my BF (M27) to be less jeal...,18naxqe,1703122000.0,https://www.reddit.com/r/relationship_advice/c...,,relationship_advice


In [24]:
# drop unnecessary columns
df = df.drop(columns=['listingid', 'created', 'url', 'media'])
# mix text and title columns
df['text'] = df.apply(lambda x: x['text'] + x['title'], axis=1)
df = df.drop(columns=['title'])
# convert subreddit names to numbers offmychest = 0, relationship_advice = 1
df['subreddit'] = df['subreddit'].map({'offmychest': 0, 'relationship_advice': 1})
df.head(3)

Unnamed: 0,text,subreddit
0,I had a few toxic relationships as a teenage a...,0
1,"TLDR is, essentially, the title.\n\nFor a litt...",1
2,\r \nAnd probably not in the way you think.\r...,0


In [25]:
# add a new column for the word count in the text
df['word_count'] = df['text'].apply(lambda x: len(re.findall(r'(?u)\b\w\w+\b', x)))

In [26]:
# add a new column for the sentiment of the text in each listing
sent = SentimentIntensityAnalyzer()
df['sentiment'] = df.apply(lambda x: 
                           sent.polarity_scores(x['text'])['compound'],
                           axis=1)
df.head(3)

Unnamed: 0,text,subreddit,word_count,sentiment
0,I had a few toxic relationships as a teenage a...,0,186,0.9515
1,"TLDR is, essentially, the title.\n\nFor a litt...",1,494,0.8858
2,\r \nAnd probably not in the way you think.\r...,0,190,0.8559


In [27]:
df['subreddit'].value_counts(normalize=True)

1    0.561612
0    0.438388
Name: subreddit, dtype: float64

Our data does not have imbalance

## Train test split

In [28]:
X = df[['text', 'word_count', 'sentiment']]
y = df['subreddit']

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    random_state=42,
                                                    test_size=0.2,
                                                    stratify=y)
X_train = pd.DataFrame(X_train, columns=['text', 'word_count', 'sentiment'])
X_test = pd.DataFrame(X_test, columns=['text', 'word_count', 'sentiment'])

In [30]:
X_test.shape

(2041, 3)

In [31]:
y_test.value_counts(normalize=True)

1    0.561489
0    0.438511
Name: subreddit, dtype: float64

In [32]:
X_train.shape

(8160, 3)

In [33]:
y_train.value_counts(normalize=True)

1    0.561642
0    0.438358
Name: subreddit, dtype: float64

## Null model

In [34]:
y_train.value_counts(normalize=True)

1    0.561642
0    0.438358
Name: subreddit, dtype: float64

Our null model is to predict all the subreddits as 1 (relationship_advice). This will give us an accuracy of 0.54. 

## Numeric features classifier

In this section, we will try to test the logistic regression model on the train data set by just using the numeric features we extracted from the data (number of words and sentiment scores). The purpose of this model will be to see if those information by themselves could do anything for us. 

In [35]:
# first scales the training data using standard scaler
sc = StandardScaler()

Z_train = sc.fit_transform(X_train[['word_count', 'sentiment']])
Z_test = sc.transform(X_test[['word_count', 'sentiment']])

logr = LogisticRegression()
logr.fit(Z_train, y_train)

In [36]:
print(f'Train data score: {round(logr.score(Z_train, y_train), 2)}')
print(f'Test data score: {round(logr.score(Z_test, y_test), 2)}')

Train data score: 0.64
Test data score: 0.61


As we can see from the results, with just using a simple linear regression and two numerical columns about the length of the text and the sentiment, we can get a 0.61 accuracy which is better than the null model. Note that since we are just using two features here, there is no need to regularize the algorithm. 

## Base NLP classifier

For the base NLP model, we will just use logistic regression for our estimator and also CountVectorizer as our transformer and see how this model will perform. In next notebook, we will try other estimators and transformers to find the one with the best performance. 

In [37]:
# create X train and test best on only the text column
Z_train = X_train['text']
Z_test = X_test['text']

In [48]:
# create a pipe instance 
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('logr', LogisticRegression())
])

# decide on what parameters to modify for transformer and estimator
pipe_params = {
    'cvec__stop_words': [None, 'english'],
    'cvec__ngram_range': [(1, 2)],
    'cvec__max_df': [1.0, 0.8, 0.5],
    'cvec__min_df': [2, 4],
#   'cvec__max_features': [500, 1000, 3000, 5000],
    'cvec__max_features': [3000],

    'logr__C': [0.02, 0.05, 0.07, 0.1, 0.2],
    'logr__penalty': ['l2'], #, 'elasticnet', 'none']
}

gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 n_jobs=-1,
                 cv=3,
                 verbose=3)

gs.fit(Z_train, y_train)

Fitting 3 folds for each of 60 candidates, totalling 180 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [50]:
print(f'best score: {gs.best_score_}') # through cross-validation
print(f'best params: {gs.best_params_}') # best parameters
print('=============')
pred = gs.predict(Z_test)

print(f'train score: {gs.score(Z_train, y_train)}')
print(f'test score: {gs.score(Z_test, y_test)}')

print('=============')
pd.DataFrame(gs.cv_results_).sort_values(by='mean_test_score', ascending=False).head(5)

best score: 0.8703431372549021
best params: {'cvec__max_df': 1.0, 'cvec__max_features': 3000, 'cvec__min_df': 4, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': 'english', 'logr__C': 0.07, 'logr__penalty': 'l2'}
train score: 0.9609068627450981
test score: 0.8682018618324351


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_cvec__max_df,param_cvec__max_features,param_cvec__min_df,param_cvec__ngram_range,param_cvec__stop_words,param_logr__C,param_logr__penalty,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
37,14.346957,0.090426,3.76818,0.03167,0.8,3000,4,"(1, 2)",english,0.07,l2,"{'cvec__max_df': 0.8, 'cvec__max_features': 30...",0.875735,0.870588,0.864706,0.870343,0.004506,1
17,9.105523,0.220617,2.691191,0.031231,1.0,3000,4,"(1, 2)",english,0.07,l2,"{'cvec__max_df': 1.0, 'cvec__max_features': 30...",0.875735,0.870588,0.864706,0.870343,0.004506,1
7,9.422559,0.241344,2.803794,0.050545,1.0,3000,2,"(1, 2)",english,0.07,l2,"{'cvec__max_df': 1.0, 'cvec__max_features': 30...",0.875,0.870588,0.864706,0.870098,0.004217,3
27,12.428231,0.7467,3.362933,0.026647,0.8,3000,2,"(1, 2)",english,0.07,l2,"{'cvec__max_df': 0.8, 'cvec__max_features': 30...",0.875,0.870588,0.864706,0.870098,0.004217,3
26,11.452624,0.446889,3.419002,0.189244,0.8,3000,2,"(1, 2)",english,0.05,l2,"{'cvec__max_df': 0.8, 'cvec__max_features': 30...",0.875,0.871691,0.860662,0.869118,0.00613,5


During the benchmarking process, we used logistic regression and countvectorizer to convert the text into numbers and see how that could be used for training the model on the train data and test it on the test data. Here are a few conclusions after multiple rounds of modeling and parameter tuning:
- Some of the common parameters for countvectorizer and logistic regression was used and the best performance showed and accuracy of about 0.87, which was better than the null model and also the model with simple features. 
- We used regularization as a way to overcome overfitting and it was successful to some extent, but the final model still suffers from overfitting. l1 regularization was way more aggressive than l2 in limiting model variance but they both achieved similar results. l2 regularization was used because of it's flexibility in using the solver and faster response. 
- Various hyperparameter tuning schemes were used to limit model variance by making the model performance on training and test (or crossvalidate) data comparable while at the same time improving the test/crossval scores of the model. However, the test/crossval score did not got more than 0.88 in any of the models. 
- In an effort to decrease the variance and improve model performance, we bumped up the number of listing data from 7000 to 10000 but model performance did not show any improvement. 
- Looking at the ``cv_results_``, we can see that almost all of the combination of the parameters in the GridSearchCV result in a similar score. This makes us think the current model will not show any more improvement in the results in case we just focus on the current hyperparameters to optimize and in fact, the best performance of the model has already reached. <br>

**Some notes on overfitting**
- As part of the effort to investigate how the model performs, we focused on the ``max_features`` parameter and increased this feature from 500 to 7000 and also no limit.
- Our observation shows that for smaller ``max_features`` the train score and test/cv scores are low (train score of 87 and cv score of 84 when a maximum of 500 features are used).
- As the ``max_features`` number increased (3000), the train scored increased rapidly but the cv score increased slowly and reached a plateau of 0.87. Further increase in the in the maximum number of features caused the train score goes up to 0.99 while the test/cv score did not improved after 0.87. 
- One conclusion we can draw out of this discussion is the fact that the reason for the observed overfitting is that we are using too many features while using these number of features improves the training score and does not help with the test/cv score. 
- In other words, if we use a maximum of 2000 words, train score (0.91) becomes very close to the test/cv score of (0.86) without sacrificing the performance of the model, and the model does not seem to be overfitting.

Now, let us take a look at the predictions and see what is the specificity and sensitivity. 

In [53]:
tn, fp, fn, tp = confusion_matrix(y_test, pred).ravel()
print(f'specificity (true negatives/actual negatives): {round(tn/(tn+fp), 2)}')
print(f'sensitivity/recall (true positives/actual positives): {round(tp/(tp+fn), 2)}')

specificity (true negatives/actual negatives): 0.86
sensitivity/recall (true positives/actual positives): 0.87


As we can see, the model performs similarly for the negative and positive class. Part of this behavior is because the data we have used in this section is relatively balanced.