# Production model
In this notebook, we will pick up our best model from the model tuning notebook and look at its coefficients. Also, we will interpret how this model performs in the case of imbalance data, and how we might want to modify it for data imbalance. 

In [87]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

import re 
from nltk.sentiment.vader import SentimentIntensityAnalyzer # for sentiment analyzer

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

## Data preparation and train/test split
This part is similar to the previous notebooks. 

In [88]:
# import data
df = pd.read_csv('./../dataset/offmychestrelationship_advice.csv')
# drop unnecessary columns
df = df.drop(columns=['listingid', 'created', 'url', 'media'])
# mix text and title columns
df['text'] = df.apply(lambda x: x['text'] + x['title'], axis=1)
df = df.drop(columns=['title'])
# convert subreddit names to numbers offmychest = 0, relationship_advice = 1
df['subreddit'] = df['subreddit'].map(
                                      {'offmychest': 0, 
                                       'relationship_advice': 1})
# add a new column for the word count in the text
df['word_count'] = df['text'].apply(lambda x: 
                                    len(re.findall(r'(?u)\b\w\w+\b', x)))
# add a new column for the sentiment of the text in each listing
sent = SentimentIntensityAnalyzer()
df['sentiment'] = df.apply(lambda x: 
                           sent.polarity_scores(x['text'])['compound'],
                           axis=1)

X = df[['text', 'word_count', 'sentiment']]
y = df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    random_state=42,
                                                    test_size=0.2,
                                                    stratify=y)
X_train = pd.DataFrame(X_train, columns=['text', 'word_count', 'sentiment'])
X_test = pd.DataFrame(X_test, columns=['text', 'word_count', 'sentiment'])

## Production model

This is the best performing model from the tuning stage

In [89]:
# create X train and test best on only the text column
Z_train = X_train['text']
Z_test = X_test['text']

# use the best model from tuning notebook and refit
pipe_cvec_logr_best = Pipeline([
    ('cvec', CountVectorizer(max_df=1.0,
                             max_features=3000,
                             min_df=4,
                             ngram_range=(1,2),
                             stop_words='english')),
    ('logr', LogisticRegression(C=0.07,
                                penalty='l2'))
])
pipe_cvec_logr_best.fit(Z_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [90]:
print(f'train score: {pipe_cvec_logr_best.score(Z_train, y_train)}')
print(f'test score: {pipe_cvec_logr_best.score(Z_test, y_test)}')

train score: 0.9660115979381443
test score: 0.8660656793303284


In [91]:
y_test.value_counts(normalize=True)

1    0.541533
0    0.458467
Name: subreddit, dtype: float64

As we can see, the data seems quite balance. Let us extract some precision, recall, and f1 scores as well. 

In [92]:
pred = pipe_cvec_logr_best.predict(Z_test)
prec, recall, f1score, _ = classification_report(y_test, pred, output_dict=True)['1'].values()
print(f'precision: {round(prec,2)}' 
      f'\nrecall: {round(recall, 2)}'
      f'\nf1-score:{round(f1score, 2)}')

precision: 0.88
recall: 0.87
f1-score:0.88


## Inferential analysis on LogReg
Since we have been using logistic regression, we should be able to interpret the model and coefficients. In logistic regression, each coefficient shows the change in log odds of the event happening when the value of the feature changes by one unit. In the context of this project, the features are the number of times the popular words have appeared in our training data. 

First, let us look at the 10 words with the lowest coefficients in the logistic regression. The negative coefficient means as the number of these coefficients increases, the odds of the posts being classified as 0 (r/offmychest) increases. Since these coefficients have the biggest absolute values, a unit increase in the number of these words in a post impacts the log(odd) of our classification by the value of the coefficient (in favor of r/offmychest or the 0 class). 

In [93]:
coeffs_df = pd.DataFrame(list(zip(pipe_cvec_logr_best.named_steps['cvec'].vocabulary_.keys(),
                                  pipe_cvec_logr_best.named_steps['logr'].coef_[0][:])), columns=['feature', 'coefficient'])
coeffs_df.sort_values(by='coefficient', ascending=True).head(10)

Unnamed: 0,feature,coefficient
440,start,-0.71974
2811,anymore don,-0.674661
1695,november,-0.474135
2931,recently started,-0.454099
2330,puts,-0.414242
1319,year old,-0.407345
1216,teach,-0.40436
1160,hearing,-0.39758
1079,hiding,-0.394946
1282,process,-0.392117


Now, let us do similar analysis for words with the highest (and positive) coefficients in our logistic regression model. 

In [94]:
coeffs_df.sort_values(by='coefficient', ascending=True).tail(10)

Unnamed: 0,feature,coefficient
13,respect,0.697787
36,jealousy,0.714501
58,explain,0.716064
16,expect,0.725415
129,clear,0.727029
35,don,0.728254
41,everytime,0.815149
29,talk,0.876478
32,immature,0.887757
47,problem,0.906186


Looking at the list above, we can say that presence (or increase in the number of times these words are used) of the words like 'problem', 'immature', 'talk', and 'expect' has the highest impact on increasing the log(odd) of a listing being classified as 1 (r/relationship_advice).

## Production model with imbalance data
So far, we have been using balanced data. Now, let us assume the data is imbalance (minority class is '1' that is somehow important for us as well-e.g. for this project, class 1 is r/relationship_advice data and let us say we did not have that much of info from the relationship_advice subreddit but being able to classify this subreddit was very important for us)

In [95]:
temp = df[df['subreddit'] == 1]
temp = temp.sample(frac=0.06)
df_imbalance = pd.concat([df[df['subreddit'] == 0], temp], axis=0)

In [96]:
df_imbalance['subreddit'].value_counts(normalize=True)

0    0.933841
1    0.066159
Name: subreddit, dtype: float64

As we can see, the df_imbalance dataframe is severely imbalanced now and the 1 category (which we are 'assuming' is important for us) is underrepresented.

In [97]:
Ximb = df_imbalance.drop(columns='subreddit')
yimb = df_imbalance['subreddit']
Ximb_train, Ximb_test, yimb_train, yimb_test = train_test_split(Ximb, yimb,
                                                                test_size=0.2,
                                                                random_state=42,
                                                                stratify=yimb)

In [98]:
yimb_test.mean(), yimb_train.mean() 

(0.06561679790026247, 0.0662947161142107)

In [99]:
# create Z train and test based on only the text column
Zimb_train = Ximb_train['text']
Zimb_test = Ximb_test['text']

# fit the previous model to this new data
pipe_cvec_logr_best.fit(Zimb_train, yimb_train)

In [100]:
# see the scores
print(f'train score: {pipe_cvec_logr_best.score(Zimb_train, yimb_train)}')
print(f'test score: {pipe_cvec_logr_best.score(Zimb_test, yimb_test)}')

train score: 0.9829340334755498
test score: 0.9330708661417323


As we can see, the score (accuracy) is actually very good but let us see how our model has been performing on our minority class.

In [101]:
predimb = pipe_cvec_logr_best.predict(Zimb_test)
prec_imb, recall_imb, f1score_imb, _ = classification_report(yimb_test, predimb, output_dict=True)['1'].values()
print(f'precision: {round(prec_imb,2)}' 
      f'\nrecall: {round(recall_imb, 2)}'
      f'\nf1-score:{round(f1score_imb, 2)}')

precision: 0.47
recall: 0.18
f1-score:0.26


As we can see, the original production model performs very poorly on the minority class when the data show strong imbalance because the majority will overtake the training process. Let us see if we can alleviate this issue. To do that, we will retrain the model by giving a higher weight to minority class (using `class_weight`). This might reduce our overall model score but should help us get better metrics on the minority class.  

In [102]:
# use the best model from tuning notebook and refit
pipe_cvec_logr_imb = Pipeline([
    ('cvec', CountVectorizer(max_df=1.0,
                             max_features=3000,
                             min_df=4,
                             ngram_range=(1,2),
                             stop_words='english')),
    ('logr', LogisticRegression(C=0.07,
                                penalty='l2',
                                class_weight={0:1, 1:30}))
])
pipe_cvec_logr_imb.fit(Zimb_train, yimb_train)

# see the scores
print(f'train score: {pipe_cvec_logr_imb.score(Zimb_train, yimb_train)}')
print(f'test score: {pipe_cvec_logr_imb.score(Zimb_test, yimb_test)}')
predimb = pipe_cvec_logr_imb.predict(Zimb_test)
prec_imb, recall_imb, f1score_imb, _ = classification_report(yimb_test, predimb, output_dict=True)['1'].values()
print(f'precision: {round(prec_imb,2)}' 
      f'\nrecall: {round(recall_imb, 2)}'
      f'\nf1-score:{round(f1score_imb, 2)}')

train score: 0.9927797833935018
test score: 0.9081364829396326
precision: 0.33
recall: 0.4
f1-score:0.36


This table shows model performance for the original production model vs. that of weighted model for severely imbalanced data:

| metric | model without weighted classes | model with weighted classes|
| --------      | ------ | ------|
| test score    | 0.93   | 0.91  |
| precision     | 0.47  | 0.33  |
| recall        | 0.18   | 0.4  |
| f1-score      | 0.26   | 0.36  |

As we can see from these scores, giving more weight to the positive class (the minority class), our model is able to predict more positives (and a higher percentage of actual positives), this means the model sensitivity (recall) increases and at the same time, the number of false positives increases as well, which means a decreased precision. 


## Key takeaways

As we have seen, the production model does not have any issue predicting balanced data, however, when the data become imbalance, then same production data struggles to predict the positive class (the imbalance class). For this to be resolved, we had to modify the production model by giving more weight to the underrepresented class (here the positive class) so that the model can perform better on that class (and predict more positives).