# Production model
In this notebook, we will pick up our best model from the model tuning notebook and look at its coefficients. Also, we will interpret how this model performs in the case of imbalance data, and how we might want to modify it for data imbalance. 

In [13]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

import re 
from nltk.sentiment.vader import SentimentIntensityAnalyzer # for sentiment analyzer

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

## Data preparation and train/test split
This part is similar to the previous notebooks. 

In [3]:
# import data
df = pd.read_csv('./../dataset/offmychestrelationship_advice.csv')
# drop unnecessary columns
df = df.drop(columns=['listingid', 'created', 'url', 'media'])
# mix text and title columns
df['text'] = df.apply(lambda x: x['text'] + x['title'], axis=1)
df = df.drop(columns=['title'])
# convert subreddit names to numbers offmychest = 0, relationship_advice = 1
df['subreddit'] = df['subreddit'].map(
                                      {'offmychest': 0, 
                                       'relationship_advice': 1})
# add a new column for the word count in the text
df['word_count'] = df['text'].apply(lambda x: 
                                    len(re.findall(r'(?u)\b\w\w+\b', x)))
# add a new column for the sentiment of the text in each listing
sent = SentimentIntensityAnalyzer()
df['sentiment'] = df.apply(lambda x: 
                           sent.polarity_scores(x['text'])['compound'],
                           axis=1)

X = df[['text', 'word_count', 'sentiment']]
y = df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    random_state=42,
                                                    test_size=0.2,
                                                    stratify=y)
X_train = pd.DataFrame(X_train, columns=['text', 'word_count', 'sentiment'])
X_test = pd.DataFrame(X_test, columns=['text', 'word_count', 'sentiment'])

## Production model

This is the best performing model from the tuning stage

In [8]:
# create X train and test best on only the text column
Z_train = X_train['text']
Z_test = X_test['text']

# use the best model from tuning notebook and refit
pipe_cvec_logr_best = Pipeline([
    ('cvec', CountVectorizer(max_df=1.0,
                             max_features=3000,
                             min_df=4,
                             ngram_range=(1,2),
                             stop_words='english')),
    ('logr', LogisticRegression(C=0.07,
                                penalty='l2'))
])
pipe_cvec_logr_best.fit(Z_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [9]:
print(f'train score: {pipe_cvec_logr_best.score(Z_train, y_train)}')
print(f'test score: {pipe_cvec_logr_best.score(Z_test, y_test)}')

train score: 0.9609068627450981
test score: 0.8682018618324351


In [10]:
y_test.value_counts(normalize=True)

1    0.561489
0    0.438511
Name: subreddit, dtype: float64

As we can see, the data seems quite balance. Let us extract some precision, recall, and f1 scores as well. 

In [30]:
pred = pipe_cvec_logr_best.predict(Z_test)
prec, recall, f1score, _ = classification_report(y_test, pred, output_dict=True)['1'].values()
print(f'precision: {round(prec,2)}' 
      f'\nrecall: {round(recall, 2)}'
      f'\nf1-score:{round(f1score, 2)}')

precision: 0.89
recall: 0.87
f1-score:0.88


## Production model with imbalance data
So far, we have been using balanced data. Now, let us assume the data is imbalance (minority class is '1' that is somehow important for us as well-e.g. for this project, class 1 is r/relationship_advice data and let us say we did not have that much of info from the relationship_advice subreddit but being able to classify this subreddit was very important for us)

In [33]:
temp = df[df['subreddit'] == 1]
temp = temp.sample(frac=0.06)
df_imbalance = pd.concat([df[df['subreddit'] == 0], temp], axis=0)

In [34]:
df_imbalance['subreddit'].value_counts(normalize=True)

0    0.928571
1    0.071429
Name: subreddit, dtype: float64

As we can see, the df_imbalance dataframe is severely imbalanced now and the 1 category (which we are 'assuming' is important for us) is underrepresented.

In [52]:
Ximb = df_imbalance.drop(columns='subreddit')
yimb = df_imbalance['subreddit']
Ximb_train, Ximb_test, yimb_train, yimb_test = train_test_split(Ximb, yimb,
                                                                test_size=0.2,
                                                                random_state=42,
                                                                stratify=yimb)

In [53]:
yimb_test.mean(), yimb_train.mean() 

(0.07157676348547717, 0.07139148494288682)

In [56]:
# create Z train and test based on only the text column
Zimb_train = Ximb_train['text']
Zimb_test = Ximb_test['text']

# fit the previous model to this new data
pipe_cvec_logr_best.fit(Zimb_train, yimb_train)

In [57]:
# see the scores
print(f'train score: {pipe_cvec_logr_best.score(Zimb_train, yimb_train)}')
print(f'test score: {pipe_cvec_logr_best.score(Zimb_test, yimb_test)}')

train score: 0.9818276220145379
test score: 0.9346473029045643


As we can see, the score (accuracy) is actually very good but let us see how our model has been performing on our minority class.

In [59]:
predimb = pipe_cvec_logr_best.predict(Zimb_test)
prec_imb, recall_imb, f1score_imb, _ = classification_report(yimb_test, predimb, output_dict=True)['1'].values()
print(f'precision: {round(prec_imb,2)}' 
      f'\nrecall: {round(recall_imb, 2)}'
      f'\nf1-score:{round(f1score_imb, 2)}')

precision: 0.6
recall: 0.26
f1-score:0.36


As we can see, the original production model performs very poorly on the minority class when the data show strong imbalance because the majority will overtake the training process. Let us see if we can alleviate this issue. To do that, we will retrain the model by giving a higher weight to minority class (using `class_weight`). This might reduce our overall model score but should help us get better metrics on the minority class.  

In [74]:
# use the best model from tuning notebook and refit
pipe_cvec_logr_imb = Pipeline([
    ('cvec', CountVectorizer(max_df=1.0,
                             max_features=3000,
                             min_df=4,
                             ngram_range=(1,2),
                             stop_words='english')),
    ('logr', LogisticRegression(C=0.07,
                                penalty='l2',
                                class_weight={0:1, 1:30}))
])
pipe_cvec_logr_imb.fit(Zimb_train, yimb_train)

# see the scores
print(f'train score: {pipe_cvec_logr_imb.score(Zimb_train, yimb_train)}')
print(f'test score: {pipe_cvec_logr_imb.score(Zimb_test, yimb_test)}')
predimb = pipe_cvec_logr_imb.predict(Zimb_test)
prec_imb, recall_imb, f1score_imb, _ = classification_report(yimb_test, predimb, output_dict=True)['1'].values()
print(f'precision: {round(prec_imb,2)}' 
      f'\nrecall: {round(recall_imb, 2)}'
      f'\nf1-score:{round(f1score_imb, 2)}')

train score: 0.9885773624091381
test score: 0.9076763485477178
precision: 0.4
recall: 0.55
f1-score:0.46


This table shows model performance for the original production model vs. that of weighted model for severely imbalanced data:

| metric | model without weighted classes | model with weighted classes|
| --------      | ------ | ------|
| test score    | 0.93   | 0.91  |
| precision     | 0.60   | 0.40  |
| recall        | 0.26   | 0.55  |
| f1-score      | 0.36   | 0.46  |

As we can see from these scores, giving more weight to the positive class (the minority class), our model is able to predict more positives (and a higher percentage of actual positives), this means the model sensitivity (recall) increases and at the same time, the number of false positives increases as well, which means a decreased precision. 


## Key takeaways

As we have seen, the production model does not have any issue predicting balanced data, however, when the data become imbalance, then same production data struggles to predict the positive class (the imbalance class). For this to be resolved, we had to modify the production model by giving more weight to the underrepresented class (here the positive class) so that the model can perform better on that class (and predict more positives).