# Baseline Model: Logistic Regression

**THIS IS NOW AN EXPLORATION NOTEBOOK**

**_Actual_** baseline logistic regression model with large training set is located in `baseline_logreg_cluster`

This notebook was used to spot-check various data subsetting and sampling techniques, as well as experiment with TF-IDF vectorizer parameters, truncated SVD, and logistic regression tuning.

**BEGIN ORIGINAL FILE**

**Initial Approach:**
1. Recreate data preparation procedure from "Predicting Amazon Book Review Helpfulness using BERT on TF Hub"
2. Train/evaluate a logistic regression model with no parameter tuning

In [2]:
import numpy as np
import re
import pandas as pd
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [1]:
import os
os.getcwd()

'C:\\Users\\Brad\\Desktop\\Keras - GPU\\Baseline Models'

## Data Prep

In [3]:
# create a dataset by sampling labeled_dev_set.csv
# - 314,080 reviews
my_data = "Data/labeled/labeled_dev_set.csv"
mine = pd.read_csv(my_data)

In [106]:
mine.shape

(314080, 15)

In [35]:
mine.head()

Unnamed: 0,asin,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,helpful_votes,review_age_days,annual_HVAR,book_num_reviews,std_HVAR,top_quartile_HVAR,most_helpful
0,000100039X,5,I would have to say that this is the best book...,2001-02-24,A26GKZPS079GFF,Areej,Touches my heart.. again and.. again...,982972800,2,4897,0.149071,86.0,2.930287,0.842702,0
1,000100039X,5,This is Gibran's most celebrated work and it i...,2000-05-03,A15ACUAJEJXCS3,Caz,Superb,957312000,1,5194,0.070273,86.0,2.930287,0.842702,0
2,000100039X,5,Gibran Khalil Gibran was born in 1883 in what ...,2006-01-10,AWLFVCT9128JV,"Dave_42 ""Dave_42""",The Lessons Of Life,1136851200,8,3116,0.937099,86.0,2.930287,0.842702,1
3,000100039X,5,_The Prophet_ is a short read (my copy checks ...,2012-08-12,A2NHD7LUXVGTD3,doc peterson,a beautiful poetic commentary on what it is to...,1344729600,1,710,0.514085,86.0,2.930287,0.842702,0
4,000100039X,5,"The Prophet, for me, is a very vivid yet dense...",2007-11-29,AAEP8YFERQ8FC,General Breadbasket,Speak to Us of the Prophet,1196294400,1,2428,0.150329,86.0,2.930287,0.842702,0


In [4]:
# Cleanup reviews without review content
#mine.iloc[94073]['reviewText'] # Example has 'nan' as reviewText
mine.dropna(subset=['reviewText'],inplace=True)

In [37]:
# How many reviews have exactly 0 helpful votes?
sum(mine.helpful_votes == 0)

32445

In [5]:
# How many examples of each group?
print('neg_helpful: {}'.format(mine[(mine.overall == 1) & (mine.most_helpful == 1) & (mine.helpful_votes != 0)].shape))
print('neg_unhelpful: {}'.format(mine[(mine.overall == 1) & (mine.most_helpful == 0) & (mine.helpful_votes == 0)].shape))
print('pos_unhelpful: {}'.format(mine[(mine.overall == 5) & (mine.most_helpful == 0) & (mine.helpful_votes == 0)].shape))
print('pos_helpful: {}'.format(mine[(mine.overall == 5) & (mine.most_helpful == 1) & (mine.helpful_votes != 0)].shape))

neg_helpful: (10061, 15)
neg_unhelpful: (1504, 15)
pos_unhelpful: (14786, 15)
pos_helpful: (40648, 15)


In [15]:
# Below I sample to have equal amounts of pos/neg reviews and equal amounts of top-quartile-HVAR vs 0 helpful votes
num_per_condition = 2000
repl=True
neg_helpful = mine[(mine.overall == 1) & (mine.most_helpful == 1) & (mine.helpful_votes != 0)].sample(num_per_condition, replace=repl)
neg_unhelpful = mine[(mine.overall == 1) & (mine.most_helpful == 0) & (mine.helpful_votes == 0)].sample(num_per_condition, replace=repl)
pos_unhelpful = mine[(mine.overall == 5) & (mine.most_helpful == 0) & (mine.helpful_votes == 0)].sample(num_per_condition, replace=repl)
pos_helpful = mine[(mine.overall == 5) & (mine.most_helpful == 1) & (mine.helpful_votes != 0)].sample(num_per_condition, replace=repl)
# "reviewText" has the review content
# "most_helpful" has the label of 0 or 1
# "overall" has the star-rating {1,2,3,4,5}

In [16]:
# Experiment with prepending TEXT representation of starts to the reviews, 
# as a way to pass overall rating to our classifier
# because haven't figured out how to send categorical data AROUND the transformer yet
neg_helpful['prepReviewText'] = neg_helpful.apply(lambda x: 'WORST ' + x.reviewText,axis = 1)
neg_unhelpful['prepReviewText'] = neg_unhelpful.apply(lambda x: 'WORST ' + x.reviewText,axis = 1)
pos_unhelpful['prepReviewText'] = pos_unhelpful.apply(lambda x: 'BEST ' + x.reviewText,axis = 1)
pos_helpful['prepReviewText'] = pos_helpful.apply(lambda x: 'BEST ' + x.reviewText,axis = 1)

In [227]:
neg_helpful['prepReviewText'].head()

311546    WORST The beginning of this terrible tale is r...
304382    WORST Poorly organized and written by someone ...
303849    WORST I'm just not sure where to start. The st...
298457    WORST I agree with the majority of the reviewe...
306659    WORST Isn't Steampunk just a rip-off of that h...
Name: prepReviewText, dtype: object

In [42]:
neg_helpful.head()

Unnamed: 0,asin,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,helpful_votes,review_age_days,annual_HVAR,book_num_reviews,std_HVAR,top_quartile_HVAR,most_helpful,prepReviewText
312111,1621050203,1,I hate this book. I don't say that lightly or ...,2013-07-15,A2NZNCKZYZYJ5G,Zep Greenfelder,spoiler alert,1373846400,3,373,2.935657,9.0,1.978833,2.703704,1,WORST I hate this book. I don't say that light...
298198,312342020,1,This was one of the most annoying books I've e...,2010-04-25,A13Z3RD1MKC0HB,S. B.,Very Annoying Read!,1272153600,7,1550,1.648387,85.0,3.227744,0.962214,1,WORST This was one of the most annoying books ...
303084,465019358,1,"Like the previous reviewer, I am very impresse...",2011-04-08,A1XIDKCJ7SOVXP,Jackal,Big disappointment given the credentials of th...,1302220800,10,1202,3.036606,10.0,4.052057,1.86888,1,"WORST Like the previous reviewer, I am very im..."
300184,373712553,1,I was very excited to see that Harlequin was c...,2005-02-14,ANC8FA4FHFMCY,Winnie G,Did not live up to the Harlequin hype!,1108339200,21,3446,2.224318,5.0,0.81935,1.164781,1,WORST I was very excited to see that Harlequin...
301986,425269205,1,I too used to read the Sookie books in one sit...,2013-05-20,A3Q9QTWU49BSL9,Amazon Customer,Such a disappointment,1369008000,35,429,29.778555,964.0,32.592742,23.153407,1,WORST I too used to read the Sookie books in o...


In [17]:
# Put the subsets into the same dataframe again
stratdf = neg_helpful.append(neg_unhelpful, ignore_index=True)
stratdf = stratdf.append(pos_unhelpful, ignore_index=True)
stratdf = stratdf.append(pos_helpful, ignore_index=True)
print(f"Our dataset is now {stratdf.shape[0]} reviews.")

Our dataset is now 8000 reviews.


In [18]:
df = shuffle(stratdf,random_state=42)[['reviewText','overall','most_helpful']]
df_prep = shuffle(stratdf,random_state=42)[['prepReviewText','overall','most_helpful']]

In [518]:
# Original Text
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(df.reviewText,df.most_helpful, test_size=0.2, \
                                    random_state=42,stratify=df.most_helpful)


In [19]:
# Prepended Text
X_train_prep, X_test_prep, y_train_prep, y_test_prep = train_test_split(df_prep.prepReviewText,df_prep.most_helpful, test_size=0.2, \
                                   random_state=42,stratify=df_prep.most_helpful)
# Ideally I would like to stratify such that train and test have stratified samples across
# BOTH the most_helpful values AND the overall rating, but I keep getting errors when I try to do that

## Data Preprocessing

In [461]:
# build stemming tokenizer
# - stemming seems to limit the model too much

import nltk
def Tokenizer(str_input):
    words = re.sub(r"(?u)\b\w\w+\b|!|\?|\"|\'|\*|\-|\;|\:|\,|\.", " ", str_input).lower().split()
    porter_stemmer=nltk.PorterStemmer()
    words = [porter_stemmer.stem(word) for word in words]
    return words

In [20]:
# Transform text examples

# sklearn tfidfvectorizer default token_pattern is a regexp that removes punctuation
# - Leave out token_pattern to have default response (token_pattern=’(?u)\b\w\w+\b’)
# - add-in your own regex to retain punctuation


# consider additional parameter options:
#    token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'|\*|\-|\;|\:|\,|\."
#    min_df=100


vectorizer = TfidfVectorizer(lowercase=True,
                             #tokenizer=Tokenizer,
                             analyzer='word',
                             stop_words=None,
                             token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'|\*|\-|\;|\:|\,|\.",
                             ngram_range=(1,3),
                             max_features=100000)

X_train_prep_vec = vectorizer.fit_transform(X_train_prep)
X_test_prep_vec = vectorizer.transform(X_test_prep)

print("Token Count: {}".format(len(vectorizer.get_feature_names())))

Token Count: 100000


## First Logistic Regression Model

In [21]:
# more than 2 classes...multinomial

# binary problem

# The SAGA solver is a variant of SAG that also supports the non-smooth penalty=l1 option (i.e. L1 Regularization).
# This is therefore the solver of choice for sparse multinomial logistic regression and it’s also suitable very Large datasets.

# saga handles l1 or l2 penalty

# C = inverse of regularization strength; pos float; smaller = stronger regularization

clf = LogisticRegression(penalty='l2',
                         C=1.0,
                         random_state=42,
                         solver='saga',
                         multi_class='ovr',
                         max_iter=100,
                         n_jobs=-1,
                         verbose=True)



clf.fit(X_train_prep_vec, y_train_prep)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


convergence after 67 epochs took 2 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    1.3s finished


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=-1,
          penalty='l2', random_state=42, solver='saga', tol=0.0001,
          verbose=True, warm_start=False)

In [22]:
test_predicted_labels = clf.predict(X_test_prep_vec)

In [23]:
# Evaluate with various f1 metrics
f1_weighted = metrics.f1_score(y_test_prep, test_predicted_labels, average='weighted')
accuracy = metrics.accuracy_score(y_test_prep, test_predicted_labels)
    
print('Logistic Regression Classifer')
print('-------------\n')
print('Accuracy on test set: {:0.3f}'.format(accuracy))
print('f_1 score (Weighted): {:0.3f}'.format(f1_weighted))


Logistic Regression Classifer
-------------

Accuracy on test set: 0.783
f_1 score (Weighted): 0.783


In [None]:
# Confusion matrix

## Second Logistic Regression Model

In [524]:
clf_2 = LogisticRegression(penalty='l2',
                           C=1.0,
                           random_state=42,
                           solver='saga',
                           multi_class='ovr',
                           max_iter=100,
                           n_jobs=-1,
                           verbose=True)

param_grid = {'C':list(np.linspace(0.1,1.0,10))}            
clf_2 = GridSearchCV(clf_2, param_grid, cv=3, scoring='accuracy', n_jobs=-1, verbose=True)
clf_2.fit(X_train_prep_vec, y_train_prep)
best_c = round(clf_2.best_params_['C'],2)
print('Best value for C: {}'.format(best_c))

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    7.3s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


convergence after 20 epochs took 0 seconds
Best value for C: 0.5


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.2s finished


In [525]:
# Fit a model with the best C value
clf_2 = LogisticRegression(penalty='l2',
                           C=best_c,
                           random_state=42,
                           solver='saga',
                           multi_class='ovr',
                           max_iter=100,
                           n_jobs=-1,
                           verbose=True)

clf_2.fit(X_train_prep_vec, y_train_prep)
test_predicted_labels_2 = clf_2.predict(X_test_prep_vec)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


convergence after 20 epochs took 0 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.2s finished


In [526]:
# Evaluate with various f1 metrics
f1_weighted = metrics.f1_score(y_test_prep, test_predicted_labels_2, average='weighted')
accuracy = metrics.accuracy_score(y_test_prep, test_predicted_labels_2)
    
print('Logistic Regression Classifer')
print('-------------\n')
print('Accuracy on test set: {:0.3f}'.format(accuracy))
print('f_1 score (Weighted): {:0.3f}'.format(f1_weighted))

Logistic Regression Classifer
-------------

Accuracy on test set: 0.718
f_1 score (Weighted): 0.717


In [513]:
confusion_matrix(y_test_prep, test_predicted_labels_2)
tn, fp, fn, tp = confusion_matrix(y_test_prep, test_predicted_labels_2).ravel()
print("True Negative: {}".format(tn))
print("True Positive: {}".format(tp))
print("False Negative: {}".format(fn))
print("False Positive: {}".format(fp))

True Negative: 3364
True Positive: 3096
False Negative: 904
False Positive: 636


## Tweaking

In [None]:
# Tweaking lessons

# acc improves with more data
# use tfidf max_features and/or regularization parameter to control for overfitting

# as per-group samples increase NOT using SVD is better
# - might as well experiment with max_features
# - 10000 features -> acc decr from 79.8 to 75.4
# - 20000 features -> acc 76.6
# - 100000 features -> acc 78.2
# - 300000 features -> acc 79.4
# - 600000 features -> acc 79.9
# - 1000000 features -> acc 80.1
# - 1.97mil features -> acc 79.8

# 32000 examples; 100000 features -> 81.8
#               1mil features -> 84.0

# do not have vectorizer ignore any tokens (min_df, max_df)
# do not remove stop words
# increase to (1,3) ngrams
# try ngrams (1,3) + truncated svd;
# - svd 100 dim acc: .694 (c=0.95)
# - svd 120 dim acc: .682 (C=0.95)
# - svd 130 dim acc: .685 (C=0.86)
# - svd 150 dim acc: .689 (C=0.91)
# - svd 50 dim acc: .688 (C=0.91)
# - svd 300 dim acc: .689 (C=0.67); more dimensions -> regularization works harder

# More data
# Did not oversample:
#  - sample 1500 per group without replacement
#  - svd 300 dim; C=1.0; acc=.732
#  - svd 100 dim; C=1.0; acc=.725
#  - svd 150 dim; C=0.76; acc=.732

# results:

#True Negative: 386
#True Positive: 492
#False Negative: 108
#False Positive: 214

# Oversample:
#  - sample 2000 per group with replacement
#  - svd 100 dim; C=0.95; acc=.742
#  - svd 150 dim; C=0.81; acc=.745
#  - svd 300 dim; C=1.0; acc=.751


# oversampling underrepresented increases accuracy

# Oversample (8000 total) with no SVD acc: 78.2



# reduce dimensions
# - stemming



In [352]:
# Reduce dimensions with truncated SVD
dimensions=100
svd = TruncatedSVD(n_components=dimensions, n_iter=5, random_state=42)
X_train_svd = svd.fit_transform(X_train_prep_vec)
X_test_svd = svd.transform(X_test_prep_vec)

In [353]:
clf_3 = LogisticRegression(penalty='l2',
                           C=1.0,
                           random_state=42,
                           solver='saga',
                           multi_class='ovr',
                           max_iter=100,
                           n_jobs=-1,
                           verbose=True)

param_grid = {'C':list(np.linspace(0.1,1.0,20))}            
clf_3 = GridSearchCV(clf_3, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=True)
clf_3.fit(X_train_svd, y_train_prep)
best_c = round(clf_3.best_params_['C'],2)
print('Best value for C: {}'.format(best_c))

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    3.3s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


convergence after 17 epochs took 0 seconds
Best value for C: 0.91


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.2s finished


In [354]:
# Fit a model with the best C value
clf_3 = LogisticRegression(penalty='l2',
                           C=best_c,
                           random_state=42,
                           solver='saga',
                           multi_class='ovr',
                           max_iter=100,
                           n_jobs=-1,
                           verbose=True)

clf_3.fit(X_train_svd, y_train_prep)
test_predicted_labels_3 = clf_3.predict(X_test_svd)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


convergence after 17 epochs took 1 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.1s finished


In [355]:
# Evaluate with various f1 metrics
f1_weighted = metrics.f1_score(y_test_prep, test_predicted_labels_3, average='weighted')
accuracy = metrics.accuracy_score(y_test_prep, test_predicted_labels_3)
    
print('Logistic Regression Classifer - SVD Dimensions = {}'.format(dimensions))
print('-------------\n')
print('Accuracy on test set: {:0.3f}'.format(accuracy))
print('f_1 score (Weighted): {:0.3f}'.format(f1_weighted))

Logistic Regression Classifer - SVD Dimensions = 100
-------------

Accuracy on test set: 0.735
f_1 score (Weighted): 0.735


In [356]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test_prep, test_predicted_labels_3)
tn, fp, fn, tp = confusion_matrix(y_test_prep, test_predicted_labels_3).ravel()
print("True Negative: {}".format(tn))
print("True Positive: {}".format(tp))
print("False Negative: {}".format(fn))
print("False Positive: {}".format(fp))

True Negative: 1186
True Positive: 1166
False Negative: 434
False Positive: 414


## Alternate Labeling Strategy
Adapt from Jen's labeling notebook
1. KBinsDiscretizer with strategy='kmeans'
2. Try 3 classes (helpful, unhelpful, undetermined)
3. Try training a model on helpful and unhelpful classes only; and on all 3