## Overview

This is a "scratchpad" notebook of experiments and ideas. It helped to develop final ideas for the baseline logistic regression model using balanced groups, balanced classes, and clustering to assign class labels.

It is disorganized...tread carefully.

In [1]:
import numpy as np
import re
import pandas as pd
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [None]:
# Sampling Approaches/Results

# 1. sampling wihtout replacement from all groups to match underrepresented class maximum (~13000)
#    - 52000 train, 6000 test
#    - f_1: .725

# 2. take all the data and accept the imbalance

# 3. sample 3 groups without replacement up to limit of neg_helpful (~81000)

# 4. try a 3-label approach
#    - All groups Jen made count for 1 and 0
#    - All data we WERE leaving out, make that the third 'undetermined' class

In [108]:
# load training data
big_train = "Data/labeled/labeled_training_set.csv"
train = pd.read_csv(big_train)

In [109]:
train.shape

(2659724, 15)

In [110]:
# Distribution of labels
train['most_helpful'].value_counts()

0    1981169
1     678555
Name: most_helpful, dtype: int64

In [111]:
# Load development data
big_dev = "Data/labeled/labeled_dev_set.csv"
dev = pd.read_csv(big_dev)

In [112]:
dev.shape

(314112, 15)

In [113]:
# Distribution of labels
dev['most_helpful'].value_counts()

0    227188
1     86924
Name: most_helpful, dtype: int64

In [114]:
# Load test data
big_test = "Data/labeled/labeled_test_set.csv"
test = pd.read_csv(big_test)

In [115]:
test.shape

(313656, 15)

In [116]:
# Distribution of labels
test['most_helpful'].value_counts()

0    226650
1     87006
Name: most_helpful, dtype: int64

In [117]:
test.shape[0]/(train.shape[0]+test.shape[0])

0.10548803045692108

In [118]:
# Cleanup reviews without review content
train.dropna(subset=['reviewText'],inplace=True)
dev.dropna(subset=['reviewText'],inplace=True)
test.dropna(subset=['reviewText'],inplace=True)

In [119]:
# How many reviews have exactly 0 helpful votes?
print('train: {}'.format(sum(train.helpful_votes == 0)))
print('dev: {}'.format(sum(dev.helpful_votes == 0)))
print('test: {}'.format(sum(test.helpful_votes == 0)))

train: 271443
dev: 32445
test: 32456


## Data Format
`overall` - customer supplied star-rating

`most_helpful` - outcome variable. Reviews (positive or negative) with a annual HVAR value in the top quartile among reviews for the same book.

`helpful_votes` - count variable for helpful votes

**Testing Notes:**

As a 2-class problem with balanced groups and no oversampling accuracy is approx. 73%

- Using the top quartile threshold alone (no prepended/no bucketing, 100000 train/40000 test, tfidf 10000 features) - .619
- Using the top quartile threshold + prepended/bucketing (52000 train/6000 test, tfidf no max features, balanced groups) - .719
- Using the top quartile threshold + prepended/bucketing (104000 train/12000 test, tfidf no max features, balanced groups with 50% of neg_unhelpful is oversampled) - .72
- Using the top quartile threshold + prepended/bucketing (104000 train/12000 test, tfidf 100k max features, balanced groups with 50% of neg_unhelpful is oversampled) - .722
- Using the top quartile threshold + prepended/bucketing (104000 train/12000 test, tfidf 50k max features, balanced groups with 50% of neg_unhelpful is oversampled) - .724
- Using the top quartile threshold + prepended/bucketing (104000 train/12000 test, tfidf 25k max features, balanced groups with 50% of neg_unhelpful is oversampled) - .722
- Using the top quartile threshold + prepended/bucketing (104000 train/12000 test, tfidf 10k max features, balanced groups with 50% of neg_unhelpful is oversampled) - .721
- Using the top quartile threshold + prepended/bucketing (208000 train/24000 test, tfidf 10k max features, balanced groups with 50% of neg_unhelpful is oversampled) - .723
- Using the top quartile threshold + prepended/bucketing (208000 train/24000 test, tfidf 100k max features, balanced groups with 50% of neg_unhelpful is oversampled) - .723
  - We aleready know this setup improves with more data
  - Can we improve the result by incorporating a third class?
- What about 3 classes?

In [120]:
# How many examples of each group in training set?
print('neg_helpful: {}'.format(train[(train.overall == 1) & (train.most_helpful == 1) & (train.helpful_votes != 0)].shape))
print('neg_unhelpful: {}'.format(train[(train.overall == 1) & (train.most_helpful == 0) & (train.helpful_votes == 0)].shape))
print('pos_unhelpful: {}'.format(train[(train.overall == 5) & (train.most_helpful == 0) & (train.helpful_votes == 0)].shape))
print('pos_helpful: {}'.format(train[(train.overall == 5) & (train.most_helpful == 1) & (train.helpful_votes != 0)].shape))

neg_helpful: (81152, 15)
neg_unhelpful: (13033, 15)
pos_unhelpful: (121879, 15)
pos_helpful: (316234, 15)


In [36]:
# Total edge classes
81152+13033+121879+316234

532298

In [37]:
# How many examples remain? (potentially "undetermined")
# - for a 3-class problem these will be labeled 'undetermined' and randomly sampled
train.shape[0] - 532298

2127125

In [121]:
# How many examples of each group in dev set?
print('neg_helpful: {}'.format(dev[(dev.overall == 1) & (dev.most_helpful == 1) & (dev.helpful_votes != 0)].shape))
print('neg_unhelpful: {}'.format(dev[(dev.overall == 1) & (dev.most_helpful == 0) & (dev.helpful_votes == 0)].shape))
print('pos_unhelpful: {}'.format(dev[(dev.overall == 5) & (dev.most_helpful == 0) & (dev.helpful_votes == 0)].shape))
print('pos_helpful: {}'.format(dev[(dev.overall == 5) & (dev.most_helpful == 1) & (dev.helpful_votes != 0)].shape))

neg_helpful: (10061, 15)
neg_unhelpful: (1504, 15)
pos_unhelpful: (14786, 15)
pos_helpful: (40648, 15)


In [122]:
# How many examples of each group in test set?
print('neg_helpful: {}'.format(test[(test.overall == 1) & (test.most_helpful == 1) & (test.helpful_votes != 0)].shape))
print('neg_unhelpful: {}'.format(test[(test.overall == 1) & (test.most_helpful == 0) & (test.helpful_votes == 0)].shape))
print('pos_unhelpful: {}'.format(test[(test.overall == 5) & (test.most_helpful == 0) & (test.helpful_votes == 0)].shape))
print('pos_helpful: {}'.format(test[(test.overall == 5) & (test.most_helpful == 1) & (test.helpful_votes != 0)].shape))

neg_helpful: (10500, 15)
neg_unhelpful: (1533, 15)
pos_unhelpful: (14569, 15)
pos_helpful: (40090, 15)


In [123]:
# Sample Training Set
# Below I sample to have equal amounts of pos/neg reviews and equal amounts of top-quartile-HVAR vs 0 helpful votes

num_per_condition = 13000
repl=False

train_neg_helpful = train[(train.overall == 1) & (train.most_helpful == 1) & (train.helpful_votes != 0)].sample(num_per_condition, replace=repl)
train_neg_unhelpful = train[(train.overall == 1) & (train.most_helpful == 0) & (train.helpful_votes == 0)].sample(num_per_condition, replace=repl)
train_pos_unhelpful = train[(train.overall == 5) & (train.most_helpful == 0) & (train.helpful_votes == 0)].sample(num_per_condition, replace=repl)
train_pos_helpful = train[(train.overall == 5) & (train.most_helpful == 1) & (train.helpful_votes != 0)].sample(num_per_condition, replace=repl)

In [124]:
# Sample Dev Set
# 1500 = no oversampling

num_per_condition = 1500
repl=False

dev_neg_helpful = dev[(dev.overall == 1) & (dev.most_helpful == 1) & (dev.helpful_votes != 0)].sample(num_per_condition, replace=repl)
dev_neg_unhelpful = dev[(dev.overall == 1) & (dev.most_helpful == 0) & (dev.helpful_votes == 0)].sample(num_per_condition, replace=repl)
dev_pos_unhelpful = dev[(dev.overall == 5) & (dev.most_helpful == 0) & (dev.helpful_votes == 0)].sample(num_per_condition, replace=repl)
dev_pos_helpful = dev[(dev.overall == 5) & (dev.most_helpful == 1) & (dev.helpful_votes != 0)].sample(num_per_condition, replace=repl)

In [125]:
# Sample Test Set
# 1500 = no oversampling

num_per_condition = 1500
repl=False

test_neg_helpful = test[(test.overall == 1) & (test.most_helpful == 1) & (test.helpful_votes != 0)].sample(num_per_condition, replace=repl)
test_neg_unhelpful = test[(test.overall == 1) & (test.most_helpful == 0) & (test.helpful_votes == 0)].sample(num_per_condition, replace=repl)
test_pos_unhelpful = test[(test.overall == 5) & (test.most_helpful == 0) & (test.helpful_votes == 0)].sample(num_per_condition, replace=repl)
test_pos_helpful = test[(test.overall == 5) & (test.most_helpful == 1) & (test.helpful_votes != 0)].sample(num_per_condition, replace=repl)

In [126]:
# Prepend training set

train_neg_helpful['prepReviewText'] = train_neg_helpful.apply(lambda x: 'WORST ' + x.reviewText,axis = 1)
train_neg_unhelpful['prepReviewText'] = train_neg_unhelpful.apply(lambda x: 'WORST ' + x.reviewText,axis = 1)
train_pos_unhelpful['prepReviewText'] = train_pos_unhelpful.apply(lambda x: 'BEST ' + x.reviewText,axis = 1)
train_pos_helpful['prepReviewText'] = train_pos_helpful.apply(lambda x: 'BEST ' + x.reviewText,axis = 1)

In [127]:
# Prepend dev set

dev_neg_helpful['prepReviewText'] = dev_neg_helpful.apply(lambda x: 'WORST ' + x.reviewText,axis = 1)
dev_neg_unhelpful['prepReviewText'] = dev_neg_unhelpful.apply(lambda x: 'WORST ' + x.reviewText,axis = 1)
dev_pos_unhelpful['prepReviewText'] = dev_pos_unhelpful.apply(lambda x: 'BEST ' + x.reviewText,axis = 1)
dev_pos_helpful['prepReviewText'] = dev_pos_helpful.apply(lambda x: 'BEST ' + x.reviewText,axis = 1)

In [128]:
# Prepend test set

test_neg_helpful['prepReviewText'] = test_neg_helpful.apply(lambda x: 'WORST ' + x.reviewText,axis = 1)
test_neg_unhelpful['prepReviewText'] = test_neg_unhelpful.apply(lambda x: 'WORST ' + x.reviewText,axis = 1)
test_pos_unhelpful['prepReviewText'] = test_pos_unhelpful.apply(lambda x: 'BEST ' + x.reviewText,axis = 1)
test_pos_helpful['prepReviewText'] = test_pos_helpful.apply(lambda x: 'BEST ' + x.reviewText,axis = 1)

In [129]:
# Assemble training set

stratdf_train = train_neg_helpful.append(train_neg_unhelpful, ignore_index=True)
stratdf_train = stratdf_train.append(train_pos_unhelpful, ignore_index=True)
stratdf_train = stratdf_train.append(train_pos_helpful, ignore_index=True)
print(f"Our training dataset is now {stratdf_train.shape[0]} reviews.")

Our training dataset is now 52000 reviews.


In [130]:
# Assemble dev set

stratdf_dev = dev_neg_helpful.append(dev_neg_unhelpful, ignore_index=True)
stratdf_dev = stratdf_dev.append(dev_pos_unhelpful, ignore_index=True)
stratdf_dev = stratdf_dev.append(dev_pos_helpful, ignore_index=True)
print(f"Our development dataset is now {stratdf_dev.shape[0]} reviews.")

Our development dataset is now 6000 reviews.


In [131]:
# Assemble test set

stratdf_test = test_neg_helpful.append(test_neg_unhelpful, ignore_index=True)
stratdf_test = stratdf_test.append(test_pos_unhelpful, ignore_index=True)
stratdf_test = stratdf_test.append(test_pos_helpful, ignore_index=True)
print(f"Our test dataset is now {stratdf_test.shape[0]} reviews.")

Our test dataset is now 6000 reviews.


In [17]:
stratdf_test.shape[0]/(stratdf_test.shape[0]+stratdf_train.shape[0])

0.10344827586206896

In [None]:
df_train_prep = shuffle(stratdf_train,random_state=42)[['prepReviewText','overall','most_helpful']]
df_dev_prep = shuffle(stratdf_dev,random_state=42)[['prepReviewText','overall','most_helpful']]
df_test_prep = shuffle(stratdf_test,random_state=42)[['prepReviewText','overall','most_helpful']]

In [571]:
df_train_prep.to_csv('train_prep.csv')
df_dev_prep.to_csv('dev_prep.csv')
df_test_prep.to_csv('test_prep.csv')

In [None]:
X_train_prep = df_train_prep['prepReviewText']
X_dev_prep = df_dev_prep['prepReviewText']
X_test_prep = df_test_prep['prepReviewText']

y_train_prep = df_train_prep['most_helpful']
y_dev_prep = df_dev_prep['most_helpful']
y_test_prep = df_test_prep['most_helpful']

In [357]:
# Transform text examples

vectorizer = TfidfVectorizer(lowercase=True,
                             #tokenizer=Tokenizer,
                             analyzer='word',
                             stop_words=None,
                             token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'|\*|\-|\;|\:|\,|\.",
                             ngram_range=(1,2),
                             max_features=None)

X_train_prep_vec = vectorizer.fit_transform(X_train_prep)
X_dev_prep_vec = vectorizer.transform(X_dev_prep)

print("Token Count: {}".format(len(vectorizer.get_feature_names())))

Token Count: 1786688


In [363]:
# more than 2 classes...multinomial

# binary problem

# The SAGA solver is a variant of SAG that also supports the non-smooth penalty=l1 option (i.e. L1 Regularization).
# This is therefore the solver of choice for sparse multinomial logistic regression and it’s also suitable very Large datasets.

# C = inverse of regularization strength; pos float; smaller = stronger regularization

clf = LogisticRegression(penalty='l2',
                         C=1.0,
                         random_state=42,
                         solver='saga',
                         multi_class='ovr',
                         max_iter=100,
                         n_jobs=-1,
                         verbose=True)



clf.fit(X_train_prep_vec, y_train_prep)
#clf.fit(X_train_svd, y_train_prep)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


convergence after 20 epochs took 8 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    7.9s finished


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=-1,
          penalty='l2', random_state=42, solver='saga', tol=0.0001,
          verbose=True, warm_start=False)

In [364]:
#dev_predicted_labels = clf.predict(X_dev_svd)
dev_predicted_labels = clf.predict(X_dev_prep_vec)

In [365]:
# Evaluate with various f1 metrics
f1_weighted = metrics.f1_score(y_dev_prep, dev_predicted_labels, average='weighted')
accuracy = metrics.accuracy_score(y_dev_prep, dev_predicted_labels)
    
print('Logistic Regression Classifer')
print('-------------\n')
print('Accuracy on test set: {:0.3f}'.format(accuracy))
print('f_1 score (Weighted): {:0.3f}'.format(f1_weighted))

Logistic Regression Classifer
-------------

Accuracy on test set: 0.730
f_1 score (Weighted): 0.730


In [382]:
# Evaluate on test set

X_test_prep_vec = vectorizer.transform(X_test_prep)
test_predicted_labels = clf.predict(X_test_prep_vec)

# Evaluate with various f1 metrics
f1_weighted = metrics.f1_score(y_test_prep, test_predicted_labels, average='weighted')
accuracy = metrics.accuracy_score(y_test_prep, test_predicted_labels)
    
print('Logistic Regression Classifer - Test Set')
print('-------------\n')
print('Accuracy on test set: {:0.3f}'.format(accuracy))
print('f_1 score (Weighted): {:0.3f}'.format(f1_weighted))

Logistic Regression Classifer - Test Set
-------------

Accuracy on test set: 0.719
f_1 score (Weighted): 0.719


## Second Model

In [351]:
clf_2 = LogisticRegression(penalty='l2',
                           C=1.0,
                           random_state=42,
                           solver='saga',
                           multi_class='ovr',
                           max_iter=100,
                           n_jobs=-1,
                           verbose=True)

param_grid = {'C':list(np.linspace(0.1,1.0,10))}            
clf_2 = GridSearchCV(clf_2, param_grid, cv=3, scoring='accuracy', n_jobs=-1, verbose=True)
clf_2.fit(X_train_prep_vec, y_train_prep)
best_c = round(clf_2.best_params_['C'],2)
print('Best value for C: {}'.format(best_c))

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.1min finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


convergence after 20 epochs took 4 seconds
Best value for C: 1.0


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    4.0s finished


In [303]:
# Fit a model with the best C value
clf_2 = LogisticRegression(penalty='l2',
                           C=best_c,
                           random_state=42,
                           solver='saga',
                           multi_class='ovr',
                           max_iter=100,
                           n_jobs=-1,
                           verbose=True)

clf_2.fit(X_train_prep_vec, y_train_prep)
dev_predicted_labels_2 = clf_2.predict(X_dev_prep_vec)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


convergence after 20 epochs took 4 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    3.3s finished


In [304]:
# Evaluate with various f1 metrics
f1_weighted = metrics.f1_score(y_dev_prep, dev_predicted_labels_2, average='weighted')
accuracy = metrics.accuracy_score(y_dev_prep, dev_predicted_labels_2)
    
print('Logistic Regression Classifer')
print('-------------\n')
print('Accuracy on test set: {:0.3f}'.format(accuracy))
print('f_1 score (Weighted): {:0.3f}'.format(f1_weighted))

Logistic Regression Classifer
-------------

Accuracy on test set: 0.729
f_1 score (Weighted): 0.729


In [143]:
metrics.confusion_matrix(y_dev_prep, dev_predicted_labels_2)
tn, fp, fn, tp = metrics.confusion_matrix(y_dev_prep, dev_predicted_labels_2).ravel()
print("True Negative: {}".format(tn))
print("True Positive: {}".format(tp))
print("False Negative: {}".format(fn))
print("False Positive: {}".format(fp))

True Negative: 2208
True Positive: 2150
False Negative: 850
False Positive: 792


## Make it multiclass

In [494]:
# Add 'class' column to training data
# - Initially all entries are 'undetermined'

train['class'] = 'undetermined'
dev['class'] = 'undetermined'
test['class'] = 'undetermined'

In [495]:
# Go through and change the label according to conditions

train.loc[(train.overall == 1) & (train.most_helpful == 1) & (train.helpful_votes != 0), 'class'] = 'neg_helpful'
train.loc[(train.overall == 1) & (train.most_helpful == 0) & (train.helpful_votes == 0), 'class'] = 'neg_unhelpful'
train.loc[(train.overall == 5) & (train.most_helpful == 0) & (train.helpful_votes == 0), 'class'] = 'pos_unhelpful'
train.loc[(train.overall == 5) & (train.most_helpful == 1) & (train.helpful_votes != 0), 'class'] = 'pos_helpful'

dev.loc[(dev.overall == 1) & (dev.most_helpful == 1) & (dev.helpful_votes != 0), 'class'] = 'neg_helpful'
dev.loc[(dev.overall == 1) & (dev.most_helpful == 0) & (dev.helpful_votes == 0), 'class'] = 'neg_unhelpful'
dev.loc[(dev.overall == 5) & (dev.most_helpful == 0) & (dev.helpful_votes == 0), 'class'] = 'pos_unhelpful'
dev.loc[(dev.overall == 5) & (dev.most_helpful == 1) & (dev.helpful_votes != 0), 'class'] = 'pos_helpful'

test.loc[(test.overall == 1) & (test.most_helpful == 1) & (test.helpful_votes != 0), 'class'] = 'neg_helpful'
test.loc[(test.overall == 1) & (test.most_helpful == 0) & (test.helpful_votes == 0), 'class'] = 'neg_unhelpful'
test.loc[(test.overall == 5) & (test.most_helpful == 0) & (test.helpful_votes == 0), 'class'] = 'pos_unhelpful'
test.loc[(test.overall == 5) & (test.most_helpful == 1) & (test.helpful_votes != 0), 'class'] = 'pos_helpful'


In [496]:
train['class'].value_counts()


undetermined     2127125
pos_helpful       316234
pos_unhelpful     121879
neg_helpful        81152
neg_unhelpful      13033
Name: class, dtype: int64

In [497]:
dev['class'].value_counts()

undetermined     247081
pos_helpful       40648
pos_unhelpful     14786
neg_helpful       10061
neg_unhelpful      1504
Name: class, dtype: int64

In [498]:
test['class'].value_counts()

undetermined     246910
pos_helpful       40090
pos_unhelpful     14569
neg_helpful       10500
neg_unhelpful      1533
Name: class, dtype: int64

In [499]:
# two approaches

# 1. 5-class classification (65,000 total reviews)
#    Sample 13,000 reviews from each of the new categories



# 2. 3-class classification (78,000 total reviews)
#    Sample:
#       pos_helpful + pos_unhelpful = 13,000 + 13,000 = 26,000 helpful
#       neg_helpful + neg_unhelpful = 13,000 + 13,000 = 26,000 unhelpful
#       undetermined                = 26,000 undetermined

num_per_condition = 13000

# mc = multi-class

# Sample multi-class examples from training set
mc_train_pos_helpful = train[train['class'] == 'pos_helpful'].sample(13000, replace=False)
mc_train_pos_unhelpful = train[train['class'] == 'pos_unhelpful'].sample(13000, replace=False)
mc_train_neg_helpful = train[train['class'] == 'neg_helpful'].sample(13000, replace=False)
mc_train_neg_unhelpful = train[train['class'] == 'neg_unhelpful'].sample(13000, replace=False)

mc5_train_undetermined = train[train['class'] == 'undetermined'].sample(13000, replace=False)
mc3_train_undetermined = train[train['class'] == 'undetermined'].sample(26000, replace=False)

# Sample multi-class examples from development set
mc_dev_pos_helpful = dev[dev['class'] == 'pos_helpful'].sample(1500, replace=False)
mc_dev_pos_unhelpful = dev[dev['class'] == 'pos_unhelpful'].sample(1500, replace=False)
mc_dev_neg_helpful = dev[dev['class'] == 'neg_helpful'].sample(1500, replace=False)
mc_dev_neg_unhelpful = dev[dev['class'] == 'neg_unhelpful'].sample(1500, replace=False)

mc5_dev_undetermined = dev[dev['class'] == 'undetermined'].sample(1500, replace=False)
mc3_dev_undetermined = dev[dev['class'] == 'undetermined'].sample(3000, replace=False)



# nothing is prepended

In [500]:
mc_train_pos_helpful.head()

Unnamed: 0,asin,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,helpful_votes,review_age_days,annual_HVAR,book_num_reviews,std_HVAR,top_quartile_HVAR,most_helpful,class
963554,145161747X,5,In 70 C.E. Jerusalem fell to the Romans and th...,2011-09-20,A3GK1O5S6188AJ,Amy Willingham,4 Stories Separate Yet Connected,1316476800,11,1037,3.871745,84,26.896178,2.159243,1,pos_helpful
44179,0060987103,5,I've owned this book for almost 3yrs and now j...,2003-11-16,AX3WJZLFQQRCT,"R. M. Ettinger ""rme1963""",Wonderfully Done!,1068940800,50,3902,4.677089,471,0.92194,0.707367,1,pos_helpful
376498,0385720564,5,Though it lacks the wide opinion that William ...,2006-02-10,AFVQZQ8PW0L,Harriet Klausner,Terrific biography,1139529600,15,3085,1.774716,15,0.473053,0.83727,1,pos_helpful
327182,0373860056,5,I have been a fan of Adrianne Byrd's writing f...,2007-01-28,A1L0PD6BRHDD31,"Reader Woman ""The Book Diva""",If only collisions where so sexy and so much f...,1169942400,6,2733,0.801317,5,0.288584,0.286162,1,pos_helpful
556652,0615547753,5,I have been so lost trying to figure out how t...,2012-10-15,A3FKSL93FIX0BZ,K. Bosworth,"Excellent, well written book on gastroparesis",1350259200,8,646,4.520124,10,1.467419,1.640382,1,pos_helpful


In [501]:
# Assemble 5-class training set

mc5_train = mc_train_neg_helpful.append(mc_train_neg_unhelpful, ignore_index=True)
mc5_train = mc5_train.append(mc_train_pos_unhelpful, ignore_index=True)
mc5_train = mc5_train.append(mc_train_pos_helpful, ignore_index=True)
mc5_train = mc5_train.append(mc5_train_undetermined, ignore_index=True)
print(f"Our 5-class training dataset is now {mc5_train.shape[0]} reviews.")

Our 5-class training dataset is now 65000 reviews.


In [464]:
mc5_train.head()

Unnamed: 0,asin,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,helpful_votes,review_age_days,annual_HVAR,book_num_reviews,std_HVAR,top_quartile_HVAR,most_helpful,class
0,0451239113,1,Parker's writings have dwindled to the point o...,2012-06-09,AN9V3LI0G85K5,Ron cz,Parker's Lost It!!!!!!!!!!,1339200000,6,774,2.829457,16,1.636546,1.781095,1,neg_helpful
1,0026045702,1,The original version of Joy is a strong conten...,2005-06-16,A2K33VWYQC9C2W,jerry i h,Worthless Garbage,1118880000,23,3324,2.525572,102,2.269745,0.941812,1,neg_helpful
2,0380818612,1,I struggled to identify with her main characte...,2010-02-02,A3NY49QVQJQK7P,"Kevin Watt ""I know that poetry is indispensab...",Awful. Not up to her usual awesomeness. Slow...,1265068800,4,1632,0.894608,57,1.681225,0.731463,1,neg_helpful
3,B0057PFWDI,1,I found myself rolling my eyes and laughing wh...,2013-03-19,A1GS6FEGHFNIG,Deana,Hmmm,1363651200,3,491,2.230143,35,0.777522,0.93736,1,neg_helpful
4,0613171373,1,"He was severely abused, (insert graphic detail...",2009-08-18,A3HQD8NJZLA1MO,Amazon Customer,This is a perfect candidate for a book burning,1250553600,6,1800,1.216667,171,1.524067,0.4802,1,neg_helpful


In [502]:
mc5_train['class'].value_counts()

undetermined     13000
neg_helpful      13000
neg_unhelpful    13000
pos_unhelpful    13000
pos_helpful      13000
Name: class, dtype: int64

In [503]:
# Assemble 3-class training set

mc3_train = mc_train_neg_helpful.append(mc_train_neg_unhelpful, ignore_index=True)
mc3_train = mc3_train.append(mc_train_pos_unhelpful, ignore_index=True)
mc3_train = mc3_train.append(mc_train_pos_helpful, ignore_index=True)
mc3_train = mc3_train.append(mc3_train_undetermined, ignore_index=True)
print(f"Our 3-class training dataset is now {mc3_train.shape[0]} reviews.")

Our 3-class training dataset is now 78000 reviews.


In [467]:
mc3_train.head()

Unnamed: 0,asin,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,helpful_votes,review_age_days,annual_HVAR,book_num_reviews,std_HVAR,top_quartile_HVAR,most_helpful,class
0,0451239113,1,Parker's writings have dwindled to the point o...,2012-06-09,AN9V3LI0G85K5,Ron cz,Parker's Lost It!!!!!!!!!!,1339200000,6,774,2.829457,16,1.636546,1.781095,1,neg_helpful
1,0026045702,1,The original version of Joy is a strong conten...,2005-06-16,A2K33VWYQC9C2W,jerry i h,Worthless Garbage,1118880000,23,3324,2.525572,102,2.269745,0.941812,1,neg_helpful
2,0380818612,1,I struggled to identify with her main characte...,2010-02-02,A3NY49QVQJQK7P,"Kevin Watt ""I know that poetry is indispensab...",Awful. Not up to her usual awesomeness. Slow...,1265068800,4,1632,0.894608,57,1.681225,0.731463,1,neg_helpful
3,B0057PFWDI,1,I found myself rolling my eyes and laughing wh...,2013-03-19,A1GS6FEGHFNIG,Deana,Hmmm,1363651200,3,491,2.230143,35,0.777522,0.93736,1,neg_helpful
4,0613171373,1,"He was severely abused, (insert graphic detail...",2009-08-18,A3HQD8NJZLA1MO,Amazon Customer,This is a perfect candidate for a book burning,1250553600,6,1800,1.216667,171,1.524067,0.4802,1,neg_helpful


In [504]:
mc3_train['class'].value_counts()

undetermined     26000
neg_helpful      13000
neg_unhelpful    13000
pos_unhelpful    13000
pos_helpful      13000
Name: class, dtype: int64

In [469]:
mc3_train.head()

Unnamed: 0,asin,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,helpful_votes,review_age_days,annual_HVAR,book_num_reviews,std_HVAR,top_quartile_HVAR,most_helpful,class
0,0451239113,1,Parker's writings have dwindled to the point o...,2012-06-09,AN9V3LI0G85K5,Ron cz,Parker's Lost It!!!!!!!!!!,1339200000,6,774,2.829457,16,1.636546,1.781095,1,neg_helpful
1,0026045702,1,The original version of Joy is a strong conten...,2005-06-16,A2K33VWYQC9C2W,jerry i h,Worthless Garbage,1118880000,23,3324,2.525572,102,2.269745,0.941812,1,neg_helpful
2,0380818612,1,I struggled to identify with her main characte...,2010-02-02,A3NY49QVQJQK7P,"Kevin Watt ""I know that poetry is indispensab...",Awful. Not up to her usual awesomeness. Slow...,1265068800,4,1632,0.894608,57,1.681225,0.731463,1,neg_helpful
3,B0057PFWDI,1,I found myself rolling my eyes and laughing wh...,2013-03-19,A1GS6FEGHFNIG,Deana,Hmmm,1363651200,3,491,2.230143,35,0.777522,0.93736,1,neg_helpful
4,0613171373,1,"He was severely abused, (insert graphic detail...",2009-08-18,A3HQD8NJZLA1MO,Amazon Customer,This is a perfect candidate for a book burning,1250553600,6,1800,1.216667,171,1.524067,0.4802,1,neg_helpful


In [505]:
# Reduce to 3 classes instead of 5
mc3_train.loc[(mc3_train['class'] == 'neg_helpful') | (mc3_train['class'] == 'pos_helpful'), 'class'] = 'helpful'
mc3_train.loc[(mc3_train['class'] == 'neg_unhelpful') | (mc3_train['class'] == 'pos_unhelpful'), 'class'] = 'unhelpful'
mc3_train['class'].value_counts()

unhelpful       26000
undetermined    26000
helpful         26000
Name: class, dtype: int64

In [472]:
mc3_train.head()

Unnamed: 0,asin,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,helpful_votes,review_age_days,annual_HVAR,book_num_reviews,std_HVAR,top_quartile_HVAR,most_helpful,class
0,0451239113,1,Parker's writings have dwindled to the point o...,2012-06-09,AN9V3LI0G85K5,Ron cz,Parker's Lost It!!!!!!!!!!,1339200000,6,774,2.829457,16,1.636546,1.781095,1,helpful
1,0026045702,1,The original version of Joy is a strong conten...,2005-06-16,A2K33VWYQC9C2W,jerry i h,Worthless Garbage,1118880000,23,3324,2.525572,102,2.269745,0.941812,1,helpful
2,0380818612,1,I struggled to identify with her main characte...,2010-02-02,A3NY49QVQJQK7P,"Kevin Watt ""I know that poetry is indispensab...",Awful. Not up to her usual awesomeness. Slow...,1265068800,4,1632,0.894608,57,1.681225,0.731463,1,helpful
3,B0057PFWDI,1,I found myself rolling my eyes and laughing wh...,2013-03-19,A1GS6FEGHFNIG,Deana,Hmmm,1363651200,3,491,2.230143,35,0.777522,0.93736,1,helpful
4,0613171373,1,"He was severely abused, (insert graphic detail...",2009-08-18,A3HQD8NJZLA1MO,Amazon Customer,This is a perfect candidate for a book burning,1250553600,6,1800,1.216667,171,1.524067,0.4802,1,helpful


In [506]:
# Assemble 5-class dev set

mc5_dev = mc_dev_neg_helpful.append(mc_dev_neg_unhelpful, ignore_index=True)
mc5_dev = mc5_dev.append(mc_dev_pos_unhelpful, ignore_index=True)
mc5_dev = mc5_dev.append(mc_dev_pos_helpful, ignore_index=True)
mc5_dev = mc5_dev.append(mc5_dev_undetermined, ignore_index=True)
print(f"Our 5-class training dataset is now {mc5_dev.shape[0]} reviews.")

Our 5-class training dataset is now 7500 reviews.


In [507]:
# Assemble 3-class dev set

mc3_dev = mc_dev_neg_helpful.append(mc_dev_neg_unhelpful, ignore_index=True)
mc3_dev = mc3_dev.append(mc_dev_pos_unhelpful, ignore_index=True)
mc3_dev = mc3_dev.append(mc_dev_pos_helpful, ignore_index=True)
mc3_dev = mc3_dev.append(mc3_dev_undetermined, ignore_index=True)
print(f"Our 3-class training dataset is now {mc3_dev.shape[0]} reviews.")

Our 3-class training dataset is now 9000 reviews.


In [508]:
# Reduce to 3 classes instead of 5
mc3_dev.loc[(mc3_dev['class'] == 'neg_helpful') | (mc3_dev['class'] == 'pos_helpful'), 'class'] = 'helpful'
mc3_dev.loc[(mc3_dev['class'] == 'neg_unhelpful') | (mc3_dev['class'] == 'pos_unhelpful'), 'class'] = 'unhelpful'
mc3_dev['class'].value_counts()

helpful         3000
undetermined    3000
unhelpful       3000
Name: class, dtype: int64

In [476]:
mc3_dev.head()

Unnamed: 0,asin,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,helpful_votes,review_age_days,annual_HVAR,book_num_reviews,std_HVAR,top_quartile_HVAR,most_helpful,class
0,0805096663,1,O'Reilly is carried away with his own view on ...,2012-10-05,AFY1YIAQD5Y0J,aPerson,Fictional History,1349395200,26,656,14.466463,1222.0,44.447703,1.13354,1,helpful
1,0983392900,1,"Just because the reading audience is younger, ...",2011-12-27,A309Z6WBC24ZVA,Dona W. Gould,"YA usually expects polished, professional writing",1324944000,7,939,2.72098,30.0,9.809796,1.88127,1,helpful
2,078693560X,1,I purchased this book at the same time that I ...,2003-01-18,A223LGYL4WGN5K,Carrie Johnson,Think like a quarterback and Pass!,1042848000,6,4204,0.520932,10.0,0.279628,0.438973,1,helpful
3,1595910743,1,"A couple of W.'s bidness buds, ""Bush Pioneers,...",2013-03-30,A2YP7JPI48MRG8,"S. J. Snyder ""De gustibus non disputandum""",Good effing doorknob,1364601600,5,480,3.802083,5.0,3.174561,3.328267,1,helpful
4,0891093117,1,Parenting with love and Logic was extremely di...,2004-02-24,A3TCHJ7YERQ938,Amy A Adams,Not for my family,1077580800,90,3802,8.640189,12.0,8.467418,1.236244,1,helpful


In [509]:
from sklearn import preprocessing

le3 = preprocessing.LabelEncoder()
mc3_train['label'] = le3.fit_transform(mc3_train['class'])
mc3_dev['label'] = le3.fit_transform(mc3_dev['class'])

print(le3.inverse_transform([0,1,2]))

le5 = preprocessing.LabelEncoder()
mc5_train['label'] = le5.fit_transform(mc5_train['class'])
mc5_dev['label'] = le5.fit_transform(mc5_dev['class'])

print(le5.inverse_transform([0,1,2,3,4]))

['helpful' 'undetermined' 'unhelpful']
['neg_helpful' 'neg_unhelpful' 'pos_helpful' 'pos_unhelpful'
 'undetermined']


In [486]:
mc5_train.head()

Unnamed: 0,asin,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,helpful_votes,review_age_days,annual_HVAR,book_num_reviews,std_HVAR,top_quartile_HVAR,most_helpful,class,label
0,0451239113,1,Parker's writings have dwindled to the point o...,2012-06-09,AN9V3LI0G85K5,Ron cz,Parker's Lost It!!!!!!!!!!,1339200000,6,774,2.829457,16,1.636546,1.781095,1,neg_helpful,0
1,0026045702,1,The original version of Joy is a strong conten...,2005-06-16,A2K33VWYQC9C2W,jerry i h,Worthless Garbage,1118880000,23,3324,2.525572,102,2.269745,0.941812,1,neg_helpful,0
2,0380818612,1,I struggled to identify with her main characte...,2010-02-02,A3NY49QVQJQK7P,"Kevin Watt ""I know that poetry is indispensab...",Awful. Not up to her usual awesomeness. Slow...,1265068800,4,1632,0.894608,57,1.681225,0.731463,1,neg_helpful,0
3,B0057PFWDI,1,I found myself rolling my eyes and laughing wh...,2013-03-19,A1GS6FEGHFNIG,Deana,Hmmm,1363651200,3,491,2.230143,35,0.777522,0.93736,1,neg_helpful,0
4,0613171373,1,"He was severely abused, (insert graphic detail...",2009-08-18,A3HQD8NJZLA1MO,Amazon Customer,This is a perfect candidate for a book burning,1250553600,6,1800,1.216667,171,1.524067,0.4802,1,neg_helpful,0


In [493]:
mc5_train.columns

Index(['reviewText', 'overall', 'most_helpful'], dtype='object')

## 5-class model

In [510]:
mc5_train = shuffle(mc5_train,random_state=42)[['reviewText','label']]
mc5_dev = shuffle(mc5_dev,random_state=42)[['reviewText','label']]


In [517]:
mc5_train['label'].value_counts()

4    13000
3    13000
2    13000
1    13000
0    13000
Name: label, dtype: int64

In [518]:
mc5_train.head()

Unnamed: 0,reviewText,label
28450,If texting has become the modern way of commun...,3
50670,Mr. Taubes has done a great service by pulling...,2
15811,"Um yeah, I still wonder why I bought this book...",1
14668,This book is not worth the 5 to 10 minutes it ...,1
57899,This book is fascinating. I was recommended it...,4


In [511]:
X_train_mc5 = mc5_train['reviewText']
X_dev_mc5 = mc5_dev['reviewText']


y_train_mc5 = mc5_train['label']
y_dev_mc5 = mc5_dev['label']


In [512]:
# Transform text examples

vectorizer = TfidfVectorizer(lowercase=True,
                             #tokenizer=Tokenizer,
                             analyzer='word',
                             stop_words=None,
                             token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'|\*|\-|\;|\:|\,|\.",
                             ngram_range=(1,2),
                             max_features=None)

X_train_mc5_vec = vectorizer.fit_transform(X_train_mc5)
X_dev_mc5_vec = vectorizer.transform(X_dev_mc5)

print("Token Count: {}".format(len(vectorizer.get_feature_names())))

Token Count: 2234573


In [514]:
# more than 2 classes...multinomial

# binary problem

# The SAGA solver is a variant of SAG that also supports the non-smooth penalty=l1 option (i.e. L1 Regularization).
# This is therefore the solver of choice for sparse multinomial logistic regression and it’s also suitable very Large datasets.

# C = inverse of regularization strength; pos float; smaller = stronger regularization

clf = LogisticRegression(penalty='l2',
                         C=1.0,
                         random_state=42,
                         solver='saga',
                         multi_class='multinomial',
                         max_iter=100,
                         n_jobs=-1,
                         verbose=True)



clf.fit(X_train_mc5_vec, y_train_mc5)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


convergence after 23 epochs took 26 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   26.3s finished


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=-1, penalty='l2', random_state=42, solver='saga',
          tol=0.0001, verbose=True, warm_start=False)

In [515]:
#dev_predicted_labels = clf.predict(X_dev_svd)
dev_mc5_predicted_labels = clf.predict(X_dev_mc5_vec)

In [516]:
# Evaluate with various f1 metrics
f1_weighted = metrics.f1_score(y_dev_mc5, dev_mc5_predicted_labels, average='weighted')
accuracy = metrics.accuracy_score(y_dev_mc5, dev_mc5_predicted_labels)
    
print('Logistic Regression Classifer - 5 class')
print('-------------\n')
print('Accuracy on test set: {:0.3f}'.format(accuracy))
print('f_1 score (Weighted): {:0.3f}'.format(f1_weighted))

Logistic Regression Classifer - 5 class
-------------

Accuracy on test set: 0.562
f_1 score (Weighted): 0.554


In [519]:
from sklearn.metrics import classification_report

target_names = le5.inverse_transform([0,1,2,3,4])

print(classification_report(y_dev_mc5, dev_mc5_predicted_labels, target_names=target_names))

               precision    recall  f1-score   support

  neg_helpful       0.61      0.67      0.63      1500
neg_unhelpful       0.63      0.62      0.63      1500
  pos_helpful       0.53      0.63      0.57      1500
pos_unhelpful       0.56      0.59      0.57      1500
 undetermined       0.45      0.30      0.36      1500

    micro avg       0.56      0.56      0.56      7500
    macro avg       0.56      0.56      0.55      7500
 weighted avg       0.56      0.56      0.55      7500



## 3-class model

In [520]:
mc3_train = shuffle(mc3_train,random_state=42)[['reviewText','label']]
mc3_dev = shuffle(mc3_dev,random_state=42)[['reviewText','label']]

In [521]:
mc3_train['label'].value_counts()

2    26000
1    26000
0    26000
Name: label, dtype: int64

In [523]:
X_train_mc3 = mc3_train['reviewText']
X_dev_mc3 = mc3_dev['reviewText']


y_train_mc3 = mc3_train['label']
y_dev_mc3 = mc3_dev['label']

In [524]:
# Transform text examples

vectorizer = TfidfVectorizer(lowercase=True,
                             #tokenizer=Tokenizer,
                             analyzer='word',
                             stop_words=None,
                             token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'|\*|\-|\;|\:|\,|\.",
                             ngram_range=(1,2),
                             max_features=None)

X_train_mc3_vec = vectorizer.fit_transform(X_train_mc3)
X_dev_mc3_vec = vectorizer.transform(X_dev_mc3)

print("Token Count: {}".format(len(vectorizer.get_feature_names())))

Token Count: 2549824


In [529]:
# more than 2 classes...multinomial

# binary problem

# The SAGA solver is a variant of SAG that also supports the non-smooth penalty=l1 option (i.e. L1 Regularization).
# This is therefore the solver of choice for sparse multinomial logistic regression and it’s also suitable very Large datasets.

# C = inverse of regularization strength; pos float; smaller = stronger regularization

clf = LogisticRegression(penalty='l2',
                         C=1.0,
                         random_state=42,
                         #solver='saga',
                         #multi_class='multinomial',
                         max_iter=100,
                         n_jobs=-1,
                         verbose=True)



clf.fit(X_train_mc3_vec, y_train_mc3)

  " = {}.".format(effective_n_jobs(self.n_jobs)))


[LibLinear]

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=-1,
          penalty='l2', random_state=42, solver='warn', tol=0.0001,
          verbose=True, warm_start=False)

In [530]:
#dev_predicted_labels = clf.predict(X_dev_svd)
dev_mc3_predicted_labels = clf.predict(X_dev_mc3_vec)

In [531]:
# Evaluate with various f1 metrics
f1_weighted = metrics.f1_score(y_dev_mc3, dev_mc3_predicted_labels, average='weighted')
accuracy = metrics.accuracy_score(y_dev_mc3, dev_mc3_predicted_labels)
    
print('Logistic Regression Classifer - 3 class')
print('-------------\n')
print('Accuracy on test set: {:0.3f}'.format(accuracy))
print('f_1 score (Weighted): {:0.3f}'.format(f1_weighted))

Logistic Regression Classifer - 3 class
-------------

Accuracy on test set: 0.551
f_1 score (Weighted): 0.551


In [534]:
confusion_matrix(y_dev_mc3, dev_mc3_predicted_labels)

array([[1653,  717,  630],
       [ 873, 1496,  631],
       [ 567,  620, 1813]], dtype=int64)

In [532]:
target_names = le3.inverse_transform([0,1,2])

print(classification_report(y_dev_mc3, dev_mc3_predicted_labels, target_names=target_names))

              precision    recall  f1-score   support

     helpful       0.53      0.55      0.54      3000
undetermined       0.53      0.50      0.51      3000
   unhelpful       0.59      0.60      0.60      3000

   micro avg       0.55      0.55      0.55      9000
   macro avg       0.55      0.55      0.55      9000
weighted avg       0.55      0.55      0.55      9000



## 2-class baseline

In [535]:
le3.inverse_transform([0,1,2])

array(['helpful', 'undetermined', 'unhelpful'], dtype=object)

In [539]:
mc3_train.head()

Unnamed: 0,reviewText,label
62226,"In a nutshell, if you don't have this book, yo...",1
42449,This is one of the popular &quot;female myster...,0
26120,Came in perfect shape in the regular mail. Ga...,2
37483,"It's a plot that has been done numerous times,...",2
20944,BAD BOOK. Characters were not even close to...,2


In [540]:
mc3_train[(mc3_train['label'] == 0) | (mc3_train['label'] == 2)]

Unnamed: 0,reviewText,label
42449,This is one of the popular &quot;female myster...,0
26120,Came in perfect shape in the regular mail. Ga...,2
37483,"It's a plot that has been done numerous times,...",2
20944,BAD BOOK. Characters were not even close to...,2
27336,I didn't really know what to expect when I bor...,2
6683,I bought the book despite the bad reviews I re...,0
30708,A Million Suns picks up three months after the...,2
6322,Robert Burney compares himself to John Bradsha...,0
10055,Some people have said it far more eloquently t...,0
4272,Bad. Bad. Bad. It CAN get worse. Why did I rea...,0


In [558]:
mc3_train['reviewText'][26120]

'Came in perfect shape in the regular mail.  Gave it to my very special neighbor. She was totally happy when she saw it.'

In [557]:
mc3_train['label'][20944]

2

In [549]:
X_train_mc2 = mc3_train[(mc3_train['label'] == 0) | (mc3_train['label'] == 2)]['reviewText']
y_train_mc2 = mc3_train[(mc3_train['label'] == 0) | (mc3_train['label'] == 2)]['label']

X_dev_mc2 = mc3_dev[(mc3_dev['label'] == 0) | (mc3_dev['label'] == 2)]['reviewText']
y_dev_mc2 = mc3_dev[(mc3_dev['label'] == 0) | (mc3_dev['label'] == 2)]['label']

In [566]:
X_train_mc2.shape


(52000,)

In [568]:
y_train_mc2.value_counts()

2    26000
0    26000
Name: label, dtype: int64

In [547]:
y_train_mc2.value_counts()

2    26000
0    26000
Name: label, dtype: int64

In [550]:
# Transform text examples

vectorizer = TfidfVectorizer(lowercase=True,
                             #tokenizer=Tokenizer,
                             analyzer='word',
                             stop_words=None,
                             token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'|\*|\-|\;|\:|\,|\.",
                             ngram_range=(1,2),
                             max_features=None)

X_train_mc2_vec = vectorizer.fit_transform(X_train_mc2)
X_dev_mc2_vec = vectorizer.transform(X_dev_mc2)

print("Token Count: {}".format(len(vectorizer.get_feature_names())))

Token Count: 1865163


In [551]:
# more than 2 classes...multinomial

# binary problem

# The SAGA solver is a variant of SAG that also supports the non-smooth penalty=l1 option (i.e. L1 Regularization).
# This is therefore the solver of choice for sparse multinomial logistic regression and it’s also suitable very Large datasets.

# C = inverse of regularization strength; pos float; smaller = stronger regularization

clf = LogisticRegression(penalty='l2',
                         C=1.0,
                         random_state=42,
                         solver='saga',
                         multi_class='ovr',
                         max_iter=100,
                         n_jobs=-1,
                         verbose=True)



clf.fit(X_train_mc2_vec, y_train_mc2)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


convergence after 22 epochs took 9 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    9.0s finished


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=-1,
          penalty='l2', random_state=42, solver='saga', tol=0.0001,
          verbose=True, warm_start=False)

In [552]:

dev_mc2_predicted_labels = clf.predict(X_dev_mc2_vec)

In [553]:
# Evaluate with various f1 metrics
f1_weighted = metrics.f1_score(y_dev_mc2, dev_mc2_predicted_labels, average='weighted')
accuracy = metrics.accuracy_score(y_dev_mc2, dev_mc2_predicted_labels)
    
print('Logistic Regression Classifer')
print('-------------\n')
print('Accuracy on test set: {:0.3f}'.format(accuracy))
print('f_1 score (Weighted): {:0.3f}'.format(f1_weighted))

Logistic Regression Classifer
-------------

Accuracy on test set: 0.728
f_1 score (Weighted): 0.727


## 4-class; no undetermined

In [559]:
print(le5.inverse_transform([0,1,2,3,4]))

['neg_helpful' 'neg_unhelpful' 'pos_helpful' 'pos_unhelpful'
 'undetermined']


In [560]:
X_train_mc4 = mc5_train[mc5_train['label'] != 4]['reviewText']
y_train_mc4 = mc5_train[mc5_train['label'] != 4]['label']

X_dev_mc4 = mc5_dev[mc5_dev['label'] != 4]['reviewText']
y_dev_mc4 = mc5_dev[mc5_dev['label'] != 4]['label']

In [565]:
X_train_mc4.shape


(52000,)

In [567]:
y_train_mc4.value_counts()

3    13000
2    13000
1    13000
0    13000
Name: label, dtype: int64

In [561]:
# Transform text examples

vectorizer = TfidfVectorizer(lowercase=True,
                             #tokenizer=Tokenizer,
                             analyzer='word',
                             stop_words=None,
                             token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'|\*|\-|\;|\:|\,|\.",
                             ngram_range=(1,2),
                             max_features=None)

X_train_mc4_vec = vectorizer.fit_transform(X_train_mc4)
X_dev_mc4_vec = vectorizer.transform(X_dev_mc4)

print("Token Count: {}".format(len(vectorizer.get_feature_names())))

Token Count: 1865163


In [562]:
# more than 2 classes...multinomial

# binary problem

# The SAGA solver is a variant of SAG that also supports the non-smooth penalty=l1 option (i.e. L1 Regularization).
# This is therefore the solver of choice for sparse multinomial logistic regression and it’s also suitable very Large datasets.

# C = inverse of regularization strength; pos float; smaller = stronger regularization

clf = LogisticRegression(penalty='l2',
                         C=1.0,
                         random_state=42,
                         solver='saga',
                         multi_class='ovr',
                         max_iter=100,
                         n_jobs=-1,
                         verbose=True)



clf.fit(X_train_mc4_vec, y_train_mc4)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


convergence after 19 epochs took 15 seconds
convergence after 22 epochs took 17 seconds
convergence after 23 epochs took 17 seconds
convergence after 23 epochs took 17 seconds


[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:   17.1s finished


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=-1,
          penalty='l2', random_state=42, solver='saga', tol=0.0001,
          verbose=True, warm_start=False)

In [563]:
dev_mc4_predicted_labels = clf.predict(X_dev_mc4_vec)

In [564]:
# Evaluate with various f1 metrics
f1_weighted = metrics.f1_score(y_dev_mc4, dev_mc4_predicted_labels, average='weighted')
accuracy = metrics.accuracy_score(y_dev_mc4, dev_mc4_predicted_labels)
    
print('Logistic Regression Classifer')
print('-------------\n')
print('Accuracy on test set: {:0.3f}'.format(accuracy))
print('f_1 score (Weighted): {:0.3f}'.format(f1_weighted))

Logistic Regression Classifer
-------------

Accuracy on test set: 0.668
f_1 score (Weighted): 0.668
