This notebook is based on the one Joachim shared in class (which predicted movie review sentiment)
https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb#scrollTo=p9gEt5SmM6i6

In [0]:
from datetime import datetime
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow_hub as hub

In [2]:
 !pip install bert-tensorflow

Collecting bert-tensorflow
[?25l  Downloading https://files.pythonhosted.org/packages/a6/66/7eb4e8b6ea35b7cc54c322c816f976167a43019750279a8473d355800a93/bert_tensorflow-1.0.1-py2.py3-none-any.whl (67kB)
[K     |████▉                           | 10kB 19.6MB/s eta 0:00:01[K     |█████████▊                      | 20kB 2.1MB/s eta 0:00:01[K     |██████████████▋                 | 30kB 3.1MB/s eta 0:00:01[K     |███████████████████▍            | 40kB 2.0MB/s eta 0:00:01[K     |████████████████████████▎       | 51kB 2.5MB/s eta 0:00:01[K     |█████████████████████████████▏  | 61kB 3.0MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 3.3MB/s 
Installing collected packages: bert-tensorflow
Successfully installed bert-tensorflow-1.0.1


In [3]:
import bert
from bert import run_classifier
from bert import optimization
from bert import tokenization

W0802 01:54:29.671370 140667509827456 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/bert/optimization.py:87: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.



In [4]:
# This is so I can access my google storage bucket 
!pip install gcsfs
import gcsfs

Collecting gcsfs
  Downloading https://files.pythonhosted.org/packages/32/bf/a105dffbbebd60662bf95c550bb74e4236ca28e77c06735cf6ec11577d0e/gcsfs-0.3.0-py2.py3-none-any.whl
Installing collected packages: gcsfs
Successfully installed gcsfs-0.3.0


I store output in a GCP bucket. To do this, one must Oauth, so don't let this cell sit and spin; must follow the link and authorize access via Google

Set DO_DELETE to rewrite the OUTPUT_DIR if it exists. Otherwise, Tensorflow will load existing model checkpoints from that directory (if they exist).

In [5]:
# Set the output directory for saving model file
# Optionally, set a GCP bucket location

OUTPUT_DIR = datetime.strftime(datetime.now(),"%Y%m%d-%H:%m")+'_Berty_output'#@param {type:"string"}
#@markdown Whether or not to clear/delete the directory and create a new one
DO_DELETE = False #@param {type:"boolean"}
#@markdown Set USE_BUCKET and BUCKET if you want to (optionally) store model output on GCP bucket.
USE_BUCKET = True #@param {type:"boolean"}
BUCKET = 'w266' #@param {type:"string"}

if USE_BUCKET:
  OUTPUT_DIR = 'gs://{}/{}'.format(BUCKET, OUTPUT_DIR)
  from google.colab import auth
  auth.authenticate_user()

if DO_DELETE:
  try:
    tf.gfile.DeleteRecursively(OUTPUT_DIR)
  except:
    # Doesn't matter if the directory didn't exist
    pass
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))


***** Model output directory: gs://w266/20190802-01:08_Berty_output *****


#Data

In [8]:
# Here's where I pull data from Brad's storage bucket 
# from google.colab import auth
# auth.authenticate_user()
# !gcloud config set project w266-239820
my_data = "gs://w266_final_proj/labeled_training_set_clust_v3.csv"
mine = pd.read_csv(my_data)

W0802 02:43:59.782456 140667509827456 _default.py:280] No project ID could be determined. Consider running `gcloud config set project` or setting the GOOGLE_CLOUD_PROJECT environment variable


In [0]:
# note that now our target class label is group_z_class

In [0]:
# Cleanup reviews without review content
#mine.iloc[94073]['reviewText'] # Example has 'nan' as reviewText
mine.dropna(subset=['reviewText'],inplace=True)

In [0]:
# Below I sample to have equal amounts of pos/neg reviews and equal amounts of top-quartile-HVAR vs 0 helpful votes
num_per_condition = 13000
neg_helpful = mine[(mine.overall == 1) & (mine.group_z_class == 1) & (mine.helpful_votes != 0)].sample(num_per_condition)
neg_unhelpful = mine[(mine.overall == 1) & (mine.group_z_class == 0) & (mine.helpful_votes == 0)].sample(num_per_condition)
pos_unhelpful = mine[(mine.overall == 5) & (mine.group_z_class == 0) & (mine.helpful_votes == 0)].sample(num_per_condition)
pos_helpful = mine[(mine.overall == 5) & (mine.group_z_class == 1) & (mine.helpful_votes != 0)].sample(num_per_condition)
# "reviewText" has the review content
# "group_z_class" has the label of 0 or 1
# "overall" has the star-rating {1,2,3,4,5}

In [0]:
# Prepending TEXT representation of overall star rating
neg_helpful['prepReviewText'] = neg_helpful.apply(lambda x: 'WORST ' + x.reviewText,axis = 1)
neg_unhelpful['prepReviewText'] = neg_unhelpful.apply(lambda x: 'WORST ' + x.reviewText,axis = 1)
pos_unhelpful['prepReviewText'] = pos_unhelpful.apply(lambda x: 'BEST ' + x.reviewText,axis = 1)
pos_helpful['prepReviewText'] = pos_helpful.apply(lambda x: 'BEST ' + x.reviewText,axis = 1)

In [19]:
# Put the subsets into the same dataframe again
stratdf = neg_helpful.append(neg_unhelpful, ignore_index=True)
stratdf = stratdf.append(pos_unhelpful, ignore_index=True)
stratdf = stratdf.append(pos_helpful, ignore_index=True)
print(f"Our dataset is now {stratdf.shape[0]} reviews.")

Our dataset is now 52000 reviews.


In [0]:
from sklearn.utils import shuffle
df = shuffle(stratdf,random_state=42)[['prepReviewText','overall','group_z_class']]

In [21]:
df.head()

Unnamed: 0,prepReviewText,overall,group_z_class
38199,BEST If you are a fan of the statistics of the...,5,0.0
20985,WORST I received a complimentary copy of this ...,1,0.0
13643,WORST When I ordered didn't know it was futuri...,1,0.0
11300,WORST To summarize:Start:He talks about how wo...,1,1.0
32153,BEST With so many reviews there should be litt...,5,0.0


In [0]:
X_train, X_test, y_train, y_test = train_test_split(df.prepReviewText,df.group_z_class, test_size=0.2, \
                                   random_state=42,stratify=df.group_z_class)
# Ideally I would like to stratify such that train and test have stratified samples across
# BOTH the most_helpful values AND the overall rating, but I keep getting errors when I try to do that

In [23]:
train = pd.concat([X_train,y_train], axis=1)
test = pd.concat([X_test,y_test], axis=1)

print(f"Train has {train.shape[0]} rows and {train.shape[1]} columns.")
print(f"Test has {test.shape[0]} rows and {test.shape[1]} columns.")

Train has 41600 rows and 2 columns.
Test has 10400 rows and 2 columns.


In [24]:
test.head()

Unnamed: 0,prepReviewText,group_z_class
46428,"BEST Brenda Boyd, whose son Kenneth Hall wrote...",1.0
24141,WORST I am just flabbergasted that so many rev...,0.0
16469,WORST With what going. In the real world. I do...,0.0
47417,BEST Alyssa Moss has lived a sad life in the s...,1.0
18085,WORST I want to give this 4 stars but a comput...,0.0


In [0]:
# DATA_COLUMN = 'reviewText'
DATA_COLUMN = 'prepReviewText'
LABEL_COLUMN = 'group_z_class'
# label_list is the list of classes: 0 = no helpful votes received; 1 = z class of helpfulness
label_list = [0, 1]

#Data Preprocessing


In [0]:
# Use the InputExample class from BERT's run_classifier code to create examples from the data
train_InputExamples = train.apply(lambda x: bert.run_classifier.InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this example
                                                                   text_a = x[DATA_COLUMN], 
                                                                   text_b = None, 
                                                                   label = x[LABEL_COLUMN]), axis = 1)

test_InputExamples = test.apply(lambda x: bert.run_classifier.InputExample(guid=None, 
                                                                   text_a = x[DATA_COLUMN], 
                                                                   text_b = None, 
                                                                   label = x[LABEL_COLUMN]), axis = 1)

In [27]:
# This is a path to an uncased (all lowercase) version of BERT
BERT_MODEL_HUB = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"

def create_tokenizer_from_hub_module():
  """Get the vocab file and casing info from the Hub module."""
  with tf.Graph().as_default():
    bert_module = hub.Module(BERT_MODEL_HUB)
    tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
    with tf.Session() as sess:
      vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
                                            tokenization_info["do_lower_case"]])
      
  return bert.tokenization.FullTokenizer(
      vocab_file=vocab_file, do_lower_case=do_lower_case)

tokenizer = create_tokenizer_from_hub_module()

W0802 03:45:26.437048 140667509827456 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/bert/tokenization.py:125: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.



In [28]:
# 512 is apparently upper limit of BERT model
# JenD's current colab environment maxes out at 13Gig memory so setting 128 for now
MAX_SEQ_LENGTH = 128
# Convert our train and test features to InputFeatures that BERT understands.
train_features = bert.run_classifier.convert_examples_to_features(train_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)
test_features = bert.run_classifier.convert_examples_to_features(test_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)

W0802 03:45:58.717837 140667509827456 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/bert/run_classifier.py:774: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.



#Creating the model


In [0]:
def create_model(is_predicting, input_ids, input_mask, segment_ids, labels,
                 num_labels):
  """Creates a classification model."""
  # In future, pass in the rating along a separate wide path
  # See https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/ch10.html#ann_chapter
  # input_rating = keras.layers.Input(shape=[1], name="wide_input")
  
  bert_module = hub.Module(
      BERT_MODEL_HUB,
      trainable=True)
  bert_inputs = dict(
      input_ids=input_ids,
      input_mask=input_mask,
      segment_ids=segment_ids)
  bert_outputs = bert_module(
      inputs=bert_inputs,
      signature="tokens",
      as_dict=True)

  # Using "pooled_output" for classification task of each truncated review.
  deep_layer = bert_outputs["pooled_output"]
  deep_size = deep_layer.shape[-1].value

  # Create our own layer to tune for helpfulness data
  # In future "+1" needed on the deep_size, to weight the rating value arriving from the side
  output_weights = tf.get_variable(
      "output_weights", [num_labels, deep_size ],
      initializer=tf.truncated_normal_initializer(stddev=0.02))

  output_bias = tf.get_variable(
      "output_bias", [num_labels], initializer=tf.zeros_initializer())

  with tf.variable_scope("loss"):

    # In future prepend the rating of the review
    # concatting_layer = keras.layers.Concatenate()([input_rating, deep_layer])
    
    # In future pass concatting layer to dropout
    output_layer = tf.nn.dropout(deep_layer, keep_prob=.9)

    logits = tf.matmul(output_layer, output_weights, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)
    log_probs = tf.nn.log_softmax(logits, axis=-1)

    # Convert labels into one-hot encoding
    one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)

    predicted_labels = tf.squeeze(tf.argmax(log_probs, axis=-1, output_type=tf.int32))
    # If we're predicting, we want predicted labels and the probabiltiies.
    if is_predicting:
      return (predicted_labels, log_probs)

    # If we're train/eval, compute loss between predicted and actual label
    per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
    loss = tf.reduce_mean(per_example_loss)
    return (loss, predicted_labels, log_probs)


Next we'll wrap our model function in a `model_fn_builder` function that adapts our model to work for training, evaluation, and prediction.

In [0]:
# model_fn_builder actually creates our model function
# using the passed parameters for num_labels, learning_rate, etc.
def model_fn_builder(num_labels, learning_rate, num_train_steps,
                     num_warmup_steps):
  """Returns `model_fn` closure for TPUEstimator."""
  def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
    """The `model_fn` for TPUEstimator."""

    input_ids = features["input_ids"]
    input_mask = features["input_mask"]
    segment_ids = features["segment_ids"]
    label_ids = features["label_ids"]

    is_predicting = (mode == tf.estimator.ModeKeys.PREDICT)
    
    # TRAIN and EVAL
    if not is_predicting:

      (loss, predicted_labels, log_probs) = create_model(
        is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)

      train_op = bert.optimization.create_optimizer(
          loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu=False)

      # Calculate evaluation metrics. 
      def metric_fn(label_ids, predicted_labels):
        accuracy = tf.metrics.accuracy(label_ids, predicted_labels)
        f1_score = tf.contrib.metrics.f1_score(
            label_ids,
            predicted_labels)
        auc = tf.metrics.auc(
            label_ids,
            predicted_labels)
        recall = tf.metrics.recall(
            label_ids,
            predicted_labels)
        precision = tf.metrics.precision(
            label_ids,
            predicted_labels) 
        true_pos = tf.metrics.true_positives(
            label_ids,
            predicted_labels)
        true_neg = tf.metrics.true_negatives(
            label_ids,
            predicted_labels)   
        false_pos = tf.metrics.false_positives(
            label_ids,
            predicted_labels)  
        false_neg = tf.metrics.false_negatives(
            label_ids,
            predicted_labels)
        return {
            "eval_accuracy": accuracy,
            "f1_score": f1_score,
            "auc": auc,
            "precision": precision,
            "recall": recall,
            "true_positives": true_pos,
            "true_negatives": true_neg,
            "false_positives": false_pos,
            "false_negatives": false_neg
        }

      eval_metrics = metric_fn(label_ids, predicted_labels)

      if mode == tf.estimator.ModeKeys.TRAIN:
        return tf.estimator.EstimatorSpec(mode=mode,
          loss=loss,
          train_op=train_op)
      else:
          return tf.estimator.EstimatorSpec(mode=mode,
            loss=loss,
            eval_metric_ops=eval_metrics)
    else:
      (predicted_labels, log_probs) = create_model(
        is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)

      predictions = {
          'probabilities': log_probs,
          'labels': predicted_labels
      }
      return tf.estimator.EstimatorSpec(mode, predictions=predictions)

  # Return the actual model function in the closure
  return model_fn


In [0]:
# Compute train and warmup steps from batch size
# These hyperparameters are copied from this colab notebook (https://colab.sandbox.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb)
BATCH_SIZE = 32
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 3.0
# Warmup is a period of time where hte learning rate 
# is small and gradually increases--usually helps training.
WARMUP_PROPORTION = 0.1
# Model configs
SAVE_CHECKPOINTS_STEPS = 500
SAVE_SUMMARY_STEPS = 100

In [0]:
# Compute # train and warmup steps from batch size
num_train_steps = int(len(train_features) / BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

In [0]:
# Specify output directory and number of checkpoint steps to save
run_config = tf.estimator.RunConfig(
    model_dir=OUTPUT_DIR,
    save_summary_steps=SAVE_SUMMARY_STEPS,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS)

In [0]:
model_fn = model_fn_builder(
  num_labels=len(label_list),
  learning_rate=LEARNING_RATE,
  num_train_steps=num_train_steps,
  num_warmup_steps=num_warmup_steps)

estimator = tf.estimator.Estimator(
  model_fn=model_fn,
  config=run_config,
  params={"batch_size": BATCH_SIZE})


Next we create an input builder function that takes our training feature set (`train_features`) and produces a generator. This is a pretty standard design pattern for working with Tensorflow [Estimators](https://www.tensorflow.org/guide/estimators).

In [0]:
# Create an input function for training. drop_remainder = True for using TPUs.
train_input_fn = bert.run_classifier.input_fn_builder(
    features=train_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=True,
    drop_remainder=False)

Now we train our model! For me, using a Colab notebook running on Google's GPUs, my training time was about 14 minutes.

In [36]:
print(f'Beginning Training!')
current_time = datetime.now()
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
print("Training took time ", datetime.now() - current_time)


Beginning Training!


W0802 03:50:46.376146 140667509827456 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0802 03:51:10.639488 140667509827456 deprecation.py:506] From <ipython-input-29-7a75a02b1991>:39: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
W0802 03:51:10.682178 140667509827456 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/bert/optimization.py:27: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_glo

Training took time  1:02:15.635699


Now let's use our test data to see how well our model did:

In [0]:
test_input_fn = run_classifier.input_fn_builder(
    features=test_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=False,
    drop_remainder=False)

In [38]:
estimator.evaluate(input_fn=test_input_fn, steps=None)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
W0802 04:53:43.137621 140667509827456 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.


{'auc': 0.73192304,
 'eval_accuracy': 0.7319231,
 'f1_score': 0.7301587,
 'false_negatives': 1428.0,
 'false_positives': 1360.0,
 'global_step': 3900,
 'loss': 0.7909138,
 'precision': 0.7349961,
 'recall': 0.7253846,
 'true_negatives': 3840.0,
 'true_positives': 3772.0}

Above prepended with * or ***** on train/test data built from 4 conditions across (1*,5*) and (most_helpful==1,0) with training on 3200 rows and testing on 800 rows (stratified on helpful/not)

Now let's write code to make predictions on new reviews:

In [0]:
def getPrediction(in_sentences):
  labels = ["Unhelpful", "Helpful"]
  input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences] # here, "" is just a dummy label
  input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=False)
  predictions = estimator.predict(predict_input_fn)
  return [(sentence, np.exp(prediction['probabilities']), labels[prediction['labels']]) for sentence, prediction in zip(in_sentences, predictions)]

In [40]:
# Here I have a few "canary in the coal mine" examples
# If the model can't get these right, I put very little faith in it

# This is 1* without sharing why
vacuous_negative = "WORST I like to read many different books from many points of view.  This book was a complete waste of time that I will never get back. Save yourself the trouble and money."

# This is 5* but worthless
vacuous_positive = "BEST My husband has really loved this series and I have heard the same comments from many others who have also read this series."

# This looks like someone who didn't read the book was paid to write a long 5* review
# it sounds like a completely generic description of any cookbook ("it provides recipes to prepare foods...")
# Perhaps mentions of many concepts (ingredients, gourmet, etc.) can fool some people and also an algorithm
# Helpful votes: 71
# Annual HVAR: 5
# For this book, the top quartile was HVAR of 1.7
# Surely adding GENRE would knock this one down
vacuous_cookbook = "BEST As someone who is learning to cook only late in her life, I was apprehensive and embarrassed about asking simple basic questions of friends and family.  Perceiving this, my parents gave me this cookbook, and voila!  -- I can cook!With step-by-step instructions on everything from  cookware, ingredients, buying, preapring, cooking, and serving, there's  nothing this book can't handle.  It provides recipes to prepare foods in  the simplest ways, all the way up to complex gourmet dishes.  And it covers  every imaginable food -- if it isn't in here, I can't imagine where you'd  find it.The language is straightforward and encouraging, with  appropriate editorializing on the author's preferences, and the layout is  clean and easy to read.  I can't say enough good things about this cookbook  -- it never leaves my kitchen counter."

# Reviewer hasn't read it yet
vacuous_not_read = "BEST Just downloaded this series.  Looking forward to getting to read it, once i get past some of the other books on my reading list.  I just love beig able to carryall these books without having to carry them individually!"

# This excerpt of a review got 5* and JenD finds it helpful 
meaty_positive = "BEST After \"Riding Lessons\", which I loved - and \"Flying Changes\", which was a huge disappointment to me, I was not sure what I would find in \"Water for Elephants\".  Wow - what a great read!  Research does pay off hugely - when it enables a writer to place the reader inside another world so easily - the world of the circus. This was a world totally foreign to most of us - but now, so familiar, thanks to Sara.  This book satisfied my three requirements - transportation (take me away from all this),  levitation (lift my spirits and leave me thinking good thoughts) and infiltration (let me get inside the characters so I feel I really know them). Reading \"Water for Elephants\" is time well-spent.  I'm happy to know Sara is working on a fourth book. I'll be first in line to buy it."
#Research does pay off hugely - when it enables a writer to place the reader inside another world so easily - the world of the circus. This was a world totally foreign to most of us - but now, so familiar, thanks to Sara. This book satisfied my three requirements - transportation (take me away from all this),  levitation (lift my spirits and leave me thinking good thoughts) and infiltration (let me get inside the characters so I feel I really know them). Time well spent!"

# Another excerpt from a 5* with many helpful reviews
meaty_positive2 = "BEST This had the flavor of The Great Gatsby, the well to do characters spend their days going from luncheon to evening parties and everyone is concerned about who's who. That is, on the surface it has that flavor. Beneath is a gripping story more about Trudy and Will than about the piano teacher, Claire and the desperate desire to survive.The characters in this book are well developed. There is a lot going on beneath the surface that the author lets you discern.Life in Hong Kong during the 40's is a lark and all about the parties you go to, until the Japanese occupation.  Will is interned along with many of the other socialites.  Life becomes getting food, keeping warm, keeping from being infested, keeping from being singled out for abuse."

# This excerpt is from a negative 1* review and includes arguments (>300 helpful votes) 
meaty_negative = "WORST Colin Campbell is so intent on promoting a vegan data that he misrepresents the data in the real China Study and cherry picks anti-animal food data. For instance, he rightly cites the link between milk and autoimmune disease but fails to mention that gluten, from wheat and related grains, is at least as important a cause. He writes of the association between casein, a milk protein, with cancer, but fails to mention that whey and butterfat are protective against cancer, and in milk you get all of them. He makes completely false statements like folate not being in meat when organ meats are much higher in folate than any plant source according to the USDA. He assumes nutrient consistency with the US without actually measuring it, despite the fact that soil nutrients and species differences have a huge effect on nutrition."

# This is a scholar reviewing another scholar's 5* work
# Helpful votes: 788 and annual HVAR: 74
meaty_scholar = "BEST Noted historian of the early church Elaine Pagels has produced a clear, cogent, and very effective introduction to the subject of Gnosticism, a different form of Christianity that was declared heretical and virtually stamped out by the orthodox church by the start of the second century after Christ.  Most of what we knew of the Gnostic belief system came from the religious authors who worked so hard to destroy the movement, but that changed drastically with the still relatively recent discovery of a number of lost Gnostic writings near Nag Hammadi in Upper Egypt.  Unlike the Dead Sea Scrolls, this momentous discovery of ancient papyri has received little attention, and I must admit I went into this book knowing virtually nothing about Gnosticism.  As an historian by training and a Christian, the information in these &quot;heretical&quot; texts intrigue me, and I believe that Christians should challenge their faith by examining material that does not fall in line"

# Make the list of canaries
pred_sentences = [vacuous_negative,\
                  vacuous_positive, \
                  vacuous_cookbook, \
                  vacuous_not_read, \
                  meaty_negative, \
                  meaty_positive, \
                  meaty_positive2, \
                  meaty_scholar]

# See how any sentence looks in tokens by uncommenting below
to_tokens = vacuous_cookbook
tokenized = tokenizer.tokenize(to_tokens)
print(f"That sample tokenizes to {len(tokenized)} tokens.")
print("Beginning with: ","\n",'\t',tokenized[0:5])
print("And ending with: ","\n",'\t',tokenized[-5:])


That sample tokenizes to 193 tokens.
Beginning with:  
 	 ['best', 'as', 'someone', 'who', 'is']
And ending with:  
 	 ['leaves', 'my', 'kitchen', 'counter', '.']


In [41]:
predictions = getPrediction(pred_sentences)
predictions


[('WORST I like to read many different books from many points of view.  This book was a complete waste of time that I will never get back. Save yourself the trouble and money.',
  array([0.9913619 , 0.00863805], dtype=float32),
  'Unhelpful'),
 ('BEST My husband has really loved this series and I have heard the same comments from many others who have also read this series.',
  array([0.98372436, 0.01627563], dtype=float32),
  'Unhelpful'),
 ("BEST As someone who is learning to cook only late in her life, I was apprehensive and embarrassed about asking simple basic questions of friends and family.  Perceiving this, my parents gave me this cookbook, and voila!  -- I can cook!With step-by-step instructions on everything from  cookware, ingredients, buying, preapring, cooking, and serving, there's  nothing this book can't handle.  It provides recipes to prepare foods in  the simplest ways, all the way up to complex gourmet dishes.  And it covers  every imaginable food -- if it isn't in her

JenD is very happy with the above results, from a canary-in-the-coalmine point of view.

In [0]:
# Now combine the test data and the test predictions, for some error analysis
test.reset_index(inplace=True)

In [0]:
test_preds = getPrediction(list(test.prepReviewText))

In [0]:
testy_preds = pd.DataFrame(test_preds, columns=['text','probas','judgment'])

In [0]:
beforeAndAfter = pd.merge(left=test, right=testy_preds,
                                  left_index=True, right_index=True)

In [0]:
beforeAndAfter.head()

Unnamed: 0,index,prepReviewText,most_helpful,text,probas,judgment
0,399,WORST I zipped through the first book of this ...,1,WORST I zipped through the first book of this ...,"[-0.0693467, -2.70311]",Unhelpful
1,3036,BEST I love the Rylie Cruz series and all of R...,1,BEST I love the Rylie Cruz series and all of R...,"[-2.6566658, -0.07276628]",Helpful
2,2596,BEST I am very happy with this book. I haven'...,0,BEST I am very happy with this book. I haven'...,"[-0.043234207, -3.1626618]",Unhelpful
3,77,"WORST Let me be honest here, although this cou...",1,"WORST Let me be honest here, although this cou...","[-0.78407216, -0.60980487]",Helpful
4,326,WORST I absolutely hated this book. The heroi...,1,WORST I absolutely hated this book. The heroi...,"[-2.70472, -0.06923114]",Helpful


In [0]:
beforeAndAfter.drop(['text'],inplace=True, axis=1)

In [0]:
fndf = beforeAndAfter[(beforeAndAfter.most_helpful==1) & (beforeAndAfter.judgment=='Unhelpful')]

In [0]:
fndf.shape

(157, 5)

In [0]:
fpdf = beforeAndAfter[(beforeAndAfter.most_helpful==0) & (beforeAndAfter.judgment=='Helpful')]

In [0]:
pd.set_option('max_colwidth', 1000)

In [0]:
# Look at some false positives that were 5* (which is what prepended with BEST means)
fpdf[fpdf.prepReviewText.str.contains('BEST')].head()

Unnamed: 0,index,prepReviewText,most_helpful,probas,judgment
8,2517,"BEST If you could prove or disprove faith-based knowledge (religious beliefs), faith would not be necessary. Nevertheless, rabid theists engage in attempts to prove, and rabid atheists engage in attempts to disprove. And predictably, they come up as empty-handed as the person referred to in the old adage, ""You can't squeeze blood out of a turnip."" It's impossible, but since hope springs eternal in matters of faith, and too, there is a market for controversy that yields big bucks in book royalties, rabid theists and atheists remain engaged.Unlike others in the current crop of rabid atheists, Harris does not play the game of attempting to disprove faith-based knowledge--at least not in his book, The End of Faith: Religion, Terror, and the Future of Reason. In the philosophical tradition of logical positivism, he asserts instead that without the objective criterion of evidence to support faith-based knowledge, it is nothing more than fairy tale. He goes on to argue that without that o...",0,"[-2.821497, -0.061361484]",Helpful
25,2811,"BEST I read ""Princess in Waiting"" in 2 sittings.I loved it because it is so wonderfully good. And coming from me, that should mean something, because I am a teen, but I don't really like teen books too much, with this series as one of my few exceptions. This fourth book lived up to the other three, it was addictive. This series does keep getting better, with each book entrancing you, and makes you unable to put it down, even as you try, as ""Princess In Waiting"" did. One critic said it is like "" reading a note from your best friend"", and that is so true.You get caught up in the story and the people, that is one of the great qualities about it.You think it is your life, and that you know them. This is the type of book/series that you wait for months for the next one to be released. I did, and I wasn't disappointed. Read it, you won't be either.",0,"[-2.3271813, -0.10266453]",Helpful
70,2159,"BEST I'm building a library of books on fashion, and this one is a keeper. Very nice photos and good descriptions of the dresses.",0,"[-0.9282413, -0.5029372]",Helpful
83,2129,"BEST ""The Capitol Game"" by Brian Haig starts off with an American soldier and his group getting killed in a roadside explosion in Iraq. While wondering how the incident might have been prevented, Haig moves us to present day. An ambitious Wall Street businessman, Jack Wiley, thinks he's found the next billion dollar business--a company on the verge of bankruptcy whose founder has discovered a special type of polymer that makes military vehicles in addition to other transportation, virtually explosion-proof.Wiley tries to sell the idea to the wealthy Capitol Group--a company whose whole business is based on taking over companies and making them profitable. However, Wiley won't tell them any more details about the venture without being guaranteed a solid percentage of future income from the polymer and being charge of the takeover himself. The Capitol Group grows suspicious of Jack, and starts concocting ways of getting its greedy claws on the polymer while moving Jack to the side. O...",0,"[-1.91135, -0.16002858]",Helpful
85,2163,"BEST Several years ago I penned an essay I ""Ten Steps to Eloquence."" In my mind, the final step was the most important: the delivery.Although he may never have read it, its message was lost on Jerry Weissman. A presentation coach with a long list of corporate clients uses his third book to present a seven-step plan for crafting content into a compelling story. Weissman teaches how to overcome public speaking jitters, present with force and conviction and to emotionally connect with any audience. Beyond your words, the author demonstrates how to communicate with your audience using your body language.Readers of the book have access to a website that provides case studies of power presenters--from Martin Luther King, Jr. to John F. Kennedy, from Ronald Reagan to Barack Obama.If you need to deliver, this is a book you cannot afford to miss. It is filled with techniques, tools and wisdom gathered during long career communicating with audiences.",0,"[-2.768761, -0.06479424]",Helpful


In [0]:
# Look at some false positives that were 1*
fpdf[fpdf.prepReviewText.str.contains('WORST')].head()

Unnamed: 0,index,prepReviewText,most_helpful,probas,judgment
11,1726,"WORST I love Anne Lamott, and I love that her son was a willing participant in this memoir. This truly is a labor of God.",0,"[-1.0120432, -0.45173246]",Helpful
12,1318,"WORST I love Amish fiction and have read every complete series by all other authors and loved every book. I held off on these books only because the author is male and I wasn't sure he could write with the passion that the female authors do. I was right. I suffered through 10 chapters of this book before putting it down for good. Characters are uninteresting and the details are different. I cannot put my finger on exactly how his details are different, but they are uninteresting. I especially did not like the one married female Amish character who was uncharacteristically obsessed with money. Too much worry, and not enough faith in God as I expect to read in Amish fiction. I do not normally give negative reviews. This is only the second book I have ever stopped reading in many years of reading Christian fiction.",0,"[-1.0489589, -0.43124798]",Helpful
13,1028,"WORST Wow, after reading and enjoying the author's first book, Joshua, I fully expected to enjoy his second book as well, Traveler. Not so! The tone of Traveler was so different from Joshua that it truly felt as if a different author had penned it. Joshua was a great character study. The characters and relationships developed as the book proceeded, and the love that was so evident between ""the man"" and Joshua carried the book even when their survival efforts were dark and discouraging. But Traveler ... nothing but disappointing. No character development, no one in the story to care about. In my opinion this book was a waste of time. What a shame, after such a valiant first novel!",0,"[-1.7100235, -0.19950213]",Helpful
28,1299,"WORST This is the only book so far that I almost couldn't finish. It was awful! Not bad enough that she not only shoved ('crammed' might be a better word) her religious....techniques (not just faith) on someone else, but waited until he fell in love with her to do it. This was not only long and unendurable, but completely unbelievable as a story.",0,"[-0.84565264, -0.5608514]",Helpful
47,1441,"WORST I was suckered into this after a promising ""sample"" on Kindle. Completely boring and poorly written. I didn't finish this. It was so boring that I can't even muster up the interest to really analyze why I didn't like it. I just felt I had to do my civic duty and add my *1 star* to warn off readers. Try ANYTHING by Liz Carlyle, Julia Quinn, Suzanne Enoch or Lisa Kleypas as a better option. Even some of the second tier novelists in this genre are a safer bet.",0,"[-0.799672, -0.59688544]",Helpful


In [0]:
# Look at some false negatives that were 5* 
fndf[fndf.prepReviewText.str.contains('BEST')].sample(5)
# Hmmm...JenD thinks maybe the algorithm got these right. Yuck.

Unnamed: 0,index,prepReviewText,most_helpful,probas,judgment
96,3155,BEST I usually never write reviews on a book in fact this is my 1st one. C.J.'s VV Inn series is awesome. I loved every page. From the hot sex that Rafe and Dria have to the werewolves getting shot and not knowing who was going to get killed. She goes into such detail on her characters and really brings them to life. I have read many vampire books but not 1 as unusual as this series and it was a fantastic to read. I am so excited for book 4.,1,"[-0.12263182, -2.1592581]",Unhelpful
333,3187,"BEST Even though my profession finds me surrounded by books all day I am a tough sell when it comes to sitting down and reading just for fun, but Changing Shoes is an excellent enjoyable read. I laughed. I cried and I thoroughly enjoyed reading it from cover to cover. You do not have to have been a Guiding Light or soap fan to enjoy piece. It appeals to women of all ages. It makes you think about what life really is all about. I feel truly blessed to have had the opportunity to get to know Tina!",1,"[-0.13903199, -2.0417619]",Unhelpful
195,3668,BEST I love avery and sean they had me laughing and very frustrated. I love hm wards books have read alot of them and loved them. However it frustrates me when I have to wait for the next one!!!,1,"[-0.034964226, -3.3708603]",Unhelpful
476,3911,"BEST I moved up to a 4g smart phone and this book saved me. I used it everyday for two weeks, now I just grab it to fiqure out a new section. Really enhances how to use a smart phone.",1,"[-0.062175777, -2.8087165]",Unhelpful
103,3370,"BEST The first time I read this book was at the age of 13. My mother had an old, worn, 1950s copy of the book on our bookshelf, and one day I just decided to give it a try. I think the first 50 pages were a bit of a mystery to me. I remember struggling with the language at the beginning, and having to look up several words in the dictionary. After 100 pages, I had finally ""learnt the language,"" and managed to at last translate it into my native tongue. I enjoyed it, and over the years I returned to it again and again. Now, at the age of 35, I practically have numerous passages memorized, and reading the book is no longer a culture shock. Every word of this novel is an old friend, and I wouldn't remove a single one from the text. This is the book I turn to when I'm in the mood to read, but not in the mood to read something new.Every time I read it, I am struck by its relevance. I don't mean the morals or the need to marry well in order to win approval from society (although I think ...",1,"[-0.21256766, -1.6528965]",Unhelpful


In [0]:
# Look at some false negatives that were 1* 
fndf[fndf.prepReviewText.str.contains('WORST')].sample(5)

Unnamed: 0,index,prepReviewText,most_helpful,probas,judgment
491,754,WORST This is a short story with unconvincing characters and no plot that got stretched out to novella length. I was bored to tears reading it.,1,"[-0.11983059, -2.1809936]",Unhelpful
160,722,"WORST I can't believe this is a best-seller. Honestly, I can't believe it ever made it to print. Read Kushiel's Dart instead.It is very badly written, repetetive, poorly edited (if edited at all), boring ... the only thing I can think of, is that this was written by an illiterate for illiterates. It shows badly on our world and our education system that it is a best-seller. I suspect there is a higher literacy rate among the feral cat colony outside than among her readership. Still, she is laughing all the way to the bank, isn't she?There are enough reviews that give details. I just want to cast a surprisingly minority vote for absolutely awful. I admit, I couldn't finish it. I did try, but this is god-awful.I really don't get the appeal of badly-written, soccer-mom porn.I'm going to feed the cats now. The back of the can of cat food is better-written than this.",1,"[-0.49281728, -0.94392633]",Unhelpful
80,188,WORST The story line and the first few pages caught my attention but after getting into the book a bit more I found myself putting it down and deleting it from my kindle. The writing lacks.,1,"[-0.05708686, -2.8915882]",Unhelpful
252,208,"WORST What can I say? Never read a Nora Roberts book that I liked. So I'm the dummy for keepin' on tryin'. I keep thinking that there must be something here simply because she sells so many books, but only goes to show you that there are, apparently, a lot of people who think differently than I do. I wish someone could explain the allure of her books to me. Sorry to be so negative, but I seem to be totally missing something.",1,"[-0.18930922, -1.7575356]",Unhelpful
187,385,"WORST The reason most women find this book exciting, is because the psychology used in this book is what attracts women to men and keeps them interested.It does NOT work the other way around, not for the long term at least. Its a completely different set of emotional priciples that keep men hanging in there for the long haul.Playing aloof to the guy who finally sets your heart on fire, is a sure way to put that fire out. Not to mention all the wonderful emotional thrills you would lose out on, which is reason to be in a relationship in the first place.The only reason to get this book is if:1) You are a total doormat.2) You want a guy to be your doormat, in which case you need to find a total wimp. But ask yourself is this the kind of guy who pulls your heartstrings?3) You want to have an open realtionship (seeing other people outside of the realtionship).So be Bitch at your own risk---you may well lose your soul mate when you finally meet him.",1,"[-0.16153044, -1.9027398]",Unhelpful
