# Amazon Reviews - Prediction of Rating and Helpfulness (an NLP Use Case)
##  Machine Learning Scaling

In this notebook, I have scaled the machine learning model to predict rating and helpfulness of Amazon reviews that I prototyped in the notebook `ml_model_prototyping.ipynb`.

## Table of Contents

* [Text Preprocessing](#tp)
* [Doc2Vec](#dv)
* [ML Scaling: Rating Prediction](#rp)
  * [Dealing with Class Imbalance](#ci)
  * [Logistic Regression](#ilr)
* [ML Prototyping: Helpfulness Prediction](#hp)
  * [Linear Regression](#lrg)
* [Making Predictions on New Data](#nd)
* [Conclusion](#cl)

In [4]:
# Time the notebook
import time
start_time = time.time()

In [5]:
# Define the schema of the dataframe to be created
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType, DateType

schema = StructType([
      StructField('marketplace', StringType()),
      StructField('customer_id', StringType()),
      StructField('review_id', StringType()),
      StructField('product_id', StringType()),
      StructField('product_parent', StringType()),
      StructField('product_title', StringType()),
      StructField('product_category', StringType()),
      StructField('star_rating', IntegerType()),
      StructField('helpful_votes', IntegerType()),
      StructField('total_votes', IntegerType()),
      StructField('vine', StringType()),
      StructField('verified_purchase', StringType()),
      StructField('review_headline', StringType()),
      StructField('review_body', StringType()),
      StructField('review_date', DateType())
])

# Read Amazon review files from Amazon S3
review_df = (sqlContext.read.format('com.databricks.spark.csv')
             .schema(schema)
             .option("inferSchema", False)
             .option('delimiter', '\t')
             .option("header", True)
             .load("/mnt/mount_1/tsv/amazon_reviews_us*.gz" ))

In [6]:
review_df = (review_df.filter("verified_purchase = 'Y'").filter("star_rating is not NULL")
                      .filter("review_date is not NULL").filter("helpful_votes is not NULL")
                      .filter("total_votes is not NULL").filter("review_headline is not NULL")
                      .filter("review_body is not NULL").filter("review_id is not NULL"))

In [7]:
# sample_frac = 0.075
# review_df = review_df.sample(False, sample_frac, 42)
review_df.cache()

<a id='tp'></a>
## Text Preprocessing

In [9]:
# Subset required columns
review_rating_vote_df = review_df.select('review_body', 'star_rating', 'helpful_votes', 'total_votes')

The following cleaning is performed on review text.
- lower casing of the text
- Stripping html tags
- stripiing punctuation
- stripping multiple white spaces
- stripping numbers

In [11]:
import gensim.parsing.preprocessing as gsp
from pyspark.sql.functions import udf
from gensim import utils
import re

# Perform following cleaning tasks on each review
cleaning_tasks = [
           gsp.strip_tags, 
           gsp.strip_punctuation,
           gsp.strip_multiple_whitespaces,
           gsp.strip_numeric
          ]

def text_preprocessing(df_row):
  '''Takes in a text and preprocess/clean it for NLP'''
  review_txt = df_row[0]
  review_txt = review_txt.lower()
  review_txt = utils.to_unicode(review_txt)
  for task in cleaning_tasks:
      review_txt = task(review_txt)
  review_txt = re.sub(r'[^a-zA-Z\s]', "", review_txt)
  return (review_txt, df_row[1], df_row[2], df_row[3])

In [12]:
clean_review_rating_vote_df = review_rating_vote_df.rdd.map(lambda x : text_preprocessing(x)).toDF()

In [13]:
# Rename the columns as _1, _2, _3 and _4 are not descriptive
clean_review_rating_vote_df = (clean_review_rating_vote_df.withColumnRenamed("_1", "review_body")
                                  .withColumnRenamed("_2", "star_rating")
                                  .withColumnRenamed("_3", "helpful_votes")
                                  .withColumnRenamed("_4", "total_votes"))

#### Positive, Negative and Neutral Class Based on Star Rating
A model that predicts whether a review is positive, negative or neutral will be trained. Star rating of 5 and 4 will be considered as positive, 3 as neutral and 1 and 2 as negative.

In [15]:
from pyspark.sql.functions import when
clean_review_rating_vote_df = (clean_review_rating_vote_df.withColumn("review_category", 
                                                when(clean_review_rating_vote_df.star_rating.isin(5, 4), 'positive')
                                                .when(clean_review_rating_vote_df.star_rating.isin(1, 2), 'negative')
                                                .otherwise('neutral')))

<a id='dv'></a>
## Doc2Vec
Apache Spark does not provide an API for ‘Doc2Vec’. But its ‘Word2Vec’ transformer based on the ‘Skip-Gram’ approach, can be used as Doc2Vec. `The Word2VecModel transforms each document into a vector using the average of all words in the document` ([Apache Spark Documentation](https://spark.apache.org/docs/latest/ml-features.html#word2vec))

In [17]:
from pyspark.ml.feature import Word2Vec
from pyspark.ml import Pipeline
from pyspark.ml.feature import Tokenizer

# Tokenize review text
tokenizer = Tokenizer(inputCol="review_body", outputCol="tokens")
# Extract doc2vec size of 300
word2vec = Word2Vec(vectorSize=300, minCount=0, inputCol="tokens", outputCol="features")
doc2vec_pipeline = Pipeline(stages=[tokenizer, word2vec])
doc2vec_model = doc2vec_pipeline.fit(clean_review_rating_vote_df)
doc2vec_df = doc2vec_model.transform(clean_review_rating_vote_df)

<a id='rp'></a>
## ML Scaling: Rating Prediction

In [19]:
from pyspark.ml.feature import StringIndexer

# Encode the target label
string_indexer = StringIndexer(inputCol="review_category", outputCol="label")
doc2vec_df_encoded = string_indexer.fit(doc2vec_df).transform(doc2vec_df)

In [20]:
# Split the data into train and test set
train_set, test_set = doc2vec_df_encoded.randomSplit([0.7, 0.3], seed=100)

In [21]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics
import pandas as pd

# A function to get performance metrics
def print_performance_metrics(predictions):
  # Get accuracy of the model
  model_evaluator = MulticlassClassificationEvaluator(
      labelCol="label", predictionCol="prediction", metricName="accuracy")
  accuracy = model_evaluator.evaluate(predictions)
  print("Accuracy: {:.3f}\n".format(accuracy))

  # get rdd of predictions and labels for eval metrics
  predictionAndLabels = predictions.select("prediction","label").rdd

  # Instantiate metrics objects
  multi_metrics = MulticlassMetrics(predictionAndLabels)
  # Get confusion matrix
  cm = multi_metrics.confusionMatrix()
  print ("Confusion Metrix:")
  print(cm)
  print ("\nConfusion Metrix as a Pandas Dataframe:")
  print(pd.DataFrame(cm.toArray().tolist(), columns=['predicted_pos', 'predicted_neg', 'predicted_neu'], index=['actual_pos', 'actual_neg', 'actual_neu']))
  print("\nFraction of positive reviews correctly predicted as positive (recall): {:.3f}".format(cm[0,0]/(cm[0,0] + cm[0,1] + cm[0,2])))
  print("\nFraction of negative reviews correctly predicted as negative (recall): {:.3f}".format(cm[1,1]/(cm[1,0] + cm[1,1] + cm[1,2])))
  print("\nFraction of neutral reviews correctly predicted as neutral (recall): {:.3f}".format(cm[2,2]/(cm[2,0] + cm[2,1] + cm[2,2])))

<a id='ci'></a>
### Dealing with Class Imbalance

In [23]:
# Target class distribution
review_category_df = clean_review_rating_vote_df.groupby('review_category').count().toPandas()
review_category_df['percentage'] = round(review_category_df['count']/review_category_df['count'].sum()*100)
review_category_df

Unnamed: 0,review_category,count,percentage
0,positive,703314,80.0
1,neutral,70825,8.0
2,negative,108033,12.0


One of the problems with this dataset is class imbalance. The class distribution is as follows.

- Positive - 80%
- Negative - 12%
- Neutral - 8%

Positive class is way more than negative and neutral class. This imbalance negatively affect model performance in correctly predicting examples of rare class. To overcome this, in PySpark, in the case of logistic regression we have a technique called `Class Weighing`, wherein class weight is set to be inversly proportional to its frequency.

<a id='ilr'></a>
### Logistic Regression

#### Class Weight Estimation

In [27]:
# Get the count of review categories
train_count = train_set.groupby("review_category").count().toPandas()

In [28]:
# Get the class weight based on its frequency
pos_count = train_count.iloc[0,1]
neu_count = train_count.iloc[1,1]
neg_count = train_count.iloc[2,1]

inv_pos = 1/pos_count
inv_neg = 1/neg_count
inv_neu = 1/neu_count

pos_weight = 1/(pos_count * (inv_pos +  inv_neg + inv_neu))
neg_weight = 1/(neg_count * (inv_pos +  inv_neg + inv_neu))
neu_weight = 1/(neu_count * (inv_pos +  inv_neg + inv_neu))
# Print class weights
(pos_weight, neg_weight, neu_weight)

In [29]:
# create a classWeight column in trainset
train_set=(train_set.withColumn("classWeight", when(train_set.review_category == 'positive', pos_weight)
                          .when(train_set.review_category == 'negative', neg_weight)
                          .otherwise(neu_weight)))

In [30]:
# Fit the model and get the predictions for test set
from pyspark.ml.classification import LogisticRegression
# Instantiate a logistic regression classifier. Deafault parameters are used as parameter tuning did not improve performance.
lr = LogisticRegression(labelCol="label", featuresCol="features", weightCol='classWeight')
# Fit train set
lrModel = lr.fit(train_set)
# Save the model
lrModel.write().overwrite().save('dbfs:/FileStore/lr_model')
# Predict test set
lr_predictions = lrModel.transform(test_set)
# Print performance metrics of logistic regression
print_performance_metrics(lr_predictions)

<a id='hp'></a>
## ML Scaling: Helpfulness Prediction

#### Helpfulness index
Absolute number of helpful votes is not suitable for comparison because of varying number of total votes among different reviews. For instance, a review with 15 helpful votes out of 10,000 total votes must be less helpful than a review with 10 helpful votes out of 11 total votes. To deal with this issue, a new helpful index column has been created by dividing helpful votes by total votes.

#### Subset Review with More Than Five Total Votes
Another issue is that the most of the reveiws has 0 total votes. Furthermore, very low total votes also may bias the analysis. For instance if there is only one total vote and if it is voted as helpful, it works out to a helpful index of 1, which may or may not be reliable. To circumvent this problem, only reviews with more than 5 total votes are taken into consideration for training.

In [34]:
doc2vec_helpful_df = doc2vec_df.filter(doc2vec_df["total_votes"] > 5)

In [35]:
doc2vec_helpful_df = (doc2vec_helpful_df.
                    withColumn('helpful_index', doc2vec_helpful_df.helpful_votes/doc2vec_helpful_df.total_votes))


In [36]:
# Split the data into train and test set
train_set_h, test_set_h = doc2vec_helpful_df.randomSplit([0.7, 0.3], seed=100)

<a id='lrg'></a>
### Linear Regression

In [38]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Instantiate a classifier
lrg = LinearRegression(labelCol="helpful_index", featuresCol="features", maxIter=1000, regParam=0.5)
# Fit training Data
lrgModel = lrg.fit(train_set_h)

# Save the model
lrgModel.write().overwrite().save('dbfs:/FileStore/lrg_model')

# Get prediction
lrgPredictions = lrgModel.transform(test_set_h)

# Print evaluation metrics
# Print rmse
rmse = RegressionEvaluator(metricName="rmse", labelCol=lrg.getLabelCol(), predictionCol=lrg.getPredictionCol()).evaluate(lrgPredictions)
print ("RMSE on the test set: {:.3f}".format(rmse))

# Print R2
r2 = RegressionEvaluator(metricName="r2", labelCol=lrg.getLabelCol(), predictionCol=lrg.getPredictionCol()).evaluate(lrgPredictions)
print ("R2 on the test set: {:.3f}".format(r2))

In [39]:
# What would be the R2, if we just predict mean of the helpful_index for all reviews.
mean_help_df = test_set_h.select('helpful_index')
from pyspark.sql.functions import lit
mean_ = mean_help_df.groupBy().avg("helpful_index").take(1)[0][0]
mean_help_df = mean_help_df.withColumn("mean_helpful_index", lit(mean_))
r2_mean = RegressionEvaluator(metricName="r2", labelCol='helpful_index', predictionCol='mean_helpful_index').evaluate(mean_help_df)
print("If we just predict the mean of the helpful_index for all reviews, r2 would be: {}".format(r2_mean))

<a id='nd'></a>
## Making Predictions on New Data
In this section, I have predicted the star rating and helpful index of a few Amazon reviews that were not part of the train or test set used in this study to get an idea as to how this model perform on new data. These are very recent reviews on Amazon website.

#### Rating Prediction

In [42]:
# List of new reviews taken from Amazon website
new_reviews = ["Lately I've been having trouble with my cell phone so I decided to give a different company a try. I love this phone! It has a decent battery life, takes great pictures, has quite a bit of volume, I had to turn it down. I selected the one day shipping through Amazon Prime, and it arrived early the following morning 100% charged. All I had to do was move my sim cards over and it was ready to go. It found my provider easily (wind). The only thing that made me laugh is the beautiful rose gold it comes in, with a black case to cover it! I am glad I tried this company, they are doing it right!", 
               "This bag is everything! I was able to pack three pairs of pants of jeans, two dresses, four shirts, night clothes, underwear, an extra pair of shoes and toiletries with some room to spare! Definitely worth the money.", 
               "I got this as a Christmas gift and use it almost daily! From making soups, to homemade nut butters, hummus, veggie burgers... The possibilities are endless! The blade works wonderfully and cleaning the appliance is a breeze. Amazing product for the price!!!",
               "I love this sooo much it is perfect for making tortillas and pancakes. It heats up quickly and it is super easy to clean. I love that you can actually wash the whole thing. The heat is distributed evenly as well. One of my favorite in the kitchen!",
               "My 13 year old son saved up birthday and Christmas money for this Microscope to replace a lower powered one. He is extremely happy with the magnification level and clarity. Also he is happy with how the adjustment knobs move the slides around with fine control. He made a stop-motion animation to show the moving parts of the microscope's key features:",
               
               "I purchased this microscope a year ago and have used it probably a total of 5 hrs. It worked well except that the fine focus control loses traction and therefore isn't very effective. Recently the LED light started intermittently turning off and on and then has now completely stopped working. The company is taking it back for repair but I have to pay the shipping costs.",
               "The ceramic is nice. Really upset though because the plug in has to be wiggled and pushed in just perfectly for it to work ",
               "This works well as a pressure cooker but is not as effective in the slow cooker mode. I've increased the time and temperature using several tried and true recipes (used in traditional slow cooker) but have been disappointed in the results.",
               "We like the looks of the unit. It fits perfectly in our built-in wet bar. It cools the wine but you cannot put champagne bottles in it. We do not like the noise nor the temperature zones because they are never correct. Otherwise, it is okay.",
               "Tv is ok. Not perfect but ok. For some reason I got tv without remote, power cord and screws for a stand which for i have to wait now around a week. So my tv is standing there without even trying it. Hisense support refused to send this overnight so therefore 3 stars. I'm disappointed.",
              
              
              "On first use, the inside strap tore. Not even sure how it happened - bag was not overpacked or used roughly. The strap attaches an inner compartment to the luggage. Very disappointed. Poor quality and expected more. Wary of buying AmazonBasics branded items now!",
               "My 30 year old Black and Decker food processor was too small for some of my recipes. Based on the reviews this seemed to fit the bill. Major disappointment. First it was aggravating because when it is used full it will overflow through the middle area....mess. Second, after only 1 year and ~20 uses the on/off switch will not turn. Had to pull out my trusty 30 year old Black and Decker to the rescue in the middle of a recipe.",
               "Absolutely awful! I had to send my first one back because it cracked and I used my replacement one for the first time and it broke while slicing a potato. Not only did it break while cutting a potato, that was sliced into 4 by the way, it also had the piece fly off the machine and cut my leg in the process. Save your money and buy something that is made with quality materials.",
               "Horrible! Became NOT Non-stick after first use. It is not easy to clean and whatever sticks to it stays. Bought a better one at Biglots for cheaper. Waste of money",
               "Very disappointed with this microscope! Does not do what it promised and very hard to get into focus what you're trying to see, almost like its not level. Once you finally do get it into focus it doesn't stay in focus. Unfortunately the time period for being able to return it closed before I even had the chance to use it."]
new_reviews = [{'review_body':review} for review in new_reviews]
import json
new_reviews = [json.dumps(new_reviews)]
# Convert new reviews to pyspark dataframe
new_reviews_df = sc.parallelize(new_reviews)
new_reviews_df = spark.read.json(new_reviews_df)
# new_reviews_df.limit(2).toPandas()

In [43]:
# Extract word2vec for new reviews
from pyspark.ml.feature import Word2Vec
from pyspark.ml import Pipeline
from pyspark.ml.feature import Tokenizer
# Tokenize user input
tokenizer_usr_input = Tokenizer(inputCol="review_body", outputCol="tokens")
# Extract doc2vec size of 300
word2vec_usr_input = Word2Vec(vectorSize=300, minCount=0, inputCol="tokens", outputCol="features")
doc2vec_pipeline_usr_input = Pipeline(stages=[tokenizer, word2vec])
doc2vec_model_usr_input = doc2vec_pipeline.fit(new_reviews_df)
doc2vec_new_reviews_df = doc2vec_model.transform(new_reviews_df)
# doc2vec_new_reviews_df.limit(2).toPandas()

#### Rating Prediction on New Review

In [45]:
pd.set_option('display.max_colwidth', -1)
# Make predictions for new reviews
prediction_df_lr = lrModel.transform(doc2vec_new_reviews_df).toPandas()
# A function to convert numeric label to string label
def convert_label(x):
  if x == 0:
    return "Positive"
  if x == 1:
    return "Negative"
  return "Neutral"
    
prediction_df_lr['prediction'] = prediction_df_lr['prediction'].apply(convert_label)
prediction_df_lr.columns = ['Review Text', 'tokens', 'features', 'rawPrediction', 'probability', 'Predicted Rating']
prediction_df_lr['Actual Rating'] = ['positive'] * 5 + ['neutral'] * 5 + ['negative'] * 5 
prediction_df_lr[['Review Text', 'Predicted Rating', 'Actual Rating']]

Unnamed: 0,Review Text,Predicted Rating,Actual Rating
0,"Lately I've been having trouble with my cell phone so I decided to give a different company a try. I love this phone! It has a decent battery life, takes great pictures, has quite a bit of volume, I had to turn it down. I selected the one day shipping through Amazon Prime, and it arrived early the following morning 100% charged. All I had to do was move my sim cards over and it was ready to go. It found my provider easily (wind). The only thing that made me laugh is the beautiful rose gold it comes in, with a black case to cover it! I am glad I tried this company, they are doing it right!",Positive,positive
1,"This bag is everything! I was able to pack three pairs of pants of jeans, two dresses, four shirts, night clothes, underwear, an extra pair of shoes and toiletries with some room to spare! Definitely worth the money.",Positive,positive
2,"I got this as a Christmas gift and use it almost daily! From making soups, to homemade nut butters, hummus, veggie burgers... The possibilities are endless! The blade works wonderfully and cleaning the appliance is a breeze. Amazing product for the price!!!",Positive,positive
3,I love this sooo much it is perfect for making tortillas and pancakes. It heats up quickly and it is super easy to clean. I love that you can actually wash the whole thing. The heat is distributed evenly as well. One of my favorite in the kitchen!,Positive,positive
4,My 13 year old son saved up birthday and Christmas money for this Microscope to replace a lower powered one. He is extremely happy with the magnification level and clarity. Also he is happy with how the adjustment knobs move the slides around with fine control. He made a stop-motion animation to show the moving parts of the microscope's key features:,Positive,positive
5,I purchased this microscope a year ago and have used it probably a total of 5 hrs. It worked well except that the fine focus control loses traction and therefore isn't very effective. Recently the LED light started intermittently turning off and on and then has now completely stopped working. The company is taking it back for repair but I have to pay the shipping costs.,Negative,neutral
6,The ceramic is nice. Really upset though because the plug in has to be wiggled and pushed in just perfectly for it to work,Neutral,neutral
7,This works well as a pressure cooker but is not as effective in the slow cooker mode. I've increased the time and temperature using several tried and true recipes (used in traditional slow cooker) but have been disappointed in the results.,Neutral,neutral
8,"We like the looks of the unit. It fits perfectly in our built-in wet bar. It cools the wine but you cannot put champagne bottles in it. We do not like the noise nor the temperature zones because they are never correct. Otherwise, it is okay.",Neutral,neutral
9,"Tv is ok. Not perfect but ok. For some reason I got tv without remote, power cord and screws for a stand which for i have to wait now around a week. So my tv is standing there without even trying it. Hisense support refused to send this overnight so therefore 3 stars. I'm disappointed.",Neutral,neutral


#### Performance of the model on Amazon reviews that were not part of the Amazon review dataset
I have predicted the rating of five each of positive, neutral and negative reviews taken directly from Amazon website. Out of this five positive (100%), four neutral (80%) and five negative (100%) reviews were predicted accurately by the model I trained.

#### Helpful Index Prediction on New Reviews

In [48]:
# Linear regression
# Make predictions for new reviews
prediction_df_lrg = lrgModel.transform(doc2vec_new_reviews_df).toPandas()
prediction_df_lrg.columns = ['Review Text', 'tokens', 'features', 'Predicted Helpful Index']
prediction_df_lrg['Predicted Helpful Index'] = round(prediction_df_lrg['Predicted Helpful Index'], 2)
prediction_df_lrg[['Review Text', 'Predicted Helpful Index']]

Unnamed: 0,Review Text,Predicted Helpful Index
0,"Lately I've been having trouble with my cell phone so I decided to give a different company a try. I love this phone! It has a decent battery life, takes great pictures, has quite a bit of volume, I had to turn it down. I selected the one day shipping through Amazon Prime, and it arrived early the following morning 100% charged. All I had to do was move my sim cards over and it was ready to go. It found my provider easily (wind). The only thing that made me laugh is the beautiful rose gold it comes in, with a black case to cover it! I am glad I tried this company, they are doing it right!",0.75
1,"This bag is everything! I was able to pack three pairs of pants of jeans, two dresses, four shirts, night clothes, underwear, an extra pair of shoes and toiletries with some room to spare! Definitely worth the money.",0.9
2,"I got this as a Christmas gift and use it almost daily! From making soups, to homemade nut butters, hummus, veggie burgers... The possibilities are endless! The blade works wonderfully and cleaning the appliance is a breeze. Amazing product for the price!!!",0.82
3,I love this sooo much it is perfect for making tortillas and pancakes. It heats up quickly and it is super easy to clean. I love that you can actually wash the whole thing. The heat is distributed evenly as well. One of my favorite in the kitchen!,0.86
4,My 13 year old son saved up birthday and Christmas money for this Microscope to replace a lower powered one. He is extremely happy with the magnification level and clarity. Also he is happy with how the adjustment knobs move the slides around with fine control. He made a stop-motion animation to show the moving parts of the microscope's key features:,0.85
5,I purchased this microscope a year ago and have used it probably a total of 5 hrs. It worked well except that the fine focus control loses traction and therefore isn't very effective. Recently the LED light started intermittently turning off and on and then has now completely stopped working. The company is taking it back for repair but I have to pay the shipping costs.,0.77
6,The ceramic is nice. Really upset though because the plug in has to be wiggled and pushed in just perfectly for it to work,0.81
7,This works well as a pressure cooker but is not as effective in the slow cooker mode. I've increased the time and temperature using several tried and true recipes (used in traditional slow cooker) but have been disappointed in the results.,0.79
8,"We like the looks of the unit. It fits perfectly in our built-in wet bar. It cools the wine but you cannot put champagne bottles in it. We do not like the noise nor the temperature zones because they are never correct. Otherwise, it is okay.",0.8
9,"Tv is ok. Not perfect but ok. For some reason I got tv without remote, power cord and screws for a stand which for i have to wait now around a week. So my tv is standing there without even trying it. Hisense support refused to send this overnight so therefore 3 stars. I'm disappointed.",0.73


<a id='cl'></a>
## Conclusion

I have used Logistic Regression to predict star rating of Amazon reviews. The performance of this model is as follows

- Accuracy: 73.5%
- Positive class recall: 75.4%
- Negative class recall: 70.3%
- Neutral class recall: 59.4%

I have used Linear regression for predicting helpfulness. It produced an R2 of 0.175; the model was able to explain 17.5% of the variance present in helpful  index (helpful votes/total votes). Although this performance is very poor, it should be noted that if we just predicted the mean of the helpful_index (zero information predictor), it would have negative R2.

In [50]:
end_time = time.time()
print('It took {:.1f} minutes to run this notebook.'.format((end_time - start_time)/60))