# Can we predict the usefulness of an amazon review?

On Amazon.com, products can be reviewed by buyers. To prospective buyers, the reviews provide helpful insights about a product, its possible defficiencies or good points. To amazon, they constitute a formidable selling tool: providing reviews attracts users to the amazon website and thus drives amazon's sells. 

The more helpful and interesting the reviews, the more prospective buyers will use the Amazon website. Therefore, it is amazon primary interest to detect useful reviews and make them stand-out on the web interface.

Currently, Amazon uses a user voting system where buyers and prospective buyers can upvote a review for its usefulness. But this introduces a lag between the moment the review is published and the moment when enough users have voted and the review is promoted.

So, can we speed up this process using machine learning? Our project attempts to do so.

## About the notebook

This notebook contains python code to parse a dataset of Amazon reviews using spark and predict the usefulness score using machine-learning.

## Current implementation

Because we run the code on a cluster for production but on the notebook during development, we need a way to detect which libraries to import. The following snippet does so.

In [1]:
#Detects wether we're running inside the dev notebook
try:
    get_ipython
    notebook = True
except:
    notebook = False

Here comes all our imports

In [2]:
if notebook:
    import findspark
    findspark.init()

from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark import SparkContext

from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

from datetime import datetime  
from datetime import timedelta

Now, we load the dataset using spark

In [3]:
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

if notebook:
    dataFile = 'sample_us.tsv'
else:
    dataFile = 'hdfs:///datasets/amazon_multiling/tsv/amazon_reviews_us*tsv.gz'

schema = StructType([
    StructField('marketplace', StringType()),
    StructField('customer_id', IntegerType()),
    StructField('review_id', StringType()),
    StructField('product_id', StringType()),
    StructField('product_parent', IntegerType()),
    StructField('product_title', StringType()),
    StructField('product_category', StringType()),
    StructField('star_rating', IntegerType()),
    StructField('helpful_votes', IntegerType()),
    StructField('total_votes', IntegerType()),
    StructField('vine', StringType()),
    StructField('verified_purchase', StringType()),
    StructField('review_headline', StringType()),
    StructField('review_body', StringType()),
    StructField('review_date', DateType()),
])

df = spark.read.csv(dataFile, sep="\t", header=True, schema=schema)

Basic data cleaning: let's get rid of incomplete entries right now and duplicate a column in prevision for the machine learning part

In [4]:
df = df.na.drop()
df = df.selectExpr("helpful_votes as label", "*")

Now, we reduce the dataset size

In [5]:
if notebook:
    x_core = 1 # number of reviews a product must have
else:
    x_core = 5

# This query returns the number of products with at least x reviews
query1 = '''
    SELECT product_id
    FROM df
    GROUP BY product_id
    HAVING COUNT(*) >= %s
''' % x_core

# This query returns the rows for reviews for products with at least x reviews
query2 = '''
SELECT *
FROM df
WHERE product_id IN ({})
'''.format(query1)

df.registerTempTable("df")
df = spark.sql(query2)

Here is the output of `query1` for some values of `x_core`:
    
- Number of 1-core reviews: 21390118
- Number of 2-core reviews: 10213901
- Number of 3-core reviews:  6931152
- Number of 4-core reviews:  5318037
- Number of 5-core reviews:  4342875

Let's now use machine-learning to attempt to predict to reviews' `helpful_votes`. First we split the dataset into `train`, `validation` and `test` sets.

In [6]:
(train_set, val_set, test_set) = df.randomSplit([0.90, 0.05, 0.05], seed = 0)

We will create features from the reviews' text-content using TF-IDF method.

In [7]:
tokenizer = Tokenizer(inputCol="review_body", outputCol="words")
hashtf    = HashingTF(numFeatures=2**16, inputCol="words", outputCol='tf')
idf       = IDF(inputCol='tf', outputCol="features", minDocFreq=5) #minDocFreq: remove sparse terms

Our current model is a linear regression

In [8]:
lr = LinearRegression(maxIter=100, regParam=0.3, elasticNetParam=0.8)

Let's fit this pipeline to the data

In [9]:
pipeline = Pipeline(stages=[tokenizer, hashtf, idf, lr])

pipeline_fit = pipeline.fit(train_set)
train_df = pipeline_fit.transform(train_set)
#val_df = pipeline_fit.transform(val_set) #to be used later during cross-validation
test_df = pipeline_fit.transform(test_set)

And see how good we did

In [10]:
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
train_rmse = evaluator.evaluate(train_df)
test_rmse = evaluator.evaluate(test_df)

print("Root Mean Squared Error (RMSE) on train data = %g and on test data = %g" % (train_rmse, test_rmse))

Root Mean Squared Error (RMSE) on train data = 0.664671 and on test data = 0.499935


Running this code on the cluster, the ouput is:

## Next steps

To improve the predictions, we need to use more complex pipelines:

- using Glove embeddings instead of TF-IDF;
- using a more flexible model than a linear regression, such as a neural network;
- using meta-data such as reviewer's id to attempt to increase accuracy.

This new model will be tuned using cross-validation.

**Agenda**

- *by december 05th:* implement this new pipeline (10 days)
- *by december 15th:* write a report with our process and findings about spark and the dataset  (10 days)
- then practice oral presentation until presentation date