# Summarize the reviews

The idea in this solution is to provide a new feature to the customer which will reduce the need to go through several reviews in order to evaluate a product. In order to achieve that, we will attempt to extract the most predictive words or sentences from the ratings and present them in a nice format (e.g. wordcloud).

Implementation steps:

- Extract the product and extract all the reviews associated with the product
- Group reviews 4 and 5, get a list of words, train against their rating and determine the most common words.
- Rank and present the most common words
- Similarly group reviews 1, 2 and 3 for negative reviews
- Could consider clustering the words

## Loading and preparing the data

In [1]:
all_reviews = (spark
    .read
    .json('./data/raw_data/reviews_Amazon_Instant_Video_5.json.gz',)
    .na
    .fill({ 'reviewerName': 'Unknown' }))

In [2]:
from pyspark.sql.functions import col, expr, udf, trim
from pyspark.sql.types import IntegerType
import re

remove_punctuation = udf(lambda line: re.sub('[^A-Za-z]', ' ', line))
make_binary = udf(lambda rating: 0 if rating in [1, 2] else 1, IntegerType())

reviews = (all_reviews
    .filter(col('overall').isin([1, 2, 5]))
    .withColumn('label', make_binary(col('overall')))
    .select(col('label').cast('int'), remove_punctuation('summary').alias('summary'))
    .filter(trim(col('summary')) != ''))

## Benchmark: predict by distribution

In [3]:
accuracy = reviews.filter('label == 1').count() / float(reviews.count())
print('Always predicting 5 stars accuracy: {0}'.format(accuracy))

Always predicting 5 stars accuracy: 0.852979645222


## Getting lists of words

In [4]:
from pyspark.ml.feature import Tokenizer, HashingTF, IDF, PCA
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

tokenizer = Tokenizer(inputCol='summary', outputCol='words')
hashing_tf = HashingTF(inputCol='words', outputCol='rawFeatures')
log_regression = LogisticRegression()

pipeline = Pipeline(stages=[
    tokenizer, 
    hashing_tf,
    IDF(inputCol='rawFeatures', outputCol='features'),
    log_regression
])

paramGrid = (ParamGridBuilder()
    .addGrid(hashing_tf.numFeatures, [120000])
    .addGrid(log_regression.regParam, [.3])
    .addGrid(log_regression.elasticNetParam, [.01])
    .build())

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)

model = crossval.fit(reviews)
model.avgMetrics

[0.8997182875012932]

In [22]:
from pyspark.sql.functions import explode
from pyspark.sql.types import FloatType

words = (tokenizer
    .transform(reviews)
    .select(explode(col('words')).alias('summary')))

predictors = model.transform(words).select('summary', 'probability')

first = udf(lambda x: x[0].item(), FloatType())
second = udf(lambda x: x[1].item(), FloatType())

predictors_good = (predictors
   .select('summary', second(col('probability')).alias('prob_good'))
   .groupBy('summary')
   .agg(pyspark.sql.functions.max('prob_good').alias('prob_good'))
   .sort('prob_good', ascending=False))

predictors_bad = (predictors
   .select('summary', first(col('probability')).alias('prob_bad'))
   .groupBy('summary')
   .agg(pyspark.sql.functions.max('prob_bad').alias('prob_bad'))
   .sort('prob_bad', ascending=False))

In [24]:
predictors_good.show(50)

+-------------+----------+
|      summary| prob_good|
+-------------+----------+
|    excellent| 0.8987591|
|        great| 0.8983653|
|      awesome| 0.8981597|
|         best|0.89773893|
|          fun|0.89716154|
|       regime|0.89715546|
|         love|0.89715546|
|     breaking|0.89684784|
|         five| 0.8967087|
|       sexist|0.89591557|
|      amazing|  0.895506|
| americanized|  0.895506|
|     favorite| 0.8954664|
|       series| 0.8953605|
|         wait|0.89534503|
|    hilarious|0.89526236|
|        loved| 0.8951537|
|    fantastic| 0.8946325|
|        keeps| 0.8940002|
|    endearing| 0.8940002|
|         show|0.89345884|
|        rocks|  0.893363|
|    wonderful|  0.893318|
|         miss| 0.8927371|
|        loves| 0.8926713|
|    customers| 0.8926098|
|    brilliant| 0.8926098|
|         hang|0.89243275|
|    justified|0.89243275|
|    enjoyable|0.89234406|
|  outstanding|  0.892175|
| entertaining|0.89214385|
|       always| 0.8918692|
|       season|0.89142704|
|

In [25]:
predictors_bad.show(50)

+-------------+----------+
|      summary|  prob_bad|
+-------------+----------+
|        kiosk|  0.385591|
|          meh|0.38450035|
|disappointing| 0.3837381|
|       boring|0.38158318|
|  unwatchable|0.38135895|
|     mediocre|0.37024063|
|        waste|0.36363545|
|       sucked|0.36315715|
|         skip|0.36098742|
|   depressing|  0.355512|
|      charges|0.35515267|
|           eh|0.35515267|
|         crap|  0.346826|
|          yuk|0.34450716|
|      stinker|0.34450716|
|          ehh|0.34450716|
|    worthless|0.34450716|
|         yawn|0.34450716|
|      garbage|0.34354392|
|      protest|0.34224117|
|       search|0.34196168|
|         nope|0.33954385|
|         mess|0.33739823|
|         weak|0.33659312|
|     tiresome|0.33552718|
|         bore| 0.3343576|
|     terrible|0.33435202|
|      nowhere|0.33308926|
|          ugh| 0.3330458|
|   unbearable|0.33196622|
|       wasted|0.33094722|
|        falls|0.33044004|
|       stupid|0.33038098|
|          hum|0.32854858|
|