# Summarize the reviews

The idea in this solution is to provide a new feature to the customer which will reduce the need to go through several reviews in order to evaluate a product. In order to achieve that, we will attempt to extract the most predictive words or sentences from the ratings and present them in a nice format (e.g. wordcloud).

Implementation steps:

- Extract the product and extract all the reviews associated with the product
- Group reviews 4 and 5, get a list of words, train against their rating and determine the most common words.
- Rank and present the most common words
- Similarly group reviews 1, 2 and 3 for negative reviews
- Could consider clustering the words

## Loading and preparing the data

In [1]:
all_reviews = (spark
    .read
    .json('./data/raw_data/reviews_Amazon_Instant_Video_5.json.gz',)
    .na
    .fill({ 'reviewerName': 'Unknown' }))

In [27]:
from pyspark.sql.functions import col, expr, udf, trim
from pyspark.sql.types import IntegerType
import re

remove_punctuation = udf(lambda line: re.sub('[^A-Za-z\s]', '', line))
make_binary = udf(lambda rating: 0 if rating in [1, 2] else 1, IntegerType())

reviews = (all_reviews
    .filter(col('overall').isin([1, 2, 5]))
    .withColumn('label', make_binary(col('overall')))
    .select(col('label').cast('int'), remove_punctuation('summary').alias('summary'))
    .filter(trim(col('summary')) != ''))

## Splitting data and balancing skewness

In [28]:
train, test = reviews.randomSplit([.8, .2], seed=5436L)

In [29]:
reviews_bad = train.filter('label == 0')
reviews_bad_multiplied = reviews_bad.union(reviews_bad).union(reviews_bad,).union(reviews_bad).union(reviews_bad).union(reviews_bad)

reviews_good = train.filter('label == 1')

train_reviews = reviews_bad_multiplied.union(reviews_good)

## Benchmark: predict by distribution

In [30]:
accuracy = reviews_bad_multiplied.count() / float(train_reviews.count())
print('Always predicting 5 stars accuracy: {0}'.format(accuracy))

Always predicting 5 stars accuracy: 0.504377044832


## Learning pipeline

In [34]:
from pyspark.ml.feature import Tokenizer, HashingTF, IDF, PCA
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

tokenizer = Tokenizer(inputCol='summary', outputCol='words')
hashing_tf = HashingTF(inputCol='words', outputCol='rawFeatures')
log_regression = LogisticRegression()

pipeline = Pipeline(stages=[
    tokenizer, 
    hashing_tf,
    IDF(inputCol='rawFeatures', outputCol='features'),
    log_regression
])

paramGrid = (ParamGridBuilder()
    .addGrid(hashing_tf.numFeatures, [120000])
    .addGrid(log_regression.regParam, [.3])
    .addGrid(log_regression.elasticNetParam, [.01])
    .build())

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)

## Testing model

In [35]:
model = crossval.fit(train_reviews)
model.avgMetrics

[0.9604014372172315]

In [36]:
BinaryClassificationEvaluator().evaluate(model.transform(test))

0.9164916248100313

## Using model to extract most predictive words

In [38]:
from pyspark.sql.functions import explode
from pyspark.sql.types import FloatType

words = (tokenizer
    .transform(reviews)
    .select(explode(col('words')).alias('summary')))

predictors = model.transform(words).select('summary', 'probability')

first = udf(lambda x: x[0].item(), FloatType())
second = udf(lambda x: x[1].item(), FloatType())

predictors_good = (predictors
   .select('summary', second(col('probability')).alias('prob_good'))
   .groupBy('summary')
   .agg(pyspark.sql.functions.max('prob_good').alias('prob_good'))
   .sort('prob_good', ascending=False))

predictors_bad = (predictors
   .select('summary', first(col('probability')).alias('prob_bad'))
   .groupBy('summary')
   .agg(pyspark.sql.functions.max('prob_bad').alias('prob_bad'))
   .sort('prob_bad', ascending=False))

In [42]:
predictors_good.toPandas().head(n=50)

Unnamed: 0,summary,prob_good
0,five,0.699386
1,excellent,0.691371
2,hilarious,0.689102
3,lie,0.6879
4,awesome,0.687069
5,enjoyable,0.68496
6,brilliant,0.681011
7,customers,0.681011
8,disappoint,0.678679
9,outstanding,0.678496


In [44]:
predictors_bad.toPandas().head(n=50)

Unnamed: 0,summary,prob_bad
0,disappointing,0.686565
1,meh,0.686261
2,theyd,0.680944
3,unwatchable,0.679726
4,mediocre,0.67911
5,boring,0.678281
6,eh,0.674466
7,charges,0.674466
8,ditch,0.673823
9,trainwreck,0.673508
