# Challenge: sentiment analysis for Amazon reviews
Now that you've looked at an example of how we can use Spark in batch mode, it's time to try it out on your own.

In this challenge you'll work on a sentiment analysis dataset: the [Amazon reviews dataset](http://jmcauley.ucsd.edu/data/amazon/). You should choose one of the 5-core datasets. Keep in mind that if the data is g-zipped, you'll need to unpack the dataset before you use it.

You should complete this challenge in a Jupyter notebook, which you'll need to work on Colab.

Now, on to the task at hand!

It's always important to start with a clear goal in mind. In this case, we'd like to **determine if we can predict whether a review is positive or negative based on the language in the review.**

We're going to tackle this problem with Spark, so you'll need to apply the principles you've learned thus far in the context of Spark.

Some tips to help you get started:

- Don't forget to install Java, Spark, Findspark and PySpark. You may also need to re-mount your drive to Colab. You can use the codes from the previous assignment for this purpose.
- Pyspark always needs to point at a running Spark instance. You can do that using a **SparkContext**.
- We're working in batch mode, so you'll need to load an entire file into memory in order to run any models you build.
- Spark likes to execute models in a pipeline, so remember that when the time comes to set up your model.
- Spark's machine learning algorithms expect numeric variables.

# Spark Setup

Install Spark and Java.

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz

Install findspark and pyspark packages.

In [0]:
!pip install -q findspark
!pip install -q pyspark

Set path variables for Java and Spark.

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

Mount Google Drive.

In [4]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [5]:
!sudo update-alternatives --config java

There are 2 choices for the alternative java (providing /usr/bin/java).

  Selection    Path                                            Priority   Status
------------------------------------------------------------
  0            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      auto mode
  1            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      manual mode
* 2            /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java   1081      manual mode

Press <enter> to keep the current choice[*], or type selection number: 2


Spark imports.

In [0]:
from pyspark import SparkContext
from pyspark.sql import SparkSession

from pyspark.sql.functions import isnan, when, count, col, udf
from pyspark.sql.types import IntegerType

from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline

Set up specific names and locations.

In [0]:
DATA_PATH = "/content/gdrive/My Drive/thinkful_big_data/Colab Datasets/Movies_and_TV_5.json"
APP_NAME = "Amazon Movies and TV Reviews Sentiment Analysis"
SPARK_URL = "local[*]"

Create spark session and create dataframe.

In [0]:
spark = SparkSession.builder.appName(APP_NAME).master(SPARK_URL).getOrCreate()
df = spark.read.options(inferschema = "true").json(DATA_PATH)

Make sure it's up and running.

In [9]:
spark.sparkContext

Take a quick look at how the data looks.

In [10]:
df.show(5)

+----------+-------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
|      asin|helpful|overall|          reviewText| reviewTime|    reviewerID|        reviewerName|             summary|unixReviewTime|
+----------+-------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
|0005019281| [0, 0]|    4.0|This is a charmin...|02 26, 2008| ADZPIG9QOCDG5|Alice L. Larson "...|good version of a...|    1203984000|
|0005019281| [0, 0]|    3.0|It was good but n...|12 30, 2013|A35947ZP82G7JH|       Amarah Strack|Good but not as m...|    1388361600|
|0005019281| [0, 0]|    3.0|Don't get me wron...|12 30, 2013|A3UORV8A9D5L2E|     Amazon Customer|Winkler's Perform...|    1388361600|
|0005019281| [0, 0]|    5.0|Henry Winkler is ...|02 13, 2008|A1VKW06X1O2X7V|Amazon Customer "...|It's an enjoyable...|    1202860800|
|0005019281| [0, 0]|    4.0|This is one of th...|12 22, 2013|A

Limit DataFrame to just the "two variables" of interest.

In [0]:
df = df.select('overall', 'reviewText')

Looks like in order to perform sentiment analysis a cutoff of positive and negative is going to have to be made regarding the overall score. I'm going to make that cutoff at 3. Any reviews with a score less than 3 are negative (so reviews at 1 or 2). And any reviews over 3 will be regarded as positive (4 and 5). Reviews of 3 are going to be considered neutral and therefore won't help with the analysis of positive or negative sentiment.

In [12]:
df.groupby('overall').count().show()

+-------+------+
|overall| count|
+-------+------+
|    1.0|104219|
|    4.0|382994|
|    3.0|201302|
|    2.0|102410|
|    5.0|906608|
+-------+------+



It is plain to see that the number of positive reviews far outweigh the number of negative reviews. Depending on the corpus/dataset use case, this could be an important or unimportant thing to the model as the class imbalance could add a general bias toward choosing a positive score on the basis that it simply occurs more in the training set. For this particular model I am going to assume that the class imbalance is an accurate depiction of the distribution out in the wild (people will generally rate movies/TV shows at a higher rating on average), therefore nothing will be done to deal with this imbalance.

In [13]:
df.select('overall').agg({'overall': 'mean'}).show()

+-----------------+
|     avg(overall)|
+-----------------+
|4.110648217148062|
+-----------------+



Goes to show that reviews are very much on the positive end.

In [14]:
df.select('overall').na.drop().count(), df.count()

(1697533, 1697533)

We can see that none of the reviews are missing the value for the review score.

In [15]:
df.select('reviewText').na.drop().count(), df.count()

(1697533, 1697533)

In [16]:
df.select('overall', 'reviewText').summary().show()

+-------+------------------+--------------------+
|summary|           overall|          reviewText|
+-------+------------------+--------------------+
|  count|           1697533|             1697533|
|   mean| 4.110648217148062|                null|
| stddev|1.1976147523955183|                null|
|    min|               1.0|                    |
|    25%|               4.0|                null|
|    50%|               5.0|                null|
|    75%|               5.0|                null|
|    max|               5.0|~~~~~Prom Night 3...|
+-------+------------------+--------------------+



Filter out all scores equal to 3.

In [0]:
df = df.filter(df.overall != 3)

In [26]:
df.agg({'overall': 'mean'}).show()

+-----------------+
|     avg(overall)|
+-----------------+
|4.260074146304949|
+-----------------+



In [27]:
df.select('overall').distinct().show()

+-------+
|overall|
+-------+
|    1.0|
|    4.0|
|    2.0|
|    5.0|
+-------+



In [0]:
udfRemap = udf(f=lambda x: 1 if x > 3 else 0, returnType=IntegerType())

In [0]:
df = df.withColumn('overall', udfRemap('overall'))

In [45]:
df.groupby('overall').count().show()

+-------+-------+
|overall|  count|
+-------+-------+
|      1|1289602|
|      0| 206629|
+-------+-------+



This reiterates how much of a class imbalance there is between positive and negative reviews. But even so there is a lot of data here that we can deal with and we'll go ahead and try and see if we can get this in the right format to start running models on.

In [0]:
(trainSet, testSet) = df.randomSplit([.8, .2], seed=2345)

In [0]:
tokenizer = Tokenizer(inputCol='reviewText', outputCol='words')
hashtf = HashingTF(inputCol='words', outputCol='tf')
idf = IDF(inputCol='tf', outputCol='features', minDocFreq=5)
label_stringIdx = StringIndexer(inputCol='overall', outputCol='label')
pipeline = Pipeline(stages=[tokenizer, hashtf, idf, label_stringIdx])
pipelineFit = pipeline.fit(trainSet)
train_df = pipelineFit.transform(trainSet)
test_df = pipelineFit.transform(testSet)

In [51]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(maxIter=100)
lrModel = lr.fit(train_df)
predictions = lrModel.transform(test_df)

from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(predictions)

0.8251046842815946

This is the AUC which isn't a bad measure by any means as it gives a good overall look at the model.

In [54]:
accuracy = predictions.filter(predictions.label == predictions.prediction).count() / float(testSet.count())
accuracy

0.8745700229321103

And there we have it, using simple logistic regression and the power of big data, the model was able to accurately classify 87% of the reviews. My guess is that it's better at classfiying the positives than the negatives. Note, the negative reviews got switched to positive, so now a 1 indicates the occurrence of a negative review.

In [63]:
predictions.filter((predictions.label == 1) & (predictions.label == predictions.prediction)).count() / predictions.filter(predictions.label == 1).count()

0.6142156391266712

In [64]:
predictions.filter((predictions.label == 0) & (predictions.label == predictions.prediction)).count() / predictions.filter(predictions.label == 0).count()

0.9164680297800548

So as expected, the model is pretty good at identifing a positive review correctly at 92%, but it isn't nearly as good at identifying a negative review with only 61% accuracy. There are numerous ways this could be improved. For one thing there was absolutely no text cleaning done, this is essentially performing on the raw text (which is quite impressive!), so obviously some cleaning could help. Also, to deal with the class imbalance, it could be a good idea to try out SMOTE or some resampling technique (over sample negatives, under sample positives) to get a more even occurrence of each class. But for a first model in Spark, I'll take this as a win!