# SENTIMENT ANALYSIS WITH SPARK ML

# Spark ML Main Concepts

The Spark Machine learning API in the **spark.ml** package is based on DataFrames, there is also another Spark Machine learning API based on RDDs in the **spark.mllib** package, but as of Spark 2.0, the RDD-based API has entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API.

Main concepts of Spark ML:

- **Transformer**: transforms one DataFrame into another DataFrame

- **Estimator**: eg. a learning algorithm that trains on a DataFrame and produces a Model

- **Pipeline**: chains Transformers and Estimators to produce a Model

- **Evaluator**: measures how well a fitted Model does on held-out test data



# Amazon product data
We will use a [dataset](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Books_5.json.gz)[1] that contains 8.9M book reviews from Amazon, spanning May 1996 - July 2014.

Dataset characteristics:
- Number of reviews: 8.9M
- Size: 8.8GB (uncompressed)
- HDFS blocks: 70 (each with 3 replicas)


[1] Image-based recommendations on styles and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015
http://jmcauley.ucsd.edu/data/amazon/

The reviews will be in English so we will set the locale accordingly:

In [1]:
import os
os.environ['LANG']='en_US.UTF-8'

As an alternative you can set the environment before launching the notebook with:
```bash
export LANG=en_US.UTF-8
```

# Load Data

In [2]:
%%time
raw_reviews = spark.read.json('/tmp/reviews_Books_5_small.json')

CPU times: user 1.33 ms, sys: 2.63 ms, total: 3.97 ms
Wall time: 8.74 s


In [3]:
%%time
all_reviews = raw_reviews.select('reviewText', 'overall')
all_reviews.cache()
all_reviews.show(2)

+--------------------+-------+
|          reviewText|overall|
+--------------------+-------+
|When most people ...|    5.0|
|I own a lot of Pa...|    4.0|
+--------------------+-------+
only showing top 2 rows

CPU times: user 3.8 ms, sys: 1.35 ms, total: 5.16 ms
Wall time: 2.32 s


# Prepare data
We will avoid neutral reviews by keeping only reviews with 1 or 5 stars overall score.
We will also filter out the reviews that contain no text.

In [4]:
nonneutral_reviews = all_reviews.filter(
    (all_reviews.overall == 1.0) | (all_reviews.overall == 5.0))
reviews = nonneutral_reviews.filter(all_reviews.reviewText != '')

In [5]:
reviews.cache()
all_reviews.unpersist()

DataFrame[reviewText: string, overall: double]

# Split Data

In [6]:
trainingData, testData = reviews.randomSplit([0.8, 0.2])

# Generate Pipeline
![pipeline](http://hadoop.cesga.es/files/sentiment_analysis/pipeline.jpg)

## Binarizer
A transformer to convert numerical features to binary (0/1) features

In [7]:
from pyspark.ml.feature import Binarizer

binarizer = Binarizer(threshold=2.5, inputCol='overall', outputCol='label')

## Tokenizer
A transformer that converts the input string to lowercase and then splits it by white spaces.

In [8]:
from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol='reviewText', outputCol='words')

## StopWordsRemover
A transformer that filters out stop words from input.

In [9]:
from pyspark.ml.feature import StopWordsRemover
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol='filtered')

## HashingTF
A Transformer that converts a sequence of words into a fixed-length feature Vector. It maps a sequence of terms to their term frequencies using a hashing function.

In [10]:
from pyspark.ml.feature import HashingTF
hashingTF = HashingTF(inputCol=remover.getOutputCol(), outputCol='features')

# Estimator
## LogisticRegression

In [11]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(maxIter=10, regParam=0.01)

# Pipeline

In [12]:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[binarizer, tokenizer, remover, hashingTF, lr])

In [13]:
%%time
pipeLineModel = pipeline.fit(trainingData)

CPU times: user 30.7 ms, sys: 7.76 ms, total: 38.5 ms
Wall time: 6.18 s


# Evaluation

In [14]:
%%time
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()

predictions = pipeLineModel.transform(testData)

aur = evaluator.evaluate(predictions)

print 'Area under ROC: ', aur

Area under ROC:  0.762599469496
CPU times: user 22.1 ms, sys: 9.19 ms, total: 31.3 ms
Wall time: 1.34 s


# Hyperparameter Tuning

In [15]:
%%time
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
param_grid = ParamGridBuilder() \
            .addGrid(hashingTF.numFeatures, [10000, 100000]) \
            .addGrid(lr.regParam, [0.01, 0.1, 1.0]) \
            .addGrid(lr.maxIter, [10, 20]) \
            .build()
            
cv = (CrossValidator()
      .setEstimator(pipeline)
      .setEvaluator(evaluator)
      .setEstimatorParamMaps(param_grid)
      .setNumFolds(3))

cv_model = cv.fit(trainingData)

CPU times: user 3.32 s, sys: 949 ms, total: 4.27 s
Wall time: 1min 10s


In [16]:
%%time
new_predictions = cv_model.transform(testData)
new_aur = evaluator.evaluate(new_predictions)
print 'Area under ROC: ', new_aur

Area under ROC:  0.94139895864
CPU times: user 25 ms, sys: 4.84 ms, total: 29.8 ms
Wall time: 355 ms
