# SENTIMENT ANALYSIS WITH SPARK ML

# Spark ML Main Concepts

The Spark Machine learning API in the **spark.ml** package is based on DataFrames, there is also another Spark Machine learning API based on RDDs in the **spark.mllib** package, but as of Spark 2.0, the RDD-based API has entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API.

Main concepts of Spark ML:

- **Transformer**: transforms one DataFrame into another DataFrame

- **Estimator**: eg. a learning algorithm that trains on a DataFrame and produces a Model

- **Pipeline**: chains Transformers and Estimators to produce a Model

- **Evaluator**: measures how well a fitted Model does on held-out test data


# Amazon product data
We will use a [dataset](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Books_5.json.gz)[1] that contains 8.9M book reviews from Amazon, spanning May 1996 - July 2014.

Dataset characteristics:
- Number of reviews: 8.9M
- Size: 8.8GB (uncompressed)
- HDFS blocks: 70 (each with 3 replicas)


[1] Image-based recommendations on styles and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015
http://jmcauley.ucsd.edu/data/amazon/

The reviews will be in English so we will set the locale accordingly:

In [None]:
import os
os.environ['LANG']='en_US.UTF-8'

As an alternative you can set the environment before launching the notebook with:
```bash
export LANG=en_US.UTF-8
```

# Load Data

For this lab we will use a reduced dataset:

In [None]:
%%time
raw_reviews = spark.read.json('/tmp/reviews_Books_5_small.json')

In [None]:
raw_reviews.count()

In [None]:
raw_reviews.show()

We are only interested in the following columns:
- `reviewText`: contains the text of the review sent by the user
- `overall`: contains the overall rating of the product (from 1 to 5 stars)

So we will only keep those columns in the dataset:

In [None]:
%%time
all_reviews = ...

To improve performance we will cache this dataframe:

In [None]:
all_reviews.cache()

Let's verify the dataframe has now only the `reviewText` and `overall` columns:

In [None]:
all_reviews.show(2)

# Prepare data
We will avoid neutral reviews by keeping only reviews with 1 or 5 stars overall score.

In [None]:
nonneutral_reviews = all_reviews.filter(...)

We will also filter out the reviews that contain no text:

In [None]:
reviews = nonneutral_reviews.filter(...)

Let's cache the new dataframe and unpersist the previous one:

In [None]:
reviews.cache()
all_reviews.unpersist()

# Split Data

Let's split the dataset in the training and test subsets (80% training, 20% test)

In [None]:
trainingData, testData = ...

# Generate Pipeline
We will now create the following pipeline:

![pipeline](http://hadoop.cesga.es/files/sentiment_analysis/pipeline.jpg)

## Binarizer
Let's generate our `label` column from the `overall` column. Our label column should contain only 0 or 1, so we have to transform the overall column so it will contain 0 for reviews below 2.5 and 1 for reviews above 2.5.

For that we will use a Binarizer: A transformer to convert numerical features to binary (0/1) features

In [None]:
from pyspark.ml.feature import Binarizer

binarizer = Binarizer(...)

## Tokenizer
Now let's start working in our feature vector. First thing is to split into words our `reviewText` column.

For that we will use a Tokenizer: A transformer that converts the input string to lowercase and then splits it by white spaces.

In [None]:
from pyspark.ml.feature import Tokenizer
tokenizer = ...

## StopWordsRemover
Next we have to remove all stop words ("the", "and", "or", etc.)

Stop words are words which are filtered out (i.e. stopped) before processing of natural language data (text) because they are insignificant. For example: "the", "a", etc.

For that we will use a StopWordsRemover: A transformer that filters out stop words from input. Note: null values from input array are preserved unless adding null to stopWords explicitly.

In [None]:
from pyspark.ml.feature import StopWordsRemover
remover = ...

We can see the list of words that will be removed with:

In [None]:
remover.getStopWords()

## HashingTF
Last but not least, we will convert our sequence of words in a vector that represents the text and that we will store in our `features` column. 

For that we will use the HashingTF Transformer that converts a sequence of words into a fixed-length feature Vector. It maps a sequence of terms to their term frequencies using a hashing function.

In [None]:
from pyspark.ml.feature import HashingTF
hashingTF = ...

# Estimator
Now is time to choose the Estimator, ie. the ML learning algorithm to use.

We are trying to predict if a review will be possitive (1.0) or negative (0.0), so this is a classification problem and we will use the LogisticRegression Estimator, one common algorithm used for classification tasks.

## LogisticRegression

We will set the maximum number of iterations to 10, and the regularization param to 0.01:

In [None]:
from pyspark.ml.classification import LogisticRegression

In [None]:
lr = LogisticRegression(...)

# Pipeline

Finally we will create a pipeline that merges all the steps:

In [None]:
from pyspark.ml import Pipeline
pipeline = ...

And we can now fit our model:

In [None]:
%%time
pipeLineModel = pipeline.fit(trainingData)

# Evaluation

Now let's see how well our model performs.

Choose a Evaluator and get some metrics about our classification (eg. Area under ROC (AUR)):

In [None]:
%%time
...

# Hyperparameter Tuning

Finally let's see if we can fine tune some of the parameters, we will try the following:
- For HashingTF we will try to tune the `numFeatures` parameter, to see how it influences our result. We will try with 10000 and 100000
- For LinearRegression we will try to tune the `regParam` and `maxIter` parameters. For the first one we will use 0.01, 0.1 and 1.0, and for the second one we will use 10 and 20 as the maximum number of iterations.

For that we will use a CrossValidator with 3 folds.

In [None]:
%%time
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
param_grid = ParamGridBuilder() \
            .addGrid(hashingTF.numFeatures, [..., ...]) \
            .addGrid(lr.regParam, ...) \
            .addGrid(...) \
            .build()
            
cv = (CrossValidator()
      .setEstimator(...)
      .setEvaluator(...)
      .setEstimatorParamMaps(param_grid)
      .setNumFolds(...))

cv_model = cv.fit(trainingData)

Finally let's evaluate the performance of the tuned model:

In [None]:
%%time
new_predictions = cv_model.transform(testData)
new_aur = evaluator.evaluate(new_predictions)

Let's see the new AUR value:

In [None]:
new_aur