<a href="http://www.calstatela.edu/centers/hipic"><img align="left" src="https://avatars2.githubusercontent.com/u/4156894?v=3&s=100"><image/>
</a>
<img align="right" alt="California State University, Los Angeles" src="http://www.calstatela.edu/sites/default/files/groups/California%20State%20University%2C%20Los%20Angeles/master_logo_full_color_horizontal_centered.svg" style="width: 360px;"/>

#    CIS5560 Term Project Tutorial

------
#### Authors: [Monika Mishra](https://www.linkedin.com/in/monika-mishra-8b2a4115/), [Amogh Mahesh](https://www.linkedin.com/in/amoghmahesh/), [Aakanksha Tasgaonkar](https://www.linkedin.com/in/aakanksha-tasgaonkar-272ba393/)

#### Instructor: [Jongwook Woo](https://www.linkedin.com/in/jongwook-woo-7081a85)

#### Date: 04/29/2019

## Text Analysis
In this lab, you will create a classification model that performs sentiment analysis of product reviewed.
### Import Spark SQL and Spark ML Libraries

First, import the libraries you will need:

In [4]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer, StopWordsRemover

### Load Source Data
Now load the data into a DataFrame. This data consists of the products reviewed.

### Read csv file from DBFS (Databricks File Systems)

### TODO 1: follow the direction to read your table after upload it to Data at the left frame
### NOTE: See reference [[1](https://docs.databricks.com/user-guide/tables.html#create-a-table)]
1. After ratings.csv file is added to the data of the left frame, create a table using the UI, especially, "Upload File"
1. Click "Preview Table to view the table" and Select the option as ratings.csv has a header as the first row: "First line is header"
1. When you click on create table button, remember the table name, for example, _ratings_csv_

In [7]:
text_csv = sqlContext.sql("SELECT * FROM ratings_csv")

text_csv.show(5)

### Prepare the Data
The features for the classification model will be derived from the column "review_headline". The label is the star_rating (between 1-5). You need to convert the label as positive if the star_rating > 3 else negative.

In [9]:
textdata = text_csv.select("review_headline", ((col("star_rating") > 3).cast("Integer").alias("label")))
textdata.show(truncate = False)

## Show Donut Chart for the ratings 

1. Select Pie chart 
1. Select __Plot Options__ with (Keys: id, Series Grouping: Keep it blank, Values: count, Aggregation: SUM, Donut: checked)

In [11]:
display(textdata.groupBy("label").count().orderBy("label"))

label,count
0,24453
1,114144


### Split the Data
In common with most classification modeling processes, you'll split the data into a set for training, and a set for testing the trained model.

In [13]:
splits = textdata.randomSplit([0.7, 0.3],seed=0)
textrain = splits[0]
textest = splits[1].withColumnRenamed("label", "trueLabel")
textrain_rows = textrain.count()
textest_rows = textest.count()
print ("Training Rows:", textrain_rows, " Testing Rows:", textest_rows)

In [14]:
textest.show(5)

In [15]:
textdata.show(5,truncate = False)

### Define the Pipeline
The pipeline for the model consist of the following stages:
- A Tokenizer to split the tweets into individual words.
- A StopWordsRemover to remove common words such as "a" or "the" that have little predictive value.
- A HashingTF class to generate numeric vectors from the text values.
- A LogisticRegression algorithm to train a binary classification model.

In [17]:
# convert sentence to words' list
tokenizer = Tokenizer(inputCol="review_headline", outputCol="Words")
# remove stop words
swr = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="MeaningfulWords")
# convert word to number as word frequency
hashTF = HashingTF(inputCol=swr.getOutputCol(), outputCol="features")
# set the model
lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10, regParam=0.01)

# process pipeline with the series of transforms - 4 transforms
pipeline = Pipeline(stages=[tokenizer, swr, hashTF, lr])

### Run the Pipeline as an Estimator
The pipeline itself is an estimator, and so it has a **fit** method that you can call to run the pipeline on a specified DataFrame. In this case, you will run the pipeline on the training data to train a model.

In [19]:
piplineModel = pipeline.fit(textrain)
print ("Pipeline complete!")

### Test the Pipeline Model
The model produced by the pipeline is a transformer that will apply all of the stages in the pipeline to a specified DataFrame and apply the trained model to generate predictions. In this case, you will transform the **test** DataFrame using the pipeline to generate label predictions.

In [21]:
prediction = piplineModel.transform(textest)
predicted = prediction.select("review_headline", "prediction", "trueLabel")
predicted.show(10)

In [22]:
predicted10 = prediction.select("*")
predicted10.show(10)

### Compute Confusion Matrix Metrics
Classifiers are typically evaluated by creating a *confusion matrix*, which indicates the number of:
- True Positives
- True Negatives
- False Positives
- False Negatives

From these core measures, other evaluation metrics such as *precision* and *recall* can be calculated.

In [24]:
tp = float(predicted.filter("prediction == 1.0 AND truelabel == 1").count())
fp = float(predicted.filter("prediction == 1.0 AND truelabel == 0").count())
tn = float(predicted.filter("prediction == 0.0 AND truelabel == 0").count())
fn = float(predicted.filter("prediction == 0.0 AND truelabel == 1").count())
metrics = spark.createDataFrame([
 ("TP", tp),
 ("FP", fp),
 ("TN", tn),
 ("FN", fn),
 ("Precision", tp / (tp + fp)),
 ("Recall", tp / (tp + fn))],["metric", "value"])
metrics.show()

## Show Bar Chart for metrics 

1. Select Bar chart 
1. Select __Plot Options__ with (Key: metric, Series Grouping: Keep it blank, Values: value, Aggregation: SUM, Grouped)

In [26]:
display(metrics)

metric,value
TP,32503.0
FP,3920.0
TN,3516.0
FN,1799.0
Precision,0.8923756966751778
Recall,0.947554078479389


Metrics:
- True Positive = 32503
- False Positive = 3920
- True Negative = 3516
- False Negative = 1799
- Precision = 0.8923756966751778
- Recall = 0.947554078479389

### Review the Area Under ROC
Another way to assess the performance of a classification model is to measure the area under a ROC curve for the model. the spark.ml library includes a **BinaryClassificationEvaluator** class that you can use to compute this.

In [29]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# LogisticRegression: rawPredictionCol="prediction", metricName="areaUnderROC"
evaluator = BinaryClassificationEvaluator(labelCol="trueLabel", rawPredictionCol="prediction", metricName="areaUnderROC")
aur = evaluator.evaluate(prediction)
print ("AUR = ", aur)


AUR = 0.7101944679648156