<a href="http://www.calstatela.edu/centers/hipic"><img align="left" src="https://avatars2.githubusercontent.com/u/4156894?v=3&s=100"><image/>
</a>
<img align="right" alt="California State University, Los Angeles" src="http://www.calstatela.edu/sites/default/files/groups/California%20State%20University%2C%20Los%20Angeles/master_logo_full_color_horizontal_centered.svg" style="width: 360px;"/>

#    CIS5560 Term Project Tutorial 
##   PySpark Collaborative Filtering in Databricks

------
#### Authors: [Monika Mishra](https://www.linkedin.com/in/monika-mishra-8b2a4115/), [Amogh Mahesh](https://www.linkedin.com/in/amoghmahesh/), [Aakanksha Tasgaonkar](https://www.linkedin.com/in/aakanksha-tasgaonkar-272ba393/)

#### Instructor: [Jongwook Woo](https://www.linkedin.com/in/jongwook-woo-7081a85)

#### Date: 04/26/2019

## Collaborative Filtering
Collaborative filtering is a machine learning technique that predicts ratings awarded to items by users.

### Import the ALS class
In this exercise, you will use the Alternating Least Squares collaborative filtering algorithm to creater a recommender.

### Two types of user preferences:

__Explicit preference__ (also referred as "Explicit feedback"), such as "rating" given to item by users. Default for ALS

__Implicit preference__ (also referred as "Implicit feedback"), such as "view" and "buy" history.

### You will be using just the Expilicit preference in this tutorial

In [4]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import BinaryClassificationEvaluator, RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit, CrossValidator

from pyspark.sql import functions as F

from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StringIndexer, VectorIndexer, MinMaxScaler
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, TrainValidationSplit


### Load Source Data
The source data for the recommender is in one file.

Read csv file from DBFS (Databricks File Systems)

### Follow the direction to read your table after upload it to Data at the left frame

1. After _amazon_rec.csv_ file is added to the data of the left frame, create a table using the UI, especially, "Upload File"
1. Click "Preview Table to view the table" and Select the option as _amazon_rec.csv_ has a header as the first row: "First line is header"
1. __Change the data type__ of the table columns: (customer_id: int, rating: double, product_category: string)
1. When you click on create table button, remember the table name, for example, _amazon_

### Assign the table name to data, which is created at TODO 1, using Spark SQL 
#### _spark.sql("SELECT * FROM amazon")_, 
ratings_all = spark.sql("SELECT customer_id, product_category, star_rating FROM amazon")

In [8]:
ratings_all = spark.sql("SELECT customer_id, product_category, star_rating FROM amazon")
ratings_all.show(5)

### Count the number of rows

In [10]:
ratings_all.count()

### Drop if any duplicate rows present

In [12]:
ratings = ratings_all.drop_duplicates()
ratings.count()

### The ALS model requires all inputs to be numerical.
### Using StringIndexer to convert product_category to product_categoryIdx

In [14]:
new_rating = ratings
strIdx = StringIndexer(inputCol = "product_category", outputCol = "product_categoryIdx")
sm = strIdx.fit(new_rating)
new_rating = sm.transform(new_rating)
new_rating.show(10)

### Prepare the Data
To prepare the data, split it into a training set and a test set.

In [16]:
data = new_rating.select("customer_id", "product_categoryIdx", "star_rating")
splits = data.randomSplit([0.7, 0.3])
train = splits[0].withColumnRenamed("star_rating", "label")
test = splits[1].withColumnRenamed("star_rating", "trueLabel")
train_rows = train.count()
test_rows = test.count()
print ("Training Rows:", train_rows, " Testing Rows:", test_rows)

### Build the Recommender
In ALS, users and products are described by a small set of latent features (factors) that can be used to predict missing entries.

The ALS class is an estimator, so you can use its **fit** method to traing a model, or you can include it in a pipeline. Rather than specifying a feature vector and as label, the ALS algorithm requries a numeric user ID, item ID, and rating.

In [18]:
als = ALS(userCol="customer_id", itemCol="product_categoryIdx", ratingCol="label")

#### Add paramGrid and Validation

In [20]:
paramGrid = ParamGridBuilder() \
                    .addGrid(als.rank, [1, 5]) \
                    .addGrid(als.maxIter, [5, 10]) \
                    .addGrid(als.regParam, [0.3, 0.1, 0.01]) \
                    .addGrid(als.alpha, [2.0,3.0]) \
                    .build()



In [21]:
cv = TrainValidationSplit(estimator=als, evaluator=RegressionEvaluator(), estimatorParamMaps=paramGrid, trainRatio=0.8)
model = cv.fit(train)

### Test the Recommender
Now that you've trained the recommender, you can see how accurately it predicts known ratings in the test set.

In [23]:
prediction = model.transform(test)

# Remove NaN values from prediction (due to SPARK-14489) [1]
prediction = prediction.filter(prediction.prediction != float('nan'))

# Round floats to whole numbers
prediction = prediction.withColumn("prediction", F.abs(F.round(prediction["prediction"],0)))


In [24]:
prediction.join(new_rating, ["customer_id", "product_categoryIdx"]).select("customer_id", "product_category", "prediction", "trueLabel").show(50, truncate=False)

#### RegressionEvaluator
Calculate RMSE using RegressionEvaluator.

__NOTE:__ make sure to set [predictionCol="prediction"]

In [26]:
# RegressionEvaluator: predictionCol="prediction", metricName="rmse"

evaluator = RegressionEvaluator(labelCol="trueLabel", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(prediction)

print ("Root Mean Square Error (RMSE):", rmse)

**Reference**

1. Predicting Song Listens Using Apache Spark, https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3175648861028866/48824497172554/657465297935335/latest.html
1. Dataset Link : https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_multilingual_US_v1_00.tsv.gz