-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://cdn2.hubspot.net/hubfs/438089/docs/training/dblearning-banner.png" alt="Databricks Learning" width="555" height="64">
</div>

-sandbox
&copy; 2018 Databricks, Inc. All rights reserved.<br/>

### Challenges
* Business wants better product recommendations and conversion on website and emails
* Data Science spends most of their time connecting and wrangling data, very little on actual data science
* Data Science is hard to scale from sample data to large data sets


### Azure Databricks Solutions
* With all the data in one place (Azure Storage, Azure Data Lake), Easy for DS to spend time on DS
* Azure Databricks Scales to ML on GB, TB, PB of Data
* Easily go into production with ML (save results to CosmosDB)

### Why Contoso uses Azure Databricks for ML
* Millions of users and 100,000s of prodcuts, product reccomendations need more than a single machine
* Easy APIs for newer data science team
* Store results in CosmosDB for online serving (emails, website, etc)

####Azure Databricks for Machine Learning and Data Scientists
![arch](https://kpistoropen.blob.core.windows.net/collateral/roadshow/azure_roadshow_ml.png)

# ![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) Providing Product Recommendations

One of the most common uses of big data is to predict what users want.  This allows Google to show you relevant ads, Amazon to recommend relevant products, and Netflix to recommend movies that you might like.  This lab will demonstrate how we can use Apache Spark to recommend products to a user.  

We will start with some basic techniques, and then use the SparkML library's Alternating Least Squares method to make more sophisticated predictions. Here are the SparkML [Python docs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html) and the [Scala docs](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.package).

For this lesson, we will use around 900,000 historical product ratings from our company Initech.

In this lab:
* *Part 0*: Exploratory Analysis
* *Part 1*: Collaborative Filtering
* *Part 2*: Analysis

## ![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/wiki-book/general/logo_spark_tiny.png) *Part 0:* Exploratory Analysis

Let's start by taking a look at our data.  It's already mounted in `/mnt/training-msft/ratings.parquet` table for us.  Exploratory analysis should answer questions such as:

* How many observations do I have?
* What are the features?
* Do I have missing values?
* What do summary statistics (e.g. mean and variance) tell me about my data?

Start by importing the data.  Bind it to `productRatings` by running the cell below

In [7]:
product_ratings = spark.read.parquet("dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/initech/productRatings/")

In [8]:
display(product_ratings)

product_id,user_id,rating
2,1,3.5
29,1,3.5
32,1,3.5
31,1,3.5
29,1,4.0
3,2,4.0
1,3,4.0
24,3,3.0
32,3,4.0
31,3,5.0


Take a count of the data using the `count()` DataFrame method.

In [10]:
product_ratings.count()

#### Let's look at what these product_ids mean?

* There is a product lookup dataset in parquet located here: `dbfs:/mnt/training-sources/initech/productsShort/`

In [12]:
product_df = spark.read.parquet("dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/initech/productsShort/") #use the Spark parquet reader like the productRatings DataFrame above to read the products data

In [13]:
display(product_df)

product_id,category,brand,model,Price,processor,size,display
1,Laptops,HP,"""Spectre x360 2-in-1 13.3"""" 4K Ultra HD Touch-Screen Laptop""",1499.99,,,
2,Laptops,Microsoft,"Surface Pro – 12.3""""",1299.99,,,
3,Laptops,Microsoft,"Surface Book 2 - 13.5""""",1499.99,,,
4,Laptops,Dell,"XPS 2-in-1 13.3""""",1949.99,,,
5,Laptops,Lenovo,"Yoga 920 2-in-1 13.9""""",1799.99,,,
6,Laptops,Apple,"""MacBook Pro - 15"""" Display""",2659.99,,,
7,Laptops,Apple,"""MacBook Pro - 13"""" Display""",1499.99,,,
8,Laptops,Apple,"""MacBook Pro - 15.4"""" Display""",1999.99,,,
9,Laptops,Apple,"""MacBook Air - 13.3"""" Display""",999.99,,,
10,Laptops,HP,"""Pavilion x360 2-in-1 14"""" Touch-Screen Laptop""",999.99,,,


## ![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/wiki-book/general/logo_spark_tiny.png) *Part 1:* Collaborative Filtering

The image below (from [Wikipedia][collab]) shows an example of predicting of the user's rating using collaborative filtering. At first, people rate different items (like videos, products, articles, images, games). After that, the system is making predictions about a user's rating for an item, which the user has not rated yet. These predictions are built upon the existing ratings of other users, who have similar ratings with the active user. For instance, in the image below the system has made a prediction, that the active user will not like the video.  
![collaborative filtering](https://courses.edx.org/c4x/BerkeleyX/CS100.1x/asset/Collaborative_filtering.gif)

[SparkML]: http://spark.apache.org/docs/latest/ml-guide.html
[collab]: https://en.wikipedia.org/?title=Collaborative_filtering
[collab2]: http://recommender-systems.org/collaborative-filtering/

In [15]:
 #We'll hold out 60% for training, 20% of our data for validation, and leave 20% for testing 
seed = 1800009193
(training_df, validation_df, test_df) = product_ratings.randomSplit([.6, .2, .2], seed=seed)

In [16]:
display(training_df)

product_id,user_id,rating
1,3,4.0
1,6,5.0
1,10,4.0
1,11,4.5
1,14,4.5
1,19,5.0
1,31,3.0
1,34,5.0
1,39,5.0
1,47,1.0


### My Ratings
* Fill in your ratings for the above `product_df`
* Pick 5-10 product ids to rate
* Choose your ratings be 1-5

In [18]:
my_user_id = 0
my_rated_products = [
     (1, my_user_id, 5), # Replace with your ratings.
     (2, my_user_id, 5),
     (3, my_user_id, 5),
     (4, my_user_id, 5),
     (6, my_user_id, 1),
     (7, my_user_id, 1),
     (9, my_user_id, 1),
     (9, my_user_id, 1),
     (9, my_user_id, 1),
     ]

In [19]:
my_ratings_df = spark.createDataFrame(my_rated_products, ['product_id','user_id','rating'])

Join your ratings with the `product_df` to see your ratings with the product metadata

In [21]:
display(my_ratings_df.join(product_df, ['product_id']))

product_id,user_id,rating,category,brand,model,Price,processor,size,display
1,0,5,Laptops,HP,"""Spectre x360 2-in-1 13.3"""" 4K Ultra HD Touch-Screen Laptop""",1499.99,,,
2,0,5,Laptops,Microsoft,"Surface Pro – 12.3""""",1299.99,,,
3,0,5,Laptops,Microsoft,"Surface Book 2 - 13.5""""",1499.99,,,
4,0,5,Laptops,Dell,"XPS 2-in-1 13.3""""",1949.99,,,
6,0,1,Laptops,Apple,"""MacBook Pro - 15"""" Display""",2659.99,,,
7,0,1,Laptops,Apple,"""MacBook Pro - 13"""" Display""",1499.99,,,
9,0,1,Laptops,Apple,"""MacBook Air - 13.3"""" Display""",999.99,,,
9,0,1,Laptops,Apple,"""MacBook Air - 13.3"""" Display""",999.99,,,
9,0,1,Laptops,Apple,"""MacBook Air - 13.3"""" Display""",999.99,,,


Union your ratings with the `trainingDF` to see your ratings with the product metadata

In [23]:
training_with_my_ratings_DF = training_df.union(my_ratings_df)

### Alternating Least Squares

In this part, we will use the Apache Spark ML Pipeline implementation of Alternating Least Squares, [ALS (Python)](http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.recommendation.ALS) or [ALS (Scala)](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS). ALS takes a training dataset (DataFrame) and several parameters that control the model creation process.

The process we will use for determining the best model is as follows:
1. Pick a set of model parameters. The most important parameter to model is the *rank*, which is the number of columns in the Users matrix (green in the diagram above) or the number of rows in the Products matrix (blue in the diagram above). In general, a lower rank will mean higher error on the training dataset, but a high rank may lead to [overfitting](https://en.wikipedia.org/wiki/Overfitting).  We will train models with a rank of 2 using the `trainingDF` dataset.

2. Set the appropriate parameters on the `ALS` object:
    * The "User" column will be set to the values in our `user_id` DataFrame column.
    * The "Item" column will be set to the values in our `product_id` DataFrame column.
    * The "Rating" column will be set to the values in our `rating` DataFrame column.
    * We'll be using a regularization parameter of 0.1.
    
   **Note**: Read the documentation for the ALS class **carefully**. It will help you accomplish this step.
3. Have the ALS output transformation (i.e., the result of `ALS.fit()`) produce a _new_ column
   called "prediction" that contains the predicted value.

4. Create multiple models using `ALS.fit()`, one for each of our rank values. We'll fit 
   against the training data set (`trainingDF`).

5. We'll run our prediction against our validation data set (`validationDF`) and check the error.

6. Use `.setColdStartStrategy("drop")` so that the model can deal with missing values.

In [25]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.pipeline import Pipeline
import pandas
import numpy

# Let's initialize our ALS learner
als = ALS()
(als.setPredictionCol("prediction")
   .setUserCol("user_id")
   .setItemCol("product_id")
   .setRatingCol("rating")
   .setMaxIter(5)
   .setSeed(seed)
   .setRegParam(0.1)
   .setRank(2)
   .setColdStartStrategy("drop")
)

# Fitting our model to the training data:
model = als.fit(training_with_my_ratings_DF)

#Create predictions on our validation dataframe
predict_df = model.transform(validation_df)

#We define our evaluation metrics
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                              predictionCol="prediction")
rmse = evaluator.evaluate(predict_df)
print("Root-mean-square error = " + str(rmse))

### Validation: 
Let's see how we did against know ratings

In [27]:
display(predict_df)

product_id,user_id,rating,prediction
31,471,5.0,4.5278306
31,1645,5.0,4.138175
31,2366,3.0,3.9707217
31,4101,4.5,3.6996498
31,4935,3.0,2.4631135
31,5518,5.0,4.096911
31,9465,5.0,3.238582
31,11748,5.0,3.8859398
31,12799,3.0,4.531582
31,18800,4.5,4.4509277


## ![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/wiki-book/general/logo_spark_tiny.png) *Part 2:* Your Recommendations:
Let's look at what ALS recommended for your user based on your ratings

In [29]:
#Filter the predictions DF for your user id something like "user_id = ID"
predictions = model.recommendForAllUsers(10)
my_predictions = predictions.filter("user_id = 0")

In [30]:
display(my_predictions)

user_id,recommendations
0,"List(List(1, 3.3164597), List(17, 3.279634), List(34, 3.1747158), List(31, 3.1237397), List(29, 3.118293), List(14, 3.0347006), List(36, 3.026896), List(32, 3.0149722), List(21, 2.9904776), List(11, 2.900852))"


In [31]:
from pyspark.sql.functions import *
my_recs = my_predictions.select("user_id", explode("recommendations").alias("recommendations")).select("user_id", "recommendations.product_id", "recommendations.rating")

In [32]:
display(my_recs)

user_id,product_id,rating
0,1,3.3164597
0,17,3.279634
0,34,3.1747158
0,31,3.1237397
0,29,3.118293
0,14,3.0347006
0,36,3.026896
0,32,3.0149722
0,21,2.9904776
0,11,2.900852


###Join

In [34]:
from pyspark.sql.functions import *
my_recs = my_predictions.select("user_id", explode("recommendations").alias("recommendations")).select("user_id", "recommendations.product_id", "recommendations.rating").join(product_df, ['product_id'])

In [35]:
display(my_recs)

product_id,user_id,rating,category,brand,model,Price,processor,size,display
1,0,3.3164597,Laptops,HP,"""Spectre x360 2-in-1 13.3"""" 4K Ultra HD Touch-Screen Laptop""",1499.99,,,
17,0,3.279634,Laptops,Microsoft,"""Surface Book 13.5"""" Touch Screen with Performance Base""",2399.99,,,
34,0,3.1747158,tablets,Dell,Latitude 5285 2-in-1,299.99,Core-i,12.3,920×1280 (3:2)
31,0,3.1237397,tablets,Samsung,"""""""Galaxy Book 10.6"""""""" """"""",299.99,Core-m,10.6,1920×1280 (3:2)
29,0,3.118293,tablets,HP®,Pro x2 612 G2,299.99,Core-m,12.0,1920×1280 (3:2)
14,0,3.0347006,Laptops,Samsung,"""Notebook 9 spin 13.3"""" Touch-Screen Laptop""",1199.99,,,
36,0,3.026896,tablets,Lenovo®,Miix 720,299.99,Core-i,12.0,2880×1920 (3:2)
32,0,3.0149722,tablets,Fujitsu,Stylistic Q737,299.99,Core-i,13.3,1920×1080 (16:9)
21,0,2.9904776,tablets,HP®,Envy x2 (2018),299.99,Core-m,12.3,1920×1280 (16:9)
11,0,2.900852,Laptops,Dell,"""2-in-1 17.3"""" Touch-Screen Laptop""",1299.99,,,


## Next Step

[MLFlow]($../4-ML/4-02 MLFlow)

&copy; 2018 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>