# Recommender Code Along

The classic recommender tutorial uses the [movielens data set](https://grouplens.org/datasets/movielens/). It is similar to using the iris or MNIST data set for other algorithms. Let's do a code along to get an idea of how this all works!


Looking for more datasets? Check out: https://gist.github.com/entaroadun/1653794

In [62]:
from pyspark.sql import SparkSession
# May take a little while on a local computer
spark = SparkSession.builder.appName("Recommender Systems").getOrCreate()

In [63]:
# check (try) if Spark session variable (spark) exists and print information about the Spark context
try:
    spark
except NameError:
    print("Spark session does not context exist. Please create Spark session first (run cell above).")
else:
    configurations = spark.sparkContext.getConf().getAll()
    for item in configurations: print(item)

('spark.app.id', 'local-1646750177273')
('spark.app.name', 'Recommender Systems')
('spark.driver.host', '192.168.59.1')
('spark.app.startTime', '1646750177217')
('spark.rdd.compress', 'True')
('spark.serializer.objectStreamReset', '100')
('spark.master', 'local[*]')
('spark.submit.pyFiles', '')
('spark.executor.id', 'driver')
('spark.submit.deployMode', 'client')
('spark.sql.warehouse.dir', 'file:/Users/gerhardwenzel/Development/Spark101/Apache-Spark-Tutorials/spark-warehouse')
('spark.driver.port', '58026')
('spark.ui.showConsoleProgress', 'true')


With Collaborative filtering we make predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption is that if a user A has the same opinion as a user B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a user chosen randomly.

The image below (from Wikipedia) shows an example of collaborative filtering. At first, people rate different items (like videos, images, games). Then, the system makes predictions about a user's rating for an item not rated yet. The new predictions are built upon the existing ratings of other users with similar ratings with the active user. In the image, the system predicts that the user will not like the video.

<img src=https://upload.wikimedia.org/wikipedia/commons/5/52/Collaborative_filtering.gif />

Spark MLlib library for Machine Learning provides a Collaborative Filtering implementation by using Alternating Least Squares. The implementation in MLlib has these parameters:

* numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure).
* rank is the number of latent factors in the model.
* iterations is the number of iterations to run.
* lambda specifies the regularization parameter in ALS.
* implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
* alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.

Let's see this all in action!

In [64]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

In [65]:
data = spark.read.csv('data/movielens_ratings.csv',inferSchema=True,header=True)

In [66]:
data.head()

Row(movieId=2, rating=3.0, userId=0)

In [67]:
data.describe().show()

+-------+------------------+------------------+------------------+
|summary|           movieId|            rating|            userId|
+-------+------------------+------------------+------------------+
|  count|              1501|              1501|              1501|
|   mean| 49.40572951365756|1.7741505662891406|14.383744170552964|
| stddev|28.937034065088994| 1.187276166124803| 8.591040424293272|
|    min|                 0|               1.0|                 0|
|    max|                99|               5.0|                29|
+-------+------------------+------------------+------------------+



We can do a split to evaluate how well our model performed, but keep in mind that it is very hard to know conclusively how well a recommender system is truly working for some topics. Especially if subjectivity is involved, for example not everyone that loves star wars is going to love star trek, even though a recommendation system may suggest otherwise.

In [68]:
# Smaller dataset so we will use 0.8 / 0.2
(training, test) = data.randomSplit([0.8, 0.2])

In [69]:
# Build the recommendation model using ALS on the training data
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")
model = als.fit(training)

Now let's see hwo the model performed!

In [70]:
# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)

In [71]:
predictions.show()

+-------+------+------+------------+
|movieId|rating|userId|  prediction|
+-------+------+------+------------+
|      1|   1.0|     6|  -2.4862545|
|      1|   1.0|    14|-0.029667616|
|      1|   3.0|    25|   1.3826276|
|      1|   1.0|    28|    5.173739|
|      6|   1.0|     1|   1.9349315|
|      6|   1.0|     6|  0.31446147|
|      3|   1.0|     9|    0.863403|
|      3|   3.0|    14|   1.6506743|
|      3|   1.0|    29|   0.9501966|
|      5|   2.0|    26|   3.0781229|
|      5|   1.0|    29|   1.5328526|
|      4|   3.0|    10|   0.3294614|
|      4|   1.0|    12|   1.7030511|
|      4|   1.0|    23|   1.2644027|
|      2|   4.0|     8|   5.6450586|
|      2|   4.0|    10|   0.8373728|
|      2|   1.0|    15|    2.296519|
|      2|   1.0|    19|   2.7675266|
|      2|   2.0|    20|  -0.0735859|
|      2|   1.0|    25|   1.4798206|
+-------+------+------+------------+
only showing top 20 rows



In [72]:
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 1.8689654288101252


The RMSE described our error in terms of the stars rating column.

So now that we have the model, how would you actually supply a recommendation to a user?

The same way we did with the test data! For example:

In [73]:
single_user = test.filter(test['userId']==11).select(['movieId','userId'])

In [74]:
# User had 10 ratings in the test data set 
# Realistically this should be some sort of hold out set!
single_user.show()

+-------+------+
|movieId|userId|
+-------+------+
|     16|    11|
|     30|    11|
|     35|    11|
|     38|    11|
|     45|    11|
|     51|    11|
|     69|    11|
|     70|    11|
|     71|    11|
|     77|    11|
|     80|    11|
|     94|    11|
|     99|    11|
+-------+------+



In [75]:
reccomendations = model.transform(single_user)

In [76]:
reccomendations.orderBy('prediction',ascending=False).show()

+-------+------+-----------+
|movieId|userId| prediction|
+-------+------+-----------+
|     30|    11|  7.1394625|
|     51|    11|   3.928402|
|     71|    11|  3.7681916|
|     38|    11|  3.0069559|
|     77|    11|  2.1474154|
|     70|    11|  2.1207266|
|     45|    11|  1.4916965|
|     69|    11|  1.4902996|
|     16|    11|  1.2746805|
|     80|    11|  1.1755676|
|     99|    11|  0.7980049|
|     94|    11|   0.720018|
|     35|    11|-0.26762515|
+-------+------+-----------+



## Recommender Systems Project

The whole world seems to be hearing about your new amazing abilities to analyze big data and build useful systems for them! You've just taken up a new contract with a new online food delivery company. This company is trying to differentiate itself by recommending new meals to customers based off of other customers likings.

Can you build them a recommendation system?

Your final result should be in the form of a function that can take in a Spark DataFrame of a single customer's ratings for various meals and output their top 3 suggested meals. For example:

Best of luck!

In [77]:
import pandas as pd

In [78]:
df = pd.read_csv('data/movielens_ratings.csv')

In [79]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
movieId,1501.0,49.40573,28.937034,0.0,24.0,50.0,74.0,99.0
rating,1501.0,1.774151,1.187276,1.0,1.0,1.0,2.0,5.0
userId,1501.0,14.383744,8.59104,0.0,7.0,14.0,22.0,29.0


In [80]:
df.corr()

Unnamed: 0,movieId,rating,userId
movieId,1.0,0.036569,0.003267
rating,0.036569,1.0,0.056411
userId,0.003267,0.056411,1.0


In [81]:
import numpy as np
df['mealskew'] = df['movieId'].apply(lambda id: np.nan if id > 31 else id)

In [82]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
movieId,1501.0,49.40573,28.937034,0.0,24.0,50.0,74.0,99.0
rating,1501.0,1.774151,1.187276,1.0,1.0,1.0,2.0,5.0
userId,1501.0,14.383744,8.59104,0.0,7.0,14.0,22.0,29.0
mealskew,486.0,15.502058,9.250634,0.0,7.0,15.0,23.0,31.0


In [83]:
mealmap = { 2. : "Chicken Curry",   
           3. : "Spicy Chicken Nuggest",   
           5. : "Hamburger",   
           9. : "Taco Surprise",  
           11. : "Meatloaf",  
           12. : "Ceaser Salad",  
           15. : "BBQ Ribs",  
           17. : "Sushi Plate",  
           19. : "Cheesesteak Sandwhich",  
           21. : "Lasagna",  
           23. : "Orange Chicken",
           26. : "Spicy Beef Plate",  
           27. : "Salmon with Mashed Potatoes",  
           28. : "Penne Tomatoe Pasta",  
           29. : "Pork Sliders",  
           30. : "Vietnamese Sandwich",  
           31. : "Chicken Wrap",  
           np.nan: "Cowboy Burger",   
           4. : "Pretzels and Cheese Plate",   
           6. : "Spicy Pork Sliders",  
           13. : "Mandarin Chicken PLate",  
           14. : "Kung Pao Chicken",
           16. : "Fried Rice Plate",  
           8. : "Chicken Chow Mein",  
           10. : "Roasted Eggplant ",  
           18. : "Pepperoni Pizza",  
           22. : "Pulled Pork Plate",   
           0. : "Cheese Pizza",   
           1. : "Burrito",   
           7. : "Nachos",  
           24. : "Chili",  
           20. : "Southwest Salad",  
           25.: "Roast Beef Sandwich"}

In [84]:
df['meal_name'] = df['mealskew'].map(mealmap)

In [85]:
df.to_csv('data/Meal_Info.csv',index=False)

In [86]:
data = spark.read.csv('data/Meal_Info.csv',inferSchema=True,header=True)

In [87]:
data = data.dropna()
data.show()

+-------+------+------+--------+--------------------+
|movieId|rating|userId|mealskew|           meal_name|
+-------+------+------+--------+--------------------+
|      2|   3.0|     0|     2.0|       Chicken Curry|
|      3|   1.0|     0|     3.0|Spicy Chicken Nug...|
|      5|   2.0|     0|     5.0|           Hamburger|
|      9|   4.0|     0|     9.0|       Taco Surprise|
|     11|   1.0|     0|    11.0|            Meatloaf|
|     12|   2.0|     0|    12.0|        Ceaser Salad|
|     15|   1.0|     0|    15.0|            BBQ Ribs|
|     17|   1.0|     0|    17.0|         Sushi Plate|
|     19|   1.0|     0|    19.0|Cheesesteak Sandw...|
|     21|   1.0|     0|    21.0|             Lasagna|
|     23|   1.0|     0|    23.0|      Orange Chicken|
|     26|   3.0|     0|    26.0|    Spicy Beef Plate|
|     27|   1.0|     0|    27.0|Salmon with Mashe...|
|     28|   1.0|     0|    28.0| Penne Tomatoe Pasta|
|     29|   1.0|     0|    29.0|        Pork Sliders|
|     30|   1.0|     0|    3

In [88]:
(training, test) = data.randomSplit([0.8, 0.2])

In [89]:
# Build the recommendation model using ALS on the training data
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="mealskew", ratingCol="rating")
model = als.fit(training)

In [90]:
# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)

predictions.show()

evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

+-------+------+------+--------+--------------------+-----------+
|movieId|rating|userId|mealskew|           meal_name| prediction|
+-------+------+------+--------+--------------------+-----------+
|      0|   1.0|    22|     0.0|        Cheese Pizza| -1.3463306|
|      1|   1.0|     3|     1.0|             Burrito|  1.0270131|
|      2|   3.0|     6|     2.0|       Chicken Curry|  2.3317757|
|      2|   2.0|     7|     2.0|       Chicken Curry|  3.6757278|
|      2|   4.0|    10|     2.0|       Chicken Curry|-0.14808428|
|      2|   1.0|    12|     2.0|       Chicken Curry|  2.3583088|
|      2|   4.0|    28|     2.0|       Chicken Curry|  0.3500839|
|      3|   2.0|     8|     3.0|Spicy Chicken Nug...|   1.106826|
|      3|   1.0|     9|     3.0|Spicy Chicken Nug...|  0.9973233|
|      3|   1.0|    13|     3.0|Spicy Chicken Nug...|-0.06744638|
|      3|   3.0|    14|     3.0|Spicy Chicken Nug...|  0.7161862|
|      3|   2.0|    22|     3.0|Spicy Chicken Nug...| -0.6051092|
|      4| 

## Stop The Spark Session

In [91]:
# stop the underlying SparkContext.
try:
    spark
except NameError:
    print("Spark session does not context exist - nothing to stop.")
else:
    spark.stop()