# Building a Movie Recommendation System in PySpark - Lab Code-along
![images of vhs tapes on shelf](img/movies.jpg)

## Introduction

In this last lab, we will implement a a movie recommendation system using Alternating Least Squares (ALS) in Spark programming environment.<br> Spark's machine learning libraray `ml` comes packaged with a very efficient imeplementation of ALS algorithm. 

The lab will require you to put into pratice your spark programming skills for creating and manipulating pyspark DataFrames. We will go through a step-by-step process into developing a movie recommendation system using ALS and pyspark using the MovieLens Dataset.

Note: You are advised to refer to [PySpark Documentation](http://spark.apache.org/docs/2.2.0/api/python/index.html) heavily for completing this lab as it will introduce a few new methods. 

## Objectives

You will be able to:

* Identify the key components of the ALS 
* Demonstrate an understanding on how recommendation systems are being used for personalization of online services/products
* Parse and filter datasets into Spark DataFrame, performing basic feature selection
* Run a brief hyper-parameter selection activity through a scalable grid search
* Train and evaluate the predictive performance of recommendation system
* Generate predictions from the trained model

## Building a Recommendation System

We have seen how recommender/Recommendation Systems have played an  integral parts in the success of Amazon (Books, Items), Pandora/Spotify (Music), Google (News, Search), YouTube (Videos) etc.  For Amazon these systems bring more than 30% of their total revenues. For Netflix service, 75% of movies that people watch are based on some sort of recommendation.

> The goal of Recommendation Systems is to find what is likely to be of interest to the user. This enables organizations to offer a high level of personalization and customer tailored services.

### We sort of get the concept

For online video content services like Netflix and Hulu, the need to build robust movie recommendation systems is extremely important. An example of recommendation system is such as this:

1.    User A watches Game of Thrones and Breaking Bad.
2.    User B performs a search query for Game of Thrones.
3.    The system suggests Breaking Bad to user B from data collected about user A.


This lab will guide you through a step-by-step process into developing such a movie recommendation system. We will use the MovieLens dataset to build a movie recommendation system using the collaborative filtering technique with Spark's Alternating Least Saqures implementation. After building that recommendation system, we will go through the process of adding a new user to the dataset with some new ratings and obtaining new recommendations for that user.

## Will Nightengale like Toy Story?

Collaborative filtering and matrix decomposition allows us to use the history of others ratings, along with the entire community of ratings, to answer that question.

![image1](img/collab.png)


## Person vs vegetable

It's important to realize that there are two sides to recommendation

![image2](img/item_user_based.png)

## Code for model

If we wanted, we could jump to the code right now to make this happen.

But would we understand it?
```python
from pyspark.ml.recommendation import ALS

als = ALS(
    userCol='userId',
    itemCol='movieId',
    ratingCol='rating',
)

als_model = als.fit(movie_ratings)
```

## Documentation Station

Let's explore the [documentation](http://spark.apache.org/docs/2.4.3/api/python/pyspark.ml.html#module-pyspark.ml.recommendation) together to maybe get a better idea of what is happening. 

- which parameters make sense?
- which are completely foreign?

## Assumptions

Matrix decomposition is built on the theory that every individual (user, movie) score is actually the **dot product** of two separate vectors:
- user characteristics 
- movie characteristics

Wait, do you mean like gender, whether the movie is sci-fi or action? do we have that data?

![beyonce-gif](img/beyonce.gif)

## The hidden matricies 
![image4](img/matrix_decomp.png)

## Embeddings

Embeddings are low dimensional hidden factors for items and users.

For e.g. say we have 5 dimensional (i.e., **rank** = 5) embeddings for both items and users (5 chosen randomly, this could be any number - as we saw with PCA and dim. reduction).

For user-X & movie-A, we can say those 5 numbers might represent 5 different characteristics about the movie e.g.:

- How much movie-A is political
- How recent is the movie
- How much special effects are in movie A
- How dialogue driven is the movie
- How linear is the narrative in the movie

In a similar way, 5 numbers in the user embedding matrix might represent:

- How much does user-X like sci-fi movies
- How much does user-X like recent movies … and so on.

But we have *no actual idea* what those factors actually represent.

### If we knew the feature embeddings in advance, it would look something like this:

In [1]:
import numpy as np

# the original matrix of rankings
R = np.array([[2, np.nan, np.nan, 1, 4],
              [5, 1, 2, np.nan, 2],
              [3, np.nan, np.nan, 3, np.nan],
              [1, np.nan, 4, 2, 1]])

# users X factors
P = np.array([[-0.63274434,  1.33686735, -1.55128517],
              [-2.23813661,  0.5123861,  0.14087293],
              [-1.0289794,  1.62052691,  0.21027516],
              [-0.06422255,  1.62892864,  0.33350709]])

# factors X items
Q = np.array([[-2.09507374,  0.52351075,  0.01826269],
              [-0.45078775, -0.07334991,  0.18731052],
              [-0.34161766,  2.46215058, -0.18942263],
              [-1.0925736,  1.04664756,  0.69963111],
              [-0.78152923,  0.89189076, -1.47144019]])

What about that `np.nan` in the third row, last column? How will that item be reviewed by that user?

In [2]:
P[2].dot(Q.T[:,4])

1.9401031341455333

## Wait, I saw a transpose in there - what's the actual formula?

Terms:<br>
$R$ is the full user-item rating matrix

$P$ is a matrix that contains the users and the k factors represented as (user,factor)

$Q^T$ is a matrix that contains the items and the k factors represented as

$r̂_{u,i}$ represents our prediction for the true rating $r_{ui}$ In order to get an individual rating, you must take the dot product of a row of P and a column of Q

for the entire matrix:
$$ R = PQ^T $$ 

or for individual ratings

$$r̂_{u,i}=q_i^⊤p_u $$ 





### Let's get the whole matrix!

In [3]:
P.dot(Q.T)

array([[ 1.99717984, -0.10339773,  3.80157388,  1.00522135,  3.96947118],
       [ 4.95987359,  0.99772807,  1.9994742 ,  3.08017572,  1.99887552],
       [ 3.00799117,  0.38437256,  4.30166793,  2.96747131,  1.94010313],
       [ 0.99340337, -0.02806164,  3.96943336,  2.00841398,  1.01228247]])

### Look at those results

Are they _exactly_ correct?
![check](img/check.gif)

## ALS benefit: Loss Function

The Loss function $L$ can be calculated as:

$$ L = \sum_{u,i ∈ \kappa}(r_{u,i}− q_i^T p_u)^2 + λ( ||q_i||^2 + |p_u||^2)$$

Where $\kappa$ is the set of (u,i) pairs for which $r_{u,i}$ is known.

To avoid overfitting, the loss function also includes a regularization parameter $\lambda$. We will choose a $\lambda$ to minimize the square of the difference between all ratings in our dataset $R$ and our predictions.

There's the **least squares** part of ALS, got it!

## So now we use gradient descent, right?

![incorrect](img/incorrect.gif)

### Here comes the alternating part

ALS alternates between holding the $q_i$'s constant and the $p_u$'s constant. 

While all $q_i$'s are held constant, each $p_u$ is computed by solving the least squared problem.<br>
After that process has taken place, all the $p_u$'s are held constant while the $q_i$'s are altered to solve the least squares problem, again, each independently.<br> 
This process repeats many times until you've reached convergence (ideally).

### Changing Loss function:

First let's assume first the item vectors are fixed, we first solve for the user vectors:

$$p_u=(\sum{r{u,i}\in r_{u*}}{q_iq_i^T + \lambda I_k})^{-1}\sum_{r_{u,i}\in r_{u*}}{r_{ui}{q_{i}}}$$

Then we hold the user vectors constant and solve for the item vectors

$$q_i=(\sum{r{u,i}\in r_{i*}}{p_up_u^T + \lambda I_k})^{-1}\sum_{r_{u,i}\in r_{u*}}{r_{ui}{p_{u}}}$$

This process repeats until convergence

# Turn and Talk
What levers do we have available to adjust?

![lever-choice](img/levers.jpeg)

- Pros and cons of lambda size?
- Iterations?

# Enough - let's get to the data

### Importing the Data
To begin with:
* initialize a SparkSession object
* import the dataset found at './data/ratings.csv' into a pyspark DataFrame

In [4]:
import pyspark

spark = pyspark.sql.SparkSession.builder.getOrCreate()
sc = spark.sparkContext

In [5]:
!unzip data.zip

Archive:  data.zip
   creating: data/
  inflating: data/ratings.csv        
  inflating: data/movies.csv         


In [6]:
!ls data/

movies.csv  ratings.csv


In [7]:
!head data/ratings.csv













In [8]:
# read in the dataset into pyspark DataFrame
movie_ratings = spark.read.csv('data/ratings.csv',
                               inferSchema=True,
                               header=True)

Check the data types of each of the values to ensure that they are a type that makes sense given the column.

In [9]:
movie_ratings.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



In [10]:
movie_ratings.head(5)

[Row(userId=1, movieId=1, rating=4.0, timestamp=964982703),
 Row(userId=1, movieId=3, rating=4.0, timestamp=964981247),
 Row(userId=1, movieId=6, rating=4.0, timestamp=964982224),
 Row(userId=1, movieId=47, rating=5.0, timestamp=964983815),
 Row(userId=1, movieId=50, rating=5.0, timestamp=964982931)]

In [11]:
movie_ratings.show(5)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
+------+-------+------+---------+
only showing top 5 rows



We aren't going to need the time stamp, so we can go ahead and remove that column.

In [12]:
movie_ratings = movie_ratings.drop('timestamp')

In [13]:
movie_ratings.show(5)

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
+------+-------+------+
only showing top 5 rows



### Fitting the Alternating Least Squares Model

Because this dataset is already preprocessed for us, we can go ahead and fit the Alternating Least Squares model.

* Use the randomSplit method on the pyspark DataFrame to separate the dataset into a training and test set
* Import the ALS module from pyspark.ml.recommendation.
* Fit the Alternating Least Squares Model to the training dataset. Make sure to set the userCol, itemCol, and ratingCol to the appropriate names given this dataset. Then fit the data to the training set and assign it to a variable model. 

In [14]:
# split into training and testing sets

mr_train, mr_test = movie_ratings.randomSplit([0.8, 0.2])

In [15]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS, ALSModel

als = ALS(
    userCol='userId',
    itemCol='movieId',
    ratingCol='rating',
)

In [16]:
# Build the recommendation model using ALS on the training data
# fit the ALS model to the training set

als_model = als.fit(mr_train)

Now you've fit the model, and it's time to evaluate it to determine just how well it performed.

* import `RegressionEvalutor` from pyspark.ml.evaluation ([documentation](http://spark.apache.org/docs/2.4.3/api/python/pyspark.ml.html#pyspark.ml.evaluation.RegressionEvaluator)
* generate predictions with your model for the test set by using the `transform` method on your ALS model
* evaluate your model and print out the RMSE from your test set [options for evaluating regressors](http://spark.apache.org/docs/2.4.3/api/python/pyspark.ml.html#pyspark.ml.evaluation.RegressionEvaluator.metricName)

In [17]:
train_predictions = als_model.transform(mr_train)

In [18]:
train_predictions.show(5)

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   191|    148|   5.0| 4.9157906|
|   133|    471|   4.0| 3.2069576|
|   597|    471|   2.0|  3.830933|
|   385|    471|   4.0| 3.0270627|
|   436|    471|   3.0| 3.7656024|
+------+-------+------+----------+
only showing top 5 rows



In [19]:
evaluator = RegressionEvaluator(metricName="rmse", 
                                labelCol="rating",
                                predictionCol="prediction")

#notice how we dropna, why are we doing this?
rmse = evaluator.evaluate(train_predictions.dropna())

print(f"Root-mean-square error (train)= {rmse}")



Root-mean-square error (train)= 0.5637573121228293


In [20]:
#let's check with our test data
predictions = als_model.transform(mr_test)
evaluator = RegressionEvaluator(metricName="rmse",
                                labelCol="rating",
                                predictionCol="prediction")

rmse = evaluator.evaluate(predictions.dropna())

print(f"Root-mean-square error = {rmse}")



Root-mean-square error = 0.885214503040523


In [21]:
predictions.persist()

DataFrame[userId: int, movieId: int, rating: double, prediction: float]

In [22]:
movie_ratings.show(6)

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
+------+-------+------+
only showing top 6 rows



In [23]:
predictions.show(6)

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   602|    471|   4.0|  3.112138|
|   409|    471|   3.0| 3.8394618|
|   610|    471|   4.0| 3.1897154|
|   448|    471|   4.0|  3.429043|
|   216|    471|   3.0| 3.2501829|
|   287|    471|   4.5| 2.7882257|
+------+-------+------+----------+
only showing top 6 rows



In [24]:
user_factors = als_model.userFactors

In [25]:
user_factors.show(10)

+---+--------------------+
| id|            features|
+---+--------------------+
| 10|[-1.1203301, -1.0...|
| 20|[-0.020354815, -0...|
| 30|[-0.70407623, -0....|
| 40|[-0.16421832, -0....|
| 50|[-0.17447253, 0.3...|
| 60|[-0.38603196, -0....|
| 70|[-0.18689997, -0....|
| 80|[-0.5450059, -0.2...|
| 90|[0.3272208, -0.04...|
|100|[-0.5109115, -0.7...|
+---+--------------------+
only showing top 10 rows



In [26]:
item_factors = als_model.itemFactors

In [27]:
item_factors.show(10)

+---+--------------------+
| id|            features|
+---+--------------------+
| 10|[-0.2615902, -0.3...|
| 20|[-0.42568925, 0.0...|
| 30|[0.24362114, 1.37...|
| 40|[-1.2288331, 0.26...|
| 50|[-0.23277198, 0.3...|
| 60|[-0.16328321, -0....|
| 70|[-0.11268958, -0....|
| 80|[-0.24506415, 0.4...|
|100|[0.1341564, -0.26...|
|110|[-0.550132, -0.17...|
+---+--------------------+
only showing top 10 rows



### Important Question

Will Billy like movie m?

In [28]:
import numpy as np

In [29]:
billy_row = user_factors[user_factors['id'] == 10].first()
billy_factors = np.array(billy_row['features'])

In [30]:
m_row = item_factors[item_factors['id'] == 300].first()
m_factors = np.array(m_row['features'])

In [31]:
billy_factors

array([-1.1203301 , -1.07146513, -0.31242731,  0.50631529,  0.71689254,
        0.45523372, -0.23791394, -1.0488621 , -0.46222615,  0.06714125])

In [32]:
m_factors

array([ 0.03458975, -0.15960222,  0.17158447,  0.79453808,  0.10294127,
       -0.31332606,  0.0296643 , -0.84704024, -1.78182173, -0.48461902])

In [33]:
billy_factors @ m_factors

2.0845341451602653

In [34]:
billy_factors.dot(m_factors)

2.0845341451602653

In [35]:
billy_preds = train_predictions[train_predictions['userId'] == 10]

In [36]:
billy_preds.sort('movieId').show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|    10|    296|   1.0| 1.9804393|
|    10|    588|   4.0|  2.940047|
|    10|    912|   4.0| 2.9359443|
|    10|   1028|   0.5|  2.090569|
|    10|   1088|   3.0| 2.9005334|
|    10|   1247|   3.0| 2.2021194|
|    10|   1307|   3.0| 3.0473313|
|    10|   1784|   3.5| 2.7882361|
|    10|   1907|   4.0| 3.2249389|
|    10|   2571|   0.5| 2.8633056|
|    10|   2671|   3.5| 3.4982302|
|    10|   2762|   0.5| 2.7024276|
|    10|   2858|   1.0| 1.9088267|
|    10|   2959|   0.5| 2.4237351|
|    10|   3578|   4.0| 3.5481024|
|    10|   3882|   3.0|  3.127587|
|    10|   4246|   3.5| 2.8391597|
|    10|   4306|   4.5|  3.182148|
|    10|   4447|   4.5|  3.602854|
|    10|   4993|   4.0| 3.2524076|
+------+-------+------+----------+
only showing top 20 rows



In [37]:
!grep "^300," < data/movies.csv




## Okay, what *will* Billy like?

In [38]:
recs = als_model.recommendForAllUsers(numItems=10)

In [39]:
recs[recs['userId']==10].first()['recommendations']

[Row(movieId=6686, rating=5.04002046585083),
 Row(movieId=32892, rating=4.991785526275635),
 Row(movieId=222, rating=4.922060012817383),
 Row(movieId=74946, rating=4.896431922912598),
 Row(movieId=90888, rating=4.8028883934021),
 Row(movieId=8869, rating=4.7076616287231445),
 Row(movieId=103372, rating=4.619114875793457),
 Row(movieId=78637, rating=4.581262588500977),
 Row(movieId=71579, rating=4.5738911628723145),
 Row(movieId=32598, rating=4.5477213859558105)]

In [40]:
!grep "^71579," < data/movies.csv




## Your turn!!


Marcy is user number 126.  Will She like the movie NeverEnding Story III (id =278)?

What movies will Marcy like?

In [41]:
#your code here

In [42]:
marcy_row = user_factors[user_factors['id'] == 126].first()
marcy_factors = np.array(billy_row['features'])

In [43]:
marcy_row

Row(id=126, features=[0.4223337769508362, -0.9833168387413025, 0.3187357187271118, 1.3234115839004517, 0.2914188504219055, 0.3199594020843506, 0.33847591280937195, -0.2346716672182083, -1.1305001974105835, -0.18424473702907562])

In [44]:
m_row = item_factors[item_factors['id'] == 278].first()
m_factors = np.array(m_row['features'])

In [45]:
marcy_factors

array([-1.1203301 , -1.07146513, -0.31242731,  0.50631529,  0.71689254,
        0.45523372, -0.23791394, -1.0488621 , -0.46222615,  0.06714125])

In [46]:
m_factors

array([-0.1059352 ,  0.2480803 , -0.29835272,  0.50908625,  0.09189876,
       -0.28101298,  0.04428867, -0.15351939, -0.57745975,  0.18948704])

In [47]:
marcy_factors @ m_factors

0.5719228231215543

In [48]:
marcy_factors.dot(m_factors)

0.5719228231215543

In [49]:
marcy_preds = train_predictions[train_predictions['userId'] == 126]

In [50]:
marcy_preds.sort('movieId').show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   126|     34|   3.0|  3.182831|
|   126|     47|   5.0| 3.2512407|
|   126|    150|   4.0| 3.7855096|
|   126|    153|   4.0| 3.3547292|
|   126|    161|   3.0| 3.6528955|
|   126|    185|   3.0|  3.569079|
|   126|    208|   2.0| 2.3691292|
|   126|    231|   1.0| 2.1674452|
|   126|    288|   1.0| 1.6922612|
|   126|    292|   3.0|  3.393481|
|   126|    296|   1.0| 2.8955328|
|   126|    318|   5.0|   3.81286|
|   126|    329|   4.0| 3.4668243|
|   126|    339|   3.0|  3.905585|
|   126|    344|   1.0|  2.002422|
|   126|    349|   4.0| 3.7426054|
|   126|    356|   4.0| 4.2265134|
|   126|    364|   5.0| 3.8689878|
|   126|    367|   5.0| 3.4152043|
|   126|    377|   5.0| 4.0025735|
+------+-------+------+----------+
only showing top 20 rows



In [51]:
!grep "^126," < data/movies.csv




In [52]:
recs = als_model.recommendForAllUsers(numItems=10)

In [53]:
recs[recs['userId']==126].first()['recommendations']

[Row(movieId=1939, rating=5.108115196228027),
 Row(movieId=3142, rating=4.8823113441467285),
 Row(movieId=3086, rating=4.864048480987549),
 Row(movieId=2316, rating=4.760489463806152),
 Row(movieId=6732, rating=4.684388160705566),
 Row(movieId=1727, rating=4.631467342376709),
 Row(movieId=2990, rating=4.616357803344727),
 Row(movieId=1172, rating=4.594834804534912),
 Row(movieId=1096, rating=4.569124221801758),
 Row(movieId=940, rating=4.51115608215332)]

In [54]:
!grep "^3682," < data/movies.csv




## Parameters to tune

- `maxIter`: the maximum number of iterations to run (defaults to 10)
- `rank`: the number of latent factors in the model (defaults to 10)
- `regParam`: the regularization parameter in ALS (defaults to 1.0)


## Strengths of ALS

- Helpful when analysis of content is expensive for difficult
- Requires little domain knowledge to create recommendations
- You don't need features of the user or item to create recommendations

## Weaknesses with ALS

- Cold start:  When we have new users or items that have no rating history and on which the model has not been trained.
- "Gray Sheep":  When the customer are very different from other customers
- Not very scalable


## Some great technical resources:

- [good one from Stanford](http://stanford.edu/~rezab/classes/cme323/S15/notes/lec14.pdf)
- [the netflix recommendation project](https://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf)