# MovieLens

In this lab you are going to build a movie recommender system.

To train it we will use the MovieLens dataset.
- [MovieLens Datasets](https://grouplens.org/datasets/movielens/)

## Collaborative filtering
Collaborative filtering is commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix. spark.ml currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. spark.ml uses the alternating least squares (ALS) algorithm to learn these latent factors. The implementation in spark.ml has the following parameters:

-    numBlocks is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
-    rank is the number of latent factors in the model (defaults to 10).
-    maxIter is the maximum number of iterations to run (defaults to 10).
-    regParam specifies the regularization parameter in ALS (defaults to 1.0).
-    implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data (defaults to false which means using explicit feedback).
-    alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations (defaults to 1.0).
-    nonnegative specifies whether or not to use nonnegative constraints for least squares (defaults to false).

Note: The DataFrame-based API for ALS currently only supports integers for user and item ids. Other numeric types are supported for the user and item id columns, but the ids must be within the integer value range.

You can find all the details here:
- [Spark Collaborative Filtering Guide](https://spark.apache.org/docs/latest/ml-collaborative-filtering.html)

## Dataset

For the lab we will use the "Small" dataset from "MovieLens Latest Datasets", but you can then experiment with the larger ones.

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files links.csv, movies.csv, ratings.csv and tags.csv.

### Variables
ratings.csv:
- userId
- movieId
- rating: Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars)
- timestamp: Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

movies.csv:
- movieId
- title
- genres

# Load Data
We have to start downloading the data. It is available under the data tab.

* [ml-latest-small](https://files.grouplens.org/datasets/movielens/ml-latest-small.zip)

We will need the following files that are in CSV format:
- ratings.csv
- movies.csv

Then we have to upload the files to HDFS, to the `datasets/movielens-latest-small` directory.

We can now now load the data and explore it:

In [1]:
ratings = spark.read.csv('datasets/movielens-latest-small/ratings.csv', header=True,
                         inferSchema=True)

In [2]:
ratings.createOrReplaceTempView('ratings')

In [3]:
movies = spark.read.csv('datasets/movielens-latest-small/movies.csv', header=True,
                         inferSchema=True)

In [4]:
movies.createOrReplaceTempView('movies')

Some tuning:

In [5]:
ratings.cache()
movies.cache()

DataFrame[movieId: int, title: string, genres: string]

Let's see how the data looks like:

In [6]:
ratings.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



In [7]:
ratings.show(5)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
+------+-------+------+---------+
only showing top 5 rows



In [8]:
ratings.count()

100836

In [9]:
movies.show(5)

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
+-------+--------------------+--------------------+
only showing top 5 rows



In [10]:
movies.count()

9742

## Data Exploration

In [11]:
ratings.select('rating').describe().show()

+-------+------------------+
|summary|            rating|
+-------+------------------+
|  count|            100836|
|   mean| 3.501556983616962|
| stddev|1.0425292390606342|
|    min|               0.5|
|    max|               5.0|
+-------+------------------+



For curiosity let's see what are the users with the highest number of reviews:

In [12]:
spark.sql('''select userId, count(*) as count from ratings group by userId order by count desc limit 10''').show()

+------+-----+
|userId|count|
+------+-----+
|   414| 2698|
|   599| 2478|
|   474| 2108|
|   448| 1864|
|   274| 1346|
|   610| 1302|
|    68| 1260|
|   380| 1218|
|   606| 1115|
|   288| 1055|
+------+-----+



Incredible! There are users that have seen thousands of films and created reviews about them.

Let's see which films have the highest number of reviews:

In [13]:
spark.sql('''select movieId, count(*) as count from ratings group by movieId order by count desc limit 10''').show()

+-------+-----+
|movieId|count|
+-------+-----+
|    356|  329|
|    318|  317|
|    296|  307|
|    593|  279|
|   2571|  278|
|    260|  251|
|    480|  238|
|    110|  237|
|    589|  224|
|    527|  220|
+-------+-----+



Actually it would be much more interesting to see the title of the film:

In [14]:
spark.sql('''select movies.title, count(*) as count
             from ratings inner join movies 
             on ratings.movieId=movies.movieId 
             group by movies.title
             order by count desc limit 10''').show()

+--------------------+-----+
|               title|count|
+--------------------+-----+
| Forrest Gump (1994)|  329|
|Shawshank Redempt...|  317|
| Pulp Fiction (1994)|  307|
|Silence of the La...|  279|
|  Matrix, The (1999)|  278|
|Star Wars: Episod...|  251|
|Jurassic Park (1993)|  238|
|   Braveheart (1995)|  237|
|Terminator 2: Jud...|  224|
|Schindler's List ...|  220|
+--------------------+-----+



## Training

In this case we are dealing with a collaborative filtering problem so we will use the ALS Estimator:

In [15]:
from pyspark.ml.recommendation import ALS

In [16]:
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
          coldStartStrategy="drop")

We use `coldStartStrategy='drop'` to ensure we don't have issues during evaluation with NaN metrics.

```
When making predictions using an ALS model, it is common to encounter users and/or items in the test dataset that were not present during training the model. 

By default, Spark assigns NaN predictions during ALSModel.transform when a user and/or item factor is not present in the model. 

However, this is undesirable during cross-validation, since any NaN predicted values will result in NaN results for the evaluation metric (for example when using RegressionEvaluator). This makes model selection impossible.

Spark allows users to set the coldStartStrategy parameter to “drop” in order to drop any rows in the DataFrame of predictions that contain NaN values.

Currently the supported cold start strategies are “nan” (the default behavior mentioned above) and “drop”. Further strategies may be supported in future.
```

We can now create the pipeline:

In [17]:
training, test = ratings.randomSplit([0.8, 0.2])

In [18]:
%%time
model = als.fit(training)

CPU times: user 8.84 ms, sys: 2.61 ms, total: 11.5 ms
Wall time: 4.35 s


## Evaluation

In [19]:
from pyspark.ml.evaluation import RegressionEvaluator

In [20]:
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating', predictionCol='prediction')

In [21]:
test.count()

20275

In [22]:
predictions = model.transform(test)

In [23]:
predictions.show(5)

+------+-------+------+----------+----------+
|userId|movieId|rating| timestamp|prediction|
+------+-------+------+----------+----------+
|   409|    471|   3.0| 967912821| 3.3105135|
|   372|    471|   3.0| 874415126|  1.962655|
|   387|    471|   3.0|1139047519| 3.3807425|
|   555|    471|   3.0| 978746933|  3.146028|
|   216|    471|   3.0| 975212641| 2.9404469|
+------+-------+------+----------+----------+
only showing top 5 rows



In [24]:
rmse = evaluator.evaluate(predictions)

Root-mean squared error (RMSE):

In [25]:
rmse

1.0872392249023355

So in general a estimation of our average error is around 1 star.

Generate top 5 movie recommendations for each user:

## Useful methods in the ALS Estimator

The ALS Estimator has some usefult methods to simplify the generation of recommendations.

**For each user generate his/her top 5 movie recommendations**

In [26]:
recommendations_per_user = model.recommendForAllUsers(5)

In [27]:
recommendations_per_user.show(2)

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|   471|[[70946, 12.38880...|
|   463|[[80906, 6.749953...|
+------+--------------------+
only showing top 2 rows



**For each movie recommend the top 5 users that would enjoy it**

In [28]:
recommendations_per_movie = model.recommendForAllItems(5)

In [29]:
recommendations_per_movie.show(2)

+-------+--------------------+
|movieId|     recommendations|
+-------+--------------------+
|   1580|[[498, 5.9596877]...|
|   4900|[[126, 6.972471],...|
+-------+--------------------+
only showing top 2 rows



**Generate top 5 movie recommendations for a specific group of users.**

The `users` variable must contain a DataFrame containing a column of user ids. The column name must match `userCol`.

In [30]:
from pyspark.sql import Row

In [31]:
users = spark.createDataFrame([Row(userId=414), Row(userId=599)])

In [32]:
users.show()

+------+
|userId|
+------+
|   414|
|   599|
+------+



In [33]:
recommendations_for_user_group = model.recommendForUserSubset(users, 5)

In [34]:
recommendations_for_user_group.show()

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|   599|[[3089, 4.9771776...|
|   414|[[89904, 5.474591...|
+------+--------------------+



**Generate top 5 user recommendations for a specific set of movies**

In [35]:
selected_movies = spark.createDataFrame([Row(movieId=356), Row(movieId=318)])

In [36]:
recommendations_for_selected_movies = model.recommendForItemSubset(selected_movies, 5)

In [37]:
recommendations_for_selected_movies.show()

+-------+--------------------+
|movieId|     recommendations|
+-------+--------------------+
|    318|[[429, 5.7827845]...|
|    356|[[429, 5.533156],...|
+-------+--------------------+



### For fun: let's add your movie preferences and see what you get as recommendations

Let's see which user ids are taken:

In [38]:
from pyspark.sql.functions import col, min, max

In [39]:
ratings.select(min(col('userId')).alias('min'), max(col('userId')).alias('max')).show()

+---+---+
|min|max|
+---+---+
|  1|610|
+---+---+



In [40]:
training.select(min(col('userId')).alias('min'), max(col('userId')).alias('max')).show()

+---+---+
|min|max|
+---+---+
|  1|610|
+---+---+



Or we could just use the `describe` method:

In [41]:
ratings.describe('userId').show()

+-------+------------------+
|summary|            userId|
+-------+------------------+
|  count|            100836|
|   mean|326.12756356856676|
| stddev| 182.6184914635004|
|    min|                 1|
|    max|               610|
+-------+------------------+



So we will take the userId 1000.

Now let's look at some movie titles to generate our reviews. Let's imagine that we are some sort of Harry Potter's fans:

In [42]:
movies.where(col('title').like('Harry Potter%')).limit(20).toPandas()

Unnamed: 0,movieId,title,genres
0,4896,Harry Potter and the Sorcerer's Stone (a.k.a. ...,Adventure|Children|Fantasy
1,5816,Harry Potter and the Chamber of Secrets (2002),Adventure|Fantasy
2,8368,Harry Potter and the Prisoner of Azkaban (2004),Adventure|Fantasy|IMAX
3,40815,Harry Potter and the Goblet of Fire (2005),Adventure|Fantasy|Thriller|IMAX
4,54001,Harry Potter and the Order of the Phoenix (2007),Adventure|Drama|Fantasy|IMAX
5,69844,Harry Potter and the Half-Blood Prince (2009),Adventure|Fantasy|Mystery|Romance|IMAX
6,81834,Harry Potter and the Deathly Hallows: Part 1 (...,Action|Adventure|Fantasy|IMAX
7,88125,Harry Potter and the Deathly Hallows: Part 2 (...,Action|Adventure|Drama|Fantasy|Mystery|IMAX


In [43]:
my_prefs = spark.createDataFrame([
    Row(userId=1000, movieId=4896, rating=5.0),
    Row(userId=1000, movieId=5816, rating=5.0),
    Row(userId=1000, movieId=8368, rating=5.0),
    Row(userId=1000, movieId=40815, rating=5.0),
    Row(userId=1000, movieId=54001, rating=5.0),
    Row(userId=1000, movieId=69844, rating=5.0),
    Row(userId=1000, movieId=81834, rating=5.0),
    Row(userId=1000, movieId=88125, rating=5.0),
])

In [44]:
my_prefs.show()

+-------+------+------+
|movieId|rating|userId|
+-------+------+------+
|   4896|   5.0|  1000|
|   5816|   5.0|  1000|
|   8368|   5.0|  1000|
|  40815|   5.0|  1000|
|  54001|   5.0|  1000|
|  69844|   5.0|  1000|
|  81834|   5.0|  1000|
|  88125|   5.0|  1000|
+-------+------+------+



Let's get rid of the timestamp column that we do not use (so we do not have to specify it in `my_prefs`):

In [45]:
training_new = training.drop('timestamp')

Let's create the new training set:

In [46]:
training_extended = training_new.unionByName(my_prefs)

NOTE: Take into account that if you use `union` instead of `unionByName`: Also as standard in SQL, this function resolves columns by position (not by name).

In [47]:
training.count()

80561

In [48]:
training_extended.count()

80569

In [49]:
training_extended.where('userId = 1000').show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|  1000|   4896|   5.0|
|  1000|   5816|   5.0|
|  1000|   8368|   5.0|
|  1000|  40815|   5.0|
|  1000|  54001|   5.0|
|  1000|  69844|   5.0|
|  1000|  81834|   5.0|
|  1000|  88125|   5.0|
+------+-------+------+



And now we re-traing the model:

In [50]:
%%time
model_new = als.fit(training_extended)

CPU times: user 41.7 ms, sys: 24.3 ms, total: 66 ms
Wall time: 5.79 s


Finally let's find what we get as recommendations:

In [51]:
my_user = spark.createDataFrame([Row(userId=1000)])

In [52]:
%%time
results = model_new.recommendForUserSubset(my_user, 10)

CPU times: user 1.13 ms, sys: 447 µs, total: 1.58 ms
Wall time: 131 ms


NOTE: Look at the Wall time and you notice how the `recommendForUserSubset` is not called until we try to retrieve the results:

In [53]:
%%time
my_recommendations = results.collect()

CPU times: user 96.7 ms, sys: 64.4 ms, total: 161 ms
Wall time: 8.87 s


In [54]:
my_recommendations

[Row(userId=1000, recommendations=[Row(movieId=5690, rating=6.74323034286499), Row(movieId=1280, rating=6.063719749450684), Row(movieId=3181, rating=6.059947967529297), Row(movieId=27611, rating=6.056857109069824), Row(movieId=1299, rating=5.9727630615234375), Row(movieId=4642, rating=5.9463348388671875), Row(movieId=2730, rating=5.933717727661133), Row(movieId=83803, rating=5.889867782592773), Row(movieId=3508, rating=5.886873245239258), Row(movieId=47423, rating=5.8755340576171875)])]

Let's see how we can get the title:

In [55]:
movies.where(col('movieId')==5690).select('title').collect()

[Row(title=u'Grave of the Fireflies (Hotaru no haka) (1988)')]

And these are the recommendations:

In [56]:
from __future__ import print_function

for m in my_recommendations[0].recommendations:
    title = movies.where(col('movieId')==m.movieId).select('title').collect()[0].title
    print(m.movieId, title, m.rating)  

5690 Grave of the Fireflies (Hotaru no haka) (1988) 6.74323034286
1280 Raise the Red Lantern (Da hong deng long gao gao gua) (1991) 6.06371974945
3181 Titus (1999) 6.05994796753
27611 Battlestar Galactica (2003) 6.05685710907
1299 Killing Fields, The (1984) 5.97276306152
4642 Hedwig and the Angry Inch (2000) 5.94633483887
2730 Barry Lyndon (1975) 5.93371772766
83803 Day & Night (2010) 5.88986778259
3508 Outlaw Josey Wales, The (1976) 5.88687324524
47423 Half Nelson (2006) 5.87553405762


Not that bad, but just as curiosity, let's check if we have the 'Fantastic Beasts' saga also from J. K. Rowling:

In [57]:
movies.where(col('title').like('Fantastic Beasts%')).show(truncate=False)

+-------+----------------------------------------------+-------+
|movieId|title                                         |genres |
+-------+----------------------------------------------+-------+
|135143 |Fantastic Beasts and Where to Find Them (2016)|Fantasy|
+-------+----------------------------------------------+-------+



In [58]:
ratings.where('movieId = 135143').show()

+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|    50| 135143|   2.0|1514239413|
|    62| 135143|   4.0|1521489754|
|    68| 135143|   4.0|1526947551|
|    98| 135143|   2.5|1532457800|
|   125| 135143|   3.5|1480789755|
|   210| 135143|   4.5|1517086938|
|   212| 135143|   4.5|1527796007|
|   318| 135143|   4.5|1512681753|
|   352| 135143|   4.0|1493931961|
|   380| 135143|   4.0|1493473809|
|   382| 135143|   4.0|1515162207|
|   517| 135143|   3.5|1487966240|
|   551| 135143|   3.0|1504320996|
|   596| 135143|   3.5|1535627421|
+------+-------+------+----------+



We can see there is one of those films in the dataset.

Too bad we did not get it in the recommendations!!