Movie Recommendation with MLlib
===============================
<!--adapted from https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html -->
In this lab, we will use MLlib to make personalized movie recommendations tailored _for you_. We will work with 10 million ratings from 72,000 users on 10,000 movies, collected by [MovieLens](http://movielens.umn.edu/). This dataset is can be found at http://grouplens.org/datasets/movielens. You may want to start with a smaller version of the dataset.

1. Data set
------------------------------
We will use two files from this MovieLens dataset: "`ratings.dat`" and "`movies.dat`". All ratings are contained in the file "`ratings.dat`" and are in the following format:

    UserID::MovieID::Rating::Timestamp

Movie information is in the file "`movies.dat`" and is in the following format:

    MovieID::Title::Genres

2. Collaborative filtering
------------------------------
Collaborative filtering is commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix, in our case, the user-movie rating matrix. MLlib currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. In particular, we implement the alternating least squares (ALS) algorithm to learn these latent factors.
<img src="https://databricks-training.s3.amazonaws.com/img/matrix_factorization.png" title="Matrix Factorization" alt="Matrix Factorization" width="50%">

3. Create training examples
------------------------------
To make recommendation _for you_, we are going to learn your taste by asking you to rate a few movies. We have selected a small set of movies that have received the most ratings from users in the MovieLens dataset. You can rate those movies by running the following:

In [32]:
import itertools
import sys

from math import sqrt
from operator import add
from os import remove, removedirs
from os.path import dirname, join, isfile, dirname
from time import time

from pyspark import SparkConf, SparkContext
from pyspark.mllib.recommendation import ALS
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

import numpy as np
from math import sqrt

In [2]:
topMovies = """1,Toy Story (1995)
780,Independence Day (a.k.a. ID4) (1996)
590,Dances with Wolves (1990)
1210,Star Wars: Episode VI - Return of the Jedi (1983)
648,Mission: Impossible (1996)
344,Ace Ventura: Pet Detective (1994)
165,Die Hard: With a Vengeance (1995)
153,Batman Forever (1995)
597,Pretty Woman (1990)
1580,Men in Black (1997)
231,Dumb & Dumber (1994)"""

ratingsFile = 'personalRatings.txt'

In [3]:
if isfile(ratingsFile):
    r = raw_input("Looks like you've already rated the movies. Overwrite ratings (y/N)? ")
    if r and r[0].lower() == "y":
        remove(ratingsFile)

Looks like you've already rated the movies. Overwrite ratings (y/N)? N


In [4]:
if not isfile(ratingsFile):
    prompt = "Please rate the following movie (1-5 (best), or 0 if not seen): "
    print prompt

    now = int(time())
    n = 0

    f = open(ratingsFile, 'w')
    for line in topMovies.split("\n"):
        ls = line.strip().split(",")
        valid = False
        while not valid:
            rStr = raw_input(ls[1] + ": ")
            r = int(rStr) if rStr.isdigit() else -1
            if r < 0 or r > 5:
                print prompt
            else:
                valid = True
                if r > 0:
                    f.write("0::%s::%d::%d\n" % (ls[0], r, now))
                    n += 1
    f.close()

After you’re done rating the movies, we save your ratings in `personalRatings.txt` in the MovieLens format, where a special user id `0` is assigned to you.

`bin/rateMovies` allows you to re-rate the movies if you’d like to see how your ratings affect your recommendations.

4. Setup
------------------------------

We will be using a standalone project template for this exercise.

The following is the main file you are going to edit, compile, and run.

In [5]:
# %load MovieLensALS.py
#!/usr/bin/env python

import sys
import itertools
from math import sqrt
from operator import add
from os.path import join, isfile, dirname

from pyspark import SparkConf, SparkContext
from pyspark.mllib.recommendation import ALS

def parseRating(line):
    """
    Parses a rating record in MovieLens format userId::movieId::rating::timestamp .
    """
    fields = line.strip().split("::")
    return long(fields[3]) % 10, (int(fields[0]), int(fields[1]), float(fields[2]))

def parseMovie(line):
    """
    Parses a movie record in MovieLens format movieId::movieTitle .
    """
    fields = line.strip().split("::")
    return int(fields[0]), fields[1]

def loadRatings(ratingsFile):
    """
    Load ratings from file.
    """
    if not isfile(ratingsFile):
        print "File %s does not exist." % ratingsFile
        sys.exit(1)
    f = open(ratingsFile, 'r')
    ratings = filter(lambda r: r[2] > 0, [parseRating(line)[1] for line in f])
    f.close()
    if not ratings:
        print "No ratings provided."
        sys.exit(1)
    else:
        return ratings

def computeRmse(model, data, n):
    """
    Compute RMSE (Root Mean Squared Error).
    """
    predictions = model.predictAll(data.map(lambda x: (x[0], x[1])))
    predictionsAndRatings = predictions.map(lambda x: ((x[0], x[1]), x[2])) \
      .join(data.map(lambda x: ((x[0], x[1]), x[2]))) \
      .values()
    return sqrt(predictionsAndRatings.map(lambda x: (x[0] - x[1]) ** 2).reduce(add) / float(n))

if __name__ == "__main__":
    if (len(sys.argv) != 3):
        print "Usage: /path/to/spark/bin/spark-submit --driver-memory 2g " + \
          "MovieLensALS.py movieLensDataDir personalRatingsFile"
        sys.exit(1)

    # set up environment
    conf = SparkConf() \
      .setAppName("MovieLensALS") \
      .set("spark.executor.memory", "2g")
#     sc = SparkContext(conf=conf)

    # load personal ratings
    myRatings = loadRatings(sys.argv[2])
    myRatingsRDD = sc.parallelize(myRatings, 1)
    
    # load ratings and movie titles

    movieLensHomeDir = sys.argv[1]

    # ratings is an RDD of (last digit of timestamp, (userId, movieId, rating))
    ratings = sc.textFile(join(movieLensHomeDir, "ratings.dat")).map(parseRating)

    # movies is an RDD of (movieId, movieTitle)
    movies = dict(sc.textFile(join(movieLensHomeDir, "movies.dat")).map(parseMovie).collect())

    # your code here
    
    # clean up
    sc.stop()


IndexError: list index out of range

Let’s first take a closer look at our template code in a text editor, then we’ll start adding code to the template. Locate the MovieLensALS class and open it with a text editor.

The code uses the SparkContext to read in ratings. Recall that the rating file is a text file with "`::`" as the delimiter. The code parses each line to create a RDD for ratings that contains `(Int, Rating)` pairs. We only keep the last digit of the timestamp as a random key. The `Rating` class is a wrapper around the tuple `(user: Int, product: Int, rating: Double)`.

In [6]:
def parseRating(line):
    """
    Parses a rating record in MovieLens format userId::movieId::rating::timestamp .
    """
    fields = line.strip().split("::")
    return long(fields[3]) % 10, (int(fields[0]), int(fields[1]), float(fields[2]))

In [7]:
movieLensHomeDir = 's3://dsci/6007/data/MovieLens/movielens/medium/'

# ratings is an RDD of (last digit of timestamp, (userId, movieId, rating))
ratings = sc.textFile(join(movieLensHomeDir, "ratings.dat")).map(parseRating)

Next, the code read in movie ids and titles, collect them into a movie id to title map.

In [8]:
def parseMovie(line):
    fields = line.split("::")
    return int(fields[0]), fields[1]

movies = dict(sc.textFile(join(movieLensHomeDir, "movies.dat")).map(parseMovie).collect())

Now, let’s get a summary of the ratings.

In [9]:
numRatings = ratings.count()
numUsers = ratings.values().map(lambda r: r[0]).distinct().count()
numMovies = ratings.values().map(lambda r: r[1]).distinct().count()

print "Got {:,} ratings from {:,} users on {:,} movies.".format(numRatings, numUsers, numMovies)

Got 1,000,209 ratings from 6,040 users on 3,706 movies.


5. Splitting training data
------------------------------

In [10]:
def loadRatings(ratingsFile):
    """
    Load ratings from file.
    """
    if not isfile(ratingsFile):
        print "File %s does not exist." % ratingsFile
        sys.exit(1)
    f = open(ratingsFile, 'r')
    ratings = filter(lambda r: r[2] > 0, [parseRating(line)[1] for line in f])
    f.close()
    if not ratings:
        print "No ratings provided."
        sys.exit(1)
    else:
        return ratings

In [11]:
# load personal ratings
myRatings = loadRatings(ratingsFile)
myRatingsRDD = sc.parallelize(myRatings, 1)

We will use MLlib’s `ALS` to train a `MatrixFactorizationModel`, which takes a `RDD[(user, product, rating)]`. ALS has training parameters such as rank for matrix factors and regularization constants. To determine a good combination of the training parameters, we split the data into three non-overlapping subsets, named training, test, and validation, based on the last digit of the timestamp, and cache them. We will train multiple models based on the training set, select the best model on the validation set based on RMSE (Root Mean Squared Error), and finally evaluate the best model on the test set. We also add your ratings to the training set to make recommendations for you. We hold the training, validation, and test sets in memory by calling cache because we need to visit them multiple times.

In [12]:
numPartitions = 4
training = ratings.filter(lambda x: x[0] < 6) \
  .values() \
  .union(myRatingsRDD) \
  .repartition(numPartitions) \
  .cache()

validation = ratings.filter(lambda x: x[0] >= 6 and x[0] < 8) \
  .values() \
  .repartition(numPartitions) \
  .cache()

test = ratings.filter(lambda x: x[0] >= 8).values().cache()

numTraining = training.count()
numValidation = validation.count()
numTest = test.count()

print "Training: {:,}; validation: {:,}; test: {:,}".format(numTraining, numValidation, numTest)

Training: 602,252; validation: 198,919; test: 199,049


6. Training using ALS
------------------------------
In this section, we will use `ALS.train` to train a bunch of models, and select and evaluate the best. Among the training paramters of ALS, the most important ones are rank, lambda (regularization constant), and number of iterations. The `train` method of ALS we are going to use is defined as the following:
```python
class ALS(object):

    def train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1):
        # ...
        return MatrixFactorizationModel(sc, mod)
```
Ideally, we want to try a large number of combinations of them in order to find the best one. Due to time constraint, we will test only 8 combinations resulting from the cross product of 2 different ranks (8 and 12), 2 different lambdas (1.0 and 10.0), and two different numbers of iterations (10 and 20). We use the provided method `computeRmse` to compute the RMSE on the validation set for each model. The model with the smallest RMSE on the validation set becomes the one selected and its RMSE on the test set is used as the final metric.

In [85]:
ranks = [6, 8]
lambdas = [.5,.8,1.0]
numIters = [10, 20]
bestModel = None
rmse_val = float("inf")
best_rank = 0
best_lambda= -1.0
best_iteration= -1
als = ALS()
for rank, lmbda, numIter in itertools.product(ranks, lambdas, numIters):
    model = als.train(training, rank, numIter, lmbda)
    validationRmse = computeRmse(model, validation, numValidation)
    if (validationRmse < rmse_val):
        bestModel = model
        rmse_val = validationRmse
        best_rank = rank
        best_lambda = lmbda
        best_iteration= numIter
test_rmse= computeRmse(bestModel, test, numTest)
# evaluate the best model on the test set
print("The best model is has validation RMSE {} and test RMSE {}, lambda {}, iterations {}, rank {}".format(
    rmse_val ,test_rmse,best_lambda, best_iteration,best_rank ))

The best model is has validation RMSE 1.04564380399 and test RMSE 1.0424118474, lambda 0.5, iterations 10, rank 8


In [86]:
print(bestModel)

<pyspark.mllib.recommendation.MatrixFactorizationModel object at 0x7f7368c4bcd0>


Spark might take a minute or two to train the models. You should see the following on the screen:

    The best model was trained using rank 8 and lambda 10.0, and its RMSE on test is 0.8808492431998702.

7. Recommending movies for you
------------------------------
As the last part of our tutorial, let’s take a look at what movies our model recommends for you. This is done by generating `(0, movieId)` pairs for all movies you haven’t rated and calling the model’s [`predict`](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/recommendation.html#MatrixFactorizationModel.predictAll) method to get predictions. `0` is the special user id assigned to you.
```python
class MatrixFactorizationModel(object):
    def predictAll(self, usersProducts):
        # ...
        return RDD(self._java_model.predict(usersProductsJRDD._jrdd),
                   self._context, RatingDeserializer())
```
After we get all predictions, let us list the top 50 recommendations and see whether they look good to you.

In [15]:
my_movies_rated = [i[1] for i in myRatings]

In [16]:
candidate_movies = [ m for m in movies if m not in my_movies_rated]
candidate_movies_rdd = sc.parallelize(candidate_movies )

In [18]:
predictions = bestModel.predictAll(candidate_movies_rdd.map(lambda x: (0,x))).collect()

In [22]:
predictions[2]

Rating(user=0, product=320, rating=2.308844064740798)

In [24]:
recommendations = sorted(predictions, key=lambda x: x[2], reverse=True)[:50]

print "Movies for you:"
for i in xrange(len(recommendations)):
    print ("%2d: %s" % (i + 1, movies[recommendations[i][1]])).encode('ascii', 'ignore')

Movies for you:
 1: I Am Cuba (Soy Cuba/Ya Kuba) (1964)
 2: Time of the Gypsies (Dom za vesanje) (1989)
 3: Smashing Time (1967)
 4: Gate of Heavenly Peace, The (1995)
 5: Follow the Bitch (1998)
 6: Zachariah (1971)
 7: Bewegte Mann, Der (1994)
 8: Institute Benjamenta, or This Dream People Call Human Life (1995)
 9: For All Mankind (1989)
10: Hour of the Pig, The (1993)
11: Man of the Century (1999)
12: Lamerica (1994)
13: Lured (1947)
14: Apple, The (Sib) (1998)
15: Sanjuro (1962)
16: I Can't Sleep (J'ai pas sommeil) (1994)
17: Bells, The (1926)
18: Shawshank Redemption, The (1994)
19: Collectionneuse, La (1967)
20: Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)
21: 24 7: Twenty Four Seven (1997)
22: Usual Suspects, The (1995)
23: Godfather, The (1972)
24: Close Shave, A (1995)
25: Big Trees, The (1952)
26: Wrong Trousers, The (1993)
27: Paths of Glory (1957)
28: Soft Fruit (1999)
29: Schindler's List (1993)
30: Third Man, The (1949)
31: Sunset Blvd. (a.k.a. Sun

The output should be similar to

    Movies recommended for you:
     1: Silence of the Lambs, The (1991)
     2: Saving Private Ryan (1998)
     3: Godfather, The (1972)
     4: Star Wars: Episode IV - A New Hope (1977)
     5: Braveheart (1995)
     6: Schindler's List (1993)
     7: Shawshank Redemption, The (1994)
     8: Star Wars: Episode V - The Empire Strikes Back (1980)
     9: Pulp Fiction (1994)
    10: Alien (1979)
    ...

YMMV, and don’t expect to see movies from this decade, becaused the data set is old.

8. Exercises
------------------------------
### 8.1 Comparing to a naïve baseline
Does ALS output a non-trivial model? We can compare the evaluation result with a naive baseline model that only outputs the average rating (or you may try one that outputs the average rating per movie). Computing the baseline’s RMSE is straightforward:

In [65]:
number_of_ratings = ratings.count()

In [62]:
total_ratings = ratings.map( lambda x: x[1][2]).reduce(add)

In [66]:
average_rating = total_ratings / number_of_ratings 

In [87]:
average_rating

3.581564453029317

In [88]:
## get the average rating for our rated movies
my_ratings = [i[2] for i in myRatings]
avg_vector = [average_rating for _ in range(len(myRatings))]

In [89]:
average_rmse = np.linalg.norm(np.array(avg_vector) - np.array(my_ratings))/sqrt(len(avg_vector))

In [90]:
average_rmse

1.3278188379711944

In [43]:
## next, get the movies we have already rated and see what the RMSE is to predict the ratings for these movies

In [71]:
my_rated_movies = [i[1] for i in myRatings]

In [91]:
## get the RMSE
model_rmse = np.linalg.norm(np.array([bestModel.predict(0, i) for i in my_rated_movies])-np.array(my_ratings))/\
 sqrt(len(my_rated_movies))

In [93]:
print("The model performs {:2%} better compared to the average rating".format((average_rmse- model_rmse)/model_rmse))

The model performs 3.231833% better compared to the average rating


The output should be similar to

    The best model improves the baseline by 20.96%.

It seems obvious that the trained model would outperform the naive baseline. However, a bad combination of training parameters would lead to a model worse than this naive baseline. Choosing the right set of parameters is quite important for this task.

### 8.2. Augmenting matrix factors
In this tutorial, we add your ratings to the training set. A better way to get the recommendations for you is training a matrix factorization model first and then augmenting the model using your ratings. If this sounds interesting to you, you can take a look at the implementation of [MatrixFactorizationModel](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/recommendation.html#MatrixFactorizationModel) and see how to update the model for new users and new movies.