In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").getOrCreate()

## Collaborative Filtering
Collaborative filtering is a machine learning technique that predicts ratings awarded to items by users.

### Import the ALS class
In this exercise, you will use the Alternating Least Squares collaborative filtering algorithm to creater a recommender.

In [2]:
from pyspark.ml.recommendation import ALS

### Load Source Data
The source data for the recommender is in two files - one containing numeric IDs for movies and users, along with user ratings; and the other containing details of the movies.

In [3]:
ratings = spark.read.csv('../data/ratings.csv', inferSchema=True, header=True)
movies = spark.read.csv('../data/movies.csv', inferSchema=True, header=True)
ratings.join(movies, "movieId").show()

+-------+------+------+----------+--------------------+--------------------+
|movieId|userId|rating| timestamp|               title|              genres|
+-------+------+------+----------+--------------------+--------------------+
|     31|     1|   2.5|1260759144|Dangerous Minds (...|               Drama|
|   1029|     1|   3.0|1260759179|        Dumbo (1941)|Animation|Childre...|
|   1061|     1|   3.0|1260759182|     Sleepers (1996)|            Thriller|
|   1129|     1|   2.0|1260759185|Escape from New Y...|Action|Adventure|...|
|   1172|     1|   4.0|1260759205|Cinema Paradiso (...|               Drama|
|   1263|     1|   2.0|1260759151|Deer Hunter, The ...|           Drama|War|
|   1287|     1|   2.0|1260759187|      Ben-Hur (1959)|Action|Adventure|...|
|   1293|     1|   2.0|1260759148|       Gandhi (1982)|               Drama|
|   1339|     1|   3.5|1260759125|Dracula (Bram Sto...|Fantasy|Horror|Ro...|
|   1343|     1|   2.0|1260759131|    Cape Fear (1991)|            Thriller|

### Prepare the Data
To prepare the data, split it into a training set and a test set.

In [4]:
data = ratings.select("userId", "movieId", "rating")
splits = data.randomSplit([0.7, 0.3])
train = splits[0].withColumnRenamed("rating", "label")
test = splits[1].withColumnRenamed("rating", "trueLabel")
train_rows = train.count()
test_rows = test.count()
print("Training Rows:", train_rows, " Testing Rows:", test_rows)

Training Rows: 69722  Testing Rows: 30282


### Build the Recommender
The ALS class is an estimator, so you can use its **fit** method to traing a model, or you can include it in a pipeline. Rather than specifying a feature vector and as label, the ALS algorithm requries a numeric user ID, item ID, and rating.

In [5]:
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="label")
model = als.fit(train)

### Test the Recommender
Now that you've trained the recommender, you can see how accurately it predicts known ratings in the test set.

In [6]:
prediction = model.transform(test)
prediction.join(movies, "movieId").select("userId", "title", "prediction", "trueLabel").show(100, truncate=False)

+------+--------------------------------+----------+---------+
|userId|title                           |prediction|trueLabel|
+------+--------------------------------+----------+---------+
|232   |Guilty as Sin (1993)            |2.815981  |4.0      |
|380   |Guilty as Sin (1993)            |1.6607947 |3.0      |
|30    |Guilty as Sin (1993)            |2.4048593 |4.0      |
|588   |Hudsucker Proxy, The (1994)     |3.4587421 |3.0      |
|274   |Hudsucker Proxy, The (1994)     |3.9263043 |5.0      |
|292   |Hudsucker Proxy, The (1994)     |4.1340666 |3.5      |
|624   |Hudsucker Proxy, The (1994)     |3.8228023 |4.0      |
|195   |Hudsucker Proxy, The (1994)     |2.7618933 |3.0      |
|30    |Hudsucker Proxy, The (1994)     |4.8375697 |4.0      |
|521   |Hudsucker Proxy, The (1994)     |3.4877195 |3.5      |
|547   |What Happened Was... (1994)     |0.78678083|3.0      |
|509   |What Happened Was... (1994)     |0.8098248 |3.0      |
|463   |Dirty Dancing (1987)            |3.471386  |3.0

The data used in this exercise describes 5-star rating activity from [MovieLens](http://movielens.org), a movie recommendation service. It was created by GroupLens, a research group in the Department of Computer Science and Engineering at the University of Minnesota, and is used here with permission.

This dataset and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.

For more information, see F. Maxwell Harper and Joseph A. Konstan. 2015. [The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015)](http://dx.doi.org/10.1145/2827872)