<a href="https://colab.research.google.com/github/profshai/pyspark-big-data/blob/main/movie_recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie recommendation using collaborative filtering

The dataset is the movielens data set (https://grouplens.org/datasets/movielens/.

In [1]:
pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/89/db/e18cfd78e408de957821ec5ca56de1250645b05f8523d169803d8df35a64/pyspark-3.1.2.tar.gz (212.4MB)
[K     |████████████████████████████████| 212.4MB 67kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 19.1MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880768 sha256=f6e31ac405118098fa0b312b2dfffae1325ec0108aa7177ae0c74c8757e79b82
  Stored in directory: /root/.cache/pip/wheels/40/1b/2c/30f43be2627857ab80062bef1527c0128f7b4070b6b2d02139
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.2


### Import libraries

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('recommender').getOrCreate()

Spark MLlib library for Machine Learning provides a Collaborative Filtering implementation by using Alternating Least Squares. The implementation in MLlib has these parameters:

* numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure).
* rank is the number of latent factors in the model.
* iterations is the number of iterations to run.
* lambda specifies the regularization parameter in ALS.
* implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
* alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.

Let's see this all in action!

In [5]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

### Import dataset

In [6]:
data = spark.read.csv('movielens_ratings.csv',inferSchema=True,header=True)

In [7]:
data.head()

Row(movieId=2, rating=3.0, userId=0)

In [8]:
data.describe().show()

+-------+------------------+------------------+------------------+
|summary|           movieId|            rating|            userId|
+-------+------------------+------------------+------------------+
|  count|              1501|              1501|              1501|
|   mean| 49.40572951365756|1.7741505662891406|14.383744170552964|
| stddev|28.937034065088994| 1.187276166124803| 8.591040424293272|
|    min|                 0|               1.0|                 0|
|    max|                99|               5.0|                29|
+-------+------------------+------------------+------------------+



### Split data

In [9]:
(training, test) = data.randomSplit([0.8, 0.2])

### Build model

In [10]:
# Build the recommendation model using ALS on the training data
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")
model = als.fit(training)

### Evaluate model

In [11]:
# Compute the RMSE on the test data
predictions = model.transform(test)

In [12]:
predictions.show()

+-------+------+------+-----------+
|movieId|rating|userId| prediction|
+-------+------+------+-----------+
|     31|   4.0|    12|  2.0425363|
|     31|   1.0|    13|  1.2831385|
|     31|   1.0|    24|  1.8250816|
|     85|   1.0|    12|  1.2649267|
|     85|   1.0|    13|  2.2662847|
|     85|   1.0|     4|   2.869074|
|     65|   1.0|    16| 0.57939017|
|     65|   2.0|     3|  2.2559707|
|     65|   1.0|     2|  2.2925167|
|     53|   1.0|     9|   2.147821|
|     53|   1.0|     7|  1.3996441|
|     53|   1.0|    25|   0.443898|
|     53|   5.0|    21|  3.4773057|
|     78|   1.0|    20| 0.46641803|
|     78|   1.0|    11|  1.0706434|
|     34|   1.0|    28| 0.98601615|
|     34|   1.0|    19|  1.0932337|
|     34|   4.0|     2|  0.1441027|
|     81|   3.0|    26|    4.40609|
|     28|   3.0|     1|-0.36992237|
+-------+------+------+-----------+
only showing top 20 rows



The predictions are not so good. This is because of the small dataset. 

In [13]:
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 1.6741877405710142


The RMSE described our error in terms of the stars rating column.

### Recommend a movie to a new user

In [14]:
single_user = test.filter(test['userId']==11).select(['movieId','userId'])

In [15]:
# UserId 11 had 10 ratings in the test data set 
single_user.show()

+-------+------+
|movieId|userId|
+-------+------+
|     16|    11|
|     22|    11|
|     25|    11|
|     36|    11|
|     41|    11|
|     51|    11|
|     61|    11|
|     77|    11|
|     78|    11|
|     94|    11|
+-------+------+



In [16]:
reccomendations = model.transform(single_user)

In [17]:
reccomendations.orderBy('prediction',ascending=False).show()

+-------+------+----------+
|movieId|userId|prediction|
+-------+------+----------+
|     94|    11| 3.9720926|
|     22|    11| 3.8743114|
|     36|    11| 3.5741959|
|     77|    11| 3.1154115|
|     51|    11|  2.313581|
|     61|    11| 1.4004419|
|     41|    11| 1.3451228|
|     78|    11| 1.0706434|
|     25|    11|0.32700703|
|     16|    11|-1.3389496|
+-------+------+----------+



userId 11 will enjoy movieId 18 so should be recommended first. Don't recommend movieId 16 as they are likely to hate it.

End of Notebook!