Dataset availability: https://www.kaggle.com/datasets/shubhammehta21/movie-lens-small-latest-dataset

### 1. Import libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

### 2. Spark session

In [2]:
spark = SparkSession.builder.appName('recommender').getOrCreate()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



### 3. Read dataset

In [None]:
df = spark.read.csv('ratings.csv', inferSchema= True, header = True)
df.printSchema()

In [3]:
df.show(3)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
+------+-------+------+---------+
only showing top 3 rows



In [4]:
df.describe().show()

+-------+------------------+----------------+------------------+--------------------+
|summary|            userId|         movieId|            rating|           timestamp|
+-------+------------------+----------------+------------------+--------------------+
|  count|            100836|          100836|            100836|              100836|
|   mean|326.12756356856676|19435.2957177992| 3.501556983616962|1.2059460873684695E9|
| stddev| 182.6184914635004|35530.9871987003|1.0425292390606342|2.1626103599513078E8|
|    min|                 1|               1|               0.5|           828124615|
|    max|               610|          193609|               5.0|          1537799250|
+-------+------------------+----------------+------------------+--------------------+



### 4. Alternating least square modelling

In [12]:
train, test = df.randomSplit([0.8, 0.2])

In [34]:
als = ALS(maxIter=5, \
          regParam=0.01, \
          userCol='userId', \
          itemCol='movieId', 
          ratingCol='rating', \
          )

In [35]:
model = als.fit(train)

In [36]:
predictions = model.transform(test)
predictions.show(5)

+------+-------+------+----------+----------+
|userId|movieId|rating| timestamp|prediction|
+------+-------+------+----------+----------+
|    91|    471|   1.0|1112713817| 2.1595163|
|   372|    471|   3.0| 874415126| 3.2865875|
|   599|    471|   2.5|1498518822| 3.0892537|
|   182|    471|   4.5|1054779644| 4.2635202|
|   474|    471|   3.0| 974668858| 3.9824388|
+------+-------+------+----------+----------+
only showing top 5 rows



In [37]:
evaluator = RegressionEvaluator(metricName = 'rmse',\
                                labelCol = 'rating', \
                                predictionCol = 'prediction'\
                                )

In [38]:
rmse = evaluator.evaluate(predictions)
print('RMSE:', rmse)

RMSE: nan


Note here that root mean square error is not calculated because predictions contain NaNs for unseen users or items

We can view NaN predictions using a temporary SQL view

In [None]:
predictions.createTempView("predictions_table")

In [28]:
spark.sql("select * from predictions_table where prediction = 'NaN'").show(5)

+------+-------+------+----------+----------+
|userId|movieId|rating| timestamp|prediction|
+------+-------+------+----------+----------+
|    89|  99130|   3.5|1520409565|       NaN|
|   610|  27480|   4.5|1479545320|       NaN|
|   599|   4957|   2.0|1519144322|       NaN|
|   225|   3042|   3.0| 949111600|       NaN|
|   294|   3042|   2.0| 966597039|       NaN|
+------+-------+------+----------+----------+
only showing top 5 rows



To avoid NaNs in predictions, we can set cold start strategy to drop and it will not include any unseen users or items

In [40]:
als = ALS(maxIter=5, \
          regParam=0.01, \
          userCol='userId', \
          itemCol='movieId', 
          ratingCol='rating', \
          coldStartStrategy="drop"\
          )

In [41]:
model = als.fit(train)

In [42]:
predictions = model.transform(test)
predictions.show(5)

+------+-------+------+----------+----------+
|userId|movieId|rating| timestamp|prediction|
+------+-------+------+----------+----------+
|    91|    471|   1.0|1112713817| 2.1595163|
|   372|    471|   3.0| 874415126| 3.2865875|
|   599|    471|   2.5|1498518822| 3.0892537|
|   182|    471|   4.5|1054779644| 4.2635202|
|   474|    471|   3.0| 974668858| 3.9824388|
+------+-------+------+----------+----------+
only showing top 5 rows



In [43]:
predictions.createTempView("predictions_table_new")

In [44]:
spark.sql("select * from predictions_table_new where prediction = 'NaN'").show(5)

+------+-------+------+---------+----------+
|userId|movieId|rating|timestamp|prediction|
+------+-------+------+---------+----------+
+------+-------+------+---------+----------+



### 5. Evaluating the model

In [45]:
evaluator = RegressionEvaluator(metricName = 'rmse',\
                                labelCol = 'rating', \
                                predictionCol = 'prediction'\
                                )

In [46]:
rmse = evaluator.evaluate(predictions)
print('RMSE:', rmse)

RMSE: 1.0845095846134665


### 6. Implementing the model on sample users

In [47]:
# Let's select a certain user ID e.g. 12

this_user = test.filter(test['userId'] == 12).select('userId', 'movieId')
this_user.show()

+------+-------+
|userId|movieId|
+------+-------+
|    12|     39|
|    12|    256|
|    12|    543|
|    12|   1357|
|    12|   1405|
|    12|   2485|
+------+-------+



In [48]:
recommendation_this_user = model.transform(this_user)
recommendation_this_user.show()

+------+-------+----------+
|userId|movieId|prediction|
+------+-------+----------+
|    12|   2485| 3.1074028|
|    12|   1357| 4.4349504|
|    12|     39|  4.948274|
|    12|    543| 2.1941657|
|    12|   1405| 1.6225376|
|    12|    256| 3.1183014|
+------+-------+----------+



In [49]:
recommendation_this_user.orderBy('prediction', ascending=False).show()

+------+-------+----------+
|userId|movieId|prediction|
+------+-------+----------+
|    12|     39|  4.948274|
|    12|   1357| 4.4349504|
|    12|    256| 3.1183014|
|    12|   2485| 3.1074028|
|    12|    543| 2.1941657|
|    12|   1405| 1.6225376|
+------+-------+----------+



In [65]:
df.filter(df['movieID'] == 39).select('userId', 'movieId', 'rating').orderBy("rating", ascending=False).show(100)

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|   200|     39|   5.0|
|   506|     39|   5.0|
|   583|     39|   5.0|
|    33|     39|   5.0|
|   224|     39|   5.0|
|   602|     39|   5.0|
|   603|     39|   5.0|
|    58|     39|   5.0|
|   240|     39|   5.0|
|   468|     39|   5.0|
|   484|     39|   4.5|
|   594|     39|   4.5|
|   357|     39|   4.5|
|   525|     39|   4.5|
|   313|     39|   4.0|
|   414|     39|   4.0|
|    68|     39|   4.0|
|    12|     39|   4.0|
|   177|     39|   4.0|
|   275|     39|   4.0|
|   280|     39|   4.0|
|   347|     39|   4.0|
|   367|     39|   4.0|
|   409|     39|   4.0|
|   458|     39|   4.0|
|    64|     39|   4.0|
|   509|     39|   4.0|
|   512|     39|   4.0|
|   555|     39|   4.0|
|   566|     39|   4.0|
|   592|     39|   4.0|
|    95|     39|   4.0|
|   596|     39|   4.0|
|   111|     39|   4.0|
|   597|     39|   4.0|
|    56|     39|   4.0|
|   104|     39|   4.0|
|   121|     39|   4.0|
|   174|     39|

In [63]:
df.filter(df['userID'] == 414).select('userId', 'movieId', 'rating').orderBy("movieID").show(100)

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|   414|      1|   4.0|
|   414|      2|   3.0|
|   414|      3|   4.0|
|   414|      5|   2.0|
|   414|      6|   3.0|
|   414|      7|   3.0|
|   414|      8|   3.0|
|   414|     10|   3.0|
|   414|     11|   5.0|
|   414|     15|   2.0|
|   414|     16|   3.0|
|   414|     17|   4.0|
|   414|     18|   3.0|
|   414|     21|   4.0|
|   414|     22|   3.0|
|   414|     23|   2.0|
|   414|     24|   3.0|
|   414|     25|   3.0|
|   414|     27|   2.0|
|   414|     31|   3.0|
|   414|     32|   5.0|
|   414|     34|   5.0|
|   414|     36|   3.0|
|   414|     39|   4.0|
|   414|     42|   2.0|
|   414|     44|   2.0|
|   414|     45|   3.0|
|   414|     46|   2.0|
|   414|     47|   4.0|
|   414|     48|   3.0|
|   414|     50|   5.0|
|   414|     52|   3.0|
|   414|     54|   1.0|
|   414|     57|   3.0|
|   414|     62|   4.0|
|   414|     65|   2.0|
|   414|     71|   2.0|
|   414|     72|   4.0|
|   414|     75|

### 7. Fine tuning the model to reduce RMSE

In [67]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

In [68]:
param_grid = ParamGridBuilder().addGrid(als.rank, [5, 10, 15, 20]).addGrid(
    als.maxIter, [5, 10]).addGrid(als.regParam, [0.01, 0.05, 0.1, 0.15]).build()

evaluator = RegressionEvaluator(metricName = "rmse", labelCol = "rating",
                               predictionCol = "prediction")
cv = CrossValidator(estimator = als,
                   estimatorParamMaps = param_grid,
                   evaluator = evaluator,
                   numFolds = 5)

print ("Num models to be tested: ", len(param_grid))

Num models to be tested:  32


In [69]:
modelcv = cv.fit(train)

In [70]:
best_model = modelcv.bestModel

In [71]:
test_predictions = best_model.transform(test)
rmse = evaluator.evaluate(test_predictions)
print(rmse)

0.8652763490211332
