## Movie Recommendation Model using Alternating Least Squares (ALS) ML algorithm

#### Requirements to run this notebook:
- Spark environment / Spark DataFrame API
- MovieLens dataset

In [0]:
%pip install findspark

In [0]:
## check Spark environment

import findspark
findspark.init()
findspark.find()

In [0]:
## Databricks runtime and Scala version

spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")
# Databricks Runtime Version: 8.3 (includes Apache Spark 3.1.1, Scala 2.12)

In [0]:
## SparkSession: pre-configured for Databricks Notebook by default (on Databricks Community Edition)

#from pyspark.sql import SparkSession
#spark = SparkSession.builder.appName("Databricks Shell").getOrCreate()
spark

#### Process Overview
- Step 1: Load MovieLens dataset and perform exploratory data analysis
- Step 2: Split movie ratings data into training and testing datasets
- Step 3: Build ALS machine learning model to predict users movie ratings
- Step 4: Predict movie ratings for all users using the ALS model 
- Step 5: Check performance metrics of the model
- Step 6: Generate movie recommendations for existing users
- Step 7: Generate movie recommendations for a new user

### Step 1: Load MovieLens dataset and perform exploratory data analysis

In [0]:
movies = spark.read.format("csv") \
  .option("inferSchema", "true") \
  .option("header", "true") \
  .option("sep", ",") \
  .load("/FileStore/tables/ml_latest_small_movies.csv")

movies.show(5)

In [0]:
ratings = spark.read.format("csv") \
  .option("inferSchema", "true") \
  .option("header", "true") \
  .option("sep", ",") \
  .load("/FileStore/tables/ml_latest_small_ratings.csv")

ratings.show(5)

In [0]:
ratings = ratings.drop('timestamp')
ratings.show(5)

In [0]:
ratings.printSchema()

In [0]:
ratings.describe().show()

### Step 2: Split movie ratings data into training and testing datasets

In [0]:
train_df, test_df = ratings.randomSplit([0.8,0.2], seed = 1234)

### Step 3: Build ALS machine learning model to predict users movie ratings

In [0]:
# Load packages and modules

from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [0]:
# Create ALS model

als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop", nonnegative=True)

In [0]:
# Create a pipleline

pipeline = Pipeline(stages=[als])

In [0]:
# Create a param_grid of hyperparameters for tuning the model

param_grid = ParamGridBuilder()\
                    .addGrid(als.rank, [5, 25, 50])\
                    .addGrid(als.maxIter, [20])\
                    .addGrid(als.regParam, [0.01, 0.1])\
                    .build()

In [0]:
# Set model evaluation metirc (evaluator) as RMSE

evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")

In [0]:
# Build the model with cross validation setup

model_cv = CrossValidator(estimator=pipeline, estimatorParamMaps=param_grid,
                          evaluator=evaluator, numFolds=3)

In [0]:
# Fit the model with train_df

model = model_cv.fit(train_df)

In [0]:
# Check best model hyperparameters

bestModel = model.bestModel.stages[0]

bestModel_rank = bestModel.rank
bestModel_MaxIter = bestModel._java_obj.parent().getMaxIter()
bestModel_RegParam = bestModel._java_obj.parent().getRegParam()

print("Best Model Hyperparameters:")
print(f" rank = {bestModel_rank}")
print(f" MaxIter = {bestModel_MaxIter}")
print(f" RegParam = {bestModel_RegParam}")

### Step 4: Predict movie ratings for all users using the ALS model

In [0]:
# Generate predictions for all user ratings using the best ALS model

pred_df = bestModel.transform(test_df)

pred_df.show(5)

In [0]:
# Sort pred_df by "userId" and "rating" in descending order

pred_df.sort("userId","rating", ascending=False).limit(5).show()

### Step 5: Check performance metrics of the model

In [0]:
# Check RMSE evaluation metric for the model predictions

rmse = evaluator.evaluate(pred_df)
print(f"RMSE = {rmse:.4f}")

#### Re-run Step 3: Build ALS machine learning model to predict users movie ratings (using updated sets of hyperparameters)

In [0]:
# Create ALS model
als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop", nonnegative=True)

# Create a pipleline
pipeline = Pipeline(stages=[als])

# Create a param_grid of hyperparameters for tuning the model
param_grid = ParamGridBuilder()\
                    .addGrid(als.rank, [40, 50, 60])\
                    .addGrid(als.maxIter, [40])\
                    .addGrid(als.regParam, [0.1])\
                    .build()

# Set model evaluation metirc (evaluator) as RMSE
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")

# Build the model with cross validation setup
model_cv = CrossValidator(estimator=pipeline, estimatorParamMaps=param_grid,
                          evaluator=evaluator, numFolds=3)

# Fit the model with train_df
model = model_cv.fit(train_df)

In [0]:
# Check best model hyperparameters

bestModel = model.bestModel.stages[0]

bestModel_rank = bestModel.rank
bestModel_MaxIter = bestModel._java_obj.parent().getMaxIter()
bestModel_RegParam = bestModel._java_obj.parent().getRegParam()

print("Best Model Hyperparameters:")
print(f" rank = {bestModel_rank}")
print(f" MaxIter = {bestModel_MaxIter}")
print(f" RegParam = {bestModel_RegParam}")

#### Re-run Step4: Predict movie ratings for all users using the ALS model (using the updated Best Model Hyperparameters)

In [0]:
# Generate predictions for all user ratings using the best ALS model

pred_df = bestModel.transform(test_df)

pred_df.show(5)

In [0]:
# Sort pred_df by "userId" and "rating" in descending order

pred_df.sort("userId","rating", ascending=False).limit(5).show()

#### Re-run Step 5: Check performance metrics of the model (using the updated Best Model Hyperparameters)

In [0]:
# Check RMSE evaluation metric for the model predictions

rmse = evaluator.evaluate(pred_df)
print(f"RMSE = {rmse:.4f}")

### Step 6: Generate movie recommendations for existing users

In [0]:
# Generate top 10 movie recommendations for all users

recommendations_all_10 = bestModel.recommendForAllUsers(10)
recommendations_all_10.show(5)

In [0]:
# Check top 10 movie recommendations for userID = 1

recommendations_all_10.filter(recommendations_all_10.userId == 1).show()

In [0]:
# Unpack the list of top 10 movieIds from recommendations column using <explode> function

recommendations_user1 = recommendations_all_10.filter(recommendations_all_10.userId == 1)

recommendations_user1.registerTempTable("ALS_recommendations_user1") 
 
recommendations_user1_exploded = spark.sql("SELECT userId, recommendation.movieId AS movieId,\
                               recommendation.rating AS pred_rating \
                               FROM ALS_recommendations_user1 \
                               LATERAL VIEW explode(recommendations) exploded_table \
                               AS recommendation")

recommendations_user1_exploded.show() 

In [0]:
# Add movie info (movie title and genres) to top 10 movie recommendations data (with movieIds and ratings)

recommendations_user1_exploded.join(movies, ["movieId"], "left").show()

### Step 7: Generate movie recommendations for a new user

#### Step 7.1: Add new user data to the existing ratings data

In [0]:
# Let's assume a new user's favorite movies are the following:
# Toy Story (1995), Lion King, The (1994), Shrek (2001), Finding Nemo (2003) 
# which can be converted as a list of movieIds as below:
new_user_movieIds = [1, 364, 4306, 6377]

# set new userId as the highest userId value among the existing users + 1
new_userId = ratings.agg({"userId": "max"}).first()[0] + 1
rating_max = 5.0

# Create a list of new user's movie rating data
new_user_ratings = [(new_userId, movieId, rating_max) for movieId in new_user_movieIds]

# ratings.columns = ['userId', 'movieId', 'rating']

# Create a new dataframe that contains new user's ratings
new_user_df = spark.createDataFrame(new_user_ratings, ratings.columns)
new_user_df.show()

In [0]:
# Add new_user_df to the existing ratings data

ratings_all_new = ratings.union(new_user_df)

ratings_all_new.filter(ratings_all_new.userId == new_userId).show()

#### Step 7.2: Split movie ratings data into training and testing datasets

In [0]:
train_df, test_df = ratings_all_new.randomSplit([0.8,0.2], seed = 1234)

#### Step 7.3: Build ALS machine learning model to predict users movie ratings

In [0]:
## Rebuild ALS model with updated ratings datasets that include new user's ratings

als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop", nonnegative=True)

pipeline = Pipeline(stages=[als])

# param_grid contains only the bestModel hyperparameters that were obtained above
param_grid = ParamGridBuilder()\
                    .addGrid(als.rank, [bestModel_rank])\
                    .addGrid(als.maxIter, [bestModel_MaxIter])\
                    .addGrid(als.regParam, [bestModel_RegParam])\
                    .build()

# Set model evaluation metirc (evaluator) as RMSE
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")

# Build the model with cross validation setup
model_cv = CrossValidator(estimator=pipeline, estimatorParamMaps=param_grid,
                          evaluator=evaluator, numFolds=3)

# Fit the model with train_df
model = model_cv.fit(train_df)

#### Step 7.4: Predict movie ratings for all users using the ALS model

In [0]:
# Generate predictions for all user ratings using the new ALS model

pred_df = model.transform(test_df)

pred_df.show(5)

In [0]:
# Sort pred_df by "userId" and "rating" in descending order

pred_df.sort("userId","rating", ascending=False).limit(5).show()

In [0]:
# Check best model hyperparameters, which should remain the same as given above

bestModel = model.bestModel.stages[0]

bestModel_rank = bestModel.rank
bestModel_MaxIter = bestModel._java_obj.parent().getMaxIter()
bestModel_RegParam = bestModel._java_obj.parent().getRegParam()

print("Best Model Hyperparameters:")
print(f" rank = {bestModel_rank}")
print(f" MaxIter = {bestModel_MaxIter}")
print(f" RegParam = {bestModel_RegParam}")

#### Step 7.5: Check performance metrics of the model

In [0]:
# Check RMSE evaluation metric for the model predictions

rmse = evaluator.evaluate(pred_df)
print(f"RMSE = {rmse:.4f}")

#### Step 7.6: Generate movie recommendations for existing users

In [0]:
# Generate top 10 movie recommendations for all users

recommendations_all_10 = bestModel.recommendForAllUsers(10)
recommendations_all_10.show(5)

#### Step 7.7: Retrieve movie recommendations for the new user

In [0]:
# Unpack the list of top 10 movieIds from recommendations column using <explode> function

recommendations_new_user = recommendations_all_10.filter(recommendations_all_10.userId == new_userId)

recommendations_new_user.registerTempTable("ALS_recommendations_new_user") 
 
recommendations_new_user_exploded = spark.sql("SELECT userId, recommendation.movieId AS movieId,\
                               recommendation.rating AS pred_rating \
                               FROM ALS_recommendations_new_user \
                               LATERAL VIEW explode(recommendations) exploded_table \
                               AS recommendation")

recommendations_new_user_exploded.show() 

In [0]:
# Add movie info (movie title and genres) to top 10 movie recommendations data (with movieIds and ratings)

recommendations_new_user_exploded.join(movies, ["movieId"], "left").toPandas()

Unnamed: 0,movieId,userId,pred_rating,title,genres
0,67504,672,5.357431,Land of Silence and Darkness (Land des Schweig...,Documentary
1,83411,672,5.357431,Cops (1922),Comedy
2,83318,672,5.357431,"Goat, The (1921)",Comedy
3,527,672,5.131235,Schindler's List (1993),Drama|War
4,59684,672,5.086157,Lake of Fire (2006),Documentary
5,54328,672,5.074544,My Best Friend (Mon meilleur ami) (2006),Comedy
6,31435,672,5.051767,Rory O'Shea Was Here (Inside I'm Dancing) (2004),Drama
7,1148,672,4.985118,Wallace & Gromit: The Wrong Trousers (1993),Animation|Children|Comedy|Crime
8,4306,672,4.983896,Shrek (2001),Adventure|Animation|Children|Comedy|Fantasy|Ro...
9,905,672,4.95656,It Happened One Night (1934),Comedy|Romance


In [0]:
Top10_movies = recommendations_new_user_exploded.join(movies, ["movieId"], "left").select("title").toPandas()

In [0]:
Top10_movies

Unnamed: 0,title
0,Land of Silence and Darkness (Land des Schweig...
1,Cops (1922)
2,"Goat, The (1921)"
3,Schindler's List (1993)
4,Lake of Fire (2006)
5,My Best Friend (Mon meilleur ami) (2006)
6,Rory O'Shea Was Here (Inside I'm Dancing) (2004)
7,Wallace & Gromit: The Wrong Trousers (1993)
8,Shrek (2001)
9,It Happened One Night (1934)
