# Collaborative Recommendations Engine (Part 2/3)

> This notebook is part of a series of notebooks that will walk you through the process of building a good collaborative recommendations engine (while also including our mistakes that we did). The series is broken up into three parts. If you haven't already, we would recommend you to read the first part of the series before continuing on with this one (as we won't repeat the same explanations).

- Part 1: Our Attempt at Building an Item-Item Collaborative Recommendations Engine
- **Part 2: Fixing our Item-Item Collaborative Recommendations Engine**
- Part 3: Improving our Collaborative Recommendations Engine by leveraging other techniques than Item-Item Collaborative Filtering...

## Part 2: Fixing our Item-Item Collaborative Recommendations Engine
In Part 1, we set up the ground foundation of our system and had some difficulties related to the way we can predict values efficiently for large datasets. 

### Step 1: Importing the necessary libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as spark_sum, udf, row_number, expr, coalesce, lit
from pyspark.sql.types import DoubleType
from pyspark.sql import Window

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import  Vectors
from pyspark.ml.evaluation import RegressionEvaluator


### Step 2: Spark Configuration/Setup
Idem to Part 1, we will be using Spark.

Throughout this program, we will also need to use udf-functions to create our own functions. They are implemented and registered in the following code-block.

In [2]:
# Spark Initialization/Setup

spark = SparkSession.builder \
    .appName("recommenderTest") \
    .config("spark.some.config.option", "some-value") \
    .config("spark.executor.memory", "7g") \
    .config("spark.driver.memory", "7g") \
    .config("spark.sql.shuffle.partitions", "32") \
    .config("spark.sql.pivotMaxValues", "20000") \
    .config("spark.master", "local[*]") \
    .config("spark.sql.codegen.wholeStage", "false") \
    .getOrCreate()

spark.sparkContext.setLogLevel("FATAL")

# -------------- UDFs Helper's Functions Implementation --------------

# Cosine Similarity Measure Calculation
# Note: We won't be using Pearson anymore. More details will follow on why.
def co_sym (x, y):
    x1 = x
    x2 = y
    return float(x1.dot(x2)/(Vectors.norm(x1,2)*Vectors.norm(x2,2)))

# -------------- UDFs Initialization --------------
dot_udf = udf(co_sym, DoubleType())
spark.udf.register("dot_udf", dot_udf)

23/03/25 14:33:55 WARN Utils: Your hostname, Martin-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.1.53 instead (on interface en0)
23/03/25 14:33:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/03/25 14:33:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


<function __main__.co_sym(x, y)>

### Step 3: Data Preparation/Loading
Data Preparation will be identical to Part 1. No preparation required except removing timestamp.

For this notebook, we will be using ratings_small.csv to validate our method. Ratings_small.csv is a dataset part of MoviesLens (the file format is identical to the previous part). It contains 100,004 ratings and 3,671 tag applications across 9,742 movies.

In [3]:
df = spark.read.csv("data/ratings_small.csv", header=True, inferSchema=True)
df = df.drop("timestamp")

### Step 4: Data Modeling
This step is crucial in order to build our recommender. This is where we will be building our item-item collaborative filtering model. We will be using the following steps:
1. Create Utility Matrix (Item-Item) where the rows are the movies and the columns are the users (in our code: this is represented as df_user_movie_rating)
2. Calculate the similarity between each pair of movies (in our code: this is represented as similarity_matrix) using the following formula (Cosine Similarity Measure):
$$similarity = \frac{\sum_{u \in U} (r_{u,i})(r_{u,j})}{\sqrt{\sum_{u \in U} (r_{u,i})^2} \sqrt{\sum_{u \in U} (r_{u,j})^2}}$$
3. Build a new matrix which will contain all the movies combinations (similarity) for each user. This matrix will be used to predict the ratings for each user

**Note**: 
In this notebook, we have some differences compared to Part 1 in our Model Implementation:
- We are splitting our dataset into training (80%) and testing (20%) set. We were very careful to ensure that the testing set contains movies that are not in the training set (including any similarities values). This is to ensure that we are not overfitting our model. 
- As previously hinted, we will be using the Cosine Similarity Measure to calculate the similarity between each pair of movies. This is because the Pearson Correlation Measure is not suitable for our ways of predicting values. By using Pearson, we were predicting values that weren't always included in 0-5 which is not desirable... The Cosine Similarity Measure is more suitable for sparse datasets.

In [4]:
# -------------- Data Initialization --------------
# Data Splitting (80% training, 20% testing)
training, test = df.randomSplit([0.8, 0.2])
# DF to be used for Training (easier to use another variable name for testing purposes)
df = training
# DF to be used for User-Movie-Rating Matrix (will be used later on - Setting initial state for now...)
df_user_movie_rating = df

# -------------- Building Similarity Matrix --------------
df = df.groupBy("movieId").pivot("userId").agg({"rating": "first"}).fillna(0)
df = df.sort("movieId")

# Build Vector Columns from DF Matrix
assembler = VectorAssembler(inputCols=df.columns[1:], outputCol="features")
df_vector = assembler.transform(df).select('movieId', 'features')
df_vector = df_vector.repartition(10)

# Compute Cosine Similarity Measure to fill data into Similarity Matrix
similarity_matrix = df_vector.alias("a").crossJoin(df_vector.alias("b")) \
    .where("a.movieId != b.movieId") \
    .selectExpr("a.movieId as movieId", "b.movieId as movieId_1",
                "dot_udf(a.features, b.features) as similarity")
similarity_matrix.show(10, 10)

# Build User-Movie-Rating Matrix (where for each user, we have all the movies combinations with the similarity values)
df_user_movie_rating = df_user_movie_rating.join(similarity_matrix, df_user_movie_rating.movieId == similarity_matrix.movieId, how='left').drop(similarity_matrix.movieId)
df_user_movie_rating = df_user_movie_rating.withColumnRenamed("similarity.movieId", "movie2")
df_user_movie_rating.show(10)

                                                                                

+-------+---------+----------+
|movieId|movieId_1|similarity|
+-------+---------+----------+
|   1393|    72356|       0.0|
|   1393|     1875|0.13639...|
|   1393|    95508|0.11356...|
|   1393|     1327|0.13387...|
|   1393|    59022|0.16576...|
|   1393|      569|0.10198...|
|   1393|    26680|0.05962...|
|   1393|     2796|0.07571...|
|   1393|     8923|0.06602...|
|   1393|     6754|0.12117...|
+-------+---------+----------+
only showing top 10 rows



[Stage 70:>                                                         (0 + 4) / 4]

+------+-------+------+---------+-------------------+
|userId|movieId|rating|movieId_1|         similarity|
+------+-------+------+---------+-------------------+
|     1|   1343|   2.0|     1393|0.25837203347039406|
|     1|   1343|   2.0|    72356|                0.0|
|     1|   1343|   2.0|     1875|0.04559607525875532|
|     1|   1343|   2.0|    95508|0.23728949893812476|
|     1|   1343|   2.0|     1327| 0.2191986497404764|
|     1|   1343|   2.0|    59022|0.14513629044340784|
|     1|   1343|   2.0|      569|0.04447007244699521|
|     1|   1343|   2.0|    26680|0.11389895949029989|
|     1|   1343|   2.0|     2796|0.19774124911510396|
|     1|   1343|   2.0|     8923| 0.1261172499428486|
+------+-------+------+---------+-------------------+
only showing top 10 rows



                                                                                

### Step 5: Data Prediction
Now that we have a dataframe containing all the possible movies similarity combinations for each users, we can use this dataframe to predict (*for real, this time...*) the ratings for each user. We need to be able to predict ratings in order to be able to evaluate our model. Here are the steps we will be using:
1. Join the Test Set with the User-Movie Rating Matrix (we will then have all the possible movies similarities combinations for each user in the test set)
2. Get top-N similar movies (Default=2) for each user (where N is a parameter that we can tune - more details on this later)
3. Calculate the predicted rating (in a Dataframe) for each user using the following formula:
$$predictedRating = \frac{\sum_{i=1}^{N} (similarity_{i,j})(rating_{i,j})}{\sum_{i=1}^{N} (similarity_{i,j})}$$

In [5]:
# N-Value (Top N Similar Movies to be used for Prediction) - We will use 4 for now
nValue = 16

# Create DF Alias To Use In Prediction Computations
test_df_alias = test.alias("tst")
sm_alias = df_user_movie_rating.alias("sm")

# Window Specification to help with Ranking Top N Similar Movies
window_spec = Window.partitionBy("tst.userId", "tst.movieId").orderBy(col("sm.similarity").desc())

# Join Similarity Matrix with Test DF to get the top N similar movies for each user-movie pair in the test set
joined_df = test_df_alias.join(sm_alias, (col("tst.userId") == col("sm.userId")) & (col("tst.movieId") == col("sm.movieId_1"))).withColumn("rank", row_number().over(window_spec)).filter(col("rank") <= nValue)

# Compute Weighted Rating
weighted_df = joined_df.withColumn("weighted_rating", col("sm.rating") * col("sm.similarity"))

# Compute Predicted Rating (and save it in Result DF - where we are storing: userId, movieId, rating, predicted_rating)
result_df = (weighted_df
            .groupBy("tst.userId", "tst.movieId", "tst.rating")
            .agg(spark_sum("weighted_rating").alias("sum_weighted_rating"), spark_sum("sm.similarity").alias("sum_similarity"))
            .withColumn("predicted_rating", expr("sum_weighted_rating / sum_similarity"))
            .drop("sum_weighted_rating", "sum_similarity")).withColumn("predicted_rating", coalesce(col("predicted_rating"), lit(0.0)))
result_df.show(10)

[Stage 101:>                                                        (0 + 1) / 1]

+------+-------+------+------------------+
|userId|movieId|rating|  predicted_rating|
+------+-------+------+------------------+
|     2|    589|   5.0| 3.616155322230661|
|     3|   1378|   4.0|3.4536925810355763|
|     4|   1194|   5.0| 4.168033799934364|
|     4|   1219|   5.0| 4.821456884742036|
|     4|   1344|   5.0|  4.81944174326345|
|     4|   1377|   3.0| 4.221582692338619|
|     4|   2902|   2.0| 3.745014821592258|
|     5|   4963|   3.0| 3.705112158996253|
|     6|   1747|   2.0| 2.814214525105227|
|     6|   2502|   3.5|3.0026333945756085|
+------+-------+------+------------------+
only showing top 10 rows



                                                                                

### Step 6: Model Evaluation
Great! We have a model that can predict ratings for each user. Now, we need to evaluate our model. We will be using the Root Mean Squared Error (RMSE) to evaluate our model (by trying to predict values in our test state). RMSE is a measure of how close a fitted line is to data points. The lower the RMSE, the better our model is. The formula is as follows:
$$RMSE = \sqrt{\ \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^2}$$

**Note**: As previously mentionned, we made sure to not contamined our test set by movies that are in the training set. This is to ensure that we are not overfitting our model.

In [6]:
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="predicted_rating")
rmse = evaluator.evaluate(result_df)
print("Root Mean Squared Error (RMSE) on test data = {:.4f}".format(rmse))

spark.stop() # Stop Spark Session

                                                                                

Root Mean Squared Error (RMSE) on test data = 0.8938


### Step 7: Model Performance
Now that we have ran our model, we can evaluate its performance (by looking at RMSE, Time to Run Model) while tuning its parameters (in our case: value of Top-N values used in similarity calculations). In order to have a better overview, we ran our model 4 times with different values of Top-N (2, 4, 6, 8). Here are the results: 
<table>
<tr>
<th>Test 1</th>
<th>Test 2</th>
<th>Test 3</th>
<th>Test 4</th>
</tr>
<tr>
<td>

| Top-N | RMSE Value | Time |
|--|--|--|
| 2 | 1.0185 | 42.523min |

</td>
<td>

| Top-N | RMSE Value | Time |
|--|--|--|
| 4 | 0.9362 | 41.975min |

</td>
<td>

| Top-N | RMSE Value | Time |
|--|--|--|
| 6 | 0.8972 | 42.73min |

</td>
<td>

| Top-N | RMSE Value | Time |
|--|--|--|
| 16 | 0.8938 | 43.02min |

</td>

</tr> </table>

By looking at the result, we can see that a value of Top-N=6 is the best value for our model. This is because it has a low RMSE value while being faster than top-16. If you compare it with the Performance Scale shown in the slides, our model performs exactly as expected for a basic collaborative filtering technique. 

For now, we won't use other Evaluation Techniques, because we aren't satisfied with the rapidity of our model. We will be looking into improving our model in Part 3, and from there will be able to evaluate this model using other techniques than just RMSE.

### Step 8: Now What?
Although we have a model that can predict ratings for each user, we still have some issues with our model. 

> TL;DR Parallelization itself doesn't guarantee optimal performance. You must also choose the right algorithm for the job.

1. **Model is slow**: The main issues are that implementing item-item this way without using factorization/KNN techniques isn't very efficient due to the necessity to manage a lot of large dataframes conversions. We are using a lot of join operations (to build similarity matrix, to predict and to evaluate) which are very costly on a single machine. Our code would be much faster if we were using a distributed system (with multiple nodes using `Spark broadcasting`), thanks to our Spark Code! The issue is not because we aren't using Spark parallelizations, but because we are using inefficient methods to build our model.
2. **Model is not very accurate**: We are using a very simple model (item-item collaborative filtering) and we are not using any other techniques to improve our model (adding biases, normalizations, etc would simply make our systems even more slow). We will be using other techniques in the next part of the series to see if we can improve our model.

We will be fixing these issues in the next part of the series.
