# Collaborative Recommendations Engine (Part 3/3)

> This notebook is part of a series of notebooks that will walk you through the process of building a good collaborative recommendations engine (while also including our mistakes that we did). The series is broken up into three parts. If you haven't already, we would recommend you to read the first two parts of the series before continuing on with this one (as we won't repeat the same explanations).

- Part 1: Our Attempt at Building an Item-Item Collaborative Recommendations Engine
- Part 2: Fixing our Item-Item Collaborative Recommendations Engine
- **Part 3: Improving our Collaborative Recommendations Engine by leveraging other techniques than Item-Item Collaborative Filtering...**

## Part 3: Improving our Collaborative Recommendations Engine by leveraging other techniques...
In Part 2, we were able to have a working-model, but with non-optimal performance. In this notebook, we want to have a look at other techniques that is used in the industry in order to have a better model.

Althought our previous code was using Spark + Parallelizations, our algorithm wasn't optimized for large datasets due to cross-joins and many dataframes manipulations. We realized (in Part 2) that our method isn't frequently used in the industry... **We did a mistake.** It is mostly used for small datasets or for educational purposes. We need to think bigger and use a more scalable/efficient solution such as:

- _ANN (Approximate Nearest Neighbors)_ which is a technique that is used to find similar items in a large dataset.
- _KNN (K-Nearest Neighbors)_ which is also a technique that is used to find similar items in a large dataset.
- _SVD (Singular Value Decomposition)_ which is a matrix factorization technique that is used to find latent factors in a large dataset. 
- _ALS (Alternating Least Squares)_ which is also a matrix factorization technique that is used to find latent factors in a large dataset.

In our case, we believe it makes more sense to use ALS since it is a matrix factorization technique that we learned in class where it is using latent factors. We will test it with item-item and user-item using parameters tuning + biases. 

### Step 1: Importing the necessary libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row
from pyspark.sql.functions import col,udf, max, exp
from pyspark.sql.types import FloatType, LongType
import warnings
warnings.filterwarnings('ignore')

### Step 2: Spark Configuration/Setup
Idem to Part 2, we will be using Spark.

In [2]:
spark = SparkSession.builder \
    .appName("Item-Item Recommender System") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "4g") \
    .getOrCreate()

23/03/26 00:17:03 WARN Utils: Your hostname, gkill.local resolves to a loopback address: 127.0.0.1; using 192.168.2.13 instead (on interface en0)
23/03/26 00:17:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/03/26 00:17:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Step 3: Data Preparation/Loading
Idem to Part 2, we will be using the same dataset + Same Data Preparation.

In [3]:
csv_file_path = "data/ratings_small.csv"
data = spark.read.csv(csv_file_path, header=True, inferSchema=True)
data = data.drop("timestamp")

# split data into train and test
train_data, test_data = data.randomSplit([0.8, 0.2], seed=42)


### Step 4: Data Modeling

Define the ALS algorithm for collaborative filtering.

In [4]:
# create ALS model
als = ALS(maxIter=15, rank=10, regParam=0.15, userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop")

# fit model
model = als.fit(train_data)


23/03/26 00:17:17 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
23/03/26 00:17:17 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
23/03/26 00:17:17 WARN InstanceBuilder$NativeLAPACK: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK


                                                                                

### Step 5: Data Prediction
The model was then used to make predictions on our test data set. In this step the model adds a column named prediction which is calculated using the learned latent factors and we can use this column to evaluate the performance of the model on the test data set. An example of 5 rows is shown below.

In [5]:
# Make predictions on the test set
predictions = model.transform(test_data)

# show 5 rows
predictions.show(5)

                                                                                

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   148|    185|   3.0|  3.131124|
|   148|    364|   4.0|  4.014933|
|   148|    596|   4.5| 3.8559518|
|   148|   1028|   5.0| 3.9720638|
|   148|   1136|   4.5| 4.3673754|
+------+-------+------+----------+
only showing top 5 rows



### Step 6: Model Evaluation

Here we chose to use RMSE and R-squared as evaluations which work well for regression based models. 

In [6]:
# Evaluate the model using RMSE (Root Mean Squared Error)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = {:.4f}".format(rmse))

# Evaluate the model by calculating the R-squared
evaluator_r2 = RegressionEvaluator(metricName="r2", labelCol="rating", predictionCol="prediction")
r2 = evaluator_r2.evaluate(predictions)
print("R-squared on test data = {:.4f}".format(r2))

spark.stop()

Root Mean Squared Error (RMSE) on test data = 0.9029
R-squared on test data = 0.2613


### Step 7: Model Performance

 We tested several different hyperparameters for our model by first running a grid search over the `maxIterations`, `rank` and `regParam`. Initially this led us to believe that a rank of 50 with more latent factors would perform better but upon further evaluation and manually adjusting the parameters individually we discovered our optimal combination of 15, 10 and 0.15 respectively that performed best for our RMSE and R^2 evaluators. While our RMSE performed quite well our R^2 evaluator indicates that our model does not perform well with high variance. This is probably due to the sparse data set. Analysis of the data indicated a sparsity of 98.36% which is quite high and would make it difficult to estimate the user's ratings accurately. 


<table>
<tr>
<th>Test 1</th>
<th>Test 2</th>
<th style="background: #2fa329">Test 3</th>
<th>Test 4</th>
</tr>
<tr>

<td>

| Params Value | RMSE Value | R^2 |Time |
|--|--|--|--|
| 10,5,0.1 | 0.9200 | 0.2329 | 8.0s |

</td>
<td>

| Params Value | RMSE Value | R^2 |Time |
|--|--|--|--|
| 15,10,0.1 | 0.9114 | 0.2472 | 23.0s |

</td>
<td >

| Params Value | RMSE Value | R^2 |Time |
|--|--|--|--|
| 15,10,0.15 | **0.9007** | **0.2647** | 26.5s |

</td>
<td>


| Params Value | RMSE Value | R^2 |Time |
|--|--|--|--|
| 15,10,0.16 | 0.9011 | 0.2641 | 22.5s |

</td>

</tr> </table>

### Step 8: And? What's next?

Several attempts were made to improve on the latent factors with baselines for `user_rating_mean - global_mean` and `item_rating_mean - global_mean` and even temporal adjustments but each time the evaluations performed worse. This is also due to the sparsity of the matrix not giving an accurate representation of a user's or movie's mean. Below is some code we used for these experiments. 


```
global_mean = train_data.groupBy().mean("rating").collect()[0][0]
user_mean = train_data.groupBy("userId").mean("rating").withColumnRenamed("avg(rating)", "user_mean")
movie_mean = train_data.groupBy("movieId").mean("rating").withColumnRenamed("avg(rating)", "movie_mean")
...
predictions = predictions.withColumn('prediction', col('prediction')+ 0.1*(col('user_mean')- global_mean)+ 0.1*(col('movie_mean')- global_mean))
```

### Recap of Collaboration

When it comes to performance the latent factor model required much less computing because of dimension reduction. Although it performed worse with regards to the metrics for our 10K dataset it is expected that with a less sparse and large data set this method would prove to be the winner. In a dataset where the user has more ratings and a less sparse dataset the latent factor would be able to take advantage of the users preferences rather than simply relying on the item-item similarity. 

