### Initializing a Spark session

In [1]:
import os
os.environ['SPARK_DRIVER_MEMORY'] = '16g'
os.environ['SPARK_EXECUTOR_MEMORY'] = '16g'

# By setting SPARK_DRIVER_MEMORY and SPARK_EXECUTOR_MEMORY to '16g', you're allocating 16GB of memory for both the Spark driver and the executors.
# this is needed to run StringIndexer in later steps

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/06/03 18:25:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Reading the dataset

In [2]:
df = spark.read.format("csv")\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .load("ratings_Beauty.csv")\
  .coalesce(10)

                                                                                

### Exploring the dataset

In [3]:
# checking the columns and the data types 
df.printSchema()

root
 |-- UserId: string (nullable = true)
 |-- ProductId: string (nullable = true)
 |-- Rating: double (nullable = true)
 |-- Timestamp: integer (nullable = true)



We can see that the columns UserId and ProductId are of type string but for the ALS model they should be of type integer or double. Therefore, we need to encode them.

In [12]:
from pyspark.ml.feature import StringIndexer

# Create StringIndexers for UserId and ProductId
userIndexer = StringIndexer(inputCol="UserId", outputCol="UserIdInt", handleInvalid="skip")
productIndexer = StringIndexer(inputCol="ProductId", outputCol="ProductIdInt", handleInvalid="skip")

# Fit and transform UserId column
df_indexed_user = userIndexer.fit(df).transform(df)

# Fit and transform ProductId column
df_indexed_product = productIndexer.fit(df_indexed_user).transform(df_indexed_user)

# Drop the original string columns
final_df = df_indexed_product.drop("UserId", "ProductId")

# Rename the encoded columns
final_df = final_df.withColumnRenamed("UserIdInt", "UserId").withColumnRenamed("ProductIdInt", "ProductId")

# Show the final DataFrame
final_df.show()

24/06/03 18:32:27 WARN DAGScheduler: Broadcasting large task binary with size 36.5 MiB
24/06/03 18:32:40 WARN DAGScheduler: Broadcasting large task binary with size 36.5 MiB
24/06/03 18:32:44 WARN DAGScheduler: Broadcasting large task binary with size 50.8 MiB
[Stage 26:>                                                         (0 + 1) / 1]

+------+----------+--------+---------+
|Rating| Timestamp|  UserId|ProductId|
+------+----------+--------+---------+
|   5.0|1369699200| 70392.0| 145790.0|
|   3.0|1355443200|265306.0| 103581.0|
|   5.0|1404691200|552933.0| 103581.0|
|   4.0|1382572800|536779.0| 145791.0|
|   1.0|1274227200| 14679.0| 145792.0|
|   5.0|1404518400|    86.0| 145793.0|
|   5.0|1371945600|   483.0| 145794.0|
|   5.0|1373068800|  2928.0| 145795.0|
|   5.0|1401840000|994647.0| 145796.0|
|   4.0|1389052800|242707.0|  81247.0|
|   5.0|1372032000|   483.0|  81247.0|
|   4.0|1378252800|300178.0|  81247.0|
|   5.0|1372118400| 35883.0| 145797.0|
|   5.0|1371686400|  2928.0| 145798.0|
|   5.0|1372118400| 35883.0| 103582.0|
|   5.0|1373414400|  5107.0| 103582.0|
|   5.0|1372896000|  2928.0| 103583.0|
|   5.0|1372896000|139615.0| 103583.0|
|   5.0|1373068800|  2928.0| 103584.0|
|   5.0|1372291200|   483.0| 103584.0|
+------+----------+--------+---------+
only showing top 20 rows



                                                                                

In [14]:
final_df.describe().show()

24/06/03 18:33:22 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
[Stage 30:=====>                                                   (1 + 9) / 10]

+-------+------------------+--------------------+------------------+------------------+
|summary|            Rating|           Timestamp|            UserId|         ProductId|
+-------+------------------+--------------------+------------------+------------------+
|  count|           2023070|             2023070|           2023070|           2023070|
|   mean| 4.149035871225415|1.3603887365637374E9|396279.36594087206|31677.089125438073|
| stddev|1.3115045737121593| 4.611860421680957E7| 375833.6848674939| 50072.68503668203|
|    min|               1.0|           908755200|               0.0|               0.0|
|    max|               5.0|          1406073600|         1210270.0|          249273.0|
+-------+------------------+--------------------+------------------+------------------+



                                                                                

### Alternating Least Squares model

In [15]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

In [16]:
train, test = final_df.randomSplit([0.8, 0.2], seed=42)

In [17]:
als = ALS(maxIter=5, \
          regParam=0.01, \
          userCol='UserId', \
          itemCol='ProductId', 
          ratingCol='Rating', \
          )

In [18]:
model = als.fit(train)

24/06/03 18:34:12 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:34:16 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:34:29 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:34:37 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:34:46 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:34:53 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:35:01 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:35:06 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/06/03 18:35:09 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:35:15 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK
24/06/03 18:35:16 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB

In [19]:
predictions = model.transform(test)
predictions.show(5)

24/06/03 18:37:21 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:37:30 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:37:33 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:37:49 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:38:05 WARN DAGScheduler: Broadcasting large task binary with size 51.0 MiB
[Stage 116:>                                                        (0 + 1) / 1]

+------+----------+--------+---------+----------+
|Rating| Timestamp|  UserId|ProductId|prediction|
+------+----------+--------+---------+----------+
|   1.0|1135468800|268271.0|   6552.0|       NaN|
|   1.0|1192233600|383781.0|  11941.0|       NaN|
|   1.0|1182211200|  1628.0|  42381.0|-1.1854051|
|   1.0|1017273600|817867.0|  57774.0|       NaN|
|   1.0|1214179200|     7.0|  72777.0| 2.9140964|
+------+----------+--------+---------+----------+
only showing top 5 rows



                                                                                

In [20]:
evaluator = RegressionEvaluator(metricName = 'rmse',\
                                labelCol = 'Rating', \
                                predictionCol = 'prediction'\
                                )

In [21]:
rmse = evaluator.evaluate(predictions)
print('RMSE:', rmse)

24/06/03 18:39:23 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:39:32 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:39:37 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:39:54 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:40:12 WARN DAGScheduler: Broadcasting large task binary with size 51.0 MiB
24/06/03 18:40:20 WARN DAGScheduler: Broadcasting large task binary with size 51.0 MiB
[Stage 167:>                                                        (0 + 2) / 2]

RMSE: nan


                                                                                

When trying to use the regression evaluator and then print the results we get NaN because not all the predictions are numerical. To avoid NaNs in predictions, we can set cold start strategy to drop NaN and it will not include any unseen users or items.

In [22]:
als = ALS(maxIter=5, \
          regParam=0.01, \
          userCol='UserId', \
          itemCol='ProductId', 
          ratingCol='Rating', \
          coldStartStrategy="drop"\
          )

In [23]:
model = als.fit(train)

24/06/03 18:42:32 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:42:37 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:42:52 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:42:57 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:43:11 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:43:18 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:43:31 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:43:37 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:43:49 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:43:54 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:44:01 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:44:09 WARN DAGScheduler: Broadc

In [24]:
predictions = model.transform(test)
predictions.show(5)

24/06/03 18:45:07 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:45:15 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:45:20 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:45:37 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:45:48 WARN DAGScheduler: Broadcasting large task binary with size 51.0 MiB
[Stage 251:>                                                        (0 + 1) / 1]

+------+----------+-------+---------+-----------+
|Rating| Timestamp| UserId|ProductId| prediction|
+------+----------+-------+---------+-----------+
|   5.0|1383696000| 6832.0|     28.0| -1.6007009|
|   5.0|1393200000|15164.0|     28.0| -0.8685224|
|   5.0|1387756800|37008.0|     28.0|0.075431585|
|   5.0|1357516800|37743.0|     28.0|-0.46197078|
|   5.0|1370390400|41810.0|     28.0| -1.6041979|
+------+----------+-------+---------+-----------+
only showing top 5 rows



                                                                                

In [25]:
evaluator = RegressionEvaluator(metricName = 'rmse',\
                                labelCol = 'Rating', \
                                predictionCol = 'prediction'\
                                )

In [26]:
rmse = evaluator.evaluate(predictions)
print('RMSE:', rmse)

24/06/03 18:47:52 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:48:00 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:48:04 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:48:15 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:48:31 WARN DAGScheduler: Broadcasting large task binary with size 51.0 MiB
24/06/03 18:48:37 WARN DAGScheduler: Broadcasting large task binary with size 51.0 MiB
[Stage 302:>                                                        (0 + 2) / 2]

RMSE: 6.436409561114752


                                                                                

The error seems to be relatively high. Generally, lower RMSE values (closer to 0) are better, as they indicate that the model's predictions are more accurate. 

### Implementing the model on sample users

In [27]:
this_user = test.filter(test['UserId'] == 15164.0).select('UserId', 'ProductId')
this_user.show()

24/06/03 18:51:30 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:51:33 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:51:37 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB

+-------+---------+
| UserId|ProductId|
+-------+---------+
|15164.0|     28.0|
|15164.0|   3510.0|
+-------+---------+



                                                                                

In [28]:
recommendation_this_user = model.transform(this_user)
recommendation_this_user.show()

24/06/03 18:51:58 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:52:04 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:52:08 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:52:30 WARN DAGScheduler: Broadcasting large task binary with size 50.9 MiB
24/06/03 18:52:38 WARN DAGScheduler: Broadcasting large task binary with size 51.0 MiB
24/06/03 18:52:40 WARN DAGScheduler: Broadcasting large task binary with size 51.0 MiB
24/06/03 18:52:42 WARN DAGScheduler: Broadcasting large task binary with size 51.0 MiB
[Stage 391:>                                                        (0 + 5) / 5]

+-------+---------+----------+
| UserId|ProductId|prediction|
+-------+---------+----------+
|15164.0|     28.0|-0.8685224|
|15164.0|   3510.0| 3.9482117|
+-------+---------+----------+



                                                                                