# <center> <img src="../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> **Big Data** </center>
---
### <center> **Autumn 2025** </center>
---
### <center> **Final Project: Machine Learning** </center>
### <center> **Alternating Least Squares (ALS)** </center>
---
**Profesor**: Pablo Camarillo Ramirez


**Estudiante**: Ana Carolina Arellano Valdez 

# Machine Learning algorithm to use: **Alternating Least Squares (ALS)**
I would like to solve the problem of books recommendation using the Alternating Least Squares (ALS) algorithm. I am selecting this problem because I used to be a disciplined reader, but the book recommendations that I received from my family or friends were not always aligned with my interests. I found this page called "goodreads" and it is to upload your book preferences and reviews. Based on that, the page recommends books that you might like. However, I think that the recommendations could be improved using a more sophisticated algorithm like ALS.

# Dataset Description
- Source: Kaggle -> https://www.kaggle.com/datasets/tranhungnghiep/goodbooks6m/data?select=ratings.csv
- Size of the dataset: Sample of 99 out of 6 million ratings
- How many unique users/items are there? 
  - Unique users: 5
  - Unique books: 96

The ratings.csv file looks like this:
    
    |user_id|book_id|rating|
    |-------|-------|------|
    |1      |258    |5     |
    |2      |4081   |4     |
    |2      |260    |5     |
    |2      |9296   |5     |
    |2      |2318   |3     |

# ML Training Process
## Create Spark Session

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ML: ALS") \
    .master("spark://spark-master:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("INFO")

# Optimization (reduce the number of shuffle partitions)
spark.conf.set("spark.sql.shuffle.partitions", "5")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/06 18:50:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Load the dataset

In [2]:
from carolinarellano.spark_utils import SparkUtils

books_ratings_path = "/opt/spark/work-dir/data/ml/als"
books_ratings_schema = SparkUtils.generate_schema([("user_id", "int"), ("book_id", "int"), ("rating", "int")])

# Source https://www.kaggle.com/datasets/tranhungnghiep/goodbooks6m/data?select=ratings.csv
books_ratings_df = spark.read \
                    .option("header", "true") \
                    .option("delimiter", ",") \
                    .schema(books_ratings_schema) \
                    .csv(books_ratings_path)

books_ratings_df.printSchema()
books_ratings_df.show(n=3)

root
 |-- user_id: integer (nullable = true)
 |-- book_id: integer (nullable = true)
 |-- rating: integer (nullable = true)



[Stage 0:>                                                          (0 + 1) / 1]

+-------+-------+------+
|user_id|book_id|rating|
+-------+-------+------+
|      1|    258|     5|
|      2|   4081|     4|
|      2|    260|     5|
+-------+-------+------+
only showing top 3 rows


                                                                                

In [4]:
print(f"Number of items o books (n): {books_ratings_df.groupBy('book_id').count().count()}")
print(f"Number of users (m): {books_ratings_df.groupBy('user_id').count().count()}")

Number of items o books (n): 96
Number of users (m): 5


## Create & Train the ML Model

**Model choice:** Alternating Least Squares (ALS)
I used Alternating Least Squares (ALS) because the task is classic user–item rating prediction and top-N recommendations. Since the data contains explicit 1–5 ratings, I trained ALS in explicit-feedback mode.

**Feaure mapping:**
- userCol = "user_id"
- itemCol = "book_id"
- ratingCol = "rating"

**Hyperparameters:**
- rank = 3 -> The number of latent factors in the model. I chose 3 because it is a small dataset and I wanted to avoid overfitting.
- regParam = 0.1 -> The regularization parameter. I chose 0.1 to prevent overfitting while still allowing the model to learn from the data.
- maxIter = 5 -> The number of iterations to run. I chose 5 to balance between training time and model performance.
- coldStartStrategy = "drop" -> This parameter is set to "drop" to ensure that any rows in the validation or test set that contain NaN predictions are dropped. Fortunately, in this small dataset, there are no such rows.

In [10]:
from pyspark.ml.recommendation import ALS

als = ALS(
    userCol="user_id", 
    itemCol="book_id",
    ratingCol="rating", 
    maxIter=5,  
    regParam=0.1,  
    rank=3,  
    coldStartStrategy="drop",  # Avoids NaN predictions
)

# Train the model
print("Starting model training...")
model = als.fit(books_ratings_df)
print("Recommendation system generated successfully")

Starting model training...
Recommendation system generated successfully
Recommendation system generated successfully


## Persist the model

In [11]:
als_model_path = "/opt/spark/work-dir/data/mlmodels/als/als_books"
model.write().overwrite().save(als_model_path)

                                                                                

## Predictions

In [12]:
from pyspark.ml.recommendation import ALSModel
# Load saved model
#als_model = ALSModel.load(als_model_path)

# Generate the  top 5 recommendations for each user
user_recommendations = model.recommendForAllUsers(numItems=5)
# Show recommendations
user_recommendations.show(truncate=False)





+-------+------------------------------------------------------------------------------------------------------+
|user_id|recommendations                                                                                       |
+-------+------------------------------------------------------------------------------------------------------+
|1      |[{1796, 4.927858}, {258, 4.927858}, {9296, 3.699322}, {8519, 3.699322}, {3753, 3.699322}]             |
|2      |[{9296, 4.950305}, {8519, 4.950305}, {3753, 4.950305}, {2686, 4.950305}, {301, 4.950305}]             |
|4      |[{1237, 4.9404473}, {693, 4.9404473}, {325, 4.9404473}, {103, 4.9404473}, {102, 4.9404473}]           |
|6      |[{6351, 3.9524057}, {5556, -0.12159826}, {3638, -0.12159826}, {2738, -0.12159826}, {867, -0.12159826}]|
|8      |[{9114, 4.9561243}, {5425, 4.9561243}, {4622, 4.9561243}, {3020, 4.9561243}, {2732, 4.9561243}]       |
+-------+---------------------------------------------------------------------------------------

                                                                                

## ML Evaluation
**Evaluator choice:** RegressionEvaluator
I used RegressionEvaluator because the task is to predict continuous ratings (1-5) for user-item pairs. RegressionEvaluator is useful for evaluating the performance of regression models by calculating metrics such as RMSE (Root Mean Square Error), which measures the average magnitude of the prediction errors.


**Metric used:** RMSE (Root Mean Square Error)
I chose RMSE as the evaluation metric because it provides a clear measure of how well the model's predicted ratings match the actual ratings. RMSE penalizes larger errors more than smaller ones, making it a useful metric for the recommendation task where accurate predictions are crucial for user satisfaction.

In [13]:
from pyspark.ml.evaluation import RegressionEvaluator

# Generate predictions for all users
predictions = model.transform(books_ratings_df)


# Set up evaluator to compute RMSE
evaluator = RegressionEvaluator(
    metricName="rmse", 
    labelCol="rating", 
    predictionCol="prediction"
)

# Calculate RMSE
rmse = evaluator.evaluate(predictions)
print(f"Root-mean-square error (RMSE) = {rmse}")

Root-mean-square error (RMSE) = 0.0476170399192705


In [None]:
sc.stop()