## Part 1
# Assignment 04 - Recommendation System using Collaborative Filtering 

## 1. Spark Initialization

In [1]:
# Import findspark to make pyspark importable as a regular library
import findspark
findspark.init()

In [2]:
# Import SparkSession
from pyspark.sql import SparkSession

In [3]:
# Create Spark Session
spark = SparkSession.builder.appName("Recommendation System").getOrCreate()

In [4]:
# Print spark object ID
print(spark)

<pyspark.sql.session.SparkSession object at 0x10bcdd860>


## 2. Load Dataset
### 2.1 Ratings Data

In [5]:
# Load the dataset
ratings_file = "/Users/mocatfrio/Documents/big-data/recommendation-system/flask-app/csv/ratings.csv"
ratings_df = spark.read.load(ratings_file, format="csv", sep=",", inferSchema="true", header="true")

In [6]:
# Show dataset
ratings_df.show()

+-------+-------+------+
|user_id|book_id|rating|
+-------+-------+------+
|      1|    258|     5|
|      2|   4081|     4|
|      2|    260|     5|
|      2|   9296|     5|
|      2|   2318|     3|
|      2|     26|     4|
|      2|    315|     3|
|      2|     33|     4|
|      2|    301|     5|
|      2|   2686|     5|
|      2|   3753|     5|
|      2|   8519|     5|
|      4|     70|     4|
|      4|    264|     3|
|      4|    388|     4|
|      4|     18|     5|
|      4|     27|     5|
|      4|     21|     5|
|      4|      2|     5|
|      4|     23|     5|
+-------+-------+------+
only showing top 20 rows



In [7]:
ratings_df.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- book_id: integer (nullable = true)
 |-- rating: integer (nullable = true)



In [8]:
ratings_df.count()

5976479

### 2.2 Books Data

In [9]:
# Load the dataset
books_file = "/Users/mocatfrio/Documents/big-data/recommendation-system/flask-app/csv/books.csv"
books_df = spark.read.load(books_file, format="csv", sep=",", inferSchema="true", header="true")

In [10]:
# Show dataset
books_df.show(truncate=False)

+-------+-----------------+------------+--------+-----------+----------+----------------+------------------------------------------------------+-------------------------+-------------------------------------------------------------+-----------------------------------------------------------+-------------+--------------+-------------+------------------+-----------------------+---------+---------+---------+---------+---------+-----------------------------------------------------------+-----------------------------------------------------------+
|book_id|goodreads_book_id|best_book_id|work_id |books_count|isbn      |isbn13          |authors                                               |original_publication_year|original_title                                               |title                                                      |language_code|average_rating|ratings_count|work_ratings_count|work_text_reviews_count|ratings_1|ratings_2|ratings_3|ratings_4|ratings_5|image_url              

In [11]:
books_df.printSchema()

root
 |-- book_id: integer (nullable = true)
 |-- goodreads_book_id: integer (nullable = true)
 |-- best_book_id: integer (nullable = true)
 |-- work_id: integer (nullable = true)
 |-- books_count: integer (nullable = true)
 |-- isbn: string (nullable = true)
 |-- isbn13: double (nullable = true)
 |-- authors: string (nullable = true)
 |-- original_publication_year: double (nullable = true)
 |-- original_title: string (nullable = true)
 |-- title: string (nullable = true)
 |-- language_code: string (nullable = true)
 |-- average_rating: string (nullable = true)
 |-- ratings_count: string (nullable = true)
 |-- work_ratings_count: string (nullable = true)
 |-- work_text_reviews_count: string (nullable = true)
 |-- ratings_1: double (nullable = true)
 |-- ratings_2: integer (nullable = true)
 |-- ratings_3: integer (nullable = true)
 |-- ratings_4: integer (nullable = true)
 |-- ratings_5: integer (nullable = true)
 |-- image_url: string (nullable = true)
 |-- small_image_url: string (nu

In [12]:
books_df.count()

10000

## 3. Collaborative Filtering using ALS (Alternating Least Squares) Algorithm
* **Collaborative filtering** is commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix. `spark.ml` library uses the **Alternating least squares (ALS) algorithm** to learn these latent factors. We can evaluate the recommendation model by measuring the root-mean-square error of rating prediction. 
* ALS algorithm from Spark will process data that each row consisting of a user, a movie (or item), and a rating. So, we don't need to preprocess data first because booksDF already fulfilled those requirements.

In [13]:
# Import ALS algorithm
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row

In [14]:
# Split ratings data become training set and test set
(training, test) = ratings_df.randomSplit([0.8, 0.2])

In [15]:
# Build the recommendation model using ALS on the training data
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
als = ALS(maxIter=5, regParam=0.01, userCol="user_id", itemCol="book_id", ratingCol="rating",
          coldStartStrategy="drop")
model = als.fit(training)

In [16]:
# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 0.8372556230095466


In [17]:
# Generate top 10 books recommendations for each user
userRecs = model.recommendForAllUsers(10)
userRecs.show(truncate=False)

+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|user_id|recommendations                                                                                                                                                                            |
+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|148    |[[7913, 4.9231935], [8233, 4.6173744], [9842, 4.589587], [7593, 4.5557823], [8548, 4.530573], [9531, 4.5245495], [5978, 4.514138], [7401, 4.498357], [9569, 4.4904485], [7803, 4.4813657]] |
|463    |[[9418, 6.085741], [8601, 6.002691], [9682, 5.900895], [9806, 5.781508], [9073, 5.7021027], [7245, 5.616987], [8455, 5.587755], [5978, 5.5631685], [8827, 5.556246], [5897, 5.544558]]     |
|471    |[

In [18]:
# Generate top 10 user recommendations for each book
bookRecs = model.recommendForAllItems(10)
bookRecs.show(truncate=False)

+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|book_id|recommendations                                                                                                                                                                                      |
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1580   |[[40668, 5.8876557], [32085, 5.607617], [41174, 5.5197353], [8211, 5.511538], [17265, 5.494228], [38076, 5.4850326], [39052, 5.422405], [40526, 5.3720927], [42544, 5.335964], [20411, 5.31097]]     |
|4900   |[[43675, 6.451187], [46765, 6.3850846], [29476, 6.2075367], [29996, 6.1975913], [30178, 6.1287475], [48687, 6.115594], [44668, 6.098499], [21562, 6.0251055], [

In [19]:
# Generate top 10 movie recommendations for a specified set of users
users = ratings_df.select(als.getUserCol()).distinct().limit(3)
userSubsetRecs = model.recommendForUserSubset(users, 10)
userSubsetRecs.show(truncate=False)

+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|user_id|recommendations                                                                                                                                                                          |
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1580   |[[9918, 5.61707], [8703, 5.5557356], [8533, 5.38871], [7844, 5.2646084], [9036, 5.2417045], [9071, 5.1399784], [3836, 5.1153626], [7440, 5.1114545], [9061, 5.098318], [9114, 5.0831666]]|
|463    |[[9418, 6.085741], [8601, 6.002691], [9682, 5.900895], [9806, 5.781508], [9073, 5.7021027], [7245, 5.616987], [8455, 5.587755], [5978, 5.5631685], [8827, 5.556246], [5897, 5.544558]]   |
|1238   |[[9036, 5.9

In [20]:
# Generate top 10 user recommendations for a specified set of movies
movies = ratings_df.select(als.getItemCol()).distinct().limit(3)
movieSubSetRecs = model.recommendForItemSubset(movies, 10)
movieSubSetRecs.show(truncate=False)

+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|book_id|recommendations                                                                                                                                                                                      |
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|471    |[[44119, 5.759254], [40668, 5.691053], [20313, 5.6232214], [9530, 5.6173387], [29838, 5.541521], [46914, 5.5317917], [20796, 5.4254336], [22180, 5.3727074], [10353, 5.357664], [14157, 5.351783]]   |
|2142   |[[40668, 6.461544], [42544, 6.3050637], [45628, 6.199746], [34036, 6.1381516], [35709, 6.1267443], [40526, 6.1183352], [40026, 6.0962234], [42394, 6.0418777], 

## 6. References

* https://spark.apache.org/docs/latest/ml-collaborative-filtering.html