# Assignment 04 - Recommendation System using Collaborative Filtering 
## 1. Preparation
### 1.1 Requirements

1. Apache Spark 2.4.0 Binary (https://spark.apache.org/downloads.html)
2. PySpark 2.4.2 (Apache Spark Python API)
3. Jupyter Notebook (https://jupyter.org/install)
4. Numpy 1.16.3

### 1.2 Dataset
* Dataset's name : [Goodbooks-10k Datasets](https://github.com/zygmuntz/goodbooks-10k)
* Description : This dataset contains six million ratings for ten thousand most popular books (with most ratings). There are a few types of data here: 
    * Explicit ratings 
    * Implicit feedback indicators (books marked to read)
    * Tabular data or metadata (book info)
    * Tags
* Since we only need explicit ratings and book's metadata, so, we'll only load two kinds of data i.e. **ratings.csv** and **books.csv**
    

## 2. Spark Initialization

In [1]:
# Import findspark to make pyspark importable as a regular library
import findspark
findspark.init()

In [2]:
# Import SparkSession
from pyspark.sql import SparkSession

In [3]:
# Create Spark Session
spark = SparkSession.builder.appName("Recommendation System").getOrCreate()

In [4]:
# Print spark object ID
print(spark)

<pyspark.sql.session.SparkSession object at 0x119567d30>


## 3. Load Dataset
### 3.1 Ratings

In [5]:
# Load the dataset
ratingsDF = spark.read.load("/Users/mocatfrio/Documents/dataset-bigdata/goodbooks-10k/ratings.csv", format="csv", sep=",", inferSchema="true", header="true")

# Show dataset
ratingsDF.show()

+-------+-------+------+
|user_id|book_id|rating|
+-------+-------+------+
|      1|    258|     5|
|      2|   4081|     4|
|      2|    260|     5|
|      2|   9296|     5|
|      2|   2318|     3|
|      2|     26|     4|
|      2|    315|     3|
|      2|     33|     4|
|      2|    301|     5|
|      2|   2686|     5|
|      2|   3753|     5|
|      2|   8519|     5|
|      4|     70|     4|
|      4|    264|     3|
|      4|    388|     4|
|      4|     18|     5|
|      4|     27|     5|
|      4|     21|     5|
|      4|      2|     5|
|      4|     23|     5|
+-------+-------+------+
only showing top 20 rows



In [6]:
ratingsDF.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- book_id: integer (nullable = true)
 |-- rating: integer (nullable = true)



In [7]:
ratingsDF.count()

5976479

### 3.2 Book's Metadata

In [8]:
# Load the dataset
booksDF = spark.read.load("/Users/mocatfrio/Documents/dataset-bigdata/goodbooks-10k/books.csv", format="csv", sep=",", inferSchema="true", header="true")

# Show dataset
booksDF.show(truncate=False)

+-------+-----------------+------------+--------+-----------+----------+----------------+------------------------------------------------------+-------------------------+-------------------------------------------------------------+-----------------------------------------------------------+-------------+--------------+-------------+------------------+-----------------------+---------+---------+---------+---------+---------+-----------------------------------------------------------+-----------------------------------------------------------+
|book_id|goodreads_book_id|best_book_id|work_id |books_count|isbn      |isbn13          |authors                                               |original_publication_year|original_title                                               |title                                                      |language_code|average_rating|ratings_count|work_ratings_count|work_text_reviews_count|ratings_1|ratings_2|ratings_3|ratings_4|ratings_5|image_url              

In [None]:
booksDF.printSchema()

root
 |-- book_id: integer (nullable = true)
 |-- goodreads_book_id: integer (nullable = true)
 |-- best_book_id: integer (nullable = true)
 |-- work_id: integer (nullable = true)
 |-- books_count: integer (nullable = true)
 |-- isbn: string (nullable = true)
 |-- isbn13: double (nullable = true)
 |-- authors: string (nullable = true)
 |-- original_publication_year: double (nullable = true)
 |-- original_title: string (nullable = true)
 |-- title: string (nullable = true)
 |-- language_code: string (nullable = true)
 |-- average_rating: string (nullable = true)
 |-- ratings_count: string (nullable = true)
 |-- work_ratings_count: string (nullable = true)
 |-- work_text_reviews_count: string (nullable = true)
 |-- ratings_1: double (nullable = true)
 |-- ratings_2: integer (nullable = true)
 |-- ratings_3: integer (nullable = true)
 |-- ratings_4: integer (nullable = true)
 |-- ratings_5: integer (nullable = true)
 |-- image_url: string (nullable = true)
 |-- small_image_url: string (nu

In [None]:
booksDF.count()

10000

## 4. Collaborative Filtering using ALS (Alternating Least Squares) Algorithm
* **Collaborative filtering** is commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix. `spark.ml` uses the alternating least squares (ALS) algorithm to learn these latent factors. Then, we can evaluate the recommendation model by measuring the root-mean-square error of rating prediction. 
* ALS algorithm from Spark will process data that each row consisting of a user, a movie (or item), a rating and a timestamp. So, we don't need to preprocess data first because booksDF already fulfilled those requirements.

In [None]:
# Import ALS algorithm
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row

In [None]:
# training and test
(training, test) = ratingsDF.randomSplit([0.8, 0.2])

In [None]:
# Build the recommendation model using ALS on the training data
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
als = ALS(maxIter=5, regParam=0.01, userCol="user_id", itemCol="book_id", ratingCol="rating",
          coldStartStrategy="drop")
model = als.fit(training)

In [None]:
# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 0.842758951882525


In [None]:
# Generate top 10 books recommendations for each user
userRecs = model.recommendForAllUsers(10)
userRecs.show(truncate=False)

In [None]:
# Generate top 10 user recommendations for each book
bookRecs = model.recommendForAllItems(10)
bookRecs.show(truncate=False)

In [None]:
# Generate top 10 movie recommendations for a specified set of users
users = ratingsDF.select(als.getUserCol()).distinct().limit(3)
userSubsetRecs = model.recommendForUserSubset(users, 10)
userSubsetRecs.show(truncate=False)

In [None]:
# Generate top 10 user recommendations for a specified set of movies
movies = ratingsDF.select(als.getItemCol()).distinct().limit(3)
movieSubSetRecs = model.recommendForItemSubset(movies, 10)
movieSubSetRecs.show(truncate=False)

## 6. References

* https://spark.apache.org/docs/latest/ml-collaborative-filtering.html