<a href="https://colab.research.google.com/github/pxuanbach/recommendation-system/blob/main/Demo_model_based_approach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demo model-based approach

### 1. Cài đặt và thêm các package cần thiết

In [None]:
!pip install pyspark

In [None]:
import pandas as pd
from pyspark.sql.functions import col, explode
from pyspark import SparkContext

Tạo spark session

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Recommendations Demo').getOrCreate()

### 2. Tải dữ liệu lên

In [8]:
movies = spark.read.csv('movies.csv', header=True)
ratings = spark.read.csv('ratings.csv', header=True)

In [9]:
ratings.show()

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
|     1|     70|   3.0|964982400|
|     1|    101|   5.0|964980868|
|     1|    110|   4.0|964982176|
|     1|    151|   5.0|964984041|
|     1|    157|   5.0|964984100|
|     1|    163|   5.0|964983650|
|     1|    216|   5.0|964981208|
|     1|    223|   3.0|964980985|
|     1|    231|   5.0|964981179|
|     1|    235|   4.0|964980908|
|     1|    260|   5.0|964981680|
|     1|    296|   3.0|964982967|
|     1|    316|   3.0|964982310|
|     1|    333|   5.0|964981179|
|     1|    349|   4.0|964982563|
+------+-------+------+---------+
only showing top 20 rows



Chỉnh sửa schema

In [10]:
ratings = ratings.\
    withColumn('userId', col('userId').cast('integer')).\
    withColumn('movieId', col('movieId').cast('integer')).\
    withColumn('rating', col('rating').cast('float')).\
    drop('timestamp')
ratings.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
|     1|    101|   5.0|
|     1|    110|   4.0|
|     1|    151|   5.0|
|     1|    157|   5.0|
|     1|    163|   5.0|
|     1|    216|   5.0|
|     1|    223|   3.0|
|     1|    231|   5.0|
|     1|    235|   4.0|
|     1|    260|   5.0|
|     1|    296|   3.0|
|     1|    316|   3.0|
|     1|    333|   5.0|
|     1|    349|   4.0|
+------+-------+------+
only showing top 20 rows



### 3. Tính toán độ thưa thớt

In [12]:
# Count the total number of ratings in the dataset
numerator = ratings.select("rating").count()
print("Total number of ratings", numerator)

# Count the number of distinct userIds and distinct movieIds
num_users = ratings.select("userId").distinct().count()
num_movies = ratings.select("movieId").distinct().count()
print("Number users", num_users, "Number movies", num_movies)

# Set the denominator equal to the number of users multiplied by the number of movies
denominator = num_users * num_movies

# Divide the numerator by the denominator
sparsity = (1.0 - (numerator *1.0)/denominator)*100
print("The ratings dataframe is ", "%.2f" % sparsity + "% empty.")

Total number of ratings 40022
Num users 274 Num movies 6222
The ratings dataframe is  97.65% empty.


### 4. Nhóm dữ liệu bằng cách tính tổng số lượt ratings

Nhóm user

In [13]:
# Group data by userId, count ratings
userId_ratings = ratings.groupBy("userId").count().orderBy('count', ascending=False)
userId_ratings.show()

+------+-----+
|userId|count|
+------+-----+
|    68| 1260|
|   249| 1046|
|   182|  977|
|   177|  904|
|   232|  862|
|   274|  793|
|   105|  722|
|    19|  703|
|   111|  646|
|   217|  613|
|   140|  608|
|    91|  575|
|    28|  570|
|   219|  528|
|    89|  518|
|    64|  517|
|   226|  507|
|    18|  502|
|    57|  476|
|    21|  443|
+------+-----+
only showing top 20 rows



Nhóm movie

In [14]:
# Group data by movieId, count ratings
movieId_ratings = ratings.groupBy("movieId").count().orderBy('count', ascending=False)
movieId_ratings.show()

+-------+-----+
|movieId|count|
+-------+-----+
|    296|  144|
|    356|  144|
|    318|  140|
|   2571|  129|
|    593|  125|
|    260|  113|
|    110|  104|
|    480|  103|
|   1196|  102|
|    589|  100|
|      1|   99|
|   1210|   96|
|   1198|   96|
|    780|   95|
|     47|   95|
|    150|   94|
|   2858|   90|
|    527|   90|
|    592|   89|
|   2028|   89|
+-------+-----+
only showing top 20 rows



### 5. Xây dựng mô hình ALS

In [19]:
from pyspark.ml.recommendation import ALS

In [20]:
# Create test and train set
(train, test) = ratings.randomSplit([0.8, 0.2], seed = 1234)

# Create ALS model
als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", nonnegative = True, implicitPrefs = False, coldStartStrategy="drop")

# Confirm that a model called "als" was created
type(als)

pyspark.ml.recommendation.ALS

Điều chỉnh siêu tham số cho ALS model
- rank (rank of the factorization) hạng của thừa số hóa.
- regParam (regularization parameter (>= 0)) là tham số chính quy hóa.

In [28]:
# Import the requisite items
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Add hyperparameters and their respective values to param_grid
param_grid = ParamGridBuilder() \
            .addGrid(als.rank, [10, 50, 100]) \
            .addGrid(als.regParam, [.01, .05, .1]) \
            .build()

Sau đó tạo RegressionEvaluator

In [29]:
# Define evaluator as RMSE and print length of evaluator
evaluator = RegressionEvaluator(
           metricName="rmse", 
           labelCol="rating", 
           predictionCol="prediction") 
print ("Num models to be tested: ", len(param_grid))

Num models to be tested:  9


In [30]:
# Build cross validation using CrossValidator
cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

### 6. Kiểm tra các thông số được lựa chọn cho mô hình tốt nhất

In [31]:
#Fit cross validator to the 'train' dataset
model = cv.fit(train)

#Extract best model from the cv model above
best_model = model.bestModel

In [33]:
print("**Best Model**")
# Print "Rank"
print("  Rank:", best_model._java_obj.parent().getRank())
# Print "MaxIter"
print("  MaxIter:", best_model._java_obj.parent().getMaxIter())
# Print "RegParam"
print("  RegParam:", best_model._java_obj.parent().getRegParam())

**Best Model**
  Rank: 100
  MaxIter: 10
  RegParam: 0.1


Xem thử dự đoán

In [32]:
# View the predictions
test_predictions = best_model.transform(test)
RMSE = evaluator.evaluate(test_predictions)
print(RMSE)

0.9514086493223866


In [34]:
test_predictions.show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   108|   1959|   5.0| 3.6098402|
|    27|   1580|   3.0| 3.5642745|
|    91|   2122|   4.0| 2.8503916|
|   157|   3175|   2.0| 3.1136878|
|   232|   1580|   3.5|  3.662118|
|   232|  44022|   3.0| 2.9948294|
|   246|   1645|   4.0|  3.458491|
|   111|   1088|   3.0| 3.0659535|
|   111|   3175|   3.5| 2.6549397|
|    47|   1580|   1.5|   2.59041|
|   140|   1580|   3.0| 3.6025956|
|   177|   1088|   3.5| 3.4895437|
|   177|   3175|   2.0|  3.340192|
|   177|  54190|   3.0| 3.4577494|
|   274|   1580|   3.0| 3.4999018|
|   182|   1645|   4.5|  2.764576|
|   218|    471|   4.0| 2.5128827|
|   164|   1580|   5.0| 4.1676044|
|    57|   1580|   4.0| 3.3473587|
|    48|   1580|   5.0| 3.6713548|
+------+-------+------+----------+
only showing top 20 rows



### 7. Đưa ra đề xuất

In [35]:
# Generate n Recommendations for all users
nrecommendations = best_model.recommendForAllUsers(10)
nrecommendations.limit(10).show()

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|     1|[{1658, 5.610651}...|
|     2|[{131724, 4.90323...|
|     3|[{70946, 4.930261...|
|     4|[{3365, 4.9455643...|
|     5|[{290, 4.61122}, ...|
|     6|[{86, 4.7455153},...|
|     7|[{8908, 4.4055104...|
|     8|[{1250, 4.50605},...|
|     9|[{1250, 4.8904414...|
|    10|[{71579, 4.858259...|
+------+--------------------+



In [36]:
nrecommendations = nrecommendations\
    .withColumn("rec_exp", explode("recommendations"))\
    .select('userId', col("rec_exp.movieId"), col("rec_exp.rating"))

nrecommendations.limit(10).show()

+------+-------+---------+
|userId|movieId|   rating|
+------+-------+---------+
|     1|   1658| 5.610651|
|     1|   1262| 5.216543|
|     1|   1250|5.2105474|
|     1|   7842| 5.200619|
|     1|  58559|5.1815085|
|     1|  91529| 5.161586|
|     1|   1066| 5.151671|
|     1|  49272| 5.148035|
|     1|  55118| 5.133019|
|     1|   5690| 5.125258|
+------+-------+---------+



### 8. Các đề xuất có hợp lý không?

In [37]:
nrecommendations.join(movies, on='movieId').filter('userId = 100').show()

+-------+------+---------+--------------------+--------------------+
|movieId|userId|   rating|               title|              genres|
+-------+------+---------+--------------------+--------------------+
|   1958|   100|4.7395883|Terms of Endearme...|        Comedy|Drama|
|   1658|   100| 4.738078|Life Less Ordinar...|    Romance|Thriller|
|  58559|   100|4.6711817|Dark Knight, The ...|Action|Crime|Dram...|
|   4041|   100|4.6689363|Officer and a Gen...|       Drama|Romance|
|   1250|   100|4.6532607|Bridge on the Riv...| Adventure|Drama|War|
| 104374|   100| 4.631496|   About Time (2013)|Drama|Fantasy|Rom...|
|   2423|   100| 4.603807|Christmas Vacatio...|              Comedy|
|   1096|   100| 4.590178|Sophie's Choice (...|               Drama|
|   1066|   100| 4.579955|Shall We Dance (1...|Comedy|Musical|Ro...|
|   1284|   100|  4.56247|Big Sleep, The (1...|Crime|Film-Noir|M...|
+-------+------+---------+--------------------+--------------------+



In [39]:
ratings.join(movies, on='movieId').filter('userId = 100').sort('rating', ascending=False).show()

+-------+------+------+--------------------+--------------------+
|movieId|userId|rating|               title|              genres|
+-------+------+------+--------------------+--------------------+
|   2423|   100|   5.0|Christmas Vacatio...|              Comedy|
|   1101|   100|   5.0|      Top Gun (1986)|      Action|Romance|
|   4041|   100|   5.0|Officer and a Gen...|       Drama|Romance|
|   1958|   100|   5.0|Terms of Endearme...|        Comedy|Drama|
|   5620|   100|   5.0|Sweet Home Alabam...|      Comedy|Romance|
|    919|   100|   4.5|Wizard of Oz, The...|Adventure|Childre...|
|    934|   100|   4.5|Father of the Bri...|              Comedy|
|     28|   100|   4.5|   Persuasion (1995)|       Drama|Romance|
|     95|   100|   4.5| Broken Arrow (1996)|Action|Adventure|...|
|   1028|   100|   4.5| Mary Poppins (1964)|Children|Comedy|F...|
|   1091|   100|   4.5|Weekend at Bernie...|              Comedy|
|     16|   100|   4.5|       Casino (1995)|         Crime|Drama|
|   1246| 