# Recommender by ALS

With Collaborative filtering, we make predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption is that if a user A has the same opinion as a user B on an issue, A is more likely to have B’s opinion on a different issue x than to have the opinion on x of a user-chosen randomly.

At first, people rate different items (like videos, images, games). Then, the system makes predictions about a user’s rating for an item not rated yet. The new predictions are built upon the existing ratings of other users with similar ratings with the active user.

Matrix factorization: is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices.

Alternating least square(ALS) matrix factorization: The idea is basically to take a large (or potentially huge) matrix and factor it into some smaller representation of the original matrix through alternating least squares. We end up with two or more lower dimensional matrices whose product equals the original one.ALS comes inbuilt in Apache Spark.


In [24]:
sc

In [1]:
from os import path

ROOT_DIR = "./"
DATA_DIR = path.join(ROOT_DIR, 'data')
MODEL_DIR = path.join(ROOT_DIR, 'model')

## 1. Data preprocessing

In [2]:
""" 
Read ratings data from file to dataframe
"""

path = os.path.join(DATA_DIR, 'rating20')
rating_data = spark.read.json(path)
rating_data = rating_data.dropDuplicates()
rating_data.show()

+----------+-------+--------------+
|      asin|overall|    reviewerID|
+----------+-------+--------------+
|0005092663|    5.0| ASDTPYYN9SMLU|
|0006486576|    2.0| A7OBFVHNJGI2A|
|076700941X|    5.0| AWEIALFUT5WRK|
|0767020294|    5.0| AZ8J3JYF838YT|
|0767809254|    5.0|A3SS5RJ7ZCSTO8|
|0767805534|    3.0|A3QZZG1V7QKRAC|
|0767809254|    5.0|A2IJLGHGXGC5RH|
|076780192X|    4.0|A2SDAMVSIKKUEI|
|0767819462|    5.0|A22FAP1Y5IRMVX|
|0767815335|    4.0|A3F0KRN9UINV0O|
|0767726227|    5.0|A2YNUGYZ05D4BW|
|0767830555|    3.0|A1OOYXSUI4Z943|
|0767837398|    4.0|A3POEMFB7467SV|
|0767817664|    5.0|A1HBEX2D49HWGD|
|0767834739|    5.0|A1PO5IWPI80U3N|
|0767834739|    4.0|A1RVI9K5DDB2JE|
|0767853636|    5.0|A2LVE70T3FZRTK|
|0767853636|    5.0|A3FZNRXVGHKBVT|
|0767853636|    5.0| A2MXJZMMJJK3D|
|0767836359|    5.0|A2BAJEVWC65S6K|
+----------+-------+--------------+
only showing top 20 rows



In [4]:
movie2int = rating_data.select('asin')\
    .distinct()\
    .rdd.zipWithIndex()\
    .map(lambda l: (l[0][0], l[1]))\
    .toDF().withColumnRenamed("_1","asin").withColumnRenamed("_2","m_id")

user2int = rating_data.select('reviewerID')\
    .distinct()\
    .rdd.zipWithIndex()\
    .map(lambda l: (l[0][0], l[1]))\
    .toDF().withColumnRenamed("_1","reviewerID").withColumnRenamed("_2","u_id")

In [None]:
"""
Store userId in Int and movieIn in Int data 
    corresponding to their origin for later usage
    
"""

user_path = os.path.join(DATA_DIR, 'users')
movie_path = os.path.join(DATA_DIR,'movies')

user2int.write.json(user_path, mode='overwrite', compression='gzip')
movie2int.write.json(movie_path, mode='overwrite', compression='gzip')

In [41]:
# Read user, movies data if you already have 

# user_path = os.path.join(DATA_DIR, 'users')
# movie_path = os.path.join(DATA_DIR,'movies')

# user2int = spark.read.json(user_path).withColumnRenamed("_1","reviewerID").withColumnRenamed("_2","u_id")
# movie2int = spark.read.json(movie_path).withColumnRenamed("_1","asin").withColumnRenamed("_2","m_id")


In [5]:
data_withId = rating_data\
    .join(user2int, on=['reviewerID'], how='left')\
    .join(movie2int, on=['asin'], how='left')
    
data_withId.show()

+----------+--------------+-------+------+----+
|      asin|    reviewerID|overall|  u_id|m_id|
+----------+--------------+-------+------+----+
|0607987162|A2GHPO1D0GH52D|    5.0|742012| 426|
|0764009303|A13STKDYQO5DQ8|    5.0|131267| 450|
|0783218923| AFOJ3MR6F17YA|    5.0| 24526|  19|
|0783218923| A1S900P7YCWGR|    5.0| 60095|  19|
|0783218923|A3Q40Y178OJ49D|    5.0| 59816|  19|
|0783218923|A1CHGP3HLNOBSA|    5.0|116864|  19|
|0783218923|A1XX3I76SZQJSV|    3.0|198892|  19|
|0783218923| ANUYSD5I85AMM|    5.0|209142|  19|
|0783218923| A69WZA12K55D7|    4.0|298626|  19|
|0783218923|A2WTA9L7YOZTGZ|    3.0|322001|  19|
|0783218923|A3LDI2BHDJZDTX|    5.0|358538|  19|
|0783218923|A2KCTUV4SFZJK6|    3.0|393977|  19|
|0783218923|A1Z0Z8LL5ZETOD|    1.0|543834|  19|
|0783218923|A3UN7HNDZAINWJ|    5.0|576276|  19|
|0783218923|A3B8ZYWL77H6G0|    4.0|620661|  19|
|0783218923| AWO9XN8MM0HLG|    5.0|616682|  19|
|0783218923|A2DJYERRIH5C8T|    3.0|735018|  19|
|0783218923|A3GW7MV7UN4ZJ4|    5.0|76029

## 2. Buiding Recommeder using matrix factorization method

In [8]:
"""

Split dataset to trainset, testset
    ` Trainset: 80% random records
    ` Testset: 20% remained
    
X_train: movie and user of each review from Trainset
X_test: movie and user of each review from Testset

"""
from pyspark.mllib.recommendation import Rating
from pyspark.sql.functions import col

ratings = data_withId.select(col('u_id').alias('user'), col('m_id').alias('product'), col('overall').alias('rating'))

(training, test) = ratings.randomSplit([0.8, 0.2])

In [9]:
rating_path = os.path.join(DATA_DIR,'ratings_int')
ratings.write.json(rating_path, mode='overwrite', compression='gzip')

In [18]:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

In [10]:
"""
Recommendation model using ALS
    ` rank = 10
    ` number of iteration = 10 
"""

# Rating constructor: Rating(int user, int product, double rating) 

rank = 10
numIterations = 10

train_rdd = training.rdd.map(lambda r: Rating(r[0], r[1], r[2]))
model = ALS.train(train_rdd, rank=rank, numIterations=numIterations)

In [11]:
"""
Save the model 
    ` Clean the directory before write
"""

import os
import sys

MODEL_DIR = os.path.join(ROOT_DIR, 'model', 'als')

!rm -r {MODEL_DIR + "/*"}

try:
    model.save(sc, MODEL_DIR)
    print ("Save model successfully at ", MODEL_DIR)
except:
    sys.exc_info()

Save model successfully at  ./model/als


In [19]:
model = MatrixFactorizationModel.load(sc,os.path.join(ROOT_DIR, 'model', 'als'))

## 3. Model Evaluation 

In this part we evaluate the recommendation model on both train dataset and test dataset. 
The RMSE lost function will be used.


In [37]:
train_pred = model.predictAll(training.rdd.map(lambda r: (r[0], r[1]))).toDF()
train_label_pred = training.join(train_pred.select('user', 'product', col('rating').alias('pred')), on=['user','product'], how='left')

In [38]:
from pyspark.mllib.evaluation import RegressionMetrics

metrics = RegressionMetrics(train_label_pred.dropna('any').select('rating', 'pred').rdd)

rmse = metrics.rootMeanSquaredError

print("Root Mean Squared Error = " + str(rmse))

Root Mean Squared Error = 0.9251515741173353


## 4. Apply model for recommendation

In [6]:
path = os.path.join(DATA_DIR, 'movie-meta')
data = spark.read.json(path)
data = data.select('asin', 'title').dropDuplicates()

# rat_path = os.path.join(DATA_DIR, 'ratings_int')
# data_withId = spark.read.json(rat_path)

In [37]:
data.show()

+----------+--------------------+
|      asin|               title|
+----------+--------------------+
|0310280923|You Teach Vol. 6:...|
|0780618211|Fire From the Sky...|
|0783116640|  Goodnight Moon VHS|
|0792845765|Robocop Collectio...|
|1560571284|In the Land of th...|
|1561270431|    Les Brigands VHS|
|1572521961|Becoming Colette VHS|
|1573628646|Ryder Cup '99 - B...|
|1580023045|San Jos&eacute; d...|
|1603998276|Power Rangers Sea...|
|1605290572|Look Better Naked...|
|1612616542|Come, Follow Me: ...|
|1934347795|        Toshin Iaido|
|1935127268|Dahn Yoga Essenti...|
|1936599015|    Darwin's Heretic|
|1940611008|Quilt It! 100 Series|
|3912900442|The Queen of Offi...|
|5555291744|Bonjour Tristesse...|
|6300150879|           Eleni VHS|
|6300166597|If It's a Man Han...|
+----------+--------------------+
only showing top 20 rows



In [23]:
ratings.first()

Row(product=110434, rating=5.0, user=1944708)

In [47]:
from pyspark.sql.functions import col

def id2int(user):
    """
    input: reviewerID
    return user's corresponding integer
    """
    return user2int.where(user2int.reviewerID==user).first()['u_id']

def int2asin(mint):
    """
    input: reviewerID
    return user's corresponding integer
    """
    return movie2int.where(movie2int.m_id==mint).first()['asin']

def recommend(userId, num = 10):
    userInt = id2int(userId)
    lst = model.recommendProducts(userInt, num)
    lst_asin = [int2asin(a[1]) for a in lst]
    rcm = data.where(col('asin').isin(lst_asin)).select('title')
    rcm.show(truncate = 0)
    return lst_asin

def show_rated_info(user):
    has_rated_score = data_withId.where(col('reviewerId')==user).select('asin', 'overall')
    has_rated_movie = [a[0] for a in has_rated_score.select('asin').collect()]
    m_rated_info = data.select('asin', 'title').where(col('asin').isin(has_rated_movie))
    has_rated_score.join(m_rated_info, on=['asin']).show(truncate = 0)

In [None]:
"""
Chọn 1 user làm ví dụ, model sẽ đưa ra danh sách các movie phù hợp theo thứ tự từ cao đến thấp của giá trị rating dự đoán
"""

In [23]:
test_user = 'AWEIALFUT5WRK'

In [24]:
print('User %s has rated these movies: \n'%test_user)
show_rated_info(test_user)

User AWEIALFUT5WRK has rated these movies: 

+----------+-------+-------------------------------------------+
|asin      |overall|title                                      |
+----------+-------+-------------------------------------------+
|B00YSG2ZPA|5.0    |Band of Brothers(Elite SC/BD+DCExp12-21)   |
|076700941X|5.0    |Upstairs Downstairs - The Fourth Season VHS|
+----------+-------+-------------------------------------------+



In [48]:
print('So we have some recommendations for he: \n')
recommend(test_user, 10)

So we have some recommendations for he: 

+-------------------------------------------------------------------------------------+
|title                                                                                |
+-------------------------------------------------------------------------------------+
|COMBAT TACTICS: Decision making in weapon based martial arts                         |
|Hummingbird Magic                                                                    |
|Mama's Family Complete Seasons 1 &amp; 2 DVD Set                                     |
|Utu VHS                                                                              |
|The Youngest Guns                                                                    |
|Distant Drums                                                                        |
| Beneath The Surface                                                                 |
|The Rhythm of Vinyasa                                                        