# Recommender by ALS

With Collaborative filtering, we make predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption is that if a user A has the same opinion as a user B on an issue, A is more likely to have B’s opinion on a different issue x than to have the opinion on x of a user-chosen randomly.

At first, people rate different items (like videos, images, games). Then, the system makes predictions about a user’s rating for an item not rated yet. The new predictions are built upon the existing ratings of other users with similar ratings with the active user.

Matrix factorization: is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices.

Alternating least square(ALS) matrix factorization: The idea is basically to take a large (or potentially huge) matrix and factor it into some smaller representation of the original matrix through alternating least squares. We end up with two or more lower dimensional matrices whose product equals the original one.ALS comes inbuilt in Apache Spark.


In [1]:
sc

In [2]:
from os import path

ROOT_DIR = "./"
DATA_DIR = path.join(ROOT_DIR, 'data')
MODEL_DIR = path.join(ROOT_DIR, 'model')

In [3]:
"""
Utilities definition
"""

import hashlib

class Utils:
    
    '''
    TODO Transform the ID to unique integer
    '''
    @staticmethod
    def hashToInt(s):
        return int(hashlib.sha1(s).hexdigest(), 16) % (10 ** 8)

#Test
Utils.hashToInt('A141HP4LYPWMSR')

5460385L

## 1. Data preprocessing

In [4]:
""" 
Read ratings data from file to RDD
Split each line into 4 parts separated by the commas 
"""

path = os.path.join(DATA_DIR, 'ratings')
data = sc.textFile(path).map(lambda l: l.split(','))

In [5]:
# import pyspark

# data.persist(pyspark.StorageLevel.MEMORY_AND_DISK)
# data.take(5)

In [6]:
"""
Create rdd named 'ratings' which is:
    ` A set of 'Ratings' object
    ` Combination of 3 features: product id as Int, user id as Int, rating score as float 
"""

from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

ratings = data.map(lambda (i, p, u, r): Rating(Utils.hashToInt(p), Utils.hashToInt(u), float(r)))

In [7]:
"""

Store userId in Int and movieIn in Int data 
    corresponding to their origin for later usage
    
"""

users = data.map(lambda (i, p,u,r): (Utils.hashToInt(u), u)).toDF()
movies = data.map(lambda (i, p, u, r): (Utils.hashToInt(p), p)).toDF()

user_path = os.path.join(DATA_DIR, 'users')
movie_path = os.path.join(DATA_DIR,'movies')

users.write.csv(user_path, mode='overwrite', compression='gzip')
movies.write.csv(movie_path, mode='overwrite', compression='gzip')


In [7]:
"""

Split dataset to trainset, testset
    ` Trainset: 80% random records
    ` Testset: 20% remained
    
X_train: movie and user of each review from Trainset
X_test: movie and user of each review from Testset

"""

(train_data, test_data) = ratings.randomSplit([0.8, 0.2])

X_train = train_data.map(lambda (p, u, r): (int(p), int(u)))

X_test = test_data.map(lambda (p, u, r): (int(p), int(u)))


In [8]:
train_data.count()

6328853

In [9]:
test_data.count()

1582831

## 2. Buiding Recommeder using matrix factorization method

In [10]:
"""

Recommendation model using ALS
    ` rank = 10
    ` number of iteration = 10 

===> Need to try other hyper parameters to find the best model 

"""

rank = 10
numIterations = 10

model = ALS.train(train_data, rank, numIterations)

## 3. Model Evaluation 

In this part we evaluate the recommendation model on both train dataset and test dataset. 
The RMSE lost function will be used.


In [12]:
from pyspark.mllib.evaluation import RegressionMetrics

In [14]:
# Evaluate the model on training data

predictions = model.predictAll(X_train).map(lambda r: ((r[0], r[1]), r[2]))

labelsAndPreds = train_data.map(lambda (p, u, r): ((p,u),r)) \
                                .join(predictions).map(lambda (k, v): v)


metrics = RegressionMetrics(labelsAndPreds)

rmse = metrics.rootMeanSquaredError

print("Root Mean Squared Error = " + str(rmse))

Root Mean Squared Error = 0.349035992892


In [15]:
# Evaluate the model on test data

test_preds = model.predictAll(X_test).map(lambda r: ((r[0], r[1]), r[2]))

test_labelsAndPreds = test_data.map(lambda (p, u, r): ((p,u),r)) \
                                .join(test_preds).map(lambda (k, v): v) \

test_metrics = RegressionMetrics(test_labelsAndPreds)

test_rmse = test_metrics.rootMeanSquaredError

print("Root Mean Squared Error = " + str(test_rmse))

Root Mean Squared Error = 1.12854318275


'Why the rmse on train and test data are same??????????'

In [13]:
# Save model
model.save(sc, "./model")

In [14]:
# sc.stop()