### Movie Recommendation System using MovieLens Dataset
In this notebook, we will explore two approaches to build a recommendation system using collaborative filtering algorithms: memory-based and model-based. Our analysis is based on a sampled MovieLens dataset with model training and inference implemented on Spark platform.

#### Table of Contents
1. [Data Import](#import)
2. [Sampling Ratings Dataset](#sampling)
3. [ALS Model Training](#alstrain)

In [303]:
from pyspark import SparkContext, SQLContext

from pyspark.sql.functions import *
from pyspark.sql import functions as F

from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

#### 1. Data Import <a id = import></a>
Something Something

In [195]:
sc = SparkContext()
sqlContext = SQLContext(sc)

csvf = 'com.databricks.spark.csv'
ratings = sqlContext.read.format(csvf).options(header='true', inferschema='true').load('data/raw/ratings.csv')

#### 2. Sampling Ratings Dataset <a id = sampling></a>
Something Something

In [311]:
ratings_count = ratings.groupby(['userID']).count()
quantile = ratings_count.approxQuantile('count', [0.25, 0.75], 0)

print("Ratings Count by User: 25th Percentile = "+str(quantile[0]))
print("Ratings Count by User: 75th Percentile = "+str(quantile[1]))

Ratings Count by User: 25th Percentile = 35.0
Ratings Count by User: 75th Percentile = 155.0


In [None]:
ratings_count = ratings_count.withColumn(
    'user_class', when(col('count') < quantile[0], 1).when(col('count') < quantile[1], 2).otherwise(3))
ratings_count = ratings_count.withColumnRenamed('userID', 'userID2')
ratings = ratings.join(ratings_count, ratings['userID'] == ratings_count['userID2'])
ratings = ratings.select(['userID', 'movieID', 'rating', 'timestamp', 'user_class'])

In [319]:
ratings_sampled = ratings.sampleBy('user_class', fractions = {1: 0.00001, 2: 0.0001, 3: 0.001}, seed = 10)
print("Total Ratings in Sample = "+str(ratings_sampled.count()))
print("Distinct Users = "+str(ratings_sampled.select('userID').distinct().count())+
      " & Distinct Movies = "+str(ratings_sampled.select('movieID').distinct().count()))

Total Ratings in Sample = 14305
Distinct User IDs = 10869 & Distinct Movie IDs = 4335


#### 3. ALS Model Training <a id = alstrain></a>
Something Something

In [313]:
(training, test) = ratings_sampled.randomSplit([0.8, 0.2])

In [316]:
als = ALS(maxIter=5, 
          regParam=0.01, 
          userCol="userID", 
          itemCol="movieID", 
          ratingCol="rating", 
          coldStartStrategy="drop")
model = als.fit(training)

In [320]:
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", 
                                labelCol="rating", 
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-Mean-Square-Error = " + str(rmse))

Root-Mean-Square-Error = 3.884697743067076


In [321]:
# Generate top 10 movie recommendations for each user
userRecs = model.recommendForAllUsers(10)

# Generate top 10 movie recommendations for a specified set of users
users = ratings.select(als.getUserCol()).distinct().limit(3)
userSubsetRecs = model.recommendForUserSubset(users, 10)