#### Using PySpark to Train a Recommendation System for Movies

By: Matt Purvis

##### Imports

In [0]:
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

In [0]:
# Create Spark Session
spark = SparkSession.builder.appName('rec').getOrCreate()

##### Import data and preview

In [0]:
data = spark.sql('select * from movielens_ratings_csv')

In [0]:
# Preview columns and dtypes
data.printSchema()

In [0]:
# Show the data
data.show()

In [0]:
# Describe data
data.describe().show()

##### Train Test Split

In [0]:
training, test = data.randomSplit([.8,.2])

##### Creating and fitting model

In [0]:
als = ALS(maxIter=5, regParam=0.01, userCol = 'userId', itemCol = 'movieId', ratingCol = 'rating')

In [0]:
model = als.fit(training)

##### Making Predictions

In [0]:
predictions = model.transform(test)

In [0]:
predictions.show()

##### Evaluating the model

In [0]:
evaluator = RegressionEvaluator(metricName = 'rmse', labelCol = 'rating', predictionCol = 'prediction' )

In [0]:
rmse = evaluator.evaluate(predictions)

In [0]:
print('RMSE')
print(rmse)

The rmse is pretty bad considering ratings are on a scale of 1 to 5. The dataset was small and having more data would probably improve the model. Also more tweaking of the ALS Model parameters could be done to improve the model.

##### How to use the model to make predictions on new data

In [0]:
# Get a sample user
single_user = test.filter(test['userid'] == 11).select(['movieId','userId'])

In [0]:
single_user.show()

In [0]:
# Make predictions
recommendations = model.transform(single_user)

In [0]:
# Display predictions, ordering from best to worse
recommendations.orderBy('prediction',ascending = False).show()

For this user we would highly recommend movie 25. We would not recommend movie 80 at all. We think they may moderately enjoy movie 47.