# Movie Recommender

GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.org). The data sets were collected over various periods of time, depending on the size of the set. 

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

### Start Spark Session

In [2]:
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder.appName('rec_sys').getOrCreate()

### Import Data

In [5]:
df = spark.read.csv('movielens_ratings.csv', inferSchema=True, header=True)

In [6]:
df.show()

+-------+------+------+
|movieId|rating|userId|
+-------+------+------+
|      2|   3.0|     0|
|      3|   1.0|     0|
|      5|   2.0|     0|
|      9|   4.0|     0|
|     11|   1.0|     0|
|     12|   2.0|     0|
|     15|   1.0|     0|
|     17|   1.0|     0|
|     19|   1.0|     0|
|     21|   1.0|     0|
|     23|   1.0|     0|
|     26|   3.0|     0|
|     27|   1.0|     0|
|     28|   1.0|     0|
|     29|   1.0|     0|
|     30|   1.0|     0|
|     31|   1.0|     0|
|     34|   1.0|     0|
|     37|   1.0|     0|
|     41|   2.0|     0|
+-------+------+------+
only showing top 20 rows



In [7]:
df.count()

1501

### Split Data Into Train and Test

In [8]:
train, test = df.randomSplit([.8,.2])

### Collaberative Filtering Using Alternating Least Squares (ALS)

In [9]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

In [10]:
als = ALS(maxIter=5, regParam=0.01, userCol='userId', itemCol='movieId', ratingCol='rating')

In [11]:
model = als.fit(train)

### Predictions

In [12]:
predictions = model.transform(test)

In [13]:
predictions.show()

+-------+------+------+----------+
|movieId|rating|userId|prediction|
+-------+------+------+----------+
|     31|   1.0|     4| 2.5437841|
|     31|   1.0|    24| 1.2198374|
|     31|   1.0|     0| 1.9999307|
|     85|   1.0|    26|  2.453824|
|     85|   5.0|    16| 1.8313253|
|     85|   5.0|     8|0.07489713|
|     65|   1.0|    28| 1.9786984|
|     65|   1.0|     2| 2.1857023|
|     53|   1.0|    12|0.38200295|
|     53|   3.0|    13| 2.9108722|
|     53|   1.0|     6| 1.4255377|
|     53|   1.0|     7|-1.7644683|
|     78|   1.0|    27| 0.9612028|
|     34|   1.0|    17| 0.8784369|
|     34|   3.0|    25|0.29068732|
|     81|   1.0|    15| 2.6637037|
|     81|   2.0|     9| 1.4798212|
|     81|   3.0|    18| 3.2504678|
|     28|   3.0|     1|0.26722473|
|     26|   1.0|     6| 2.2969806|
+-------+------+------+----------+
only showing top 20 rows



In [14]:
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating', predictionCol='prediction')

In [15]:
rmse = evaluator.evaluate(predictions)

### Model Performance

In [16]:
print('RMSE: ', rmse)

RMSE:  1.760349994573671


### Predicting a Single User

In [17]:
single_user = test.filter(test['userId']==11).select(['movieId', 'userId'])

In [18]:
single_user.show()

+-------+------+
|movieId|userId|
+-------+------+
|      6|    11|
|      9|    11|
|     10|    11|
|     39|    11|
|     41|    11|
|     43|    11|
|     48|    11|
|     50|    11|
|     59|    11|
|     62|    11|
|     64|    11|
|     86|    11|
+-------+------+



In [19]:
recommendation = model.transform(single_user)

In [20]:
recommendation.orderBy('prediction', ascending=False).show()

+-------+------+-----------+
|movieId|userId| prediction|
+-------+------+-----------+
|     48|    11|  4.1461964|
|     50|    11|  3.7933547|
|     39|    11|  3.7105079|
|     59|    11|  2.4018178|
|      6|    11|   2.355648|
|     43|    11|  2.3349094|
|     10|    11|  2.2312279|
|     64|    11|  1.9377606|
|     41|    11|  1.7963191|
|     86|    11|  1.7506684|
|      9|    11|0.028414845|
|     62|    11|-0.77857643|
+-------+------+-----------+



In [21]:
spark.stop()