In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Recomendation system basic').getOrCreate()

With Collaborative filtering we make predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption is that if a user A has the same opinion as a user B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a user chosen randomly.

The image below (from Wikipedia) shows an example of collaborative filtering. At first, people rate different items (like videos, images, games). Then, the system makes predictions about a user's rating for an item not rated yet. The new predictions are built upon the existing ratings of other users with similar ratings with the active user. In the image, the system predicts that the user will not like the video.

<img src=https://upload.wikimedia.org/wikipedia/commons/5/52/Collaborative_filtering.gif />

Spark MLlib library for Machine Learning provides a Collaborative Filtering implementation by using Alternating Least Squares. The implementation in MLlib has these parameters:

* numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure).
* rank is the number of latent factors in the model.
* iterations is the number of iterations to run.
* lambda specifies the regularization parameter in ALS.
* implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
* alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.

Let's see this all in action!

In [2]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

In [3]:
data = spark.read.csv('movielens_ratings.csv', inferSchema=True, header=True)

In [4]:
data.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- userId: integer (nullable = true)



In [5]:
data.show()

+-------+------+------+
|movieId|rating|userId|
+-------+------+------+
|      2|   3.0|     0|
|      3|   1.0|     0|
|      5|   2.0|     0|
|      9|   4.0|     0|
|     11|   1.0|     0|
|     12|   2.0|     0|
|     15|   1.0|     0|
|     17|   1.0|     0|
|     19|   1.0|     0|
|     21|   1.0|     0|
|     23|   1.0|     0|
|     26|   3.0|     0|
|     27|   1.0|     0|
|     28|   1.0|     0|
|     29|   1.0|     0|
|     30|   1.0|     0|
|     31|   1.0|     0|
|     34|   1.0|     0|
|     37|   1.0|     0|
|     41|   2.0|     0|
+-------+------+------+
only showing top 20 rows



In [6]:
data.describe().show()

+-------+------------------+------------------+------------------+
|summary|           movieId|            rating|            userId|
+-------+------------------+------------------+------------------+
|  count|              1501|              1501|              1501|
|   mean| 49.40572951365756|1.7741505662891406|14.383744170552964|
| stddev|28.937034065088994| 1.187276166124803| 8.591040424293272|
|    min|                 0|               1.0|                 0|
|    max|                99|               5.0|                29|
+-------+------------------+------------------+------------------+



In [7]:
#Making test-train splits
train, test = data.randomSplit([0.9,0.1])

In [8]:
als = ALS(regParam=0.01, userCol='userId', itemCol='movieId', ratingCol='rating')

In [9]:
recomm_model = als.fit(train)

In [10]:
#deploy the model with the test data
test_result = recomm_model.transform(test)

In [11]:
test_result.show()

+-------+------+------+-----------+
|movieId|rating|userId| prediction|
+-------+------+------+-----------+
|     31|   1.0|    26|  0.3706654|
|     31|   3.0|     8|  1.0377382|
|     85|   1.0|    25| -1.0857711|
|     85|   1.0|    29| -1.3310037|
|     53|   1.0|    25|  1.0998933|
|     78|   1.0|    28|  0.9556845|
|     78|   1.0|    17|  0.6684927|
|     81|   1.0|    21|  2.0102956|
|     28|   1.0|     0| 0.26456326|
|     12|   1.0|     1|  3.3449736|
|     12|   1.0|    24| -0.1114272|
|     12|   2.0|     0| -2.1473265|
|     91|   3.0|     0|  1.3555651|
|     22|   1.0|    10|-0.40224358|
|     22|   3.0|    29|  1.8946289|
|     93|   1.0|     1|  0.5577408|
|     47|   4.0|    25| 0.20588377|
|      1|   1.0|    18|-0.43171194|
|     13|   1.0|     1|  0.5564422|
|     13|   2.0|    10|  3.1317255|
+-------+------+------+-----------+
only showing top 20 rows



In [12]:
#lets get the rmse error on this prediction result
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating', predictionCol='prediction')
rmse = evaluator.evaluate(test_result)
print("The Root mean Square Error for the analysis is {}".format(rmse))

The Root mean Square Error for the analysis is 1.497836548325522


In [13]:
#lets apply the recommendation for a single user
user = test.filter(test['userId'] == 20).select('userId','movieId')

In [14]:
user.show()

+------+-------+
|userId|movieId|
+------+-------+
|    20|      0|
|    20|     48|
|    20|     55|
|    20|     66|
|    20|     79|
+------+-------+



In [15]:
recommendations = recomm_model.transform(user)

In [16]:
from pyspark.sql.functions import desc

In [17]:
recommendations.orderBy(desc('prediction')).show()

+------+-------+------------+
|userId|movieId|  prediction|
+------+-------+------------+
|    20|     66|   0.8624274|
|    20|     48|  0.74465984|
|    20|     79|   0.5775627|
|    20|     55|  0.40462604|
|    20|      0|-0.061816126|
+------+-------+------------+



Based on this study we can say that the user 20's like or dislike for these movies

In [18]:
usercheck = test.filter(test['userId'] == 20).select('userId','movieId','rating')

In [19]:
usercheck.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|    20|      0|   1.0|
|    20|     48|   1.0|
|    20|     55|   1.0|
|    20|     66|   1.0|
|    20|     79|   1.0|
+------+-------+------+

