# **A study on Recommmendation Algorithm applied to Movie dataset**
**This notebook is a form to practice my knowledge in data science**

The notebook walks us through a workflow for solving a problem with a movie recommendation algorithm.

The main purpose of this notebook is to serve as a step-by-step workflow guide, allowing me to review this notebook myself and serve as a study for future cases.

## Workflow stages
The solution workflow goes through six stages.
1.   Load the Data.
2.   Data pre-processing.
3.   Splitting the data between training and testing.
4.   Create the recommender system through ALS.
5.   Make predictions.
6.   Test the model.

In [0]:
from pyspark.sql import SparkSession #Import the library that creates the spark section

In [0]:
#Starts the section for using spark
spark = SparkSession.builder.appName("recomendation").getOrCreate()

In [0]:
%fs ls /FileStore/tables

path,name,size,modificationTime
dbfs:/FileStore/tables/movies-1.csv,movies-1.csv,494431,1662647401000
dbfs:/FileStore/tables/movies-2.csv,movies-2.csv,494431,1662647447000
dbfs:/FileStore/tables/movies.csv,movies.csv,494431,1662647363000
dbfs:/FileStore/tables/ratings.csv,ratings.csv,2483723,1662647649000
dbfs:/FileStore/tables/u.data,u.data,1979173,1662474869000


In [0]:
#Get the directory containing the file to use
diretorioRecomendacao="/FileStore/tables/u.data"  

#1) Load the Data

In [0]:
#Reading stored files through generic function
rdd_movies = spark.sparkContext.textFile(diretorioRecomendacao)

In [0]:
rdd_movies.take(10)  #User id | Item id | Rating | Timestamp

Out[31]: ['196\t242\t3\t881250949',
 '186\t302\t3\t891717742',
 '22\t377\t1\t878887116',
 '244\t51\t2\t880606923',
 '166\t346\t1\t886397596',
 '298\t474\t4\t884182806',
 '115\t265\t2\t881171488',
 '253\t465\t5\t891628467',
 '305\t451\t3\t886324817',
 '6\t86\t3\t883603013']

In [0]:
#Defining the libraries to use
from pyspark.mllib.recommendation import ALS, Rating  #MLlib used to implement the ALS and Rating algorithms

#2) Data pre-processing
#3) Splitting the data between training and testing

In [0]:
#Splitting data between training and testing
(trainRatings, testRatings) = rdd_movies.randomSplit([0.7, 0.3])

In [0]:
trainRatings.take(5)

Out[34]: ['196\t242\t3\t881250949',
 '186\t302\t3\t891717742',
 '22\t377\t1\t878887116',
 '244\t51\t2\t880606923',
 '166\t346\t1\t886397596']

In [0]:
testRatings.first()  #Print of the first line of the RDD

Out[35]: '115\t265\t2\t881171488'

In [0]:
trainingData = trainRatings.map(lambda l: l.split('\t')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))  #Applying the Rating function

In [0]:
trainingData.first()  #Print of the RDD created through the Rating function

Out[37]: Rating(user=196, product=242, rating=3.0)

In [0]:
#Same procedure for test data
testData = testRatings.map(lambda l: l.split('\t')).map(lambda l: (int(l[0]), int(l[1])))

In [0]:
testData.first()  #User id | Movie id

Out[39]: (115, 265)

#4) Create the recommender system through ALS

In [0]:
#Defining model variables
rank = 10  #Number of latent factors of the model R->P (users)*Q (items) => R_mxn = P_mxrank * Q_rankxm (where m= user number and n= number of items)
numIterations = 50 #Number of iterations performed by the model
model = ALS.train(trainingData, rank, numIterations) # Train the model

#5) Model prediction

In [0]:
model.predict(253, 465)  # Input (User,Movie)

Out[41]: 4.498855577951222

In [0]:
prediction = model.predictAll(testData)  #Prediction for all test data
prediction.first()

Out[43]: Rating(user=368, product=320, rating=2.2249046999913586)

In [0]:
prediction = prediction.map(lambda l: ((l[0], l[1]), l[2])) #Map to display
prediction.take(5)

Out[44]: [((368, 320), 2.2249046999913586),
 ((264, 320), 3.715630913536507),
 ((833, 320), 4.992979846646292),
 ((731, 320), 1.080591809504766),
 ((342, 320), 4.609023387897583)]

In [0]:
testRating2 = testRatings.map(lambda l: l.split('\t')).map(lambda l: ((int(l[0]), int(l[1])), float(l[2]))) #Maps for display and use in analysis

In [0]:
testRating2.first()

Out[46]: ((115, 265), 2.0)

In [0]:
ratingsAndPredictions = testRating2.join(prediction)
ratingsAndPredictions.take(5)

Out[47]: [((290, 88), (4.0, 4.025118623429075)),
 ((160, 174), (5.0, 4.260708884789807)),
 ((42, 96), (5.0, 4.343224565491234)),
 ((90, 648), (4.0, 4.952638079588513)),
 ((222, 750), (5.0, 3.8093080486624205))]

#6) Model evaluation

In [0]:
MSE = ratingsAndPredictions.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print ("Mean square error = " + str(MSE))

Mean square error = 1.3369087341053123


In [0]:
#Given a user -> recommend the top 5 movies
model.recommendProducts(105, 5) 

Out[49]: [Rating(user=105, product=390, rating=7.155605273006472),
 Rating(user=105, product=580, rating=7.042244960950049),
 Rating(user=105, product=1404, rating=6.916491217418537),
 Rating(user=105, product=1022, rating=6.816778710676243),
 Rating(user=105, product=1203, rating=6.730966788552899)]

In [0]:
#Given a Movie -> recommend the top 5 users
model.recommendUsers(1, 5)  #filme Toy Story (1995)

Out[50]: [Rating(user=688, product=1, rating=7.601942677360231),
 Rating(user=366, product=1, rating=7.005675834269713),
 Rating(user=909, product=1, rating=6.971875553043447),
 Rating(user=341, product=1, rating=6.763404449100201),
 Rating(user=228, product=1, rating=6.70832972446382)]

In [0]:
#Shows the vector of characteristics referring to users (column - P)
model.userFeatures().take(1)[0]

Out[51]: (8,
 array('d', [-0.09816134721040726, -0.8855308890342712, -0.6927691698074341, -0.05442076176404953, -1.3535518646240234, 0.3800413906574249, 0.977526843547821, -0.9397379159927368, 0.01582619734108448, -1.2322837114334106]))

In [0]:
#Shows the vector of characteristics referring to a product (line - Q)
model.productFeatures().take(1)[0]

Out[52]: (8,
 array('d', [0.5154821276664734, -1.613100528717041, 0.7003354430198669, -0.4653001129627228, -1.4897795915603638, 0.4245077967643738, 1.3665841817855835, -0.11095596104860306, -0.0847773626446724, -0.5163061618804932]))