# Exemplo 09: Sistema de Recomendação
## Recomendação de Filmes MovieLens

Sistema de Recomendação tem como objetivo selecionar itens personalizados para um usuário (cliente) com base nos interesses dele e dos interesses de usuários semelhantes conforme o contexto no qual estão inseridos. Esta técnica pode recomendar itens variados como, por exemplo, livros, filmes, notícias, música, vídeos, páginas de internet e produtos de uma loja virtual. Este exemplo faz a recomendação de filmes para um determinado usuário baseado nos seus gostos e de gostos de outros usuários semelhantes. O dataset utilizado é do site de recomendação de filmes MovieLens.

In [1]:
import pyspark
from pyspark.sql import SparkSession, Row

from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

import time
start_time = time.time()

data_path='./data/'

# Experimente uma base de dados maior
#data_path='/data/dataset/MovieLens/ml-20m/'

In [2]:
# Create Spark Session
sc = SparkSession.builder \
     .master("local[*]") \
     .appName("Movies_Recommendation") \
     .getOrCreate()

## Ratings Data File Structure (ratings.csv)
-----------------------------------------

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [3]:
df = sc.read.format("csv").options(sep=',',header='true',inferschema='true').\
     load(data_path+"ratings.csv.gz")

df.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



In [4]:
ratedf = df.select(['userId', 'movieId', 'rating'])
ratedf.head(5)

[Row(userId=1, movieId=1, rating=4.0),
 Row(userId=1, movieId=3, rating=4.0),
 Row(userId=1, movieId=6, rating=4.0),
 Row(userId=1, movieId=47, rating=5.0),
 Row(userId=1, movieId=50, rating=5.0)]

In [5]:
ratedf.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
|     1|    101|   5.0|
|     1|    110|   4.0|
|     1|    151|   5.0|
|     1|    157|   5.0|
|     1|    163|   5.0|
|     1|    216|   5.0|
|     1|    223|   3.0|
|     1|    231|   5.0|
|     1|    235|   4.0|
|     1|    260|   5.0|
|     1|    296|   3.0|
|     1|    316|   3.0|
|     1|    333|   5.0|
|     1|    349|   4.0|
+------+-------+------+
only showing top 20 rows



## Movies Data File Structure (movies.csv)
---------------------------------------

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

In [6]:
df = sc.read.format("csv").options(sep=',',header='true',inferschema='true').\
     load(data_path+"movies.csv.gz")

df.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)



In [7]:
moviedf = df.select(['movieId', 'title', 'genres'])
moviedf.head(5)

[Row(movieId=1, title='Toy Story (1995)', genres='Adventure|Animation|Children|Comedy|Fantasy'),
 Row(movieId=2, title='Jumanji (1995)', genres='Adventure|Children|Fantasy'),
 Row(movieId=3, title='Grumpier Old Men (1995)', genres='Comedy|Romance'),
 Row(movieId=4, title='Waiting to Exhale (1995)', genres='Comedy|Drama|Romance'),
 Row(movieId=5, title='Father of the Bride Part II (1995)', genres='Comedy')]

## Rating Statistics

In [8]:
ratedf.describe().show()

+-------+------------------+----------------+------------------+
|summary|            userId|         movieId|            rating|
+-------+------------------+----------------+------------------+
|  count|            100836|          100836|            100836|
|   mean|326.12756356856676|19435.2957177992| 3.501556983616962|
| stddev| 182.6184914635004|35530.9871987003|1.0425292390606342|
|    min|                 1|               1|               0.5|
|    max|               610|          193609|               5.0|
+-------+------------------+----------------+------------------+



In [9]:
# Configure ALS
als = ALS(maxIter=5, regParam=0.01, userCol='userId', itemCol='movieId', ratingCol='rating', \
          coldStartStrategy='drop')

# Train dataset: 80% to ttrain and 20% to test 
training, test = ratedf.randomSplit([0.8,0.2])

model = als.fit(training)

# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error (RMSE) = " + str(rmse))

Root-mean-square error (RMSE) = 1.0806088998600494


In [10]:
# Generate top 5 movie recommendations for each user
userRecs = model.recommendForAllUsers(5)

userRecs.show(truncate=False)

+------+--------------------------------------------------------------------------------------------------+
|userId|recommendations                                                                                   |
+------+--------------------------------------------------------------------------------------------------+
|1     |[{183897, 7.972645}, {1866, 7.9014792}, {945, 7.458456}, {4821, 7.1990886}, {134368, 7.081739}]   |
|2     |[{45668, 9.398457}, {1866, 9.235968}, {102903, 8.895226}, {54256, 8.519}, {1468, 8.437317}]       |
|3     |[{74754, 7.0828533}, {1232, 6.5853977}, {1034, 6.4000034}, {8810, 6.248864}, {5055, 6.1870885}]   |
|4     |[{3134, 8.045471}, {1483, 7.8819046}, {48322, 7.5527663}, {6993, 7.5115657}, {4642, 7.363976}]    |
|5     |[{4642, 9.11238}, {1483, 8.348793}, {8910, 8.155349}, {2469, 7.960324}, {714, 7.8431997}]         |
|6     |[{2937, 6.9683504}, {1354, 6.6519384}, {3134, 6.6474795}, {183897, 6.5362234}, {3676, 6.378956}]  |
|7     |[{1464, 8.312141}, {

In [11]:
for line in userRecs.take(3):
    print(line)

Row(userId=1, recommendations=[Row(movieId=183897, rating=7.972644805908203), Row(movieId=1866, rating=7.901479244232178), Row(movieId=945, rating=7.458456039428711), Row(movieId=4821, rating=7.1990885734558105), Row(movieId=134368, rating=7.0817389488220215)])
Row(userId=2, recommendations=[Row(movieId=45668, rating=9.398456573486328), Row(movieId=1866, rating=9.235967636108398), Row(movieId=102903, rating=8.895225524902344), Row(movieId=54256, rating=8.519000053405762), Row(movieId=1468, rating=8.43731689453125)])
Row(userId=3, recommendations=[Row(movieId=74754, rating=7.082853317260742), Row(movieId=1232, rating=6.585397720336914), Row(movieId=1034, rating=6.400003433227539), Row(movieId=8810, rating=6.24886417388916), Row(movieId=5055, rating=6.187088489532471)])


In [12]:
# Generate top 5 user recommendations for each movie
movieRecs = model.recommendForAllItems(5)
movieRecs.show(truncate=False)

+-------+-----------------------------------------------------------------------------------------+
|movieId|recommendations                                                                          |
+-------+-----------------------------------------------------------------------------------------+
|1      |[{498, 6.5737}, {413, 5.9587665}, {344, 5.951416}, {278, 5.571675}, {393, 5.489744}]     |
|12     |[{259, 12.570951}, {231, 10.248701}, {67, 9.625432}, {77, 9.32859}, {569, 9.111862}]     |
|13     |[{77, 7.005076}, {224, 6.501129}, {569, 5.739014}, {485, 5.6819344}, {498, 5.597794}]    |
|22     |[{243, 5.750332}, {502, 5.679482}, {548, 5.609611}, {173, 5.3974214}, {461, 5.329376}]   |
|26     |[{461, 6.608678}, {548, 5.668672}, {236, 5.5750694}, {423, 5.572011}, {468, 5.2249146}]  |
|27     |[{549, 9.383605}, {548, 8.301372}, {497, 7.9008756}, {557, 7.6610274}, {423, 7.2323446}] |
|28     |[{494, 7.82404}, {423, 7.47771}, {407, 7.271427}, {53, 7.12323}, {120, 7.085224}]        |


In [13]:
for line in movieRecs.take(4):
   print(line)

Row(movieId=1, recommendations=[Row(userId=498, rating=6.573699951171875), Row(userId=413, rating=5.958766460418701), Row(userId=344, rating=5.951416015625), Row(userId=278, rating=5.571674823760986), Row(userId=393, rating=5.489744186401367)])
Row(movieId=12, recommendations=[Row(userId=259, rating=12.570951461791992), Row(userId=231, rating=10.248701095581055), Row(userId=67, rating=9.625432014465332), Row(userId=77, rating=9.328590393066406), Row(userId=569, rating=9.111862182617188)])
Row(movieId=13, recommendations=[Row(userId=77, rating=7.005075931549072), Row(userId=224, rating=6.501129150390625), Row(userId=569, rating=5.739014148712158), Row(userId=485, rating=5.681934356689453), Row(userId=498, rating=5.597794055938721)])
Row(movieId=22, recommendations=[Row(userId=243, rating=5.750331878662109), Row(userId=502, rating=5.6794819831848145), Row(userId=548, rating=5.6096110343933105), Row(userId=173, rating=5.397421360015869), Row(userId=461, rating=5.329376220703125)])


In [14]:
print("--- Execution time: %s seconds ---" % (time.time() - start_time))

--- Execution time: 32.838313579559326 seconds ---


In [None]:
enter_user=input("Digite o número do usuário que deseja a recomendação:")
userRecs.where(userRecs.userId == enter_user).select("recommendations.movieId",\
              "recommendations.rating").collect()

In [None]:
enter_movie=input("Digite o número do filme que deseja recomendar:")
movieRecs.where(movieRecs.movieId == enter_movie).select("recommendations.userId",\
                "recommendations.rating").collect()

In [None]:
sc.stop()