# Exemplo 09: Sistema de Recomendação
## Recomendação de Filmes MovieLens

Sistema de Recomendação tem como objetivo selecionar itens personalizados para um usuário (cliente) com base nos interesses dele e dos interesses de usuários semelhantes conforme o contexto no qual estão inseridos. Esta técnica pode recomendar itens variados como, por exemplo, livros, filmes, notícias, música, vídeos, páginas de internet e produtos de uma loja virtual. Este exemplo faz a recomendação de filmes para um determinado usuário baseado nos seus gostos e de gostos de outros usuários semelhantes. O dataset utilizado é do site de recomendação de filmes MovieLens.

In [1]:
import pyspark
from pyspark.sql import SparkSession, Row

from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

import time
start_time = time.time()

data_path='./data/'

# Experimente uma base de dados maior
#data_path='/data/dataset/MovieLens/ml-20m/'

In [2]:
# Create Spark Session
sc = SparkSession.builder \
     .master("local[*]") \
     .appName("Movies_Recommendation") \
     .getOrCreate()

## Ratings Data File Structure (ratings.csv)
-----------------------------------------

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [3]:
df = sc.read.format("csv").options(sep=',',header='true',inferschema='true').\
     load(data_path+"ratings.csv.gz")

df.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



In [4]:
ratedf = df.select(['userId', 'movieId', 'rating'])
ratedf.head(5)

[Row(userId=1, movieId=1, rating=4.0),
 Row(userId=1, movieId=3, rating=4.0),
 Row(userId=1, movieId=6, rating=4.0),
 Row(userId=1, movieId=47, rating=5.0),
 Row(userId=1, movieId=50, rating=5.0)]

In [5]:
ratedf.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
|     1|    101|   5.0|
|     1|    110|   4.0|
|     1|    151|   5.0|
|     1|    157|   5.0|
|     1|    163|   5.0|
|     1|    216|   5.0|
|     1|    223|   3.0|
|     1|    231|   5.0|
|     1|    235|   4.0|
|     1|    260|   5.0|
|     1|    296|   3.0|
|     1|    316|   3.0|
|     1|    333|   5.0|
|     1|    349|   4.0|
+------+-------+------+
only showing top 20 rows



## Movies Data File Structure (movies.csv)
---------------------------------------

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

In [6]:
df = sc.read.format("csv").options(sep=',',header='true',inferschema='true').\
     load(data_path+"movies.csv.gz")

df.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)



In [7]:
moviedf = df.select(['movieId', 'title', 'genres'])
moviedf.head(5)

[Row(movieId=1, title='Toy Story (1995)', genres='Adventure|Animation|Children|Comedy|Fantasy'),
 Row(movieId=2, title='Jumanji (1995)', genres='Adventure|Children|Fantasy'),
 Row(movieId=3, title='Grumpier Old Men (1995)', genres='Comedy|Romance'),
 Row(movieId=4, title='Waiting to Exhale (1995)', genres='Comedy|Drama|Romance'),
 Row(movieId=5, title='Father of the Bride Part II (1995)', genres='Comedy')]

## Rating Statistics

In [8]:
ratedf.describe().show()

+-------+------------------+----------------+------------------+
|summary|            userId|         movieId|            rating|
+-------+------------------+----------------+------------------+
|  count|            100836|          100836|            100836|
|   mean|326.12756356856676|19435.2957177992| 3.501556983616962|
| stddev| 182.6184914635004|35530.9871987003|1.0425292390606342|
|    min|                 1|               1|               0.5|
|    max|               610|          193609|               5.0|
+-------+------------------+----------------+------------------+



In [9]:
# Configure ALS
als = ALS(maxIter=5, regParam=0.01, userCol='userId', itemCol='movieId', ratingCol='rating', \
          coldStartStrategy='drop')

# Train dataset: 80% to ttrain and 20% to test 
training, test = ratedf.randomSplit([0.8,0.2])

model = als.fit(training)

# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error (RMSE) = " + str(rmse))

Root-mean-square error (RMSE) = 1.0750506567550213


In [10]:
# Generate top 5 movie recommendations for each user
userRecs = model.recommendForAllUsers(5)

userRecs.show(truncate=False)

+------+-------------------------------------------------------------------------------------------------+
|userId|recommendations                                                                                  |
+------+-------------------------------------------------------------------------------------------------+
|1     |[{85, 6.4950376}, {247, 6.0013113}, {28, 5.986198}, {44761, 5.885227}, {51931, 5.8429537}]       |
|2     |[{1726, 8.964352}, {85, 8.619324}, {6731, 7.3211703}, {28, 6.9898663}, {2261, 6.8948894}]        |
|3     |[{7991, 5.151084}, {37830, 5.1316137}, {2851, 5.117914}, {6835, 5.103821}, {5746, 5.103821}]     |
|4     |[{1620, 6.739647}, {2275, 6.5970106}, {2901, 6.554268}, {446, 6.438507}, {1173, 6.3667836}]      |
|5     |[{85, 7.7918}, {41716, 7.1699595}, {2530, 6.657156}, {1334, 6.6378627}, {2387, 6.5986567}]       |
|6     |[{1411, 6.277818}, {3618, 6.103948}, {1223, 6.039419}, {85, 5.994721}, {2384, 5.9404774}]        |
|7     |[{52435, 8.398609}, {3754, 8.

In [11]:
for line in userRecs.take(3):
    print(line)

Row(userId=1, recommendations=[Row(movieId=85, rating=6.49503755569458), Row(movieId=247, rating=6.001311302185059), Row(movieId=28, rating=5.9861979484558105), Row(movieId=44761, rating=5.885227203369141), Row(movieId=51931, rating=5.842953681945801)])
Row(userId=2, recommendations=[Row(movieId=1726, rating=8.964351654052734), Row(movieId=85, rating=8.61932373046875), Row(movieId=6731, rating=7.321170330047607), Row(movieId=28, rating=6.989866256713867), Row(movieId=2261, rating=6.8948893547058105)])
Row(userId=3, recommendations=[Row(movieId=7991, rating=5.151083946228027), Row(movieId=37830, rating=5.131613731384277), Row(movieId=2851, rating=5.117914199829102), Row(movieId=6835, rating=5.10382080078125), Row(movieId=5746, rating=5.10382080078125)])


In [12]:
# Generate top 5 user recommendations for each movie
movieRecs = model.recommendForAllItems(5)
movieRecs.show(truncate=False)

+-------+-----------------------------------------------------------------------------------------+
|movieId|recommendations                                                                          |
+-------+-----------------------------------------------------------------------------------------+
|1      |[{423, 5.47678}, {543, 5.4741597}, {569, 5.4077687}, {574, 5.4033713}, {106, 5.383932}]  |
|12     |[{295, 8.182247}, {472, 8.027278}, {257, 7.9246087}, {77, 7.6918116}, {459, 7.509772}]   |
|13     |[{423, 7.3308654}, {399, 7.042202}, {485, 6.979488}, {407, 6.676213}, {196, 5.9709945}]  |
|22     |[{130, 6.118208}, {192, 6.0991898}, {375, 5.9425855}, {37, 5.922715}, {531, 5.89345}]    |
|26     |[{485, 8.167679}, {461, 7.369815}, {498, 7.1125364}, {468, 6.854341}, {458, 6.8336816}]  |
|27     |[{485, 10.336701}, {468, 8.800411}, {548, 8.262348}, {502, 8.033107}, {107, 7.5116234}]  |
|28     |[{536, 12.047627}, {472, 9.742122}, {251, 9.548626}, {548, 8.825282}, {557, 8.541237}]   |


In [13]:
for line in movieRecs.take(4):
   print(line)

Row(movieId=1, recommendations=[Row(userId=423, rating=5.476779937744141), Row(userId=543, rating=5.4741597175598145), Row(userId=569, rating=5.407768726348877), Row(userId=574, rating=5.403371334075928), Row(userId=106, rating=5.383932113647461)])
Row(movieId=12, recommendations=[Row(userId=295, rating=8.182247161865234), Row(userId=472, rating=8.027277946472168), Row(userId=257, rating=7.9246087074279785), Row(userId=77, rating=7.691811561584473), Row(userId=459, rating=7.509771823883057)])
Row(movieId=13, recommendations=[Row(userId=423, rating=7.330865383148193), Row(userId=399, rating=7.042201995849609), Row(userId=485, rating=6.979487895965576), Row(userId=407, rating=6.676212787628174), Row(userId=196, rating=5.970994472503662)])
Row(movieId=22, recommendations=[Row(userId=130, rating=6.118207931518555), Row(userId=192, rating=6.099189758300781), Row(userId=375, rating=5.942585468292236), Row(userId=37, rating=5.922715187072754), Row(userId=531, rating=5.893449783325195)])


In [14]:
print("--- Execution time: %s seconds ---" % (time.time() - start_time))

--- Execution time: 32.386048793792725 seconds ---


In [15]:
enter_user=input("Digite o número do usuário que deseja a recomendação:")
userRecs.where(userRecs.userId == enter_user).select("recommendations.movieId",\
              "recommendations.rating").collect()

Digite o número do usuário que deseja a recomendação: 1


[Row(movieId=[85, 247, 28, 44761, 51931], rating=[6.49503755569458, 6.001311302185059, 5.9861979484558105, 5.885227203369141, 5.842953681945801])]

In [16]:
enter_movie=input("Digite o número do filme que deseja recomendar:")
movieRecs.where(movieRecs.movieId == enter_movie).select("recommendations.userId",\
                "recommendations.rating").collect()

Digite o número do filme que deseja recomendar: 12


[Row(userId=[295, 472, 257, 77, 459], rating=[8.182247161865234, 8.027277946472168, 7.9246087074279785, 7.691811561584473, 7.509771823883057])]

In [17]:
sc.stop()