# Exemplo 09: Sistema de Recomendação
## Recomendação de Filmes MovieLens

Sistema de Recomendação tem como objetivo selecionar itens personalizados para um usuário (cliente) com base nos interesses dele e dos interesses de usuários semelhantes conforme o contexto no qual estão inseridos. Esta técnica pode recomendar itens variados como, por exemplo, livros, filmes, notícias, música, vídeos, páginas de internet e produtos de uma loja virtual. Este exemplo faz a recomendação de filmes para um determinado usuário baseado nos seus gostos e de gostos de outros usuários semelhantes. O dataset utilizado é do site de recomendação de filmes MovieLens.

In [1]:
import pyspark
from pyspark.sql import SparkSession, Row

from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

import time
start_time = time.time()

data_path='./data/'

# Experimente uma base de dados maior
#data_path='/data/dataset/MovieLens/ml-20m/'

In [2]:
# Create Spark Session
spark = SparkSession.builder \
       .master("local[*]") \
       .appName("Movies_Recommendation") \
       .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/02/08 18:51:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/02/08 18:51:53 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Ratings Data File Structure (ratings.csv)
-----------------------------------------

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [3]:
df = spark.read.format("csv").options(sep=',',header='true',inferschema='true').\
     load(data_path+"ratings.csv.gz")

df.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



In [4]:
ratedf = df.select(['userId', 'movieId', 'rating'])
ratedf.head(5)

[Row(userId=1, movieId=1, rating=4.0),
 Row(userId=1, movieId=3, rating=4.0),
 Row(userId=1, movieId=6, rating=4.0),
 Row(userId=1, movieId=47, rating=5.0),
 Row(userId=1, movieId=50, rating=5.0)]

In [5]:
ratedf.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
|     1|    101|   5.0|
|     1|    110|   4.0|
|     1|    151|   5.0|
|     1|    157|   5.0|
|     1|    163|   5.0|
|     1|    216|   5.0|
|     1|    223|   3.0|
|     1|    231|   5.0|
|     1|    235|   4.0|
|     1|    260|   5.0|
|     1|    296|   3.0|
|     1|    316|   3.0|
|     1|    333|   5.0|
|     1|    349|   4.0|
+------+-------+------+
only showing top 20 rows



## Movies Data File Structure (movies.csv)
---------------------------------------

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

In [6]:
df = spark.read.format("csv").options(sep=',',header='true',inferschema='true').\
           load(data_path+"movies.csv.gz")

df.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)



In [7]:
moviedf = df.select(['movieId', 'title', 'genres'])
moviedf.head(5)

[Row(movieId=1, title='Toy Story (1995)', genres='Adventure|Animation|Children|Comedy|Fantasy'),
 Row(movieId=2, title='Jumanji (1995)', genres='Adventure|Children|Fantasy'),
 Row(movieId=3, title='Grumpier Old Men (1995)', genres='Comedy|Romance'),
 Row(movieId=4, title='Waiting to Exhale (1995)', genres='Comedy|Drama|Romance'),
 Row(movieId=5, title='Father of the Bride Part II (1995)', genres='Comedy')]

## Rating Statistics

In [8]:
ratedf.describe().show()

+-------+------------------+----------------+------------------+
|summary|            userId|         movieId|            rating|
+-------+------------------+----------------+------------------+
|  count|            100836|          100836|            100836|
|   mean|326.12756356856676|19435.2957177992| 3.501556983616962|
| stddev| 182.6184914635004|35530.9871987003|1.0425292390606342|
|    min|                 1|               1|               0.5|
|    max|               610|          193609|               5.0|
+-------+------------------+----------------+------------------+



In [9]:
# Configure ALS
als = ALS(maxIter=5, regParam=0.01, userCol='userId', itemCol='movieId', ratingCol='rating', \
          coldStartStrategy='drop')

# Train dataset: 80% to ttrain and 20% to test 
training, test = ratedf.randomSplit([0.8,0.2])

model = als.fit(training)

# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error (RMSE) = " + str(rmse))

22/02/08 18:52:02 WARN InstanceBuilder$NativeLAPACK: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK


Root-mean-square error (RMSE) = 1.0942630918977896


In [10]:
# Generate top 5 movie recommendations for each user
userRecs = model.recommendForAllUsers(5)

userRecs.show(truncate=False)



+------+--------------------------------------------------------------------------------------------------+
|userId|recommendations                                                                                   |
+------+--------------------------------------------------------------------------------------------------+
|1     |[{3435, 6.0119767}, {55276, 5.7843413}, {3134, 5.771266}, {1797, 5.7708206}, {3846, 5.7700787}]   |
|2     |[{213, 8.605273}, {417, 8.542724}, {3019, 7.9154744}, {79592, 7.9052444}, {1468, 7.7565255}]      |
|3     |[{3265, 6.9433246}, {135861, 6.383185}, {112623, 6.3495502}, {487, 5.941598}, {7169, 5.869364}]   |
|4     |[{5135, 8.165978}, {166534, 7.6370225}, {4041, 7.4442196}, {26865, 7.317074}, {3030, 7.229325}]   |
|5     |[{446, 10.052595}, {3089, 9.013862}, {7099, 8.805123}, {2361, 8.537218}, {5135, 8.521127}]        |
|6     |[{3265, 7.1632953}, {232, 7.1197286}, {166534, 6.882869}, {28, 6.5892534}, {3341, 6.5189123}]     |
|7     |[{3134, 11.69991}, {

                                                                                

In [11]:
for line in userRecs.take(3):
    print(line)

Row(userId=1, recommendations=[Row(movieId=3435, rating=6.011976718902588), Row(movieId=55276, rating=5.784341335296631), Row(movieId=3134, rating=5.771265983581543), Row(movieId=1797, rating=5.770820617675781), Row(movieId=3846, rating=5.770078659057617)])
Row(userId=2, recommendations=[Row(movieId=213, rating=8.605273246765137), Row(movieId=417, rating=8.542723655700684), Row(movieId=3019, rating=7.9154744148254395), Row(movieId=79592, rating=7.90524435043335), Row(movieId=1468, rating=7.75652551651001)])
Row(userId=3, recommendations=[Row(movieId=3265, rating=6.943324565887451), Row(movieId=135861, rating=6.383184909820557), Row(movieId=112623, rating=6.349550247192383), Row(movieId=487, rating=5.941597938537598), Row(movieId=7169, rating=5.869363784790039)])


In [12]:
# Generate top 5 user recommendations for each movie
movieRecs = model.recommendForAllItems(5)
movieRecs.show(truncate=False)



+-------+-----------------------------------------------------------------------------------------+
|movieId|recommendations                                                                          |
+-------+-----------------------------------------------------------------------------------------+
|28     |[{173, 11.805707}, {392, 9.127658}, {120, 9.0567255}, {407, 8.947717}, {574, 8.853763}]  |
|31     |[{295, 7.114854}, {67, 6.668771}, {557, 6.567077}, {548, 6.3582892}, {441, 6.304737}]    |
|34     |[{295, 6.121164}, {35, 5.965703}, {203, 5.590754}, {451, 5.531224}, {267, 5.4585023}]    |
|53     |[{126, 8.545336}, {360, 8.002896}, {203, 7.9733253}, {120, 7.962228}, {173, 7.8958163}]  |
|65     |[{441, 7.320288}, {548, 5.8378067}, {497, 5.803167}, {539, 5.5712986}, {598, 5.507638}]  |
|76     |[{295, 9.218039}, {329, 8.212451}, {441, 8.037064}, {35, 7.89428}, {433, 6.9272914}]     |
|78     |[{574, 6.4934654}, {549, 5.601421}, {257, 5.4445596}, {589, 5.40549}, {174, 5.3622537}]  |


                                                                                

In [13]:
for line in movieRecs.take(4):
   print(line)



Row(movieId=28, recommendations=[Row(userId=173, rating=11.805706977844238), Row(userId=392, rating=9.127657890319824), Row(userId=120, rating=9.05672550201416), Row(userId=407, rating=8.94771671295166), Row(userId=574, rating=8.85376262664795)])
Row(movieId=31, recommendations=[Row(userId=295, rating=7.114853858947754), Row(userId=67, rating=6.668770790100098), Row(userId=557, rating=6.567077159881592), Row(userId=548, rating=6.3582892417907715), Row(userId=441, rating=6.304737091064453)])
Row(movieId=34, recommendations=[Row(userId=295, rating=6.121163845062256), Row(userId=35, rating=5.965703010559082), Row(userId=203, rating=5.59075403213501), Row(userId=451, rating=5.531223773956299), Row(userId=267, rating=5.458502292633057)])
Row(movieId=53, recommendations=[Row(userId=126, rating=8.54533576965332), Row(userId=360, rating=8.002896308898926), Row(userId=203, rating=7.973325252532959), Row(userId=120, rating=7.962227821350098), Row(userId=173, rating=7.895816326141357)])


                                                                                

In [14]:
print("--- Execution time: %s seconds ---" % (time.time() - start_time))

--- Execution time: 21.476658582687378 seconds ---


In [15]:
enter_user=input("Digite o número do usuário que deseja a recomendação:")
userRecs.where(userRecs.userId == enter_user).select("recommendations.movieId",\
              "recommendations.rating").collect()

Digite o número do usuário que deseja a recomendação: 1


[Row(movieId=[3435, 55276, 3134, 1797, 3846], rating=[6.011976718902588, 5.784341335296631, 5.771265983581543, 5.770820617675781, 5.770078659057617])]

In [16]:
enter_movie=input("Digite o número do filme que deseja recomendar:")
movieRecs.where(movieRecs.movieId == enter_movie).select("recommendations.userId",\
                "recommendations.rating").collect()

Digite o número do filme que deseja recomendar: 28


[Row(userId=[173, 392, 120, 407, 574], rating=[11.805706977844238, 9.127657890319824, 9.05672550201416, 8.94771671295166, 8.85376262664795])]

In [17]:
spark.stop()