# Exemplo 09: Sistema de Recomendação
## Recomendação de Filmes MovieLens

Sistema de Recomendação tem como objetivo selecionar itens personalizados para um usuário (cliente) com base nos interesses dele e dos interesses de usuários semelhantes conforme o contexto no qual estão inseridos. Esta técnica pode recomendar itens variados como, por exemplo, livros, filmes, notícias, música, vídeos, páginas de internet e produtos de uma loja virtual. Este exemplo faz a recomendação de filmes para um determinado usuário baseado nos seus gostos e de gostos de outros usuários semelhantes. O dataset utilizado é do site de recomendação de filmes MovieLens.

In [1]:
import pyspark
from pyspark.sql import SparkSession, Row

from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

import time
start_time = time.time()

data_path='./data/'

# Experimente uma base de dados maior
#data_path='/data/dataset/MovieLens/ml-20m/'

In [2]:
# Create Spark Session
sc = SparkSession.builder \
     .master("local[*]") \
     .appName("Movies_Recommendation") \
     .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/17 08:51:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Ratings Data File Structure (ratings.csv)
-----------------------------------------

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [3]:
df = sc.read.format("csv").options(sep=',',header='true',inferschema='true').\
     load(data_path+"ratings.csv.gz")

df.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



In [4]:
ratedf = df.select(['userId', 'movieId', 'rating'])
ratedf.head(5)

[Row(userId=1, movieId=1, rating=4.0),
 Row(userId=1, movieId=3, rating=4.0),
 Row(userId=1, movieId=6, rating=4.0),
 Row(userId=1, movieId=47, rating=5.0),
 Row(userId=1, movieId=50, rating=5.0)]

In [5]:
ratedf.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
|     1|    101|   5.0|
|     1|    110|   4.0|
|     1|    151|   5.0|
|     1|    157|   5.0|
|     1|    163|   5.0|
|     1|    216|   5.0|
|     1|    223|   3.0|
|     1|    231|   5.0|
|     1|    235|   4.0|
|     1|    260|   5.0|
|     1|    296|   3.0|
|     1|    316|   3.0|
|     1|    333|   5.0|
|     1|    349|   4.0|
+------+-------+------+
only showing top 20 rows



## Movies Data File Structure (movies.csv)
---------------------------------------

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

In [6]:
df = sc.read.format("csv").options(sep=',',header='true',inferschema='true').\
     load(data_path+"movies.csv.gz")

df.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)



In [7]:
moviedf = df.select(['movieId', 'title', 'genres'])
moviedf.head(5)

[Row(movieId=1, title='Toy Story (1995)', genres='Adventure|Animation|Children|Comedy|Fantasy'),
 Row(movieId=2, title='Jumanji (1995)', genres='Adventure|Children|Fantasy'),
 Row(movieId=3, title='Grumpier Old Men (1995)', genres='Comedy|Romance'),
 Row(movieId=4, title='Waiting to Exhale (1995)', genres='Comedy|Drama|Romance'),
 Row(movieId=5, title='Father of the Bride Part II (1995)', genres='Comedy')]

## Rating Statistics

In [8]:
ratedf.describe().show()

+-------+------------------+----------------+------------------+
|summary|            userId|         movieId|            rating|
+-------+------------------+----------------+------------------+
|  count|            100836|          100836|            100836|
|   mean|326.12756356856676|19435.2957177992| 3.501556983616962|
| stddev| 182.6184914635004|35530.9871987003|1.0425292390606342|
|    min|                 1|               1|               0.5|
|    max|               610|          193609|               5.0|
+-------+------------------+----------------+------------------+



In [9]:
# Configure ALS
als = ALS(maxIter=5, regParam=0.01, userCol='userId', itemCol='movieId', ratingCol='rating', \
          coldStartStrategy='drop')

# Train dataset: 80% to ttrain and 20% to test 
training, test = ratedf.randomSplit([0.8,0.2])

model = als.fit(training)

# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error (RMSE) = " + str(rmse))

24/01/17 08:51:33 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS


Root-mean-square error (RMSE) = 1.0577009728715718


In [10]:
# Generate top 5 movie recommendations for each user
userRecs = model.recommendForAllUsers(5)

userRecs.show(truncate=False)



+------+--------------------------------------------------------------------------------------------------+
|userId|recommendations                                                                                   |
+------+--------------------------------------------------------------------------------------------------+
|1     |[{3676, 7.308286}, {89904, 7.224451}, {2568, 7.069147}, {102123, 7.0021195}, {119141, 6.728187}]  |
|2     |[{142422, 7.3629603}, {3791, 7.3262672}, {33880, 7.1006045}, {95720, 6.999377}, {3061, 6.8449497}]|
|3     |[{6380, 6.5234504}, {158872, 6.146253}, {166534, 5.626844}, {46948, 5.4502206}, {1965, 5.430073}] |
|4     |[{3272, 8.050189}, {3030, 7.4631953}, {2730, 7.4063616}, {1256, 7.284992}, {8154, 7.10944}]       |
|5     |[{4102, 6.661257}, {3266, 6.587204}, {2137, 6.2627063}, {3040, 6.23706}, {6461, 6.2127013}]       |
|6     |[{97304, 6.163364}, {1916, 6.0844436}, {89904, 5.848202}, {84847, 5.7961025}, {4041, 5.795106}]   |
|7     |[{3525, 7.865737}, {

                                                                                

In [11]:
for line in userRecs.take(3):
    print(line)

Row(userId=1, recommendations=[Row(movieId=3676, rating=7.308286190032959), Row(movieId=89904, rating=7.224451065063477), Row(movieId=2568, rating=7.069147109985352), Row(movieId=102123, rating=7.002119541168213), Row(movieId=119141, rating=6.728187084197998)])
Row(userId=2, recommendations=[Row(movieId=142422, rating=7.362960338592529), Row(movieId=3791, rating=7.326267242431641), Row(movieId=33880, rating=7.10060453414917), Row(movieId=95720, rating=6.9993767738342285), Row(movieId=3061, rating=6.844949722290039)])
Row(userId=3, recommendations=[Row(movieId=6380, rating=6.5234503746032715), Row(movieId=158872, rating=6.1462531089782715), Row(movieId=166534, rating=5.6268439292907715), Row(movieId=46948, rating=5.450220584869385), Row(movieId=1965, rating=5.430072784423828)])


In [12]:
# Generate top 5 user recommendations for each movie
movieRecs = model.recommendForAllItems(5)
movieRecs.show(truncate=False)



+-------+-----------------------------------------------------------------------------------------+
|movieId|recommendations                                                                          |
+-------+-----------------------------------------------------------------------------------------+
|12     |[{423, 13.603161}, {468, 11.867344}, {243, 9.948024}, {340, 9.324197}, {396, 8.880301}]  |
|26     |[{413, 8.366593}, {531, 7.087544}, {197, 6.893214}, {498, 6.8756533}, {536, 6.7145615}]  |
|27     |[{208, 7.948241}, {598, 7.8396993}, {569, 7.204832}, {505, 6.919182}, {544, 6.8781276}]  |
|28     |[{399, 9.210004}, {443, 8.201692}, {278, 7.092686}, {421, 6.9759607}, {566, 6.9277353}]  |
|31     |[{283, 7.9452305}, {485, 6.8117495}, {413, 6.778816}, {569, 6.6152673}, {505, 6.5838885}]|
|34     |[{413, 7.4244547}, {461, 6.837977}, {22, 6.625615}, {536, 6.4195004}, {502, 6.2688913}]  |
|44     |[{498, 5.3707647}, {31, 5.096789}, {319, 5.0943384}, {147, 5.0853972}, {576, 4.8676753}] |


                                                                                

In [13]:
for line in movieRecs.take(4):
   print(line)



Row(movieId=12, recommendations=[Row(userId=423, rating=13.603160858154297), Row(userId=468, rating=11.86734390258789), Row(userId=243, rating=9.948023796081543), Row(userId=340, rating=9.324196815490723), Row(userId=396, rating=8.880301475524902)])
Row(movieId=26, recommendations=[Row(userId=413, rating=8.366593360900879), Row(userId=531, rating=7.087543964385986), Row(userId=197, rating=6.893214225769043), Row(userId=498, rating=6.875653266906738), Row(userId=536, rating=6.714561462402344)])
Row(movieId=27, recommendations=[Row(userId=208, rating=7.948241233825684), Row(userId=598, rating=7.8396992683410645), Row(userId=569, rating=7.204832077026367), Row(userId=505, rating=6.919181823730469), Row(userId=544, rating=6.878127574920654)])
Row(movieId=28, recommendations=[Row(userId=399, rating=9.210003852844238), Row(userId=443, rating=8.201691627502441), Row(userId=278, rating=7.092686176300049), Row(userId=421, rating=6.975960731506348), Row(userId=566, rating=6.927735328674316)])


                                                                                

In [14]:
print("--- Execution time: %s seconds ---" % (time.time() - start_time))

--- Execution time: 20.994229316711426 seconds ---


In [15]:
enter_user=input("Digite o número do usuário que deseja a recomendação:")
userRecs.where(userRecs.userId == enter_user).select("recommendations.movieId",\
              "recommendations.rating").collect()

Digite o número do usuário que deseja a recomendação: 1


[Row(movieId=[3676, 89904, 2568, 102123, 119141], rating=[7.308286190032959, 7.224451065063477, 7.069147109985352, 7.002119541168213, 6.728187084197998])]

In [16]:
enter_movie=input("Digite o número do filme que deseja recomendar:")
movieRecs.where(movieRecs.movieId == enter_movie).select("recommendations.userId",\
                "recommendations.rating").collect()

Digite o número do filme que deseja recomendar: 1


                                                                                

[Row(userId=[99, 147, 486, 192, 340], rating=[5.723738670349121, 5.477389335632324, 5.250665187835693, 5.110465049743652, 5.078319072723389])]

In [17]:
sc.stop()