# Exemplo 09: Sistema de Recomendação
## Recomendação de Filmes MovieLens

Sistema de Recomendação tem como objetivo selecionar itens personalizados para um usuário (cliente) com base nos interesses dele e dos interesses de usuários semelhantes conforme o contexto no qual estão inseridos. Esta técnica pode recomendar itens variados como, por exemplo, livros, filmes, notícias, música, vídeos, páginas de internet e produtos de uma loja virtual. Este exemplo faz a recomendação de filmes para um determinado usuário baseado nos seus gostos e de gostos de outros usuários semelhantes. O dataset utilizado é do site de recomendação de filmes MovieLens.

In [1]:
import findspark
findspark.init()

In [2]:
import pyspark
from pyspark.sql import SparkSession, Row

from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

import time
start_time = time.time()

data_path='./data/'

# Experimente uma base de dados maior
#data_path='/data/dataset/MovieLens/ml-20m/'

In [3]:
# Create Spark Session
spark = SparkSession.builder \
       .master("local[*]") \
       .appName("Movies_Recommendation") \
       .getOrCreate()

## Ratings Data File Structure (ratings.csv)
-----------------------------------------

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [4]:
df = spark.read.format("csv").options(sep=',',header='true',inferschema='true').\
     load(data_path+"ratings.csv.gz")

df.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



In [5]:
ratedf = df.select(['userId', 'movieId', 'rating'])
ratedf.head(5)

[Row(userId=1, movieId=1, rating=4.0),
 Row(userId=1, movieId=3, rating=4.0),
 Row(userId=1, movieId=6, rating=4.0),
 Row(userId=1, movieId=47, rating=5.0),
 Row(userId=1, movieId=50, rating=5.0)]

In [6]:
ratedf.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
|     1|    101|   5.0|
|     1|    110|   4.0|
|     1|    151|   5.0|
|     1|    157|   5.0|
|     1|    163|   5.0|
|     1|    216|   5.0|
|     1|    223|   3.0|
|     1|    231|   5.0|
|     1|    235|   4.0|
|     1|    260|   5.0|
|     1|    296|   3.0|
|     1|    316|   3.0|
|     1|    333|   5.0|
|     1|    349|   4.0|
+------+-------+------+
only showing top 20 rows



## Movies Data File Structure (movies.csv)
---------------------------------------

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

In [7]:
df = spark.read.format("csv").options(sep=',',header='true',inferschema='true').\
           load(data_path+"movies.csv.gz")

df.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)



In [8]:
moviedf = df.select(['movieId', 'title', 'genres'])
moviedf.head(5)

[Row(movieId=1, title='Toy Story (1995)', genres='Adventure|Animation|Children|Comedy|Fantasy'),
 Row(movieId=2, title='Jumanji (1995)', genres='Adventure|Children|Fantasy'),
 Row(movieId=3, title='Grumpier Old Men (1995)', genres='Comedy|Romance'),
 Row(movieId=4, title='Waiting to Exhale (1995)', genres='Comedy|Drama|Romance'),
 Row(movieId=5, title='Father of the Bride Part II (1995)', genres='Comedy')]

## Rating Statistics

In [9]:
ratedf.describe().show()

+-------+------------------+----------------+------------------+
|summary|            userId|         movieId|            rating|
+-------+------------------+----------------+------------------+
|  count|            100836|          100836|            100836|
|   mean|326.12756356856676|19435.2957177992| 3.501556983616962|
| stddev| 182.6184914635004|35530.9871987003|1.0425292390606342|
|    min|                 1|               1|               0.5|
|    max|               610|          193609|               5.0|
+-------+------------------+----------------+------------------+



In [10]:
# Configure ALS
als = ALS(maxIter=5, regParam=0.01, userCol='userId', itemCol='movieId', ratingCol='rating', \
          coldStartStrategy='drop')

# Train dataset: 80% to ttrain and 20% to test 
training, test = ratedf.randomSplit([0.8,0.2])

model = als.fit(training)

# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error (RMSE) = " + str(rmse))

Root-mean-square error (RMSE) = 1.0865871535586957


In [11]:
# Generate top 5 movie recommendations for each user
userRecs = model.recommendForAllUsers(5)

userRecs.show(truncate=False)

+------+---------------------------------------------------------------------------------------------------+
|userId|recommendations                                                                                    |
+------+---------------------------------------------------------------------------------------------------+
|471   |[[3676, 9.503193], [85, 9.185207], [71899, 8.886112], [134170, 8.2760315], [26865, 7.57923]]       |
|463   |[[53127, 7.8692055], [3272, 7.8565254], [119141, 7.5148816], [1974, 7.204374], [158872, 7.133631]] |
|496   |[[89904, 11.066743], [213, 9.602147], [53127, 8.465222], [1866, 8.436421], [79428, 8.102955]]      |
|148   |[[2427, 7.0396204], [3441, 6.965892], [5292, 6.7121906], [3030, 6.3832707], [3089, 6.328024]]      |
|540   |[[599, 7.5825143], [4678, 6.8466644], [7360, 6.6380506], [59900, 6.6304708], [1956, 6.629635]]     |
|392   |[[1974, 10.596278], [5419, 10.529247], [215, 10.111864], [2084, 9.823917], [4062, 9.7230215]]      |
|243   |[[179819, 1

In [12]:
for line in userRecs.take(3):
    print(line)

Row(userId=471, recommendations=[Row(movieId=3676, rating=9.503192901611328), Row(movieId=85, rating=9.18520736694336), Row(movieId=71899, rating=8.886112213134766), Row(movieId=134170, rating=8.276031494140625), Row(movieId=26865, rating=7.579229831695557)])
Row(userId=463, recommendations=[Row(movieId=53127, rating=7.869205474853516), Row(movieId=3272, rating=7.856525421142578), Row(movieId=119141, rating=7.514881610870361), Row(movieId=1974, rating=7.204373836517334), Row(movieId=158872, rating=7.133631229400635)])
Row(userId=496, recommendations=[Row(movieId=89904, rating=11.066742897033691), Row(movieId=213, rating=9.602147102355957), Row(movieId=53127, rating=8.465222358703613), Row(movieId=1866, rating=8.436421394348145), Row(movieId=79428, rating=8.102954864501953)])


In [13]:
# Generate top 5 user recommendations for each movie
movieRecs = model.recommendForAllItems(5)
movieRecs.show(truncate=False)

+-------+------------------------------------------------------------------------------------------+
|movieId|recommendations                                                                           |
+-------+------------------------------------------------------------------------------------------+
|1580   |[[147, 6.2455316], [344, 5.7582912], [77, 5.737737], [543, 5.7029176], [71, 5.443395]]    |
|4900   |[[147, 7.459399], [197, 6.453449], [549, 6.0067153], [130, 5.7156615], [176, 5.6444206]]  |
|6620   |[[399, 10.76585], [468, 9.644726], [259, 8.955781], [218, 8.326175], [289, 8.121714]]     |
|7340   |[[329, 6.8539414], [392, 6.6830053], [557, 6.2859898], [399, 5.99953], [197, 5.8264093]]  |
|32460  |[[535, 8.6159525], [296, 7.7539415], [494, 7.31887], [120, 7.047518], [55, 6.8850083]]    |
|54190  |[[147, 8.324658], [13, 7.7683086], [423, 7.618715], [535, 7.5837674], [518, 7.543573]]    |
|471    |[[37, 8.984481], [224, 7.840829], [399, 7.65471], [557, 7.3898625], [497, 7.055581

In [14]:
for line in movieRecs.take(4):
   print(line)

Row(movieId=1580, recommendations=[Row(userId=147, rating=6.2455315589904785), Row(userId=344, rating=5.758291244506836), Row(userId=77, rating=5.73773717880249), Row(userId=543, rating=5.702917575836182), Row(userId=71, rating=5.443395137786865)])
Row(movieId=4900, recommendations=[Row(userId=147, rating=7.459399223327637), Row(userId=197, rating=6.45344877243042), Row(userId=549, rating=6.006715297698975), Row(userId=130, rating=5.715661525726318), Row(userId=176, rating=5.644420623779297)])
Row(movieId=6620, recommendations=[Row(userId=399, rating=10.765850067138672), Row(userId=468, rating=9.644725799560547), Row(userId=259, rating=8.955780982971191), Row(userId=218, rating=8.32617473602295), Row(userId=289, rating=8.121713638305664)])
Row(movieId=7340, recommendations=[Row(userId=329, rating=6.853941440582275), Row(userId=392, rating=6.683005332946777), Row(userId=557, rating=6.285989761352539), Row(userId=399, rating=5.999529838562012), Row(userId=197, rating=5.826409339904785)])

In [15]:
print("--- Execution time: %s seconds ---" % (time.time() - start_time))

--- Execution time: 25.465017795562744 seconds ---


In [None]:
enter_user=input("Digite o número do usuário que deseja a recomendação:")
userRecs.where(userRecs.userId == enter_user).select("recommendations.movieId",\
              "recommendations.rating").collect()

In [None]:
enter_movie=input("Digite o número do filme que deseja recomendar:")
movieRecs.where(movieRecs.movieId == enter_movie).select("recommendations.userId",\
                "recommendations.rating").collect()

In [None]:
spark.stop()