# Exemplo 09: Sistema de Recomendação
## Recomendação de Filmes MovieLens

Sistema de Recomendação tem como objetivo selecionar itens personalizados para um usuário (cliente) com base nos interesses dele e dos interesses de usuários semelhantes conforme o contexto no qual estão inseridos. Esta técnica pode recomendar itens variados como, por exemplo, livros, filmes, notícias, música, vídeos, páginas de internet e produtos de uma loja virtual. Este exemplo faz a recomendação de filmes para um determinado usuário baseado nos seus gostos e de gostos de outros usuários semelhantes. O dataset utilizado é do site de recomendação de filmes MovieLens.

In [1]:
import findspark
findspark.init()

In [2]:
import pyspark
from pyspark.sql import SparkSession, Row

from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

import time
start_time = time.time()

data_path='./data/'

# Experimente uma base de dados maior
#data_path='/data/dataset/MovieLens/ml-20m/'

In [3]:
# Create Spark Session
spark = SparkSession.builder \
       .master("local[*]") \
       .appName("Movies_Recommendation") \
       .getOrCreate()

## Ratings Data File Structure (ratings.csv)
-----------------------------------------

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [4]:
df = spark.read.format("csv").options(sep=',',header='true',inferschema='true').\
     load(data_path+"ratings.csv.gz")

df.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



In [5]:
ratedf = df.select(['userId', 'movieId', 'rating'])
ratedf.head(5)

[Row(userId=1, movieId=1, rating=4.0),
 Row(userId=1, movieId=3, rating=4.0),
 Row(userId=1, movieId=6, rating=4.0),
 Row(userId=1, movieId=47, rating=5.0),
 Row(userId=1, movieId=50, rating=5.0)]

In [6]:
ratedf.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
|     1|    101|   5.0|
|     1|    110|   4.0|
|     1|    151|   5.0|
|     1|    157|   5.0|
|     1|    163|   5.0|
|     1|    216|   5.0|
|     1|    223|   3.0|
|     1|    231|   5.0|
|     1|    235|   4.0|
|     1|    260|   5.0|
|     1|    296|   3.0|
|     1|    316|   3.0|
|     1|    333|   5.0|
|     1|    349|   4.0|
+------+-------+------+
only showing top 20 rows



## Movies Data File Structure (movies.csv)
---------------------------------------

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

In [7]:
df = spark.read.format("csv").options(sep=',',header='true',inferschema='true').\
           load(data_path+"movies.csv.gz")

df.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)



In [8]:
moviedf = df.select(['movieId', 'title', 'genres'])
moviedf.head(5)

[Row(movieId=1, title='Toy Story (1995)', genres='Adventure|Animation|Children|Comedy|Fantasy'),
 Row(movieId=2, title='Jumanji (1995)', genres='Adventure|Children|Fantasy'),
 Row(movieId=3, title='Grumpier Old Men (1995)', genres='Comedy|Romance'),
 Row(movieId=4, title='Waiting to Exhale (1995)', genres='Comedy|Drama|Romance'),
 Row(movieId=5, title='Father of the Bride Part II (1995)', genres='Comedy')]

## Rating Statistics

In [9]:
ratedf.describe().show()

+-------+------------------+----------------+------------------+
|summary|            userId|         movieId|            rating|
+-------+------------------+----------------+------------------+
|  count|            100836|          100836|            100836|
|   mean|326.12756356856676|19435.2957177992| 3.501556983616962|
| stddev| 182.6184914635004|35530.9871987003|1.0425292390606342|
|    min|                 1|               1|               0.5|
|    max|               610|          193609|               5.0|
+-------+------------------+----------------+------------------+



In [10]:
# Configure ALS
als = ALS(maxIter=5, regParam=0.01, userCol='userId', itemCol='movieId', ratingCol='rating', \
          coldStartStrategy='drop')

# Train dataset: 80% to ttrain and 20% to test 
training, test = ratedf.randomSplit([0.8,0.2])

model = als.fit(training)

# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error (RMSE) = " + str(rmse))

Root-mean-square error (RMSE) = 1.0959359106861477


In [11]:
# Generate top 5 movie recommendations for each user
userRecs = model.recommendForAllUsers(5)

userRecs.show(truncate=False)

+------+--------------------------------------------------------------------------------------------------+
|userId|recommendations                                                                                   |
+------+--------------------------------------------------------------------------------------------------+
|471   |[[3969, 7.0078754], [48322, 6.818779], [5785, 6.739111], [1147, 6.493653], [918, 6.3156576]]      |
|463   |[[49932, 6.9578238], [4642, 6.5255427], [71899, 6.3419843], [1260, 6.2815895], [1251, 6.2326956]] |
|496   |[[265, 7.269724], [3089, 7.1702523], [613, 7.1454577], [4941, 7.1369314], [2390, 7.0252786]]      |
|148   |[[932, 6.487558], [92535, 6.190889], [1809, 6.1229095], [84847, 5.8580184], [1126, 5.821113]]     |
|540   |[[1916, 7.2871857], [1211, 7.065893], [5048, 6.906929], [1952, 6.6632004], [2290, 6.529852]]      |
|392   |[[501, 10.30653], [412, 9.654592], [2259, 8.984282], [1354, 8.974475], [2935, 8.8553505]]         |
|243   |[[5668, 10.899073], 

In [12]:
for line in userRecs.take(3):
    print(line)

Row(userId=471, recommendations=[Row(movieId=3969, rating=7.007875442504883), Row(movieId=48322, rating=6.818778991699219), Row(movieId=5785, rating=6.739110946655273), Row(movieId=1147, rating=6.493652820587158), Row(movieId=918, rating=6.315657615661621)])
Row(userId=463, recommendations=[Row(movieId=49932, rating=6.957823753356934), Row(movieId=4642, rating=6.525542736053467), Row(movieId=71899, rating=6.341984272003174), Row(movieId=1260, rating=6.281589508056641), Row(movieId=1251, rating=6.232695579528809)])
Row(userId=496, recommendations=[Row(movieId=265, rating=7.269723892211914), Row(movieId=3089, rating=7.170252323150635), Row(movieId=613, rating=7.145457744598389), Row(movieId=4941, rating=7.136931419372559), Row(movieId=2390, rating=7.025278568267822)])


In [13]:
# Generate top 5 user recommendations for each movie
movieRecs = model.recommendForAllItems(5)
movieRecs.show(truncate=False)

+-------+------------------------------------------------------------------------------------------+
|movieId|recommendations                                                                           |
+-------+------------------------------------------------------------------------------------------+
|1580   |[[544, 5.865222], [531, 5.1814184], [569, 5.1122675], [389, 4.985305], [518, 4.936157]]   |
|4900   |[[364, 7.7238946], [423, 7.3393326], [363, 6.5847754], [360, 6.529024], [77, 6.1763105]]  |
|6620   |[[289, 7.8415065], [394, 7.7843814], [502, 7.572922], [154, 7.5063796], [157, 7.40295]]   |
|7340   |[[461, 6.350884], [295, 5.4302626], [396, 5.1350555], [485, 5.094689], [383, 4.751594]]   |
|32460  |[[423, 9.602335], [548, 8.643217], [549, 8.191774], [128, 6.6855235], [407, 6.5842657]]   |
|54190  |[[548, 7.9672494], [549, 7.871696], [53, 7.582603], [197, 7.546384], [291, 7.5416565]]    |
|471    |[[461, 9.356191], [413, 8.700609], [502, 8.562303], [303, 7.946674], [423, 7.78631

In [14]:
for line in movieRecs.take(4):
   print(line)

Row(movieId=1580, recommendations=[Row(userId=544, rating=5.865221977233887), Row(userId=531, rating=5.181418418884277), Row(userId=569, rating=5.11226749420166), Row(userId=389, rating=4.985304832458496), Row(userId=518, rating=4.9361572265625)])
Row(movieId=4900, recommendations=[Row(userId=364, rating=7.7238945960998535), Row(userId=423, rating=7.339332580566406), Row(userId=363, rating=6.584775447845459), Row(userId=360, rating=6.529024124145508), Row(userId=77, rating=6.1763105392456055)])
Row(movieId=6620, recommendations=[Row(userId=289, rating=7.841506481170654), Row(userId=394, rating=7.78438138961792), Row(userId=502, rating=7.572922229766846), Row(userId=154, rating=7.5063796043396), Row(userId=157, rating=7.402949810028076)])
Row(movieId=7340, recommendations=[Row(userId=461, rating=6.350883960723877), Row(userId=295, rating=5.430262565612793), Row(userId=396, rating=5.1350555419921875), Row(userId=485, rating=5.094688892364502), Row(userId=383, rating=4.751594066619873)])


In [15]:
print("--- Execution time: %s seconds ---" % (time.time() - start_time))

--- Execution time: 24.703402519226074 seconds ---


In [None]:
enter_user=input("Digite o número do usuário que deseja a recomendação:")
userRecs.where(userRecs.userId == enter_user).select("recommendations.movieId",\
              "recommendations.rating").collect()

In [None]:
enter_movie=input("Digite o número do filme que deseja recomendar:")
movieRecs.where(movieRecs.movieId == enter_movie).select("recommendations.userId",\
                "recommendations.rating").collect()

In [None]:
spark.stop()