# Exemplo 09: Sistema de Recomendação
## Recomendação de Filmes MovieLens

Sistema de Recomendação tem como objetivo selecionar itens personalizados para um usuário (cliente) com base nos interesses dele e dos interesses de usuários semelhantes conforme o contexto no qual estão inseridos. Esta técnica pode recomendar itens variados como, por exemplo, livros, filmes, notícias, música, vídeos, páginas de internet e produtos de uma loja virtual. Este exemplo faz a recomendação de filmes para um determinado usuário baseado nos seus gostos e de gostos de outros usuários semelhantes. O dataset utilizado é do site de recomendação de filmes MovieLens.

In [1]:
import findspark
findspark.init()

In [2]:
import pyspark
from pyspark.sql import SparkSession, Row

from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

import time
start_time = time.time()

data_path='./data/'

# Experimente uma base de dados maior
#data_path='/data/textdata/MovieLens/ml-20m/'

In [3]:
# Create Spark Session
spark = SparkSession.builder \
       .master("local[*]") \
       .appName("Movies_Recommendation") \
       .getOrCreate()

## Ratings Data File Structure (ratings.csv)
-----------------------------------------

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [4]:
df = spark.read.format("csv").options(sep=',',header='true',inferschema='true').\
     load(data_path+"ratings.csv.gz")

df.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



In [5]:
ratedf = df.select(['userId', 'movieId', 'rating'])
ratedf.head(5)

[Row(userId=1, movieId=1, rating=4.0),
 Row(userId=1, movieId=3, rating=4.0),
 Row(userId=1, movieId=6, rating=4.0),
 Row(userId=1, movieId=47, rating=5.0),
 Row(userId=1, movieId=50, rating=5.0)]

In [6]:
ratedf.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
|     1|    101|   5.0|
|     1|    110|   4.0|
|     1|    151|   5.0|
|     1|    157|   5.0|
|     1|    163|   5.0|
|     1|    216|   5.0|
|     1|    223|   3.0|
|     1|    231|   5.0|
|     1|    235|   4.0|
|     1|    260|   5.0|
|     1|    296|   3.0|
|     1|    316|   3.0|
|     1|    333|   5.0|
|     1|    349|   4.0|
+------+-------+------+
only showing top 20 rows



## Movies Data File Structure (movies.csv)
---------------------------------------

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

In [7]:
df = spark.read.format("csv").options(sep=',',header='true',inferschema='true').\
           load(data_path+"movies.csv.gz")

df.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)



In [8]:
moviedf = df.select(['movieId', 'title', 'genres'])
moviedf.head(5)

[Row(movieId=1, title='Toy Story (1995)', genres='Adventure|Animation|Children|Comedy|Fantasy'),
 Row(movieId=2, title='Jumanji (1995)', genres='Adventure|Children|Fantasy'),
 Row(movieId=3, title='Grumpier Old Men (1995)', genres='Comedy|Romance'),
 Row(movieId=4, title='Waiting to Exhale (1995)', genres='Comedy|Drama|Romance'),
 Row(movieId=5, title='Father of the Bride Part II (1995)', genres='Comedy')]

## Rating Statistics

In [9]:
ratedf.describe().show()

+-------+------------------+----------------+------------------+
|summary|            userId|         movieId|            rating|
+-------+------------------+----------------+------------------+
|  count|            100836|          100836|            100836|
|   mean|326.12756356856676|19435.2957177992| 3.501556983616962|
| stddev| 182.6184914635004|35530.9871987003|1.0425292390606342|
|    min|                 1|               1|               0.5|
|    max|               610|          193609|               5.0|
+-------+------------------+----------------+------------------+



In [10]:
# Configure ALS
als = ALS(maxIter=5, regParam=0.01, userCol='userId', itemCol='movieId', ratingCol='rating', \
          coldStartStrategy='drop')

# Train dataset: 80% to ttrain and 20% to test 
training, test = ratedf.randomSplit([0.8,0.2])

model = als.fit(training)

# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error (RMSE) = " + str(rmse))

Root-mean-square error (RMSE) = 1.1041152882807121


In [11]:
# Generate top 5 movie recommendations for each user
userRecs = model.recommendForAllUsers(5)

userRecs.show(truncate=False)

+------+--------------------------------------------------------------------------------------------------+
|userId|recommendations                                                                                   |
+------+--------------------------------------------------------------------------------------------------+
|471   |[[26258, 12.84515], [4941, 11.294691], [2290, 11.14419], [3089, 10.463146], [1809, 10.145598]]    |
|463   |[[932, 8.336502], [4678, 7.8344407], [157296, 6.437334], [3925, 6.302414], [194, 6.2880564]]      |
|496   |[[92535, 8.381637], [86345, 8.325246], [49932, 8.315734], [8973, 7.5235357], [86347, 7.5186667]]  |
|148   |[[5291, 5.718426], [5882, 5.7034245], [932, 5.700269], [3741, 5.495344], [26865, 5.381207]]       |
|540   |[[932, 8.648776], [3265, 6.521791], [80906, 6.4970374], [4678, 6.3557663], [7360, 6.2567873]]     |
|392   |[[7046, 8.612724], [5419, 8.467591], [106100, 7.8681784], [2261, 7.8283925], [103341, 7.678608]]  |
|243   |[[158872, 9.826565],

In [12]:
for line in userRecs.take(3):
    print(line)

Row(userId=471, recommendations=[Row(movieId=26258, rating=12.845149993896484), Row(movieId=4941, rating=11.29469108581543), Row(movieId=2290, rating=11.144189834594727), Row(movieId=3089, rating=10.463146209716797), Row(movieId=1809, rating=10.145598411560059)])
Row(userId=463, recommendations=[Row(movieId=932, rating=8.336502075195312), Row(movieId=4678, rating=7.8344407081604), Row(movieId=157296, rating=6.437334060668945), Row(movieId=3925, rating=6.3024139404296875), Row(movieId=194, rating=6.288056373596191)])
Row(userId=496, recommendations=[Row(movieId=92535, rating=8.381636619567871), Row(movieId=86345, rating=8.32524585723877), Row(movieId=49932, rating=8.315733909606934), Row(movieId=8973, rating=7.52353572845459), Row(movieId=86347, rating=7.518666744232178)])


In [13]:
# Generate top 5 user recommendations for each movie
movieRecs = model.recommendForAllItems(5)
movieRecs.show(truncate=False)

+-------+----------------------------------------------------------------------------------------+
|movieId|recommendations                                                                         |
+-------+----------------------------------------------------------------------------------------+
|1580   |[[35, 6.8533063], [536, 6.36949], [485, 6.201755], [584, 5.87738], [267, 5.7080536]]    |
|4900   |[[549, 8.06773], [126, 7.9640265], [598, 7.507944], [530, 7.2240415], [261, 7.152564]]  |
|5300   |[[461, 8.871581], [295, 6.164115], [197, 5.6758566], [114, 5.460502], [258, 5.3874536]] |
|6620   |[[471, 8.186786], [461, 8.027531], [114, 7.696641], [259, 7.6481266], [530, 7.600788]]  |
|7340   |[[423, 5.3940163], [224, 4.4409695], [96, 4.316804], [383, 3.9536536], [81, 3.9005015]] |
|32460  |[[393, 9.411356], [55, 9.010375], [399, 8.616556], [423, 8.392638], [544, 7.8689055]]   |
|54190  |[[508, 9.152799], [34, 7.429445], [574, 7.4289594], [67, 7.380638], [71, 7.247784]]     |
|471    |[

In [14]:
for line in movieRecs.take(4):
   print(line)

Row(movieId=1580, recommendations=[Row(userId=35, rating=6.853306293487549), Row(userId=536, rating=6.369490146636963), Row(userId=485, rating=6.201755046844482), Row(userId=584, rating=5.877379894256592), Row(userId=267, rating=5.7080535888671875)])
Row(movieId=4900, recommendations=[Row(userId=549, rating=8.067729949951172), Row(userId=126, rating=7.96402645111084), Row(userId=598, rating=7.507944107055664), Row(userId=530, rating=7.22404146194458), Row(userId=261, rating=7.15256404876709)])
Row(movieId=5300, recommendations=[Row(userId=461, rating=8.871581077575684), Row(userId=295, rating=6.164114952087402), Row(userId=197, rating=5.675856590270996), Row(userId=114, rating=5.4605021476745605), Row(userId=258, rating=5.387453556060791)])
Row(movieId=6620, recommendations=[Row(userId=471, rating=8.186785697937012), Row(userId=461, rating=8.027530670166016), Row(userId=114, rating=7.696640968322754), Row(userId=259, rating=7.648126602172852), Row(userId=530, rating=7.600788116455078)]

In [15]:
print("--- Execution time: %s seconds ---" % (time.time() - start_time))

--- Execution time: 32.69687843322754 seconds ---


In [16]:
enter_user=input("Digite o número do usuário que deseja a recomendação:")
userRecs.where(userRecs.userId == enter_user).select("recommendations.movieId",\
              "recommendations.rating").collect()

Digite o número do usuário que deseja a recomendação: 31


[Row(movieId=[26258, 1701, 2290, 926, 1354], rating=[9.391887664794922, 8.904085159301758, 8.642967224121094, 8.587239265441895, 8.582169532775879])]

In [17]:
enter_movie=input("Digite o número do filme que deseja recomendar:")
movieRecs.where(movieRecs.movieId == enter_movie).select("recommendations.userId",\
                "recommendations.rating").collect()

Digite o número do filme que deseja recomendar: 833


[Row(userId=[71, 34, 126, 544, 574], rating=[7.287454605102539, 6.620984077453613, 6.58880615234375, 6.5091776847839355, 6.363091468811035])]

In [18]:
spark.stop()