# Exemplo 09: Sistema de Recomendação
## Recomendação de Filmes MovieLens

Sistema de Recomendação tem como objetivo selecionar itens personalizados para um usuário (cliente) com base nos interesses dele e dos interesses de usuários semelhantes conforme o contexto no qual estão inseridos. Esta técnica pode recomendar itens variados como, por exemplo, livros, filmes, notícias, música, vídeos, páginas de internet e produtos de uma loja virtual. Este exemplo faz a recomendação de filmes para um determinado usuário baseado nos seus gostos e de gostos de outros usuários semelhantes. O dataset utilizado é do site de recomendação de filmes MovieLens.

In [1]:
import findspark
findspark.init()

In [2]:
import pyspark
from pyspark.sql import SparkSession, Row

from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

import time
start_time = time.time()

data_path='./data/'

# Experimente uma base de dados maior
#data_path='/data/textdata/MovieLens/ml-20m/'

In [3]:
# Create Spark Session
spark = SparkSession.builder \
       .master("local[*]") \
       .appName("Movies_Recommendation") \
       .getOrCreate()

## Ratings Data File Structure (ratings.csv)
-----------------------------------------

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [4]:
df = spark.read.format("csv").options(sep=',',header='true',inferschema='true').\
     load(data_path+"ratings.csv.gz")

df.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



In [5]:
ratedf = df.select(['userId', 'movieId', 'rating'])
ratedf.head(5)

[Row(userId=1, movieId=1, rating=4.0),
 Row(userId=1, movieId=3, rating=4.0),
 Row(userId=1, movieId=6, rating=4.0),
 Row(userId=1, movieId=47, rating=5.0),
 Row(userId=1, movieId=50, rating=5.0)]

In [6]:
ratedf.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
|     1|    101|   5.0|
|     1|    110|   4.0|
|     1|    151|   5.0|
|     1|    157|   5.0|
|     1|    163|   5.0|
|     1|    216|   5.0|
|     1|    223|   3.0|
|     1|    231|   5.0|
|     1|    235|   4.0|
|     1|    260|   5.0|
|     1|    296|   3.0|
|     1|    316|   3.0|
|     1|    333|   5.0|
|     1|    349|   4.0|
+------+-------+------+
only showing top 20 rows



## Movies Data File Structure (movies.csv)
---------------------------------------

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

In [7]:
df = spark.read.format("csv").options(sep=',',header='true',inferschema='true').\
           load(data_path+"movies.csv.gz")

df.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)



In [8]:
moviedf = df.select(['movieId', 'title', 'genres'])
moviedf.head(5)

[Row(movieId=1, title='Toy Story (1995)', genres='Adventure|Animation|Children|Comedy|Fantasy'),
 Row(movieId=2, title='Jumanji (1995)', genres='Adventure|Children|Fantasy'),
 Row(movieId=3, title='Grumpier Old Men (1995)', genres='Comedy|Romance'),
 Row(movieId=4, title='Waiting to Exhale (1995)', genres='Comedy|Drama|Romance'),
 Row(movieId=5, title='Father of the Bride Part II (1995)', genres='Comedy')]

## Rating Statistics

In [9]:
ratedf.describe().show()

+-------+------------------+----------------+------------------+
|summary|            userId|         movieId|            rating|
+-------+------------------+----------------+------------------+
|  count|            100836|          100836|            100836|
|   mean|326.12756356856676|19435.2957177992| 3.501556983616962|
| stddev| 182.6184914635004|35530.9871987003|1.0425292390606342|
|    min|                 1|               1|               0.5|
|    max|               610|          193609|               5.0|
+-------+------------------+----------------+------------------+



In [10]:
# Configure ALS
als = ALS(maxIter=5, regParam=0.01, userCol='userId', itemCol='movieId', ratingCol='rating', \
          coldStartStrategy='drop')

# Train dataset: 80% to ttrain and 20% to test 
training, test = ratedf.randomSplit([0.8,0.2])

model = als.fit(training)

# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error (RMSE) = " + str(rmse))

Root-mean-square error (RMSE) = 1.0856305391170333


In [11]:
# Generate top 10 movie recommendations for each user
userRecs = model.recommendForAllUsers(10)
userRecs.show()
#userRecs.printSchema()

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|   471|[[55280, 7.869307...|
|   463|[[1734, 6.552576]...|
|   496|[[7318, 11.441045...|
|   148|[[89904, 7.353350...|
|   540|[[2301, 6.7797527...|
|   392|[[674, 10.768398]...|
|   243|[[102, 12.640464]...|
|    31|[[89118, 10.17698...|
|   516|[[89118, 8.239606...|
|   580|[[89904, 6.405048...|
|   251|[[104218, 6.61549...|
|   451|[[3089, 9.183769]...|
|    85|[[3030, 7.1850796...|
|   137|[[1701, 5.909634]...|
|    65|[[7346, 7.217666]...|
|   458|[[53127, 8.946671...|
|   481|[[7099, 7.798819]...|
|    53|[[89904, 9.787351...|
|   255|[[33834, 10.12877...|
|   588|[[945, 6.5845184]...|
+------+--------------------+
only showing top 20 rows



In [12]:
for line in userRecs.take(3):
    print(line)

Row(userId=471, recommendations=[Row(movieId=55280, rating=7.869307518005371), Row(movieId=49274, rating=7.541896343231201), Row(movieId=7748, rating=7.420228004455566), Row(movieId=674, rating=7.395407676696777), Row(movieId=6669, rating=7.349298000335693), Row(movieId=7160, rating=7.036937236785889), Row(movieId=68945, rating=6.739740371704102), Row(movieId=27815, rating=6.735745429992676), Row(movieId=170355, rating=6.711194038391113), Row(movieId=3379, rating=6.711194038391113)])
Row(userId=463, recommendations=[Row(movieId=1734, rating=6.552576065063477), Row(movieId=86347, rating=6.404196739196777), Row(movieId=5650, rating=6.116823196411133), Row(movieId=4144, rating=5.990962982177734), Row(movieId=106100, rating=5.971987724304199), Row(movieId=56145, rating=5.871619701385498), Row(movieId=185029, rating=5.868698596954346), Row(movieId=6548, rating=5.847761154174805), Row(movieId=2469, rating=5.808485984802246), Row(movieId=2467, rating=5.710673809051514)])
Row(userId=496, recom

In [13]:
# Generate top 10 user recommendations for each movie
movieRecs = model.recommendForAllItems(10)
movieRecs.show()

+-------+--------------------+
|movieId|     recommendations|
+-------+--------------------+
|   1580|[[433, 6.2147884]...|
|   4900|[[126, 5.2460976]...|
|   5300|[[505, 5.9176984]...|
|   6620|[[340, 12.005905]...|
|   7340|[[498, 7.5222373]...|
|  32460|[[549, 9.503585],...|
|  54190|[[498, 8.526159],...|
|    471|[[461, 8.524869],...|
|   1591|[[173, 9.514005],...|
| 140541|[[394, 7.001133],...|
|   1342|[[55, 12.601469],...|
|   2122|[[557, 8.52813], ...|
|   2142|[[340, 5.877474],...|
|   7982|[[55, 7.262258], ...|
|  44022|[[399, 7.9155116]...|
| 141422|[[231, 4.1886244]...|
| 144522|[[295, 3.9284961]...|
|    833|[[231, 7.4209723]...|
|   5803|[[413, 7.527453],...|
|   7833|[[173, 8.176621],...|
+-------+--------------------+
only showing top 20 rows



In [14]:
for line in movieRecs.take(4):
   print(line)

Row(movieId=1580, recommendations=[Row(userId=433, rating=6.214788436889648), Row(userId=231, rating=5.942447662353516), Row(userId=267, rating=5.916658401489258), Row(userId=548, rating=5.666942596435547), Row(userId=53, rating=5.636122226715088), Row(userId=544, rating=5.485495090484619), Row(userId=77, rating=5.291179656982422), Row(userId=259, rating=5.275787353515625), Row(userId=228, rating=5.237707614898682), Row(userId=276, rating=5.007348537445068)])
Row(movieId=4900, recommendations=[Row(userId=126, rating=5.246097564697266), Row(userId=48, rating=4.946140289306641), Row(userId=502, rating=4.741703510284424), Row(userId=138, rating=4.68482780456543), Row(userId=31, rating=4.613824367523193), Row(userId=243, rating=4.544923305511475), Row(userId=529, rating=4.477650165557861), Row(userId=591, rating=4.460307598114014), Row(userId=43, rating=4.390514373779297), Row(userId=337, rating=4.367153644561768)])
Row(movieId=5300, recommendations=[Row(userId=505, rating=5.91769838333129

In [15]:
print("--- Execution time: %s seconds ---" % (time.time() - start_time))

--- Execution time: 25.55022931098938 seconds ---


In [16]:
enter_user=input("Digite o número do usuário que deseja a recomendação:")
userRecs.where(userRecs.userId == enter_user).select("recommendations.movieId",\
              "recommendations.rating").collect()

Digite o número do usuário que deseja a recomendação: 65


[Row(movieId=[7346, 89904, 1218, 390, 5075, 501, 158783, 65642, 1966, 3858], rating=[7.217666149139404, 7.165190696716309, 7.008078575134277, 6.583594799041748, 6.545704364776611, 6.475786209106445, 6.4643425941467285, 6.444797039031982, 6.4007368087768555, 6.3879618644714355])]

In [17]:
enter_movie=input("Digite o número do filme que deseja recomendar:")
movieRecs.where(movieRecs.movieId == enter_movie).select("recommendations.userId",\
                "recommendations.rating").collect()

Digite o número do filme que deseja recomendar: 5300


[Row(userId=[505, 55, 589, 537, 258, 295, 209, 333, 364, 393], rating=[5.917698383331299, 5.6272969245910645, 5.354638576507568, 5.277253150939941, 5.2599639892578125, 5.177928924560547, 5.095948696136475, 5.09009313583374, 4.935408592224121, 4.905584812164307])]

In [18]:
spark.stop()