# Sistemas de recomendação

Neste exemplo, iremos apresentar algumas formas de desenvolver um sistema de recomendação para filmes.

Será apresentada a metodologia para Filtragem Colaborativa.

## Carregar pacotes

In [1]:
library(reshape2)
library(tidyverse)
library(magrittr)
library(recommenderlab)
library(Matrix)
library(NMF)
library(NNLM)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.0.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.6
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: ‘magrittr’

The following object is masked from ‘package:purrr’:

    set_names

The following object is masked from ‘package:tidyr’:

    extract

Loading required package: Matrix

Attaching package: ‘Matrix’

The following object is masked from ‘package:tidyr’:

    expand

Loading required package: arules

Attaching package: ‘arules’

The following object is masked from ‘package:dplyr’:

    recode

The following objects are masked from ‘package:base’:

    abbreviate, write

Loading required package: proxy

Attaching package: ‘proxy’

The following object is masked from ‘package:Matrix’

## Carregar dados

In [2]:
dados_ratings <- read_csv("/home/vm-data-science/education/dados/movies_ratings_example.csv")

Parsed with column specification:
cols(
  user_id = col_integer(),
  user = col_character(),
  movie = col_character(),
  rating = col_double()
)


In [3]:
dados_ratings %>% head

user_id,user,movie,rating
1,antonio,batman,2.5
1,antonio,matrix,3.0
1,antonio,spiderman,3.5
1,antonio,ella,2.5
1,antonio,blue_lagoon,3.0
2,nunes,batman,3.0


## Análises

### Transformar em matriz de usuário/item

Os valores "NA" são os filmes que os usuários ainda não deram nota.

O objetivo é estimar estes valores pelos métodos que serão apresentados para sabermos se devemos recomendar ou não estes filmes.

In [4]:
user_item_matrix <- dados_ratings %>% 
  spread( key = movie, value = rating ) %>% 
  select( -user )

In [5]:
user_item_matrix

user_id,batman,blue_lagoon,ella,lost_in_translation,matrix,spiderman
1,2.5,3.0,2.5,,3.0,3.5
2,3.0,4.0,,3.5,3.5,
3,,,2.5,4.0,3.5,3.0
4,,3.0,2.0,3.0,4.0,2.0
5,3.0,3.0,3.5,5.0,4.0,
6,,1.0,4.0,,4.5,
7,3.0,2.0,,,3.5,4.0


## Algoritmos baseados em memória (*Memory Based Reasoning*)

Estes algoritmos, primeiramente, calculam a similaridade entre os usuários (*User based filtering*) ou itens (*Item based filtering*). Iremos apresentar ambos métodos.

Para realizar os cálculos, iremos utilizar as funções do pacote recommenderlab.

Neste pacote, primeiramente devemos transformar a matriz para o formato "realRatingMatrix".

In [6]:
user_item_matrix_reclab <- as.matrix(user_item_matrix[,-1]) %>% 
                                as(., "realRatingMatrix")

In [7]:
user_item_matrix_reclab

7 x 6 rating matrix of class ‘realRatingMatrix’ with 30 ratings.

### ***Item Based filtering***

Este método segue as etapas:

1 - Para cada 2 itens, calcule a similaridade entre eles.

2 - Para cada item, identifique os *k* itens mais similares. 

3 - Identifique os grupos de itens mais associados para cada usuário.

4 - Recomende o grupo de itens que estão mais associados ao usuário.

- **Matriz de distâncias**

A matriz de distância será calculada em relação aos filmes (6 x 6), iremos apresentar uma amostra.

A diagonal é zero porque a distância entre o item e ele mesmo é igual. O método de cálculo da distância foi o coseno.

In [8]:
similarity_items <- similarity(user_item_matrix_reclab, 
                               method = "cosine", 
                               which = "items")
as.matrix(similarity_items)

Unnamed: 0,batman,blue_lagoon,ella,lost_in_translation,matrix,spiderman
batman,0.0,0.9705803,0.997227,0.9847836,0.9982897,0.9997098
blue_lagoon,0.9705803,0.0,0.8528029,0.9582708,0.911839,0.9197925
ella,0.997227,0.8528029,0.0,0.998777,0.9808042,0.992093
lost_in_translation,0.9847836,0.9582708,0.998777,0.0,0.9822047,0.9984604
matrix,0.9982897,0.911839,0.9808042,0.9822047,0.0,0.9515988
spiderman,0.9997098,0.9197925,0.992093,0.9984604,0.9515988,0.0


- **Construção do modelo**

A matriz de similaridades é construída internamente no modelo.

In [9]:
# cuidado - demora bastante se a matriz for muito grande
# https://github.com/mhahsler/recommenderlab/blob/master/R/RECOM_IBCF.R

item_based_rec_model <- Recommender( data = user_item_matrix_reclab, 
                                     method = "IBCF", # Item based
                                     parameter = list(k = 3, normalize = NULL))

In [10]:
item_based_rec_model@model

$description
[1] "IBCF: Reduced similarity matrix"

$sim
6 x 6 sparse Matrix of class "dgCMatrix"
                       batman blue_lagoon      ella lost_in_translation
batman              .                   . 0.9972270           .        
blue_lagoon         0.9705803           . .                   0.9582708
ella                0.9972270           . .                   0.9987770
lost_in_translation 0.9847836           . 0.9987770           .        
matrix              0.9982897           . 0.9808042           0.9822047
spiderman           0.9997098           . 0.9920930           0.9984604
                       matrix spiderman
batman              0.9982897 0.9997098
blue_lagoon         .         0.9197925
ella                .         0.9920930
lost_in_translation .         0.9984604
matrix              .         .        
spiderman           .         .        

$k
[1] 3

$method
[1] "Cosine"

$normalize
NULL

$normalize_sim_matrix
[1] FALSE

$alpha
[1] 0.5

$na_as_zero
[1] FAL

- **Uso do modelo**

In [11]:
numero_recomendações <- 5

In [12]:
item_based_recomendacoes <- predict( item_based_rec_model,
                                     user_item_matrix_reclab,
                                     n = numero_recomendações )

- **Avaliações previstas para os filmes não vistos**

In [13]:
movies <- sapply( item_based_recomendacoes@items, 
                  function(x){ colnames(user_item_matrix[,-1])[x] } )

dados_ratings_previstos_ibcf <- cbind( melt(movies), ratings = item_based_recomendacoes@ratings %>% unlist )

dados_ratings_previstos_ibcf %<>% 
  rename( user_id = L1 ) %>% 
  left_join(., dados_ratings %>% 
              distinct( user, user_id),
            by ="user_id" )

In [14]:
dados_ratings_previstos_ibcf

value,user_id,ratings,user
lost_in_translation,1,2.834827,antonio
ella,2,3.250194,nunes
spiderman,2,3.249844,nunes
blue_lagoon,3,3.510244,carlota
batman,3,3.000177,carlota
batman,4,2.666587,fiona
spiderman,5,3.833695,valkyria
batman,6,4.250133,jacinta
lost_in_translation,6,4.0,jacinta
spiderman,6,4.0,jacinta


In [16]:
dados_ratings_previstos_ibcf %>% 
    filter( user_id == 5 ) %>% 
    arrange( desc(ratings) )

value,user_id,ratings,user
spiderman,5,3.833695,valkyria


### ***User based filtering***

- **Matriz de distâncias**

A diagonal é zero porque a distância entre o usuário e ele mesmo é igual. O método de cálculo da distância foi o coseno.

In [17]:
similarity_users <- similarity(user_item_matrix_reclab, 
                               method = "cosine", 
                               which = "users")
as.matrix(similarity_users)

Unnamed: 0,1,2,3,4,5,6,7
1,0.0,0.9981648,0.9909091,0.9508468,0.9917917,0.8817122,0.9793213
2,0.9981648,0.0,0.9977852,0.9836167,0.9714729,0.8060805,0.9537445
3,0.9909091,0.9977852,0.0,0.9765627,0.997394,0.9943456,0.9897444
4,0.9508468,0.9836167,0.9765627,0.0,0.9663543,0.8823398,0.9155755
5,0.9917917,0.9714729,0.997394,0.9663543,0.0,0.9395973,0.9897553
6,0.8817122,0.8060805,0.9943456,0.8823398,0.9395973,0.0,0.9551954
7,0.9793213,0.9537445,0.9897444,0.9155755,0.9897553,0.9551954,0.0


- **Construção do modelo**

In [18]:
user_based_rec_model <- Recommender( data = user_item_matrix_reclab, 
                                     method = "UBCF", # User based 
                                     parameter = list(nn = 3, normalize = NULL) )

In [19]:
user_based_rec_model@model

$description
[1] "UBCF-Real data: contains full or sample of data set"

$data
7 x 6 rating matrix of class ‘realRatingMatrix’ with 30 ratings.

$method
[1] "cosine"

$nn
[1] 3

$sample
[1] FALSE

$normalize
NULL

$verbose
[1] FALSE


- **Uso do modelo**

In [20]:
numero_recomendacoes <- 10

In [21]:
user_based_recomendacoes <- predict( user_based_rec_model,
                                     user_item_matrix_reclab,
                                     n = numero_recomendações )

- **Recomendação de filmes para o usuário**

In [22]:
movies <- sapply( user_based_recomendacoes@items, 
                  function(x){ colnames(user_item_matrix[,-1])[x] } )

dados_ratings_previstos_ubcf <- cbind( melt(movies), ratings = user_based_recomendacoes@ratings %>% unlist )

dados_ratings_previstos_ubcf %<>% 
  rename( user_id = L1 ) %>% 
  left_join(., dados_ratings %>% 
              distinct( user, user_id),
            by ="user_id" )

In [23]:
dados_ratings_previstos_ubcf

value,user_id,ratings,user
lost_in_translation,1,2.826976,antonio
spiderman,2,2.1652338,nunes
ella,2,1.6655401,nunes
blue_lagoon,3,2.3315208,carlota
batman,3,1.9983905,carlota
batman,4,0.9968484,fiona
spiderman,5,2.1622789,valkyria
spiderman,6,2.306738,jacinta
lost_in_translation,6,1.3484751,jacinta
batman,6,0.9715363,jacinta


In [24]:
dados_ratings_previstos_ubcf %>% 
    filter( user_id == 3 ) %>% 
    arrange( desc(ratings) )

value,user_id,ratings,user
blue_lagoon,3,2.331521,carlota
batman,3,1.99839,carlota


## Algoritmos baseados em modelos

Por meio da técnica de fatoração de matrizes, esses algoritmos preenchem os valores "NA" diretamente na matriz de usuários e itens.

Será utilizado o pacote NNLM combinado com o pacote NMF. Este pacote permite o uso de modelos baseados em *Alternating Least Squares*, estes proporcionam ganho de tempo e memória para estimar os *ratings*.

https://towardsdatascience.com/prototyping-a-recommender-system-step-by-step-part-2-alternating-least-square-als-matrix-4a76c58714a1

In [25]:
user_item_matrix_nnlm <- as.matrix(user_item_matrix[,-1])

In [26]:
user_item_matrix_nnlm

batman,blue_lagoon,ella,lost_in_translation,matrix,spiderman
2.5,3.0,2.5,,3.0,3.5
3.0,4.0,,3.5,3.5,
,,2.5,4.0,3.5,3.0
,3.0,2.0,3.0,4.0,2.0
3.0,3.0,3.5,5.0,4.0,
,1.0,4.0,,4.5,
3.0,2.0,,,3.5,4.0


In [27]:
fatoracao_rec_model <- nnmf(user_item_matrix_nnlm, 
                            k = 2,  
                            method = 'scd', 
                            loss = 'mse')

In [28]:
complete_user_item_matrix <- fatoracao_rec_model$W %*% fatoracao_rec_model$H

In [29]:
complete_user_item_matrix

batman,blue_lagoon,ella,lost_in_translation,matrix,spiderman
2.643294,2.639233,2.520116,3.938678,3.389984,3.253988
2.90735,3.906254,2.434689,3.414793,3.748188,2.10046
2.752215,2.974938,2.547696,3.893485,3.534098,3.053634
2.584164,3.335451,2.209941,3.160066,3.328871,2.068232
3.208547,2.920639,3.154122,5.039658,4.109397,4.366837
3.460083,1.039054,4.110634,7.364349,4.390422,7.819339
2.71304,2.179266,2.764584,4.526804,3.46911,4.120288


Podemos combinar as duas matrizes para obter os *ratings* dos filmes ainda não foram assistidos e poderão ser recomendados.

In [30]:
matriz_recomendacoes <- ( is.na(user_item_matrix_nnlm) == TRUE ) * round(complete_user_item_matrix, 2)

In [31]:
matriz_recomendacoes

batman,blue_lagoon,ella,lost_in_translation,matrix,spiderman
0.0,0.0,0.0,3.94,0,0.0
0.0,0.0,2.43,0.0,0,2.1
2.75,2.97,0.0,0.0,0,0.0
2.58,0.0,0.0,0.0,0,0.0
0.0,0.0,0.0,0.0,0,4.37
3.46,0.0,0.0,7.36,0,7.82
0.0,0.0,2.76,4.53,0,0.0


Associamos novamente com os usuários.

In [32]:
matriz_recomendacoes <- cbind( user_item_matrix$user_id, data.frame(matriz_recomendacoes) )

In [33]:
# alguns ajustes
matriz_recomendacoes %<>% 
    rename( user_id = `user_item_matrix$user_id` )

In [34]:
matriz_recomendacoes %<>% 
    left_join(., dados_ratings %>% 
              distinct( user, user_id),
            by ="user_id" )

In [35]:
matriz_recomendacoes

user_id,batman,blue_lagoon,ella,lost_in_translation,matrix,spiderman,user
1,0.0,0.0,0.0,3.94,0,0.0,antonio
2,0.0,0.0,2.43,0.0,0,2.1,nunes
3,2.75,2.97,0.0,0.0,0,0.0,carlota
4,2.58,0.0,0.0,0.0,0,0.0,fiona
5,0.0,0.0,0.0,0.0,0,4.37,valkyria
6,3.46,0.0,0.0,7.36,0,7.82,jacinta
7,0.0,0.0,2.76,4.53,0,0.0,didira


Podemos ajustar para organizar um banco de dados ordenado com as possíveis recomendações.

In [36]:
banco_ratings <- matriz_recomendacoes %>% 
    gather( key = movie, value = ratings, -c(user_id, user)  ) %>% 
    filter( ratings > 0 )

In [37]:
banco_ratings

user_id,user,movie,ratings
3,carlota,batman,2.75
4,fiona,batman,2.58
6,jacinta,batman,3.46
3,carlota,blue_lagoon,2.97
2,nunes,ella,2.43
7,didira,ella,2.76
1,antonio,lost_in_translation,3.94
6,jacinta,lost_in_translation,7.36
7,didira,lost_in_translation,4.53
2,nunes,spiderman,2.1


Podemos ver as 5 melhores recomendações para o usuário 6.

In [38]:
banco_ratings %>% 
    filter( user_id == 3 ) %>% 
    arrange( desc(ratings) ) %>% 
    head(5)

user_id,user,movie,ratings
3,carlota,blue_lagoon,2.97
3,carlota,batman,2.75


## Comparações

In [None]:
dados_ratings_previstos_ibcf %>% 
    filter( user_id == 3 ) %>% 
    arrange( desc(ratings) )

In [None]:
dados_ratings_previstos_ubcf %>% 
    filter( user_id == 3 ) %>% 
    arrange( desc(ratings) )

In [None]:
banco_ratings %>% 
    filter( user_id == 3 ) %>% 
    arrange( desc(ratings) )