***Dr. Emmanuel Dufourq*** www.emmanueldufourq.com

***African Institute for Mathematical Sciences***

***2019***


Credits: This notebook is an extension of the one provided by Prof. Ian Durbach (https://github.com/iandurbach/) link to notebook: https://github.com/iandurbach/datasci-fi/blob/921f19903c20db3b3a957da5ec80d78988a29376/lesson6-more-nns.ipynb

***NOTE***

Be sure to use hardware acceleration to use the GPU. Click on `Runtime`, change `runtime type`, and select `GPU` for the *hardware accelerator* option.




In this example we build a recommender system for the full "small" MovieLens dataset. Previously we saw how to use matrix decomposition to represent each movie and each user as a vector of latent variables. Here we use neural networks to learn the "weights" in these latent factors.

In [2]:
devtools::install_github("rstudio/keras")

Downloading GitHub repo rstudio/keras@master


reticulate (NA -> 96421b5c6...) [GitHub]
tensorflow (NA -> 07d9bd539...) [GitHub]
tfruns     (NA -> 1.4         ) [CRAN]
config     (NA -> 0.3         ) [CRAN]


Downloading GitHub repo rstudio/reticulate@master



[32m✔[39m  [90mchecking for file ‘/tmp/RtmpkhHpVX/remotes822045c049/rstudio-reticulate-96421b5/DESCRIPTION’[39m[36m[39m
[90m─[39m[90m  [39m[90mpreparing ‘reticulate’:[39m[36m[39m
[32m✔[39m  [90mchecking DESCRIPTION meta-information[39m[36m[39m
[90m─[39m[90m  [39m[90mcleaning src[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for LF line-endings in source and make files and shell scripts[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for empty or unneeded directories[39m[36m[39m
[90m─[39m[90m  [39m[90mbuilding ‘reticulate_1.13.0-9000.tar.gz’[39m[36m[39m
   


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Downloading GitHub repo rstudio/tensorflow@master


config (NA -> 0.3) [CRAN]
tfruns (NA -> 1.4) [CRAN]


Installing 2 packages: config, tfruns
Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)


[32m✔[39m  [90mchecking for file ‘/tmp/RtmpkhHpVX/remotes822a3e54fe/rstudio-tensorflow-07d9bd5/DESCRIPTION’[39m[36m[39m
[90m─[39m[90m  [39m[90mpreparing ‘tensorflow’:[39m[36m[39m
[32m✔[39m  [90mchecking DESCRIPTION meta-information[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for LF line-endings in source and make files and shell scripts[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for empty or unneeded directories[39m[36m[39m
[90m─[39m[90m  [39m[90mbuilding ‘tensorflow_1.14.0.9000.tar.gz’[39m[36m[39m
   


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Installing 2 packages: tfruns, config
Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Skipping install of 'reticulate' from a github remote, the SHA1 (96421b5c) has not changed since last install.
  Use `force = TRUE` to force installation
Skipping install of 'tensorflow' from a github remote, the SHA1 (07d9bd53) has not changed since last install.
  Use `force = TRUE` to force installation


[32m✔[39m  [90mchecking for file ‘/tmp/RtmpkhHpVX/remotes822d897ddd/rstudio-keras-8758aae/DESCRIPTION’[39m[36m[39m
[90m─[39m[90m  [39m[90mpreparing ‘keras’:[39m[36m[39m
[32m✔[39m  [90mchecking DESCRIPTION meta-information[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for LF line-endings in source and make files and shell scripts[39m[36m[36m (485ms)[36m[39m
[90m─[39m[90m  [39m[90mchecking for empty or unneeded directories[39m[36m[39m
   Removed empty directory ‘keras/man-roxygen’
[90m─[39m[90m  [39m[90mbuilding ‘keras_2.2.4.1.9001.tar.gz’[39m[36m[39m
   


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)


In [3]:
library(tidyverse)
library(keras)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.0     [32m✔[39m [34mpurrr  [39m 0.3.2
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 0.8.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


### Load the data

In [0]:
load(url("https://github.com/iandurbach/datasci-fi/raw/master/data/movielens-small.RData"))

### Look at what is available

In [6]:
ls()

### Check the "ratings" data

In [7]:
ratings

userId,movieId,rating,timestamp
<int>,<int>,<dbl>,<int>
1,31,2.5,1260759144
1,1029,3.0,1260759179
1,1061,3.0,1260759182
1,1129,2.0,1260759185
1,1172,4.0,1260759205
1,1263,2.0,1260759151
1,1287,2.0,1260759187
1,1293,2.0,1260759148
1,1339,3.5,1260759125
1,1343,2.0,1260759131


### Subtract 1 from the user ID and movie ID

In [0]:
ratings <- ratings %>% mutate(userId = -1 + as.numeric(factor(userId)),
                              movieId = -1 + as.numeric(factor(movieId)))

In [9]:
ratings

userId,movieId,rating,timestamp
<dbl>,<dbl>,<dbl>,<int>
0,30,2.5,1260759144
0,833,3.0,1260759179
0,859,3.0,1260759182
0,906,2.0,1260759185
0,931,4.0,1260759205
0,1017,2.0,1260759151
0,1041,2.0,1260759187
0,1047,2.0,1260759148
0,1083,3.5,1260759125
0,1087,2.0,1260759131


### Set the number of users and movies

In [0]:
n_users <- length(unique(ratings$userId))
n_movies <- length(unique(ratings$movieId))

In [25]:
n_users

In [14]:
n_movies

### Choose the number of dimensions to use in each embedding

In [0]:
n_factors <- 50

### Split the data into training and testing

Randomly assign 80% of the ratings to the training data and keep the remaining 20% aside as test data.

In [0]:
train_indicator <- (runif(nrow(ratings)) < 0.8)
training_ratings <- ratings[train_indicator,]
test_ratings <- ratings[-train_indicator,]

### A note on the Keras Function API

The functional API enables one to create more complex models than the ones which can be created using Sequential models. For instance, the functional API allows one to create a model which multiple inputs and outputs.

With the functional API the main architecture is defined by the `keras_model` class. This takes in inputs and outputs. Further information here: https://keras.rstudio.com/reference/keras_model.html

We can then define the inputs and outputs separately and then bring everything together.

We will begin by defining two input layers and then one output layer, finally we bring all three of these together using `keras_model`.

Further information: https://keras.rstudio.com/articles/functional_api.html

### Define two inputs layers

Below we create two inputs layers, one for the user and one for the movie. These are created separately.

We specify the shape of our input layers for user and movie embeddings. These are just a single value, representing the index of the user or movie.

In [0]:
user_in <- layer_input(shape = c(1), dtype = 'int64', name = 'user_in')
movie_in <- layer_input(shape = c(1), dtype = 'int64', name = 'movie_in')

### Define two embedding layers (which connect to the two inputs)

Next we define two separate embeddings layers, one which takes the user input and the other takes the movie input.

The user embedding will be have dimension *n_users* x *n_factors*

The movie embedding will be have dimension *n_movies* x *n_factors*

In both cases there is only a single value input.

### Question: how many parameters will there be in the user and movie embedding layers? How would each embedding layer roughly look like if you had to draw a simple representation of it?


In [0]:
user_emb <- user_in %>% layer_embedding(input_dim = n_users, output_dim = n_factors, input_length = 1)
movie_emb <- movie_in %>% layer_embedding(input_dim = n_movies, output_dim = n_factors, input_length = 1)

### Define one output layer

We now "combine" the embeddings and then proceed as usual.

To combine the embeddings we simply concatenate the embeddings and then flatten that.

In [0]:
predictions <- layer_concatenate(c(user_emb, movie_emb)) %>%
  layer_flatten() %>% 
  layer_dropout(0.3) %>%
  layer_dense(70, activation='relu') %>% 
  layer_dropout(0.75) %>%
  layer_dense(1)

### Create the Model

Now that we have created the two inputs and one output, we can go ahead and create the final model which links everything together.

We added all the complexities to the input and outputs - so combining everything is simplier in terms of code.

In [0]:
model <- keras_model(c(user_in, movie_in), predictions)

### Print out a summary of the model

Notice how there are two inputs which are both followed by an embedding layer. Then notice how the embeddings are concatenated together (each had 50 dimensions which results in a final shape of 100).

In [23]:
summary(model)

Model: "model"
________________________________________________________________________________
Layer (type)              Output Shape      Param #  Connected to               
user_in (InputLayer)      [(None, 1)]       0                                   
________________________________________________________________________________
movie_in (InputLayer)     [(None, 1)]       0                                   
________________________________________________________________________________
embedding (Embedding)     (None, 1, 50)     33550    user_in[0][0]              
________________________________________________________________________________
embedding_1 (Embedding)   (None, 1, 50)     453300   movie_in[0][0]             
________________________________________________________________________________
concatenate (Concatenate) (None, 1, 100)    0        embedding[0][0]            
                                                     embedding_1[0][0]          
_____________

### Question: Can you roughly draw what the network looks like given this information?

### Compile

Just like before, nothing new here

In [0]:
model %>% compile(optimizer='Adam', loss='mse')

### What data do we input into the NN?

In our network we specified that there are two inputs, one for the user id and the other for the movie id. Let's create a list which contains these two data arrays and check the dimension.

In [29]:
x_train = as.matrix(list(training_ratings$userId, training_ratings$movieId))
y_train = as.matrix(training_ratings$rating)
dim(x_train)
dim(y_train)

### Fitting the model

Very similar to before. We provide the training features, training targets and the number of epochs.

In [0]:
model %>% fit(x = x_train, 
           y = y_train, 
           batch_size=64, 
           epoch=2)

### Evaluate the performance on the test data

In [31]:
model %>% evaluate(list(test_ratings$userId, test_ratings$movieId), 
                test_ratings$rating)

### Predicting on the test data

In [32]:
model %>% predict(list(test_ratings$userId, test_ratings$movieId))

0
2.762176
2.772403
2.720495
3.235361
2.967207
3.095442
3.469285
2.555620
2.921433
2.446158
