# Exploration: Embeddings

## Lesson 4 Homework Assignment

MovieLens dataset: https://grouplens.org/datasets/movielens/

In [1]:
import os
current_dir = os.getcwd()

LESSON_HOME_DIR = current_dir + '/'
DATA_HOME_DIR = LESSON_HOME_DIR + 'data/'

#DATASET_DIR = DATA_HOME_DIR + 'ml-20m/'
DATASET_DIR = DATA_HOME_DIR + 'ml-small/'
MODEL_DIR = DATASET_DIR + 'models/'

In [3]:
if not os.path.exists(DATASET_DIR):
    %cd $DATA_HOME_DIR
    #!wget http://files.grouplens.org/datasets/movielens/ml-20m.zip
    #!unzip ml-20m.zip
    !wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
    !unzip ml-latest-small.zip && mv ml-latest-small ml-small

In [4]:
if not os.path.exists(MODEL_DIR): os.mkdir(MODEL_DIR)

## Data Setup

In [5]:
import pandas as pd
import numpy as np

In [6]:
ratings = pd.read_csv(DATASET_DIR+'ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [7]:
len(ratings)

100004

Movie names for a more user-friendly display

In [8]:
movie_names = pd.read_csv(DATASET_DIR+'movies.csv').set_index('movieId')['title'].to_dict()

In [9]:
users = ratings.userId.unique()
movies = ratings.movieId.unique()

Update movie and user ids so that they are contiguous integers, which we want when using embeddings.

In [10]:
userid2idx = {o:i for i,o in enumerate(users)}
movieid2idx = {o:i for i,o in enumerate(movies)}

ratings.movieId = ratings.movieId.apply(lambda x: movieid2idx[x])
ratings.userId = ratings.userId.apply(lambda x: userid2idx[x])

In [11]:
user_min, user_max, movie_min, movie_max = (ratings.userId.min(), 
    ratings.userId.max(), ratings.movieId.min(), ratings.movieId.max())

user_min, user_max, movie_min, movie_max

(0, 670, 0, 9065)

In [12]:
n_users = ratings.userId.nunique()
n_movies = ratings.movieId.nunique()

n_users, n_movies

(671, 9066)

Set the number of latent factors in each embedding.

In [13]:
n_factors = 50

Slick way to randomly split data into training and validation.

In [14]:
msk = np.random.rand(len(ratings)) < 0.8
trn = ratings[msk]
val = ratings[~msk]

In [None]:
batch_size=64

## Dot Product Model

The most basic model is a dot product of a movie embedding and a user embedding. Let's see how well that works:

In [29]:
from keras.layers import Input, Embedding, merge
from keras.layers.core import Flatten
from keras.models import Model
from keras.optimizers import Adam
from keras.regularizers import l2

In [72]:
def embedding_input(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype='int64', name=name)
    return inp, Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg))(inp)

In [73]:
def create_bias(inp, n_in):
    x = Embedding(n_in, 1, input_length=1)(inp)
    return Flatten()(x)

In [19]:
user_in, user_embed = embedding_input('user_in', n_users, n_factors, 1e-4)
movie_in, movie_embed = embedding_input('movie_in', n_movies, n_factors, 1e-4)

In [22]:
x = merge([user_embed, movie_embed], mode='dot')
x = Flatten()(x)
model = Model([user_in, movie_in], x)
model.compile(Adam(0.001), loss='mse')

In [30]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
user_in (InputLayer)             (None, 1)             0                                            
____________________________________________________________________________________________________
movie_in (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
embedding_1 (Embedding)          (None, 1, 50)         33550       user_in[0][0]                    
____________________________________________________________________________________________________
embedding_2 (Embedding)          (None, 1, 50)         453300      movie_in[0][0]                   
___________________________________________________________________________________________

In [63]:
model.optimizer.lr.get_value().item()

0.0010000000474974513

In [31]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=batch_size, nb_epoch=1, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 79956 samples, validate on 20048 samples
Epoch 1/1


<keras.callbacks.History at 0x7f24339e9dd0>

Let's track predictions as we train.

In [53]:
predictions = []

In [54]:
predict = np.squeeze(model.predict([np.array([3]), np.array([6])])).item()

predictions.append(predict)
predictions

[2.9463322162628174]

In [66]:
model.optimizer.lr=0.01

In [67]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=batch_size, nb_epoch=3, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 79956 samples, validate on 20048 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f2432e98f10>

In [68]:
predict = np.squeeze(model.predict([np.array([3]), np.array([6])])).item()

predictions.append(predict)
predictions

[2.9463322162628174, 4.9115986824035645]

In [69]:
model.optimizer.lr=0.001

In [70]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=batch_size, nb_epoch=6, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 79956 samples, validate on 20048 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<keras.callbacks.History at 0x7f2432e98f90>

In [71]:
predict = np.squeeze(model.predict([np.array([3]), np.array([6])])).item()

predictions.append(predict)
predictions

[2.9463322162628174, 4.9115986824035645, 4.935777187347412]

According to the course, the best benchmarks for `loss` are a bit over `0.9`, so this model doesn't seem to be working that well...

### Adding Bias

Bias represents how positive or negative each user is, and how good each movie is.

In [None]:
user_in, user_embed = embedding_input('user_in', n_users, n_factors, 1e-4)
movie_in, movie_embed = embedding_input('movie_in', n_movies, n_factors, 1e-4)

In [74]:
user_bias = create_bias(user_in, n_users)
movie_bias = create_bias(movie_in, n_movies)

In [75]:
x = merge([user_embed, movie_embed], mode='dot')
x = Flatten()(x)
x = merge([x, user_bias], mode='sum')
x = merge([x, movie_bias], mode='sum')
model = Model([user_in, movie_in], x)
model.compile(Adam(0.001), loss='mse')

In [76]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=batch_size, nb_epoch=1, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 79956 samples, validate on 20048 samples
Epoch 1/1


<keras.callbacks.History at 0x7f2431424f50>

In [77]:
predictions = []

In [78]:
predict = np.squeeze(model.predict([np.array([3]), np.array([6])])).item()

predictions.append(predict)
predictions

[4.878170013427734]

In [79]:
model.optimizer.lr=0.01

In [80]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=batch_size, nb_epoch=3, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 79956 samples, validate on 20048 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f2431043b90>

In [81]:
predict = np.squeeze(model.predict([np.array([3]), np.array([6])])).item()

predictions.append(predict)
predictions

[4.878170013427734, 4.722662925720215]

In [82]:
model.optimizer.lr=0.001

In [83]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=batch_size, nb_epoch=6, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 79956 samples, validate on 20048 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<keras.callbacks.History at 0x7f2432e08290>

In [84]:
predict = np.squeeze(model.predict([np.array([3]), np.array([6])])).item()

predictions.append(predict)
predictions

[4.878170013427734, 4.722662925720215, 4.69352912902832]

With bias, we arrive at an upper-4 rating after the very first iteration (took more epochs without bias).

Loss is also lower after the equivalent epochs cycle. So let's keep training!

In [85]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=batch_size, nb_epoch=10, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 79956 samples, validate on 20048 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f2431043e50>

In [86]:
predict = np.squeeze(model.predict([np.array([3]), np.array([6])])).item()

predictions.append(predict)
predictions

[4.878170013427734, 4.722662925720215, 4.69352912902832, 4.727512836456299]

In [87]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=5, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 79956 samples, validate on 20048 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f2431a2a310>

In [88]:
predict = np.squeeze(model.predict([np.array([3]), np.array([6])])).item()

predictions.append(predict)
predictions

[4.878170013427734,
 4.722662925720215,
 4.69352912902832,
 4.727512836456299,
 4.831404685974121]

Loss is now quite a bit better than the `0.9` benchmark.

In [89]:
if not os.path.exists(MODEL_DIR + 'bias.h5'):
    model.save_weights(MODEL_DIR + 'bias.h5')
model.load_weights(MODEL_DIR + 'bias.h5')

## Deep Neural Network Model

Rather than creating a special purpose architecture (like our dot-product with bias earlier), it's often both easier and more accurate to use a standard neural network. Let's try it! Here, we simply concatenate the user and movie embeddings into a single vector, which we feed into the neural net.

In [94]:
from keras.layers.core import Dense, Dropout

In [90]:
user_in, user_embed = embedding_input('user_in', n_users, n_factors, 1e-4)
movie_in, movie_embed = embedding_input('movie_in', n_movies, n_factors, 1e-4)

In [95]:
x = merge([user_embed, movie_embed], mode='concat')
x = Flatten()(x)
x = Dropout(0.3)(x)
x = Dense(70, activation='relu')(x)
x = Dropout(0.75)(x)
x = Dense(1)(x)
nn = Model([user_in, movie_in], x)
nn.compile(Adam(0.001), loss='mse')

Run the same initial epoch cycle again...

In [96]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=batch_size, nb_epoch=1, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 79956 samples, validate on 20048 samples
Epoch 1/1


<keras.callbacks.History at 0x7f242d7ab210>

In [97]:
predictions = []

In [98]:
predict = np.squeeze(model.predict([np.array([3]), np.array([6])])).item()

predictions.append(predict)
predictions

[4.839086532592773]

In [99]:
model.optimizer.lr=0.01

In [100]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=batch_size, nb_epoch=3, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 79956 samples, validate on 20048 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f2431043e10>

In [101]:
predict = np.squeeze(model.predict([np.array([3]), np.array([6])])).item()

predictions.append(predict)
predictions

[4.839086532592773, 4.8729634284973145]

In [102]:
model.optimizer.lr=0.001

In [103]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=batch_size, nb_epoch=6, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 79956 samples, validate on 20048 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<keras.callbacks.History at 0x7f243316d4d0>

In [104]:
predict = np.squeeze(model.predict([np.array([3]), np.array([6])])).item()

predictions.append(predict)
predictions

[4.839086532592773, 4.8729634284973145, 4.852595329284668]

Note that the first epoch loss for the neural net was better than the 25th epoch loss for the dot product with bias model!

## Latent Factor Analysis

...of the top 2000 most popular movies.

In [105]:
counts = ratings.groupby('movieId')['rating'].count()
topMovies = counts.sort_values(ascending=False)[:2000]
topMovies = np.array(topMovies.index)

In [106]:
# arbitrary indices from counting movies with the most ratings
topMovies[:10]

array([ 57,  49,  99,  92, 143,  72, 402, 417,  79,  89])

In [108]:
# MovieLens indices
[movies[topMovies[i]] for i in range(10)]

[356, 296, 318, 593, 260, 480, 2571, 1, 527, 589]

In [109]:
# Movie names
[movie_names[movies[topMovies[i]]] for i in range(10)]

['Forrest Gump (1994)',
 'Pulp Fiction (1994)',
 'Shawshank Redemption, The (1994)',
 'Silence of the Lambs, The (1991)',
 'Star Wars: Episode IV - A New Hope (1977)',
 'Jurassic Park (1993)',
 'Matrix, The (1999)',
 'Toy Story (1995)',
 "Schindler's List (1993)",
 'Terminator 2: Judgment Day (1991)']

We'll look at the movie embeddings. We create a 'model' - which in keras is simply a way of associating one or more inputs with one more more outputs, using the functional API. Here, our input is the movie id (a single id), and the output is the movie's embedding (an array of 50 latent factors).

In [113]:
get_movie_emb = Model(movie_in, movie_embed)
movie_emb = np.squeeze(get_movie_emb.predict([topMovies]))
movie_emb.shape

(2000, 50)

Because it's hard to interpret 50 latent factors, we use [PCA](https://plot.ly/ipython-notebooks/principal-component-analysis/) to perform dimensionality reduction.