Recommenders have multiple connected components - using neural networks

1. Retrieval - selecting initial set of candidates. Efficiently weed-out all candidates the user is not interested in. 
2. Ranking 
3. Post-Ranking



In [2]:
pip install -q tensorflow-recommenders

[K     |████████████████████████████████| 85 kB 3.5 MB/s 
[K     |████████████████████████████████| 462 kB 12.3 MB/s 
[?25h

In [3]:
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

import pprint
import numpy as np

In [4]:
# Ratings Data 
ratings=tfds.load("movielens/100k-ratings", split="train")

[1mDownloading and preparing dataset movielens/100k-ratings/0.1.0 (download: 4.70 MiB, generated: 32.41 MiB, total: 37.10 MiB) to /root/tensorflow_datasets/movielens/100k-ratings/0.1.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]






0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/movielens/100k-ratings/0.1.0.incompleteG7PNF4/movielens-train.tfrecord


  0%|          | 0/100000 [00:00<?, ? examples/s]

[1mDataset movielens downloaded and prepared to /root/tensorflow_datasets/movielens/100k-ratings/0.1.0. Subsequent calls will reuse this data.[0m


In [5]:
for x in ratings.take(1).as_numpy_iterator():
  pprint.pprint(x)

{'bucketized_user_age': 45.0,
 'movie_genres': array([7]),
 'movie_id': b'357',
 'movie_title': b"One Flew Over the Cuckoo's Nest (1975)",
 'raw_user_age': 46.0,
 'timestamp': 879024327,
 'user_gender': True,
 'user_id': b'138',
 'user_occupation_label': 4,
 'user_occupation_text': b'doctor',
 'user_rating': 4.0,
 'user_zip_code': b'53211'}


In [6]:
# Features of all available movies 
movies = tfds.load("movielens/100k-movies", split="train")

[1mDownloading and preparing dataset movielens/100k-movies/0.1.0 (download: 4.70 MiB, generated: 150.35 KiB, total: 4.84 MiB) to /root/tensorflow_datasets/movielens/100k-movies/0.1.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]






0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/movielens/100k-movies/0.1.0.incompleteNH3LAQ/movielens-train.tfrecord


  0%|          | 0/1682 [00:00<?, ? examples/s]

[1mDataset movielens downloaded and prepared to /root/tensorflow_datasets/movielens/100k-movies/0.1.0. Subsequent calls will reuse this data.[0m


In [7]:
for x in movies.take(1).as_numpy_iterator():
  pprint.pprint(x)

{'movie_genres': array([4]),
 'movie_id': b'1681',
 'movie_title': b'You So Crazy (1994)'}


In [8]:
ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"]
})

movies = movies.map(lambda x: x["movie_title"])

In [9]:
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

In [10]:
train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

In [11]:
movie_titles = movies.batch(1_000)
user_ids = ratings.batch(1_000_000).map(lambda x: x["user_id"])

In [12]:
unique_movie_titles = np.unique(np.concatenate(list(movie_titles)))

In [13]:
unique_user_ids = np.unique(np.concatenate(list(user_ids)))

In [14]:
unique_movie_titles[:10]

array([b"'Til There Was You (1997)", b'1-900 (1994)',
       b'101 Dalmatians (1996)', b'12 Angry Men (1957)', b'187 (1997)',
       b'2 Days in the Valley (1996)',
       b'20,000 Leagues Under the Sea (1954)',
       b'2001: A Space Odyssey (1968)',
       b'3 Ninjas: High Noon At Mega Mountain (1998)',
       b'39 Steps, The (1935)'], dtype=object)

## Implementing the model

> The query tower
  First step is to decide on the dimensionality of the query and candidate representations

  Higher values are generally more accurate, but slower to fit and prone to overfitting




## The query tower

In [15]:
embedding_dimension = 32

Second, we define the model itself. We convert User_id strings into integers, then convert those to user embeddings via an embedding layer. 

We use the list of unqiue user ids we computed earlier as a vocabulary

In [16]:
user_model = tf.keras.Sequential([
                                  tf.keras.layers.StringLookup(
                                      vocabulary=unique_user_ids, mask_token=None),
                                  # Add additional embedding to account for unknown tokens
                                  tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension)
])

This corresponds exactly to a classic matrix factorization approach. While defining a subclass of tf.kera.Model

We can easily extend complex model using standard Keras components, as long as we return as embedding dimension- wide output at the end

## The candidate tower

In [17]:
movie_model = tf.keras.Sequential(
    [tf.keras.layers.StringLookup(
         vocabulary=unique_movie_titles, mask_token=None), 
     tf.keras.layers.Embedding(len(unique_movie_titles) + 1, embedding_dimension)]
)

## Metrics
We have (user, movie) pairs. To figure out how good our model is, we need to compare the affinity score that the model calculates for this pair to the scores of all the other possible candidates: if the score for the positive pair is higher than for all other candidates, our model is highly accurate.

To do this, we can use the Factorized top K metric. It requires one argument, the dataset of candidates that are used as implicit negatives for evaluation. 

In our case, that's the movies dataset, converted into embeddings via our movie model

In [18]:
metrics = tfrs.metrics.FactorizedTopK(
    candidates=movies.batch(128).map(movie_model)
)

## Loss 

Tfrs has several loss layers and tasks to make this easy. 

We'll use the Retrieval task object: a convenience wrapper that bundles together the loss function and metric computation

In [19]:
task = tfrs.tasks.Retrieval(metrics=metrics)

The task itself is a Keras layer that takes the query and candidate embeddings as arguments, and returns the computed loss: we'll use that to implement the model's training loop

## The Full Model
We can now put together the full training model. TFRS exposes a base model class (tfrs.models.Model) wich streamlines building models: all we need is to set up the components in the __init__ method and implement the compute_loss method, taking in the raw features and returning a loss value

The base model will then take care of creating the appropriate training loop to fit our model

In [20]:
from typing import Dict, Text

class MovieLensModel(tfrs.Model):

  def __init__(self, user_model, movie_model):
    super().__init__()
    self.movie_model: tf.keras.Model = movie_model
    self.user_model: tf.keras.Model = user_model
    self.task: tf.keras.layers.Layer = task

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor: 
    # Pick out user features and pass them to user model 
    user_embeddings = self.user_model(features["user_id"])
    # pick out the movie features and pass them to the movie model, getting positive embeddings back 
    positive_movie_embeddings = self.movie_model(features["movie_title"])

    # The task computed the loss and the metrics
    return self.task(user_embeddings, positive_movie_embeddings)


In [21]:
class NoBaseClassMovielensModel(tf.keras.Model):

  def __init__(self, user_model, movie_model):
    super().__init__()
    self.movie_model: tf.keras.Model = movie_model
    self.user_model: tf.keras.Model = user_model
    self.task: tf.keras.layers.Layer = task

  def train_step(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor: 
     #Set up a gradient tape to record gradients
     with tf.GradientTape() as tape: 

       #Loss computation 
       user_embeddings = self.user_model(features["user_id"])
       positive_movie_embeddings = self.movie_model(features["movie_title"])
       loss = self.task(user_embeddings, positive_movie_embeddings)

       # Handle regularisation losses as well 
       regularisation_loss = sum(self.losses)

       total_loss = loss + regularisation_loss

     gradients = tape.gradient(total_loss, self.trainable_variables)
     self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))

     metrics = {metric.name: metric.result() for metric in self.metrics}
     metrics["loss"] = loss
     metrics["regularization_loss"] = regularisation_loss
     metrics["total_loss"] = total_loss

     return metrics

  def test_step(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor: 

    # Loss computation 
     user_embeddings = self.user_model(features["user_id"])
     positive_movie_embeddings = self.movie_model(features["movie_title"])
     loss = self.task(user_embeddings, positive_movie_embeddings)

     # Handle regularization losses as well
     regularization_loss = sum(self.losses)

     total_loss = loss + regularization_loss

     metrics = {metric.name: metric.result() for metric in self.metrics}
     metrics["loss"] = loss
     metrics["regularization_loss"] = regularization_loss
     metrics["total_loss"] = total_loss

     return metrics
    



## Fitting and Evaluating

After defining the model, we can use standard Keras fitting and evaluation routines to fit and evaluate the model

In [22]:
model = MovieLensModel(user_model, movie_model)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

In [23]:
# shuffle batch and cache
cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

In [24]:
model.fit(cached_train, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f67bf869e10>

## Tensorboard 
If you want to monitor the training process with Tensorboard, you can add a Tensorboard callback to fit() function 

Then start Tensorboard using %tensorboard --logdir logs/fit

In [25]:
model.evaluate(cached_test, return_dict=True)



{'factorized_top_k/top_100_categorical_accuracy': 0.23274999856948853,
 'factorized_top_k/top_10_categorical_accuracy': 0.022299999371170998,
 'factorized_top_k/top_1_categorical_accuracy': 0.000699999975040555,
 'factorized_top_k/top_50_categorical_accuracy': 0.12460000067949295,
 'factorized_top_k/top_5_categorical_accuracy': 0.009499999694526196,
 'loss': 28244.771484375,
 'regularization_loss': 0,
 'total_loss': 28244.771484375}

Test set performance is much worse than training performance. This is due to

1. Our model is likely to perform better on the data that it has seen, simply because it can memorize it. This overfitting phenomenon is strong when models have many parameters. It can be mediated by model regularization and use of user and movie features that help the model generalise better to unseen data. 

2. The mode is re-recommending some of users' already watched movies. These known positive watches can crowd out test movies out of top K recommendations - this can be tackled by excluding previously seen movies from test recommendations



## Making Predictions

Now that we have a model, we would like to be able to make predictions. We can use the tfrs.layers.factorized_top_k.BruteForce layer to do this

In [29]:
# create a model that takes in raw query features, and 
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)

# recommneds movies out of the entire movies dataset 
index.index_from_dataset(
    tf.data.Dataset.zip((movies.batch(100), movies.batch(100).map(model.movie_model)))
)

# Get recommendations 
_, titles = index(tf.constant(["42"]))
print(f"Recommendations for user 42: {titles[0, :3]}")

Recommendations for user 42: [b'Bridges of Madison County, The (1995)'
 b'Father of the Bride Part II (1995)' b'Rudy (1993)']


The brute force layer is going to be too slow to serve a model with many possible candidates. The following sections show how to speed this up by using an approximate retrieval index. 

## Model Serving 

After the model is trained, we need to deploy it. 

In a two-tower retrieval model, serving has two components: 
- a serving query model, taking in features of the query and transforming them into a query embedding 

- a serving candidate model. This most often takes the form of an approxiamte nearest neighbours (ANN) index which allows fast approximate lookup of candidates in response to a query, produced by the query model

In TFRS, both components can be packaged into a single exportable model, giving us a model that takes the raw user id and returns the titles of top movies for that user. 

This is done via exporting the model to a SavedModel format, which makes it possible to serve using TensorFlow Serving

To deploy a model like this, we simply export the BruteForce layer we created above: 

In [30]:
import tempfile
import os
# Export the query model
with tempfile.TemporaryDirectory() as tmp: 
  path = os.path.join(tmp, "model")

  # Save the index
  tf.saved_model.save(index, path)

  # Load it back; can also be done in Tensorflow serving
  loaded = tf.saved_model.load(path)


  # Pass a user id in, get top predicted movie titles back
  scores, titles = loaded(["42"])

  print(f"Recommendations: {titles[0][:3]}")



INFO:tensorflow:Assets written to: /tmp/tmp7v6i_27u/model/assets


INFO:tensorflow:Assets written to: /tmp/tmp7v6i_27u/model/assets


Recommendations: [b'Bridges of Madison County, The (1995)'
 b'Father of the Bride Part II (1995)' b'Rudy (1993)']
