In this tutorial, we build a simple two tower ranking model using the [MovieLens 100K dataset](https://grouplens.org/datasets/movielens/100k/) with TF-Ranking. We can use this model to rank and recommend movies for a given user according to their predicted user ratings.

## Setup

Install and import the TF-Ranking library:

In [1]:
from typing import Dict, Tuple
import pprint

import tensorflow as tf

import tensorflow_datasets as tfds
import tensorflow_ranking as tfr

## Read the data

Prepare to train a model by creating a ratings dataset and movies dataset. Use `user_id` as the query input feature, `movie_title` as the document input feature, and `user_rating` as the label to train the ranking model.

In [2]:
%%capture --no-display
# Ratings data.
ratings = tfds.load('movielens/100k-ratings', split="train")
# Features of all the available movies.
movies = tfds.load('movielens/100k-movies', split="train")

# Select the basic features.
ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
    "user_rating": x["user_rating"]
})

2022-07-05 13:52:00.035708: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
ratings

<MapDataset element_spec={'movie_title': TensorSpec(shape=(), dtype=tf.string, name=None), 'user_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'user_rating': TensorSpec(shape=(), dtype=tf.float32, name=None)}>

In [4]:
movies

<PrefetchDataset element_spec={'movie_genres': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'movie_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'movie_title': TensorSpec(shape=(), dtype=tf.string, name=None)}>

Build vocabularies to convert all user ids and all movie titles into integer indices for embedding layers:

In [5]:
movies = movies.map(lambda x: x["movie_title"])
users = ratings.map(lambda x: x["user_id"])

In [6]:
user_ids_vocabulary = tf.keras.layers.experimental.preprocessing.StringLookup(mask_token=None)
user_ids_vocabulary.adapt(users.batch(1000))

movie_titles_vocabulary = tf.keras.layers.experimental.preprocessing.StringLookup(mask_token=None)
movie_titles_vocabulary.adapt(movies.batch(1000))

Group by `user_id` to form lists for ranking models:


In [7]:
key_func = lambda x: user_ids_vocabulary(x["user_id"])
reduce_func = lambda key, dataset: dataset.batch(100)
ds_train = ratings.group_by_window(key_func=key_func, reduce_func=reduce_func, window_size=100)

In [8]:
for x in ds_train.take(1):
	for key, value in x.items():
		print(f"Shape of {key}: {value.shape}")
		print(f"Example values of {key}: {value.numpy()}")
		print()

Shape of movie_title: (100,)
Example values of movie_title: [b'Man Who Would Be King, The (1975)' b'Silence of the Lambs, The (1991)'
 b'Next Karate Kid, The (1994)' b'2001: A Space Odyssey (1968)'
 b'Usual Suspects, The (1995)' b'Critical Care (1997)'
 b'Annie Hall (1977)' b'Manhattan (1979)' b'Picture Bride (1995)'
 b'Jefferson in Paris (1995)' b'Baton Rouge (1988)'
 b'Pink Floyd - The Wall (1982)' b'Searching for Bobby Fischer (1993)'
 b'Vermont Is For Lovers (1992)' b'Nightmare on Elm Street, A (1984)'
 b'Raging Bull (1980)' b"Nobody's Fool (1994)"
 b'Star Trek: The Motion Picture (1979)' b'To Die For (1995)'
 b'When Harry Met Sally... (1989)' b'Graduate, The (1967)'
 b'Shawshank Redemption, The (1994)' b'Just Cause (1995)'
 b'Murder in the First (1995)' b'Tommy Boy (1995)'
 b'Miami Rhapsody (1995)' b'Star Trek: Generations (1994)'
 b'Circle of Friends (1995)' b'Last of the Mohicans, The (1992)'
 b'Return of Martin Guerre, The (Retour de Martin Guerre, Le) (1982)'
 b'Congo (1995)' 

Generate batched features and labels:

In [9]:
def _features_and_labels(
		x: Dict[str, tf.Tensor]) -> Tuple[Dict[str, tf.Tensor], tf.Tensor]:
	labels = x.pop("user_rating")
	return x, labels


ds_train = ds_train.map(_features_and_labels)
ds_train = ds_train.apply(tf.data.experimental.dense_to_ragged_batch(batch_size = 32))

The `user_id` and `movie_title` tensors generated in `ds_train` are of shape `[32, None]`, where the second dimension is 100 in most cases except for the batches when less than 100 items grouped in lists. A model working on ragged tensors is thus used.

In [10]:
for x, label in ds_train.take(1):
	for key, value in x.items():
		print(f"Shape of {key}: {value.shape}")
		print(f"Example values of {key}: {value.numpy()}")
		print()
	print(f"Shape of label: {label.shape}")
	print(f"Example values of label: {label.numpy()}")

Shape of movie_title: (32, None)
Example values of movie_title: [[b'Man Who Would Be King, The (1975)'
  b'Silence of the Lambs, The (1991)' b'Next Karate Kid, The (1994)' ...
  b'Swan Princess, The (1994)' b'Alice in Wonderland (1951)'
  b'Amadeus (1984)']
 [b'Flower of My Secret, The (Flor de mi secreto, La) (1995)'
  b'Little Princess, The (1939)' b'Time to Kill, A (1996)' ...
  b'Caro Diario (Dear Diary) (1994)' b'Wings of the Dove, The (1997)'
  b'Mrs. Doubtfire (1993)']
 [b'Kundun (1997)' b'Scream (1996)' b'Power 98 (1995)' ...
  b"Sophie's Choice (1982)" b'Giant (1956)'
  b'FairyTale: A True Story (1997)']
 ...
 [b'Assassins (1995)' b'Harlem (1993)' b'Rumble in the Bronx (1995)' ...
  b'Sudden Death (1995)' b'Empire Strikes Back, The (1980)'
  b'Monty Python and the Holy Grail (1974)']
 [b'Bob Roberts (1992)' b'Willy Wonka and the Chocolate Factory (1971)'
  b'Hot Shots! Part Deux (1993)' ... b'Back to the Future (1985)'
  b'Three Colors: Blue (1993)' b'Michael (1996)']
 [b'Litt

## Define a model

Define a ranking model by inheriting from `tf.keras.Model` and implementing the `call` method:

In [11]:
class MovieLensRankingModel(tf.keras.Model):

	def __init__(self, user_vocab, movie_vocab):
		super().__init__()

		# Set up user and movie vocabulary and embedding.
		self.user_vocab = user_vocab
		self.movie_vocab = movie_vocab
		self.user_embed = tf.keras.layers.Embedding(
			user_vocab.vocabulary_size(), 64)
		self.movie_embed = tf.keras.layers.Embedding(
			movie_vocab.vocabulary_size(), 64)

	def call(self, features: Dict[str, tf.Tensor]) -> tf.Tensor:
		# Define how the ranking scores are computed: 
		# Take the dot-product of the user embeddings with the movie embeddings.

		user_embeddings = self.user_embed(self.user_vocab(features["user_id"]))
		movie_embeddings = self.movie_embed(
			self.movie_vocab(features["movie_title"])
		)

		return tf.reduce_sum(user_embeddings * movie_embeddings, axis = 2)

Create the model, and then compile it with ranking `tfr.keras.losses` and `tfr.keras.metrics`, which are the core of the TF-Ranking package. 

This example uses a ranking-specific **softmax loss**, which is a listwise loss introduced to promote all relevant items in the ranking list with better chances on top of the irrelevant ones. In contrast to the softmax loss in the multi-class classification problem, where only one class is positive and the rest are negative, the TF-Ranking library supports multiple relevant documents in a query list and non-binary relevance labels.

For ranking metrics, this example uses in specific **Normalized Discounted Cumulative Gain (NDCG)** and **Mean Reciprocal Rank (MRR)**, which calculate the user utility of a ranked query list with position discounts. For more details about ranking metrics, review evaluation measures [offline metrics](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Offline_metrics).

In [12]:
# Create the ranking model, trained with a ranking loss and evaluated with
# ranking metrics.
model = MovieLensRankingModel(user_ids_vocabulary, movie_titles_vocabulary)
optimizer = tf.keras.optimizers.Adagrad(0.5)
loss = tfr.keras.losses.get(
	loss = tfr.keras.losses.RankingLossKey.SOFTMAX_LOSS, ragged = True
)
eval_metrics = [
	tfr.keras.metrics.get(key = "ndcg", name = "metric/ndcg", ragged = True),
	tfr.keras.metrics.get(key = "mrr", name = "metric/mrr", ragged = True)
]
model.compile(optimizer = optimizer, loss = loss, metrics = eval_metrics)

## Train and evaluate the model

Train the model with `model.fit`.

In [13]:
model.fit(ds_train, epochs=3)

Epoch 1/3




Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fba21b59a30>

Generate predictions and evaluate.

In [14]:
# Get movie title candidate list.
for movie_titles in movies.batch(2000):
	break

In [15]:
movie_titles

<tf.Tensor: shape=(1682,), dtype=string, numpy=
array([b'You So Crazy (1994)', b'Love Is All There Is (1996)',
       b'Fly Away Home (1996)', ..., b'Great White Hype, The (1996)',
       b'Venice/Venice (1992)', b'Stalingrad (1993)'], dtype=object)>

In [16]:
# Generate the input for user 42.
inputs = {
	"user_id": tf.expand_dims(tf.repeat("42", repeats = movie_titles.shape[0]), axis = 0),
	"movie_title": tf.expand_dims(movie_titles, axis = 0)
}

In [17]:
# Get movie recommendations for user 42.
scores = model(inputs)
titles = tfr.utils.sort_by_scores(scores, [tf.expand_dims(movie_titles, axis = 0)])[0]
print(f"Top 5 recommendations for user 42: {titles[0, :5]}")

Top 5 recommendations for user 42: [b'Sound of Music, The (1965)' b'Titanic (1997)'
 b"It's a Wonderful Life (1946)" b'Air Force One (1997)'
 b'Jerry Maguire (1996)']
