# Recommendation System: Ranking Stage

Real-world recommender systems are often composed of two stages:

- The retrieval stage is responsible for selecting an initial set of hundreds of candidates from all possible candidates. The main objective of this model is to efficiently weed out all candidates that the user is not interested in. Because the retrieval model may be dealing with millions of candidates, it has to be computationally efficient.
- The ranking stage takes the outputs of the retrieval model and fine-tunes them to select the best possible handful of recommendations. Its task is to narrow down the set of items the user may be interested in to a shortlist of likely candidates.

In [1]:
# Import packages
import os
import numpy as np
import tensorflow as tf
from pprint import pprint
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

tf.__version__

'2.7.0'

## Prepare data

In [2]:
os.listdir("/database/tensorflow-datasets/")

['movielens', 'datasets', 'tiny_shakespeare', 'imdb_reviews', 'downloads']

For ranking stage, ratings will be used as the objective.

In [3]:
ratings = tfds.load("movielens/100k-ratings", split="train", data_dir="/database/tensorflow-datasets/")

ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
    "user_rating": x["user_rating"]
})

2021-12-10 21:18:15.504202: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-10 21:18:15.509574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-10 21:18:15.509937: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-10 21:18:15.510555: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

In [4]:
for x in ratings.take(3).as_numpy_iterator():
	pprint(x)

{'movie_title': b"One Flew Over the Cuckoo's Nest (1975)",
 'user_id': b'138',
 'user_rating': 4.0}
{'movie_title': b'Strictly Ballroom (1992)',
 'user_id': b'92',
 'user_rating': 2.0}
{'movie_title': b'Very Brady Sequel, A (1996)',
 'user_id': b'301',
 'user_rating': 4.0}


2021-12-10 21:18:16.132953: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


In [5]:
# Create train and test split
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

In [6]:
movie_titles = ratings.batch(1_000_000).map(lambda x: x["movie_title"])
user_ids = ratings.batch(1_000_000).map(lambda x: x["user_id"])

unique_movie_titles = np.unique(np.concatenate(list(movie_titles)))
unique_user_ids = np.unique(np.concatenate(list(user_ids)))

## Implement Model

In [7]:
class RankingModel(tf.keras.Model):
	def __init__(self, embed_dim = 64) -> None:
		super().__init__()
		embedding_dimension = embed_dim
		self.user_embeddings = tf.keras.Sequential([
			tf.keras.layers.StringLookup(vocabulary=unique_user_ids, mask_token=None),
			tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension)
		])
		self.movie_embeddings = tf.keras.Sequential([
			tf.keras.layers.StringLookup(vocabulary=unique_movie_titles, mask_token=None),
			tf.keras.layers.Embedding(len(unique_movie_titles) + 1, embedding_dimension)
		])
		self.ratings = tf.keras.Sequential([
			tf.keras.layers.Dense(256, activation="relu"),
			tf.keras.layers.Dense(64, activation="relu"),
			tf.keras.layers.Dense(1)
		])

	def call(self, inputs):
		user_id, movie_title = inputs
		user_embedding = self.user_embeddings(user_id)
		movie_embedding = self.movie_embeddings(movie_title)
		return self.ratings(tf.concat([user_embedding, movie_embedding], axis=1))

The model takes user_id and movie titles and output a predicted rating.

In [8]:
RankingModel()((["42"], ["One Flew Over the Cuckoo's Nest (1975)"])) # Without training

Consider rewriting this model with the Functional API.


Consider rewriting this model with the Functional API.


Consider rewriting this model with the Functional API.


Consider rewriting this model with the Functional API.


<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[0.00769137]], dtype=float32)>

We'll make use of the Ranking task object: a convenience wrapper that bundles together the loss function and metric computation and also use it together with the MeanSquaredError Keras loss in order to predict the ratings.

In [9]:
task = tfrs.tasks.Ranking(
	loss = tf.keras.losses.MeanSquaredError(),
	metrics=[tf.keras.metrics.RootMeanSquaredError()]
)

The task itself is a Keras layer that takes true and predicted as arguments, and returns the computed loss.

Now we put all this together into a model.

In [10]:
class MovielensModel(tfrs.models.Model):

	def __init__(self):
		super().__init__()
		self.ranking_model = RankingModel()
		self.task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
			loss = tf.keras.losses.MeanSquaredError(),
			metrics=[tf.keras.metrics.RootMeanSquaredError()]
		)

	def call(self, features):
		return self.ranking_model((features["user_id"], features["movie_title"]))

	def compute_loss(self, features, training=False):
		labels = features.pop("user_rating")
		rating_predictions = self(features)
		return self.task(labels=labels, predictions=rating_predictions)

# Learn and eveluate

In [11]:
model = MovielensModel()
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

In [12]:
cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

In [13]:
model.fit(cached_train, epochs=10, validation_data=cached_test)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f181c0df490>

In [14]:
model.evaluate(cached_test, return_dict=True)



{'root_mean_squared_error': 0.9844292402267456,
 'loss': 0.9631555080413818,
 'regularization_loss': 0,
 'total_loss': 0.9631555080413818}

## Testing the ranking model 

In [15]:
test_ratings = {}
test_movie_titles = ["M*A*S*H (1970)", "Dances with Wolves (1990)", "Speed (1994)"]

for movie_title in test_movie_titles:
	test_ratings[movie_title] = model({
		"user_id": np.array(["42"]),
		"movie_title": np.array([movie_title])
	})

print("Ratings:")
for title, score in sorted(test_ratings.items(), key=lambda x: x[1], reverse=True):
	print(f"{title}: {score}")

Ratings:
M*A*S*H (1970): [[3.8500729]]
Dances with Wolves (1990): [[3.6245043]]
Speed (1994): [[3.5271575]]


The model above gives us a decent start towards building a ranking system and a careful understanding of the objectives worth optimizing is also necessary. 