# Recommendation System: Retrieval Stage

Retrieval models are often composed of two sub-models:

- A query model computing the query representation (normally a fixed-dimensionality embedding vector) using query features.
- A candidate model computing the candidate representation (an equally-sized vector) using the candidate features

The outputs of the two models are then multiplied together to give a query-candidate affinity score, with higher scores expressing a better match between the candidate and the query.

In [1]:
# Import packages
import os
import numpy as np
import tensorflow as tf
from pprint import pprint
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

from typing import Dict, Text

tf.__version__

'2.7.0'

## Prepare data

In [2]:
os.listdir("/database/tensorflow-datasets/")

['movielens', 'datasets', 'tiny_shakespeare', 'imdb_reviews', 'downloads']

In [3]:
# Load data
ratings = tfds.load("movielens/100k-ratings", split="train", data_dir="/database/tensorflow-datasets/")
movies = tfds.load("movielens/100k-movies", split="train", data_dir="/database/tensorflow-datasets/")

2021-12-10 19:53:35.479201: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-10 19:53:35.483276: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-10 19:53:35.483725: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-10 19:53:35.484593: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

In [4]:
for x in ratings.take(1).as_numpy_iterator():
	pprint(x)

{'bucketized_user_age': 45.0,
 'movie_genres': array([7]),
 'movie_id': b'357',
 'movie_title': b"One Flew Over the Cuckoo's Nest (1975)",
 'raw_user_age': 46.0,
 'timestamp': 879024327,
 'user_gender': True,
 'user_id': b'138',
 'user_occupation_label': 4,
 'user_occupation_text': b'doctor',
 'user_rating': 4.0,
 'user_zip_code': b'53211'}


2021-12-10 19:53:36.202170: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


In [5]:
for x in movies.take(1).as_numpy_iterator():
	pprint(x)

{'movie_genres': array([4]),
 'movie_id': b'1681',
 'movie_title': b'You So Crazy (1994)'}


2021-12-10 19:53:36.294936: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


In [6]:
# For this iteration, keeping only `movie_title` and `user_id` information
ratings = ratings.map(lambda x: {
	"movie_title": x["movie_title"],
	"user_id": x["user_id"],
})
movies = movies.map(lambda x: x["movie_title"])

In [7]:
# Create train and test set (ideally based on time) using random split
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

In [8]:
for x in train.take(1).as_numpy_iterator():
	pprint(x)

{'movie_title': b'Postman, The (1997)', 'user_id': b'681'}


In [9]:
for x in test.take(1).as_numpy_iterator():
	pprint(x)

{'movie_title': b'M*A*S*H (1970)', 'user_id': b'346'}


In [10]:
# Get unique movies and user_id present in the data
movie_titles = movies.batch(1000)
user_ids = ratings.batch(1_000_000).map(lambda x: x["user_id"])

unique_movie_titles = np.unique(np.concatenate(list(movie_titles)))
unique_user_ids = np.unique(np.concatenate(list(user_ids)))

print(unique_movie_titles[:10])
print(unique_user_ids[:5])

[b"'Til There Was You (1997)" b'1-900 (1994)' b'101 Dalmatians (1996)'
 b'12 Angry Men (1957)' b'187 (1997)' b'2 Days in the Valley (1996)'
 b'20,000 Leagues Under the Sea (1954)' b'2001: A Space Odyssey (1968)'
 b'3 Ninjas: High Noon At Mega Mountain (1998)' b'39 Steps, The (1935)']
[b'1' b'10' b'100' b'101' b'102']


## Implement model

Choosing the architecture of our model is a key part of modelling.

Because we are building a two-tower retrieval model, we can build each tower separately and then combine them in the final model.

In [11]:
# Set embedding dimension
embedding_dimension = 32

In [12]:
# Define user model
user_model = tf.keras.Sequential([
	tf.keras.layers.StringLookup(vocabulary=unique_user_ids, mask_token=None),
	tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension),
	tf.keras.layers.BatchNormalization()
])

In [13]:
# Define candiate model
movie_model = tf.keras.Sequential([
	tf.keras.layers.StringLookup(vocabulary=unique_movie_titles, mask_token=None),
	tf.keras.layers.Embedding(len(unique_movie_titles) + 1, embedding_dimension),
	tf.keras.layers.BatchNormalization()
])

In our training data we have positive (user, movie) pairs. To figure out how good our model is, we need to compare the affinity score that the model calculates for this pair to the scores of all the other possible candidates: if the score for the positive pair is higher than for all other candidates, our model is highly accurate.

In [14]:
# Set metric
metrics = tfrs.metrics.FactorizedTopK(candidates=movies.batch(256).map(movie_model))

In [15]:
# Set objective
retrieval_task = tfrs.tasks.Retrieval(metrics=metrics)

The task itself is a Keras layer that takes the query and candidate embeddings as arguments, and returns the computed loss: we'll use that to implement the model's training loop.

In [16]:
# Combine the candidate and user model to build the complete retrieval model
class MovielensModel(tfrs.Model):
	def __init__(self, user_model, movie_model, retrieval_task):
		super().__init__()
		self.movie_model: tf.keras.Model = movie_model
		self.user_model: tf.keras.Model = user_model
		self.task: tf.keras.layers.Layer = retrieval_task
	
	def compute_loss(self, features, training=False):
		user_embeddings = self.user_model(features["user_id"])
		positive_movie_embeddings = self.movie_model(features["movie_title"])
		return self.task(user_embeddings, positive_movie_embeddings)

The tfrs.Model base class is a simply convenience class: it allows us to compute both training and test losses using the same method.

## Learn and evaluate model

In [17]:
# Get combined model
main_model = MovielensModel(user_model, movie_model, retrieval_task)

# Compile model
main_model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

Now shuffle, batch and cache training and evaluation data.

In [18]:
cached_train = train.shuffle(100_000).batch(4096).cache()
cached_test = test.batch(4096).cache()

In [19]:
# Train model
main_model.fit(
	cached_train, epochs=5, validation_data=cached_test
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fcc4041f070>

In [20]:
# Evaluate model
main_model.evaluate(cached_test, return_dict=True)



{'factorized_top_k/top_1_categorical_accuracy': 0.0002500000118743628,
 'factorized_top_k/top_5_categorical_accuracy': 0.002199999988079071,
 'factorized_top_k/top_10_categorical_accuracy': 0.007000000216066837,
 'factorized_top_k/top_50_categorical_accuracy': 0.07885000109672546,
 'factorized_top_k/top_100_categorical_accuracy': 0.17739999294281006,
 'loss': 28888.240234375,
 'regularization_loss': 0,
 'total_loss': 28888.240234375}

Test set performance is much worse and starts de-grading just after the first epoch. 

Our model is likely to perform better on the data that it has seen, simply because it can memorize it. This overfitting phenomenon is especially strong when models have many parameters. It can be mediated by model regularization and use of user and movie features that help the model generalize better to unseen data.

The model is re-recommending some of users' already watched movies. These known-positive watches can crowd out test movies out of top K recommendations.

The second phenomenon can be tackled by excluding previously seen movies from test recommendations. This approach is relatively common in the recommender systems literature, but we don't follow it in these tutorials. If not recommending past watches is important, we should expect appropriately specified models to learn this behaviour automatically from past user history and contextual information.

## Make predictions

In [21]:
# Generate index
index = tfrs.layers.factorized_top_k.BruteForce(main_model.user_model)
index.index_from_dataset(movies.batch(100).map(lambda title: (title, main_model.movie_model(title))))

<tensorflow_recommenders.layers.factorized_top_k.BruteForce at 0x7fcc4047b6a0>

In [22]:
# Get recommendations.
_, titles = index(tf.constant(["42"]))
print(f"Recommendations for user 42: {titles[0, :3]}")

Recommendations for user 42: [b'Before and After (1996)' b'Jack (1996)'
 b'All Dogs Go to Heaven 2 (1996)']


In [23]:
# Get some recommendations
i, titles = index(np.array(["939"]))
print("Top 5 recommendations:", titles[0, :5])

Top 5 recommendations: tf.Tensor(
[b'For Richer or Poorer (1997)' b'That Old Feeling (1997)'
 b'Flubber (1997)' b'Eye for an Eye (1996)' b"Preacher's Wife, The (1996)"], shape=(5,), dtype=string)


In this model, we created a user-movie model. However, for some applications (for example, product detail pages) it's common to perform item-to-item (for example, movie-to-movie or product-to-product) recommendations.