Real-world recommender systems typically consist of two distinct stages:

1. **Retrieval Stage:**
   The primary role of the retrieval stage is to efficiently select an initial set of hundreds of candidates from a vast pool of possibilities. Its fundamental objective is to swiftly eliminate candidates that are unlikely to interest the user. Given the potential involvement with millions of candidates, the retrieval model must prioritize computational efficiency.

2. **Ranking Stage:**
   Following the retrieval stage, the ranking stage refines the outputs from the retrieval model to pinpoint the best possible recommendations. This stage is dedicated to narrowing down the set of items the user might find interesting, presenting a concise list of highly probable candidates.

Retrieval models often consist of two integral sub-models:

1. **Query Model:**
   Responsible for computing the query representation, usually manifested as a fixed-dimensionality embedding vector, utilizing relevant query features.

2. **Candidate Model:**
   Tasked with computing the candidate representation, also in the form of an equally-sized vector, using the respective candidate features.

The results generated by these two models are then combined by multiplying them, producing a query-candidate affinity score. Higher scores indicate a stronger match between the candidate and the query, aiding in the selection of more personalized and relevant recommendations.




## Imports and Environment Setup

In [112]:
# !pip install tensorflow_datasets==4.9.2 --upgrade
# !pip install --force-reinstall -v protobuf==3.20.3
# !pip install pandas numpy tensorflow

ERROR: Could not find a version that satisfies the requirement scann (from versions: none)
ERROR: No matching distribution found for scann


In [110]:
import os
from typing import Dict, Text

import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_recommenders as tfrs
import tensorflow_datasets as tfds

In [71]:
df = pd.read_csv("Preprocessed_data.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,user_id,location,age,isbn,rating,book_title,book_author,year_of_publication,publisher,img_s,img_m,img_l,Summary,Language,Category,city,state,country
0,0,2,"stockton, california, usa",18.0,195153448,0,Classical Mythology,Mark P. O. Morford,2002.0,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,Provides an introduction to classical myths pl...,en,['Social Science'],stockton,california,usa
1,1,8,"timmins, ontario, canada",34.7439,2005018,5,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],timmins,ontario,canada
2,2,11400,"ottawa, ontario, canada",49.0,2005018,0,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],ottawa,ontario,canada
3,3,11676,"n/a, n/a, n/a",34.7439,2005018,8,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],,,
4,4,41385,"sudbury, ontario, canada",34.7439,2005018,0,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],sudbury,ontario,canada


# Data Preprocessing

In [73]:
filtered_data = df[["user_id", "book_title"]].astype({"user_id": np.str_, "book_title": np.str_})

In [74]:
ratings_dataset = tf.data.Dataset.from_tensor_slices((tf.cast(filtered_data['user_id'], tf.string), \
                                                      tf.cast(filtered_data['book_title'], tf.string)))

Let's keep only the user_id and book_title

In [164]:
ratings = ratings_dataset.map(lambda x0, x1: {
    "user_id": x0,
    "book_title": x1,
})

books = ratings_dataset.map(lambda x, x1:x1)

for x in ratings.take(3).as_numpy_iterator():
    print(x)

{'user_id': b'2', 'book_title': b'Classical Mythology'}
{'user_id': b'8', 'book_title': b'Clara Callan'}
{'user_id': b'11400', 'book_title': b'Clara Callan'}


To facilitate model fitting and evaluation, it is essential to partition the dataset into distinct training and evaluation sets. In a real-world recommender system, this segregation is commonly based on time, where data up to a specific time point \(T\) is utilized for predicting interactions occurring after \(T\).

In this straightforward illustration, however, we will employ a random split, allocating 80% of the ratings to the training set and the remaining 20% to the test set.

In [165]:
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

"Let's also identify unique user IDs and book titles present in the data. This is important because we need to be able to map the raw values of our categorical features to embedding vectors in our models. To do that, we need a vocabulary that maps a raw feature value to an integer in a contiguous range. This allows us to look up the corresponding embeddings in our embedding tables."

In [77]:
book_titles = books.batch(1_000)
user_ids = ratings.batch(1_000_000).map(lambda x: x["user_id"])
unique_book_titles = np.unique(np.concatenate(list(book_titles)))
unique_user_ids = np.unique(np.concatenate(list(user_ids)))

unique_book_titles[:10]

array([b' A Light in the Storm: The Civil War Diary of Amelia Martin, Fenwick Island, Delaware, 1861 (Dear America)',
       b' Always Have Popsicles',
       b" Apple Magic (The Collector's series)",
       b' Ask Lily (Young Women of Faith: Lily Series, Book 5)',
       b' Beyond IBM: Leadership Marketing and Finance for the 1990s',
       b' Clifford Visita El Hospital (Clifford El Gran Perro Colorado)',
       b' Dark Justice', b' Deceived',
       b' Earth Prayers From around the World: 365 Prayers, Poems, and Invocations for Honoring the Earth',
       b' Final Fantasy Anthology: Official Strategy Guide (Brady Games)'],
      dtype=object)

## Implementing a model

Selecting the architecture for our model is a critical aspect of the modeling process.

Given that we are constructing a two-tower retrieval model, we have the flexibility to build each tower independently and subsequently integrate them into the final model.

### The query tower
We'll begin by establishing the query tower.

The initial step involves determining the dimensionality of the query and candidate representations:

In [78]:
embedding_dimension = 32

Choosing higher values for the dimensionality may lead to models that are potentially more accurate, but they might also require more time for fitting and could be more susceptible to overfitting.

The next step is to define the model. In this context, we will utilize Keras preprocessing layers. Initially, we'll convert user IDs to integers and then transform them into user embeddings using an `Embedding` layer. It's noteworthy that we employ the list of unique user IDs obtained earlier as a vocabulary:

In [79]:
user_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_user_ids, mask_token=None),
  # We add an additional embedding to account for unknown tokens.
  tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension)
])

### The candidate tower

We can do the same with the candidate tower.

In [80]:
book_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_book_titles, mask_token=None),
  tf.keras.layers.Embedding(len(unique_book_titles) + 1, embedding_dimension)
])

### Metrics

Within our training data, we possess positive (user, book) pairs. Evaluating our model's performance entails comparing the affinity score calculated by the model for this pair against the scores for all other potential candidates. If the score for the positive pair surpasses that of all other candidates, our model is deemed highly accurate.

To facilitate this evaluation, we can employ the `tfrs.metrics.FactorizedTopK` metric. This metric necessitates one essential argument: the dataset of candidates employed as implicit negatives for evaluation.

In our scenario, this corresponds to the `books` dataset, transformed into embeddings via our book model:

In [81]:
metrics = tfrs.metrics.FactorizedTopK(
  candidates=books.batch(128).map(book_model)
)

### Loss

The subsequent component is the loss employed for training our model. TFRS provides various loss layers and tasks to simplify this process.

In this case, we will utilize the `Retrieval` task object—a convenient wrapper that combines the loss function and metric computation:

In [82]:
task = tfrs.tasks.Retrieval(
  metrics=metrics
)

The task itself serves as a Keras layer, accepting the query and candidate embeddings as arguments and producing the computed loss. We'll leverage this task layer to implement the training loop for our model.

### The full model

We can now integrate all the components into a model. TFRS provides a base model class (`tfrs.models.Model`) that simplifies the model-building process: we just need to configure the components in the `__init__` method and implement the `compute_loss` method, which takes in the raw features and returns a loss value.

The base model will handle the creation of the appropriate training loop to fit our model.

In [83]:
class BookModel(tfrs.Model):

    def __init__(self, user_model, book_model):
        super().__init__()
        self.book_model: tf.keras.Model = book_model
        self.user_model: tf.keras.Model = user_model
        self.task: tf.keras.layers.Layer = task

    def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
        # We pick out the user features and pass them into the user model.
        user_embeddings = self.user_model(features["user_id"])
        # And pick out the movie features and pass them into the movie model,
        # getting embeddings back.
        positive_book_embeddings = self.book_model(features["book_title"])

        # The task computes the loss and the metrics.
        return self.task(user_embeddings, positive_book_embeddings)

The `tfrs.Model` base class is a simply convenience class: it allows us to compute both training and test losses using the same method.

## Fitting and evaluating

After defining the model, we can use standard Keras fitting and evaluation routines to fit and evaluate the model.

Let's first instantiate the model.

In [84]:
model = BookModel(user_model, book_model)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

Then shuffle, batch, and cache the training and evaluation data.

In [85]:
cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

Then train the  model:

In [86]:
model.fit(cached_train, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x1e0c98dae90>

Finally, we can evaluate our model on the test set:

In [None]:
model.evaluate(cached_test, return_dict=True)

The disparity in performance between the test set and training set can be attributed to two key factors:

1. **Overfitting:** The model is likely to exhibit better performance on data it has encountered during training, essentially memorizing it. This overfitting tendency is more pronounced in models with numerous parameters. Techniques such as model regularization and the incorporation of user and movie features can mitigate this effect, promoting better generalization to unseen data.


2. **Recommending Previously Watched Movies:** The model may re-recommend movies that users have already watched. This situation can overshadow test movies in the top K recommendations. While it's a common practice in recommender systems to exclude past watches from test recommendations, we don't adopt this approach in these tutorials. If avoiding recommendations of past watches is crucial, appropriately configured models should learn this behavior autonomously from user history and contextual information. Furthermore, recommending the same item multiple times, such as evergreen TV series or regularly purchased items, is often considered appropriate.

## Making predictions

Now that we have a model, we would like to be able to make predictions. We can use the `tfrs.layers.factorized_top_k.BruteForce` layer to do this.

In [109]:
# get unique set of books
batched_unique_books = tf.data.Dataset.from_tensor_slices(unique_book_titles)

# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
# recommends movies out of the entire movies dataset.
index.index_from_dataset(
  tf.data.Dataset.zip((batched_unique_books.batch(100), batched_unique_books.batch(100).map(model.book_model)))
)

# Get recommendations.
_, titles = index(tf.constant(["10"]))
print(f"Recommendations for user 8: {titles[0, :10]}")

Recommendations for user 8: [b'Angels & Demons'
 b'Cruel & Unusual (Kay Scarpetta Mysteries (Paperback))'
 b'Harry Potter and the Chamber of Secrets (Book 2)' b'Dead Aim'
 b'The Lake House' b'Jack & Jill (Alex Cross Novels)' b'Toxin'
 b'Along Came a Spider (Alex Cross Novels)' b'When the Wind Blows'
 b'Icebound']


### Export the model

In [162]:
# Export the query model.
path = "saved_index"
tf.saved_model.save(index, path)

# Load it back;
loaded = tf.saved_model.load(path)

# Pass a user id in, get top predicted book titles back.
scores, titles = loaded(["100"])

for x in titles[0][:3]:
    title_string = x.numpy().decode("utf-8")
    print(f"Recommended Book: {title_string}")

<tf.Tensor: shape=(), dtype=float32, numpy=0.21195796>