Our recommender system will generate recommendations in the following way:
1. Generate a course set of candidate items (e.g. 100).
2. Filter bad candidate items (e.g. items the the user has already bought).
3. Rank candidate items.

In this notebook, we will build a model for the first step. Our model will be based on the two-tower architecture, which trains query (user) embeddings to be close to candidate (item) embeddings in a shared space. The idea is that the embedding of a user should be close to all the embeddings of items the user has previously bought.

Let's go ahead and load the data.

In [1]:
import pandas as pd

dtype = {"user_id": object, "item_id": object}

train_df = pd.read_csv("train_df.csv", dtype=dtype)
val_df = pd.read_csv("val_df.csv", dtype=dtype)


We will train our retrieval model with a subset of features.

For the query embedding we will use:
- `user_id`: ID of the customer.
- `age`: age of the customer at the time of purchase.
- `month_sin`, `month_cos`: time of year the purchase was made.

For the candidate embedding we will use:
- `item_id`: ID of the item.
- `garment_group_name`: type of garment.
- `index_group_name`: menswear/ladieswear etc.

In [2]:
retrieval_features = ["user_id", "item_id", "age", "month_sin",
                      "month_cos", "garment_group_name", "index_group_name"]
train_df = train_df[retrieval_features]
val_df = val_df[retrieval_features]

We need a list of user and item IDs to initialize out embeddings.
Garment group list and index group list are for the one-hot encodings (will be fixed in Hopsworks later).

In [3]:
user_id_list = train_df["user_id"].unique().tolist()
item_id_list = train_df["item_id"].unique().tolist()
garment_group_list = train_df["garment_group_name"].unique().tolist()
index_group_list = train_df["index_group_name"].unique().tolist()

print(f"Number of transactions: {len(train_df):,}")
print(f"Number of users: {len(user_id_list):,}")
print(f"Number of items: {len(item_id_list):,}")

Number of transactions: 672,157
Number of users: 32,353
Number of items: 64,267


In [4]:
import tensorflow as tf

def df_to_ds(df):
    return tf.data.Dataset.from_tensor_slices({col : df[col] for col in df})

BATCH_SIZE = 2048
ds_train = df_to_ds(train_df).batch(BATCH_SIZE).cache().shuffle(BATCH_SIZE*10)
ds_val = df_to_ds(val_df).batch(BATCH_SIZE).cache()

2022-05-19 07:21:00.358006: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
# Just a check that it works.
# list(ds_train.take(1).as_numpy_iterator())

Next we specify the dimensionalty of our embeddings. Here we choose a relatively small dimensionality to prevent overfitting on the training data.

In [6]:
EMBEDDING_DIMENSION = 16

Now we will create the actual models. We will create three models:
- Query model: generates a query representation given user and transaction features.
- Candidate model: generates an item representation given item features.
- Two tower model: trains the query and candidate model.

In [7]:
import tensorflow as tf

class UserTower(tf.keras.Model):

    def __init__(self):
        super().__init__()

        self.user_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=user_id_list,
                mask_token=None
            ),
            tf.keras.layers.Embedding(
                # We add an additional embedding to account for unknown tokens.
                len(user_id_list) + 1,
                EMBEDDING_DIMENSION
            )
        ])
        
        self.normalized_age = tf.keras.layers.Normalization(axis=None)

        self.fnn = tf.keras.Sequential([
            tf.keras.layers.Dense(EMBEDDING_DIMENSION, activation="relu"),
            tf.keras.layers.Dense(EMBEDDING_DIMENSION)
        ])

    def call(self, inputs):
        concatenated_inputs = tf.concat([
            self.user_embedding(inputs["user_id"]),
            tf.reshape(self.normalized_age(inputs["age"]), (-1,1)),
            tf.reshape(inputs["month_sin"], (-1,1)),
            tf.reshape(inputs["month_cos"], (-1,1))
        ], axis=1)

        outputs = self.fnn(concatenated_inputs)

        return outputs


user_model = UserTower()
user_model.normalized_age.adapt(ds_train.map(lambda x : x["age"]))

In [8]:
class ItemTower(tf.keras.Model):

    def __init__(self):
        super().__init__()

        self.item_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=item_id_list,
                mask_token=None
            ),
            tf.keras.layers.Embedding(
                # We add an additional embedding to account for unknown tokens.
                len(item_id_list) + 1,
                EMBEDDING_DIMENSION
            )
        ])

        self.garment_group_tokenizer = tf.keras.layers.StringLookup(vocabulary=garment_group_list, mask_token=None)
        self.index_group_tokenizer = tf.keras.layers.StringLookup(vocabulary=index_group_list, mask_token=None)

        self.fnn = tf.keras.Sequential([
            tf.keras.layers.Dense(EMBEDDING_DIMENSION, activation="relu"),
            tf.keras.layers.Dense(EMBEDDING_DIMENSION)
        ])

    def call(self, inputs):
        garment_group_embedding = tf.one_hot(
            self.garment_group_tokenizer(inputs["garment_group_name"]),
            len(garment_group_list)
        )

        index_group_embedding = tf.one_hot(
            self.index_group_tokenizer(inputs["index_group_name"]),
            len(index_group_list)
        )

        concatenated_inputs = tf.concat([
            self.item_embedding(inputs["item_id"]),
            garment_group_embedding,
            index_group_embedding
        ], axis=1)

        outputs = self.fnn(concatenated_inputs)

        return outputs


item_model = ItemTower()

In [9]:
import tensorflow_recommenders as tfrs

item_df = train_df[["item_id", "garment_group_name", "index_group_name"]].drop_duplicates(subset="item_id")

# Convert item_list to dataset.
item_ds = tf.data.Dataset.from_tensor_slices({col : item_df[col] for col in item_df})

In [10]:
# TODO change variable name of user, item model to query, candidate model.

class TwoTowerModel(tf.keras.Model):
    def __init__(self, user_model, item_model):
        super().__init__()
        self.user_model = user_model
        self.item_model = item_model
        self.task = tfrs.tasks.Retrieval(
            metrics=tfrs.metrics.FactorizedTopK(
                candidates=item_ds.batch(BATCH_SIZE).map(self.item_model)
            )
        )

    def train_step(self, batch) -> tf.Tensor:
        # Set up a gradient tape to record gradients.
        with tf.GradientTape() as tape:

            # Loss computation.
            user_embeddings = self.user_model(batch)
            item_embeddings = self.item_model(batch)
            loss = self.task(user_embeddings, item_embeddings,
                             compute_metrics=False)

            # Handle regularization losses as well.
            regularization_loss = sum(self.losses)

            total_loss = loss + regularization_loss

        gradients = tape.gradient(total_loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))

        metrics = {
            "loss": loss,
            "regularization_loss": regularization_loss,
            "total_loss": total_loss
        }

        return metrics

    def test_step(self, batch) -> tf.Tensor:
        # Loss computation.
        user_embeddings = self.user_model(batch)
        item_embeddings = self.item_model(batch)

        loss = self.task(user_embeddings, item_embeddings,
                         compute_metrics=True)

        # Handle regularization losses as well.
        regularization_loss = sum(self.losses)

        total_loss = loss + regularization_loss

        metrics = {metric.name: metric.result() for metric in self.metrics}
        # metrics = {}
        metrics["loss"] = loss
        metrics["regularization_loss"] = regularization_loss
        metrics["total_loss"] = total_loss

        return metrics


In [11]:
import tensorflow_addons as tfa

model = TwoTowerModel(user_model, item_model)
# optimizer = tf.keras.optimizers.Adagrad(learning_rate=0.01)
optimizer = tfa.optimizers.AdamW(0.001, learning_rate=0.01)
model.compile(optimizer=optimizer)

In [12]:
model.fit(ds_train, validation_data=ds_val, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x125740f40>

Finally we save our models. During inference we will use the item tower to genererate a query embedding. The items associated with the top-k closest candidate embeddings will serve as candidates. These will then be filtered by some criteria (e.g. do not recommend items the customer has already bought), and ranked by a so-called *ranking model*.

In [15]:
model.user_model.save("user_model")
model.item_model.save("item_model")

2022-05-19 07:36:08.308950: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


INFO:tensorflow:Assets written to: user_model/assets


Retrieving the top-k closest candidate embeddings in a brute-force way (computing the distances between the query embedding and all candidate embeddings) would be too expensive in a practical setting. In the next notebook, we will index the item embeddings using OpenSearch, which will allow us to retrieve candidates with very low latency.