## Train Retrieval Model

In this notebook, we will train a retrieval model that will be able to quickly generate a small subset of candidate items from a large collection of items. Our model will be based on the two-tower architecture, which embeds queries and candidates (keys) into a shared low-dimensional vector space. Here, a query consists of features of a customer and a transaction (e.g. timestamp of the purchase), whereas a candidate consists of features of a particular item. All queries will have a user ID and all candidates will have an item ID, and the model will be trained such that the embedding of a user will be close to all the embeddings of items the user has previously bought.

### Data

Let's go ahead and load the data.

In [1]:
import pandas as pd

dtype = {"customer_id": object, "article_id": object}

train_df = pd.read_csv("train_df.csv", dtype=dtype)
val_df = pd.read_csv("val_df.csv", dtype=dtype)

We will train our retrieval model with a subset of features.

For the query embedding we will use:
- `customer_id`: ID of the customer.
- `age`: age of the customer at the time of purchase.
- `month_sin`, `month_cos`: time of year the purchase was made.

For the candidate embedding we will use:
- `article_id`: ID of the item.
- `garment_group_name`: type of garment.
- `index_group_name`: menswear/ladieswear etc.

In [2]:
import tensorflow as tf

query_features = ["customer_id", "age", "month_sin", "month_cos"]
candidate_features = ["article_id", "garment_group_name", "index_group_name"]

retrieval_features = query_features + candidate_features

train_df = train_df[retrieval_features]
val_df = val_df[retrieval_features]

def df_to_ds(df):
    return tf.data.Dataset.from_tensor_slices({col : df[col] for col in df})

BATCH_SIZE = 2048
ds_train = df_to_ds(train_df).batch(BATCH_SIZE).cache().shuffle(BATCH_SIZE*10)
ds_val = df_to_ds(val_df).batch(BATCH_SIZE).cache()

2022-05-25 16:26:36.922605: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


We will need a list of user and item IDs when we initialize our embeddings.

In [3]:
user_id_list = train_df["customer_id"].unique().tolist()
item_id_list = train_df["article_id"].unique().tolist()

# TODO will be handled by Hopsworks when we create dataset with label encoder.
garment_group_list = train_df["garment_group_name"].unique().tolist()
index_group_list = train_df["index_group_name"].unique().tolist()

print(f"Number of transactions: {len(train_df):,}")
print(f"Number of users: {len(user_id_list):,}")
print(f"Number of items: {len(item_id_list):,}")

Number of transactions: 409,857
Number of users: 23,838
Number of items: 56,690


### Two Tower Model

The two tower model consist of two models:
- Query model: generates a query representation given user and transaction features.
- Candidate model: generates an item representation given item features.

Both models produce embeddings that live in the same embedding space. We let this space be low-dimensional to prevent overfitting on the training data. (Otherwise, the model might simply memorize previous purchases, which makes it recommend items customers already have bought.)

In [4]:
EMB_DIM = 16

We start with creating the query model.

In [5]:
import tensorflow as tf

class UserTower(tf.keras.Model):

    def __init__(self):
        super().__init__()

        self.user_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=user_id_list,
                mask_token=None
            ),
            tf.keras.layers.Embedding(
                # We add an additional embedding to account for unknown tokens.
                len(user_id_list) + 1,
                EMB_DIM
            )
        ])
        
        self.normalized_age = tf.keras.layers.Normalization(axis=None)

        self.fnn = tf.keras.Sequential([
            tf.keras.layers.Dense(EMB_DIM, activation="relu"),
            tf.keras.layers.Dense(EMB_DIM)
        ])

    def call(self, inputs):
        concatenated_inputs = tf.concat([
            self.user_embedding(inputs["customer_id"]),
            tf.reshape(self.normalized_age(inputs["age"]), (-1,1)),
            tf.reshape(inputs["month_sin"], (-1,1)),
            tf.reshape(inputs["month_cos"], (-1,1))
        ], axis=1)

        outputs = self.fnn(concatenated_inputs)

        return outputs


user_model = UserTower()

# TODO this will be done in Hopsworks with the min-max scaler.
user_model.normalized_age.adapt(ds_train.map(lambda x : x["age"]))

# Initialize model with inputs.
query_ds = ds_train.map(lambda x : {feat : x[feat] for feat in query_features})
user_model(next(iter(query_ds)))

<tf.Tensor: shape=(2048, 16), dtype=float32, numpy=
array([[ 0.5593176 , -0.0049925 ,  0.09828547, ..., -0.34590882,
         0.00600495,  0.16044746],
       [ 0.43364993, -0.11357062,  0.10666486, ..., -0.21101117,
         0.01319867,  0.11278725],
       [-0.17184407,  0.21663298, -0.08368293, ..., -0.26022148,
        -0.23559812, -0.06363247],
       ...,
       [-0.13941592,  0.17771576,  0.11363616, ..., -0.13024214,
        -0.2633681 ,  0.06311373],
       [ 0.18995321,  0.26337442, -0.16955397, ..., -0.6095935 ,
         0.14430802,  0.20826371],
       [ 0.10882069,  0.43169028, -0.27546012, ..., -1.0224997 ,
         0.26312116,  0.09213994]], dtype=float32)>

The candidate model is very similar to the query model. A difference is that it has two categorical features as input, which we one-hot encode.

In [6]:
class ItemTower(tf.keras.Model):

    def __init__(self):
        super().__init__()

        self.item_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=item_id_list,
                mask_token=None
            ),
            tf.keras.layers.Embedding(
                # We add an additional embedding to account for unknown tokens.
                len(item_id_list) + 1,
                EMB_DIM
            )
        ])

        self.garment_group_tokenizer = tf.keras.layers.StringLookup(vocabulary=garment_group_list, mask_token=None)
        self.index_group_tokenizer = tf.keras.layers.StringLookup(vocabulary=index_group_list, mask_token=None)

        self.fnn = tf.keras.Sequential([
            tf.keras.layers.Dense(EMB_DIM, activation="relu"),
            tf.keras.layers.Dense(EMB_DIM)
        ])

    def call(self, inputs):
        garment_group_embedding = tf.one_hot(
            self.garment_group_tokenizer(inputs["garment_group_name"]),
            len(garment_group_list)
        )

        index_group_embedding = tf.one_hot(
            self.index_group_tokenizer(inputs["index_group_name"]),
            len(index_group_list)
        )

        concatenated_inputs = tf.concat([
            self.item_embedding(inputs["article_id"]),
            garment_group_embedding,
            index_group_embedding
        ], axis=1)

        outputs = self.fnn(concatenated_inputs)

        return outputs


item_model = ItemTower()

We will evaluate the two tower model using the *top-100 accuracy*. That is, for each transaction in the validation data we will generate the associated query embedding and retrieve the set of the 100 items that are closest to this query in the embedding space. The top-100 accuracy measures how often the item that was actually bought is part of this subset. To evaluate this, we create a dataset of all unique items in the training data.

In [7]:
item_df = train_df[candidate_features]
item_df.drop_duplicates(subset="article_id", inplace=True)
item_ds = df_to_ds(item_df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


With this in place, we can finally create our two tower model.

In [8]:
import tensorflow_recommenders as tfrs

class TwoTowerModel(tf.keras.Model):
    def __init__(self, user_model, item_model):
        super().__init__()
        self.user_model = user_model
        self.item_model = item_model
        self.task = tfrs.tasks.Retrieval(
            metrics=tfrs.metrics.FactorizedTopK(
                candidates=item_ds.batch(BATCH_SIZE).map(self.item_model)
            )
        )

    def train_step(self, batch) -> tf.Tensor:
        # Set up a gradient tape to record gradients.
        with tf.GradientTape() as tape:

            # Loss computation.
            user_embeddings = self.user_model(batch)
            item_embeddings = self.item_model(batch)
            loss = self.task(user_embeddings, item_embeddings,
                             compute_metrics=False)

            # Handle regularization losses as well.
            regularization_loss = sum(self.losses)

            total_loss = loss + regularization_loss

        gradients = tape.gradient(total_loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))

        metrics = {
            "loss": loss,
            "regularization_loss": regularization_loss,
            "total_loss": total_loss
        }

        return metrics

    def test_step(self, batch) -> tf.Tensor:
        # Loss computation.
        user_embeddings = self.user_model(batch)
        item_embeddings = self.item_model(batch)

        loss = self.task(user_embeddings, item_embeddings,
                         compute_metrics=True)

        # Handle regularization losses as well.
        regularization_loss = sum(self.losses)

        total_loss = loss + regularization_loss

        metrics = {metric.name: metric.result() for metric in self.metrics}
        metrics["loss"] = loss
        metrics["regularization_loss"] = regularization_loss
        metrics["total_loss"] = total_loss

        return metrics


#### Model Training

We'll train our model using the AdamW optimizer, which applies weight regularization during training.

In [9]:
import tensorflow_addons as tfa

model = TwoTowerModel(user_model, item_model)
optimizer = tfa.optimizers.AdamW(0.001, learning_rate=0.01)
model.compile(optimizer=optimizer)

In [10]:
model.fit(ds_train, validation_data=ds_val, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x10f63e220>

Finally we save our models.

In [11]:
model.user_model.save("user_model")
model.item_model.save("item_model")

2022-05-25 16:29:03.879201: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


INFO:tensorflow:Assets written to: user_model/assets
INFO:tensorflow:Assets written to: item_model/assets


### Next Steps

Retrieving the top-k closest candidate embeddings in a brute-force way (computing the distances between the query embedding and all candidate embeddings) is too expensive in a practical setting. In the next notebook, we will index the item embeddings using OpenSearch, which will allow us to retrieve candidates with very low latency.