# TensorFlow Recommenders (TFRS) Demo 

In this notebook we will demonstrate the use of TFRS on sales data from [Babyshop](https://www.babyshop.se/). This tutorial is heavily based on the [official tensorflow recommenders tutorials](https://www.tensorflow.org/recommenders/examples/quickstart). 

<div class="alert alert-block alert-info">
There are a lot of machine learning "best practices" that are ignored in this notebook for the sake of simplicity. The focus is to get an introduction to TFRS and general understanding of how this library works, not to build an industrial recommendation system. 
</div>

## **Imports**

In [227]:
from typing import Dict, Any, Text

import numpy as np 
import pandas as pd

import tensorflow as tf
import tensorflow_recommenders as tfrs
import tensorflow_data_validation as tfdv

# **Reading in the Data** 

First we will read in the training and test data. 

<div class="alert alert-block alert-info">
<b>NOTE:</b> See <code>EDA.ipynb</code> for analysis on the data and details on how the train and test sets were created. 
</div>

In the following cells we will create an even smaller version of the dataset so that we can train on a reasonable amount of time on a CPU. 

In [228]:
train_df = pd.read_csv('train.csv', dtype={'user_no': str, 'item_no': str})
test_df = pd.read_csv('test.csv', dtype={'user_no': str, 'item_no': str})

# For evaluation
item_info_df = pd.read_csv('item_info.csv', dtype={'item_no': str})

In [233]:
display(train_df)

Unnamed: 0,user_no,item_no,gender_description,brand,product_group,first_interaction_month
0,3514657341026450752,-8200171396217105230,girls,jacadi,all in ones,5
1,-2544835772752526495,6010486836306001722,unisex,done by deer,tableware,11
2,-6023760384625599940,-289310928076258010,unisex,axkid,car seat accessories,2
3,4084143572023326121,-1069008842172275553,boys,pom dapi,sandals,5
4,-4787976733877481713,608763176274829755,unisex,little luwi,tops,8
...,...,...,...,...,...,...
667004,6183491195824661353,-487489333946043722,girls,ikks,dresses,4
667005,-8074445800271606192,7154496603299236573,unisex,tommee tippee,baby feeding,3
667006,3873852775369901008,3465194094158419708,unisex,by nils,sandals,5
667007,-1306455725574612144,2424760068735106973,girls,kenzo,tops,1


We will obtain our smaller dataset by just taking the top 2000 users (i.e. users with the most interactions) in the training data. 

In [229]:
NUM_USERS = 2000
top_users = train_df['user_no'].value_counts()[:NUM_USERS].index

train_df_filtered = train_df.loc[train_df['user_no'].isin(top_users)]
test_df_filtered = test_df.loc[test_df['user_no'].isin(top_users)]
items = train_df_filtered['item_no'].unique()

In the following cells we create TensorFlow datasets out of the Pandas DataFrames and print out the first few instances just to get an idea of what the datasets look like. 

In [230]:
train_dataset = tf.data.Dataset.from_tensor_slices(dict(train_df_filtered))
test_dataset = tf.data.Dataset.from_tensor_slices(dict(test_df_filtered))

items_dataset = tf.data.Dataset.from_tensor_slices(items)

In [231]:
for item in items_dataset.take(3):
    print(item)

tf.Tensor(b'-1119687312509640915', shape=(), dtype=string)
tf.Tensor(b'-3219910350938683317', shape=(), dtype=string)
tf.Tensor(b'1179978263120783371', shape=(), dtype=string)


In [232]:
for interaction in train_dataset.take(3):
    print(interaction)

{'user_no': <tf.Tensor: shape=(), dtype=string, numpy=b'-2683506524939646253'>, 'item_no': <tf.Tensor: shape=(), dtype=string, numpy=b'-1119687312509640915'>, 'gender_description': <tf.Tensor: shape=(), dtype=string, numpy=b'unisex'>, 'brand': <tf.Tensor: shape=(), dtype=string, numpy=b'reima'>, 'product_group': <tf.Tensor: shape=(), dtype=string, numpy=b'boots'>, 'first_interaction_month': <tf.Tensor: shape=(), dtype=int64, numpy=11>}
{'user_no': <tf.Tensor: shape=(), dtype=string, numpy=b'-8270295623916047084'>, 'item_no': <tf.Tensor: shape=(), dtype=string, numpy=b'-3219910350938683317'>, 'gender_description': <tf.Tensor: shape=(), dtype=string, numpy=b'boys'>, 'brand': <tf.Tensor: shape=(), dtype=string, numpy=b'moschino kid-teen'>, 'product_group': <tf.Tensor: shape=(), dtype=string, numpy=b'tops'>, 'first_interaction_month': <tf.Tensor: shape=(), dtype=int64, numpy=11>}
{'user_no': <tf.Tensor: shape=(), dtype=string, numpy=b'-1493854771764820101'>, 'item_no': <tf.Tensor: shape=()

---
---

# **Baseline**

The first thing we can do is start with a very "naive" baseline; for every interaction in the test dataset we will just predict the top 100 items from the training set. This will give us a reference point for any metrics we calculate after training a model. 

A side benefit is that we can get a better understanding of TFRS by recreating the way that metrics are calculated by TFRS. See [here](https://github.com/tensorflow/recommenders/blob/8b249f3fc0f8d3d907eecf010809a5df3759d65d/tensorflow_recommenders/metrics/factorized_top_k.py#L64) for the source code; the following cells are basically a simplified version of the code found in the TFRS library. 

In [154]:
NUM_TOP_ITEMS = 100
top_items = train_df_filtered['item_no'].value_counts()[:100].index

ks = (1, 5, 10, 50, 100)
metrics = [tf.keras.metrics.Mean() for k in ks]

true_candidates = tf.expand_dims(tf.constant(test_df_filtered['item_no'].values), 1)
retrieved_candidates = tf.expand_dims(top_items, 1)
# Pretend like we retrieve the same top 100 candidates for every interaction in test data
retrieved_candidates = tf.transpose(tf.repeat(retrieved_candidates, 
                                              tf.constant(true_candidates.shape[0]), 
                                              axis=1))
ids_match = tf.cast(tf.math.equal(true_candidates, retrieved_candidates), tf.float32)

In [234]:
retrieved_candidates

<tf.Tensor: shape=(2273, 100), dtype=string, numpy=
array([[b'-2131113190737351926', b'-608163241791914349',
        b'4783972269932241964', ..., b'-6581604590878446705',
        b'-11796169933509936', b'-964263282460141044'],
       [b'-2131113190737351926', b'-608163241791914349',
        b'4783972269932241964', ..., b'-6581604590878446705',
        b'-11796169933509936', b'-964263282460141044'],
       [b'-2131113190737351926', b'-608163241791914349',
        b'4783972269932241964', ..., b'-6581604590878446705',
        b'-11796169933509936', b'-964263282460141044'],
       ...,
       [b'-2131113190737351926', b'-608163241791914349',
        b'4783972269932241964', ..., b'-6581604590878446705',
        b'-11796169933509936', b'-964263282460141044'],
       [b'-2131113190737351926', b'-608163241791914349',
        b'4783972269932241964', ..., b'-6581604590878446705',
        b'-11796169933509936', b'-964263282460141044'],
       [b'-2131113190737351926', b'-608163241791914349',
    

In [155]:
for k, metric in zip(ks, metrics):
    # By slicing until :k we assume scores are sorted.
    # Clip to only count multiple matches once.
    match_found = tf.clip_by_value(
        tf.reduce_sum(ids_match[:, :k], axis=1, keepdims=True),
        0.0, 1.0
    )
    metric.update_state(match_found)

In [156]:
for k, metric in zip(ks, metrics):
    print(f'Top {k} categorical accuracy: {metric.result().numpy():.5f}')

Top 1 categorical accuracy: 0.00176
Top 5 categorical accuracy: 0.00660
Top 10 categorical accuracy: 0.00968
Top 50 categorical accuracy: 0.02728
Top 100 categorical accuracy: 0.04004


# Creating a Simple Model

We will start by creating a very simple model similar to the one created in [the TFRS basic retrieval tutorial](https://www.tensorflow.org/recommenders/examples/basic_retrieval). Quoting from the tutorial, the model will be created by two-submodels: 

> 1. A query model computing the query representation (normally a fixed-dimensionality embedding vector) using query features
> 2. A candidate model computing the candidate representation (an equally-sized vector using the candidate features
> 
> The outputs of the two models are then multiplied together to give a query-candidate affinity score, with higher scores expressing a better match between the candidate and the query.

For our use case, we will pretend that we want to recommend items to users. As such, our **query** model will produce representations of the **users** (and potentially additional **context**, such as time, device, etc.) and our **candidate** model will produce representations of the **items**. 

For the rest of the notebook we will refer to the "query" model as a `user_model` and the "candidate" model as a `item_model`

<div class="alert alert-block alert-info">
<b>Tip:</b>  There is nothing forcing us to associate users with a query model and items with a candidate model. For example, we could just as easily associate items with a query model and items with a candidate model for an <b>item-item</b> recommender. 
</div>

In the following cells we will build each tower separately (via the `create_embedding_model` function). We will also define the task, which in this case will be a retrieval task. Finally we will put together the two sub-models and the task in a `tfrs.Model`, which allows us to implement a model by only implementing the `__init__` and `compute_loss` methods—the base model class will take care of the training loop. 

In [238]:
def get_vocab(df, feature, top_n=None):
    return df[feature].value_counts()[:top_n].index

def create_embedding_model(df, feature, num_oov_indices=1, embedding_dim=32):
    feature_vocab = get_vocab(df, feature)
    embedding_model = tf.keras.Sequential([
        tf.keras.layers.StringLookup(vocabulary=feature_vocab, 
                                     num_oov_indices=num_oov_indices),
        tf.keras.layers.Embedding(len(feature_vocab) + num_oov_indices, embedding_dim)
    ])
    
    
    return embedding_model

class SimpleTFRSModel(tfrs.Model):

    def __init__(self, user_model, item_model, task):
        super().__init__()
        self.user_model: tf.keras.Model = user_model
        self.item_model: tf.keras.Model = item_model
        self.task: tf.keras.layers.Layer = task
            

    def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
        # We pick out the user features and pass them into the user model
        # and item features to pass to the item model. Use the returned embeddings 
        # to calculate the loss
        user_embeddings = self.user_model(features['user_no'])
        positive_item_embeddings = self.item_model(features['item_no'])
        # The task computes the loss and the metrics. Don't compute metrics during training 
        # because it will take too long otherwise
        return self.task(user_embeddings, positive_item_embeddings, compute_metrics=not training)

In [236]:
user_model = create_embedding_model(train_df_filtered, "user_no")
item_model = create_embedding_model(train_df_filtered, "item_no")
metrics = tfrs.metrics.FactorizedTopK(
  candidates=items_dataset.batch(128).map(item_model)
)
task = tfrs.tasks.Retrieval(
  metrics=metrics
)

simple_tfrs_model = SimpleTFRSModel(user_model, item_model, task)

---
---

<div class="alert alert-block alert-warning">
<b>The above is just a convenience!</b> The following class is a simplified version of what
is actually going on under-the-hood:

```python 
class NonTFRSModel(tf.keras.Model):
    def __init__(self, user_model, item_model, metrics):
        """
        Note that we don't pass in the task! That's because we define 
        what the task is here.
        """
        super().__init__()
        self.user_model = user_model 
        self.item_model = item_model 
        # When we perform retrieval, the default loss is actually just good 
        # old CategoricalCrossentropy :) 
        self._loss = tf.keras.losses.CategoricalCrossentropy(
            from_logits=True, reduction=tf.keras.losses.Reduction.SUM
        )
        self._factorized_metrics = metrics

    def calc_loss(self, query_embeddings, candidate_embeddings): 
        scores = tf.linalg.matmul(
            query_embeddings, 
            candidate_embeddings, 
            transpose_b=True
        )
        num_queries, num_candidates = scores.shape
        labels = tf.eye(num_queries, num_candidates)
        loss = self._loss(y_true=labels, y_pred=scores)
        self._factorized_metrics.update_state(
            query_embeddings, 
            candidate_embeddings
        )
        return loss
    

    def train_step(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor:
        with tf.GradientTape() as tape: 
            user_embeddings = self.user_model(features['user_no'])
            positive_item_embeddings = self.item_model(features['item_no'])
            loss = self.calc_loss(user_embeddings, positive_item_embeddings)

        gradients = tape.gradient(loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))

        metrics = {metric.name: metric.result() for metric in self.metrics}
        return metrics 

    def test_step(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor: 
        user_embeddings = self.user_model(features['user_no'])
        positive_item_embeddings = self.item_model(features['item_no'])

        loss = self.compute_loss(user_embeddings, positive_item_embeddings)        

        metrics = {metric.name: metric.result() for metric in self.metrics}
        return metrics 
```

We can then instantiate and compile a model like so: 

```python 
simple_model = NonTFRSModel(user_model, item_model, metrics)
# Need to specify run_eagerly=True because we need the shape of the scores 
# in the calc_loss function
simple_model.compile(optimizer=tf.keras.optimizers.Adam(), run_eagerly=True)
```

After that we can just train the model the same as below :)

</div>
---
---

In [239]:
train_dataset_interactions = train_dataset.map(lambda x: {
    'user_no': x['user_no'],
    'item_no': x['item_no']
})
test_dataset_interactions = test_dataset.map(lambda x: {
    'user_no': x['user_no'],
    'item_no': x['item_no']
})

train_ds = train_dataset_interactions.shuffle(1_000).batch(4096)
test_ds = test_dataset_interactions.batch(4096)

<div class="alert alert-block alert-info">
In the interest of accelerating training as much as possible, we won't calculate any metrics. If we were training "for real" we'd probably want to monitor the training and implement early stopping 
</div>

In [240]:
simple_tfrs_model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))
history = simple_tfrs_model.fit(train_ds, 
                                epochs=5, 
                                validation_data=test_ds)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Evaluation

In [241]:
train_results = simple_tfrs_model.evaluate(train_ds, return_dict=True)
test_results = simple_tfrs_model.evaluate(test_ds, return_dict=True)



In [242]:
print(f"Train top-100 accuracy:  {train_results['factorized_top_k/top_100_categorical_accuracy']}")
print(f"Test top-100 accuracy:  {test_results['factorized_top_k/top_100_categorical_accuracy']}")

Train top-100 accuracy:  0.9680023789405823
Test top-100 accuracy:  0.04223493114113808


Our model is overfitting like crazy. Quoting from the tutorial, this is due to two factors: 

> 1. Our model is likely to perform better on the data that it has seen, simply because it can memorize it. This overfitting phenomenon is especially strong when models have many parameters. It can be mediated by model regularization and use of user and movie features that help the model generalize better to unseen data.
> 2. The model is re-recommending some of users' already [bought items]. These known-positive watches can crowd out test [items] out of top K recommendations.

## Serving and Qualitative Evaluation

In order to serve the model, we create an "index". Basically this is a way for us to do nearest neighbor search in the embedding space; we get in a "query" (in this case a user), calculate an embedding, and then compare that embedding to the embeddings of all candidate items. 

In this case, the number of candidate items is very small, so we just brute force the search. For real-world use cases we would want to use an approximate nearest neighbor search. TFRS allows us to build an index based on [ScaNN](https://github.com/google-research/google-research/tree/master/scann) if we install the optional dependency. 

In [243]:
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(simple_tfrs_model.user_model)
# recommends items out of the entire items dataset.
_ = index.index_from_dataset(
        tf.data.Dataset.zip((items_dataset.batch(100), 
                             items_dataset.batch(100).map(simple_tfrs_model.item_model))))

To qualitatively analyze the performance of the model, we can look at the predictions for a random user. 

<div class="alert alert-block alert-info">
<b>Tip: </b> Rerun the next few cells to get predictions for different users. 
</div>

In [255]:
random_user = np.random.choice(train_df_filtered['user_no'].unique())

In [256]:
%%time
# Get recommendations.
_, titles = index(tf.constant([random_user]))

CPU times: user 4.02 ms, sys: 4.22 ms, total: 8.24 ms
Wall time: 13.9 ms


In [259]:
items_to_exclude = train_df_filtered.loc[train_df_filtered['user_no'] == random_user]['item_no'].unique()

In [261]:
%%time
_, titles = index.query_with_exclusions(tf.constant([random_user]), 
                                       tf.constant([items_to_exclude]))

CPU times: user 2.17 ms, sys: 1.02 ms, total: 3.19 ms
Wall time: 1.87 ms


**Historical purchases**

In [262]:
train_df_filtered.loc[train_df_filtered['user_no'] == random_user]

Unnamed: 0,user_no,item_no,gender_description,brand,product_group,first_interaction_month
4198,7586537222774793661,-3755111142162014470,girls,kuling,swimwear and coverups,8
25021,7586537222774793661,3629669565886549596,unisex,oii,tops,8
76505,7586537222774793661,-4572276524960750184,unisex,buddy & hope,all in ones,8
107637,7586537222774793661,6246954727138772455,unisex,ergobaby,carriers and slings,8
111045,7586537222774793661,6828600928526932258,girls,mini rodini,tops,8
113278,7586537222774793661,-7286957941988748272,unisex,mini rodini,all in ones,8
123273,7586537222774793661,-1712642357406945566,girls,stella mccartney kids,coats and jackets,8
159195,7586537222774793661,-4157873162967126783,unisex,kuling,headwear,8
187626,7586537222774793661,-7773341554472956586,unisex,frugi,coveralls,11
196172,7586537222774793661,1032216803311351638,girls,stella mccartney kids,clothing sets,8


**Recommendations**

In [263]:
recommendations = [item.numpy().decode() for item in titles[0]]
item_info_df.loc[item_info_df['item_no'].isin(recommendations)]

Unnamed: 0,item_no,colour,gender_description,brand,product_group,min_age,max_age
9315,7532757415729254540,grey,unisex,stoy,vehicles,,
9926,-2175418266946900817,pink,girls,oii,dresses,1.0,12.0
13988,478295614227143228,blue,girls,wolf & rita,tops,1.0,10.0
17043,7735931853117354,yellow,unisex,kuling,swimwear and coverups,0.625,6.0
23589,-5329910941720654887,cream,unisex,kuling,headwear,0.125,2.0
23971,5352525995966123584,white,unisex,aden + anais,textile,,
27911,188818157454194145,beige,girls,how to kiss a frog,tops,3.0,14.0
36300,5160161305215034223,cream,unisex,kuling,swimwear and coverups,0.625,8.0
53200,-434987919998693808,unknown,girls,oii,dresses,1.0,12.0
56856,6765400635300223494,black,girls,nike,underwear,6.0,12.0


---
---
---

## **Content-Based Filtering**

Another way to approach recommendations is to base them solely on content metadata, rather than learning from patterns in interactions across the customer base as a whole. As such, we will likely not get any "novel" recommendations and instead many of the recommendations will be very similar to the user's purchase history. 

In order to take advantage of some TFRS "machinery", we can build user and item models as before. However, this time instead of *learning* embeddings for each individual user and each item, we will manually compute the representations of each user and each item. 

In this case an item embedding will just consist of the concatenated one-hot encodings of the brand, product group, and gender description, and a user embedding will be the average of all the item embeddings in their purchase history. 

In [264]:
top_brands = get_vocab(train_df_filtered, 'brand', 100)
top_groups = get_vocab(train_df_filtered, 'product_group', 50)
COLS_TO_KEEP = ['gender_description', 'brand', 'product_group']

def precompute_embeddings(df, agg_col):
    df.loc[:, 'brand'] = df['brand'].apply(lambda x: x if x in top_brands else 'other')
    df.loc[:, 'product_group'] = df['product_group'].apply(lambda x: x if x in top_groups else 'other')
    df_one_hot = pd.get_dummies(df[COLS_TO_KEEP + [agg_col]], columns=COLS_TO_KEEP)
    return df_one_hot.groupby(agg_col).agg('mean')

In [265]:
precomputed_user_embeddings = precompute_embeddings(train_df_filtered, agg_col='user_no')
precomputed_item_embeddings = precompute_embeddings(item_info_df, agg_col='item_no')

In [266]:
display(precomputed_user_embeddings)

Unnamed: 0_level_0,gender_description_boys,gender_description_girls,gender_description_unisex,brand_1+ in the family,brand_a happy brand,brand_a monday in copenhagen,brand_adidas,brand_beau loves,brand_billieblush,brand_bisgaard,...,product_group_stroller parts and customisati,product_group_strollers,product_group_swimwear and coverups,product_group_tableware,product_group_textile,product_group_tops,product_group_trainers,product_group_underwear,product_group_vehicles,product_group_water toys
user_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-1012876894217140776,0.000000,0.277778,0.722222,0.000000,0.166667,0.000000,0.055556,0.0,0.0,0.0,...,0.0,0.0,0.055556,0.0,0.000000,0.000000,0.055556,0.166667,0.0,0.0
-1022934284196456562,0.000000,0.166667,0.833333,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.111111,0.0,0.055556,0.111111,0.000000,0.000000,0.0,0.0
-1030336247862550277,0.444444,0.388889,0.166667,0.000000,0.000000,0.111111,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.500000,0.000000,0.000000,0.0,0.0
-1031375167955555195,0.000000,0.736842,0.263158,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.210526,0.052632,0.105263,0.0,0.0
-1041412818309902183,0.200000,0.550000,0.250000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.600000,0.000000,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
968073716034597193,0.111111,0.777778,0.111111,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.166667,0.000000,0.000000,0.0,0.0
976567085753614314,0.000000,0.947368,0.052632,0.000000,0.000000,0.052632,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.052632,0.210526,0.000000,0.157895,0.0,0.0
979760204207844065,0.055556,0.388889,0.555556,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.055556,0.0,0.0
987479213534973896,0.000000,0.277778,0.722222,0.166667,0.000000,0.000000,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.055556,0.000000,0.000000,0.277778,0.0,0.0


In [267]:
display(precomputed_item_embeddings)

Unnamed: 0_level_0,gender_description_boys,gender_description_girls,gender_description_unisex,brand_1+ in the family,brand_a happy brand,brand_a monday in copenhagen,brand_adidas,brand_beau loves,brand_billieblush,brand_bisgaard,...,product_group_stroller parts and customisati,product_group_strollers,product_group_swimwear and coverups,product_group_tableware,product_group_textile,product_group_tops,product_group_trainers,product_group_underwear,product_group_vehicles,product_group_water toys
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-10001501373726678,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
-1000182030290830232,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
-1000183384954605528,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
-1000321715684049686,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
-1000570342615087077,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999030474988862413,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
999032067904529387,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
999084409713144028,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
999328979874402204,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [268]:
def create_precomputed_embedding_model(precomputed_embeddings):
    num_columns = len(precomputed_embeddings.columns)
    embedding_matrix = np.concatenate((np.zeros((1, num_columns)),
                                      precomputed_embeddings.values))
    embedding_layer = tf.keras.layers.Embedding(*embedding_matrix.shape,
                                                embeddings_initializer=tf.keras.initializers.Constant(
                                                    embedding_matrix),
                                                trainable=False)
    model = tf.keras.Sequential([
        tf.keras.layers.StringLookup(
            vocabulary=precomputed_embeddings.index,
            num_oov_indices=1
        ),
        embedding_layer
    ])
    return model

In [269]:
user_model = create_precomputed_embedding_model(precomputed_user_embeddings)
item_model = create_precomputed_embedding_model(precomputed_item_embeddings)

In [270]:
item_model(tf.constant(['-1000183384954605528']))

<tf.Tensor: shape=(1, 155), dtype=float32, numpy=
array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)>

In [272]:
items_dataset = tf.data.Dataset.from_tensor_slices(item_info_df['item_no'])
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(user_model)
# recommends items out of the entire items dataset.
index.index_from_dataset(
  tf.data.Dataset.zip((items_dataset.batch(100), items_dataset.batch(100).map(item_model)))
)

<tensorflow_recommenders.layers.factorized_top_k.BruteForce at 0x15ac55850>

In [279]:
random_user = np.random.choice(train_df_filtered['user_no'].unique())

In [280]:
items_to_exclude = train_df_filtered.loc[train_df_filtered['user_no'] == random_user]['item_no'].unique()

In [281]:
%%time
_, titles = index.query_with_exclusions(tf.constant([random_user]), 
                                       tf.constant([items_to_exclude]))

CPU times: user 7.91 ms, sys: 1.42 ms, total: 9.33 ms
Wall time: 8.77 ms


In [282]:
train_df_filtered.loc[train_df_filtered['user_no'] == random_user]

Unnamed: 0,user_no,item_no,gender_description,brand,product_group,first_interaction_month
12866,1314757480229460543,374555410294867495,unisex,carena,other,4
85644,1314757480229460543,-1135627525164144199,unisex,garbo&friends,textile,4
118165,1314757480229460543,-6063671196338916197,boys,other,headwear,8
142592,1314757480229460543,-8143881665306767562,unisex,bobo choses,headwear,8
144750,1314757480229460543,1132221932709541777,unisex,bobo choses,headwear,8
185363,1314757480229460543,-7423664940335137580,unisex,bobo choses,headwear,8
201656,1314757480229460543,-8745981706606507622,girls,molo,headwear,8
276629,1314757480229460543,-7293654529543277906,unisex,other,baby feeding,4
361298,1314757480229460543,1934457679759312272,unisex,garbo&friends,textile,4
379618,1314757480229460543,-7669056745363357315,girls,other,headwear,8


In [283]:
recommendations = [item.numpy().decode() for item in titles[0]]
item_info_df.loc[item_info_df['item_no'].isin(recommendations)]

Unnamed: 0,item_no,colour,gender_description,brand,product_group,min_age,max_age
15,457467103957514638,beige,unisex,other,headwear,0.125,0.125
19,5751912426656680356,white,unisex,other,headwear,0.125,2.0
209,3958544075576929766,beige,unisex,other,headwear,0.625,8.0
252,5765193949212180695,yellow,unisex,other,headwear,0.375,2.0
376,-8745569070866202486,brown,unisex,other,headwear,0.875,6.0
495,7912340583694710060,grey,unisex,other,headwear,0.125,2.0
670,-8259248676814643686,black,unisex,other,headwear,,
942,2589311358657624625,black,unisex,other,headwear,0.125,4.0
1022,5146575204438291293,grey,unisex,other,headwear,,
1092,7732139752943756076,white,unisex,other,headwear,0.125,2.0


**Usually the recommendations have little or no diversity**

In [284]:
test_users_dataset = tf.data.Dataset.from_tensor_slices(test_df_filtered['user_no'])

In [287]:
%%time
_, retrieved_items = index(tf.constant(test_df_filtered['user_no']), k=100)

CPU times: user 2.46 s, sys: 378 ms, total: 2.84 s
Wall time: 970 ms


~1s to produce 100 recommendations for ~2000 users. 

In [289]:
ids_match = tf.cast(tf.math.equal(true_candidates, retrieved_items), tf.float32)

In [290]:
metrics = [tf.keras.metrics.Mean() for k in ks]
for k, metric in zip(ks, metrics):
    # By slicing until :k we assume scores are sorted.
    # Clip to only count multiple matches once.
    match_found = tf.clip_by_value(
        tf.reduce_sum(ids_match[:, :k], axis=1, keepdims=True),
        0.0, 1.0
    )
    metric.update_state(match_found)

In [291]:
for k, metric in zip(ks, metrics):
    print(f'Top {k} categorical accuracy: {metric.result().numpy():.5f}')

Top 1 categorical accuracy: 0.00000
Top 5 categorical accuracy: 0.00308
Top 10 categorical accuracy: 0.00836
Top 50 categorical accuracy: 0.04048
Top 100 categorical accuracy: 0.07347


---

# **Using Additional Features**

Now let's add additional features

In [304]:
class UserModel(tf.keras.Model):
    def __init__(self, unique_users, num_oov_indices=1, embedding_dim=32):
        super().__init__()
        
        self.user_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(vocabulary=unique_users, 
                                         num_oov_indices=num_oov_indices),
            tf.keras.layers.Embedding(len(unique_users) + num_oov_indices, embedding_dim)
        ])
        
    def call(self, inputs):
        return self.user_embedding(inputs['user_no'])
    
class ItemModel(tf.keras.Model):
    def __init__(self, 
                 items, 
                 gender_description,
                 top_brands, 
                 top_groups, 
                 num_oov_indices=1, 
                 embedding_dim=16):
        super().__init__()
        
        self.item_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(vocabulary=items, 
                                         num_oov_indices=num_oov_indices),
            tf.keras.layers.Embedding(len(items) + num_oov_indices, 16)
        ])
        
        self.gender_description_lookup = tf.keras.layers.StringLookup(vocabulary=gender_description, 
                                                                      output_mode='one_hot',
                                                                      num_oov_indices=0)
        self.brand_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(vocabulary=top_brands, 
                                         num_oov_indices=num_oov_indices),
            tf.keras.layers.Embedding(len(top_brands) + num_oov_indices, 8)
        ])
        self.product_group_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(vocabulary=top_groups, 
                                         num_oov_indices=num_oov_indices),
            tf.keras.layers.Embedding(len(top_groups) + num_oov_indices, 5)
        ])
        
    def call(self, inputs):
        return tf.concat([
             self.item_embedding(inputs['item_no']),
             self.gender_description_lookup(inputs['gender_description']),
             self.brand_embedding(inputs['brand']),
             self.product_group_embedding(inputs['product_group'])
        ], axis=1)
    
class TFRSContextModel(tfrs.models.Model):
    def __init__(self, 
                 unique_users,
                 items, 
                 gender_description,
                 top_brands, 
                 top_groups):
        super().__init__()
        self.query_model = tf.keras.Sequential([
            UserModel(unique_users), 
            #tf.keras.layers.Dense(32)
        ])
        self.candidate_model = tf.keras.Sequential([
            ItemModel(items, gender_description, top_brands, top_groups),
            #tf.keras.layers.Dense(32)
        ])
        self.task = tfrs.tasks.Retrieval(
            metrics=tfrs.metrics.FactorizedTopK(
                candidates=items_dataset_w_context.batch(128).map(self.candidate_model)
            )
        )
    def compute_loss(self, inputs, training=False):
        query_embeddings = self.query_model({
            'user_no': inputs['user_no']
        })
        candidate_embeddings = self.candidate_model({
            'item_no': inputs['item_no'],
            'gender_description': inputs['gender_description'],
            'brand': inputs['brand'],
            'product_group': inputs['product_group']
        })
        
        return self.task(query_embeddings, candidate_embeddings, compute_metrics=not training)

**FIX ITEMS DATASET!!!**

In [305]:
items_df = item_info_df.loc[item_info_df['item_no'].isin(items)][
    ['item_no', 'gender_description', 'brand', 'product_group']]

items_dataset_w_context = tf.data.Dataset.from_tensor_slices(dict(items_df))
unique_users = get_vocab(train_df_filtered, 'user_no')
gender_description = get_vocab(train_df_filtered, 'gender_description')
top_brands = get_vocab(train_df_filtered, 'brand')
top_groups = get_vocab(train_df_filtered, 'product_group')

In [306]:
model = TFRSContextModel(unique_users, items, gender_description, top_brands, top_groups)

Consider rewriting this model with the Functional API.


In [307]:
model.compile(optimizer=tf.keras.optimizers.Adam())

In [308]:
cached_train = train_dataset.shuffle(1_000).batch(1024).cache()
cached_test = test_dataset.batch(512).cache()

In [309]:
history = model.fit(cached_train, epochs=5)

Epoch 1/5
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [310]:
results = model.evaluate(cached_test, return_dict=True)

Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.


In [None]:
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(model.query_model)
# recommends items out of the entire items dataset.
_ = index.index_from_dataset(
        tf.data.Dataset.zip((items_dataset.batch(100), 
                             items_dataset_w_context.batch(100).map(model.candidate_model))))

In [None]:
for item in train_dataset.take(3).batch(3):
    print(item)

In [None]:
unique_users

In [None]:
top_brands = train_df_filtered['brand'].value_counts()[:100].index
top_groups = train_df_filtered['product_group'].value_counts()[:50].index
gender_description = train_df_filtered['gender_description'].unique()
item_model = ItemModel(items, gender_description, top_brands, top_groups)

In [None]:
for item in train_dataset.take(3).batch(3):
    print(item)

In [None]:
item_model(item)

In [None]:
train_df['gender_description'].unique()

In [None]:
tf.keras.layers.StringLookup?

In [None]:
gender_lookup = tf.keras.layers.StringLookup(vocabulary=train_df['gender_description'].unique(), 
                                             output_mode='one_hot', 
                                             num_oov_indices=0)

In [None]:
gender_lookup(tf.constant(['boys']))

In [None]:
class MovieModel(tf.keras.Model):

  def __init__(self):
    super().__init__()

    max_tokens = 10_000

    self.title_embedding = tf.keras.Sequential([
      tf.keras.layers.StringLookup(
          vocabulary=unique_movie_titles, mask_token=None),
      tf.keras.layers.Embedding(len(unique_movie_titles) + 1, 32)
    ])

    self.title_vectorizer = tf.keras.layers.TextVectorization(
        max_tokens=max_tokens)

    self.title_text_embedding = tf.keras.Sequential([
      self.title_vectorizer,
      tf.keras.layers.Embedding(max_tokens, 32, mask_zero=True),
      tf.keras.layers.GlobalAveragePooling1D(),
    ])

    self.title_vectorizer.adapt(movies)

  def call(self, titles):
    return tf.concat([
        self.title_embedding(titles),
        self.title_text_embedding(titles),
    ], axis=1)

In [None]:
for item in train_dataset.take(1):
    print(item)