# Training TFRS

 - [ ] Fix the data 
     - [x] Get a reasonable amount of data, make sure there is overlap in train/test 
     - [ ] Set up a flag so we can use all vs. subset of data depending on CPU/GPU
 - [ ] Set up eval procedure - **Clean this up a bit more**
     - [x] Metrics 
     - [ ] Coverage/Popularity
     - [x] Qualitative evaluation of predictions 
 - [x] Baselines - **Done, just need to clean**
     - [x] Most popular 
     - [x] Domain Knowledge 
     - [x] kNN
 - [ ] TFRS
     - [x] Simple model 
     - [ ] With Context Features
     - [ ] Sequential 
     - [ ] Memory Efficient
 - [ ] Serving 
     - [x] In memory 
     - [ ] TFS
 - [ ] E2E with TFX
 - [ ] Alternatives 
     - [ ] LightFM, Microsoftrecommenders, Transformer recommends
 - [ ] Clean Notebook
     - [ ] References to Papers / Books
     - [ ] Evaluation notes
     - [ ] Shortcomings/Future work 
    
After doing with context features, do a more advanced on GPU, and then do E2E with TFX 

## **Imports**

In [119]:
from typing import Dict, Any, Text

import numpy as np 
import pandas as pd

import tensorflow as tf
import tensorflow_recommenders as tfrs
import tensorflow_data_validation as tfdv

# **Reading in the Data** 

First we will read in the training and test data. 

<div class="alert alert-block alert-info">
<b>NOTE:</b> See <code>EDA.ipynb</code> for analysis on the data and details on how the train and test sets were created. 
</div>

In the following cells we will cheat a bit and create an even smaller version of the dataset so that we can train on a reasonable amount of time on a CPU. 

In [148]:
train_df = pd.read_csv('train.csv', dtype={'user_no': str, 'item_no': str})
test_df = pd.read_csv('test.csv', dtype={'user_no': str, 'item_no': str})

# For evaluation
item_info_df = pd.read_csv('item_info.csv', dtype={'item_no': str})

In [149]:
NUM_USERS = 2000
top_users = train_df['user_no'].value_counts()[:NUM_USERS].index

train_df_filtered = train_df.loc[train_df['user_no'].isin(top_users)]
test_df_filtered = test_df.loc[test_df['user_no'].isin(top_users)]
items = train_df_filtered['item_no'].unique()

In the following cell we create TensorFlow datasets out of the Pandas DataFrames

In [150]:
train_dataset = tf.data.Dataset.from_tensor_slices(dict(train_df_filtered))
test_dataset = tf.data.Dataset.from_tensor_slices(dict(test_df_filtered))

items_dataset = tf.data.Dataset.from_tensor_slices(items)

In [151]:
for item in items_dataset.take(3):
    print(item)

tf.Tensor(b'-1119687312509640915', shape=(), dtype=string)
tf.Tensor(b'-3219910350938683317', shape=(), dtype=string)
tf.Tensor(b'1179978263120783371', shape=(), dtype=string)


In [152]:
for interaction in train_dataset.take(3):
    print(interaction)

{'user_no': <tf.Tensor: shape=(), dtype=string, numpy=b'-2683506524939646253'>, 'item_no': <tf.Tensor: shape=(), dtype=string, numpy=b'-1119687312509640915'>, 'gender_description': <tf.Tensor: shape=(), dtype=string, numpy=b'unisex'>, 'brand': <tf.Tensor: shape=(), dtype=string, numpy=b'reima'>, 'product_group': <tf.Tensor: shape=(), dtype=string, numpy=b'boots'>, 'first_interaction_month': <tf.Tensor: shape=(), dtype=int64, numpy=11>}
{'user_no': <tf.Tensor: shape=(), dtype=string, numpy=b'-8270295623916047084'>, 'item_no': <tf.Tensor: shape=(), dtype=string, numpy=b'-3219910350938683317'>, 'gender_description': <tf.Tensor: shape=(), dtype=string, numpy=b'boys'>, 'brand': <tf.Tensor: shape=(), dtype=string, numpy=b'moschino kid-teen'>, 'product_group': <tf.Tensor: shape=(), dtype=string, numpy=b'tops'>, 'first_interaction_month': <tf.Tensor: shape=(), dtype=int64, numpy=11>}
{'user_no': <tf.Tensor: shape=(), dtype=string, numpy=b'-1493854771764820101'>, 'item_no': <tf.Tensor: shape=()

---
---

# **Baseline**

The first thing we can do is start with a very "naive" baseline; for every interaction in the test dataset we will just predict the top 100 items from the training set. This will give us a reference point for any metrics we calculate after training a model. 

A side benefit is that we can recreate the way that metrics are calculated by TFRS. See [here](https://github.com/tensorflow/recommenders/blob/8b249f3fc0f8d3d907eecf010809a5df3759d65d/tensorflow_recommenders/metrics/factorized_top_k.py#L64) for the source code. 

In [154]:
NUM_TOP_ITEMS = 100
top_items = train_df_filtered['item_no'].value_counts()[:100].index

ks = (1, 5, 10, 50, 100)
metrics = [tf.keras.metrics.Mean() for k in ks]

true_candidates = tf.expand_dims(tf.constant(test_df_filtered['item_no'].values), 1)
retrieved_candidates = tf.expand_dims(top_items, 1)
# Pretend like we retrieve the same top 100 candidates for every interaction in test data
retrieved_candidates = tf.transpose(tf.repeat(retrieved_candidates, 
                                              tf.constant(true_candidates.shape[0]), 
                                              axis=1))
ids_match = tf.cast(tf.math.equal(true_candidates, retrieved_candidates), tf.float32)

In [155]:
for k, metric in zip(ks, metrics):
    # By slicing until :k we assume scores are sorted.
    # Clip to only count multiple matches once.
    match_found = tf.clip_by_value(
        tf.reduce_sum(ids_match[:, :k], axis=1, keepdims=True),
        0.0, 1.0
    )
    metric.update_state(match_found)

In [156]:
for k, metric in zip(ks, metrics):
    print(f'Top {k} categorical accuracy: {metric.result().numpy():.5f}')

Top 1 categorical accuracy: 0.00176
Top 5 categorical accuracy: 0.00660
Top 10 categorical accuracy: 0.00968
Top 50 categorical accuracy: 0.02728
Top 100 categorical accuracy: 0.04004


# Creating a Simple Model

We will start by creating a very simple model similar to the one created in [the TFRS basic retrieval tutorial](https://www.tensorflow.org/recommenders/examples/basic_retrieval). Quoting from the tutorial, the model will be created by two-submodels: 

> 1. A query model computing the query representation (normally a fixed-dimensionality embedding vector) using query features
> 2. A candidate model computing the candidate representation (an equally-sized vector using the candidate features
> 
> The outputs of the two models are then multiplied together to give a query-candidate affinity score, with higher scores expressing a better match between the candidate and the query.

For our use case, we will pretend that we want to recommend items to users. As such, our **query** model will produce representations of the **users** (and potentially additional **context**, such as time, device, etc.) and our **candidate** model will produce representations of the **items**. 

For the rest of the notebook we will refer to the "query" model as a `user_model` and the "candidate" model as a `item_model`

In [158]:
def get_vocab(df, feature, top_n=None):
    return df[feature].value_counts()[:top_n].index

def create_embedding_model(df, feature, num_oov_indices=1, embedding_dim=32):
    feature_vocab = get_vocab(df, feature)
    feature_input = tf.keras.Input(shape=(), dtype="string", name=feature)
    feature_lookup = tf.keras.layers.StringLookup(
        vocabulary=feature_vocab,
        mask_token=None,
        num_oov_indices=num_oov_indices,
        name=f"{feature}_lookup"
    )(feature_input)
    feature_embedding = tf.keras.layers.Embedding(len(feature_vocab) + num_oov_indices, 
                                                  embedding_dim)(feature_lookup)
    return tf.keras.models.Model(feature_input, feature_embedding)

class SimpleTFRSModel(tfrs.Model):

    def __init__(self, user_model, item_model, task):
        super().__init__()
        self.user_model: tf.keras.Model = user_model
        self.item_model: tf.keras.Model = item_model
        self.task: tf.keras.layers.Layer = task
            

    def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
        # We pick out the user features and pass them into the user model
        # and item features to pass to the item model. Use the returned embeddings 
        # to calculate the loss
        user_embeddings = self.user_model(features['user_no'])
        positive_item_embeddings = self.item_model(features['item_no'])
        # The task computes the loss and the metrics.
        return self.task(user_embeddings, positive_item_embeddings, compute_metrics=not training)

In [159]:
user_model = create_embedding_model(train_df_filtered, "user_no")
item_model = create_embedding_model(train_df_filtered, "item_no")
metrics = tfrs.metrics.FactorizedTopK(
  candidates=items_dataset.batch(128).map(item_model)
)
task = tfrs.tasks.Retrieval(
  metrics=metrics
)

simple_tfrs_model = SimpleTFRSModel(user_model, item_model, task)

---
---

<div class="alert alert-block alert-warning">
<b>The above is just a convenience!</b> The following class is a simplified version of what
is actually going on under-the-hood:

```python 
class NonTFRSModel(tf.keras.Model):
    def __init__(self, user_model, item_model, metrics):
        """
        Note that we don't pass in the task! That's because we define 
        what the task is here.
        """
        super().__init__()
        self.user_model = user_model 
        self.item_model = item_model 
        # When we perform retrieval, the default loss is actually just good 
        # old CategoricalCrossentropy :) 
        self._loss = tf.keras.losses.CategoricalCrossentropy(
            from_logits=True, reduction=tf.keras.losses.Reduction.SUM
        )
        self._factorized_metrics = metrics

    def calc_loss(self, query_embeddings, candidate_embeddings): 
        scores = tf.linalg.matmul(
            query_embeddings, 
            candidate_embeddings, 
            transpose_b=True
        )
        num_queries, num_candidates = scores.shape
        labels = tf.eye(num_queries, num_candidates)
        loss = self._loss(y_true=labels, y_pred=scores)
        self._factorized_metrics.update_state(
            query_embeddings, 
            candidate_embeddings
        )
        return loss
    

    def train_step(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor:
        with tf.GradientTape() as tape: 
            user_embeddings = self.user_model(features['user_no'])
            positive_item_embeddings = self.item_model(features['item_no'])
            loss = self.calc_loss(user_embeddings, positive_item_embeddings)

        gradients = tape.gradient(loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))

        metrics = {metric.name: metric.result() for metric in self.metrics}
        return metrics 

    def test_step(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor: 
        user_embeddings = self.user_model(features['user_no'])
        positive_item_embeddings = self.item_model(features['item_no'])

        loss = self.compute_loss(user_embeddings, positive_item_embeddings)        

        metrics = {metric.name: metric.result() for metric in self.metrics}
        return metrics 
```

We can then instantiate and compile a model like so: 

```python 
simple_model = NonTFRSModel(user_model, item_model, metrics)
# Need to specify run_eagerly=True because we need the shape of the scores 
# in the calc_loss function
simple_model.compile(optimizer=tf.keras.optimizers.Adam(), run_eagerly=True)
```

After that we can just train the model the same as below :)

</div>
---
---

In [161]:
train_dataset_interactions = train_dataset.map(lambda x: {
    'user_no': x['user_no'],
    'item_no': x['item_no']
})
test_dataset_interactions = test_dataset.map(lambda x: {
    'user_no': x['user_no'],
    'item_no': x['item_no']
})

train_ds = train_dataset_interactions.shuffle(1_000).batch(4096)
test_ds = test_dataset_interactions.batch(4096)

<div class="alert alert-block alert-info">
In the interest of accelerating training as much as possible, we won't calculate any metrics. If we were training "for real" we'd probably want to monitor the training and implement early stopping 
</div>

In [162]:
simple_tfrs_model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))
history = simple_tfrs_model.fit(train_ds, 
                                epochs=5, 
                                validation_data=test_ds)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Evaluation

In [163]:
train_results = simple_tfrs_model.evaluate(train_ds, return_dict=True)
test_results = simple_tfrs_model.evaluate(test_ds, return_dict=True)



In [164]:
print(f"Train top-100 accuracy:  {train_results['factorized_top_k/top_100_categorical_accuracy']}")
print(f"Test top-100 accuracy:  {test_results['factorized_top_k/top_100_categorical_accuracy']}")

Train top-100 accuracy:  0.9690590500831604
Test top-100 accuracy:  0.0439947210252285


## Serving and Qualitative Evaluation

In [165]:
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(simple_tfrs_model.user_model)
# recommends items out of the entire items dataset.
_ = index.index_from_dataset(
        tf.data.Dataset.zip((items_dataset.batch(100), 
                             items_dataset.batch(100).map(simple_tfrs_model.item_model))))

In [166]:
random_user = np.random.choice(train_df_filtered['user_no'].unique())
train_df_filtered.loc[train_df_filtered['user_no'] == random_user]

Unnamed: 0,user_no,item_no,gender_description,brand,product_group,first_interaction_month
74747,-3104547778465153302,-6830050435975456605,unisex,oyoy,textile,11
116801,-3104547778465153302,8324485338373092391,unisex,garbo&friends,bedding,11
212342,-3104547778465153302,885525305765675599,girls,marmar copenhagen,bottoms,10
254351,-3104547778465153302,-880741914853140430,unisex,filibabba,bedding,11
265275,-3104547778465153302,-4942960870640161773,unisex,baby einstein,first toys and baby toys,11
266383,-3104547778465153302,3041342533360743315,unisex,hust&claire,all in ones,11
287945,-3104547778465153302,131144182965917918,unisex,hust&claire,bottoms,11
358410,-3104547778465153302,7197441133579403316,girls,buddy & hope,headwear,11
370123,-3104547778465153302,142365317642445102,unisex,mainio,all in ones,11
434516,-3104547778465153302,1869695463816614457,unisex,buddy & hope,underwear,11


In [172]:
%%time
# Get recommendations.
_, titles = index(tf.constant([random_user]))

CPU times: user 3.65 ms, sys: 1.69 ms, total: 5.34 ms
Wall time: 4.97 ms


In [170]:
items_to_exclude = train_df_filtered.loc[train_df_filtered['user_no'] == random_user]['item_no'].unique()

In [173]:
%%time
_, titles = index.query_with_exclusions(tf.constant([random_user]), 
                                       tf.constant([items_to_exclude]))

CPU times: user 2.44 ms, sys: 1.25 ms, total: 3.69 ms
Wall time: 3.87 ms


In [174]:
recommendations = [item.numpy().decode() for item in titles[0]]
item_info_df.loc[item_info_df['item_no'].isin(recommendations)]

Unnamed: 0,item_no,colour,gender_description,brand,product_group,min_age,max_age
4390,-169382843652926954,white,girls,marmar copenhagen,dresses,0.125,3.0
5523,-5047690686078044677,red,girls,stella mccartney kids,jumpers and knitwear,1.0,14.0
5769,1015996855129878374,yellow,unisex,mam,baby feeding,,
15648,-3106656952336550512,cream,unisex,mp,underwear,0.125,11.0
28960,5304782731933651079,cream,girls,louise misha,all in ones,0.375,2.0
30257,-2534104804647427698,pink,girls,mp,underwear,0.125,10.0
43568,8787244317887926731,green,unisex,joha,scarves,0.875,6.0
44137,4852358975316315615,blue,unisex,kuling,baselayers,1.0,12.0
50387,-2800734550847316720,white,girls,stella mccartney kids,jumpers and knitwear,1.0,14.0
55512,2025859253558055261,blue,unisex,bobo choses,coats and jackets,1.0,11.0


---
---
---

## Content-Based Filtering

In [204]:
top_brands = get_vocab(train_df_filtered, 'brand', 100)
top_groups = get_vocab(train_df_filtered, 'product_group', 50)
COLS_TO_KEEP = ['gender_description', 'brand', 'product_group']

def precompute_embeddings(df, agg_col):
    df.loc[:, 'brand'] = df['brand'].apply(lambda x: x if x in top_brands else 'other')
    df.loc[:, 'product_group'] = df['product_group'].apply(lambda x: x if x in top_groups else 'other')
    df_one_hot = pd.get_dummies(df[COLS_TO_KEEP + [agg_col]], columns=COLS_TO_KEEP)
    return df_one_hot.groupby(agg_col).agg('mean')

In [205]:
precomputed_user_embeddings = precompute_embeddings(train_df_filtered, agg_col='user_no')
precomputed_item_embeddings = precompute_embeddings(item_info_df, agg_col='item_no')

In [206]:
display(precomputed_user_embeddings)

Unnamed: 0_level_0,gender_description_boys,gender_description_girls,gender_description_unisex,brand_1+ in the family,brand_a happy brand,brand_adidas,brand_beau loves,brand_billieblush,brand_bisgaard,brand_bobo choses,...,product_group_stroller parts and customisati,product_group_strollers,product_group_swimwear and coverups,product_group_tableware,product_group_textile,product_group_tops,product_group_trainers,product_group_underwear,product_group_vehicles,product_group_water toys
user_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-1012876894217140776,0.000000,0.277778,0.722222,0.000000,0.166667,0.055556,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.055556,0.0,0.000000,0.000000,0.055556,0.166667,0.0,0.0
-1022934284196456562,0.000000,0.166667,0.833333,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.111111,0.0,0.055556,0.111111,0.000000,0.000000,0.0,0.0
-1030336247862550277,0.444444,0.388889,0.166667,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.000000,0.500000,0.000000,0.000000,0.0,0.0
-1031375167955555195,0.000000,0.736842,0.263158,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.052632,...,0.0,0.0,0.000000,0.0,0.000000,0.210526,0.052632,0.105263,0.0,0.0
-1041412818309902183,0.200000,0.550000,0.250000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.050000,...,0.0,0.0,0.000000,0.0,0.000000,0.600000,0.000000,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
968073716034597193,0.111111,0.777778,0.111111,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.055556,...,0.0,0.0,0.000000,0.0,0.000000,0.166667,0.000000,0.000000,0.0,0.0
976567085753614314,0.000000,0.947368,0.052632,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.052632,0.210526,0.000000,0.157895,0.0,0.0
979760204207844065,0.055556,0.388889,0.555556,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.055556,0.0,0.0
987479213534973896,0.000000,0.277778,0.722222,0.166667,0.000000,0.000000,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.055556,0.000000,0.000000,0.277778,0.0,0.0


In [207]:
display(precomputed_item_embeddings)

Unnamed: 0_level_0,gender_description_boys,gender_description_girls,gender_description_unisex,brand_1+ in the family,brand_a happy brand,brand_adidas,brand_beau loves,brand_billieblush,brand_bisgaard,brand_bobo choses,...,product_group_stroller parts and customisati,product_group_strollers,product_group_swimwear and coverups,product_group_tableware,product_group_textile,product_group_tops,product_group_trainers,product_group_underwear,product_group_vehicles,product_group_water toys
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-10001501373726678,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
-1000182030290830232,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
-1000183384954605528,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
-1000321715684049686,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
-1000570342615087077,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999030474988862413,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
999032067904529387,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
999084409713144028,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
999328979874402204,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [208]:
def create_embedding_model(precomputed_embeddings):
    num_columns = len(precomputed_embeddings.columns)
    embedding_matrix = np.concatenate((np.zeros((1, num_columns)),
                                      precomputed_embeddings.values))
    embedding_layer = tf.keras.layers.Embedding(*embedding_matrix.shape,
                                                embeddings_initializer=tf.keras.initializers.Constant(
                                                    embedding_matrix),
                                                trainable=False)
    model = tf.keras.Sequential([
        tf.keras.layers.StringLookup(
            vocabulary=precomputed_embeddings.index,
            num_oov_indices=1
        ),
        embedding_layer
    ])
    return model

In [209]:
user_model = create_embedding_model(precomputed_user_embeddings)
item_model = create_embedding_model(precomputed_item_embeddings)

In [211]:
item_model(tf.constant(['-1000183384954605528']))

<tf.Tensor: shape=(1, 153), dtype=float32, numpy=
array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)>

In [212]:
items_dataset = tf.data.Dataset.from_tensor_slices(item_info_df['item_no'])

In [213]:
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(user_model)
# recommends items out of the entire items dataset.
index.index_from_dataset(
  tf.data.Dataset.zip((items_dataset.batch(100), items_dataset.batch(100).map(item_model)))
)

<tensorflow_recommenders.layers.factorized_top_k.BruteForce at 0x14f6e9890>

In [214]:
random_user = np.random.choice(train_df_filtered['user_no'].unique())
train_df_filtered.loc[train_df_filtered['user_no'] == random_user]

Unnamed: 0,user_no,item_no,gender_description,brand,product_group,first_interaction_month
36657,5994354835073478558,-2081566317162731741,girls,liilu,tops,11
87177,5994354835073478558,-1225443341060517789,unisex,kids concept,role play,11
143821,5994354835073478558,-6345811451254769024,girls,oii,jumpers and knitwear,11
159171,5994354835073478558,-5501720600386163534,girls,other,tops,10
170635,5994354835073478558,-1355619664042809785,unisex,other,stationary,11
212141,5994354835073478558,8199939935313300336,girls,liilu,dresses,11
221101,5994354835073478558,-7103983907925988144,girls,camper,shoes,10
240354,5994354835073478558,-1327811681384147519,girls,liilu,dresses,11
277017,5994354835073478558,-7083720758992576476,unisex,stoy,stationary,11
292293,5994354835073478558,-6265121426273307255,unisex,kuling,gloves and mittens,10


In [222]:
items_to_exclude = train_df_filtered.loc[train_df_filtered['user_no'] == random_user]['item_no'].unique()

In [224]:
%%time
_, titles = index.query_with_exclusions(tf.constant([random_user]), 
                                       tf.constant([items_to_exclude]))

CPU times: user 8.25 ms, sys: 1.64 ms, total: 9.89 ms
Wall time: 9.71 ms


In [225]:
recommendations = [item.numpy().decode() for item in titles[0]]
item_info_df.loc[item_info_df['item_no'].isin(recommendations)]

Unnamed: 0,item_no,colour,gender_description,brand,product_group,min_age,max_age
29,-8150507846269859328,white,girls,other,tops,7.0,14.0
144,-2697913540689879618,cream,girls,other,tops,7.0,14.0
223,6809060874831808586,pink,girls,other,tops,3.0,14.0
449,-1574081396994835414,blue,girls,other,tops,7.0,14.0
451,-1434471219688457854,blue,girls,other,tops,3.0,14.0
463,8090136361583216871,pink,girls,other,tops,5.0,14.0
510,3719459509698442638,grey,girls,other,tops,3.0,14.0
641,5380787914814790039,black,girls,other,tops,7.0,14.0
648,7831423384100472341,pink,girls,other,tops,0.125,0.875
655,4893874383433347764,white,girls,other,tops,2.0,14.0


**Looks like it 'memorizes' users' tastes more**

In [217]:
test_users_dataset = tf.data.Dataset.from_tensor_slices(test_df_filtered['user_no'])

In [218]:
_, retrieved_items = index(test_df_filtered['user_no'], k=100)

74        -6532814492104545390
82        -4459805488399478589
94         8824595755562636694
151       -3831695031485036422
                  ...         
117449     7586537222774793661
117459    -1528195764598400227
117498     6952248964395215486
117664      202618031484203886
117685     5138080642247734882
Name: user_no, Length: 2273, dtype: object
Consider rewriting this model with the Functional API.


In [219]:
ids_match = tf.cast(tf.math.equal(true_candidates, retrieved_items), tf.float32)

In [220]:
metrics = [tf.keras.metrics.Mean() for k in ks]
for k, metric in zip(ks, metrics):
    # By slicing until :k we assume scores are sorted.
    # Clip to only count multiple matches once.
    match_found = tf.clip_by_value(
        tf.reduce_sum(ids_match[:, :k], axis=1, keepdims=True),
        0.0, 1.0
    )
    metric.update_state(match_found)

In [226]:
for k, metric in zip(ks, metrics):
    print(f'Top {k} categorical accuracy: {metric.result().numpy():.5f}')

Top 1 categorical accuracy: 0.00000
Top 5 categorical accuracy: 0.00308
Top 10 categorical accuracy: 0.00836
Top 50 categorical accuracy: 0.04048
Top 100 categorical accuracy: 0.07347


---

## Context Features

Now let's add context features

In [None]:
class UserModel(tf.keras.Model):
    def __init__(self, unique_users, num_oov_indices=1, embedding_dim=32):
        super().__init__()
        
        self.user_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(vocabulary=unique_users, 
                                         num_oov_indices=num_oov_indices),
            tf.keras.layers.Embedding(len(unique_users) + num_oov_indices, embedding_dim)
        ])
        
    def call(self, inputs):
        return self.user_embedding(inputs['user_no'])
    
class ItemModel(tf.keras.Model):
    def __init__(self, 
                 items, 
                 gender_description,
                 top_brands, 
                 top_groups, 
                 num_oov_indices=1, 
                 embedding_dim=16):
        super().__init__()
        
        self.item_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(vocabulary=items, 
                                         num_oov_indices=num_oov_indices),
            tf.keras.layers.Embedding(len(items) + num_oov_indices, 16)
        ])
        
        self.gender_description_lookup = tf.keras.layers.StringLookup(vocabulary=gender_description, 
                                                                      output_mode='one_hot',
                                                                      num_oov_indices=0)
        self.brand_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(vocabulary=top_brands, 
                                         num_oov_indices=num_oov_indices),
            tf.keras.layers.Embedding(len(top_brands) + num_oov_indices, 8)
        ])
        self.product_group_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(vocabulary=top_groups, 
                                         num_oov_indices=num_oov_indices),
            tf.keras.layers.Embedding(len(top_groups) + num_oov_indices, 5)
        ])
        
    def call(self, inputs):
        return tf.concat([
             self.item_embedding(inputs['item_no']),
             self.gender_description_lookup(inputs['gender_description']),
             self.brand_embedding(inputs['brand']),
             self.product_group_embedding(inputs['product_group'])
        ], axis=1)
    
class TFRSContextModel(tfrs.models.Model):
    def __init__(self, 
                 unique_users,
                 items, 
                 gender_description,
                 top_brands, 
                 top_groups):
        super().__init__()
        self.query_model = tf.keras.Sequential([
            UserModel(unique_users), 
            #tf.keras.layers.Dense(32)
        ])
        self.candidate_model = tf.keras.Sequential([
            ItemModel(items, gender_description, top_brands, top_groups),
            #tf.keras.layers.Dense(32)
        ])
        self.task = tfrs.tasks.Retrieval(
            metrics=tfrs.metrics.FactorizedTopK(
                candidates=items_dataset_w_context.batch(128).map(self.candidate_model)
            )
        )
    def compute_loss(self, inputs, training=False):
        query_embeddings = self.query_model({
            'user_no': inputs['user_no']
        })
        candidate_embeddings = self.candidate_model({
            'item_no': inputs['item_no'],
            'gender_description': inputs['gender_description'],
            'brand': inputs['brand'],
            'product_group': inputs['product_group']
        })
        
        return self.task(query_embeddings, candidate_embeddings)

**FIX ITEMS DATASET!!!**

In [None]:
items_df = item_info_df.loc[item_info_df['item_no'].isin(items)][
    ['item_no', 'gender_description', 'brand', 'product_group']]

items_dataset_w_context = tf.data.Dataset.from_tensor_slices(dict(items_df))

In [None]:
model = TFRSContextModel(unique_users, items, gender_description, top_brands, top_groups)

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam())

In [None]:
cached_train = train_dataset.shuffle(1_000).batch(1024).cache()
cached_test = test_dataset.batch(512).cache()

In [None]:
history = model.fit(cached_train, epochs=5)

In [None]:
results = model.evaluate(cached_test, return_dict=True)

In [None]:
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(model.query_model)
# recommends items out of the entire items dataset.
_ = index.index_from_dataset(
        tf.data.Dataset.zip((items_dataset.batch(100), 
                             items_dataset_w_context.batch(100).map(model.candidate_model))))

In [None]:
for item in train_dataset.take(3).batch(3):
    print(item)

In [None]:
unique_users

In [None]:
top_brands = train_df_filtered['brand'].value_counts()[:100].index
top_groups = train_df_filtered['product_group'].value_counts()[:50].index
gender_description = train_df_filtered['gender_description'].unique()
item_model = ItemModel(items, gender_description, top_brands, top_groups)

In [None]:
for item in train_dataset.take(3).batch(3):
    print(item)

In [None]:
item_model(item)

In [None]:
train_df['gender_description'].unique()

In [None]:
tf.keras.layers.StringLookup?

In [None]:
gender_lookup = tf.keras.layers.StringLookup(vocabulary=train_df['gender_description'].unique(), 
                                             output_mode='one_hot', 
                                             num_oov_indices=0)

In [None]:
gender_lookup(tf.constant(['boys']))

In [None]:
class MovieModel(tf.keras.Model):

  def __init__(self):
    super().__init__()

    max_tokens = 10_000

    self.title_embedding = tf.keras.Sequential([
      tf.keras.layers.StringLookup(
          vocabulary=unique_movie_titles, mask_token=None),
      tf.keras.layers.Embedding(len(unique_movie_titles) + 1, 32)
    ])

    self.title_vectorizer = tf.keras.layers.TextVectorization(
        max_tokens=max_tokens)

    self.title_text_embedding = tf.keras.Sequential([
      self.title_vectorizer,
      tf.keras.layers.Embedding(max_tokens, 32, mask_zero=True),
      tf.keras.layers.GlobalAveragePooling1D(),
    ])

    self.title_vectorizer.adapt(movies)

  def call(self, titles):
    return tf.concat([
        self.title_embedding(titles),
        self.title_text_embedding(titles),
    ], axis=1)

In [None]:
for item in train_dataset.take(1):
    print(item)