# Training TFRS

 - [ ] Fix the data 
     - [x] Get a reasonable amount of data, make sure there is overlap in train/test 
     - [ ] Set up a flag so we can use all vs. subset of data depending on CPU/GPU
 - [ ] Set up eval procedure - **Clean this up a bit more**
     - [x] Metrics 
     - [ ] Coverage/Popularity
     - [x] Qualitative evaluation of predictions 
 - [x] Baselines - **Done, just need to clean**
     - [x] Most popular 
     - [x] Domain Knowledge 
     - [x] kNN
 - [ ] TFRS
     - [x] Simple model 
     - [ ] With Context Features
     - [ ] Sequential 
     - [ ] Memory Efficient
 - [ ] Serving 
     - [x] In memory 
     - [ ] TFS
 - [ ] E2E with TFX
 - [ ] Alternatives 
     - [ ] LightFM, Microsoftrecommenders, Transformer recommends
 - [ ] Clean Notebook
     - [ ] References to Papers / Books
     - [ ] Evaluation notes
     - [ ] Shortcomings/Future work 
    
After doing with context features, do a more advanced on GPU, and then do E2E with TFX 

In [1]:
from typing import Dict, Any, Text

import numpy as np 
import pandas as pd

import tensorflow as tf
import tensorflow_recommenders as tfrs
import tensorflow_data_validation as tfdv

## **Reading in the Data** 

In [51]:
train_df = pd.read_csv('train.csv', dtype={'user_no': str, 'item_no': str})
test_df = pd.read_csv('test.csv', dtype={'user_no': str, 'item_no': str})

# For evaluation
item_info_df = pd.read_csv('item_info.csv', dtype={'item_no': str})

<div class="alert alert-block alert-info">
<b>TODO:</b> Move all this stuff to EDA notebook so this is a bit more streamlined and we can just 
read in data that is ready-to-go. 
    
<b>NOTE:</b> Gonna cheat here a bit and make an artificial dataset such that all of the users are repeat
    
Create **two** versions of the dataset (abbreviated and full) so that we can run on CPU and GPU
</div>

In [67]:
NUM_USERS = 5000

overlap_users = set(train_df['user_no']) & set(test_df['user_no'].unique())
top_users = train_df[train_df['user_no'].isin(overlap_users)]['user_no'].value_counts()[:NUM_USERS].index

In [169]:
train_df_filtered = train_df.loc[train_df['user_no'].isin(top_users)]
test_df_filtered = test_df.loc[test_df['user_no'].isin(top_users)]
items = train_df_filtered['item_no'].unique()

In [69]:
train_dataset = tf.data.Dataset.from_tensor_slices(dict(train_df_filtered))
test_dataset = tf.data.Dataset.from_tensor_slices(dict(test_df_filtered))

items_dataset = tf.data.Dataset.from_tensor_slices(items)

In [70]:
for item in items_dataset.take(3):
    print(item)

tf.Tensor(b'2561421211445868078', shape=(), dtype=string)
tf.Tensor(b'-5587843449775984456', shape=(), dtype=string)
tf.Tensor(b'-6916770089740843404', shape=(), dtype=string)


In [71]:
for elem in train_dataset.take(3):
    print(elem)

{'user_no': <tf.Tensor: shape=(), dtype=string, numpy=b'-6613028768649161262'>, 'item_no': <tf.Tensor: shape=(), dtype=string, numpy=b'2561421211445868078'>, 'gender_description': <tf.Tensor: shape=(), dtype=string, numpy=b'unisex'>, 'brand': <tf.Tensor: shape=(), dtype=string, numpy=b'stoy'>, 'product_group': <tf.Tensor: shape=(), dtype=string, numpy=b'role play'>}
{'user_no': <tf.Tensor: shape=(), dtype=string, numpy=b'-6613028768649161262'>, 'item_no': <tf.Tensor: shape=(), dtype=string, numpy=b'-5587843449775984456'>, 'gender_description': <tf.Tensor: shape=(), dtype=string, numpy=b'unisex'>, 'brand': <tf.Tensor: shape=(), dtype=string, numpy=b'stoy'>, 'product_group': <tf.Tensor: shape=(), dtype=string, numpy=b'role play'>}
{'user_no': <tf.Tensor: shape=(), dtype=string, numpy=b'-6613028768649161262'>, 'item_no': <tf.Tensor: shape=(), dtype=string, numpy=b'-6916770089740843404'>, 'gender_description': <tf.Tensor: shape=(), dtype=string, numpy=b'unisex'>, 'brand': <tf.Tensor: shape

In [72]:
unique_users = train_df_filtered['user_no'].unique()
unique_items_training = set(train_df_filtered['item_no'])
unique_items_test = set(test_df_filtered['item_no'])

print(len(unique_users))
print(len(unique_items_training))
print(len(unique_items_test))
print(len(unique_items_test - unique_items_training))

5000
25266
6444
2434


## Creating the Model

In [9]:
EMBEDDING_DIM = 32
NUM_OOV_INDICES = 1

user_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_users, 
      num_oov_indices=NUM_OOV_INDICES),
  tf.keras.layers.Embedding(len(unique_users) + NUM_OOV_INDICES, EMBEDDING_DIM)
])

item_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=items, 
      num_oov_indices=NUM_OOV_INDICES),
  tf.keras.layers.Embedding(len(items) + NUM_OOV_INDICES, EMBEDDING_DIM)
])

metrics = tfrs.metrics.FactorizedTopK(
  candidates=items_dataset.batch(128).map(item_model)
)

task = tfrs.tasks.Retrieval(
  metrics=metrics
)

In [10]:
class SimpleTFRSModel(tfrs.Model):

    def __init__(self, user_model, item_model, task):
        super().__init__()
        self.user_model: tf.keras.Model = user_model
        self.item_model: tf.keras.Model = item_model
        self.task: tf.keras.layers.Layer = task

    def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
        # We pick out the user features and pass them into the user model.
        user_embeddings = self.user_model(features["user_no"])
        # And pick out the movie features and pass them into the movie model,
        # getting embeddings back.
        positive_item_embeddings = self.item_model(features["item_no"])

        # The task computes the loss and the metrics.
        return self.task(user_embeddings, positive_item_embeddings)

---
---

<div class="alert alert-block alert-warning">
<b>The above is just a convenience!</b> The following class is a simplified version of what
is actually going on under-the-hood:

```python 
class NonTFRSModel(tf.keras.Model):
    def __init__(self, user_model, item_model, metrics):
        """
        Note that we don't pass in the task! That's because we define 
        what it is here.
        """
        super().__init__()
        self.user_model = user_model 
        self.item_model = item_model 
        # When we perform retrieval, the default loss is actually just good 
        # old CategoricalCrossentropy :) 
        self._loss = tf.keras.losses.CategoricalCrossentropy(
            from_logits=True, reduction=tf.keras.losses.Reduction.SUM
        )
        self._factorized_metrics = metrics

    def calc_loss(self, query_embeddings, candidate_embeddings): 
        scores = tf.linalg.matmul(
            query_embeddings, 
            candidate_embeddings, 
            transpose_b=True
        )
        num_queries, num_candidates = scores.shape
        labels = tf.eye(num_queries, num_candidates)
        loss = self._loss(y_true=labels, y_pred=scores)
        self._factorized_metrics.update_state(
            query_embeddings, 
            candidate_embeddings
        )
        return loss
    

    def train_step(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor:
        with tf.GradientTape() as tape: 
            user_embeddings = self.user_model(features['user_no'])
            positive_item_embeddings = self.item_model(features['item_no'])
            loss = self.calc_loss(user_embeddings, positive_item_embeddings)

        gradients = tape.gradient(loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))

        metrics = {metric.name: metric.result() for metric in self.metrics}
        return metrics 

    def test_step(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor: 
        user_embeddings = self.user_model(features['user_no'])
        positive_item_embeddings = self.item_model(features['item_no'])

        loss = self.compute_loss(user_embeddings, positive_item_embeddings)        

        metrics = {metric.name: metric.result() for metric in self.metrics}
        return metrics 
```

We can then instantiate and compile a model like so: 

```python 
simple_model = NonTFRSModel(user_model, item_model, metrics)
# Need to specify run_eagerly=True because we need the shape of the scores 
# in the calc_loss function
simple_model.compile(optimizer=tf.keras.optimizers.Adam(), run_eagerly=True)
```

After that we can just train the model the same as below :)

</div>
---
---

In [11]:
model = SimpleTFRSModel(user_model, item_model, task)
model.compile(optimizer=tf.keras.optimizers.Adam())

In [12]:
train_dataset_interactions = train_dataset.map(lambda x: {
    'user_no': x['user_no'],
    'item_no': x['item_no']
})
test_dataset_interactions = test_dataset.map(lambda x: {
    'user_no': x['user_no'],
    'item_no': x['item_no']
})

cached_train = train_dataset_interactions.shuffle(1_000).batch(1024).cache()
cached_test = test_dataset_interactions.batch(512).cache()

In [122]:
history = model.fit(cached_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Evaluation

In [123]:
results = model.evaluate(cached_test, return_dict=True)



## Serving

In [124]:
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
# recommends items out of the entire items dataset.
index.index_from_dataset(
  tf.data.Dataset.zip((items_dataset.batch(100), items_dataset.batch(100).map(model.item_model)))
)

<tensorflow_recommenders.layers.factorized_top_k.BruteForce at 0x14991ad10>

In [127]:
random_user = np.random.choice(train_df_filtered['user_no'].unique())
train_df_filtered.loc[train_df_filtered['user_no'] == random_user]

Unnamed: 0,user_no,item_no,gender_description,brand,product_group
377129,1145355627971110554,-7865088438347131541,girls,wheat,bottoms
377130,1145355627971110554,-4157873162967126783,unisex,kuling,headwear
377131,1145355627971110554,-2142740165482218263,unisex,kuling,headwear
377132,1145355627971110554,3748431471949385807,unisex,kuling,swimwear and coverups
377133,1145355627971110554,-2860174162663871712,unisex,kuling,headwear
377134,1145355627971110554,-2195934864809708124,unisex,wheat,fleeces and midlayers
377135,1145355627971110554,1513658394069720986,unisex,wheat,fleeces and midlayers
377136,1145355627971110554,-6850173515075499791,unisex,wheat,clothing sets
377137,1145355627971110554,-2195934864809708124,unisex,wheat,fleeces and midlayers
377138,1145355627971110554,1513658394069720986,unisex,wheat,fleeces and midlayers


In [129]:
%%time
# Get recommendations.
_, titles = index(tf.constant([random_user]))

CPU times: user 4.58 ms, sys: 6.71 ms, total: 11.3 ms
Wall time: 21 ms


In [132]:
%%time
items_to_exclude = train_df_filtered.loc[train_df_filtered['user_no'] == random_user]['item_no'].unique()
_, titles = index.query_with_exclusions(tf.constant([random_user]), 
                                       tf.constant([items_to_exclude]))

CPU times: user 4.08 ms, sys: 704 µs, total: 4.79 ms
Wall time: 3.78 ms


In [133]:
recommendations = [item.numpy().decode() for item in titles[0]]
item_info_df.loc[item_info_df['item_no'].isin(recommendations)]

Unnamed: 0,item_no,colour,gender_description,brand,product_group,min_age,max_age
8723,-6452537443298138438,cream,unisex,bobo choses,all in ones,1.0,11.0
16614,-5461181132081057096,blue,girls,burberry,dresses,2.0,14.0
21874,-873465860918484678,navy,unisex,kuling,sandals,0.875,6.0
39166,-8429863690086218988,pink,girls,adidas,trainers,4.0,10.0
41496,8659013735764980519,grey,unisex,bugaboo,stroller parts and customisati,,
42154,7480282260445719099,black,boys,nike,trainers,0.375,5.0
52105,1493209376961654965,navy,boys,didriksons,coats and jackets,1.0,9.0
54505,-6447463798668859639,purple,unisex,buddy & hope,stroller accessories,,
56913,-7634805924562764179,black,unisex,reima,trainers,0.875,5.0
58036,-5678741866268285557,blue,unisex,bobo choses,swimwear and coverups,1.0,11.0


---
---
---

## **Baselines**

### **Top Items**

**Let's find the top 100 items in the training dataset and always predict during the test dataset**

In [73]:
NUM_TOP_ITEMS = 100
top_items = train_df_filtered['item_no'].value_counts()[:100].index

In [77]:
top_items_in_test_dataset = test_df_filtered.loc[test_df_filtered['item_no'].isin(top_items)]

print(len(top_items_in_test_dataset))
print(len(test_df_filtered['item_no'].unique()))
print(len(test_df_filtered))

721
6444
12447


In [78]:
ks = (1, 5, 10, 50, 100)
metrics = [tf.keras.metrics.Mean() for k in ks]

In [79]:
true_candidates = tf.expand_dims(tf.constant(test_df_filtered['item_no'].values), 1)

In [80]:
retrieved_candidates = tf.expand_dims(top_items, 1)
retrieved_candidates = tf.transpose(tf.repeat(retrieved_candidates, tf.constant(true_candidates.shape[0]), axis=1))

In [81]:
ids_match = tf.cast(tf.math.equal(true_candidates, retrieved_candidates), tf.float32)

In [82]:
ids_match

<tf.Tensor: shape=(12447, 100), dtype=float32, numpy=
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)>

In [83]:
for k, metric in zip(ks, metrics):
    # By slicing until :k we assume scores are sorted.
    # Clip to only count multiple matches once.
    match_found = tf.clip_by_value(
        tf.reduce_sum(ids_match[:, :k], axis=1, keepdims=True),
        0.0, 1.0
    )
    metric.update_state(match_found)

In [84]:
for metric in metrics:
    print(metric.result())

tf.Tensor(0.0042580543, shape=(), dtype=float32)
tf.Tensor(0.006025548, shape=(), dtype=float32)
tf.Tensor(0.007953724, shape=(), dtype=float32)
tf.Tensor(0.030609785, shape=(), dtype=float32)
tf.Tensor(0.057925604, shape=(), dtype=float32)


### **Top Items Domain Knowledge**

Since the test data is in November let's exclude certain product groups

In [85]:
item_info_df.loc[item_info_df['item_no'].isin(top_items)]['product_group'].unique()

array(['bottoms', 'coats and jackets', 'jumpers and knitwear', 'shorts',
       'coveralls', 'trainers', 'dresses', 'tops', 'boots',
       'clothing sets', 'strollers', 'stroller accessories', 'furniture',
       'fleeces and midlayers', 'sandals', 'carriers and slings',
       'gloves and mittens', 'role play', 'swimwear and coverups',
       'stationary'], dtype=object)

In [86]:
GROUPS_TO_INCLUDE = ['jumpers and knitwear', 'coveralls', 'boots', 'coats and jackets', 'stroller accessories', 
                      'fleeces and midlayers', 'winter sets', 'gloves and mittens', 'headwear']

items_to_consider = item_info_df.loc[item_info_df['product_group'].isin(GROUPS_TO_INCLUDE)]['item_no']

In [87]:
top_items_filtered = train_df_filtered[
    train_df_filtered['item_no'].isin(items_to_consider)]['item_no'].value_counts()[:100].index

In [88]:
len(set(top_items_filtered) - set(top_items))

53

In [89]:
retrieved_candidates = tf.expand_dims(top_items_filtered, 1)
retrieved_candidates = tf.transpose(tf.repeat(retrieved_candidates, tf.constant(true_candidates.shape[0]), axis=1))

In [90]:
ids_match = tf.cast(tf.math.equal(true_candidates, retrieved_candidates), tf.float32)

In [91]:
metrics = [tf.keras.metrics.Mean() for k in ks]
for k, metric in zip(ks, metrics):
    # By slicing until :k we assume scores are sorted.
    # Clip to only count multiple matches once.
    match_found = tf.clip_by_value(
        tf.reduce_sum(ids_match[:, :k], axis=1, keepdims=True),
        0.0, 1.0
    )
    metric.update_state(match_found)

In [92]:
for metric in metrics:
    print(metric.result())

tf.Tensor(0.0042580543, shape=(), dtype=float32)
tf.Tensor(0.00867679, shape=(), dtype=float32)
tf.Tensor(0.010845987, shape=(), dtype=float32)
tf.Tensor(0.042741224, shape=(), dtype=float32)
tf.Tensor(0.06467422, shape=(), dtype=float32)


## Content-Based

In [171]:
top_brands = train_df_filtered['brand'].value_counts()[:100].index
top_groups = train_df_filtered['product_group'].value_counts()[:50].index
train_df_filtered.loc[:, 'brand'] = train_df_filtered['brand'].apply(lambda x: x if x in top_brands else 'niche_brand')
train_df_filtered.loc[:, 'product_group'] = train_df_filtered['product_group'].apply(lambda x: x if x in top_groups else 'niche_group')

In [172]:
train_df_filtered

Unnamed: 0,user_no,item_no,gender_description,brand,product_group
10,-6613028768649161262,2561421211445868078,unisex,stoy,role play
11,-6613028768649161262,-5587843449775984456,unisex,stoy,role play
12,-6613028768649161262,-6916770089740843404,unisex,stoy,role play
13,-6613028768649161262,-8288550518819679828,unisex,stoy,role play
14,-6613028768649161262,-3646011484357966884,unisex,bugaboo,stroller accessories
...,...,...,...,...,...
578584,-3683116124016444198,-971688822500808947,boys,niche_brand,swimwear and coverups
578585,-3683116124016444198,-1012351534660867109,unisex,garbo&friends,bedding
578586,-3683116124016444198,2879887491631046190,unisex,kuling,coats and jackets
578587,-3683116124016444198,7875147516452490830,unisex,kuling,coveralls


In [173]:
train_df_one_hot = pd.get_dummies(train_df_filtered[['user_no', 'gender_description', 'brand', 'product_group']], 
                                  columns=['gender_description', 'brand', 'product_group'])
train_df_one_hot

Unnamed: 0,user_no,gender_description_boys,gender_description_girls,gender_description_unisex,brand_a happy brand,brand_adidas,brand_babybjörn,brand_babyzen,brand_beau loves,brand_bergans,...,product_group_stroller parts and customisati,product_group_strollers,product_group_swimwear and coverups,product_group_tableware,product_group_textile,product_group_tops,product_group_trainers,product_group_underwear,product_group_vehicles,product_group_water toys
10,-6613028768649161262,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11,-6613028768649161262,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12,-6613028768649161262,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
13,-6613028768649161262,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14,-6613028768649161262,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
578584,-3683116124016444198,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
578585,-3683116124016444198,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
578586,-3683116124016444198,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
578587,-3683116124016444198,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [174]:
user_embeddings = train_df_one_hot.groupby('user_no').agg('mean')

user_embeddings

Unnamed: 0_level_0,gender_description_boys,gender_description_girls,gender_description_unisex,brand_a happy brand,brand_adidas,brand_babybjörn,brand_babyzen,brand_beau loves,brand_bergans,brand_besafe,...,product_group_stroller parts and customisati,product_group_strollers,product_group_swimwear and coverups,product_group_tableware,product_group_textile,product_group_tops,product_group_trainers,product_group_underwear,product_group_vehicles,product_group_water toys
user_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-1001697075369787517,0.388889,0.500000,0.111111,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.666667,0.000000,0.0,0.0
-1004190764919556160,0.000000,0.000000,1.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.111111,0.333333,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0
-1005108460398818827,0.000000,0.529412,0.470588,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.058824,0.058824,0.0,0.117647,0.000000,0.058824,0.0,0.0
-1006521943957043595,0.000000,0.055556,0.944444,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.666667,0.000000,0.000000,0.0,0.0
-101493426712742714,0.111111,0.722222,0.166667,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.055556,0.000000,0.111111,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98933627682977975,0.000000,0.000000,1.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.000000,0.157895,0.0,0.157895,0.000000,0.000000,0.0,0.0
989454361282535063,0.000000,0.250000,0.750000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.062500,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0
991464767946384812,0.058824,0.176471,0.764706,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.117647,0.000000,0.0,0.058824,0.000000,0.000000,0.0,0.0
992880867697223146,0.000000,0.000000,1.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0


In [175]:
user_embeddings_matrix = np.concatenate((np.zeros((1, 154)), user_embeddings.values))

In [176]:
user_embedding_layer = tf.keras.layers.Embedding(*user_embeddings_matrix.shape, 
                                                 embeddings_initializer=tf.keras.initializers.Constant(user_embeddings_matrix),
                                                 trainable=False)

In [177]:
user_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=user_embeddings.index, 
      num_oov_indices=NUM_OOV_INDICES),
  user_embedding_layer
])

In [184]:
item_info_df.loc[:, 'brand'] = item_info_df['brand'].apply(lambda x: x if x in top_brands else 'niche_brand')
item_info_df.loc[:, 'product_group'] = item_info_df['product_group'].apply(lambda x: x if x in top_groups else 'niche_group')
item_embeddings = pd.get_dummies(item_info_df[['gender_description', 'brand', 'product_group']], 
                                 columns=['gender_description', 'brand', 'product_group'])

In [185]:
item_embeddings

Unnamed: 0,gender_description_boys,gender_description_girls,gender_description_unisex,brand_a happy brand,brand_adidas,brand_babybjörn,brand_babyzen,brand_beau loves,brand_bergans,brand_besafe,...,product_group_stroller parts and customisati,product_group_strollers,product_group_swimwear and coverups,product_group_tableware,product_group_textile,product_group_tops,product_group_trainers,product_group_underwear,product_group_vehicles,product_group_water toys
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61699,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
61700,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
61701,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
61702,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [186]:
item_embeddings_matrix = np.concatenate((np.zeros((1, 154)), item_embeddings.values))

item_embedding_layer = tf.keras.layers.Embedding(*item_embeddings_matrix.shape, 
                                                 embeddings_initializer=tf.keras.initializers.Constant(item_embeddings_matrix),
                                                 trainable=False)

In [187]:
item_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=item_info_df['item_no'], 
      num_oov_indices=NUM_OOV_INDICES),
  item_embedding_layer
])

In [189]:
item_model('206890150141030846')

Consider rewriting this model with the Functional API.


<tf.Tensor: shape=(154,), dtype=float32, numpy=
array([0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0.], dtype=float32)>

In [191]:
items_dataset = tf.data.Dataset.from_tensor_slices(item_info_df['item_no'])

In [193]:
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(user_model)
# recommends items out of the entire items dataset.
index.index_from_dataset(
  tf.data.Dataset.zip((items_dataset.batch(100), items_dataset.batch(100).map(item_model)))
)

<tensorflow_recommenders.layers.factorized_top_k.BruteForce at 0x14f7e9750>

In [194]:
random_user = np.random.choice(train_df_filtered['user_no'].unique())
train_df_filtered.loc[train_df_filtered['user_no'] == random_user]

Unnamed: 0,user_no,item_no,gender_description,brand,product_group
317520,1912488871163207781,8168121107461146825,unisex,kuling,clothing sets
317521,1912488871163207781,-2786258447471871904,unisex,kuling,boots
317522,1912488871163207781,4142893553733654846,unisex,bugaboo,car seats
317523,1912488871163207781,6028803586253333036,unisex,niche_brand,car seats
317524,1912488871163207781,8168121107461146825,unisex,kuling,clothing sets
317525,1912488871163207781,-2786258447471871904,unisex,kuling,boots
317526,1912488871163207781,-7613506151564862065,unisex,niche_brand,strollers
317527,1912488871163207781,4142893553733654846,unisex,bugaboo,car seats
317528,1912488871163207781,4142893553733654846,unisex,bugaboo,car seats
317529,1912488871163207781,522343477169152754,unisex,bugaboo,car seats


In [196]:
%%time
items_to_exclude = train_df_filtered.loc[train_df_filtered['user_no'] == random_user]['item_no'].unique()
_, titles = index.query_with_exclusions(tf.constant([random_user]), 
                                       tf.constant([items_to_exclude]))

CPU times: user 14.5 ms, sys: 2.41 ms, total: 16.9 ms
Wall time: 19.8 ms


In [197]:
recommendations = [item.numpy().decode() for item in titles[0]]
item_info_df.loc[item_info_df['item_no'].isin(recommendations)]

Unnamed: 0,item_no,colour,gender_description,brand,product_group,min_age,max_age,product_gorup
73,5448066451714887268,black,unisex,kuling,boots,1.0,10.0,boots
660,-8460217038453222213,brown,unisex,kuling,boots,0.875,10.0,boots
1054,-1418496272304003044,yellow,unisex,kuling,boots,0.875,10.0,boots
2016,1591483067575855474,red,unisex,kuling,boots,0.875,10.0,boots
2084,-2441044268599888379,green,unisex,kuling,boots,0.125,2.0,boots
2206,8738811477785577407,pink,unisex,kuling,boots,0.875,10.0,boots
2272,-1434358603626196733,grey,unisex,kuling,boots,0.125,2.0,boots
3170,4480433044825856994,cream,unisex,kuling,boots,0.875,10.0,boots
3390,5339705274970828498,black,unisex,kuling,boots,0.125,2.0,boots
3670,8942458444938394870,green,unisex,kuling,boots,0.875,10.0,boots


**Looks like it 'memorizes' users' tastes more**

In [200]:
test_users_dataset = tf.data.Dataset.from_tensor_slices(test_df_filtered['user_no'])

In [205]:
_, retrieved_items = index(test_df_filtered['user_no'], k=100)

6         -6613028768649161262
7         -6613028768649161262
8         -6613028768649161262
38        -2029740236817510102
                  ...         
432066    -6153155530715126273
432338    -3695442683323654294
432459    -3683116124016444198
432460    -3683116124016444198
432461    -3683116124016444198
Name: user_no, Length: 12447, dtype: object
Consider rewriting this model with the Functional API.


In [211]:
ids_match = tf.cast(tf.math.equal(true_candidates, retrieved_items), tf.float32)

In [212]:
metrics = [tf.keras.metrics.Mean() for k in ks]
for k, metric in zip(ks, metrics):
    # By slicing until :k we assume scores are sorted.
    # Clip to only count multiple matches once.
    match_found = tf.clip_by_value(
        tf.reduce_sum(ids_match[:, :k], axis=1, keepdims=True),
        0.0, 1.0
    )
    metric.update_state(match_found)

In [213]:
for metric in metrics:
    print(metric.result())

tf.Tensor(0.0012854503, shape=(), dtype=float32)
tf.Tensor(0.005704186, shape=(), dtype=float32)
tf.Tensor(0.010042581, shape=(), dtype=float32)
tf.Tensor(0.038161807, shape=(), dtype=float32)
tf.Tensor(0.064352855, shape=(), dtype=float32)


---

## Context Features