<a href="https://colab.research.google.com/github/mrudulaacharya/RecommenderSystem/blob/main/Recommender_System_ASOS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Festival x ASOS
## Build and Deploy a Recommender System in 3 Hours.

# Imports

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import os

# Import training data

In [2]:
train = pd.read_parquet("https://raw.githubusercontent.com/ASOS/dsf2020/main/dsf_asos_train_with_alphanumeric_dummy_ids.parquet")
valid = pd.read_parquet("https://raw.githubusercontent.com/ASOS/dsf2020/main/dsf_asos_valid_with_alphanumeric_dummy_ids.parquet")
dummy_users = pd.read_csv("https://raw.githubusercontent.com/ASOS/dsf2020/main/dsf_asos_dummy_users_with_alphanumeric_dummy_ids.csv", header=None).values.flatten().astype(str)
products = pd.read_csv("https://raw.githubusercontent.com/ASOS/dsf2020/main/dsf_asos_productIds.csv", header=None).values.flatten().astype(int)

# The briefest intro to tf

Tensors

In [3]:
x=tf.constant([1,2,3,4])
tf.math.square(x)

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([ 1,  4,  9, 16], dtype=int32)>

In [4]:
tf.constant([[1,2,3],[4,5,6]], dtype=tf.float32)

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)>

In [5]:
x=tf.Variable([1,2,3,4,5],dtype=tf.float32)
x

<tf.Variable 'Variable:0' shape=(5,) dtype=float32, numpy=array([1., 2., 3., 4., 5.], dtype=float32)>

Gradients

In [6]:
with tf.GradientTape() as tape:
  y=tf.math.square(x)

In [7]:
y

<tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 1.,  4.,  9., 16., 25.], dtype=float32)>

In [8]:
dy_dx=tape.gradient(y,x)
dy_dx

<tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 2.,  4.,  6.,  8., 10.], dtype=float32)>

Multiply and add tensors

In [9]:
x = tf.constant([[1,2,3]], dtype=tf.float32)
Y = tf.constant([[1,2,3, 4], [1,2,3,4], [1,2,3,4]], dtype=tf.float32)

In [10]:
tf.matmul(x,Y)

<tf.Tensor: shape=(1, 4), dtype=float32, numpy=array([[ 6., 12., 18., 24.]], dtype=float32)>

In [11]:
z = tf.constant([10, 11, 12, 13], dtype=tf.float32)

In [12]:
tf.matmul(x,Y)+z

<tf.Tensor: shape=(1, 4), dtype=float32, numpy=array([[16., 23., 30., 37.]], dtype=float32)>

This operation is very common in deep learning, so it has been abstracted:

In [13]:
dl1=tf.keras.layers.Dense(4,use_bias=True,weights=[Y,z])
dl1

<keras.src.layers.core.dense.Dense at 0x7a6b511045e0>

You can choose to apply a function to each value in the output

In [14]:
dl2=tf.keras.layers.Dense(4,use_bias=True,weights=[Y,z],activation=lambda x: x+1)
dl2(x)

<tf.Tensor: shape=(1, 4), dtype=float32, numpy=array([[17., 24., 31., 38.]], dtype=float32)>

We can put different layers together in a sequence:

In [15]:


dl3 = tf.keras.layers.Dense(1, use_bias=False, \
                             weights=[tf.constant([[0], [1], [0], [1]], \
                                                  dtype=tf.float32)])

In [16]:
x_b=dl2(x)
x_b

<tf.Tensor: shape=(1, 4), dtype=float32, numpy=array([[17., 24., 31., 38.]], dtype=float32)>

In [17]:
dl3(x_b)

<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[62.]], dtype=float32)>

We can get more flexibility if you use tf.keras.model:

In [18]:
class simple_model(tf.keras.Model):
  def __init__(self):
    super(simple_model,self).__init__()
    self.dl2=tf.keras.layers.Dense(4,use_bias=True,weights=[Y,z],activation=lambda x: x+1)

    self.dl3 = tf.keras.layers.Dense(1, use_bias=False, \
                             weights=[tf.constant([[0], [1], [0], [1]], \
                                                  dtype=tf.float32)])
  def call(self,x):
    x_b= self.dl2(x)
    return self.dl3(x_b),x_b,x_b+104

In [19]:
sm=simple_model()
sm(x)

(<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[62.]], dtype=float32)>,
 <tf.Tensor: shape=(1, 4), dtype=float32, numpy=array([[17., 24., 31., 38.]], dtype=float32)>,
 <tf.Tensor: shape=(1, 4), dtype=float32, numpy=array([[121., 128., 135., 142.]], dtype=float32)>)

So far we have been setting the weights of the dense layers, but if we don't set the weights than weights get randomly chosen.

In [20]:
dl6 = tf.keras.layers.Dense(4, use_bias=True)
dl6(x)

<tf.Tensor: shape=(1, 4), dtype=float32, numpy=
array([[-1.9119163 , -1.2725966 , -0.06874073, -1.577482  ]],
      dtype=float32)>

In [21]:
dl6.get_weights()

[array([[ 0.7462485 , -0.52048755, -0.49333078,  0.60282767],
        [-0.898436  ,  0.54611504, -0.3534335 , -0.2771924 ],
        [-0.28709757, -0.6147797 ,  0.37715232, -0.54197496]],
       dtype=float32),
 array([0., 0., 0., 0.], dtype=float32)]

# Define a Recommender Model

The embedding layer gives a list of random numbers for each user and each product.

In [22]:
embed1=tf.keras.layers.Embedding(5,8)

In [23]:
embed1(2)

<tf.Tensor: shape=(8,), dtype=float32, numpy=
array([ 0.04388313,  0.01725307, -0.02241429, -0.02908378, -0.00377575,
        0.00076275,  0.04973755, -0.02922085], dtype=float32)>

In [24]:
embed1.get_weights()

[array([[-0.00629555,  0.00349354,  0.02777816,  0.03024511, -0.04761508,
         -0.03286769,  0.03993586, -0.0126165 ],
        [ 0.01483366, -0.0219169 ,  0.02989353, -0.01681351, -0.03050611,
         -0.03133475,  0.03699403,  0.03459718],
        [ 0.04388313,  0.01725307, -0.02241429, -0.02908378, -0.00377575,
          0.00076275,  0.04973755, -0.02922085],
        [-0.01515958,  0.01866445, -0.02639969, -0.00662656,  0.04458531,
          0.0327865 , -0.01246912,  0.00703238],
        [-0.03415183,  0.04842332, -0.04785762, -0.02507461,  0.04499492,
         -0.00338442,  0.00487215, -0.03473597]], dtype=float32)]

Scores can be found using the dot product.

In [25]:
dummy_users

array(['pmfkU4BNZhmtLgJQwJ7x', 'UDRRwOlzlWVbu7H8YCCi',
       'QHGAef0TI6dhn0wTogvW', ..., 'lcORJ5hemOZc1iGo9z7k',
       '5CqDquDAszqJp27P7AL8', 'SSPNYxJMfuKhoe1dg24m'], dtype='<U20')

In [26]:
products

array([ 8650774,  9306139,  9961521, ..., 12058614, 12058615, 11927550])

In [27]:
dummy_user_embedding=tf.keras.layers.Embedding(len(dummy_users),6)
product_embedding=tf.keras.layers.Embedding(len(products),6)

In [28]:
dummy_user_embedding(1)

<tf.Tensor: shape=(6,), dtype=float32, numpy=
array([-0.02030497, -0.0323224 , -0.02752995, -0.01977856,  0.00534099,
        0.04994592], dtype=float32)>

In [29]:
product_embedding(99)

<tf.Tensor: shape=(6,), dtype=float32, numpy=
array([-0.00860673,  0.04042614,  0.00945725, -0.01182505,  0.04237033,
        0.01999631], dtype=float32)>

In [30]:
tf.tensordot(dummy_user_embedding(1),product_embedding(99),axes=[[0],[0]])

<tf.Tensor: shape=(), dtype=float32, numpy=6.664824e-05>

We can score multiple products at the same time, which is what we need to create a ranking.

In [31]:
example_product=tf.constant([1,77,104,2062])
product_embedding(example_product)

<tf.Tensor: shape=(4, 6), dtype=float32, numpy=
array([[ 0.00130073, -0.00392114,  0.04112608, -0.01473174,  0.02102882,
         0.03601812],
       [ 0.02719109,  0.01521922,  0.03307286, -0.01474122,  0.04461883,
        -0.00448322],
       [ 0.01756961,  0.02070396, -0.04838504,  0.03711827,  0.04838953,
        -0.02082442],
       [-0.04111825,  0.03575439, -0.03130037, -0.02729225, -0.04122292,
         0.0280551 ]], dtype=float32)>

In [32]:
tf.tensordot(dummy_user_embedding(1),product_embedding(example_product),axes=[[0],[1]])

<tf.Tensor: shape=(4,), dtype=float32, numpy=array([ 0.00117078, -0.00164858, -0.00120971,  0.0022618 ], dtype=float32)>

And we can score multiple users for multiple products which we will need to do if we are to train quickly.

In [33]:
dummy_users

array(['pmfkU4BNZhmtLgJQwJ7x', 'UDRRwOlzlWVbu7H8YCCi',
       'QHGAef0TI6dhn0wTogvW', ..., 'lcORJ5hemOZc1iGo9z7k',
       '5CqDquDAszqJp27P7AL8', 'SSPNYxJMfuKhoe1dg24m'], dtype='<U20')

In [34]:
products

array([ 8650774,  9306139,  9961521, ..., 12058614, 12058615, 11927550])

But we need to map product ids to embedding ids.

In [35]:
product_table = tf.lookup.StaticHashTable(
    tf.lookup.KeyValueTensorInitializer(tf.constant(products, dtype=tf.int32),
                                        range(len(products))), -1)

In [36]:
product_table.lookup(tf.constant([12058614]))

<tf.Tensor: shape=(1,), dtype=int32, numpy=array([29693], dtype=int32)>

Let's put those two things together

In [37]:
class SimpleRecommender(tf.keras.Model):
    def __init__(self, dummy_users, products, length_of_embedding):
        super(SimpleRecommender, self).__init__()
        self.products = tf.constant(products, dtype=tf.int32)
        self.dummy_users = tf.constant(dummy_users, dtype=tf.string)
        self.dummy_user_table = tf.lookup.StaticHashTable(tf.lookup.KeyValueTensorInitializer(self.dummy_users, range(len(dummy_users))), -1)
        self.product_table = tf.lookup.StaticHashTable(tf.lookup.KeyValueTensorInitializer(self.products, range(len(products))), -1)

        self.user_embedding = tf.keras.layers.Embedding(len(dummy_users),length_of_embedding)
        self.product_embedding = tf.keras.layers.Embedding(len(products), length_of_embedding)
        self.dot=tf.keras.layers.Dot(axes=-1)
    def call(self,inputs):
        user=inputs[0]
        products=inputs[1]

        user_embedding_index=self.dummy_user_table.lookup(user)
        product_embedding_index=self.product_table.lookup(products)

        user_embedding_values=self.user_embedding(user_embedding_index)
        product_embedding_values=self.product_embedding(product_embedding_index)
        return tf.squeeze(self.dot([user_embedding_values,product_embedding_values]),1)
    @tf.function
    def call_item_item(self, product):
        product_x = self.product_table.lookup(product)
        pe = tf.expand_dims(self.product_embedding(product_x), 0)

        all_pe = tf.expand_dims(self.product_embedding.embeddings, 0)#note this only works if the layer has been built!
        scores = tf.reshape(self.dot([pe, all_pe]), [-1])

        top_scores, top_indices = tf.math.top_k(scores, k=100)
        top_ids = tf.gather(self.products, top_indices)
        return top_ids, top_scores

In [38]:
dummy_users

array(['pmfkU4BNZhmtLgJQwJ7x', 'UDRRwOlzlWVbu7H8YCCi',
       'QHGAef0TI6dhn0wTogvW', ..., 'lcORJ5hemOZc1iGo9z7k',
       '5CqDquDAszqJp27P7AL8', 'SSPNYxJMfuKhoe1dg24m'], dtype='<U20')

In [39]:
products

array([ 8650774,  9306139,  9961521, ..., 12058614, 12058615, 11927550])

In [40]:
srl=SimpleRecommender(dummy_users,products,15)
srl([tf.constant([['pmfkU4BNZhmtLgJQwJ7x'],['UDRRwOlzlWVbu7H8YCCi']]),
     tf.constant([[8650774,9306139,9961521],[12058614, 12058615, 11927550]])])

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[-0.0008847 , -0.00087931,  0.00066579],
       [-0.00554606, -0.00035141,  0.000104  ]], dtype=float32)>

# Creating a dataset

In [41]:
train

Unnamed: 0,dummyUserId,productId
0,b'PIXcm7Ru5KmntCy0yA1K',10524048
1,b'd0RILFB1hUzNSINMY4Ow',9137713
2,b'Ebax7lyhnKRm4xeRlWW2',5808602
3,b'vtigDw2h2vxKt0sJpEeU',10548272
4,b'r4GfiEaUGxziyjX0PyU6',10988173
...,...,...
165037,b'7Eom5Ancozj01ozGxAMK',9071435
165038,b'zi9vZETHqSIZK0TM2nZc',10413104
165039,b'fVCveec9P946asY5wqGm',9859881
165040,b'VJtfpw602SZHh2qwarK4',10809487


First create a tf.data.Dataset from the user purchase pairs.

In [42]:
dummy_user_tensor = tf.constant(train[["dummyUserId"]].values, dtype=tf.string)
product_tensor = tf.constant(train[["productId"]].values, dtype=tf.int32)

dataset = tf.data.Dataset.from_tensor_slices((dummy_user_tensor, product_tensor))
for x, y in dataset:
    print(x)
    print(y)
    break

tf.Tensor([b'PIXcm7Ru5KmntCy0yA1K'], shape=(1,), dtype=string)
tf.Tensor([10524048], shape=(1,), dtype=int32)


In [43]:
products

array([ 8650774,  9306139,  9961521, ..., 12058614, 12058615, 11927550])

In [44]:
random_negatives_indexes=tf.random.uniform((7,),minval=0,maxval=len(products),dtype=tf.int32)
random_negatives_indexes

<tf.Tensor: shape=(7,), dtype=int32, numpy=array([ 6281, 24684, 22198,  9655,  3612,  3657, 29249], dtype=int32)>

In [45]:
tf.gather(products,random_negatives_indexes)

<tf.Tensor: shape=(7,), dtype=int64, numpy=
array([10251000, 10203163, 12813179, 11839245,  9060003, 11419441,
       10614982])>

In [46]:
products[5776]

8413619

For each purchase let's sample a number of products that the user did not purchase. Then the model can score each of the products and we will know we are doing a good job if the product with the highest score is the product that the user actually purchased.

We can do this using dataset.map

In [47]:
tf.one_hot(0,depth=11)

<tf.Tensor: shape=(11,), dtype=float32, numpy=array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)>

In [48]:
class Mapper():

    def __init__(self, possible_products, num_negative_products):
        self.num_possible_products = len(possible_products)
        self.possible_products_tensor = tf.constant(possible_products, dtype=tf.int32)

        self.num_negative_products = num_negative_products
        self.y=tf.one_hot(0,num_negative_products+1)
    def __call__(self, user, product):
        random_negatives_indexes=tf.random.uniform((self.num_negative_products,),minval=0,maxval=self.num_possible_products,dtype=tf.int32)
        negatives=tf.gather(self.possible_products_tensor,random_negatives_indexes)
        candidates=tf.concat([product,negatives],axis=0)
        return (user, candidates),self.y

In [49]:
dataset = tf.data.Dataset.from_tensor_slices((dummy_user_tensor, product_tensor)).map(Mapper(products,10))
for (u,c),y in dataset:
  print(u)
  print(c)
  print(y)
  break

tf.Tensor([b'PIXcm7Ru5KmntCy0yA1K'], shape=(1,), dtype=string)
tf.Tensor(
[10524048 11808358 10336332 10578070 12807935 10405852 11988828 12049009
 10309923 11539026  9991302], shape=(11,), dtype=int32)
tf.Tensor([1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(11,), dtype=float32)


Let's bring the steps together to define a function which creates a dataset

In [50]:
def get_dataset(df,products,num_negative_products):
    dummy_user_tensor = tf.constant(df[["dummyUserId"]].values, dtype=tf.string)
    product_tensor = tf.constant(df[["productId"]].values, dtype=tf.int32)

    dataset = tf.data.Dataset.from_tensor_slices((dummy_user_tensor, product_tensor))
    dataset=dataset.map(Mapper(products,num_negative_products))
    dataset=dataset.batch(1024)
    return dataset

In [51]:

for(u,c),y in get_dataset(train,products,4):
  print(u)
  print(c)
  print(y)
  break

tf.Tensor(
[[b'PIXcm7Ru5KmntCy0yA1K']
 [b'd0RILFB1hUzNSINMY4Ow']
 [b'Ebax7lyhnKRm4xeRlWW2']
 ...
 [b'xuX9n8PHfSR0AP3UZ8ar']
 [b'iNnxsPFfOa9884fMjVPJ']
 [b'aD8Mn12im8lFPzXAY41P']], shape=(1024, 1), dtype=string)
tf.Tensor(
[[10524048 10190429 11948499 11731326 12876986]
 [ 9137713 13407839 12644776  9920911 11238599]
 [ 5808602 11729741 12667576 11934281 12285465]
 ...
 [11541336  9229776 11239026  9586825  5775999]
 [ 7779232  8057148 12563470  9186535 12507393]
 [ 4941259 12447795 10925767 13167798 13086186]], shape=(1024, 5), dtype=int32)
tf.Tensor(
[[1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 ...
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]], shape=(1024, 5), dtype=float32)


# Train a model

We need to compile a model, set the loss and create an evaluation metric. Then we need to train the model.

In [52]:
model=SimpleRecommender(dummy_users,products,15)
model.compile(loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),optimizer=tf.keras.optimizers.SGD(learning_rate=100),metrics=[tf.metrics.CategoricalAccuracy()])
model.fit(get_dataset(train,products,100),validation_data=get_dataset(valid,products,100),epochs=5)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7a6b5033f2e0>

Let's do a manual check on whether the model is any good.

In [53]:
test_product = 11698965

In [54]:
print("Recs for item {}: {}".format(test_product, model.call_item_item(tf.constant(test_product, dtype=tf.int32))))

Recs for item 11698965: (<tf.Tensor: shape=(100,), dtype=int32, numpy=
array([ 6848491, 10880400, 10360535,  9099410, 10233576, 11846940,
       10957845, 10555920,  7908164, 13329218, 10076518, 12297404,
        9538276, 11738002, 12400123, 10437215, 11001547, 11021553,
       11713447,  8651669, 12227117, 12296243, 12159348, 11165000,
       10490468, 11483210, 10231893, 11689898, 12103379, 10733129,
       11443074, 11512765, 10156207, 10389382, 10490500, 11399580,
       11797027, 12076028, 11264278,  9686939, 10637112, 11147714,
       10309779, 11489908, 11492507,  9489862, 11187698, 10887227,
       11409811, 10571268, 12731290, 10143326, 10795252, 12006313,
       10573792, 11668069, 11648408, 11344547, 12896403, 10021205,
       11036544, 12656573, 10367329, 12551056, 11855889, 12640851,
       10528018, 11165924, 11033925, 12145899,  9789201,  9128301,
       10896827, 10183144, 11198974,  9327768,  8174687, 12335362,
       10966981, 11953072, 11642522,  7880723, 11827084, 1

# Save the model

In [55]:
model_path = "models/recommender/1"

In [56]:
inpute_signature = tf.TensorSpec(shape=(), dtype=tf.int32)

In [66]:
signatures = { 'call_item_item': model.call_item_item.get_concrete_function(inpute_signature)}

In [58]:
tf.saved_model.save(model,model_path,signatures=signatures)



In [69]:
imported_model = tf.saved_model.load("/content/models/recommender/1")
list(imported_model.signatures.keys())

['call_item_item']

In [70]:
result_tensor=imported_model.signatures['call_item_item'](tf.constant([22103359]))
from IPython.core.display import HTML

def path_to_image_html(path):
  return '<img src= "https://images.asos.media.com/products/ugg-classic-mini-boots-in-black-suede/'+ str(path)+'-2" width="60">'
result_df=pd.DataFrame(result_tensor['output_0'].numpy(),columns=['ProductUrl']).head(5)
HTML(result_df.to_html(escape=False,formatters=dict(ProductUrl=path_to_image_html)))

InvalidArgumentError: Graph execution error:

Detected at node embedding_6/embedding_lookup defined at (most recent call last):
<stack traces unavailable>
indices[0] = -1 is not in [0, 29696)
	 [[{{node embedding_6/embedding_lookup}}]] [Op:__inference_signature_wrapper_6337]

Zipping the saved model will make it easier to download.

In [65]:
from zipfile import ZipFile
import os
# create a ZipFile object
with ZipFile('models.zip', 'w') as zipObj:
   # Iterate over all the files in directory
    for folderName, subfolders, filenames in os.walk("models/recommender"):
        for filename in filenames:
           #create complete filepath of file in directory
           filePath = os.path.join(folderName, filename)
           # Add file to zip
           zipObj.write(filePath)