## Create Ranking Dataset

In this notebook, we'll create a dataset for our ranking model. Since our dataset only consists of positive user-item interactions (transactions) we need to do negative sampling. (Otherwise our model might just recommend all items to all users.)

This notebook can be run to generate both training and validation data. Please run the the notebook once, change `USE_TRAIN` below to False, and run the notebook again if you want to generate both datasets.

In [1]:
import pandas as pd

USE_TRAIN = True

if USE_TRAIN:
    df = pd.read_csv("train_df.csv")
    ds_name = "ranking_train_df.csv"
else:
    # Use validation data.
    df = pd.read_csv("val_df.csv")
    ds_name = "ranking_val_df.csv"

These are the true positive pairs.

In [2]:
query_features = ["customer_id", "age", "month_sin", "month_cos"]

positive_pairs = df[query_features + ["article_id"]].copy()

positive_pairs

Unnamed: 0,customer_id,age,month_sin,month_cos,article_id
0,f19f841dd4ecf49ff4e7dfef40109df86649957783eb14...,26.0,1.224647e-16,-1.000000e+00,599580061
1,d5f60ad06c6da745351126948f290749f5880e770a1c1a...,23.0,-8.660254e-01,-5.000000e-01,738943005
2,6f61fa26a25954c17f63765d2a734495e911c8284336f0...,45.0,-8.660254e-01,-5.000000e-01,700938001
3,ad9772ef769c604ebd7f0a855951eaf3d18306632d3119...,74.0,5.000000e-01,8.660254e-01,842605004
4,c70790e9f60f4593a5ed53d06f3a1b9b9a2a8ac72372ba...,29.0,-8.660254e-01,5.000000e-01,674651002
...,...,...,...,...,...
409852,616d15f691328a9c0d1db5b74ea59eb01431fd88c38d25...,26.0,-5.000000e-01,8.660254e-01,688796001
409853,3d7af77f522b463540e8df305946ed398dc696c1d817ee...,58.0,-1.000000e+00,-1.836970e-16,710899002
409854,3b201b60fd85a82ce8b6cea640817850d28f69c9d1151b...,51.0,5.000000e-01,-8.660254e-01,811900002
409855,70e30dea501c117ee15739c2c1bcde049e0e5b8572d1ff...,60.0,5.000000e-01,8.660254e-01,853097001


In [3]:
n_neg = len(positive_pairs)*10

negative_pairs = positive_pairs[query_features]\
    .sample(n_neg, replace=True, random_state=1)\
    .reset_index(drop=True)

negative_pairs["article_id"] = positive_pairs["article_id"]\
    .sample(n_neg, replace=True, random_state=2).to_numpy()

negative_pairs

Unnamed: 0,customer_id,age,month_sin,month_cos,article_id
0,407e47239b68af2ffa2f1972ea08420a8b72f5b247f2fd...,28.0,5.000000e-01,-0.866025,728162005
1,22a6b19d37713870b8154229c943b2c028b34e3b79adbd...,28.0,-5.000000e-01,0.866025,810169018
2,8253b8636fb7d65e31698b4ae203e6ef538990eb05811a...,27.0,-8.660254e-01,-0.500000,372860001
3,8137751c14064ff4ea8631fcdf930e35db3d73df11b66d...,22.0,8.660254e-01,-0.500000,702841001
4,ec2dd15d6e86cc00e5f0e4b954d24aefa4d3d839c7baf6...,50.0,1.224647e-16,-1.000000,900176001
...,...,...,...,...,...
4098565,70f4ca906186650f2aae7ecba8d266014a8fe751858bb6...,25.0,0.000000e+00,1.000000,719530004
4098566,a5ba79dd9869da0f5f2ecb81495564d8fde8bb06c4bbda...,31.0,-5.000000e-01,-0.866025,778064012
4098567,a05b2d59f80ae1edfb9623dd776a572a91201e3cbe313e...,34.0,5.000000e-01,-0.866025,470985009
4098568,d0460f72bb366a1297d525bfc6adecc9ea3f0f003a3830...,29.0,8.660254e-01,-0.500000,669786002


In [4]:
# Add labels.
positive_pairs["label"] = 1
negative_pairs["label"] = 0

# Concatenate.
ranking_df = pd.concat([positive_pairs, negative_pairs], ignore_index=True)

In [5]:
ranking_df

Unnamed: 0,customer_id,age,month_sin,month_cos,article_id,label
0,f19f841dd4ecf49ff4e7dfef40109df86649957783eb14...,26.0,1.224647e-16,-1.000000,599580061,1
1,d5f60ad06c6da745351126948f290749f5880e770a1c1a...,23.0,-8.660254e-01,-0.500000,738943005,1
2,6f61fa26a25954c17f63765d2a734495e911c8284336f0...,45.0,-8.660254e-01,-0.500000,700938001,1
3,ad9772ef769c604ebd7f0a855951eaf3d18306632d3119...,74.0,5.000000e-01,0.866025,842605004,1
4,c70790e9f60f4593a5ed53d06f3a1b9b9a2a8ac72372ba...,29.0,-8.660254e-01,0.500000,674651002,1
...,...,...,...,...,...,...
4508422,70f4ca906186650f2aae7ecba8d266014a8fe751858bb6...,25.0,0.000000e+00,1.000000,719530004,0
4508423,a5ba79dd9869da0f5f2ecb81495564d8fde8bb06c4bbda...,31.0,-5.000000e-01,-0.866025,778064012,0
4508424,a05b2d59f80ae1edfb9623dd776a572a91201e3cbe313e...,34.0,5.000000e-01,-0.866025,470985009,0
4508425,d0460f72bb366a1297d525bfc6adecc9ea3f0f003a3830...,29.0,8.660254e-01,-0.500000,669786002,0


In [6]:
# Merge with item features.
item_df = pd.read_csv("articles.csv")
item_df.drop_duplicates(subset="article_id", inplace=True)

ranking_df = ranking_df.merge(item_df, on="article_id")

In [7]:
import tensorflow as tf

# Load models.
item_model = tf.keras.models.load_model("item_model")
user_model = tf.keras.models.load_model("user_model")

2022-05-25 11:21:34.526824: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.




In [8]:
# Model expects string.
ranking_df["article_id"] = ranking_df["article_id"].astype(str)
ranking_df["customer_id"] = ranking_df["customer_id"].astype(str)

Next, we compute the query and candidate embeddings.

In [9]:
import numpy as np

retrieval_features = ["customer_id", "article_id", "age", "month_sin",
                      "month_cos", "garment_group_name", "index_group_name"]

item_feat = ["article_id", "garment_group_name", "index_group_name"]

ranking_ds = tf.data.Dataset.from_tensor_slices(
    {col: ranking_df[col] for col in retrieval_features})

item_emb_ds = ranking_ds.batch(2048)\
    .map(lambda x : item_model({i : x[i] for i in item_feat}))

user_emb_ds = ranking_ds.batch(2048)\
    .map(lambda x : user_model(x))

item_emb_arr = np.concatenate([batch.numpy() for batch in item_emb_ds])
user_emb_arr = np.concatenate([batch.numpy() for batch in user_emb_ds])

item_emb_df = pd.DataFrame(item_emb_arr).add_prefix("item_emb_")
user_emb_df = pd.DataFrame(user_emb_arr).add_prefix("user_emb_")

ranking_df = pd.concat([ranking_df, item_emb_df, user_emb_df], axis=1)

There are several "duplicated" categorical features in the dataset. For instance, `index_code` and `index_name` encodes the same feature, but in different formats (int, string). Therefore we have to deduplicate these features.

In [10]:
def exclude_feat(s):
    return s.endswith("_id") or s.endswith("_no") or s.endswith("_code")

features_to_exclude = [col for col in ranking_df.columns if exclude_feat(col)]
features_to_exclude.append("prod_name")
features_to_exclude.append("detail_desc")

ranking_df.drop(features_to_exclude, axis="columns", inplace=True)

ranking_df.head()

Unnamed: 0,age,month_sin,month_cos,label,product_type_name,product_group_name,graphical_appearance_name,colour_group_name,perceived_colour_value_name,perceived_colour_master_name,...,user_emb_6,user_emb_7,user_emb_8,user_emb_9,user_emb_10,user_emb_11,user_emb_12,user_emb_13,user_emb_14,user_emb_15
0,26.0,1.224647e-16,-1.0,1,Swimwear bottom,Swimwear,All over pattern,Off White,Light,White,...,-0.251088,-0.710878,0.117995,0.327928,0.314271,0.250304,0.900349,-0.123851,0.141342,0.149506
1,30.0,1.224647e-16,-1.0,1,Swimwear bottom,Swimwear,All over pattern,Off White,Light,White,...,-1.266206,-0.219487,0.040475,0.144094,1.306867,0.826521,-0.558121,-0.165437,0.094604,0.187149
2,24.0,1.0,6.123234000000001e-17,1,Swimwear bottom,Swimwear,All over pattern,Off White,Light,White,...,-1.019965,-0.076945,-1.407597,-0.701969,-0.585404,-0.08742,0.371376,0.274598,1.399602,-0.130431
3,27.0,0.5,-0.8660254,1,Swimwear bottom,Swimwear,All over pattern,Off White,Light,White,...,-1.810096,-0.191291,-0.692682,-0.35607,0.753533,-0.276697,-0.877726,-0.135839,0.838847,-0.419043
4,24.0,-0.5,-0.8660254,1,Swimwear bottom,Swimwear,All over pattern,Off White,Light,White,...,-0.430608,-0.468733,0.100904,0.125014,0.428927,0.615911,0.02609,-0.342208,-0.211479,0.085512


In [11]:
ranking_df.to_csv(ds_name, index=False)

### Next Steps

In the next notebook, we'll train a ranking model on the dataset we created in this notebook.