In [1]:
import hopsworks

project = hopsworks.login()  # insert API Key from https://app.hopsworks.ai

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://hopsworks0.logicalclocks.com/p/119


## Requirements

Install libraries:

* **tensorflow** (version 2.11) [already installed]

## Create Ranking Dataset

In this notebook, we'll create a dataset for our ranking model. Since our dataset only consists of positive user-item interactions (transactions) we need to do negative sampling. (Otherwise our model might just recommend all items to all users.)

This notebook can be run to generate both training and validation data. Please run the the notebook once, change `USE_TRAIN` below to False, and run the notebook again if you want to generate both datasets.

In [3]:
fs = project.get_feature_store()

feature_view = fs.get_feature_view("retrieval", version=1)

train_df, val_df, test_df, y_train, y_val, y_test = feature_view.get_train_validation_test_split(training_dataset_version=1)

Connected. Call `.close()` to terminate connection gracefully.


In [4]:
train_df["article_id"] = train_df["article_id"].astype(str) # to be deleted
val_df["article_id"] = val_df["article_id"].astype(str)
test_df["article_id"] = test_df["article_id"].astype(str)

> <span style="color:darkred">NOTE: </span>&emsp;Repeat all cells below with both USE_TRAIN = True and USE_TRAIN = false

In [13]:
USE_TRAIN = False

split, df = ("train", train_df) if USE_TRAIN else ("validation", val_df)
df['article_id'] = df['article_id'].astype(str)

ds_name = f"ranking_{split}.csv"

These are the true positive pairs.

In [14]:
query_features = ["customer_id", "age", "month_sin", "month_cos"]

positive_pairs = df[query_features + ["article_id"]].copy()
positive_pairs

Unnamed: 0,customer_id,age,month_sin,month_cos,article_id
0,009f96da58ca990356bf27485cbf09400997771be94b62...,43.0,-0.866025,0.500000,554541016
1,009f96da58ca990356bf27485cbf09400997771be94b62...,43.0,-0.866025,0.500000,625532002
2,0261af7c1dd3be5ce2584e10ad9ce9d5a0d955ddb35524...,42.0,-0.866025,0.500000,215589001
3,0261af7c1dd3be5ce2584e10ad9ce9d5a0d955ddb35524...,42.0,-0.866025,0.500000,262277011
4,0261af7c1dd3be5ce2584e10ad9ce9d5a0d955ddb35524...,42.0,-0.866025,0.500000,341129001
...,...,...,...,...,...
421422,fd24678386349c556e81e5d3326f62c727993720f25930...,24.0,0.500000,-0.866025,851094002
421423,fd24678386349c556e81e5d3326f62c727993720f25930...,24.0,0.500000,-0.866025,885047002
421424,fd24678386349c556e81e5d3326f62c727993720f25930...,24.0,0.500000,-0.866025,885047004
421425,fd24678386349c556e81e5d3326f62c727993720f25930...,24.0,-0.866025,-0.500000,893059005


In [15]:
n_neg = len(positive_pairs)*10

negative_pairs = positive_pairs[query_features]\
    .sample(n_neg, replace=True, random_state=1)\
    .reset_index(drop=True)

negative_pairs["article_id"] = positive_pairs["article_id"]\
    .sample(n_neg, replace=True, random_state=2).to_numpy()

negative_pairs

Unnamed: 0,customer_id,age,month_sin,month_cos,article_id
0,709ed03562312c3fbf887a34e31273e41596d8c58beac3...,32.0,8.660254e-01,5.000000e-01,675699001
1,8c009f53427921fd3ecae69695893a8a6a5f91aef96736...,23.0,1.000000e+00,6.123234e-17,757805001
2,d336efc8b2fcc693ae46d7e2efc1263913980ad07ed204...,33.0,5.000000e-01,8.660254e-01,680832003
3,df807d4027823ee48048e61a8508562acfa61ab1c1e7fe...,39.0,-5.000000e-01,8.660254e-01,827687002
4,d05a70aad5bc9109eb74b77cb8382fab0df1823ab3e9b0...,65.0,5.000000e-01,-8.660254e-01,693246008
...,...,...,...,...,...
4214265,8dcb279daad8fc5d26e520512bb1e357b0d7ff1c503c5a...,25.0,1.224647e-16,-1.000000e+00,721544002
4214266,9634ea9f3c1396b666f6d0ca25b044f32bbf82812dee97...,43.0,-5.000000e-01,-8.660254e-01,893796002
4214267,1ec0ce7a4ef416d1986a6826f6289e2c283ac6ea38a539...,51.0,8.660254e-01,-5.000000e-01,633808005
4214268,a9a120c1ddf6d34ee411341c6bbd23014854a7a9a42b0b...,35.0,1.000000e+00,6.123234e-17,615141002


In [16]:
import pandas as pd

# Add labels.
positive_pairs["label"] = 1
negative_pairs["label"] = 0

# Concatenate.
ranking_df = pd.concat([positive_pairs, negative_pairs], ignore_index=True)
ranking_df.head()

Unnamed: 0,customer_id,age,month_sin,month_cos,article_id,label
0,009f96da58ca990356bf27485cbf09400997771be94b62...,43.0,-0.866025,0.5,554541016,1
1,009f96da58ca990356bf27485cbf09400997771be94b62...,43.0,-0.866025,0.5,625532002,1
2,0261af7c1dd3be5ce2584e10ad9ce9d5a0d955ddb35524...,42.0,-0.866025,0.5,215589001,1
3,0261af7c1dd3be5ce2584e10ad9ce9d5a0d955ddb35524...,42.0,-0.866025,0.5,262277011,1
4,0261af7c1dd3be5ce2584e10ad9ce9d5a0d955ddb35524...,42.0,-0.866025,0.5,341129001,1


In [17]:
# Merge with item features.
articles_fg = fs.get_feature_group("articles")
item_df = articles_fg.read()
item_df.drop_duplicates(subset="article_id", inplace=True)
ranking_df = ranking_df.merge(item_df, on="article_id")



Finished: Reading data from Hopsworks, using Hive (9.45s) 


Next, we compute the query and candidate embeddings.

There are several "duplicated" categorical features in the dataset. For instance, `index_code` and `index_name` encodes the same feature, but in different formats (int, string). Therefore we have to deduplicate these features.

In [18]:
def exclude_feat(s):
    return s.endswith("_id") or s.endswith("_no") or s.endswith("_code")

features_to_exclude = [col for col in ranking_df.columns if exclude_feat(col)]
features_to_exclude.append("prod_name")

ranking_df.drop(features_to_exclude, axis="columns", inplace=True)
ranking_df.head()

Unnamed: 0,age,month_sin,month_cos,label,product_type_name,product_group_name,graphical_appearance_name,colour_group_name,perceived_colour_value_name,perceived_colour_master_name,department_name,index_name,index_group_name,section_name,garment_group_name
0,43.0,-0.866025,0.5,1,Trousers,Garment Lower body,Denim,Dark Blue,Medium Dusty,Blue,Young Boy Denim,Children Sizes 134-170,Baby/Children,Young Boy,Trousers Denim
1,40.0,-0.866025,-0.5,1,Trousers,Garment Lower body,Denim,Dark Blue,Medium Dusty,Blue,Young Boy Denim,Children Sizes 134-170,Baby/Children,Young Boy,Trousers Denim
2,38.0,-0.5,0.8660254,1,Trousers,Garment Lower body,Denim,Dark Blue,Medium Dusty,Blue,Young Boy Denim,Children Sizes 134-170,Baby/Children,Young Boy,Trousers Denim
3,44.0,-0.866025,-0.5,1,Trousers,Garment Lower body,Denim,Dark Blue,Medium Dusty,Blue,Young Boy Denim,Children Sizes 134-170,Baby/Children,Young Boy,Trousers Denim
4,48.0,-1.0,-1.83697e-16,1,Trousers,Garment Lower body,Denim,Dark Blue,Medium Dusty,Blue,Young Boy Denim,Children Sizes 134-170,Baby/Children,Young Boy,Trousers Denim


In [19]:
ranking_df.to_csv(ds_name, index=False)

In [20]:
dataset_api = project.get_dataset_api()
uploaded_file_path = dataset_api.upload(ds_name, "Resources", overwrite=True)

Uploading: 0.000%|          | 0/673391867 elapsed<00:00 remaining<?

### Next Steps

In the next notebook, we'll train a ranking model on the dataset we created in this notebook.