## Create Ranking Dataset

In this notebook, we'll create a dataset for our ranking model. Since our dataset only consists of positive user-item interactions (transactions) we need to do negative sampling. (Otherwise our model might just recommend all items to all users.)

This notebook can be run to generate both training and validation data. Please run the the notebook once, change `USE_TRAIN` below to False, and run the notebook again if you want to generate both datasets.

In [13]:
import hsfs

USE_TRAIN = True

conn = hsfs.connection()
fs = conn.get_feature_store()

# Load training dataset.
feature_view = fs.get_feature_view("retrieval", version=1)
train_df, y_train, val_df, y_val, test_df, y_test = feature_view.get_train_validation_test_split(training_dataset_version=4)
# td = fs.get_training_dataset("retrieval_1")

split, df = ("train", train_df) if USE_TRAIN else ("validation", val_df)
df['article_id'] = df['article_id'].astype(str)

ds_name = f"ranking_{split}.csv"

Connected. Call `.close()` to terminate connection gracefully.
2022-09-16 11:48:41,273 INFO: USE `rec_featurestore`
2022-09-16 11:48:41,981 INFO: SELECT `fg2`.`customer_id` `customer_id`, `fg2`.`article_id` `article_id`, `fg2`.`month_sin` `month_sin`, `fg2`.`month_cos` `month_cos`, `fg0`.`age` `age`, `fg1`.`garment_group_name` `garment_group_name`, `fg1`.`index_group_name` `index_group_name`
FROM `rec_featurestore`.`transactions_1` `fg2`
INNER JOIN `rec_featurestore`.`customers_1` `fg0` ON `fg2`.`customer_id` = `fg0`.`customer_id`
INNER JOIN `rec_featurestore`.`articles_1` `fg1` ON `fg2`.`article_id` = `fg1`.`article_id`


These are the true positive pairs.

In [14]:
query_features = ["customer_id", "age", "month_sin", "month_cos"]

positive_pairs = df[query_features + ["article_id"]].copy()

positive_pairs

Unnamed: 0,customer_id,age,month_sin,month_cos,article_id
0,126af14ee80591f35c8e8a153da81eb09d72012a1e073f...,29.0,-0.866025,-0.5,863646003
2,1de87318baafea4da0f39ee90334f7b06339dd0daa0c95...,26.0,-0.866025,-0.5,876342001
3,b772351a040d7bbbba637a3eb4b9d2b7d95a2a7ed12c76...,55.0,-0.866025,-0.5,237347063
5,29f0b7182797df0333eeaf8628598ee10850d1c51a3766...,58.0,-0.866025,-0.5,896989001
6,6366885922ad927f5b94920f675c364b6feed381b36a9d...,28.0,-0.866025,-0.5,867257005
...,...,...,...,...,...
1838,089e1021b4640a7ea620e99509a89b84965740aa918337...,25.0,-0.866025,-0.5,814224001
1840,7f02dc707b74d8c8c76dff8b724905b338f82d1882b74e...,33.0,-0.866025,-0.5,873279004
1841,7f02dc707b74d8c8c76dff8b724905b338f82d1882b74e...,33.0,-0.866025,-0.5,873279003
1842,d302d3f0f3278c6142f6d3ee3f365d236fce31426cfde2...,57.0,-0.866025,-0.5,922037001


In [15]:
n_neg = len(positive_pairs)*10

negative_pairs = positive_pairs[query_features]\
    .sample(n_neg, replace=True, random_state=1)\
    .reset_index(drop=True)

negative_pairs["article_id"] = positive_pairs["article_id"]\
    .sample(n_neg, replace=True, random_state=2).to_numpy()

negative_pairs

Unnamed: 0,customer_id,age,month_sin,month_cos,article_id
0,cf3d07adfd9dc298426b78d8c0d2e481f6817c4a8fac1d...,32.0,-0.866025,-0.5,822344010
1,4d8fd09f7f062adc4cdedac5616381d35208928b6a6d80...,27.0,-0.866025,-0.5,923037003
2,19e1efd91ce1ccd488e732af0c20e56e0d218b066b1aa9...,42.0,-0.866025,-0.5,902388003
3,ee0aad902eb77f680346fba98bde65f4f86a01be052e42...,29.0,-0.866025,-0.5,919176001
4,57ea94b48f4b8002d2992bd9b8f4d0f50b0a7ae4f2f537...,27.0,-0.866025,-0.5,798915004
...,...,...,...,...,...
14755,6366885922ad927f5b94920f675c364b6feed381b36a9d...,28.0,-0.866025,-0.5,610776105
14756,319a7911f96dc5596ec2f79aced8ecbf630ed98627140c...,50.0,-0.866025,-0.5,673281013
14757,5d72be2a7393375abd543548a72a27c95efe962d31b000...,31.0,-0.866025,-0.5,573085028
14758,4de6988c4971cfc0c6910a85a54fa8d6fb11be64b87b98...,49.0,-0.866025,-0.5,850917001


In [16]:
import pandas as pd

# Add labels.
positive_pairs["label"] = 1
negative_pairs["label"] = 0

# Concatenate.
ranking_df = pd.concat([positive_pairs, negative_pairs], ignore_index=True)

In [17]:
ranking_df

Unnamed: 0,customer_id,age,month_sin,month_cos,article_id,label
0,126af14ee80591f35c8e8a153da81eb09d72012a1e073f...,29.0,-0.866025,-0.5,863646003,1
1,1de87318baafea4da0f39ee90334f7b06339dd0daa0c95...,26.0,-0.866025,-0.5,876342001,1
2,b772351a040d7bbbba637a3eb4b9d2b7d95a2a7ed12c76...,55.0,-0.866025,-0.5,237347063,1
3,29f0b7182797df0333eeaf8628598ee10850d1c51a3766...,58.0,-0.866025,-0.5,896989001,1
4,6366885922ad927f5b94920f675c364b6feed381b36a9d...,28.0,-0.866025,-0.5,867257005,1
...,...,...,...,...,...,...
16231,6366885922ad927f5b94920f675c364b6feed381b36a9d...,28.0,-0.866025,-0.5,610776105,0
16232,319a7911f96dc5596ec2f79aced8ecbf630ed98627140c...,50.0,-0.866025,-0.5,673281013,0
16233,5d72be2a7393375abd543548a72a27c95efe962d31b000...,31.0,-0.866025,-0.5,573085028,0
16234,4de6988c4971cfc0c6910a85a54fa8d6fb11be64b87b98...,49.0,-0.866025,-0.5,850917001,0


In [18]:
# Merge with item features.
articles_fg = fs.get_feature_group("articles")
item_df = articles_fg.read()
item_df.drop_duplicates(subset="article_id", inplace=True)
ranking_df = ranking_df.merge(item_df, on="article_id")



2022-09-16 11:49:18,019 INFO: USE `rec_featurestore`
2022-09-16 11:49:18,756 INFO: SELECT `fg0`.`article_id` `article_id`, `fg0`.`product_code` `product_code`, `fg0`.`prod_name` `prod_name`, `fg0`.`product_type_no` `product_type_no`, `fg0`.`product_type_name` `product_type_name`, `fg0`.`product_group_name` `product_group_name`, `fg0`.`graphical_appearance_no` `graphical_appearance_no`, `fg0`.`graphical_appearance_name` `graphical_appearance_name`, `fg0`.`colour_group_code` `colour_group_code`, `fg0`.`colour_group_name` `colour_group_name`, `fg0`.`perceived_colour_value_id` `perceived_colour_value_id`, `fg0`.`perceived_colour_value_name` `perceived_colour_value_name`, `fg0`.`perceived_colour_master_id` `perceived_colour_master_id`, `fg0`.`perceived_colour_master_name` `perceived_colour_master_name`, `fg0`.`department_no` `department_no`, `fg0`.`department_name` `department_name`, `fg0`.`index_code` `index_code`, `fg0`.`index_name` `index_name`, `fg0`.`index_group_no` `index_group_no`, `

In [19]:
import hsml

conn = hsml.connection()
mr = conn.get_model_registry()

Connected. Call `.close()` to terminate connection gracefully.


In [20]:
import tensorflow as tf

Next, we compute the query and candidate embeddings.

In [21]:
import numpy as np

# # Retrieve input feature names.
# candidate_model_schema = candidate_model.model_schema['input_schema']['columnar_schema']
# item_features = [feat['name'] for feat in candidate_model_schema]
# query_model_schema = query_model.model_schema['input_schema']['columnar_schema']
# query_features = [feat['name'] for feat in query_model_schema]

# def df_to_ds(df):
#     return tf.data.Dataset.from_tensor_slices({col : df[col] for col in df})

# item_ds = df_to_ds(ranking_df[item_features])
# query_ds = df_to_ds(ranking_df[query_features])

# ranking_df = pd.concat([ranking_df, item_emb_df, user_emb_df], axis=1)

There are several "duplicated" categorical features in the dataset. For instance, `index_code` and `index_name` encodes the same feature, but in different formats (int, string). Therefore we have to deduplicate these features.

In [22]:
def exclude_feat(s):
    return s.endswith("_id") or s.endswith("_no") or s.endswith("_code")

features_to_exclude = [col for col in ranking_df.columns if exclude_feat(col)]
features_to_exclude.append("prod_name")

ranking_df.drop(features_to_exclude, axis="columns", inplace=True)

ranking_df.head()

Unnamed: 0,age,month_sin,month_cos,label,product_type_name,product_group_name,graphical_appearance_name,colour_group_name,perceived_colour_value_name,perceived_colour_master_name,department_name,index_name,index_group_name,section_name,garment_group_name
0,29.0,-0.866025,-0.5,1,Sweater,Garment Upper body,Melange,Beige,Medium Dusty,Beige,Knitwear,Ladieswear,Ladieswear,Womens Everyday Collection,Knitwear
1,30.0,-0.866025,-0.5,1,Sweater,Garment Upper body,Melange,Beige,Medium Dusty,Beige,Knitwear,Ladieswear,Ladieswear,Womens Everyday Collection,Knitwear
2,54.0,-0.866025,-0.5,0,Sweater,Garment Upper body,Melange,Beige,Medium Dusty,Beige,Knitwear,Ladieswear,Ladieswear,Womens Everyday Collection,Knitwear
3,27.0,-0.866025,-0.5,0,Sweater,Garment Upper body,Melange,Beige,Medium Dusty,Beige,Knitwear,Ladieswear,Ladieswear,Womens Everyday Collection,Knitwear
4,30.0,-0.866025,-0.5,0,Sweater,Garment Upper body,Melange,Beige,Medium Dusty,Beige,Knitwear,Ladieswear,Ladieswear,Womens Everyday Collection,Knitwear


In [23]:
ranking_df.to_csv(ds_name, index=False)

In [24]:
import hopsworks

connection = hopsworks.connection()
project = connection.get_project()
dataset_api = project.get_dataset_api()
uploaded_file_path = dataset_api.upload(ds_name, "Resources", overwrite=True)

Connected. Call `.close()` to terminate connection gracefully.


Uploading: 0.000%|          | 0/2691371 elapsed<00:00 remaining<?

### Next Steps

In the next notebook, we'll train a ranking model on the dataset we created in this notebook.