## Create Ranking Dataset

In this notebook, we'll create a dataset for our ranking model. Since our dataset only consists of positive user-item interactions (transactions) we need to do negative sampling. (Otherwise our model might just recommend all items to all users.)

This notebook can be run to generate both training and validation data. Please run the the notebook once, change `USE_TRAIN` below to False, and run the notebook again if you want to generate both datasets.

In [None]:
# Uncomment this cell and fill in details if you are running external Python
import os
key=""
with open("api-key.txt", "r") as f:
    key = f.read().rstrip()
os.environ['HOPSWORKS_PROJECT']="hm"
os.environ['HOPSWORKS_HOST']="35.240.81.237"
os.environ['HOPSWORKS_API_KEY']=key

In [None]:
import hopsworks

project = hopsworks.login()
fs = project.get_feature_store()

In [13]:
USE_TRAIN = False

# Load training dataset.
td = fs.get_training_dataset("retrieval_1")

split = "train" if USE_TRAIN else "validation"
ds_name = f"ranking_{split}.csv"
    
df = td.read(split)

df['article_id'] = df['article_id'].astype(str)

Connected. Call `.close()` to terminate connection gracefully.




These are the true positive pairs.

In [14]:
query_features = ["customer_id", "age", "month_sin", "month_cos"]

positive_pairs = df[query_features + ["article_id"]].copy()

positive_pairs

Unnamed: 0,customer_id,age,month_sin,month_cos,article_id
0,0261af7c1dd3be5ce2584e10ad9ce9d5a0d955ddb35524...,42.0,-8.660254e-01,5.000000e-01,774039004
1,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,5.000000e-01,8.660254e-01,573152007
2,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,-1.000000e+00,-1.836970e-16,578752001
3,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,8.660254e-01,-5.000000e-01,608213007
4,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,8.660254e-01,5.000000e-01,617322013
...,...,...,...,...,...
105596,fa3abedbd19d8358085ac4fcdd1e9355b531222a251a79...,18.0,-8.660254e-01,5.000000e-01,783925003
105597,fb90ddf5c8c9fdb3dde6cd9c073edc037c770ec80e4ea4...,43.0,1.224647e-16,-1.000000e+00,743644002
105598,fd24678386349c556e81e5d3326f62c727993720f25930...,24.0,5.000000e-01,-8.660254e-01,851094001
105599,fd24678386349c556e81e5d3326f62c727993720f25930...,24.0,5.000000e-01,-8.660254e-01,885047002


In [15]:
n_neg = len(positive_pairs)*10

negative_pairs = positive_pairs[query_features]\
    .sample(n_neg, replace=True, random_state=1)\
    .reset_index(drop=True)

negative_pairs["article_id"] = positive_pairs["article_id"]\
    .sample(n_neg, replace=True, random_state=2).to_numpy()

negative_pairs



Unnamed: 0,customer_id,age,month_sin,month_cos,article_id
0,0f863752123dfd4214ec2f8d0fbfd6b163c00080cd8a7c...,44.0,8.660254e-01,5.000000e-01,680262010
1,00ffed0316ae807cb9439799d73cd61fd7f6479a0f9a19...,46.0,-8.660254e-01,5.000000e-01,706016006
2,20509ffc475342dec9fab94d3f343ebeb3ee2436b7f946...,58.0,-1.000000e+00,-1.836970e-16,772570002
3,10ced0273136735c24606f0d9908e2d7b70d8f65b2e7f0...,26.0,1.224647e-16,-1.000000e+00,875081008
4,5e3225a80c7911d11cfccc6605a35c9a3af9c3a70cd2f3...,21.0,1.224647e-16,-1.000000e+00,605690002
...,...,...,...,...,...
1056005,27b960e9de9f652dd6a8286c710a99b1769101ee03ff1a...,37.0,-8.660254e-01,5.000000e-01,764045001
1056006,3a3d0cc3dec8ef744158b98301b5005ba5f67cbffa218b...,24.0,8.660254e-01,5.000000e-01,863456004
1056007,83545155a5fd15a09c294045e2c30dae5fe7447096870e...,57.0,-5.000000e-01,8.660254e-01,918813001
1056008,eab4cd86bd7d84417893f0a92b0a0e1332c7c917dd5171...,46.0,1.224647e-16,-1.000000e+00,666080002


In [16]:
import pandas as pd

# Add labels.
positive_pairs["label"] = 1
negative_pairs["label"] = 0

# Concatenate.
ranking_df = pd.concat([positive_pairs, negative_pairs], ignore_index=True)



In [17]:
ranking_df

Unnamed: 0,customer_id,age,month_sin,month_cos,article_id,label
0,0261af7c1dd3be5ce2584e10ad9ce9d5a0d955ddb35524...,42.0,-8.660254e-01,5.000000e-01,774039004,1
1,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,5.000000e-01,8.660254e-01,573152007,1
2,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,-1.000000e+00,-1.836970e-16,578752001,1
3,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,8.660254e-01,-5.000000e-01,608213007,1
4,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,8.660254e-01,5.000000e-01,617322013,1
...,...,...,...,...,...,...
1161606,27b960e9de9f652dd6a8286c710a99b1769101ee03ff1a...,37.0,-8.660254e-01,5.000000e-01,764045001,0
1161607,3a3d0cc3dec8ef744158b98301b5005ba5f67cbffa218b...,24.0,8.660254e-01,5.000000e-01,863456004,0
1161608,83545155a5fd15a09c294045e2c30dae5fe7447096870e...,57.0,-5.000000e-01,8.660254e-01,918813001,0
1161609,eab4cd86bd7d84417893f0a92b0a0e1332c7c917dd5171...,46.0,1.224647e-16,-1.000000e+00,666080002,0


In [18]:
# Merge with item features.
articles_fg = fs.get_feature_group("articles")
item_df = articles_fg.read()
item_df.drop_duplicates(subset="article_id", inplace=True)
ranking_df = ranking_df.merge(item_df, on="article_id")



2022-06-07 13:06:09,813 INFO: USE `rec_featurestore`
2022-06-07 13:06:10,535 INFO: SELECT `fg0`.`article_id` `article_id`, `fg0`.`product_code` `product_code`, `fg0`.`prod_name` `prod_name`, `fg0`.`product_type_no` `product_type_no`, `fg0`.`product_type_name` `product_type_name`, `fg0`.`product_group_name` `product_group_name`, `fg0`.`graphical_appearance_no` `graphical_appearance_no`, `fg0`.`graphical_appearance_name` `graphical_appearance_name`, `fg0`.`colour_group_code` `colour_group_code`, `fg0`.`colour_group_name` `colour_group_name`, `fg0`.`perceived_colour_value_id` `perceived_colour_value_id`, `fg0`.`perceived_colour_value_name` `perceived_colour_value_name`, `fg0`.`perceived_colour_master_id` `perceived_colour_master_id`, `fg0`.`perceived_colour_master_name` `perceived_colour_master_name`, `fg0`.`department_no` `department_no`, `fg0`.`department_name` `department_name`, `fg0`.`index_code` `index_code`, `fg0`.`index_name` `index_name`, `fg0`.`index_group_no` `index_group_no`, `

In [19]:
import hsml

conn = hsml.connection()
mr = conn.get_model_registry()


Connected. Call `.close()` to terminate connection gracefully.


In [20]:
import tensorflow as tf

Next, we compute the query and candidate embeddings.

In [21]:
import numpy as np

# # Retrieve input feature names.
# candidate_model_schema = candidate_model.model_schema['input_schema']['columnar_schema']
# item_features = [feat['name'] for feat in candidate_model_schema]
# query_model_schema = query_model.model_schema['input_schema']['columnar_schema']
# query_features = [feat['name'] for feat in query_model_schema]

# def df_to_ds(df):
#     return tf.data.Dataset.from_tensor_slices({col : df[col] for col in df})

# item_ds = df_to_ds(ranking_df[item_features])
# query_ds = df_to_ds(ranking_df[query_features])

# ranking_df = pd.concat([ranking_df, item_emb_df, user_emb_df], axis=1)

There are several "duplicated" categorical features in the dataset. For instance, `index_code` and `index_name` encodes the same feature, but in different formats (int, string). Therefore we have to deduplicate these features.

In [22]:
def exclude_feat(s):
    return s.endswith("_id") or s.endswith("_no") or s.endswith("_code")

features_to_exclude = [col for col in ranking_df.columns if exclude_feat(col)]
features_to_exclude.append("prod_name")

ranking_df.drop(features_to_exclude, axis="columns", inplace=True)

ranking_df.head()

Unnamed: 0,age,month_sin,month_cos,label,product_type_name,product_group_name,graphical_appearance_name,colour_group_name,perceived_colour_value_name,perceived_colour_master_name,department_name,index_name,index_group_name,section_name,garment_group_name
0,42.0,-0.8660254,0.5,1,Leggings/Tights,Garment Lower body,All over pattern,Black,Dark,Black,Jersey,Ladieswear,Ladieswear,Mama,Jersey Fancy
1,25.0,0.8660254,-0.5,0,Leggings/Tights,Garment Lower body,All over pattern,Black,Dark,Black,Jersey,Ladieswear,Ladieswear,Mama,Jersey Fancy
2,29.0,-0.8660254,0.5,0,Leggings/Tights,Garment Lower body,All over pattern,Black,Dark,Black,Jersey,Ladieswear,Ladieswear,Mama,Jersey Fancy
3,36.0,1.224647e-16,-1.0,0,Leggings/Tights,Garment Lower body,All over pattern,Black,Dark,Black,Jersey,Ladieswear,Ladieswear,Mama,Jersey Fancy
4,25.0,1.0,6.123234000000001e-17,0,Leggings/Tights,Garment Lower body,All over pattern,Black,Dark,Black,Jersey,Ladieswear,Ladieswear,Mama,Jersey Fancy


In [None]:
ranking_df.to_csv(ds_name, index=False)

In [24]:
import hopsworks

connection = hopsworks.connection()
project = connection.get_project()
dataset_api = project.get_dataset_api()
uploaded_file_path = dataset_api.upload(ds_name, "Resources", overwrite=True)

Connected. Call `.close()` to terminate connection gracefully.


Uploading: 0.000%|          | 0/187122926 elapsed<00:00 remaining<?

### Next Steps

In the next notebook, we'll train a ranking model on the dataset we created in this notebook.