In [2]:
import hopsworks

project = hopsworks.login()  # insert API Key from https://app.hopsworks.ai

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://b0636a00-6406-11ed-88f4-3779517939b7.cloud.hopsworks.ai:443/p/119


## Requirements

Install libraries:

* **tensorflow** (version 2.9.1)

In [3]:
!pip install --quiet tensorflow==2.9.1

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
nbconvert 7.0.0 requires jinja2>=3.0, but you have jinja2 2.11.3 which is incompatible.
nbconvert 7.0.0 requires mistune<3,>=2.0.3, but you have mistune 0.8.4 which is incompatible.
hsfs 3.1.0.dev1 requires markupsafe<2.1.0, but you have markupsafe 2.1.1 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.9 -m pip install --upgrade pip[0m


## Create Ranking Dataset

In this notebook, we'll create a dataset for our ranking model. Since our dataset only consists of positive user-item interactions (transactions) we need to do negative sampling. (Otherwise our model might just recommend all items to all users.)

This notebook can be run to generate both training and validation data. Please run the the notebook once, change `USE_TRAIN` below to False, and run the notebook again if you want to generate both datasets.

In [4]:
fs = project.get_feature_store()

feature_view = fs.get_feature_view("retrieval", version=1)

train_df, val_df, test_df, y_train, y_val, y_test = feature_view.get_train_validation_test_split(training_dataset_version=1)



Connected. Call `.close()` to terminate connection gracefully.


In [5]:
train_df["article_id"] = train_df["article_id"].astype(str) # to be deleted
val_df["article_id"] = val_df["article_id"].astype(str)
test_df["article_id"] = test_df["article_id"].astype(str)

In [14]:
# Repeat all cells below with both USE_TRAIN = True and USE_TRAIN = false

USE_TRAIN = False

split, df = ("train", train_df) if USE_TRAIN else ("validation", val_df)
df['article_id'] = df['article_id'].astype(str)

ds_name = f"ranking_{split}.csv"

These are the true positive pairs.

In [15]:
query_features = ["customer_id", "age", "month_sin", "month_cos"]

positive_pairs = df[query_features + ["article_id"]].copy()
positive_pairs

Unnamed: 0,customer_id,age,month_sin,month_cos,article_id
0,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,-5.000000e-01,8.660254e-01,179123001
1,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,-1.000000e+00,-1.836970e-16,617835003
2,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,-1.000000e+00,-1.836970e-16,622444003
3,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,-1.000000e+00,-1.836970e-16,653567001
4,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,-8.660254e-01,5.000000e-01,672491002
...,...,...,...,...,...
52516,f57a3de3dbcfc1f4026f06c36b5e1d9802a85f3503a59b...,23.0,0.000000e+00,1.000000e+00,507909001
52517,f57a3de3dbcfc1f4026f06c36b5e1d9802a85f3503a59b...,23.0,-5.000000e-01,-8.660254e-01,711823004
52518,f6eeaa81b9dff0256beb2b1c31268626ae7ecb7669788f...,26.0,-8.660254e-01,5.000000e-01,630116001
52519,f6eeaa81b9dff0256beb2b1c31268626ae7ecb7669788f...,26.0,5.000000e-01,-8.660254e-01,829643001


In [16]:
n_neg = len(positive_pairs)*10

negative_pairs = positive_pairs[query_features]\
    .sample(n_neg, replace=True, random_state=1)\
    .reset_index(drop=True)

negative_pairs["article_id"] = positive_pairs["article_id"]\
    .sample(n_neg, replace=True, random_state=2).to_numpy()

negative_pairs

Unnamed: 0,customer_id,age,month_sin,month_cos,article_id
0,950d0078a50b4174aad1270c475b2ad0e0e149dc767d27...,63.0,-8.660254e-01,5.000000e-01,687704022
1,a61e6bc03fa00146e2940b728dd331c36df956a485d747...,57.0,5.000000e-01,-8.660254e-01,843805003
2,654ccec13aeb2316ff6a615fa629cd606094a766b98022...,50.0,-1.000000e+00,-1.836970e-16,863001001
3,28a6df5f0cce682959b1cb23dc0958761dd3d470cc9706...,45.0,-1.000000e+00,-1.836970e-16,547780004
4,8fab1eb72edc2c7ba438eb7051871e81cbb6344391d557...,46.0,1.224647e-16,-1.000000e+00,928210002
...,...,...,...,...,...
525205,ee87369883b00e0b94b43cc240c2f5574d31e1124eb021...,27.0,0.000000e+00,1.000000e+00,879781002
525206,66785d1aeb70927d8cd69d766f3d787a19cf771965876d...,45.0,5.000000e-01,-8.660254e-01,711982001
525207,0c44b5697c84f49c39712df9b5452ec131cf6cf4090e35...,34.0,-5.000000e-01,-8.660254e-01,856109001
525208,9186f3dbc99115d92c82e78e455f928f54abaa2b08cde5...,39.0,1.224647e-16,-1.000000e+00,442915014


In [17]:
import pandas as pd

# Add labels.
positive_pairs["label"] = 1
negative_pairs["label"] = 0

# Concatenate.
ranking_df = pd.concat([positive_pairs, negative_pairs], ignore_index=True)
ranking_df.head()

Unnamed: 0,customer_id,age,month_sin,month_cos,article_id,label
0,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,-0.5,0.8660254,179123001,1
1,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,-1.0,-1.83697e-16,617835003,1
2,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,-1.0,-1.83697e-16,622444003,1
3,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,-1.0,-1.83697e-16,653567001,1
4,105f4c9f7ac528b9440bb3126f4ac2cce18992121f5f71...,35.0,-0.866025,0.5,672491002,1


In [18]:
# Merge with item features.
articles_fg = fs.get_feature_group("articles")
item_df = articles_fg.read()
item_df.drop_duplicates(subset="article_id", inplace=True)
ranking_df = ranking_df.merge(item_df, on="article_id")



2022-11-15 11:04:13,866 INFO: USE `recsys_featurestore`
2022-11-15 11:04:14,820 INFO: SELECT `fg0`.`article_id` `article_id`, `fg0`.`product_code` `product_code`, `fg0`.`prod_name` `prod_name`, `fg0`.`product_type_no` `product_type_no`, `fg0`.`product_type_name` `product_type_name`, `fg0`.`product_group_name` `product_group_name`, `fg0`.`graphical_appearance_no` `graphical_appearance_no`, `fg0`.`graphical_appearance_name` `graphical_appearance_name`, `fg0`.`colour_group_code` `colour_group_code`, `fg0`.`colour_group_name` `colour_group_name`, `fg0`.`perceived_colour_value_id` `perceived_colour_value_id`, `fg0`.`perceived_colour_value_name` `perceived_colour_value_name`, `fg0`.`perceived_colour_master_id` `perceived_colour_master_id`, `fg0`.`perceived_colour_master_name` `perceived_colour_master_name`, `fg0`.`department_no` `department_no`, `fg0`.`department_name` `department_name`, `fg0`.`index_code` `index_code`, `fg0`.`index_name` `index_name`, `fg0`.`index_group_no` `index_group_no`



Next, we compute the query and candidate embeddings.

There are several "duplicated" categorical features in the dataset. For instance, `index_code` and `index_name` encodes the same feature, but in different formats (int, string). Therefore we have to deduplicate these features.

In [19]:
def exclude_feat(s):
    return s.endswith("_id") or s.endswith("_no") or s.endswith("_code")

features_to_exclude = [col for col in ranking_df.columns if exclude_feat(col)]
features_to_exclude.append("prod_name")

ranking_df.drop(features_to_exclude, axis="columns", inplace=True)
ranking_df.head()

Unnamed: 0,age,month_sin,month_cos,label,product_type_name,product_group_name,graphical_appearance_name,colour_group_name,perceived_colour_value_name,perceived_colour_master_name,department_name,index_name,index_group_name,section_name,garment_group_name
0,35.0,-0.5,0.8660254,1,Leggings/Tights,Garment Lower body,Solid,Black,Dark,Black,Jersey Basic,Ladieswear,Ladieswear,Womens Everyday Basics,Jersey Basic
1,33.0,-0.866025,0.5,1,Leggings/Tights,Garment Lower body,Solid,Black,Dark,Black,Jersey Basic,Ladieswear,Ladieswear,Womens Everyday Basics,Jersey Basic
2,64.0,-1.0,-1.83697e-16,1,Leggings/Tights,Garment Lower body,Solid,Black,Dark,Black,Jersey Basic,Ladieswear,Ladieswear,Womens Everyday Basics,Jersey Basic
3,54.0,0.866025,0.5,1,Leggings/Tights,Garment Lower body,Solid,Black,Dark,Black,Jersey Basic,Ladieswear,Ladieswear,Womens Everyday Basics,Jersey Basic
4,33.0,-0.5,0.8660254,1,Leggings/Tights,Garment Lower body,Solid,Black,Dark,Black,Jersey Basic,Ladieswear,Ladieswear,Womens Everyday Basics,Jersey Basic


In [20]:
ranking_df.to_csv(ds_name, index=False)

In [21]:
dataset_api = project.get_dataset_api()
uploaded_file_path = dataset_api.upload(ds_name, "Resources", overwrite=True)

Uploading: 0.000%|          | 0/83896790 elapsed<00:00 remaining<?

### Next Steps

In the next notebook, we'll train a ranking model on the dataset we created in this notebook.