In [1]:
import hopsworks

project = hopsworks.login()  # insert API Key from https://app.hopsworks.ai

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://hopsworks0.logicalclocks.com/p/119


## Train Ranking Model

In this notebook, we will train a ranking model using gradient boosted trees. 

Let's start by loading the datasets we created in the previous notebook.

In [2]:
dataset_api = project.get_dataset_api()

dataset_api.download("Resources/ranking_train.csv", overwrite=True)
dataset_api.download("Resources/ranking_validation.csv", overwrite=True)

Downloading: 0.000%|          | 0/673391867 elapsed<00:00 remaining<?

Downloading: 0.000%|          | 0/84347649 elapsed<00:00 remaining<?

'/srv/hops/jupyter/Projects/rec/rec__meb10000/0609bf947e47d55d9b97af03ab8cc959ee09238750f25b1b1bdc7746e73fb889/ranking_validation.csv'

In [3]:
import pandas as pd

X_train = pd.read_csv("ranking_train.csv")
X_val = pd.read_csv("ranking_validation.csv")

y_train = X_train.pop("label")
y_val = X_val.pop("label")

X_train.sample(5)

Unnamed: 0,age,month_sin,month_cos,product_type_name,product_group_name,graphical_appearance_name,colour_group_name,perceived_colour_value_name,perceived_colour_master_name,department_name,index_name,index_group_name,section_name,garment_group_name
3774240,37.0,-0.5,0.8660254,Sneakers,Shoes,Solid,Black,Dark,Black,Sneakers,Ladies Accessories,Ladieswear,Womens Shoes,Shoes
5009,24.0,0.5,0.8660254,T-shirt,Garment Upper body,All over pattern,Beige,Medium Dusty,Beige,Jersey,Ladieswear,Ladieswear,H&M+,Jersey Fancy
3067453,31.0,-0.8660254,0.5,Trousers,Garment Lower body,Solid,Black,Dark,Black,Denim Trousers,Divided,Divided,Ladies Denim,Trousers Denim
4482394,24.0,-1.0,-1.83697e-16,Shorts,Garment Lower body,Denim,Dark Blue,Medium Dusty,Blue,Denim Other Garments,Divided,Divided,Ladies Denim,Trousers Denim
3783567,45.0,1.224647e-16,-1.0,Blouse,Garment Upper body,Solid,Black,Dark,Black,Tops Woven,Divided,Divided,Divided Collection,Blouses


Let's train a model.

In [4]:
from catboost import CatBoostClassifier, Pool

cat_features = list(
    X_train.select_dtypes(include=['string', 'object']).columns
)

pool_train = Pool(X_train, y_train, cat_features=cat_features)
pool_val = Pool(X_val, y_val, cat_features=cat_features)

model = CatBoostClassifier(
    learning_rate=0.2,
    iterations=100,
    depth=10,
    scale_pos_weight=10,
    early_stopping_rounds=5,
    use_best_model=True
)

model.fit(pool_train, eval_set=pool_val)

0:	learn: 0.6931471	test: 0.6931472	best: 0.6931472 (0)	total: 3.11s	remaining: 5m 8s
1:	learn: 0.6907613	test: 0.6907587	best: 0.6907587 (1)	total: 8.15s	remaining: 6m 39s
2:	learn: 0.6843511	test: 0.6844698	best: 0.6844698 (2)	total: 12.6s	remaining: 6m 48s
3:	learn: 0.6816790	test: 0.6818446	best: 0.6818446 (3)	total: 17.6s	remaining: 7m 1s
4:	learn: 0.6790418	test: 0.6792755	best: 0.6792755 (4)	total: 21.7s	remaining: 6m 51s
5:	learn: 0.6781290	test: 0.6783837	best: 0.6783837 (5)	total: 26.2s	remaining: 6m 50s
6:	learn: 0.6781289	test: 0.6783837	best: 0.6783837 (6)	total: 27.5s	remaining: 6m 5s
7:	learn: 0.6745419	test: 0.6748731	best: 0.6748731 (7)	total: 31.5s	remaining: 6m 2s
8:	learn: 0.6724568	test: 0.6728372	best: 0.6728372 (8)	total: 35.6s	remaining: 5m 59s
9:	learn: 0.6718549	test: 0.6722387	best: 0.6722387 (9)	total: 39.9s	remaining: 5m 59s
10:	learn: 0.6701141	test: 0.6705919	best: 0.6705919 (10)	total: 43.8s	remaining: 5m 54s
11:	learn: 0.6696620	test: 0.6701440	best: 0.

<catboost.core.CatBoostClassifier at 0x7eff01db6fb0>

Next, we'll evaluate how well the model performs on the validation data.

In [5]:
from sklearn.metrics import classification_report, precision_recall_fscore_support

preds = model.predict(pool_val)

precision, recall, fscore, _ = precision_recall_fscore_support(y_val, preds, average="binary")

metrics = {
    "precision" : precision,
    "recall" : recall,
    "fscore" : fscore
}

print(classification_report(y_val, preds))

              precision    recall  f1-score   support

           0       0.94      0.57      0.71    527930
           1       0.13      0.65      0.22     52793

    accuracy                           0.58    580723
   macro avg       0.54      0.61      0.47    580723
weighted avg       0.87      0.58      0.67    580723



It can be seen that the model has a low F1-score on the positive class (higher is better). The performance could potentially be improved by adding more features to the dataset, e.g. image embeddings.

Let's see which features our model considers important.

In [6]:
feat_to_score = {feature: score for feature, score in zip(
    X_train.columns, model.feature_importances_)}

feat_to_score = dict(
    sorted(
        feat_to_score.items(),
        key=lambda item: item[1],
        reverse=True
    )
)

feat_to_score

{'product_type_name': 15.87629939347693,
 'month_sin': 15.79725897870585,
 'month_cos': 12.214018360626726,
 'product_group_name': 9.691322782725956,
 'age': 7.50713547362113,
 'department_name': 7.478385013417299,
 'index_name': 6.381776981099198,
 'garment_group_name': 5.514835119565458,
 'perceived_colour_value_name': 5.251788315022826,
 'section_name': 3.847254588462731,
 'graphical_appearance_name': 3.525305037012214,
 'perceived_colour_master_name': 2.7111396725528145,
 'colour_group_name': 2.2097454439160678,
 'index_group_name': 1.993734839794825}

It can be seen that the model places high importance on user and item embedding features. Consequently, better trained embeddings could yield a better ranking model.

Finally, we'll save our model.

In [7]:
import joblib

joblib.dump(model, 'ranking_model.pkl')

['ranking_model.pkl']

### Upload Model to Model Registry

We'll upload the model to the Hopsworks Model Registry.

In [8]:
# connect to Hopsworks Model Registry
mr = project.get_model_registry()

Connected. Call `.close()` to terminate connection gracefully.


In [9]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_example = X_train.sample().to_dict("records")
input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema, output_schema)

ranking_model = mr.python.create_model(
    name="ranking_model", metrics=metrics,
    model_schema=model_schema,
    input_example=input_example,
    description="Ranking model that scores item candidates")

ranking_model.save("ranking_model.pkl")

  0%|          | 0/6 [00:00<?, ?it/s]

Model created, explore it at https://hopsworks0.logicalclocks.com/p/119/models/ranking_model/1


Model(name: 'ranking_model', version: 1)

### Next Steps

Now we have trained both a retrieval and a ranking model, which will allow us to generate recommendations for users. In the next notebook, we'll take a look at how we can deploy these models with the `HSML` library.