In [2]:
import hopsworks

project = hopsworks.login()  # insert API Key from https://app.hopsworks.ai

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://b0636a00-6406-11ed-88f4-3779517939b7.cloud.hopsworks.ai:443/p/119


## Train Ranking Model

In this notebook, we will train a ranking model using gradient boosted trees. 

Let's start by loading the datasets we created in the previous notebook.

In [3]:
dataset_api = project.get_dataset_api()

dataset_api.download("Resources/ranking_train.csv", overwrite=True)
dataset_api.download("Resources/ranking_validation.csv", overwrite=True)

Downloading: 0.000%|          | 0/674667596 elapsed<00:00 remaining<?

Downloading: 0.000%|          | 0/83896790 elapsed<00:00 remaining<?

'/home/javierdlrm/Workspace/Hopsworks/hopsworks-tutorials-1/rec/ranking_validation.csv'

In [4]:
import pandas as pd

X_train = pd.read_csv("ranking_train.csv")
X_val = pd.read_csv("ranking_validation.csv")

y_train = X_train.pop("label")
y_val = X_val.pop("label")

X_train.sample(5)

Unnamed: 0,age,month_sin,month_cos,product_type_name,product_group_name,graphical_appearance_name,colour_group_name,perceived_colour_value_name,perceived_colour_master_name,department_name,index_name,index_group_name,section_name,garment_group_name
1813705,44.0,0.866025,-0.5,Dress,Garment Full body,Denim,Blue,Light,Blue,Dresses,Divided,Divided,Divided Collection,Dresses Ladies
2254877,25.0,0.5,0.866025,Trousers,Garment Lower body,All over pattern,Blue,Medium Dusty,Blue,Trouser,Ladieswear,Ladieswear,Womens Everyday Collection,Trousers
1155760,32.0,0.5,0.866025,Underwear Tights,Socks & Tights,All over pattern,Black,Dark,Black,Tights basic,Lingeries/Tights,Ladieswear,"Womens Nightwear, Socks & Tigh",Socks and Tights
892496,18.0,0.5,0.866025,Dress,Garment Full body,Solid,Black,Dark,Black,Basic 1,Divided,Divided,Divided Basics,Jersey Basic
2799615,25.0,0.0,1.0,Jacket,Garment Upper body,Solid,Blue,Medium,Blue,Outdoor/Blazers,Divided,Divided,Divided Collection,Outdoor


Let's train a model.

In [5]:
from catboost import CatBoostClassifier, Pool

cat_features = list(
    X_train.select_dtypes(include=['string', 'object']).columns
)

pool_train = Pool(X_train, y_train, cat_features=cat_features)
pool_val = Pool(X_val, y_val, cat_features=cat_features)

model = CatBoostClassifier(
    learning_rate=0.2,
    iterations=100,
    depth=10,
    scale_pos_weight=10,
    early_stopping_rounds=5,
    use_best_model=True
)

model.fit(pool_train, eval_set=pool_val)

0:	learn: 0.6869087	test: 0.6870068	best: 0.6870068 (0)	total: 1.77s	remaining: 2m 54s
1:	learn: 0.6850737	test: 0.6852878	best: 0.6852878 (1)	total: 3.13s	remaining: 2m 33s
2:	learn: 0.6850736	test: 0.6852879	best: 0.6852878 (1)	total: 3.46s	remaining: 1m 51s
3:	learn: 0.6820082	test: 0.6823183	best: 0.6823183 (3)	total: 5.04s	remaining: 2m 1s
4:	learn: 0.6785049	test: 0.6788925	best: 0.6788925 (4)	total: 6.38s	remaining: 2m 1s
5:	learn: 0.6776948	test: 0.6781578	best: 0.6781578 (5)	total: 7.41s	remaining: 1m 56s
6:	learn: 0.6752486	test: 0.6758257	best: 0.6758257 (6)	total: 8.4s	remaining: 1m 51s
7:	learn: 0.6740182	test: 0.6746670	best: 0.6746670 (7)	total: 9.23s	remaining: 1m 46s
8:	learn: 0.6723343	test: 0.6730809	best: 0.6730809 (8)	total: 10.1s	remaining: 1m 41s
9:	learn: 0.6723343	test: 0.6730809	best: 0.6730809 (8)	total: 10.3s	remaining: 1m 32s
10:	learn: 0.6704537	test: 0.6712865	best: 0.6712865 (10)	total: 11s	remaining: 1m 29s
11:	learn: 0.6700372	test: 0.6709268	best: 0.6

<catboost.core.CatBoostClassifier at 0x7f3834e3e0d0>

Next, we'll evaluate how well the model performs on the validation data.

In [6]:
from sklearn.metrics import classification_report, precision_recall_fscore_support

preds = model.predict(pool_val)

precision, recall, fscore, _ = precision_recall_fscore_support(y_val, preds, average="binary")

metrics = {
    "precision" : precision,
    "recall" : recall,
    "fscore" : fscore
}

print(classification_report(y_val, preds))

              precision    recall  f1-score   support

           0       0.94      0.57      0.71    525210
           1       0.13      0.65      0.22     52521

    accuracy                           0.57    577731
   macro avg       0.54      0.61      0.46    577731
weighted avg       0.87      0.57      0.66    577731



It can be seen that the model has a low F1-score on the positive class (higher is better). The performance could potentially be improved by adding more features to the dataset, e.g. image embeddings.

Let's see which features our model considers important.

In [7]:
feat_to_score = {feature: score for feature, score in zip(
    X_train.columns, model.feature_importances_)}

feat_to_score = dict(
    sorted(
        feat_to_score.items(),
        key=lambda item: item[1],
        reverse=True
    )
)

feat_to_score

{'product_type_name': 16.089815083678165,
 'month_sin': 15.215697873860751,
 'month_cos': 14.327220746946468,
 'product_group_name': 10.984299334165481,
 'age': 7.366683690336384,
 'garment_group_name': 6.169202336103466,
 'index_name': 6.158166644254398,
 'department_name': 5.889640566283271,
 'perceived_colour_value_name': 4.652374917327656,
 'section_name': 4.070205049786829,
 'graphical_appearance_name': 3.10524045270199,
 'perceived_colour_master_name': 2.5523558214357744,
 'index_group_name': 1.7757813603025683,
 'colour_group_name': 1.6433161228168147}

It can be seen that the model places high importance on user and item embedding features. Consequently, better trained embeddings could yield a better ranking model.

Finally, we'll save our model.

In [8]:
import joblib

joblib.dump(model, 'ranking_model.pkl')

['ranking_model.pkl']

### Upload Model to Model Registry

We'll upload the model to the Hopsworks Model Registry.

In [9]:
# connect to Hopsworks Model Registry
mr = project.get_model_registry()

Connected. Call `.close()` to terminate connection gracefully.


In [10]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_example = X_train.sample().to_dict("records")
input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema, output_schema)

ranking_model = mr.python.create_model(
    name="ranking_model", metrics=metrics,
    model_schema=model_schema,
    input_example=input_example,
    description="Ranking model that scores item candidates")

ranking_model.save("ranking_model.pkl")

  0%|          | 0/6 [00:00<?, ?it/s]

Model created, explore it at https://b0636a00-6406-11ed-88f4-3779517939b7.cloud.hopsworks.ai:443/p/119/models/ranking_model/1


Model(name: 'ranking_model', version: 1)

### Next Steps

Now we have trained both a retrieval and a ranking model, which will allow us to generate recommendations for users. In the next notebook, we'll take a look at how we can deploy these models with the `HSML` library.