## Train Ranking Model

In this notebook, we will train a ranking model using gradient boosted trees. 

Let's start by loading the datasets we created in the previous notebook.

In [1]:
import hopsworks

connection = hopsworks.connection()
project = connection.get_project()
dataset_api = project.get_dataset_api()

dataset_api.download("Resources/ranking_train.csv", overwrite=True)
dataset_api.download("Resources/ranking_validation.csv", overwrite=True)

Connected. Call `.close()` to terminate connection gracefully.


Downloading: 0.000%|          | 0/2691371 elapsed<00:00 remaining<?

Downloading: 0.000%|          | 0/336981 elapsed<00:00 remaining<?

'/srv/hops/staging/private_dirs/be32cfc58e38bc51dc0e9d4be4ca61a9c9ce3be4d493628f6566b837ab451442/ranking_validation.csv'

In [2]:
import pandas as pd

X_train = pd.read_csv("ranking_train.csv")
X_val = pd.read_csv("ranking_validation.csv")
y_train = X_train.pop("label")
y_val = X_val.pop("label")

X_train.sample(5)

Unnamed: 0,age,month_sin,month_cos,product_type_name,product_group_name,graphical_appearance_name,colour_group_name,perceived_colour_value_name,perceived_colour_master_name,department_name,index_name,index_group_name,section_name,garment_group_name
6201,30.0,-0.866025,-0.5,Cardigan,Garment Upper body,Melange,Light Purple,Dusty Light,Lilac Purple,Knitwear,Ladieswear,Ladieswear,Womens Everyday Collection,Knitwear
10102,57.0,-0.866025,-0.5,Shirt,Garment Upper body,Solid,White,Light,White,Shirt S&T,Menswear,Menswear,Men Suits & Tailoring,Shirts
13552,32.0,-0.866025,-0.5,Top,Garment Upper body,Solid,White,Light,White,Jersey,Ladieswear,Ladieswear,Womens Tailoring,Jersey Fancy
6995,34.0,-0.866025,-0.5,Bra,Underwear,Solid,Blue,Medium Dusty,Blue,Loungewear,Ladieswear,Ladieswear,"Womens Nightwear, Socks & Tigh","Under-, Nightwear"
9151,38.0,-0.866025,-0.5,Bra,Underwear,Solid,Light Purple,Light,Pink,Expressive Lingerie,Lingeries/Tights,Ladieswear,Womens Lingerie,"Under-, Nightwear"


Let's train a model.

In [3]:
from catboost import CatBoostClassifier, Pool

cat_features = list(
    X_train.select_dtypes(include=['string', 'object']).columns
)

pool_train = Pool(X_train, y_train, cat_features=cat_features)
pool_val = Pool(X_val, y_val, cat_features=cat_features)

model = CatBoostClassifier(
    learning_rate=0.2,
    iterations=100,
    depth=10,
    scale_pos_weight=10,
    early_stopping_rounds=5,
    use_best_model=True
)

model.fit(pool_train, eval_set=pool_val)

0:	learn: 0.6922620	test: 0.6931107	best: 0.6931107 (0)	total: 65.9ms	remaining: 6.52s
1:	learn: 0.6903400	test: 0.6932205	best: 0.6931107 (0)	total: 79.1ms	remaining: 3.88s
2:	learn: 0.6887709	test: 0.6923902	best: 0.6923902 (2)	total: 95.9ms	remaining: 3.1s
3:	learn: 0.6866559	test: 0.6926094	best: 0.6923902 (2)	total: 111ms	remaining: 2.66s
4:	learn: 0.6864945	test: 0.6925962	best: 0.6923902 (2)	total: 119ms	remaining: 2.25s
5:	learn: 0.6841399	test: 0.6925168	best: 0.6923902 (2)	total: 133ms	remaining: 2.09s
6:	learn: 0.6841182	test: 0.6925195	best: 0.6923902 (2)	total: 139ms	remaining: 1.84s
7:	learn: 0.6833490	test: 0.6926753	best: 0.6923902 (2)	total: 146ms	remaining: 1.68s
Stopped by overfitting detector  (5 iterations wait)

bestTest = 0.6923902366
bestIteration = 2

Shrink model to first 3 iterations.


<catboost.core.CatBoostClassifier at 0x7f5d1aaf8850>

Next, we'll evaluate how well the model performs on the validation data.

In [4]:
from sklearn.metrics import classification_report, precision_recall_fscore_support

preds = model.predict(pool_val)

precision, recall, fscore, _ = precision_recall_fscore_support(y_val, preds, average="binary")

metrics = {
    "precision" : precision,
    "recall" : recall,
    "fscore" : fscore
}

print(classification_report(y_val, preds))

              precision    recall  f1-score   support

           0       0.91      0.66      0.77      1840
           1       0.10      0.36      0.15       184

    accuracy                           0.64      2024
   macro avg       0.51      0.51      0.46      2024
weighted avg       0.84      0.64      0.71      2024



It can be seen that the model has a low F1-score on the positive class (higher is better). The performance could potentially be improved by adding more features to the dataset, e.g. image embeddings.

Let's see which features our model considers important.

In [5]:
feat_to_score = {feature: score for feature, score in zip(
    X_train.columns, model.feature_importances_)}

feat_to_score = dict(
    sorted(
        feat_to_score.items(),
        key=lambda item: item[1],
        reverse=True
    )
)

feat_to_score


{'department_name': 35.449607417242156,
 'age': 15.537052663681996,
 'index_group_name': 14.708430114295725,
 'colour_group_name': 9.304488168320994,
 'garment_group_name': 7.74209060943304,
 'perceived_colour_value_name': 6.79673137133605,
 'index_name': 6.4150935095333175,
 'perceived_colour_master_name': 4.038320460325405,
 'product_group_name': 0.008185685831313004,
 'month_sin': 0.0,
 'month_cos': 0.0,
 'product_type_name': 0.0,
 'graphical_appearance_name': 0.0,
 'section_name': 0.0}

It can be seen that the model places high importance on user and item embedding features. Consequently, better trained embeddings could yield a better ranking model.

Finally, we'll save our model.

In [6]:
import joblib

joblib.dump(model, 'ranking_model.pkl')

['ranking_model.pkl']

### Upload Model to Model Registry

We'll upload the model to the Hopsworks Model Registry.

In [7]:
import hsml

# connect to Hopsworks Model Registry
conn = hsml.connection()
mr = conn.get_model_registry()

Connected. Call `.close()` to terminate connection gracefully.


In [8]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_example = X_train.sample().to_dict("records")
input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema, output_schema)

ranking_model = mr.python.create_model(
    name="ranking_model", metrics=metrics,
    model_schema=model_schema,
    input_example=input_example, description="Ranking model")

ranking_model.save("ranking_model.pkl")

  0%|          | 0/6 [00:00<?, ?it/s]

Model created, explore it at https://2176a0f0-3503-11ed-be64-b1a4781e5f0a.cloud.hopsworks.ai/p/135/models/ranking_model/1


Model(name: 'ranking_model', version: 1)

### Next Steps

Now we have trained both a retrieval and a ranking model, which will allow us to generate recommendations for users. In the next notebook, we'll take a look at how we can deploy these models with the `HSML` library.