## <span style="color:#ff5f27">üë®üèª‚Äçüè´ Train Ranking Model </span>

In this notebook, you will train a ranking model using gradient boosted trees. 

## <span style="color:#ff5f27">üìù Imports </span>

In [1]:
import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import classification_report, precision_recall_fscore_support
import joblib

## <span style="color:#ff5f27">üîÆ Connect to Hopsworks Feature Store </span>

In [2]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://snurran.hops.works/p/11383
Connected. Call `.close()` to terminate connection gracefully.


In [4]:
users_fg = fs.get_feature_group(
    name="users",
    version=1,
)

videos_fg = fs.get_feature_group(
    name="videos",
    version=1,
)

rank_fg = fs.get_feature_group(
    name="ranking",
    version=1,
)

## <span style="color:#ff5f27">‚öôÔ∏è Feature View Creation </span>

In [5]:
# Select features
selected_features_customers = users_fg.select_all()

fs.get_or_create_feature_view( 
    name='users',
    query=selected_features_customers,
    version=1,
)

Feature view created successfully, explore it at 
https://snurran.hops.works/p/11383/fs/11331/fv/users/version/1


<hsfs.feature_view.FeatureView at 0x7f283ba2d0c0>

In [6]:
# Select features
selected_features_articles = videos_fg.select_all()

fs.get_or_create_feature_view(
    name='videos',
    query=selected_features_articles,
    version=1,
)

Feature view created successfully, explore it at 
https://snurran.hops.works/p/11383/fs/11331/fv/videos/version/1


<hsfs.feature_view.FeatureView at 0x7f283ba4c130>

In [7]:
# Select features
selected_features_ranking = rank_fg.select_except(["user_id", "video_id"])

feature_view_ranking = fs.get_or_create_feature_view(
    name='ranking',
    query=selected_features_ranking,
    labels=["label"],
    version=1,
)

Feature view created successfully, explore it at 
https://snurran.hops.works/p/11383/fs/11331/fv/ranking/version/1


## <span style="color:#ff5f27">üóÑÔ∏è Train Data loading </span>

In [8]:
X_train, X_val, y_train, y_val = feature_view_ranking.train_test_split(
    test_size=0.1,
    description='Ranking training dataset',
)

X_train.head(3)

Finished: Reading data from Hopsworks, using ArrowFlight (7.98s) 




Unnamed: 0,category,views,likes,video_length,upload_date,gender,age,country
0,Travel,88173,30625,236,2022-08-16,Female,81,Congo - Brazzaville
1,Sports,66526,515,154,2024-02-18,Female,22,Ethiopia
2,Comedy,14472,3323,226,2022-07-28,Female,69,Haiti


In [9]:
y_train.head(3)

Unnamed: 0,label
0,0
1,0
2,0


## <span style="color:#ff5f27">üèÉüèª‚Äç‚ôÇÔ∏è Model Training </span>

Let's train a model.

In [10]:
cat_features = list(
    X_train.select_dtypes(include=['string', 'object']).columns
)

pool_train = Pool(X_train, y_train, cat_features=cat_features)
pool_val = Pool(X_val, y_val, cat_features=cat_features)

model = CatBoostClassifier(
    learning_rate=0.2,
    iterations=100,
    depth=10,
    scale_pos_weight=10,
    early_stopping_rounds=5,
    use_best_model=True,
)

model.fit(
    pool_train, 
    eval_set=pool_val,
)

0:	learn: 0.6047943	test: 0.6049987	best: 0.6049987 (0)	total: 683ms	remaining: 1m 7s
1:	learn: 0.5468728	test: 0.5472443	best: 0.5472443 (1)	total: 1.08s	remaining: 53.1s
2:	learn: 0.5077067	test: 0.5082187	best: 0.5082187 (2)	total: 1.38s	remaining: 44.6s
3:	learn: 0.4808279	test: 0.4814580	best: 0.4814580 (3)	total: 1.98s	remaining: 47.5s
4:	learn: 0.4622726	test: 0.4630056	best: 0.4630056 (4)	total: 2.28s	remaining: 43.3s
5:	learn: 0.4494611	test: 0.4502821	best: 0.4502821 (5)	total: 2.48s	remaining: 38.9s
6:	learn: 0.4406433	test: 0.4415399	best: 0.4415399 (6)	total: 2.88s	remaining: 38.3s
7:	learn: 0.4346065	test: 0.4355671	best: 0.4355671 (7)	total: 3.28s	remaining: 37.7s
8:	learn: 0.4305001	test: 0.4315150	best: 0.4315150 (8)	total: 3.78s	remaining: 38.2s
9:	learn: 0.4277257	test: 0.4287866	best: 0.4287866 (9)	total: 4.38s	remaining: 39.4s
10:	learn: 0.4258645	test: 0.4269650	best: 0.4269650 (10)	total: 4.98s	remaining: 40.3s
11:	learn: 0.4246241	test: 0.4257559	best: 0.4257559

<catboost.core.CatBoostClassifier at 0x7f283ba85d80>

## <span style="color:#ff5f27">üëÆüèª‚Äç‚ôÇÔ∏è Model Validation </span>

Next, you'll evaluate how well the model performs on the validation data.

In [11]:
preds = model.predict(pool_val)

precision, recall, fscore, _ = precision_recall_fscore_support(y_val, preds, average="binary")

metrics = {
    "precision" : precision,
    "recall" : recall,
    "fscore" : fscore,
}
print(classification_report(y_val, preds))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00     63866
           1       0.36      1.00      0.53     36059

    accuracy                           0.36     99925
   macro avg       0.18      0.50      0.27     99925
weighted avg       0.13      0.36      0.19     99925





In [12]:
feat_to_score = {
    feature: score 
    for feature, score 
    in zip(
        X_train.columns, 
        model.feature_importances_,
    )
}

feat_to_score = dict(
    sorted(
        feat_to_score.items(),
        key=lambda item: item[1],
        reverse=True,
    )
)
feat_to_score

{'age': 31.650692148883884,
 'gender': 21.34485610363139,
 'video_length': 15.269523002273727,
 'views': 11.866658360703779,
 'likes': 7.924140862241619,
 'country': 6.816744170131442,
 'category': 5.127385352134173,
 'upload_date': 0.0}

It can be seen that the model places high importance on user and item embedding features. Consequently, better trained embeddings could yield a better ranking model.

Finally, you'll save your model.

In [13]:
joblib.dump(model, 'ranking_model.pkl')

['ranking_model.pkl']

### <span style="color:#ff5f27">üíæ  Upload Model to Model Registry </span>

You'll upload the model to the Hopsworks Model Registry.

In [14]:
# Connect to Hopsworks Model Registry
mr = project.get_model_registry()

Connected. Call `.close()` to terminate connection gracefully.


In [16]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_example = X_train.sample().to_dict("records")
input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema, output_schema)

ranking_model = mr.python.create_model(
    name="ranking_model", 
    metrics=metrics,
    model_schema=model_schema,
    input_example=input_example,
    description="Ranking model that scores item candidates",
)
ranking_model.save("ranking_model.pkl")

  0%|          | 0/6 [00:00<?, ?it/s]

Model created, explore it at https://snurran.hops.works/p/11383/models/ranking_model/1


Model(name: 'ranking_model', version: 1)

---
## <span style="color:#ff5f27">‚è©Ô∏è Next Steps </span>

Now you have trained both a retrieval and a ranking model, which will allow you to generate recommendations for users. In the next notebook, you'll take a look at how you can deploy these models with the `HSML` library.