## <span style="color:#ff5f27">👨🏻‍🏫 Train Ranking Model </span>

In this notebook, you will train a ranking model using gradient boosted trees. 

In [None]:
import time

# Start the timer
notebook_start_time = time.time()

## <span style="color:#ff5f27">📝 Imports </span>

In [None]:
import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import classification_report, precision_recall_fscore_support
import joblib

## <span style="color:#ff5f27">🔮 Connect to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

In [None]:
customers_fg = fs.get_feature_group(
    name="customers",
    version=1,
)

articles_fg = fs.get_feature_group(
    name="articles",
    version=1,
)

trans_fg = fs.get_feature_group(
    name="transactions",
    version=1,
)

interactions_fg = fs.get_feature_group(
    name="interactions",
    version=1,
)

rank_fg = fs.get_feature_group(
    name="ranking",
    version=1,
)

## <span style="color:#ff5f27">⚙️ Feature View Creation </span>

In [None]:
# Select features
selected_features_customers = customers_fg.select_all()

fs.get_or_create_feature_view( 
    name='customers',
    query=selected_features_customers,
    version=1,
)

In [None]:
# Select features
selected_features_articles = articles_fg.select_except(['embeddings']) 

fs.get_or_create_feature_view(
    name='articles',
    query=selected_features_articles,
    version=1,
)

In [None]:
selected_features_llm_assistant = trans_fg.select([
    "customer_id",
    "t_dat",
    "price",
    "sales_channel_id",
    "year",
    "month",
    "day",
    "day_of_week",
]).join(
    customers_fg.select([
        "club_member_status",
        "age",
        "age_group",
    ]), 
    on="customer_id", 
    prefix="customer_",
).join(
    articles_fg.select([
        "prod_name",
        "product_type_name",
        "product_group_name",
        "graphical_appearance_name",
        "colour_group_name",
        "section_name",
        "garment_group_name",
        "article_description",
    ]), 
    on="article_id", 
    prefix="article_",
).join(
    interactions_fg.select([
        "interaction_score",
]),
    on=["customer_id", "article_id"],
    prefix="interaction_",
)

# Create the feature view
llm_assistant_feature_view = fs.get_or_create_feature_view(
    name='llm_assistant_context',
    query=selected_features_llm_assistant,
    version=1
)

In [None]:
# Select features
selected_features_ranking = rank_fg.select_except(["customer_id", "article_id"]).join(
    trans_fg.select(["month_sin", "month_cos"]), 
    prefix="trans_",
)

feature_view_ranking = fs.get_or_create_feature_view(
    name='ranking',
    query=selected_features_ranking,
    labels=["label"],
    version=1,
)

## <span style="color:#ff5f27">🗄️ Train Data loading </span>

In [None]:
X_train, X_val, y_train, y_val = feature_view_ranking.train_test_split(
    test_size=0.1,
    description='Ranking training dataset',
)

X_train.head(3)

In [None]:
y_train.head(3)

## <span style="color:#ff5f27">🏃🏻‍♂️ Model Training </span>

Let's train a model.

In [None]:
cat_features = list(
    X_train.select_dtypes(include=['string', 'object']).columns
)

pool_train = Pool(X_train, y_train, cat_features=cat_features)
pool_val = Pool(X_val, y_val, cat_features=cat_features)

model = CatBoostClassifier(
    learning_rate=0.2,
    iterations=100,
    depth=10,
    scale_pos_weight=10,
    early_stopping_rounds=5,
    use_best_model=True,
)

model.fit(
    pool_train, 
    eval_set=pool_val,
)

## <span style="color:#ff5f27">👮🏻‍♂️ Model Validation </span>

Next, you'll evaluate how well the model performs on the validation data.

In [None]:
preds = model.predict(pool_val)

precision, recall, fscore, _ = precision_recall_fscore_support(y_val, preds, average="binary")

metrics = {
    "precision" : precision,
    "recall" : recall,
    "fscore" : fscore,
}
print(classification_report(y_val, preds))

It can be seen that the model has a low F1-score on the positive class (higher is better). The performance could potentially be improved by adding more features to the dataset, e.g. image embeddings.

Let's see which features your model considers important.

In [None]:
feat_to_score = {
    feature: score 
    for feature, score 
    in zip(
        X_train.columns, 
        model.feature_importances_,
    )
}

feat_to_score = dict(
    sorted(
        feat_to_score.items(),
        key=lambda item: item[1],
        reverse=True,
    )
)
feat_to_score

It can be seen that the model places high importance on user and item embedding features. Consequently, better trained embeddings could yield a better ranking model.

Finally, you'll save your model.

In [None]:
joblib.dump(model, 'ranking_model.pkl')

### <span style="color:#ff5f27">💾  Upload Model to Model Registry </span>

You'll upload the model to the Hopsworks Model Registry.

In [None]:
# Connect to Hopsworks Model Registry
mr = project.get_model_registry()

In [None]:
input_example = X_train.sample().to_dict("records")
                                         
ranking_model = mr.python.create_model(
    name="ranking_model", 
    description="Ranking model that scores item candidates",
    version=1,
    metrics=metrics,
    feature_view=feature_view_ranking,
    input_example=input_example,
)
ranking_model.save("ranking_model.pkl")

---

In [None]:
# End the timer
notebook_end_time = time.time()

# Calculate and print the execution time
notebook_execution_time = notebook_end_time - notebook_start_time
print(f"⌛️ Notebook Execution time: {notebook_execution_time:.2f} seconds")

---
## <span style="color:#ff5f27">⏩️ Next Steps </span>

Now you have trained both a retrieval and a ranking model, which will allow you to generate recommendations for users. In the next notebook, you'll take a look at how you can deploy these models with the `HSML` library.