# Feature engineering
Create custom features. Features categories:
- User features
- Item features
- Frequency features

- first step: load data and prepared before training/validating/testing.

In [10]:
import pandas as pd

In [11]:
postgres_uri = "postgresql+psycopg2://backend:backend@localhost:5432/app_db"

customers = pd.read_sql_table("customers", postgres_uri)
articles = pd.read_sql_table("articles", postgres_uri)
transactions = pd.read_sql_table("transactions", postgres_uri)

In [12]:
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5407 entries, 0 to 5406
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   customer_uuid           5407 non-null   object
 1   fn                      5407 non-null   int64 
 2   active                  5407 non-null   int64 
 3   club_member_status      5407 non-null   object
 4   fashion_news_frequency  2909 non-null   object
 5   age                     5407 non-null   int64 
 6   postal_code             5407 non-null   object
 7   customer_id             5407 non-null   int64 
dtypes: int64(4), object(4)
memory usage: 338.1+ KB


In [13]:
articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13673 entries, 0 to 13672
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   article_uuid        13673 non-null  object
 1   prod_name           13673 non-null  object
 2   product_type_no     13673 non-null  int64 
 3   product_type_name   13673 non-null  object
 4   product_group_no    13673 non-null  int64 
 5   product_group_name  13673 non-null  object
 6   department_no       13673 non-null  int64 
 7   department_name     13673 non-null  object
 8   index_code          13673 non-null  object
 9   index_name          13673 non-null  object
 10  index_group_no      13673 non-null  int64 
 11  index_group_name    13673 non-null  object
 12  section_no          13673 non-null  int64 
 13  section_name        13673 non-null  object
 14  garment_group_no    13673 non-null  int64 
 15  garment_group_name  13673 non-null  object
 16  detail_desc         13

In [14]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 514251 entries, 0 to 514250
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   transaction_uuid  514251 non-null  object        
 1   t_dat             514251 non-null  datetime64[ns]
 2   price             514251 non-null  float64       
 3   sales_channel_id  514251 non-null  int64         
 4   customer_uuid     514251 non-null  object        
 5   article_uuid      514251 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 23.5+ MB


prepared uset/item features

In [15]:
article_features = articles[[
    "article_uuid", "article_id", "product_type_no", "product_group_no", "department_no", "index_code",
    "index_group_no", "section_no", "garment_group_no"
]]
customer_features = customers[[
    "customer_uuid", "customer_id", "age"
]]

In [16]:
article_features["article_id"] -= 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  article_features["article_id"] -= 1


In [17]:
customer_features["customer_id"] -= 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  customer_features["customer_id"] -= 1


create interaction matrix (user X item)

In [18]:
interactions = transactions.copy()

In [19]:
interactions = interactions.merge(customer_features, how="left", on="customer_uuid")
interactions = interactions.merge(article_features, how="left", on="article_uuid")

In [20]:
interactions = interactions[["customer_id", "article_id", "t_dat"]]

In [21]:
interactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 514251 entries, 0 to 514250
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   customer_id  514251 non-null  int64         
 1   article_id   514251 non-null  int64         
 2   t_dat        514251 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(2)
memory usage: 11.8 MB


# Training
Let's divide the data into three parts. The first part is the data for the first stage. The second part is the data for the second stage. The third is the test data for evaluating the overall model.  
We will use time intervals. Min date: 2020-03-22; max date: 2020-09-22  
- first stage train (2020-03-22/2020-06-22)
- first stage validate (2020-06-22/2020-08-22)
- second stage train (2020-06-22/2020-08-22)
- second stage validate (2020-08-22/2020-09-11)
- test (2020-09-11/2020-09-22)

It is necessary for users from the first training sample to be in all other samples. This code seems to guarantee this condition

In [22]:
min_date = "2020-03-22 00:00:00"
date_1 = "2020-06-22 00:00:00"
date_2 = "2020-08-22 00:00:00" 
date_3 = "2020-09-11 00:00:00" 
date_4 = "2020-09-22 00:00:00"

users_in_train = set(interactions[interactions["t_dat"] <= date_1]["customer_id"])

first_stage_train = interactions[interactions["t_dat"] <= date_1]
first_stage_validate = interactions[(interactions["t_dat"] > date_1) & (interactions["t_dat"] <= date_2) & (interactions["customer_id"].isin(users_in_train))]

second_stage_train = first_stage_validate
second_stage_validate = interactions[(interactions["t_dat"] > date_2) & (interactions["t_dat"] <= date_3) & (interactions["customer_id"].isin(users_in_train))]

test = interactions[(interactions["t_dat"] > date_3) & (interactions["t_dat"] <= date_4) & (interactions["customer_id"].isin(users_in_train))]


assert len(set(first_stage_validate["customer_id"]) - users_in_train) == 0
assert len(set(second_stage_validate["customer_id"]) - users_in_train) == 0
assert len(set(test["customer_id"]) - users_in_train) == 0

In [23]:
assert str(first_stage_train["t_dat"].min()) == min_date
assert str(first_stage_train["t_dat"].max()) == date_1

assert str(first_stage_validate["t_dat"].min() - pd.Timedelta(days=1)) == date_1
assert str(first_stage_validate["t_dat"].max()) == date_2

assert str(second_stage_validate["t_dat"].min() - pd.Timedelta(days=1)) == date_2
assert str(second_stage_validate["t_dat"].max()) == date_3

assert str(test["t_dat"].min() - pd.Timedelta(days=1)) == date_3
assert str(test["t_dat"].max()) == date_4

In [24]:
print(
    f"first_stage_train length: {len(first_stage_train)};\n"
    f"first_stage_validate length: {len(first_stage_validate)};\n"
    f"second_stage_train length: {len(second_stage_train)};\n"
    f"second_stage_validate length: {len(second_stage_validate)};\n"
    f"test length: {len(test)}"
)

first_stage_train length: 263965;
first_stage_validate length: 175544;
second_stage_train length: 175544;
second_stage_validate length: 47794;
test length: 20737


In [25]:
min_date = "2020-03-22 00:00:00"
date_1 = "2020-09-01 00:00:00"
date_2 = "2020-09-11 00:00:00"

train = interactions[interactions["t_dat"] <= date_1]
validate = interactions[(interactions["t_dat"] > date_1) & (interactions["t_dat"] <= date_2)]
test = interactions[(interactions["t_dat"] > date_2)]


assert len(set(validate["customer_id"]) - set(train["customer_id"])) == 0
assert len(set(test["customer_id"]) - set(train["customer_id"])) == 0

print(
    f"train length: {len(train)};\n"
    f"validate length: {len(validate)};\n"
    f"test length: {len(test)};"
)

train length: 468843;
validate length: 24189;
test length: 21219;


# Training first stage
Create als model which predicts user x items interactions

In [58]:
import os

import numpy as np
import optuna
import implicit

import mlflow
import mlflow.pyfunc

from scipy.sparse import coo_matrix, csr_matrix

from metrics import recall_at_k, precision_at_k

import plotly.express as px
import plotly.graph_objects as go

from typing import List, Any, Dict

In [27]:
os.environ["MLFLOW_TRACKING_URI"] = "http://localhost:5000"
os.environ["MLFLOW_S3_ENDPOINT_URL"] = "http://localhost:9000"
os.environ["AWS_ACCESS_KEY_ID"] = "admin"
os.environ["AWS_SECRET_ACCESS_KEY"] = "password"

In [28]:
length_customers = len(customers)
length_articles = len(articles)

In [29]:
def to_coo(interactions, users_len, items_len):
    row = interactions["customer_id"].values
    col = interactions["article_id"].values
    data = np.ones(interactions.shape[0])

    return coo_matrix((data, (row, col)), shape=(users_len, items_len), dtype=np.float32)


def to_csr(interactions, users_len, items_len):
    coo_matrix = to_coo(interactions, users_len, items_len)
    csr_matrix = coo_matrix.tocsr()
    return csr_matrix


train_matrix = to_csr(train, length_customers, length_articles)
validate_matrix = to_csr(validate, length_customers, length_articles)

In [30]:
def items_vector(df, customer_id):
    return df[df["customer_id"] == customer_id]["article_id"].to_list()


def estimate_metrics(model, df, k):
    data = df.copy()
    data["candidates"] = data["customer_id"].apply(lambda x: model.recommend(x, train_matrix[x], N=k)[0])
    recall = []
    precision = []
    for _, actual, candidates in data.values:
        recall.append(recall_at_k(actual, candidates, k=k))
        precision.append(precision_at_k(actual, candidates, k=k))
    return {"recall": np.mean(recall), "precision": np.mean(precision)}


validate_matrix = pd.DataFrame({"customer_id": validate["customer_id"].unique()})
validate_matrix["actual"] = validate_matrix["customer_id"].apply(lambda x: items_vector(validate, x))

In [122]:
class ALSModel(mlflow.pyfunc.PythonModel):
    def __init__(self, model: implicit.als.AlternatingLeastSquares, user_item: csr_matrix):
        self.model = model
        self.sparse_matrix = user_item

    def predict(self, context, model_input: List[int], params: Dict[str, Any] = None) -> List[list]:
        if params is None:
            params = {}
        N = params.get("N", 10)

        return self.model.recommend(model_input, self.sparse_matrix[model_input], N=N)

    def get_raw_model(self):
        return self.model

In [123]:
def objective(trial):
    factors = trial.suggest_int("factors", 10, 200)
    regularization = trial.suggest_float("regularization", 1e-6, 1e-2, log=True)
    alpha = trial.suggest_int("alpha", -5, 15)
    iterations = trial.suggest_int("iterations", 5, 350)

    model = implicit.als.AlternatingLeastSquares(
        factors=factors,
        regularization=regularization,
        alpha=alpha,
        iterations=iterations,
        random_state=42)

    model.fit(train_matrix, show_progress=False)

    return estimate_metrics(model, validate_matrix, 25)["recall"]


study = optuna.create_study(
    direction="maximize",
    pruner=optuna.pruners.MedianPruner(
        n_startup_trials=5, n_warmup_steps=30, interval_steps=10
    ),
)
study.optimize(objective, n_trials=1)

with mlflow.start_run(run_name="ALS w/ recall&precision"):
    als_model = implicit.als.AlternatingLeastSquares()
    als_model.fit(train_matrix)

    mlflow.log_params(study.best_params)

    # recall & precision @10
    metrics = estimate_metrics(als_model, validate_matrix, 10)
    mlflow.log_metric("recall_k10", metrics["recall"])
    mlflow.log_metric("precision_k10", metrics["precision"])


    xs = np.arange(100, 800, 50)
    ys = [estimate_metrics(als_model, validate_matrix, k)["recall"] for k in xs]

    fig = go.Figure()

    fig.add_trace(go.Scatter(
        x=xs,
        y=ys,
        name="recall"
    ))

    fig.add_trace(go.Scatter(
        x=xs,
        y=np.gradient(ys) * 10,
        name="change rate / gradient * 10",
        line=dict(color='red', width=2, dash='dot')
    ))

    mlflow.log_figure(fig, "plot_artifacts/change_recall_rate.png")
    mlflow.pyfunc.log_model(
        artifact_path="als_model",
        python_model=ALSModel(als_model, train_matrix)
    )

[I 2025-04-01 19:15:28,011] A new study created in memory with name: no-name-59e12661-6a31-481b-a53a-844cec4e45c5
[I 2025-04-01 19:15:33,208] Trial 0 finished with value: 0.009932489050596556 and parameters: {'factors': 89, 'regularization': 0.007749916301453231, 'alpha': 2, 'iterations': 82}. Best is trial 0 with value: 0.009932489050596556.
100%|██████████| 15/15 [00:00<00:00, 21.61it/s]
2025/04/01 19:15:54 INFO mlflow.models.signature: Inferring model signature from type hints
2025/04/01 19:15:54 INFO mlflow.models.signature: Failed to infer output type hint, setting output schema to AnyType. Invalid type hint `list`, it must include a valid element type. Type hints must be a list[...] where collection element type is one of these types: [<class 'int'>, <class 'str'>, <class 'bool'>, <class 'float'>, <class 'bytes'>, <class 'datetime.datetime'>], pydantic BaseModel subclasses, lists and dictionaries of primitive types, or typing.Any. Check https://mlflow.org/docs/latest/model/python

🏃 View run ALS w/ recall&precision at: http://localhost:5000/#/experiments/0/runs/6fc4ed3663ac4a2f9d210a4c4bcb6d8f
🧪 View experiment at: http://localhost:5000/#/experiments/0


In [124]:
mlflow.register_model(
    "runs:/6fc4ed3663ac4a2f9d210a4c4bcb6d8f/als_model",
    "ALS_Model"
)

Registered model 'ALS_Model' already exists. Creating a new version of this model...
2025/04/01 19:16:17 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: ALS_Model, version 3
Created version '3' of model 'ALS_Model'.


<ModelVersion: aliases=[], creation_timestamp=1743520577914, current_stage='None', description='', last_updated_timestamp=1743520577914, name='ALS_Model', run_id='6fc4ed3663ac4a2f9d210a4c4bcb6d8f', run_link='', source='s3://mlflow/0/6fc4ed3663ac4a2f9d210a4c4bcb6d8f/artifacts/als_model', status='READY', status_message=None, tags={}, user_id='', version='3'>

# Second stage train

prepare data for train/validate CatBoost step. 

match candidates for second step model

In [128]:
from catboost import CatBoost

In [129]:
def freq_feature(left, right, group_by, agg_col, feature_name):
    return left.merge(
        right.groupby(by=group_by)[agg_col]\
            .count()
            .rename(feature_name) / 1,
            how="left",
            on=group_by
    )


def match_candidates(df):
    candidates = pd.DataFrame({"customer_id": df["customer_id"].unique()})
    candidates["candidates"] = candidates["customer_id"].apply(lambda x: als_model.recommend(x, train_matrix[x], 200)[0])

    articles = candidates.apply(lambda x: pd.Series(x["candidates"]), axis=1).stack().reset_index(level=1, drop=True)
    articles.name = "article_id"

    return candidates.drop("candidates", axis=1).join(articles)


def merge_features(data, customer_features, article_features):
    data = data.merge(customer_features, how="left", on="customer_id")
    data = data.merge(article_features, how="left", on="article_id")
    return data


def add_freq_features(data, right):
    data = freq_feature(data, right, ["article_id"], "article_uuid", "article_freq")
    data = freq_feature(data, right, ["customer_id", "product_group_no"], "article_id", "product_group_freq")
    data = freq_feature(data, right, ["customer_id", "index_code"], "article_id", "index_freq")
    data = freq_feature(data, right, ["customer_id", "garment_group_no"], "article_id", "garment_group_freq")
    return data

In [131]:
candidates = match_candidates(train)

In [132]:
X_train = train.copy()
X_train = X_train[["customer_id", "article_id"]]

In [133]:
X_train.loc[0:, "target"] = 1

In [134]:
X_train = X_train.merge(candidates, how="outer", on=["customer_id", "article_id"])
X_train = X_train.drop_duplicates(subset=["customer_id", "article_id"])
X_train.fillna(0, inplace=True)

In [135]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1437903 entries, 0 to 1550242
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   customer_id  1437903 non-null  int64  
 1   article_id   1437903 non-null  int64  
 2   target       1437903 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 43.9 MB


In [136]:
X_train = merge_features(X_train, customer_features, article_features)

In [137]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1437903 entries, 0 to 1437902
Data columns (total 13 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   customer_id       1437903 non-null  int64  
 1   article_id        1437903 non-null  int64  
 2   target            1437903 non-null  float64
 3   customer_uuid     1437903 non-null  object 
 4   age               1437903 non-null  int64  
 5   article_uuid      1437903 non-null  object 
 6   product_type_no   1437903 non-null  int64  
 7   product_group_no  1437903 non-null  int64  
 8   department_no     1437903 non-null  int64  
 9   index_code        1437903 non-null  object 
 10  index_group_no    1437903 non-null  int64  
 11  section_no        1437903 non-null  int64  
 12  garment_group_no  1437903 non-null  int64  
dtypes: float64(1), int64(9), object(3)
memory usage: 142.6+ MB


# Feature engineering

In [138]:
full_transactions = transactions.copy()
full_transactions = full_transactions.merge(customer_features, how="left", on="customer_uuid")
full_transactions = full_transactions.merge(article_features, how="left", on="article_uuid")

full_transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 514251 entries, 0 to 514250
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   transaction_uuid  514251 non-null  object        
 1   t_dat             514251 non-null  datetime64[ns]
 2   price             514251 non-null  float64       
 3   sales_channel_id  514251 non-null  int64         
 4   customer_uuid     514251 non-null  object        
 5   article_uuid      514251 non-null  object        
 6   customer_id       514251 non-null  int64         
 7   age               514251 non-null  int64         
 8   article_id        514251 non-null  int64         
 9   product_type_no   514251 non-null  int64         
 10  product_group_no  514251 non-null  int64         
 11  department_no     514251 non-null  int64         
 12  index_code        514251 non-null  object        
 13  index_group_no    514251 non-null  int64         
 14  sect

In [139]:
X_train = add_freq_features(X_train, full_transactions)

In [140]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1437903 entries, 0 to 1437902
Data columns (total 17 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   customer_id         1437903 non-null  int64  
 1   article_id          1437903 non-null  int64  
 2   target              1437903 non-null  float64
 3   customer_uuid       1437903 non-null  object 
 4   age                 1437903 non-null  int64  
 5   article_uuid        1437903 non-null  object 
 6   product_type_no     1437903 non-null  int64  
 7   product_group_no    1437903 non-null  int64  
 8   department_no       1437903 non-null  int64  
 9   index_code          1437903 non-null  object 
 10  index_group_no      1437903 non-null  int64  
 11  section_no          1437903 non-null  int64  
 12  garment_group_no    1437903 non-null  int64  
 13  article_freq        1437903 non-null  float64
 14  product_group_freq  1405504 non-null  float64
 15  index_freq     

In [141]:
X_train.sample(1)

Unnamed: 0,customer_id,article_id,target,customer_uuid,age,article_uuid,product_type_no,product_group_no,department_no,index_code,index_group_no,section_no,garment_group_no,article_freq,product_group_freq,index_freq,garment_group_freq
8348,31,3421,1.0,666b4c99-f2e6-4909-b4e2-2ed5e31e4f27,51,f943cd17-3034-4394-99ed-6cac6bfc6ae6,272,2,1710,A,1,6,1009,6.0,26.0,93.0,22.0


In [142]:
def prepare(data, freq_columns, drop_columns):
    for col_name in freq_columns:
        data[col_name].fillna(0, inplace=True)
    data = data.drop(columns=drop_columns)
    return data


In [143]:
freq_columns = ["article_freq", "product_group_freq", "index_freq", "garment_group_freq"]
drop_columns = ["customer_uuid", "article_uuid"]

X_train = prepare(X_train, freq_columns, drop_columns)


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





In [144]:
X_train

Unnamed: 0,customer_id,article_id,target,age,product_type_no,product_group_no,department_no,index_code,index_group_no,section_no,garment_group_no,article_freq,product_group_freq,index_freq,garment_group_freq
0,0,113,1.0,44,286,3,3710,B,1,61,1017,17.0,3.0,19.0,10.0
1,0,131,0.0,44,273,2,3608,B,1,62,1021,114.0,49.0,19.0,0.0
2,0,132,0.0,44,286,3,3710,B,1,61,1017,86.0,3.0,19.0,10.0
3,0,191,0.0,44,302,1,3611,B,1,62,1021,239.0,0.0,19.0,0.0
4,0,308,0.0,44,272,2,1747,D,2,53,1009,128.0,49.0,19.0,18.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1437898,5406,13161,1.0,25,272,2,1772,D,2,57,1016,80.0,36.0,18.0,12.0
1437899,5406,13163,1.0,25,272,2,1772,D,2,57,1016,108.0,36.0,18.0,12.0
1437900,5406,13179,0.0,25,254,0,1919,A,1,2,1005,46.0,29.0,52.0,16.0
1437901,5406,13299,0.0,25,272,2,1939,A,1,2,1009,53.0,36.0,52.0,11.0


In [145]:
y_train = X_train[["target"]]
X_train = X_train.drop(columns=["target"])

In [146]:
def rerank(customer_id, df, k=5):
    return df[df["customer_id"] == customer_id].\
                sort_values("proba", ascending=False).head(k)["article_id"].tolist()

In [148]:
def match_rerank(data, rerank_model):
    data = match_candidates(validate)

    data = merge_features(data, customer_features, article_features)
    data = add_freq_features(data, full_transactions)
    data = prepare(data, freq_columns, drop_columns)

    data["proba"] = rerank_model.predict(data, prediction_type="Probability")[:, 1]

    compared = pd.DataFrame({"customer_id": validate["customer_id"].unique()})
    compared["reranked"] = compared["customer_id"].apply(lambda x: rerank(x, data, 25))
    compared["actual"] = compared["customer_id"].apply(lambda x: validate[validate["customer_id"] == x]["article_id"].to_list())

    return compared


def objective(trial):
    with mlflow.start_run():
        params = {
            "objective": "Logloss",
            "iterations": trial.suggest_int("iterations", 50, 500),
            "max_depth": trial.suggest_int("max_depth", 6, 12),
            "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
            "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-5, 1e2),
            "task_type": "GPU"
        }

        catboost_model = CatBoost(
            params=params
        )

        catboost_model.fit(X_train, y_train, cat_features=["index_code"], verbose=False)
        candidates = match_rerank(validate, catboost_model)

        recall = []
        precision = []

        for _, reranked, actual in candidates.values:
            recall.append(recall_at_k(actual, reranked, 25))
            precision.append(precision_at_k(actual, reranked, 25))

        return np.mean(recall)


study = optuna.create_study(
    direction="maximize",
    pruner=optuna.pruners.MedianPruner(
        n_startup_trials=5, n_warmup_steps=30, interval_steps=10
    )
)

study.optimize(objective, n_trials=1)

with mlflow.start_run(run_name="catboost/match candidates + rerank"):
    catboost_model = CatBoost(params=study.best_params)
    catboost_model.fit(X_train, y_train, cat_features=["index_code"], verbose=False)

    mlflow.log_params(study.best_params)

    candidates = match_rerank(validate, catboost_model)

    recall = []
    precision = []

    for _, reranked, actual in candidates.values:
        recall.append(recall_at_k(actual, reranked, 25))
        precision.append(precision_at_k(actual, reranked, 25))

    mean_recall = np.mean(recall)
    mean_precision = np.mean(precision)

    mlflow.log_metric("recall", mean_recall)
    mlflow.log_metric("precision", mean_precision)
    
    mlflow.catboost.log_model(
        catboost_model,
        artifact_path="catboost_model"
    )

[I 2025-04-01 19:19:23,193] A new study created in memory with name: no-name-bd9d0888-47ea-498e-a2ad-6c0625a5301b

suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.



[I 2025-04-01 19:19:50,709] Trial 0 finished with value: 0.007492031349641516 and parameters: {'iterations': 277, 'max_depth': 6, 'learning_rate': 0.18948380663010458, 'l2_leaf_reg': 35.396948663

🏃 View run whimsical-snail-678 at: http://localhost:5000/#/experiments/0/runs/efd07227491b4413ac2316e7888bba44
🧪 View experiment at: http://localhost:5000/#/experiments/0



A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





🏃 View run catboost/match candidates + rerank at: http://localhost:5000/#/experiments/0/runs/3169bfe1d1d043c49348138f252779b4
🧪 View experiment at: http://localhost:5000/#/experiments/0


In [149]:
mlflow.register_model(
    "runs:/3169bfe1d1d043c49348138f252779b4/catboost_model",
    "CatBoost_Model"
)

Successfully registered model 'CatBoost_Model'.
2025/04/01 19:21:43 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: CatBoost_Model, version 1
Created version '1' of model 'CatBoost_Model'.


<ModelVersion: aliases=[], creation_timestamp=1743520903964, current_stage='None', description='', last_updated_timestamp=1743520903964, name='CatBoost_Model', run_id='3169bfe1d1d043c49348138f252779b4', run_link='', source='s3://mlflow/0/3169bfe1d1d043c49348138f252779b4/artifacts/catboost_model', status='READY', status_message=None, tags={}, user_id='', version='1'>