# 52nd Place Solution Notebook

This notebook is a cleaned version of my final submission.  
See [this post](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/discussion/324076/) for some details about my solution.

The notebook has minimal code in it - most of the code is imported from my [handmhelpers dataset](https://www.kaggle.com/datasets/jacob34/handmhelpers), which is synced to [this github repo](https://github.com/JacobCP/kaggle-handm-helpers) .  
See [this post](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/discussion/324078) for some details about my code development.  

**Please note:**  
I plan on continuing to update the github repo, as I try to recreate some of the strategies shared by winning teams.  
Some of those changes may break the code usage for this notebook.  
In order to keep this notebook functional, I will no longer be updating the dataset to reflect the changes made to the repo - it will remain at commit 86c412e902a7692b24e15791322a8dfeb5a761eb

In [1]:
%%time
import os
import sys
import copy
from datetime import datetime
import gc
import pickle as pkl
import shelve

import pandas as pd
import numpy as np
import cudf
    
sys.path.append("../input/")
from handmhelpers import io as h_io, sub as h_sub, cv as h_cv, fe as h_fe
from handmhelpers import modeling as h_modeling, candidates as h_can, pairs as h_pairs

CPU times: user 1.61 s, sys: 388 ms, total: 1.99 s
Wall time: 4.35 s


## Load and convert data

In [2]:
%%time

c, t, a = h_io.load_data(files=['customers.csv', 'transactions_train.csv', 'articles.csv'])        

index_to_id_dict_path = h_fe.reduce_customer_id_memory(c, [t])
t["week_number"] = h_fe.day_week_numbers(t["t_dat"])
t["t_dat"] = h_fe.day_numbers(t["t_dat"])

CPU times: user 5.64 s, sys: 2.75 s, total: 8.39 s
Wall time: 53.3 s


# Get item pairs

In [3]:
%%time

pairs_per_item = 5

week_number_pairs = {}
for week_number in [96, 97, 98, 99, 100, 101, 102, 103, 104]:
    print(f"Creating pairs for week number {week_number}")
    week_number_pairs[week_number] = h_pairs.create_pairs(
        t, week_number, pairs_per_item, verbose=False
    )

Creating pairs for week number 96
Creating pairs for week number 97
Creating pairs for week number 98
Creating pairs for week number 99
Creating pairs for week number 100
Creating pairs for week number 101
Creating pairs for week number 102
Creating pairs for week number 103
Creating pairs for week number 104
CPU times: user 39.2 s, sys: 25.1 s, total: 1min 4s
Wall time: 1min 4s


## Main retrieval/features function!

In [4]:
def create_candidates_with_features_df(t, c, a, customer_batch=None, **kwargs):
    # splitting cv
    features_df, label_df = h_cv.feature_label_split(
        t, kwargs["label_week"], kwargs["feature_periods"]
    )
    
    # converting relative day_number
    features_df["t_dat"] = h_fe.how_many_ago(features_df["t_dat"])
    features_df["week_number"] = h_fe.how_many_ago(features_df["week_number"])
    
    # pull out the cv week
    article_pairs_df = week_number_pairs[kwargs["label_week"]-1]
    
    # check if we can limit customers
    if len(label_df) > 0:
        customers = label_df["customer_id"].unique()
    elif customer_batch is not None:
        customers = customer_batch
    else:
        customers = None
    
    ############################################
    # creating candidates (and adding features)
    ###########################################
    
    features_db = shelve.open("features_db") 
    
    # creating candidate (and saving features created)
    recent_customer_cand, features_db["customer_article"] = (
        h_can.create_recent_customer_candidates(
            features_df,
            kwargs["ca_num_weeks"],
            customers=customers,
        )
    )
    
    (cust_last_week_cand,
     cust_last_week_pair_cand,
     features_db["clw"],
     features_db["clw_pairs"]) = h_can.create_last_customer_weeks_and_pairs(
        features_df,
        article_pairs_df,
        kwargs["clw_num_weeks"],
        kwargs["clw_num_pair_weeks"],
        customers=customers,
    )
    
    popular_cand, features_db["popular_articles"] = h_can.create_popular_article_cand(
        features_df,
        c,
        a,
        kwargs["pa_num_weeks"],
        kwargs["hier_col"],
        num_candidates=kwargs["num_recent_candidates"],
        num_articles=kwargs["num_recent_articles"],
        customers=customers,
    )
    age_bucket_can, _, _ = h_can.create_age_bucket_candidates(
        features_df,
        c,
        kwargs["num_age_buckets"],
        articles=kwargs["num_recent_articles"],
        customers=customers,
    )
    
    cand = [recent_customer_cand, cust_last_week_cand, cust_last_week_pair_cand, popular_cand ,age_bucket_can]
    cand = cudf.concat(cand).drop_duplicates()
    cand = cand.sort_values(["customer_id", "article_id"]).reset_index(drop=True)
    
    del recent_customer_cand, cust_last_week_cand, cust_last_week_pair_cand, age_bucket_can, popular_cand
    
    cand = h_can.filter_candidates(cand, t, **kwargs)
    
    # creating other features
    h_fe.create_cust_hier_features(features_df, a, kwargs["hier_cols"], features_db)
    h_fe.create_price_features(features_df, features_db)
    h_fe.create_cust_features(c, features_db)
    h_fe.create_article_cust_features(features_df, c, features_db)
    h_fe.create_lag_features(features_df, a, kwargs["lag_days"], features_db)
    h_fe.create_rebuy_features(features_df, features_db)
    h_fe.create_cust_t_features(features_df, a, features_db)
    h_fe.create_art_t_features(features_df, features_db)
    
    del features_df

    # another filter at the end, for the ones that didn't get filtered earlier
    if customers is not None:
        cand = cand[cand["customer_id"].isin(customers)]
    
    # report on recall/precision of candidates
    if kwargs["cv"]:
        ground_truth_candidates = label_df[["customer_id", "article_id"]].drop_duplicates()
        h_cv.report_candidates(cand, ground_truth_candidates)
        del ground_truth_candidates        
    
    # adding features to candidates
    cand_with_f_df = h_can.add_features_to_candidates(
        cand, features_db, c, a
    )
    
    # manually adding article features (couldn't use shelve for some reason)
    for article_col in kwargs["article_columns"]:
        art_col_map = a.set_index("article_id")[article_col]
        cand_with_f_df[article_col] = cand_with_f_df["article_id"].map(art_col_map)
    
    # limiting features
    if kwargs["selected_features"] is not None:
        cand_with_f_df = cand_with_f_df[
            ["customer_id", "article_id"] + kwargs["selected_features"]
        ]
        
    features_db.close()
    os.remove("features_db.bak"), os.remove("features_db.dir"), os.remove("features_db.dat")
    
    assert len(cand) == len(cand_with_f_df), "seem to have duplicates in the feature dfs"
    del cand
    
    return cand_with_f_df, label_df

In [5]:
def calculate_model_score(ids_df, preds, truth_df):
    predictions = h_modeling.create_predictions(ids_df, preds)
    true_labels = h_cv.ground_truth(truth_df).set_index("customer_id")["prediction"]
    score = round(h_cv.comp_average_precision(true_labels, predictions),5)
    
    return score

## Parameters - one place for all!

In [6]:
cv_params = {
    "cv": True,
    "feature_periods": 105,
    "label_week": 104,
    "index_to_id_dict_path": index_to_id_dict_path,
    "pairs_file_version": "_v3_5_ex",
    "num_recent_candidates": 36,
    "num_recent_articles": 12,
    "hier_col": "department_no",
    "ca_num_weeks": 3,
    "clw_num_weeks": 12,
    "clw_num_pair_weeks": 2,
    "pa_num_weeks": 1,
    "num_age_buckets": 4,
    "filter_recent_art_weeks": 1,
    "filter_num_articles": None,
    "lag_days": [1, 3, 14, 30],
    "article_columns": ["index_code"],
    "hier_cols": [
        "department_no", "section_no", "index_group_no", "index_code",
        "product_type_no", "product_group_name"
    ],
    "selected_features": None,
    "lgbm_params": {"n_estimators": 200, "num_leaves": 20},
    "log_evaluation": 10,
    "early_stopping": 20,
    "eval_at": 12,
    "save_model": True,
    "num_concats": 5,
}
sub_params = {
    "cv": False,
    "feature_periods": 105,
    "label_week": 105,
    "index_to_id_dict_path": index_to_id_dict_path,
    "pairs_file_version": "_v3_5_ex",
    "num_recent_candidates": 60,
    "num_recent_articles": 12,
    "hier_col": "department_no",
    "ca_num_weeks": 3,
    "clw_num_weeks": 12,
    "clw_num_pair_weeks": 2,
    "pa_num_weeks": 1,
    "num_age_buckets": 4,
    "filter_recent_art_weeks": 1,
    "filter_num_articles": None,
    "lag_days": [1, 3, 14, 30],
    "article_columns": ["index_code"],
    "hier_cols": [
        "department_no", "section_no", "index_group_no", "index_code",
        "product_type_no", "product_group_name"
    ],
    "selected_features": None,
    "lgbm_params": {
        "n_estimators": 150,
        "num_leaves": 20,    
    },
    "log_evaluation": 10,
    "eval_at": 12,
    "prediction_models": ["model_104", "model_105"],
    "save_model": True,
    "num_concats": 5,
}

In [7]:
cand_features_func = create_candidates_with_features_df
scoring_func = calculate_model_score

In [None]:
%%time
cv_weeks = [104]
results = h_modeling.run_all_cvs(
    t, c, a, cand_features_func, scoring_func, 
    cv_weeks=cv_weeks, **cv_params
)

preparing training modeling dfs for 103...
candidates recall: 8.00% (18,241/227,910)
candidates precision: 0.68% (18,241/2,693,138)
preparing training modeling dfs for 102...
candidates recall: 8.15% (19,394/238,074)
candidates precision: 0.70% (19,394/2,764,674)
preparing training modeling dfs for 101...
candidates recall: 7.83% (19,968/255,172)
candidates precision: 0.67% (19,968/2,981,682)
preparing training modeling dfs for 100...
candidates recall: 7.92% (18,278/230,825)
candidates precision: 0.65% (18,278/2,819,547)
preparing training modeling dfs for 99...
candidates recall: 7.58% (17,975/237,160)
candidates precision: 0.61% (17,975/2,924,751)
concatenating all weeks together
preparing evaluation modeling dfs...
candidates recall: 8.81% (18,820/213,728)
candidates precision: 0.76% (18,820/2,474,671)




Training until validation scores don't improve for 20 rounds
[10]	train's map@12: 0.280074	train's ndcg@12: 0.372799	validation's map@12: 0.274462	validation's ndcg@12: 0.366002
[20]	train's map@12: 0.288354	train's ndcg@12: 0.382036	validation's map@12: 0.277939	validation's ndcg@12: 0.371061
[30]	train's map@12: 0.292605	train's ndcg@12: 0.386755	validation's map@12: 0.280205	validation's ndcg@12: 0.373149
[40]	train's map@12: 0.295994	train's ndcg@12: 0.390606	validation's map@12: 0.280637	validation's ndcg@12: 0.374345
[50]	train's map@12: 0.298685	train's ndcg@12: 0.394046	validation's map@12: 0.282639	validation's ndcg@12: 0.377264
[60]	train's map@12: 0.301864	train's ndcg@12: 0.397141	validation's map@12: 0.284198	validation's ndcg@12: 0.378867
[70]	train's map@12: 0.304544	train's ndcg@12: 0.399944	validation's map@12: 0.284481	validation's ndcg@12: 0.378954
[80]	train's map@12: 0.307066	train's ndcg@12: 0.402337	validation's map@12: 0.285979	validation's ndcg@12: 0.380301
[90

In [None]:
%%time
gc.collect()
h_modeling.full_sub_train_run(t, c, a, cand_features_func, scoring_func, **sub_params)
predictions = h_modeling.full_sub_predict_run(
    t, c, a, cand_features_func, **sub_params
)

In [None]:
sub = h_sub.create_sub(c["customer_id"], predictions, index_to_id_dict_path)
sub.to_csv('dev_submission.csv', index=False)

display(sub.head())
print(sub.shape)