# Introduction
This ipynb notebook shows the way I followed, faced the assignment's data for the first time. 

I followed advice and didn't not spend a lot of time for explorative data analysis on it and make only needed simple checks. 


First, let's see how much space the data takes on the hard disk drive. I ma going to use inline bash comands, thanks jupyter's magic, to get the file size in bytes:

In [1]:
!du -sh s3/sessions.csv

130M	s3/sessions.csv


In [2]:
!du -sh cache/venues.csv

 92K	cache/venues.csv


Not very much space. So I make a decision to read the whole file into random access memory, using the simple pandas read_csv method. It couldn't cause any out of memory errors.

Before that, I want to take a look inside of a data:

In [3]:
!head s3/sessions.csv

,purchased,session_id,position_in_list,venue_id,has_seen_venue_in_this_session,is_new_user,is_from_order_again,is_recommended
3362,True,010befaf-c5aa-43ba-8561-a3e2ccab277b,0,1424193000929084737,True,False,True,False
3363,False,010befaf-c5aa-43ba-8561-a3e2ccab277b,1,-1970346298375932149,False,False,True,False
3364,False,010befaf-c5aa-43ba-8561-a3e2ccab277b,2,-3266889597638182283,False,False,False,False
3365,False,010befaf-c5aa-43ba-8561-a3e2ccab277b,3,-1599030892486223118,False,False,False,False
3366,False,010befaf-c5aa-43ba-8561-a3e2ccab277b,4,-1514693057066477057,False,False,False,False
3367,False,010befaf-c5aa-43ba-8561-a3e2ccab277b,5,5188415623516897594,False,False,True,False
3368,False,010befaf-c5aa-43ba-8561-a3e2ccab277b,6,-3853058847403276971,False,False,False,False
3369,False,010befaf-c5aa-43ba-8561-a3e2ccab277b,7,5332634774045137434,False,False,True,False
3370,False,010befaf-c5aa-43ba-8561-a3e2ccab277b,8,1007590810240705088,False,False,False,False


In [4]:
!head cache/venues.csv

,venue_id,conversions_per_impression,price_range,rating,popularity,retention_rate
0,1424193000929084737,0.4034916201117318,1,8.6,5.537811281291531,0.38496450000000004
1,-1970346298375932149,0.1914396887159533,1,8.6,0.9660341183903789,0.2500005
2,-3266889597638182283,0.27538177546568215,2,9.0,5.905066922300413,0.30422099999999996
3,-1599030892486223118,0.060618556701030925,2,8.8,0.7547005381380858,0.31437149999999997
4,-1514693057066477057,0.1709515859766277,2,9.2,5.506692230041323,0.32709750000000004
5,5188415623516897594,0.5651188299817185,1,8.2,17.36076312483206,0.621054
6,-3853058847403276971,0.4390477202678287,1,9.0,9.332084213889917,0.42274049999999996
7,5332634774045137434,0.2117497886728656,1,8.8,1.4423698410277639,0.3639705
8,1007590810240705088,0.15143496845104823,2,8.6,2.1768811230455274,0.29713649999999997


# Initialization
In this section all needed libraries are imported. A very few libraries, needed to dive into data, fit a `baseline` ranker with minimal risk of overfitting, nothing special. 

I could spend more time here, but will do it for the next iteration of service.

In [5]:
from typing import NoReturn

import pandas as pd
from catboost import CatBoostRanker, Pool, cv
from sklearn.model_selection import KFold

from train.src.utils import prepare_datasets, train_and_evaluate, calculate_params

In [6]:
def show_results(results: list[dict]) -> NoReturn:
    """This function prints the results of a cross-validation experiment for a ranking model.

    Arguments:
        results -- a list of dictionaries containing the best scores and feature importances for each fold.

    Returns:
        None. This function only prints the results to the standard output.
    """

    # Extract the MAP@10 scores for the validation set from each fold: 
    map_10 = [result["best_score"]["validation"]["MAP:top=10"] for result in results]
    # Calculate the mean and standard deviation of the MAP@10 scores across all folds:
    mean, std_dev = calculate_params(map_10)
    print("\n---\n")
    # Print the summary statistics for the MAP@10 distribution:
    print("MAP@10 distribution parameters:")
    print(f"mean = {mean:.3f}, std_dev = {std_dev:.3f}")
    print(f"95% confidence interval from {(mean-2*std_dev):.3f} to {(mean+2*std_dev):.3f}")

    for i, result in enumerate(results):
        print("\n---\n")

        learn = result["best_score"]["learn"]["MAP:top=10"]
        validation = result["best_score"]["validation"]["MAP:top=10"]

        print(f"Here are results for fold #{i+1} of {len(results)}:")
        print(f"Best performer model hit MAP@10 {validation:.3f} for test data, {learn:.3f} for train data")
        print("\n")

        feature_importances = result["feature_importances"]
        feature_importances = sorted(feature_importances.items(), key=lambda pair: pair[1], reverse=True)

        print("Feature's importances from the most to the least:\n")
        for feature, importance in feature_importances:
            print(f"importance: {importance:.2f}, feature: {feature}")

In [7]:
RANDOM_STATE = 21

# Work with data 

In [8]:
df_sessions = pd.read_csv("s3/sessions.csv", low_memory=False, index_col=0)
df_venues = pd.read_csv("cache/venues.csv", low_memory=False, index_col=0)

In [9]:
df_sessions.head()

Unnamed: 0,purchased,session_id,position_in_list,venue_id,has_seen_venue_in_this_session,is_new_user,is_from_order_again,is_recommended
3362,True,010befaf-c5aa-43ba-8561-a3e2ccab277b,0,1424193000929084737,True,False,True,False
3363,False,010befaf-c5aa-43ba-8561-a3e2ccab277b,1,-1970346298375932149,False,False,True,False
3364,False,010befaf-c5aa-43ba-8561-a3e2ccab277b,2,-3266889597638182283,False,False,False,False
3365,False,010befaf-c5aa-43ba-8561-a3e2ccab277b,3,-1599030892486223118,False,False,False,False
3366,False,010befaf-c5aa-43ba-8561-a3e2ccab277b,4,-1514693057066477057,False,False,False,False


In [10]:
df_sessions.shape

(1369807, 8)

In [11]:
df_venues.head()

Unnamed: 0,venue_id,conversions_per_impression,price_range,rating,popularity,retention_rate
0,1424193000929084737,0.403492,1,8.6,5.537811,0.384964
1,-1970346298375932149,0.19144,1,8.6,0.966034,0.250001
2,-3266889597638182283,0.275382,2,9.0,5.905067,0.304221
3,-1599030892486223118,0.060619,2,8.8,0.754701,0.314371
4,-1514693057066477057,0.170952,2,9.2,5.506692,0.327097


Check if venue identificator unique inside the dataframe. It is important from the point of view - how to organize a cash database:

In [12]:
df_venues["venue_id"].nunique() == df_venues.shape[0]

True

Yes, identificators are unique. So, it could be a key for data scheme, and further process is selection of storage technology to save all venues' data will be described as a part of presentation. 

In [13]:
df_sessions.session_id.nunique()

4415

In [14]:
df_sessions.session_id.value_counts()

eac7cdc2-6047-4dfe-896a-7abbd31bab12    2436
922817cf-06eb-4593-90dd-bc9f658ce1f6    2420
f5bc3cc1-bcd1-47ab-a249-b952c469f4a6    2420
de874ff2-9db1-45b3-9e88-de621379c141    2348
b8954892-7082-4cdb-a877-5ec701ef37dd    2340
                                        ... 
e9f1051e-9f5a-459d-a05c-09388447c359      21
89e66ad1-f7d5-4c81-9859-2448d1db7869      14
58D77E89-C7C9-4AB8-896E-99215F7374FC      13
579727F1-1348-4780-B41F-235182EE74D4      13
FBC28911-DECF-4ABF-863F-C4DB415B677A      11
Name: session_id, Length: 4415, dtype: int64

In [15]:
df_sessions[["purchased", "session_id",]].groupby(by="session_id").agg(sum)["purchased"].value_counts(
    sort=False
).reset_index().rename(
    columns={
        "index": "purchased_in_session",
        "purchased": "num_of_sessions",
    }
).sort_values(
    by="purchased_in_session", ascending=True
)

Unnamed: 0,purchased_in_session,num_of_sessions
1,0,34
0,1,4367
3,2,1
2,4,13


Check that we could fetch a data from a sessions' venues for all the values. Yes, we can, the sets of venue_ids equals:

In [16]:
set(df_sessions.venue_id) - set(df_venues.venue_id)

set()

In [17]:
set(df_venues.venue_id) - set(df_sessions.venue_id)

set()

In [18]:
df_sessions.shape[1]

8

Prepare a full dataset, that includes all possible data.

In [19]:
df_all = pd.merge(left=df_sessions, right=df_venues, left_on="venue_id", right_on="venue_id", how="left")

Simple quality checks:

In [20]:
df_sessions.shape[1] + df_venues.shape[1] - 1 == df_all.shape[1]

True

In [21]:
df_sessions.shape[0] == df_all.shape[0]

True

In [22]:
df_all["purchased"] = df_all.purchased.astype(int)

In [23]:
df_all.sort_values(by=["session_id", "position_in_list"], ascending=True, inplace=True)

In [24]:
df_all.reset_index(drop=True, inplace=True)

# Modelling

In [25]:
del df_all["position_in_list"]
del df_all["venue_id"]

In [26]:
df_all.head()

Unnamed: 0,purchased,session_id,has_seen_venue_in_this_session,is_new_user,is_from_order_again,is_recommended,conversions_per_impression,price_range,rating,popularity,retention_rate
0,0,0013B033-6B3E-4FDF-AFA1-3B1B1E830892,True,False,False,False,0.152793,1,9.4,0.897205,0.276595
1,0,0013B033-6B3E-4FDF-AFA1-3B1B1E830892,True,False,False,False,0.068783,2,10.0,0.194737,0.375
2,0,0013B033-6B3E-4FDF-AFA1-3B1B1E830892,True,False,False,False,0.04,1,,1.71379,0.034091
3,0,0013B033-6B3E-4FDF-AFA1-3B1B1E830892,True,False,False,False,0.281423,3,8.8,16.555698,0.220719
4,0,0013B033-6B3E-4FDF-AFA1-3B1B1E830892,True,False,False,False,0.316285,1,9.2,4.036884,0.351399


In [27]:
sessions = df_all["session_id"].unique()

I have decided to choose CatBoost library for ranking the list of venues, because it allows out-of-box get strong baseline, and has a lot of possibilities to optimize model and metric under the hood, via params and possible grid search for them.

Last, but not least, I know this library good enough.

In [28]:
ranker = CatBoostRanker(
            loss_function="YetiRank",
            iterations=4000,
            early_stopping_rounds=100,
            random_state=RANDOM_STATE,
            name="ranker' training",
            eval_metric="MAP:top=10",
            task_type="CPU",
            logging_level="Silent",
            use_best_model=True,
            learning_rate=None,
            random_strength=None,
            ignored_features=None,
            min_data_in_leaf=None,
            min_child_samples=None,
            diffusion_temperature=None,
            bagging_temperature=2.0,
            depth=None,
        )

One of the challenges of machine learning is to avoid overfitting, which means that the model learns too well from the training data and fails to generalize to new and unseen data. 

To prevent overfitting, I used cross-validation over some folds, which is a technique that splits the data into different subsets and trains and tests the model on each subset. 

For the first view of metrics, I choosed minimal folds (2) just to get fitted estimator for the shortest time. 

This means that I divided the data into two parts, one for training and one for testing, and repeated this process twice by switching the roles of the parts. 

This way, I could quickly get an idea of how well the model performs on different data and adjust the parameters accordingly.

In [29]:
kf = KFold(n_splits=2, shuffle=True, random_state=RANDOM_STATE)
results = []
for train, test in kf.split(sessions):
    sessions_train = set(sessions[train])
    sessions_test = set(sessions[test])
    df_train = df_all[df_all["session_id"].isin(sessions_train)]
    df_test = df_all[df_all["session_id"].isin(sessions_test)]
    train_set, eval_set, names = prepare_datasets(df_train, df_test)
    res = train_and_evaluate(train_set, eval_set, names, ranker=ranker)
    results.append(res)
show_results(results)


---

MAP@10 distribution parameters:
mean = 0.735, std_dev = 0.004
95% confidence interval from 0.726 to 0.743

---

Here are results for fold #1 of 2:
Best performer model hit MAP@10 0.730 for test data, 0.792 for train data


Feature's importances from the most to the least:

importance: 32.17, feature: has_seen_venue_in_this_session
importance: 15.75, feature: popularity
importance: 14.82, feature: conversions_per_impression
importance: 14.07, feature: retention_rate
importance: 11.23, feature: rating
importance: 4.83, feature: price_range
importance: 3.66, feature: is_from_order_again
importance: 2.67, feature: is_recommended
importance: 0.80, feature: is_new_user

---

Here are results for fold #2 of 2:
Best performer model hit MAP@10 0.739 for test data, 0.779 for train data


Feature's importances from the most to the least:

importance: 34.26, feature: has_seen_venue_in_this_session
importance: 15.20, feature: retention_rate
importance: 15.11, feature: conversions_per_impressi

Surprisingly, very well values of MAP metric! 

Let's go deeper in our analysis of features importance.

Here we see values of importances calculated via LossFunctionChange method, for each feature the value represents the difference between the loss value of the model with this feature and without it.

The most important feature is binary `has_seen_venue_in_this_session`, which has a significant impact on the model performance. 

The main problem is that this feature is not available for the moment of model serving: we don't know which venues the user has seen in the session.

There are several options to solve this problem:
1) remove this feature from the model,
2) or use it as a feature for the next session. But the don't have enough data to train the model - there isn't any user_id field. 

That is why we can't use it in our model.

I made an assumption that the sessions implies a pre-calculated candidates, during an overnight batch job. 

So, the model should take a list of candidates and return a list of most relevant venues for a session.



In [30]:
cols_reduced = list(df_all.columns)

In [31]:
cols_reduced.remove("has_seen_venue_in_this_session")

I removed this feature from the model, and implemented a simple baseline model over 4 folds:

In [32]:
kf = KFold(n_splits=4, shuffle=True, random_state=RANDOM_STATE)
results = []
for train, test in kf.split(sessions):
    sessions_train = set(sessions[train])
    sessions_test = set(sessions[test])
    df_train = df_all[df_all["session_id"].isin(sessions_train)][cols_reduced]
    df_test = df_all[df_all["session_id"].isin(sessions_test)][cols_reduced]
    train_set, eval_set, names = prepare_datasets(df_train, df_test)
    res = train_and_evaluate(train_set, eval_set, names,ranker = ranker)
    results.append(res)
show_results(results)


---

MAP@10 distribution parameters:
mean = 0.495, std_dev = 0.014
95% confidence interval from 0.468 to 0.522

---

Here are results for fold #1 of 4:
Best performer model hit MAP@10 0.493 for test data, 0.535 for train data


Feature's importances from the most to the least:

importance: 29.12, feature: popularity
importance: 21.61, feature: conversions_per_impression
importance: 17.61, feature: retention_rate
importance: 14.65, feature: rating
importance: 6.35, feature: is_from_order_again
importance: 5.91, feature: price_range
importance: 4.23, feature: is_recommended
importance: 0.51, feature: is_new_user

---

Here are results for fold #2 of 4:
Best performer model hit MAP@10 0.494 for test data, 0.527 for train data


Feature's importances from the most to the least:

importance: 20.25, feature: conversions_per_impression
importance: 19.87, feature: popularity
importance: 19.64, feature: retention_rate
importance: 13.21, feature: rating
importance: 11.68, feature: is_from_order

Nothing suspicious, the model is fitted and we save the best estimator to the file. It will be used in the next step, when we will prepare the model for serving.

In [35]:
best_iteration = max([result["best_score"]["validation"]["MAP:top=10"] for result in results])
for result in results:
    if result["best_score"]["validation"]["MAP:top=10"] == best_iteration:
        best_ranker = result["model"]
best_ranker.save_model('./s3/weights.cbm')

This notebook is used as a base for the "production" training pipeline, represented by the [ranker.py](train/src/ranker.py) file.

That is all for the training part. Now we can start to serve the model. 
