In [None]:
!wget -O zen_dataset.tar.gz https://www.dropbox.com/s/5ugsinj434yzmu6/zen_dataset.tar.gz?dl=0
!tar -xzvf zen_dataset.tar.gz

In [None]:
!pip install catboost

# Ranking model

Classical multistage recommendations pipeline looks like following:

![](
https://raw.githubusercontent.com/girafe-ai/recsys/3f374f49cede21d25c777aa3a274b9cbadc29d19/homeworks/recsys-pipeline.png)

1. Candidate selection: on this stage we use relative simple models (embeddings kNN, collaborative filtering result or SLIM)
1. Ranking model: candidates collected on previous stage are evaluated with more complex model (usually boosting nowadays).
1. Reranking: applying business logic, heuristics and ad-hocs.

We discussed this pipeline on the first lecture.

In this task you are to build dataset and train ranking model using Dzen dataset.

You need to use gradient boosting model e.g. Catboost (however any other goes well). As features you will need:
* Dot product, cosine distance between user and item embeddings. Models to use as embeddings source are:  *explicit* and *implicit ALS*, content models
* Item and user statistics (counters) such as CTR, view count, etc.


In [None]:
import numpy as np
import pandas as pd

import tqdm
import json
import itertools
import collections

import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score

sns.set()

In [None]:
item_counts = pd.read_csv('zen_dataset/item_counts.csv', index_col=0)
item_meta = pd.read_csv('zen_dataset/item_meta.gz', compression='gzip', index_col=0)
user_ratings = pd.read_csv('zen_dataset/user_ratings.gz', compression='gzip', index_col=0)

In [None]:
item_counts['itemId'] = item_counts['itemId'].apply(str)
item_meta['itemId'] = item_meta['itemId'].apply(str)

In [None]:
def parse_ratings_history(string):
    return json.loads(string.replace("'", '"'))

In [None]:
user_encoder = LabelEncoder().fit(user_ratings['userId'])
item_encoder = LabelEncoder().fit(item_counts['itemId'])

all_items = item_counts['itemId']
indices = item_encoder.transform(all_items)
item_to_id = dict(zip(all_items, indices))

## ALS (10 points)

Train explicit Ð¸ implicit ALS.

You may use your implementation from previous HW as well as library implementation such as _implicit_

In [None]:
DIMENSION = ... # choose appropriately

In [None]:
def train_eals(data, dimension=DIMENSION, steps=10):
    # your preffered training method

    return user_embeddings, item_embeddings


def train_ials(data, dimension=DIMENSION, steps=10, alpha=10):
    # your preffered training method

    return user_embeddings, item_embeddings

In [None]:
eals_user_embeddings, eals_item_embeddings = ...

In [None]:
ials_user_embeddings, ials_item_embeddings = ...

## Content model (5 points)

Choose content model you like and apply it to all the items (remember to use all available information).

Some sane choices are: RuBERT or any CLIP model. You may found them on [Hugging face](https://huggingface.co/models)

Which dimensionality to choose for content model? Why?

In [None]:
content_item_embeddings = ...

## ALS step from content model (5 points)

To use content based information (important for cold items and users) in user-item ranking you need to build user embeddings. One simple way is to perform ALS step with item embeddings fixed to compute user embeddings.

Then you will use these embeddings to compute user-item features for final ranking.

In [None]:
content_user_embeddings = ...

## Ranking (10 points)


Build embedding based user-item features for ranking model.

In [None]:
class EmbeddingFeatureGetter:
    def __init__(self, user_embeddings, item_embeddings):
        self.user_embeddings = user_embeddings
        self.item_embeddings = item_embeddings

    def get_features(self, user_id, item_ids):
        """
        * user_id -- user index to compute features
        * item_ids -- list of item indexes
        """
        ...

        return dot, cos

In [None]:
eals_features_getter = EmbeddingFeatureGetter(eals_user_embeddings, eals_item_embeddings)
ials_features_getter = EmbeddingFeatureGetter(ials_user_embeddings, ials_item_embeddings)
content_features_getter = EmbeddingFeatureGetter(content_user_embeddings, content_item_embeddings)

Build item and user features:

In [None]:
item_features = [...]

In [None]:
user_features = [...]

Then build dataset for boosting model.

In case of Catboost it is called `Pool` and can be built [according to documentation](https://catboost.ai/en/docs/concepts/python-reference_catboost).

Once again, you may use any library - there is almost no difference nowadays.

Don't forget to use statistical features!

In [None]:
import catboost

In [None]:
train_features = ...
train_labels = [...]
train_group_ids = [...]

In [None]:
train_pool = catboost.Pool(train_features, train_labels, group_id=train_group_ids)

In [None]:
test_features = ...
test_labels = [...]
test_group_ids = [...]

In [None]:
test_pool = catboost.Pool(test_features, test_labels, group_id=test_group_ids)

Train the model. For the best result you may need to read about [loss functions](https://catboost.ai/en/docs/references/training-parameters/common#loss_function) and their parameters.

Try two cases:
* binary classification
* ranking loss

In case of ranking pay attention to amount of pairs sampled as it can get training very long.


In [None]:
cb = catboost.CatBoost({...})

In [None]:
cb.fit(train_pool, eval_set=test_pool)

Choose at least three metrics relevant for this setup and measure them for both loss functions:

In [None]:
per_user_predictions = [...]  # user -> predictions

In [None]:
# your code here

## Analysis and conclusion (10 points)

Analyse feature importance using [shap values](https://github.com/shap/shap). Make some visualizations.

Make a conclusions on which loss works better in this case and how is it corellated with features and results (concrete rankings).

In [None]:
# your code here

**Your conclusion here**