# Курс Рекомендательные системы

# Домашнее задание урока 6. Двухуровневые модели рекомендаций

**Задание 1.**

A) Попробуйте различные варианты генерации кандидатов. Какие из них дают наибольший recall@k ?
- Пока пробуем отобрать 50 кандидатов (k=50)
- Качество измеряем на data_val_matcher: следующие 6 недель после трейна

Дают ли own recommendtions + top-popular лучший recall?  

B)* Как зависит recall@k от k? Постройте для одной схемы генерации кандидатов эту зависимость для k = {20, 50, 100, 200, 500}  
C)* Исходя из прошлого вопроса, как вы думаете, какое значение k является наиболее разумным?

**Задание 2.**

Обучите модель 2-ого уровня, при этом:

- Добавьте минимум по 2 фичи для юзера, товара и пары юзер-товар

- Измерьте отдельно precision@5 модели 1-ого уровня и двухуровневой модели на data_val_ranker

- Вырос ли precision@5 при использовании двухуровневой модели?

---

## Загрузка библиотек

In [86]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Для работы с матрицами
from scipy.sparse import csr_matrix

# Матричная факторизация
from implicit import als

# Модель второго уровня
from lightgbm import LGBMClassifier

import os, sys
module_path = os.path.abspath(os.path.join(os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

# Написанные нами функции
from metrics import precision_at_k, recall_at_k
from utils import prefilter_items
from recommenders import MainRecommender

## Чтение данных

In [87]:
data = pd.read_csv('retail_train.csv')
item_features = pd.read_csv('product.csv')
user_features = pd.read_csv('hh_demographic.csv')

## Подготовка датасета

In [88]:
ITEM_COL = 'item_id'
USER_COL = 'user_id'

In [89]:
# column processing
item_features.columns = [col.lower() for col in item_features.columns]
user_features.columns = [col.lower() for col in user_features.columns]

item_features.rename(columns={'product_id': ITEM_COL}, inplace=True)
user_features.rename(columns={'household_key': USER_COL }, inplace=True)

## Разделение датасета на тренировочную, валидационную и тестовую выборки

In [90]:
# Важна схема обучения и валидации!
# -- давние покупки -- | -- 6 недель -- | -- 3 недель -- 
# подобрать размер 2-ого датасета (6 недель) --> learning curve (зависимость метрики recall@k от размера датасета)


VAL_MATCHER_WEEKS = 6
VAL_RANKER_WEEKS = 3

In [91]:
# берем данные для обучения matching модели
data_train_matcher = data[data['week_no'] < data['week_no'].max() - (VAL_MATCHER_WEEKS + VAL_RANKER_WEEKS)]

# берем данные для валидации matching модели
# на этой выборке будем считать recall
data_val_matcher = data[(data['week_no'] >= data['week_no'].max() - (VAL_MATCHER_WEEKS + VAL_RANKER_WEEKS)) &
                      (data['week_no'] < data['week_no'].max() - (VAL_RANKER_WEEKS))]


# берем данные для обучения ranking модели (модели 2-го уровня)
data_train_ranker = data_val_matcher.copy()  # Для наглядности. Далее мы добавим изменения, и они будут отличаться

# берем данные для теста ranking, matching модели
# на этой выборке будем тестировать суммрарное качество всей нашей модели)
data_val_ranker = data[data['week_no'] >= data['week_no'].max() - VAL_RANKER_WEEKS]

In [92]:
def print_stats_data(df_data, name_df):
    print(name_df)
    print(f"Shape: {df_data.shape} Users: {df_data[USER_COL].nunique()} Items: {df_data[ITEM_COL].nunique()}")

In [93]:
print_stats_data(data_train_matcher,'train_matcher')
print_stats_data(data_val_matcher,'val_matcher')
print_stats_data(data_train_ranker,'train_ranker')
print_stats_data(data_val_ranker,'val_ranker')

train_matcher
Shape: (2108779, 12) Users: 2498 Items: 83685
val_matcher
Shape: (169711, 12) Users: 2154 Items: 27649
train_ranker
Shape: (169711, 12) Users: 2154 Items: 27649
val_ranker
Shape: (118314, 12) Users: 2042 Items: 24329


In [94]:
# выше видим разброс по пользователям и товарам

In [95]:
data_train_matcher.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0


## Префильтр товаров

In [96]:
n_items_before = data_train_matcher['item_id'].nunique()

data_train_matcher = prefilter_items(data_train_matcher, item_features=item_features, take_n_popular=5000)

n_items_after = data_train_matcher['item_id'].nunique()
print('Decreased # items from {} to {}'.format(n_items_before, n_items_after))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['price'] = data['sales_value'] / (np.maximum(data['quantity'], 1))


Decreased # items from 83685 to 5001


## Перевод с холодного старта на теплый старт

In [97]:
# ищем общих пользователей
common_users = list(set(data_train_matcher.user_id.values)&(set(data_val_matcher.user_id.values))&set(data_val_ranker.user_id.values))

data_train_matcher = data_train_matcher[data_train_matcher.user_id.isin(common_users)]
data_val_matcher = data_val_matcher[data_val_matcher.user_id.isin(common_users)]
data_train_ranker = data_train_ranker[data_train_ranker.user_id.isin(common_users)]
data_val_ranker = data_val_ranker[data_val_ranker.user_id.isin(common_users)]

print_stats_data(data_train_matcher,'train_matcher')
print_stats_data(data_val_matcher,'val_matcher')
print_stats_data(data_train_ranker,'train_ranker')
print_stats_data(data_val_ranker,'val_ranker')

train_matcher
Shape: (784420, 13) Users: 1915 Items: 4999
val_matcher
Shape: (163261, 12) Users: 1915 Items: 27118
train_ranker
Shape: (163261, 12) Users: 1915 Items: 27118
val_ranker
Shape: (115989, 12) Users: 1915 Items: 24042


In [98]:
# Теперь warm-start по пользователям

## Инициализация тренировки рекомендательной системы

In [99]:
recommender = MainRecommender(data_train_matcher)

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/4999 [00:00<?, ?it/s]

### Варианты, как получить кандидатов

Можно потом все эти варианты соединить в один

(!) Если модель рекомендует < N товаров, то рекомендации дополняются топ-популярными товарами до N

In [100]:
# Берем тестового юзера 2375

In [101]:
recommender.get_als_recommendations(2375, N=5)

[871756, 899624, 844179, 1044078, 832678]

In [102]:
recommender.get_own_recommendations(2375, N=5)

[948640, 918046, 847962, 907099, 873980]

In [103]:
recommender.get_similar_items_recommendation(2375, N=5)

[1046545, 1044078, 999270, 1078652, 963542]

In [104]:
recommender.get_similar_users_recommendation(2375, N=5)

[1133654, 825317, 1097398, 1107760, 918638]

## Оценка recall по матчингу

**Задание 1.**

A) Попробуйте различные варианты генерации кандидатов. Какие из них дают наибольший recall@k ?
- Пока пробуем отобрать 50 кандидатов (k=50)
- Качество измеряем на data_val_matcher: следующие 6 недель после трейна

Дают ли own recommendtions + top-popular лучший recall?  

B)* Как зависит recall@k от k? Постройте для одной схемы генерации кандидатов эту зависимость для k = {20, 50, 100, 200, 500}  
C)* Исходя из прошлого вопроса, как вы думаете, какое значение k является наиболее разумным?

In [105]:
ACTUAL_COL = 'actual'

In [106]:
result_eval_matcher = data_val_matcher.groupby(USER_COL)[ITEM_COL].unique().reset_index()
result_eval_matcher.columns=[USER_COL, ACTUAL_COL]
result_eval_matcher.head(2)

Unnamed: 0,user_id,actual
0,1,"[853529, 865456, 867607, 872137, 874905, 87524..."
1,6,"[1024306, 1102949, 6548453, 835394, 940804, 96..."


In [107]:
# N = Neighbors
N_PREDICT = 50 

In [108]:
%%time
# для понятности расписано все в строчку, без функций, ваша задача уметь оборачивать все это в функции
result_eval_matcher['own_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))
result_eval_matcher['sim_item_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_similar_items_recommendation(x, N=N_PREDICT))
result_eval_matcher['als_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_als_recommendations(x, N=N_PREDICT))

Wall time: 42.7 s


In [109]:
result_eval_matcher.head(8)

Unnamed: 0,user_id,actual,own_rec,sim_item_rec,als_rec
0,1,"[853529, 865456, 867607, 872137, 874905, 87524...","[856942, 9297615, 5577022, 877391, 9655212, 10...","[6514302, 5582712, 9297615, 5577022, 888210, 9...","[5574377, 883616, 888543, 1037332, 1077133, 10..."
1,6,"[1024306, 1102949, 6548453, 835394, 940804, 96...","[13003092, 995598, 923600, 972416, 1084036, 11...","[948650, 5569845, 8357613, 941361, 1074754, 11...","[1051516, 871611, 878996, 1084036, 896613, 102..."
2,7,"[836281, 843306, 845294, 914190, 920456, 93886...","[998519, 894360, 7147142, 9338009, 896666, 939...","[5585510, 7152455, 1044078, 12384779, 948468, ...","[1039627, 1100140, 10285022, 9803591, 912817, ..."
3,8,"[868075, 886787, 945611, 1005186, 1008787, 101...","[12808385, 939860, 981660, 7410201, 5577022, 6...","[5569845, 5592888, 1044078, 908318, 12731436, ...","[916122, 1029743, 985999, 839243, 869388, 1281..."
4,9,"[883616, 1029743, 1039126, 1051323, 1082772, 1...","[872146, 918046, 9655676, 985622, 1056005, 109...","[1008032, 1074754, 901062, 904493, 996269, 713...","[5585510, 1074333, 970866, 1091865, 6039859, 6..."
5,13,"[6544236, 822407, 908317, 1056775, 1066289, 11...","[965772, 9488065, 10342382, 6554400, 862070, 1...","[1074754, 1120559, 1008547, 6553237, 7147915, ...","[9707240, 7409644, 1139782, 12172071, 1029743,..."
6,14,"[917277, 981760, 878234, 925514, 986394, 10220...","[902377, 822161, 874563, 1123106, 8090610, 138...","[1074754, 910673, 985999, 1025611, 990335, 135...","[1127758, 910673, 1025611, 836445, 846823, 113..."
7,15,"[996016, 1014509, 1044404, 1087353, 976199, 10...","[823576, 1052975, 1053530, 1071196, 1010051, 1...","[901062, 1074754, 1135476, 1091926, 999999, 10...","[863632, 1042616, 1001827, 823576, 1034956, 10..."


In [110]:
%%time
# result_eval_matcher['sim_user_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_similar_users_recommendation(x, N=50))

Wall time: 0 ns


### Пример оборачивания

In [111]:
# # сырой и простой пример как можно обернуть в функцию
def evalRecall(df_result, target_col_name, recommend_model):
    result_col_name = 'result'
    df_result[result_col_name] = df_result[target_col_name].apply(lambda x: recommend_model(x, N=25))
    score = df_result.apply(lambda row: recall_at_k(row[result_col_name], row[ACTUAL_COL], k=N_PREDICT), axis=1).mean()
    return score

In [112]:
# evalRecall(result_eval_matcher, USER_COL, recommender.get_own_recommendations)

In [113]:
def calc_recall(df_data, top_k):
    for col_name in df_data.columns[2:]:
        score = df_data.apply(lambda row: recall_at_k(row[col_name], row[ACTUAL_COL], k=top_k), axis=1).mean()
        yield col_name, score

In [114]:
def calc_precision(df_data, top_k):
    for col_name in df_data.columns[2:]:
        score = df_data.apply(lambda row: precision_at_k(row[col_name], row[ACTUAL_COL], k=top_k), axis=1).mean()
        yield col_name, score

### Recall@50 of matching

In [115]:
TOPK_RECALL = 50

In [116]:
sorted(calc_recall(result_eval_matcher, TOPK_RECALL), key=lambda x: x[1],reverse=True)

[('own_rec', 0.061684201353290766),
 ('als_rec', 0.04834377722420033),
 ('sim_item_rec', 0.0314682489314974)]

### Precision@5 of matching

In [117]:
TOPK_PRECISION = 5

In [118]:
sorted(calc_precision(result_eval_matcher, TOPK_PRECISION), key=lambda x: x[1],reverse=True)

[('own_rec', 0.18872062663185182),
 ('als_rec', 0.12637075718015564),
 ('sim_item_rec', 0.06725848563968714)]

## Решение задания 1

In [119]:
TOPK_RECALL = 50

In [120]:
sorted(calc_recall(result_eval_matcher, TOPK_RECALL), key=lambda x: x[1],reverse=True)

[('own_rec', 0.061684201353290766),
 ('als_rec', 0.04834377722420033),
 ('sim_item_rec', 0.0314682489314974)]

**Вывод:** Товары, похожие на топ-N купленных юзером товаров (own recommendtions + top-popular),  дают наибольший recall@k.

In [126]:
def recall_search(df_data, target_col_name, recommend_model, top_k):
    
    summary = []
    for k in top_k:
        result_col_name = 'result'
        
        df_data[result_col_name] = df_data[target_col_name].apply(lambda x: recommend_model(x, N=k))
        
        score = df_data.apply(lambda row: recall_at_k(row[result_col_name], row[ACTUAL_COL], k=k), axis=1).mean()                
        
        print(f'При k = {k}:\r\n{score}')
        print()
        
        summary.append([k, score])
        
    return summary

In [128]:
top_k = [20, 50, 100, 200, 300, 400, 500]

In [130]:
%%time
summary_own = recall_search(result_eval_matcher, USER_COL, recommender.get_own_recommendations, top_k)

При k = 20:
0.0364815577009746

При k = 50:
0.061684201353290766

При k = 100:
0.09211914788591925

При k = 200:
0.1325640196447428

При k = 300:
0.15565414254099597

При k = 400:
0.17037858778850923

При k = 500:
0.18061825158867673

Wall time: 1min 25s


**Вывод к Заданию 1:** На данной выборке с ростом k от 20 до 500 растет значение recall.

# Ranking part

### Обучаем модель 2-ого уровня на выбранных кандидатах

- Обучаем на data_train_ranking
- Обучаем *только* на выбранных кандидатах
- Я *для примера* сгенерирую топ-50 кадидиатов через get_own_recommendations
- (!) Если юзер купил < 50 товаров, то get_own_recommendations дополнит рекоммендации топ-популярными

In [132]:
# -- давние покупки -- | -- 6 недель -- | -- 3 недель -- 

## Подготовка данных для трейна

In [133]:
# взяли пользователей из трейна для ранжирования
df_match_candidates = pd.DataFrame(data_train_ranker[USER_COL].unique())
df_match_candidates.columns = [USER_COL]

In [134]:
# собираем кандитатов с первого этапа (matcher)
df_match_candidates['candidates'] = df_match_candidates[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))

In [135]:
df_match_candidates.head(2)

Unnamed: 0,user_id,candidates
0,2070,"[1105426, 1097350, 879194, 948640, 928263, 944..."
1,2021,"[950935, 1119454, 835578, 863762, 1097398, 101..."


In [136]:
df_items = df_match_candidates.apply(lambda x: pd.Series(x['candidates']), axis=1).stack().reset_index(level=1, drop=True)
df_items.name = 'item_id'

In [137]:
df_match_candidates = df_match_candidates.drop('candidates', axis=1).join(df_items)

In [138]:
df_match_candidates.head(8)

Unnamed: 0,user_id,item_id
0,2070,1105426
0,2070,1097350
0,2070,879194
0,2070,948640
0,2070,928263
0,2070,944588
0,2070,1032703
0,2070,1138596


### Check warm start

In [139]:
print_stats_data(df_match_candidates, 'match_candidates')

match_candidates
Shape: (95750, 2) Users: 1915 Items: 4437


### Создаем трейн сет для ранжирования с учетом кандидатов с этапа 1 

In [140]:
df_ranker_train = data_train_ranker[[USER_COL, ITEM_COL]].copy()
df_ranker_train['target'] = 1  # тут только покупки 

df_ranker_train = df_match_candidates.merge(df_ranker_train, on=[USER_COL, ITEM_COL], how='left')

df_ranker_train['target'].fillna(0, inplace= True)

In [141]:
df_ranker_train.target.value_counts()

0.0    88346
1.0    11053
Name: target, dtype: int64

In [142]:
df_ranker_train.head(9)

Unnamed: 0,user_id,item_id,target
0,2070,1105426,0.0
1,2070,1097350,0.0
2,2070,879194,0.0
3,2070,948640,0.0
4,2070,928263,0.0
5,2070,944588,0.0
6,2070,1032703,0.0
7,2070,1138596,0.0
8,2070,1092937,1.0


(!) На каждого юзера 50 item_id-кандидатов

In [143]:
df_ranker_train['target'].mean()

0.11119830179378062

- Пока для простоты обучения выберем LightGBM c loss = binary. Это классическая бинарная классификация
- Это пример *без* генерации фич

## Подготавливаем фичи для обучения модели

In [144]:
item_features.head(2)

Unnamed: 0,item_id,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product
0,25671,2,GROCERY,National,FRZN ICE,ICE - CRUSHED/CUBED,22 LB
1,26081,2,MISC. TRANS.,National,NO COMMODITY DESCRIPTION,NO SUBCOMMODITY DESCRIPTION,


In [145]:
user_features.head(2)

Unnamed: 0,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,user_id
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,1
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,7


In [146]:
df_ranker_train = df_ranker_train.merge(item_features, on=ITEM_COL, how='left')
df_ranker_train = df_ranker_train.merge(user_features, on=USER_COL, how='left')

df_ranker_train.head(9)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc
0,2070,1105426,0.0,69,DELI,Private,SANDWICHES,SANDWICHES - (COLD),,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
1,2070,1097350,0.0,2468,GROCERY,National,DOMESTIC WINE,VALUE GLASS WINE,4 LTR,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
2,2070,879194,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,14 CT,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
3,2070,948640,0.0,1213,DRUG GM,National,ORAL HYGIENE PRODUCTS,WHITENING SYSTEMS,3 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
4,2070,928263,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,13 CT,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
5,2070,944588,0.0,1094,MEAT-PCKGD,National,LUNCHMEAT,HAM,12 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
6,2070,1032703,0.0,1087,SEAFOOD-PCKGD,National,SEAFOOD - FROZEN,FRZN BRD STICK/PORTON,10.5 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
7,2070,1138596,0.0,111,DRUG GM,National,CIGARETTES,CIGARETTES,523670 CTN,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
8,2070,1092937,1.0,1089,MEAT-PCKGD,National,LUNCHMEAT,BOLOGNA,16OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown


**Фичи user_id:**
    - Средний чек
    - Средняя сумма покупки 1 товара в каждой категории
    - Кол-во покупок в каждой категории
    - Частотность покупок раз/месяц
    - Долю покупок в выходные
    - Долю покупок утром/днем/вечером

**Фичи item_id**:
    - Кол-во покупок в неделю
    - Среднее ол-во покупок 1 товара в категории в неделю
    - (Кол-во покупок в неделю) / (Среднее ол-во покупок 1 товара в категории в неделю)
    - Цена (Можно посчитать из retil_train.csv)
    - Цена / Средняя цена товара в категории
    
**Фичи пары user_id - item_id**
    - (Средняя сумма покупки 1 товара в каждой категории (берем категорию item_id)) - (Цена item_id)
    - (Кол-во покупок юзером конкретной категории в неделю) - (Среднее кол-во покупок всеми юзерами конкретной категории в неделю)
    - (Кол-во покупок юзером конкретной категории в неделю) / (Среднее кол-во покупок всеми юзерами конкретной категории в неделю)

### Решение задания 2: (а) генерация фичей

In [147]:
df_ranker_train.shape

(99399, 16)

#### 1. 'mean_cheque' = Средний чек

In [148]:
# берем датасет data = pd.read_csv('retail_train.csv')

users_sales = data.groupby(USER_COL)['sales_value'].sum().reset_index()
num_baskets = data.groupby(USER_COL)['basket_id'].nunique().reset_index()
users_sales = users_sales.merge(num_baskets, on=USER_COL, how='left')
users_sales['mean_cheque'] = users_sales['sales_value'] / users_sales['basket_id']
users_sales.drop(['sales_value', 'basket_id'], axis=1, inplace=True)
users_sales.head(2)

Unnamed: 0,user_id,mean_cheque
0,1,50.125443
1,2,41.442045


In [149]:
df_ranker_train = df_ranker_train.merge(users_sales, on=USER_COL, how='left')
df_ranker_train.head(2)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,mean_cheque
0,2070,1105426,0.0,69,DELI,Private,SANDWICHES,SANDWICHES - (COLD),,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,12.92937
1,2070,1097350,0.0,2468,GROCERY,National,DOMESTIC WINE,VALUE GLASS WINE,4 LTR,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,12.92937


#### 2. 'mean_department_price' = Средний чек по категории

In [150]:
departments = list(set(df_ranker_train['department'].tolist()))
departments

['MEAT',
 'COSMETICS',
 'NUTRITION',
 'MISC. TRANS.',
 'DRUG GM',
 'SPIRITS',
 'MEAT-PCKGD',
 'DELI',
 'SEAFOOD',
 'GROCERY',
 'SEAFOOD-PCKGD',
 'PRODUCE',
 'PASTRY',
 'FLORAL']

In [151]:
%%time

df_ranker_train['mean_department_price'] = 0

for n in departments:
    dep_df_ranker_train = df_ranker_train[df_ranker_train['department'] == n]
    ids = dep_df_ranker_train[ITEM_COL].tolist()
    dep_data = data[data[ITEM_COL].isin(ids)]
    
    dep_sales = dep_data.groupby(USER_COL).agg({
    'sales_value' : 'sum', 
    'quantity': 'sum'}).reset_index()
    
    dep_sales['dep_mean_price'] = dep_sales['sales_value'] / dep_sales['quantity']
    dep_sales.drop(['sales_value', 'quantity'], axis=1, inplace=True)
    
    for i in range(dep_sales.shape[0]):
        df_ranker_train.loc[(((df_ranker_train[USER_COL] == dep_sales[USER_COL][i]) & (df_ranker_train['department'] == n)) == True), 'mean_department_price'] = dep_sales['dep_mean_price'][i]

df_ranker_train.head(2)

Wall time: 4min 1s


Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,mean_cheque,mean_department_price
0,2070,1105426,0.0,69,DELI,Private,SANDWICHES,SANDWICHES - (COLD),,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,12.92937,3.185789
1,2070,1097350,0.0,2468,GROCERY,National,DOMESTIC WINE,VALUE GLASS WINE,4 LTR,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,12.92937,2.898967


#### 3. 'item_id_week_sales' = Кол-во покупок в неделю

In [152]:
# 'week_no' - число недель продажи
week_sales = data.groupby(ITEM_COL).agg({ 
    'quantity': 'sum',
    'week_no' : 'nunique'
}).reset_index()
week_sales['item_id_week_sales'] = week_sales['quantity'] / week_sales['week_no']
week_sales.drop(['quantity', 'week_no'], axis=1, inplace=True)
week_sales.head()

Unnamed: 0,item_id,item_id_week_sales
0,25671,2.0
1,26081,1.0
2,26093,1.0
3,26190,1.0
4,26355,2.0


In [153]:
df_ranker_train = df_ranker_train.merge(week_sales, on=ITEM_COL, how='left')
df_ranker_train.head(2)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,mean_cheque,mean_department_price,item_id_week_sales
0,2070,1105426,0.0,69,DELI,Private,SANDWICHES,SANDWICHES - (COLD),,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,12.92937,3.185789,2.0
1,2070,1097350,0.0,2468,GROCERY,National,DOMESTIC WINE,VALUE GLASS WINE,4 LTR,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,12.92937,2.898967,1.35


#### 4. 'mean_price' = Цена

In [154]:
mean_price = data.groupby(ITEM_COL).agg({
    'sales_value' : 'sum', 
    'quantity': 'sum'
}).reset_index()

mean_price['mean_price'] = mean_price['sales_value'] / mean_price['quantity']

mean_price.drop(['sales_value', 'quantity'], axis=1, inplace=True)

mean_price.head(2)

Unnamed: 0,item_id,mean_price
0,25671,3.49
1,26081,0.99


In [155]:
df_ranker_train = df_ranker_train.merge(mean_price, on=ITEM_COL, how='left')
df_ranker_train.head(2)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,mean_cheque,mean_department_price,item_id_week_sales,mean_price
0,2070,1105426,0.0,69,DELI,Private,SANDWICHES,SANDWICHES - (COLD),,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,12.92937,3.185789,2.0,3.905593
1,2070,1097350,0.0,2468,GROCERY,National,DOMESTIC WINE,VALUE GLASS WINE,4 LTR,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,12.92937,2.898967,1.35,11.471481


#### 5. 'delta_dep_user_price' = Средняя сумма покупки юзером 1 товара в каждой категории - Средняя цена в этой категории

In [156]:
%%time

df_ranker_train['delta_dep_user_price'] = 0

for n in departments:
    dep_df_ranker_train = df_ranker_train[df_ranker_train['department'] == n]
    ids = dep_df_ranker_train[ITEM_COL].tolist()
    dep_data = data[data[ITEM_COL].isin(ids)]
    
    dep_mean_price = dep_data['sales_value'].sum() / dep_data['quantity'].sum()
    
    dep_user_sales = data.groupby(USER_COL).agg({
    'sales_value' : 'sum', 
    'quantity': 'sum'}).reset_index()
    
    dep_user_sales['mean_dep_user_price'] = dep_user_sales['sales_value'] / dep_user_sales['quantity']
    dep_user_sales.drop(['sales_value', 'quantity'], axis=1, inplace=True)
    
    for i in range(dep_user_sales.shape[0]):
        df_ranker_train.loc[(((df_ranker_train[USER_COL] == dep_user_sales[USER_COL][i]) &
                            (df_ranker_train['department'] == n)) == True), 'delta_dep_user_price'] = dep_user_sales['mean_dep_user_price'][i] - dep_mean_price

df_ranker_train.head(2)

Wall time: 7min 28s


Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,...,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,mean_cheque,mean_department_price,item_id_week_sales,mean_price,delta_dep_user_price
0,2070,1105426,0.0,69,DELI,Private,SANDWICHES,SANDWICHES - (COLD),,45-54,...,50-74K,Unknown,Unknown,1,None/Unknown,12.92937,3.185789,2.0,3.905593,-3.858987
1,2070,1097350,0.0,2468,GROCERY,National,DOMESTIC WINE,VALUE GLASS WINE,4 LTR,45-54,...,50-74K,Unknown,Unknown,1,None/Unknown,12.92937,2.898967,1.35,11.471481,-2.976403


#### 6. 'rel_week_sales' = (Кол-во покупок юзером конкретной категории в неделю) / (Среднее кол-во покупок всеми юзерами конкретной категории в неделю)

In [157]:
%%time

df_ranker_train['rel_week_sales'] = 0

for n in departments:
    dep_df_ranker_train = df_ranker_train[df_ranker_train['department'] == n]
    ids = dep_df_ranker_train[ITEM_COL].tolist()
    dep_data = data[data[ITEM_COL].isin(ids)]
    
    dep_mean_week_sales = dep_data['quantity'].sum() / dep_data['week_no'].nunique()
    
    dep_user_week_sales = data.groupby(USER_COL).agg({ 
    'quantity': 'sum',
    'week_no' : 'nunique'
    }).reset_index()
    
    dep_user_week_sales['mean_user_week_sales'] = dep_user_week_sales['quantity'] / dep_user_week_sales['week_no']
    dep_user_week_sales.drop(['quantity', 'week_no'], axis=1, inplace=True)
    
    for i in range(dep_user_week_sales.shape[0]):
        df_ranker_train.loc[(((df_ranker_train[USER_COL] == dep_user_week_sales[USER_COL][i]) &
                            (df_ranker_train['department'] == n)) == True), 'rel_week_sales'] = dep_user_week_sales['mean_user_week_sales'][i] / dep_mean_week_sales

df_ranker_train.head(2)

Wall time: 7min 34s


Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,...,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,mean_cheque,mean_department_price,item_id_week_sales,mean_price,delta_dep_user_price,rel_week_sales
0,2070,1105426,0.0,69,DELI,Private,SANDWICHES,SANDWICHES - (COLD),,45-54,...,Unknown,Unknown,1,None/Unknown,12.92937,3.185789,2.0,3.905593,-3.858987,3.49285
1,2070,1097350,0.0,2468,GROCERY,National,DOMESTIC WINE,VALUE GLASS WINE,4 LTR,45-54,...,Unknown,Unknown,1,None/Unknown,12.92937,2.898967,1.35,11.471481,-2.976403,0.276601


## Обучение модели ранжирования (модели 2-го уровня)

In [158]:
X_train = df_ranker_train.drop('target', axis=1)
y_train = df_ranker_train[['target']]

In [159]:
cat_feats = X_train.columns[2:].tolist()
X_train[cat_feats] = X_train[cat_feats].astype('category')

cat_feats

['manufacturer',
 'department',
 'brand',
 'commodity_desc',
 'sub_commodity_desc',
 'curr_size_of_product',
 'age_desc',
 'marital_status_code',
 'income_desc',
 'homeowner_desc',
 'hh_comp_desc',
 'household_size_desc',
 'kid_category_desc',
 'mean_cheque',
 'mean_department_price',
 'item_id_week_sales',
 'mean_price',
 'delta_dep_user_price',
 'rel_week_sales']

In [160]:
lgb = LGBMClassifier(objective='binary',
                     max_depth=8,
                     n_estimators=300,
                     learning_rate=0.05,
                     categorical_column=cat_feats)

lgb.fit(X_train, y_train)

train_preds = lgb.predict_proba(X_train)

  return f(*args, **kwargs)


In [161]:
df_ranker_predict = df_ranker_train.copy()

In [162]:
df_ranker_predict['proba_item_purchase'] = train_preds[:,1]

In [163]:
df_ranker_predict.head(9)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,...,hh_comp_desc,household_size_desc,kid_category_desc,mean_cheque,mean_department_price,item_id_week_sales,mean_price,delta_dep_user_price,rel_week_sales,proba_item_purchase
0,2070,1105426,0.0,69,DELI,Private,SANDWICHES,SANDWICHES - (COLD),,45-54,...,Unknown,1,None/Unknown,12.92937,3.185789,2.0,3.905593,-3.858987,3.49285,0.073188
1,2070,1097350,0.0,2468,GROCERY,National,DOMESTIC WINE,VALUE GLASS WINE,4 LTR,45-54,...,Unknown,1,None/Unknown,12.92937,2.898967,1.35,11.471481,-2.976403,0.276601,0.009542
2,2070,879194,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,14 CT,45-54,...,Unknown,1,None/Unknown,12.92937,4.490714,1.588235,7.237222,-5.050091,3.596502,0.011691
3,2070,948640,0.0,1213,DRUG GM,National,ORAL HYGIENE PRODUCTS,WHITENING SYSTEMS,3 OZ,45-54,...,Unknown,1,None/Unknown,12.92937,4.490714,1.289474,6.596122,-5.050091,3.596502,0.001925
4,2070,928263,0.0,69,DRUG GM,Private,DIAPERS & DISPOSABLES,BABY DIAPERS,13 CT,45-54,...,Unknown,1,None/Unknown,12.92937,4.490714,1.90625,7.632459,-5.050091,3.596502,0.37723
5,2070,944588,0.0,1094,MEAT-PCKGD,National,LUNCHMEAT,HAM,12 OZ,45-54,...,Unknown,1,None/Unknown,12.92937,3.889307,1.68,3.531429,-3.286711,1.763872,0.054801
6,2070,1032703,0.0,1087,SEAFOOD-PCKGD,National,SEAFOOD - FROZEN,FRZN BRD STICK/PORTON,10.5 OZ,45-54,...,Unknown,1,None/Unknown,12.92937,4.485,2.342105,3.19236,-5.241217,25.789617,0.021822
7,2070,1138596,0.0,111,DRUG GM,National,CIGARETTES,CIGARETTES,523670 CTN,45-54,...,Unknown,1,None/Unknown,12.92937,4.490714,1.25,34.964545,-5.050091,3.596502,0.002775
8,2070,1092937,1.0,1089,MEAT-PCKGD,National,LUNCHMEAT,BOLOGNA,16OZ,45-54,...,Unknown,1,None/Unknown,12.92937,3.889307,6.848837,2.420594,-3.286711,1.763872,0.304114


## Подведем итоги

    Мы обучили модель ранжирования на покупках из сета data_train_ranker и на кандитатах от own_recommendations, что является тренировочным сетом, и теперь наша задача предсказать и оценить именно на тестовом сете.

# Evaluation on test dataset

In [164]:
result_eval_ranker = data_val_ranker.groupby(USER_COL)[ITEM_COL].unique().reset_index()
result_eval_ranker.columns=[USER_COL, ACTUAL_COL]
result_eval_ranker.head(2)

Unnamed: 0,user_id,actual
0,1,"[821867, 834484, 856942, 865456, 889248, 90795..."
1,6,"[920308, 926804, 946489, 1006718, 1017061, 107..."


## Eval matching on test dataset

In [165]:
%%time
result_eval_ranker['own_rec'] = result_eval_ranker[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))

Wall time: 11.4 s


In [166]:
# померяем precision только модели матчинга, чтобы понимать влияение ранжирования на метрики

sorted(calc_precision(result_eval_ranker, TOPK_PRECISION), key=lambda x: x[1], reverse=True)

[('own_rec', 0.1462140992167092)]

## Eval re-ranked matched result on test dataset
    Вспомним df_match_candidates сет, который был получен own_recommendations на юзерах, набор пользователей мы фиксировали и он одинаков, значи и прогноз одинаков, поэтому мы можем использовать этот датафрейм для переранжирования.
    

In [167]:
def rerank(user_id):
    return df_ranker_predict[df_ranker_predict[USER_COL]==user_id].sort_values('proba_item_purchase', ascending=False).head(5).item_id.tolist()

In [168]:
result_eval_ranker['reranked_own_rec'] = result_eval_ranker[USER_COL].apply(lambda user_id: rerank(user_id))

In [169]:
print(*sorted(calc_precision(result_eval_ranker, TOPK_PRECISION), key=lambda x: x[1], reverse=True), sep='\n')

('own_rec', 0.1462140992167092)
('reranked_own_rec', 0.13973890339425452)


Берем топ-k предсказаний, ранжированных по вероятности, для каждого юзера