# Two-layer hybrid recommender system in retail

**Data:** from [Retail X5 Hero Competition](https://retailhero.ai/c/recommender_system/overview)

**Stack:**

- 1-st layer: NLP, Implicit, ItemItemRecommender, ALS, sklearn, pandas, numpy, matplotlib
- 2-nd layer: CatBoost, LightGBM



**Stages**:

1. Prepare data:  prefiltering 
2. Learn 1-st layer model (this is baseline based on MainRecommender class)
3. Choose Top popular
4. Choose Top@k Recall
5. Add feature engineering for ranking
6. Learn 2-nd layer model
7. Validate on test.csv

## Prepare data

Import necessary libs:

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Second layer models
from lightgbm import LGBMClassifier
from catboost import CatBoost

# Functions for prefilter, evaluate and baseline
from src.utils import prefilter_items
from src.metrics import precision_at_k, recall_at_k
from src.recommenders import MainRecommender

# Ignore some warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('data/retail_train.csv')
test = pd.read_csv('data/retail_test.csv')
item_features = pd.read_csv('data/product.csv')
user_features = pd.read_csv('data/hh_demographic.csv')

Let's look at data

In [5]:
data.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0


In [6]:
test.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,1340,41652823310,664,912987,1,8.49,446,0.0,52,96,0.0,0.0
1,588,41652838477,664,1024426,1,6.29,388,0.0,8,96,0.0,0.0


In [7]:
item_features.head(2)

Unnamed: 0,PRODUCT_ID,MANUFACTURER,DEPARTMENT,BRAND,COMMODITY_DESC,SUB_COMMODITY_DESC,CURR_SIZE_OF_PRODUCT
0,25671,2,GROCERY,National,FRZN ICE,ICE - CRUSHED/CUBED,22 LB
1,26081,2,MISC. TRANS.,National,NO COMMODITY DESCRIPTION,NO SUBCOMMODITY DESCRIPTION,


In [8]:
user_features.head(2)

Unnamed: 0,AGE_DESC,MARITAL_STATUS_CODE,INCOME_DESC,HOMEOWNER_DESC,HH_COMP_DESC,HOUSEHOLD_SIZE_DESC,KID_CATEGORY_DESC,household_key
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,1
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,7


In [27]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2396804 entries, 0 to 2396803
Data columns (total 12 columns):
 #   Column             Dtype  
---  ------             -----  
 0   user_id            int64  
 1   basket_id          int64  
 2   day                int64  
 3   item_id            int64  
 4   quantity           int64  
 5   sales_value        float64
 6   store_id           int64  
 7   retail_disc        float64
 8   trans_time         int64  
 9   week_no            int64  
 10  coupon_disc        float64
 11  coupon_match_disc  float64
dtypes: float64(4), int64(8)
memory usage: 219.4 MB


In [26]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88734 entries, 0 to 88733
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   user_id            88734 non-null  int64  
 1   basket_id          88734 non-null  int64  
 2   day                88734 non-null  int64  
 3   item_id            88734 non-null  int64  
 4   quantity           88734 non-null  int64  
 5   sales_value        88734 non-null  float64
 6   store_id           88734 non-null  int64  
 7   retail_disc        88734 non-null  float64
 8   trans_time         88734 non-null  int64  
 9   week_no            88734 non-null  int64  
 10  coupon_disc        88734 non-null  float64
 11  coupon_match_disc  88734 non-null  float64
dtypes: float64(4), int64(8)
memory usage: 8.1 MB


In [28]:
item_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92353 entries, 0 to 92352
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   item_id               92353 non-null  int64 
 1   manufacturer          92353 non-null  int64 
 2   department            92353 non-null  object
 3   brand                 92353 non-null  object
 4   commodity_desc        92353 non-null  object
 5   sub_commodity_desc    92353 non-null  object
 6   curr_size_of_product  92353 non-null  object
dtypes: int64(2), object(5)
memory usage: 4.9+ MB


In [29]:
user_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   age_desc             801 non-null    object
 1   marital_status_code  801 non-null    object
 2   income_desc          801 non-null    object
 3   homeowner_desc       801 non-null    object
 4   hh_comp_desc         801 non-null    object
 5   household_size_desc  801 non-null    object
 6   kid_category_desc    801 non-null    object
 7   user_id              801 non-null    int64 
dtypes: int64(1), object(7)
memory usage: 50.2+ KB


In [20]:
def print_stats_data(df_data, name_df):
    print(f'{name_df}:')
    print(f"Shape: {df_data.shape} Users: {df_data['user_id'].nunique()} Items: {df_data['item_id'].nunique()}\n")

In [22]:
print_stats_data(data, 'data')
print_stats_data(test, 'test')

data:
Shape: (2396804, 12) Users: 2499 Items: 89051

test:
Shape: (88734, 12) Users: 1885 Items: 20497



Prepare columns feature datasets

In [12]:
# column processing
item_features.columns = [col.lower() for col in item_features.columns]
user_features.columns = [col.lower() for col in user_features.columns]

item_features.rename(columns={'product_id': 'item_id'}, inplace=True)
user_features.rename(columns={'household_key': 'user_id' }, inplace=True)

### Split dataset for train, eval and test

In [17]:
# timeline for train:  -- old purchases -- | -- 6 weeks -- 

VAL_MATCHER_WEEKS = 6

In [23]:
# data for train 1-st layer model (matching)     -- old purchases --
data_train_matcher = data[data['week_no'] < data['week_no'].max() - VAL_MATCHER_WEEKS]

# data for validate 1-st layer model (matching)  -- 6 weeks --
data_val_matcher = data[data['week_no'] >= data['week_no'].max() - VAL_MATCHER_WEEKS]

# data for train 2-nd layer model (ranking)      -- 6 weeks --
data_train_ranker = data_val_matcher.copy()

# data for validate 2-nd layer model (ranking)   -- test data --
data_val_ranker = test.copy()

In [24]:
print_stats_data(data_train_matcher,'train_matcher')
print_stats_data(data_val_matcher,'val_matcher')
print_stats_data(data_train_ranker,'train_ranker')
print_stats_data(data_val_ranker,'val_ranker')

train_matcher:
Shape: (2193515, 12) Users: 2499 Items: 85334

val_matcher:
Shape: (203289, 12) Users: 2197 Items: 30040

train_ranker:
Shape: (203289, 12) Users: 2197 Items: 30040

val_ranker:
Shape: (88734, 12) Users: 1885 Items: 20497



Here is dispertion of users and items. In this case we won't use "Cold Start" and fix it below. But for "Cold Start" may use top popular items as base case from baseline.

### Prefinter data_train_matcher

For begin we will take 5000 popular items. But later we try select better decision.

In [30]:
n_items_before = data_train_matcher['item_id'].nunique()

data_train_matcher = prefilter_items(data_train_matcher, item_features=item_features, take_n_popular=5000)

n_items_after = data_train_matcher['item_id'].nunique()
print('Decreased # items from {} to {}'.format(n_items_before, n_items_after))

Decreased # items from 85334 to 5001
