# Candidate ReRank Model using Handcrafted Rules
In this notebook, we present a "candidate rerank" model using handcrafted rules. We can improve this model by engineering features, merging them unto items and users, and training a reranker model (such as XGB) to choose our final 20. Furthermore to tune and improve this notebook, we should build a local CV scheme to experiment new logic and/or models.

UPDATE: I published a notebook to compute validation score [here][10] using Radek's scheme described [here][11].

Note in this competition, a "session" actually means a unique "user". So our task is to predict what each of the `1,671,803` test "users" (i.e. "sessions") will do in the future. For each test "user" (i.e. "session") we must predict what they will `click`, `cart`, and `order` during the remainder of the week long test period.

### Step 1 - Generate Candidates
For each test user, we generate possible choices, i.e. candidates. In this notebook, we generate candidates from 5 sources:
* User history of clicks, carts, orders
* Most popular 20 clicks, carts, orders during test week
* Co-visitation matrix of click/cart/order to cart/order with type weighting
* Co-visitation matrix of cart/order to cart/order called buy2buy
* Co-visitation matrix of click/cart/order to clicks with time weighting

### Step 2 - ReRank and Choose 20
Given the list of candidates, we must select 20 to be our predictions. In this notebook, we do this with a set of handcrafted rules. We can improve our predictions by training an XGBoost model to select for us. Our handcrafted rules give priority to:
* Most recent previously visited items
* Items previously visited multiple times
* Items previously in cart or order
* Co-visitation matrix of cart/order to cart/order
* Current popular items

![](https://raw.githubusercontent.com/cdeotte/Kaggle_Images/main/Nov-2022/c_r_model.png)
  
# Credits
We thank many Kagglers who have shared ideas. We use co-visitation matrix idea from Vladimir [here][1]. We use groupby sort logic from Sinan in comment section [here][4]. We use duplicate prediction removal logic from Radek [here][5]. We use multiple visit logic from Pietro [here][2]. We use type weighting logic from Ingvaras [here][3]. We use leaky test data from my previous notebook [here][4]. And some ideas may have originated from Tawara [here][6] and KJ [here][7]. We use Colum2131's parquets [here][8]. Above image is from Ravi's discussion about candidate rerank models [here][9]

[1]: https://www.kaggle.com/code/vslaykovsky/co-visitation-matrix
[2]: https://www.kaggle.com/code/pietromaldini1/multiple-clicks-vs-latest-items
[3]: https://www.kaggle.com/code/ingvarasgalinskas/item-type-vs-multiple-clicks-vs-latest-items
[4]: https://www.kaggle.com/code/cdeotte/test-data-leak-lb-boost
[5]: https://www.kaggle.com/code/radek1/co-visitation-matrix-simplified-imprvd-logic
[6]: https://www.kaggle.com/code/ttahara/otto-mors-aid-frequency-baseline
[7]: https://www.kaggle.com/code/whitelily/co-occurrence-baseline
[8]: https://www.kaggle.com/datasets/columbia2131/otto-chunk-data-inparquet-format
[9]: https://www.kaggle.com/competitions/otto-recommender-system/discussion/364721
[10]: https://www.kaggle.com/cdeotte/compute-validation-score-cv-564
[11]: https://www.kaggle.com/competitions/otto-recommender-system/discussion/364991

# Notes
Below are notes about versions:
* **Version 1 LB 0.573** Uses popular ideas from public notebooks and adds additional co-visitation matrices and additional logic. Has CV `0.563`. See validation notebook version 2 [here][1].
* **Version 2 LB 573** Refactor logic for `suggest_buys(df)` to make it clear how new co-visitation matrices are reranking the candidates by adding to candidate weights. Also new logic boosts CV by `+0.0003`. Also LB is slightly better too. See validation notebook version 3 [here][1]
* **Version 3** is the same as version 2 but 1.5x faster co-visitation matrix computation!
* **Version 4 LB 575** Use top20 for clicks and top15 for carts and buys (instead of top40 and top40). This boosts CV `+0.0015` hooray! New CV is `0.5647`. See validation version 5 [here][1]
* **Version 5** is the same as version 4 but 2x faster co-visitation matrix computation! (and 3x faster than version 1)
* **Version 6** Stay tuned for more versions...

[1]: https://www.kaggle.com/code/cdeotte/compute-validation-score-cv-564

# Step 1 - Candidate Generation with RAPIDS
For candidate generation, we build three co-visitation matrices. One computes the popularity of cart/order given a user's previous click/cart/order. We apply type weighting to this matrix. One computes the popularity of cart/order given a user's previous cart/order. We call this "buy2buy" matrix. One computes the popularity of clicks given a user previously click/cart/order.  We apply time weighting to this matrix. We will use RAPIDS cuDF GPU to compute these matrices quickly!

In [1]:
VER = 5

import pandas as pd, numpy as np
from tqdm.notebook import tqdm
import os, sys, pickle, glob, gc
from collections import Counter
# import cudf, itertools
# print('We will use RAPIDS version',cudf.__version__)

## Compute Three Co-visitation Matrices with RAPIDS
We will compute 3 co-visitation matrices using RAPIDS cuDF on GPU. This is 30x faster than using Pandas CPU like other public notebooks! For maximum speed, set the variable `DISK_PIECES` to the smallest number possible based on the GPU you are using without incurring memory errors. If you run this code offline with 32GB GPU ram, then you can use `DISK_PIECES = 1` and compute each co-visitation matrix in almost 1 minute! Kaggle's GPU only has 16GB ram, so we use `DISK_PIECES = 4` and it takes an amazing 3 minutes each! Below are some of the tricks to speed up computation
* Use RAPIDS cuDF GPU instead of Pandas CPU
* Read disk once and save in CPU RAM for later GPU multiple use
* Process largest amount of data possible on GPU at one time
* Merge data in two stages. Multiple small to single medium. Multiple medium to single large.
* Write result as parquet instead of dictionary

In [4]:
df = pd.read_csv('/kaggle/input/vhac-recsys/training_set.csv')

user_list = df.groupby('UserId')['ItemId'].nunique()
user_list_denoise = user_list[(user_list<=20) & (user_list>=3)].index.to_list()
df = df[df.UserId.isin(user_list_denoise)]

df['type'] = df['Purchase']
df.shape

(257122, 5)

## 1) "Carts Orders" Co-visitation Matrix - Type Weighted

In [5]:
# CREATE PAIRS
df = df.merge(df,on='UserId')
df = df.loc[(df.ItemId_x != df.ItemId_y) ]

In [6]:
type_weight = {0:1, 1:5}

# ASSIGN WEIGHTS
df = df[['UserId', 'ItemId_x', 'ItemId_y','type_y']].drop_duplicates(['UserId', 'ItemId_x', 'ItemId_y'])
df['wgt'] = df.type_y.map(type_weight)

In [7]:
df = df[['ItemId_x', 'ItemId_y','wgt']]
df.wgt = df.wgt.astype('float32')
df = df.groupby(['ItemId_x', 'ItemId_y']).wgt.sum()

In [8]:
# CONVERT MATRIX TO DICTIONARY
df = df.reset_index()
df = df.sort_values(['ItemId_x','wgt'],ascending=[True,False])
# SAVE TOP 40
df = df.reset_index(drop=True)
df['n'] = df.groupby('ItemId_x').ItemId_y.cumcount()
df = df.loc[df.n<15].drop('n',axis=1)

In [9]:
cart_order = df.copy()

In [11]:
cart_order.to_csv('cart_order_denoise.csv', index = False)

## 2) "Buy2Buy" Co-visitation Matrix

In [12]:
%%time
df = pd.read_csv('/kaggle/input/vhac-recsys/training_set.csv')

user_list = df.groupby('UserId')['ItemId'].nunique()
user_list_denoise = user_list[(user_list<=20) & (user_list>=3)].index.to_list()
df = df[df.UserId.isin(user_list_denoise)]

df['type'] = df['Purchase']
df.shape

df = df.loc[df['type'].isin([1])] # ONLY WANT CARTS AND ORDERS
# CREATE PAIRS
df = df.merge(df,on='UserId')
df = df.loc[(df.ItemId_x != df.ItemId_y)] # 14 DAYS
# ASSIGN WEIGHTS
df = df[['UserId', 'ItemId_x', 'ItemId_y','type_y']].drop_duplicates(['UserId', 'ItemId_x', 'ItemId_y'])
df['wgt'] = 1
df = df[['ItemId_x', 'ItemId_y','wgt']]
df.wgt = df.wgt.astype('float32')
df = df.groupby(['ItemId_x', 'ItemId_y']).wgt.sum()

# CONVERT MATRIX TO DICTIONARY
df = df.reset_index()
df = df.sort_values(['ItemId_x','wgt'],ascending=[True,False])
# SAVE TOP 40
df = df.reset_index(drop=True)
df['n'] = df.groupby('ItemId_x').ItemId_y.cumcount()
df = df.loc[df.n<15].drop('n',axis=1)
# SAVE PART TO DISK (convert to pandas first uses less memory)

buy_order = df.copy()

CPU times: user 569 ms, sys: 7.67 ms, total: 576 ms
Wall time: 576 ms


In [14]:
buy_order.to_csv('buy_order_denoise.csv', index = False)

# Click order

In [33]:
df = pd.read_csv('/kaggle/input/vhac-recsys/training_set.csv')
user_list = df.groupby('UserId')['ItemId'].nunique()
user_list_denoise = user_list[(user_list<=20) & (user_list>=3)].index.to_list()
df = df[df.UserId.isin(user_list_denoise)]

def add_action_num_reverse_chrono(df):
    df['action_num_reverse_chrono'] = df.groupby('UserId').cumcount(ascending=False)
    return df

def add_session_length(df):
    tmp = df.groupby('UserId')['ItemId'].nunique().reset_index().rename(columns={'ItemId': 'session_length'})
    df = df.merge(tmp, on = 'UserId', how = 'left')
    return df

def add_log_recency_score(df):
    linear_interpolation = 0.1 + ((1-0.1) / (df['session_length']-1)) * (df['session_length']-df['action_num_reverse_chrono']-1)
    df['log_recency_score'] = pd.Series(2**linear_interpolation - 1).fillna(1)
    return df

def add_type_weighted_log_recency_score(df):
    type_weights = {0:1, 1:5}
    type_weighted_log_recency_score = pd.Series(df['Purchase'].apply(lambda x: type_weights[x]) * df['log_recency_score'])
    df['type_weighted_log_recency_score'] = type_weighted_log_recency_score
    return df

def apply(df, pipeline):
    for f in pipeline:
        df = f(df)
    return df

pipeline = [add_action_num_reverse_chrono, add_session_length, add_log_recency_score, add_type_weighted_log_recency_score]

df = apply(df, pipeline)

In [28]:
%%time
df['ts'] = df['action_num_reverse_chrono']
# CREATE PAIRS
df = df.merge(df,on='UserId')
df = df.loc[(df.ItemId_x != df.ItemId_y)]
# ASSIGN WEIGHTS
df = df[['UserId', 'ItemId_x', 'ItemId_y','ts_x']].drop_duplicates(['UserId', 'ItemId_x', 'ItemId_y'])
df['wgt'] = 1 + 3*(df.ts_x - 1)/100
df = df[['ItemId_x', 'ItemId_y','wgt']]
df.wgt = df.wgt.astype('float32')
df = df.groupby(['ItemId_x', 'ItemId_y']).wgt.sum()

# CONVERT MATRIX TO DICTIONARY
df = df.reset_index()
df = df.sort_values(['ItemId_x','wgt'],ascending=[True,False])
# SAVE TOP 40
df = df.reset_index(drop=True)
df['n'] = df.groupby('ItemId_x').ItemId_y.cumcount()
df = df.loc[df.n<15].drop('n',axis=1)
# SAVE PART TO DISK (convert to pandas first uses less memory)

click_order = df.copy()
click_order

CPU times: user 4.63 s, sys: 1.18 s, total: 5.81 s
Wall time: 5.79 s


Unnamed: 0,ItemId_x,ItemId_y,wgt
0,004POG3wLH,rbXvgoHURF,6.32
1,004POG3wLH,Djzr4rtQ1d,6.14
2,004POG3wLH,a6fTWfqc6h,5.02
3,004POG3wLH,z2JS9EHqBR,4.02
4,004POG3wLH,52SfqaqILe,3.90
...,...,...,...
1674250,zzyfHckCgU,17QWtUfOqj,6.02
1674251,zzyfHckCgU,PzcBCDf6R4,5.93
1674252,zzyfHckCgU,VkffBXXhVi,5.42
1674253,zzyfHckCgU,B3IeyxvddX,4.99


In [29]:
click_order.to_csv('click_order_action_num_reverse_denoise.csv', index = False)

In [31]:
%%time
df['ts'] = df['log_recency_score']
# CREATE PAIRS
df = df.merge(df,on='UserId')
df = df.loc[(df.ItemId_x != df.ItemId_y)]
# ASSIGN WEIGHTS
df = df[['UserId', 'ItemId_x', 'ItemId_y','ts_x']].drop_duplicates(['UserId', 'ItemId_x', 'ItemId_y'])
df['wgt'] = df.ts_x 
df = df[['ItemId_x', 'ItemId_y','wgt']]
df.wgt = df.wgt.astype('float32')
df = df.groupby(['ItemId_x', 'ItemId_y']).wgt.sum()

# CONVERT MATRIX TO DICTIONARY
df = df.reset_index()
df = df.sort_values(['ItemId_x','wgt'],ascending=[True,False])
# SAVE TOP 40
df = df.reset_index(drop=True)
df['n'] = df.groupby('ItemId_x').ItemId_y.cumcount()
df = df.loc[df.n<15].drop('n',axis=1)
# SAVE PART TO DISK (convert to pandas first uses less memory)

click_order = df.copy()
click_order

CPU times: user 4.47 s, sys: 732 ms, total: 5.2 s
Wall time: 5.2 s


Unnamed: 0,ItemId_x,ItemId_y,wgt
0,004POG3wLH,Djzr4rtQ1d,1.852012
1,004POG3wLH,a6fTWfqc6h,1.546204
2,004POG3wLH,0jUQo5avOJ,1.271901
3,004POG3wLH,P04Gvoewcw,1.104455
4,004POG3wLH,rbXvgoHURF,1.035571
...,...,...,...
1674250,zzyfHckCgU,PzcBCDf6R4,2.328075
1674251,zzyfHckCgU,BxOTdTTVgN,2.228202
1674252,zzyfHckCgU,XROzLQhTXa,2.021708
1674253,zzyfHckCgU,ELwOkhgtMG,1.852114


In [32]:
click_order.to_csv('click_order_log_recency_score_denoise.csv', index = False)

In [34]:
%%time
df['ts'] = df['type_weighted_log_recency_score']
# CREATE PAIRS
df = df.merge(df,on='UserId')
df = df.loc[(df.ItemId_x != df.ItemId_y)]
# ASSIGN WEIGHTS
df = df[['UserId', 'ItemId_x', 'ItemId_y','ts_x']].drop_duplicates(['UserId', 'ItemId_x', 'ItemId_y'])
df['wgt'] = df.ts_x 
df = df[['ItemId_x', 'ItemId_y','wgt']]
df.wgt = df.wgt.astype('float32')
df = df.groupby(['ItemId_x', 'ItemId_y']).wgt.sum()

# CONVERT MATRIX TO DICTIONARY
df = df.reset_index()
df = df.sort_values(['ItemId_x','wgt'],ascending=[True,False])
# SAVE TOP 40
df = df.reset_index(drop=True)
df['n'] = df.groupby('ItemId_x').ItemId_y.cumcount()
df = df.loc[df.n<15].drop('n',axis=1)
# SAVE PART TO DISK (convert to pandas first uses less memory)

click_order = df.copy()
click_order

CPU times: user 4.49 s, sys: 777 ms, total: 5.27 s
Wall time: 5.27 s


Unnamed: 0,ItemId_x,ItemId_y,wgt
0,004POG3wLH,Djzr4rtQ1d,1.852012
1,004POG3wLH,rbXvgoHURF,1.670348
2,004POG3wLH,a6fTWfqc6h,1.546204
3,004POG3wLH,2WiKPnbBA7,1.534573
4,004POG3wLH,QOOxusL4GZ,1.324271
...,...,...,...
1674250,zzyfHckCgU,PzcBCDf6R4,2.328075
1674251,zzyfHckCgU,BxOTdTTVgN,2.228202
1674252,zzyfHckCgU,XROzLQhTXa,2.021708
1674253,zzyfHckCgU,ELwOkhgtMG,1.852114


In [35]:
click_order.to_csv('click_order_type_weighted_log_recency_score_denoise.csv', index = False)

# Step 2 - ReRank (choose 20) using handcrafted rules
For description of the handcrafted rules, read this notebook's intro.

In [39]:
test = pd.read_csv('/kaggle/input/vhac-recsys-standard/public_testset.csv', names = ['user_id']+[f'item_id_{i}' for i in range(1,1001)])

In [40]:
%%time
def pqt_to_dict(df):
    return df.groupby('ItemId_x').ItemId_y.apply(list).to_dict()
# LOAD THREE CO-VISITATION MATRICES

top_20_buys = pqt_to_dict(cart_order)

top_20_buy2buy = pqt_to_dict( buy_order )

CPU times: user 2.72 s, sys: 17.8 ms, total: 2.74 s
Wall time: 2.74 s


In [41]:
print( len( top_20_buy2buy ), len( top_20_buys ) )

6438 72643


In [42]:
#type_weight_multipliers = {'clicks': 1, 'carts': 6, 'orders': 3}
type_weight_multipliers = {0: 1, 1: 5}

df = pd.read_csv('/kaggle/input/vhac-recsys-standard/data_final.csv')
df['type'] = df['Purchase']

def suggest_buys(df):
    # USER HISTORY AIDS AND TYPES
    aids= df.ItemId.tolist()
    types = df.type.tolist()
    # UNIQUE AIDS AND UNIQUE BUYS
    unique_aids = list(dict.fromkeys(aids[::-1] ))
    df = df.loc[(df['type']==1)]
    unique_buys = list(dict.fromkeys( df.ItemId.tolist()[::-1] ))
    # USE "CART ORDER" CO-VISITATION MATRIX
    aids2 = list(itertools.chain(*[top_20_buys[aid] for aid in unique_aids if aid in top_20_buys]))
    # USE "BUY2BUY" CO-VISITATION MATRIX
    aids3 = list(itertools.chain(*[top_20_buy2buy[aid] for aid in unique_buys if aid in top_20_buy2buy]))
    # RERANK CANDIDATES
    result = [aid2 for aid2, cnt in Counter(aids2+aids3).most_common(20) if aid2 not in unique_aids] 
    # USE TOP20 TEST ORDERS
    return result 

# Create Submission CSV
Inferring test data with Pandas groupby is slow. We need to accelerate the following code.

In [43]:
%%time
import itertools
pred_df_buys = df.groupby(["UserId"]).apply(
    lambda x: suggest_buys(x)
)

CPU times: user 27.7 s, sys: 76.3 ms, total: 27.8 s
Wall time: 27.8 s


In [44]:
pred_df_buys = pred_df_buys.reset_index()
pred_df_buys.rename(columns = {0: "list_item"}, inplace = True)
submit = pred_df_buys[pred_df_buys.UserId.isin(test.user_id)]

In [45]:
# Hàm để tạo ra một danh sách có độ dài 1000, điền thêm NaN nếu cần
def pad_list(lst, length=1000):
    return lst + ["0"] * (length - len(lst))

# Tạo DataFrame với 1000 cột
submit = pd.DataFrame(
    submit['list_item'].apply(lambda x: pad_list(x)).tolist(),
    index=submit['UserId'],
    columns=[f'item_{i+1}' for i in range(1000)]
).reset_index()

# Đổi tên cột index thành user_id
submit.rename(columns={'index': 'UserId'}, inplace=True)

# Hiển thị DataFrame
submit.to_csv('predict.csv', index = None, header = None)