# Clicks-only dataset creation
This notebook describes pre-processing of clicks data to create reliable datasets.

Both original and reproducibility papers only mention **clicks** and metrics related to clicks,  
however related codebase merge **buys** data to **clicks** data. This have  
positive and negative effects. Positive one is that it prolongs some sessions - i.e. adds more  
user actions in form of buy events. On the other hand, buy events are likely not ordered and  
are added to existing sessions as en extra items while it is not clear how were they recorded.  
Buys only represent small portion of data (rc15 has 1,110,965 clicks and only 43,946 buys).  
Moreover, published code treats rewards for clicks and buys  
differently, which is not described in text of the papers. As reasons for merging **buys**  
data to **clicks** dataset is not clear and potentially brings more issues I decided to  
benchmark models with clicks-only dataset.


In [None]:
import os
import pandas as pd
from src.utils import *

import numpy as np

cur_dir = os.getcwd()
data_path = cur_dir + '/div4rec/rc15_data/'
data_path_save = cur_dir + '/div4rec/rc15_data/Clicks_only/'
os.makedirs(data_path_save, exist_ok=True)

### Start with sampled_buys and sampled_clicks
We start with the data produced by sample_data_rc15.py. At least here we copy original work.  
However, we omit sampled_buys.fd and continue with sampled_clicks.df only. Following code    
partially corresponds to merge_and_sort_rc15.py, but we skip merging buys into clicks  
and only sort clicks data.

In [None]:
sampled_clicks = pd.read_pickle(os.path.join(data_path, 'sampled_clicks.df'))
sampled_clicks=sampled_clicks.drop(columns=['category'])
sampled_clicks['is_buy']=0
sampled_clicks=sampled_clicks.sort_values(by=['session_id','timestamp'])

sampled_clicks.to_csv(f'{data_path_save}sampled_clicks.csv', index = None, header=True)
to_pickled_df(data_path_save, sampled_clicks=sampled_clicks)

### Continue with split_data.py
Almost no changes, we take our sorted clicks data and split it to train, test and val.

In [None]:
total_sessions=sampled_clicks.session_id.unique()
np.random.shuffle(total_sessions)

fractions = np.array([0.8, 0.1, 0.1])
# split into 3 parts
train_ids, val_ids, test_ids = np.array_split(
    total_sessions, (fractions[:-1].cumsum() * len(total_sessions)).astype(int))
train_sessions=sampled_clicks[sampled_clicks['session_id'].isin(train_ids)]
val_sessions=sampled_clicks[sampled_clicks['session_id'].isin(val_ids)]
test_sessions=sampled_clicks[sampled_clicks['session_id'].isin(test_ids)]

to_pickled_df(data_path_save, sampled_train=train_sessions)
to_pickled_df(data_path_save, sampled_val=val_sessions)
to_pickled_df(data_path_save, sampled_test=test_sessions)

### Generate replay buffers for click data
Here we generate replay buffers for test, val and train datasets. Replay buffer  
is source of data further processed before loading by DataLoader.

replay_buffer entry (line) has following format:  
0 &nbsp;&nbsp;&nbsp; [26702, 26702, 26702, 26702, 26702, 26702, 26702, 26702, 26702, 26702]&nbsp;&nbsp;&nbsp; 1 &nbsp; &nbsp; &nbsp; 217 &nbsp;&nbsp;&nbsp; 0 &nbsp;&nbsp;&nbsp; [217, 26702, 26702, 26702, 26702, 26702, 26702, 26702, 26702, 26702] &nbsp;&nbsp;&nbsp; 1 &nbsp;&nbsp;&nbsp; False  
where left to right: line number; **state**; len_state; action; is_buy; **next_state**; len_next_state; is_done  
Note that 26702 is padding item and state is thus empty sequence.

We follow two variants:
 - original approach - buffer contains lines with "empty" **state** (only padding items) - see replay buffer example above
 - improved approach - 1st state in buffer already contains one or more items

Explainer: with original approach all buffers contain lines with no item in **state** and  
single item in **next_state**. That means model should guess first item during inference  
and consequently some metrics (as hit ration) are affected by considering these random guesses.  
My estimation is that 10 - 20 % of entries in replay buffers are of this kind and I expect that  
having around 26700 items it leads to truly random guesses. Improved dataset will eliminate  
situations when model should infer (guess) based on empty (paddings only) sequence. Training  
may also be negatively affected by empty states in training data.

In [None]:
state_size, item_num = get_stats(data_path)

def create_buffer(dataset_name, sorted_events, output_path, skip_length=0):
    pad_item = item_num
    groups = sorted_events.groupby('session_id')
    ids = sorted_events.session_id.unique()

    state, len_state, action, is_buy, next_state, len_next_state, is_done = [], [], [], [], [],[],[]

    for id in ids:
        group = groups.get_group(id)
        history = []
        for index, row in group.iterrows():
            s = list(history)
            s = pad_history(s, state_size, pad_item)
            a = row['item_id']
            is_b = row['is_buy']
            history.append(row['item_id'])
            next_s = list(history)
            next_s=pad_history(next_s, state_size, pad_item)
            # sequences of skip_length or shorter are not added to dataset
            if len(history) > skip_length:
                state.append(s)
                len_state.append(state_size if len(s) >= state_size else 1 if len(s) == 0 else len(s))
                action.append(a)
                is_buy.append(is_b)
                len_next_state.append(state_size if len(next_s) >= state_size else 1 if len(next_s) == 0 else len(next_s))
                next_state.append(next_s)
                is_done.append(False)
        is_done[-1] = True

    replay_buffer_dict = {
        'state': state,
        'len_state': len_state,
        'action': action,
        'is_buy': is_buy,
        'next_state': next_state,
        'len_next_states': len_next_state,
        'is_done': is_done
    }
    replay_buffer = pd.DataFrame(data=replay_buffer_dict)
    replay_buffer.to_pickle(output_path + f'replay_buffer_{dataset_name}_skip={skip_length}.df')

In [None]:
for dataset in ['val', 'test', 'train']:
    sorted_events = pd.read_pickle(data_path_save + f'sampled_{dataset}.df')
    for skip in [0, 1, 2]:
        create_buffer(dataset, sorted_events, data_path_save, skip)