# Sequential datasets
the goal is to explore the following datasets and compare their characteristics to what appears in \[1] and \[2]:
- Amazon Beauty
- Steam
- MovieLens 1M

## Baseline Statistics
Here's the statistics after applying the same preprocessing stages in both papers:
- each user has 1 sequence
- all interactions in a sequence are sorted by time : t=1..T
    - interaction T is for test, interaction T-1 for validation, the rest are for training
- ignore users and items with less than 5 related actions

### SASRec
![sasrec_stats](SASRec_dataset_stat.png)


### BERT4Rec
![bert4rec_stats](Bert4rec_dataset_stat.png)



## RecBole Statistics
Recbole also have these datasets in its repo. they are already in their `.inter` format (i.e. atomic files)

This is the flow:   

![data_flow](data_flow_en.png)


we need to configure the preprocessing to get a similar statistics.
Thus, we need to look at the statitics of the dataframe that is loaded. the preprocessing is configured and runs as part of RecBole.


In [None]:
import os 
import pandas as pd
import numpy as np
from datetime import datetime
from recbole.config import Config
from recbole.data import create_dataset
from recbole.data.utils import get_dataloader
from recbole.utils import init_logger, init_seed, get_model, get_trainer, set_color

In [None]:
from dataclasses import dataclass

@dataclass
class Arguments:
    model:str = 'SASRec'
    dataset:str = 'Amazon_Beauty'
    validation: bool = 'False'      # seem to be irrelevant for the sequential protocol
    valid_portion: float = 0.1      # seem to be irrelevant for the sequential protocol

args=Arguments()



Going over the documentation, I have built a dictionary that include most of the possible parameters for sequential RS algorithm configuration

In [None]:
config_dict = {
    ############################
    # Environment settings - see https://recbole.io/docs/user_guide/config/environment_settings.html
    # see default values in recbole/properties/overall.yaml
    # commenting out default values. to change, uncomment
    'gpu_id': args.gpu_id,
    'use_gpu': True,
    # 'seed': 2020,
    # 'state': 'INFO',
    # 'reproducibility': True,
    # 'data_path': 'dataset/',
    # 'checkpoint_dir': 'saved',
    # 'show_progress': True,
    # 'save_dataset': False,
    # 'dataset_save_path': None,
    # 'save_dataloaders': False,
    # 'dataloaders_save_path': None,
    # 'log_wandb': False,
    # 'wandb_project': 'recbole',
    
    ############################
    # Model Settings
    # model architecture parameters: see https://recbole.io/docs/user_guide/model/sequential/sasrec.html or relevant algorithm
    # the below setup for SASRec is taken from recbole/properties/model/SASRec.yaml
    # commenting out default values. to change, uncomment
    # 'n_layers': 2,
    # 'n_heads': 2,
    # 'hidden_size': 64,
    # 'inner_size': 256,
    # 'hidden_dropout_prob': 0.5,
    # 'attn_dropout_prob': 0.5,
    # 'hidden_act': 'gelu',
    # 'layer_norm_eps': 1e-12,
    # 'initializer_range': 0.02,
    # 'loss_type': 'CE',

    ############################
    # Data Settings - dataset basic information and preprocessing
    # see https://recbole.io/docs/user_guide/config/data_settings.html for details
    # see default values in recbole/properties/dataset/sample.yaml
    # in teh following, only settings that are relevant to sequential model are added:

    # Sequential Model Needed
    # 'ITEM_LIST_LENGTH_FIELD': 'item_length',
    # 'LIST_SUFFIX': '_list',
    # 'MAX_ITEM_LIST_LENGTH': 50,
    # 'POSITION_FIELD': 'position_id',

    # Selectively Loading
    # Selectively Loading
    # 'load_col': {'inter': ['user_id', 'item_id']},
    'load_col': {'inter': ['user_id', 'item_id', 'rating', 'timestamp']},
    # 'unload_col': None,
    # 'unused_col': None,
    # 'additional_feat_suffix': None,
    
    # Filtering
    # 'rm_dup_inter': None,
    # 'val_interval': None,
    # 'filter_inter_by_user_or_item': True,
    'user_inter_num_interval': "[5,inf)",
    # 'item_inter_num_interval': "[4,inf)",

    # Preprocessing
    # 'alias_of_user_id': None,
    # 'alias_of_item_id': None,
    # 'alias_of_item_id': ['item_id_list'],
    # 'alias_of_entity_id': None,
    # 'alias_of_relation_id': None,
    # 'preload_weight': None,
    # 'normalize_field': None,
    # 'normalize_all': None,

    # Benchmark .inter
    # 'benchmark_filename': ['train', 'test'],


    ################################
    # Training settings - see https://recbole.io/docs/user_guide/config/training_settings.html
    # see default values in recbole/properties/overall.yaml
    # defaults are commented out
    # 'epochs': 300,
    'train_batch_size': 2048,
    # 'learner': 'adam',
    # 'learning_rate': 0.001,
    'neg_sampling': None,
    # 'eval_step': 1,
    # 'stopping_step': 10,
    'stopping_step': 50,
    # 'clip_grad_norm': None,
    # # clip_grad_norm:  {'max_norm': 5, 'norm_type': 2}
    # 'weight_decay': 0.0,
    # 'loss_decimal_place': 4,
    # 'require_pow': False,



    ################################
    # Evaluation settings - see https://recbole.io/docs/user_guide/config/evaluation_settings.html
    # see default values in recbole/properties/overall.yaml
    # defaults are commented out
    'eval_args':{
        'split': {'LS': 'valid_and_test'},      # Leave-one-out splitting with valid and test splits (in addition to train)
        'order': 'TO',      # time ordered
        'mode': 'pop100'},
    'repeatable': True,     # must be True for sequential models
    'metrics': ['Hit','NDCG'],
    'topk': [1,5,10],
    'valid_metric': 'Hit@5',
    # 'valid_metric_bigger': True,
    'eval_batch_size': 4096,
    # 'metric_decimal_place': 4       
}
config = Config(model=args.model, dataset=f'{args.dataset}', config_dict=config_dict)
config.final_config_dict

in the following, we're creating the dataset. this calls the constructor of the corresponding dataset (according to the algorithm type. in this case `SequentialDataset`)
it builds the class and the features will be in `SequentialDataset.inter_feat` dataframe.  
it does all the cleaning that has to take place:
- 

In [None]:
dataset = create_dataset(config)

Now that we have the dataframe ready, we can build the dataset (block #4 above). 
it `build` does the split to train,validation and test. each of them is `SequentialDataset` where the `inter_feat` field is not a dataframe but `interaction`.
the `interaction` is a set of tensors that will be fed to the model. lets analyze it.

In [None]:
train_dataset = dataset.build()