# Dataset preparation
Background: I'd like to run the RecBole algorithms in a session-based task on 4 datasets, and compare results to what's published in the following papers:
- Evaluation of Session-based Recommendation Algorithms, Ludewig et al 2018
- Empirical Analysis of Session-Based Recommendation Algorithms, Ludewig et al 2020
- A survey on session-based recommender systems, Wang et al 2021

The first 2 papers have also published their code in the session-rec framework. they also have the datasets to download.
All 3 papers are aligned on the datasets: 
![img](SBRS_datasets.png)


the goal is to understand what is needed to create the following SBRS version of the following datasets: 
- RetailRocket
- RSC15
- DIGINETICA - we already have `diginetica-session` as `.inter` file. need to compare the characteristics to the published one
- NowPlaying - we already have `nowplaying-session` as `.inter` file. we cant generate it from raw in `session-rec` because we dont have the corresponding preprocessing script
- Tmall - we already have `tmall-session` as `.inter` file. we cant generate it from raw in `session-rec` due to error on the preprocessing script.



RecBole has DIGINETICA and NowPlaying in a `session based` version. need to compare the characteristics to `session-rec` version 

## imports


In [1]:
import os 
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from recbole.config import Config
from recbole.data import create_dataset
from recbole.data.utils import get_dataloader
from recbole.utils import init_logger, init_seed, get_model, get_trainer, set_color

In [2]:
from dataclasses import dataclass

@dataclass
class Arguments:
    model:str = 'GRU4Rec'
    dataset:str = 'diginetica-session'
    validation: bool = 'False'      # Whether evaluating on validation set (split from train set), otherwise on test set.
    valid_portion: float = 0.1


In [3]:
config_dict_sess = {
    'USER_ID_FIELD': 'session_id',
    'load_col': None,
    'neg_sampling': None,
    'benchmark_filename': ['train', 'test'],
    'alias_of_item_id': ['item_id_list'],
    'topk': [20],
    'metrics': ['Recall', 'MRR'],
    'valid_metric': 'MRR@20',
    'gpu_id': 0
}

# DIGINETICA
Lets start by comparing `diginetica-session` to what we have in `session-rec` and the above table

## Recbole

### session based
There was a script that is not given in the repository that reads the raw data input file as downloaded from diginetica and converts it to `.inter` file



In [None]:
args = Arguments(dataset='diginetica-session')
args

In [None]:
config = Config(model=args.model, dataset=f'{args.dataset}', config_dict=config_dict_sess)
config.final_config_dict['data_path'] = os.path.join(os.path.dirname(os.getcwd()),config.final_config_dict['data_path'])
config.final_config_dict

In [None]:
# called from the main script and performs a set of operations to load the .inter file to dataframe
digi_recbole_sess = create_dataset(config)
digi_recbole_sess

In [None]:
digi_recbole_sess_inter_df=digi_recbole_sess.inter_feat
digi_recbole_sess_inter_df.head()

In [None]:
digi_recbole_sess_inter_df.session_id.value_counts()

In [None]:
digi_recbole_sess_inter_df.loc[digi_recbole_sess_inter_df.session_id==1,:]

In [None]:
len(digi_recbole_sess_inter_df)

its not clear what is `item_id` in this file. is it the last item in the session (where all the rest are in `item_id_list`) ?

the script for generating the `.inter` file of the sessionized version is not provided

#### Step into `create_dataset`
lets look at the input file first

In [None]:
print(config.final_config_dict['data_path'])
os.listdir(config.final_config_dict['data_path'])

when calling `create_dataset` it understand that the dataset class is `SequentialDataset` and defines this class, calling its constructor.  
the consstructor calls its `super` class (`Dataset`) constructor that calls eventually loads the above files to dataframe using the following:  
`Dataset._load_data()` --> `Dataset._load_inter_feat` --> `Dataset._load_feat`

In [None]:
tr_inter_df=pd.read_csv(os.path.join(config.final_config_dict['data_path'],'diginetica-session.train.inter'),delimiter='\t')
tr_inter_df.head(15)


and this is the structure of the file we need to generate.
the idea is to use the pre-processing of `session-rec` which generates the files in the folder `/home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica/prepared`:


In [None]:
tr_inter_df.columns

In [None]:
prep_srec_path = '/home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica/prepared'
os.listdir(prep_srec_path)

we want to do the following conversion:
- `train-item-views_train_full.txt` --> `diginetica-session.train.inter`
- `train-item-views_test.txt` --> `diginetica-session.test.inter`

### raw

In [None]:
digi_args = Arguments(dataset='diginetica')
digi_args

In [None]:
digi_config_dict= {
        'USER_ID_FIELD': 'session_id',
        'load_col': None,       # load all columns. dont filter anything
        'neg_sampling': None,
        # 'benchmark_filename': ['train', 'test'],
        # 'alias_of_item_id': ['item_id_list'],
        'eval_args':{
            'group_by': 'user',
            'order': 'TO',
            'split':{'LS': 'test_only'},
            'mode': 'uni100'},
        'topk': [20],
        'metrics': ['Recall', 'MRR'],
        'valid_metric': 'MRR@20'
    }
digi_config = Config(model=digi_args.model, dataset=f'{digi_args.dataset}', config_dict=digi_config_dict)
digi_config.final_config_dict['data_path'] = os.path.join(os.path.dirname(os.getcwd()),digi_config.final_config_dict['data_path'])
digi_config.final_config_dict

In [None]:
digi_recbole_raw = create_dataset(digi_config)
digi_recbole_raw

In [None]:
digi_recbole_raw_inter_df = digi_recbole_raw.inter_feat
digi_recbole_raw_inter_df.head()

In [None]:
digi_recbole_raw_inter_df['number of times'].max()

In [None]:
digi_recbole_raw_inter_df.session_id.value_counts()

In [None]:
len(digi_recbole_sess_inter_df)

## session-rec

In [None]:
dataset_path = '/home/gkoren2/datasets/recsys/seq_recsys_datasets/'
datapath_raw = os.path.join(dataset_path,'diginetica','raw')
os.listdir(datapath_raw)

In [None]:
digi_srec_raw = pd.read_csv(os.path.join(datapath_raw,'train-item-views.csv'),delimiter=';')
digi_srec_raw.head()

In [None]:
len(digi_srec_raw)

In [None]:
digi_srec_raw.sessionId.value_counts()

In [None]:
len(digi_srec_raw.itemId.unique())

### preprocessing with `run_preprocessing.py`
using 2 configurations:

/home/gkoren2/study/git/guyk1971/session-rec/conf/myconf/prep_digi_sb_s.yml
```yml
type: single # single|window
mode: session_based # session_based | session_aware
preprocessor: diginetica # dataset (folder) name
data:
  folder: /home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica/raw/
  prefix: train-item-views

filter:
  min_item_support: 5
  min_session_length: 2

params:
  days_test: 7

output:
  folder: /home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica/prepared/
```

/home/gkoren2/study/git/guyk1971/session-rec/conf/myconf/prep_digi_sb_w.yml
```yml
type: window # single|window
mode: session_based # session_based | session_aware
preprocessor: diginetica #
data:
  folder: /home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica/raw/
  prefix: train-item-views

filter:
  min_item_support: 5
  min_session_length: 2

params:
  days_test: 7
  days_train: 25 #only window
  num_slices: 5 #only window
  days_offset: 45 #only window
  days_shift: 18 #only window

output:
  folder: /home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica/slices/
```

so now we have txt files that we can read and convert to atomic files.

In [None]:
!tree /home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica

how do I read it to dataframe ?

## Converting from session-rec prepared to RecBole .inter 
we can see above the structure of the `.inter` file. this is our target.   
these are our 'input' files:


In [None]:
from tqdm.notebook import tqdm
prep_srec_path = '/home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica/prepared'
os.listdir(prep_srec_path)

we want to do the following conversion:
- `train-item-views_train_full.txt` --> `diginetica-session.train.inter`
- `train-item-views_test.txt` --> `diginetica-session.test.inter`

#### Draft Sandbox

In [None]:
src_file='train-item-views_train_full'

train_df = pd.read_csv(os.path.join(prep_srec_path,src_file+'.txt'), sep='\t')
train_df.head(10)

In [None]:
src_file='train-item-views_test'

tst_df = pd.read_csv(os.path.join(prep_srec_path,src_file+'.txt'), sep='\t')
tst_df.head(10)

In [None]:
# note that the itemIDs are not sequential. we need to make them such
print(len(train_df.ItemId.value_counts()))
train_df.ItemId.max()

In [None]:
print(len(tst_df.ItemId.value_counts()))
tst_df.ItemId.max()

In [None]:
train_df.dtypes

In [None]:
src_col = ['SessionId', 'ItemId']
trdf=train_df[src_col]
tsdf=tst_df[src_col]
trdf.head()

In [None]:
trdf.ItemId.unique()

In [None]:
# set(tsdf.ItemId.unique()) - set(trdf.ItemId.unique())
len(set(tsdf.ItemId.unique()).union(set(trdf.ItemId.unique())))
# set(trdf.ItemId.unique()).add(set(tsdf.ItemId.unique()))

In [None]:
trg2src_id={k+1:v for k, v in zip(range(len(trdf.ItemId.unique())),trdf.ItemId.unique())}
src2trg_id={v:k for k,v in trg2src_id.items()}
len(trg2src_id)

In [None]:
trdf.ItemId=trdf.ItemId.map(src2trg_id)
tsdf.ItemId=tsdf.ItemId.map(src2trg_id)     # assuming no tst items that are absent from training set
trdf.ItemId.value_counts()

In [None]:
trdf.head(30)

In [None]:
gbs=trdf.groupby('SessionId')

In [None]:
# since we break each session to several subsessions, we need to know how many subsessions we'll have in total.
# this will save us the copy in concatenating by allocating the dataframe from advance
# session with N elements will be broken to N-1 sessions
# so to understand the total number of subsessions we need to summarize the Ns-1 (the size of the session-1):
gbs.size().sum() - len(gbs)

In [None]:
tgt_col=['session_id:token', 'item_id_list:token_seq', 'item_id:token']


In [None]:
def process_session2(sess_df,sid):
    iid = sess_df.ItemId.values
    iids=[(str(iid[:i])[1:-1],str(iid[i])) for i in reversed(range(1,len(iid)))]
    iidfd=dict()
    iidfd.update({'session_id:token':[sid+i+1 for i in range(len(iids))]})
    iidfd.update({'item_id_list:token_seq':[i[0] for i in iids]})
    iidfd.update({'item_id:token':[i[1] for i in iids]})
    iidfs=pd.DataFrame(iidfd)
    
    return iidfs,iidfd['session_id:token'][-1]
    

In [None]:
from tqdm.notebook import tqdm
maxlen=0
for g in tqdm(list(gbs.groups)[100]):
    leng=len(gbs.get_group(g))
    if leng>maxlen:
        maxlen=leng

print(maxlen)

In [None]:
len(gbs.groups.keys())

In [None]:
rbdf=[]
# rbdf=pd.DataFrame(columns=tgt_col)
sid=0
for grp in tqdm(gbs.groups):
    sub_rbdf,sid=process_session2(gbs.get_group(grp),sid)
    rbdf.append(sub_rbdf)
rbdf=pd.concat(rbdf,ignore_index=True)
rbdf.head(30)



In [None]:
rbdf=[]
rbdf=pd.DataFrame(columns=tgt_col)
sid=0
for grp in tqdm(gbs.groups):
    sub_rbdf,sid=process_session2(gbs.get_group(grp),sid)
    rbdf=pd.concat([rbdf,sub_rbdf],ignore_index=True)
rbdf.head(30)


#### Final Conversion Code

In [None]:
from tqdm.notebook import tqdm

In [6]:
def process_session(sess_df,sid):
    iid = sess_df.ItemId.values
    iids=[(str(iid[:i])[1:-1],str(iid[i])) for i in reversed(range(1,len(iid)))]
    iidfd=dict()
    iidfd.update({'session_id:token':[sid+i+1 for i in range(len(iids))]})
    iidfd.update({'item_id_list:token_seq':[i[0] for i in iids]})
    iidfd.update({'item_id:token':[i[1] for i in iids]})
    iidfs=pd.DataFrame(iidfd)
    
    return iidfs,iidfd['session_id:token'][-1]


In [7]:
def transform_df(srec_df,n_sess):
    gbs=srec_df.groupby('SessionId')
    recbole_col=['session_id:token', 'item_id_list:token_seq', 'item_id:token']
    rbdf=[]
    sid=0
    n_sess = n_sess or len(gbs.groups)
    for grp in tqdm(list(gbs.groups)[:n_sess]):
        sub_rbdf,sid=process_session(gbs.get_group(grp),sid)
        rbdf.append(sub_rbdf)
    rbdf=pd.concat(rbdf,ignore_index=True)
    return rbdf 


def convert_srec_to_recbole(srec_train_filename,srec_test_filename, inter_train_filename,inter_test_filename,n_sess=None):
    srec_train_df = pd.read_csv(os.path.join(prep_srec_path,srec_train_filename+'.txt'), sep='\t')
    srec_test_df = pd.read_csv(os.path.join(prep_srec_path,srec_test_filename+'.txt'), sep='\t')
    srec_col = ['SessionId', 'ItemId']
    srec_trn_df=srec_train_df[srec_col]
    srec_tst_df=srec_test_df[srec_col]
    if (set(srec_test_df.ItemId.unique())-set(srec_train_df.ItemId.unique())):
        print("Warning: there are new items in the test set ")
        srec_ItemId=set(srec_train_df.ItemId.unique()).union(set(srec_test_df.ItemId.unique()))
    else:
        srec_ItemId=set(srec_train_df.ItemId.unique())
    # remap item_id to be sequential from 1 to N
    trg2src_id={k+1:v for k, v in zip(range(len(srec_ItemId)),srec_ItemId)}
    src2trg_id={v:k for k,v in trg2src_id.items()}
    srec_train_df.ItemId=srec_train_df.ItemId.map(src2trg_id)
    srec_test_df.ItemId=srec_test_df.ItemId.map(src2trg_id)
    print(f'generating {inter_train_filename}:')
    rbdf=transform_df(srec_train_df,n_sess)
    rbdf.to_csv(inter_train_filename,sep='\t',index=None)

    print(f'generating {inter_test_filename}:')
    rbdf=transform_df(srec_test_df,n_sess)
    rbdf.to_csv(inter_test_filename,sep='\t',index=None)

    print('Done')


In [None]:
srec_train_filename='train-item-views_train_full'
inter_train_filename='./diginetica-sess.train.inter'

srec_test_filename='train-item-views_test'
inter_test_filename='./diginetica-sess.test.inter'

convert_srec_to_recbole(srec_train_filename,srec_test_filename,inter_train_filename,inter_test_filename)

# Tmall

<font color='red'> ERROR: Cant preprocess raw file in `session-rec`. problem with the preprocessing script </font>  


now that we successfully converted DIGINETICA from `session-rec` to diginetica-sess (`.inter` files) for Recbole, lets do it again for another dataset that already has a ready made `-session.inter` file.

the stages are as follows:
1. Read the RecBole .inter file and check its characteristics - for reference later
1. Read the session-rec file and compare its characteristics to the table at the head of the notebook. should be the same
1. Convert the session-rec file to `-sess.inter` file
1. read the new `-sess.inter` file and compare to the `-session.inter`




In [None]:
# run the import section above 
pd.__version__

## Read RecBole file

In [None]:
args = Arguments(dataset='tmall-session')
args

In [None]:
config = Config(model=args.model, dataset=f'{args.dataset}', config_dict=config_dict_sess)
config.final_config_dict['data_path'] = os.path.join(os.path.dirname(os.getcwd()),config.final_config_dict['data_path'])
config.final_config_dict

In [None]:
print(config.final_config_dict['data_path'])
os.listdir(config.final_config_dict['data_path'])

In [None]:
tr_inter_df=pd.read_csv(os.path.join(config.final_config_dict['data_path'],'tmall-session.train.inter'),delimiter='\t')
tr_inter_df.head(30)


### using `create_dataset` to read the .inter

In [None]:
# called from the main script and performs a set of operations to load the .inter file to dataframe
nowp_session = create_dataset(config)
nowp_session

In [None]:
inter_df=nowp_session.inter_feat
inter_df.head(10)

In [None]:
inter_df.item_length.value_counts()

note that the IDs of the items were remapped to be sequential according to the order they appear in the `item_id` column, thus we see the numbers are running up in this column

## preprocess raw file according to session-rec preprocess
as done before, we need to create a preprocess configuration file for the preprocessing in the `session-rec` framework and run the preprocessing.

we can use the preprocessing script to prepare the files as we want, according to their protocols.

within the session-rec framework, I've created the following configuration file for the preprocessing:



/home/gkoren2/study/git/guyk1971/session-rec/conf/myconf/prep_tmal_sb_s.yml
```yml
type: single # single|window
mode: session_based # session_based | session_aware
preprocessor: tmall # dataset (folder) name
data:
  folder: /home/gkoren2/datasets/recsys/seq_recsys_datasets/tmall/raw/
  prefix: dataset15

filter:
  min_item_support: 5
  min_session_length: 2

params:
  days_test: 7

output:
  folder: /home/gkoren2/datasets/recsys/seq_recsys_datasets/tmall/prepared/
```

<font color='red'> it doesnt run. need to debug it in the session-rec framework. compare to the diginetica preprocessing </font>

In [None]:
prep_srec_path = '/home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica/prepared'
os.listdir(prep_srec_path)

# RetailRocket  
the stages are as follows:
1. Read the RecBole .inter file and check its characteristics - for reference later
1. Read the session-rec file and compare its characteristics to the table at the head of the notebook. should be the same
1. Convert the session-rec file to `-sess.inter` file
1. read the new `-sess.inter` file and compare to the `-session.inter`


In [4]:
# run the import section above 
pd.__version__

'1.4.3'

## preprocess raw file according to session-rec preprocess
as done before, we need to create a preprocess configuration file for the preprocessing in the `session-rec` framework and run the preprocessing.

we can use the preprocessing script to prepare the files as we want, according to their protocols.

within the session-rec framework, I've created the following configuration file for the preprocessing:



/home/gkoren2/study/git/guyk1971/session-rec/conf/myconf/prep_retailrocket_sb_s.yml
```yml
type: single # single|window
mode: session_based # session_based | session_aware
preprocessor: retailrocket # dataset (folder) name
data:
  folder: /home/gkoren2/datasets/recsys/seq_recsys_datasets/retailrocket/
  prefix: events

filter:
  min_item_support: 5
  min_session_length: 2

params:
  days_test: 7

output:
  folder: /home/gkoren2/datasets/recsys/seq_recsys_datasets/retailrocket/prepared/
```

In [5]:
prep_srec_path = '/home/gkoren2/datasets/recsys/seq_recsys_datasets/retailrocket/prepared'
os.listdir(prep_srec_path)

['events_test.txt', 'events_orig.hdf', 'events_train_full.txt']

we want to do the following conversion:
- `events_train_full.txt` --> `retailrocket-sess.train.inter`
- `events_test.txt` --> `retailrocket-sess.test.inter`

In [None]:
# make sure to run the conversion code cells

In [10]:
srec_train_filename='events_train_full'
inter_train_filename='./retailrocket-sess.train.inter'

srec_test_filename='events_test'
inter_test_filename='./retailrocket-sess.test.inter'

convert_srec_to_recbole(srec_train_filename,srec_test_filename,inter_train_filename,inter_test_filename)

generating ./retailrocket-sess.train.inter:


  0%|          | 0/294437 [00:00<?, ?it/s]

generating ./retailrocket-sess.test.inter:


  0%|          | 0/12197 [00:00<?, ?it/s]

Done


## Reading the resulting .inter file

In [15]:
args = Arguments(dataset='retailrocket-sess')
args

Arguments(model='GRU4Rec', dataset='retailrocket-sess', validation='False', valid_portion=0.1)

In [16]:
config = Config(model=args.model, dataset=f'{args.dataset}', config_dict=config_dict_sess)
config.final_config_dict['data_path'] = os.path.join(os.path.dirname(os.getcwd()),config.final_config_dict['data_path'])
config.final_config_dict

{'gpu_id': 0,
 'use_gpu': True,
 'seed': 2020,
 'state': 'INFO',
 'reproducibility': True,
 'data_path': '/home/gkoren2/study/git/guyk1971/RecBole/dataset/retailrocket-sess',
 'checkpoint_dir': 'saved',
 'show_progress': True,
 'save_dataset': False,
 'dataset_save_path': None,
 'save_dataloaders': False,
 'dataloaders_save_path': None,
 'log_wandb': False,
 'wandb_project': 'recbole',
 'epochs': 300,
 'train_batch_size': 2048,
 'learner': 'adam',
 'learning_rate': 0.001,
 'neg_sampling': None,
 'eval_step': 1,
 'stopping_step': 10,
 'clip_grad_norm': None,
 'weight_decay': 0.0,
 'loss_decimal_place': 4,
 'require_pow': False,
 'eval_args': {'split': {'LS': 'valid_and_test'},
  'order': 'TO',
  'mode': 'full',
  'group_by': 'user'},
 'repeatable': True,
 'metrics': ['Recall', 'MRR'],
 'topk': [20],
 'valid_metric': 'MRR@20',
 'valid_metric_bigger': True,
 'eval_batch_size': 4096,
 'metric_decimal_place': 4,
 'embedding_size': 64,
 'hidden_size': 128,
 'num_layers': 1,
 'dropout_prob': 

In [17]:
# called from the main script and performs a set of operations to load the .inter file to dataframe
retailrocket_sess = create_dataset(config)
retailrocket_sess

[1;35mretailrocket-sess[0m
[1;34mThe number of users[0m: 749606
[1;34mAverage actions of users[0m: 1.037926641364452
[1;34mThe number of items[0m: 56073
[1;34mAverage actions of items[0m: 15.913014132902461
[1;34mThe number of inters[0m: 778035
[1;34mThe sparsity of the dataset[0m: 99.99814897498487%
[1;34mRemain Fields[0m: ['session_id', 'item_id_list', 'item_id', 'item_length']

the statistics is different than what's described above. its expected as we broke sessions to shorter ones, but I'd expect the number of items to be similar.

In [19]:
retailrocket_sess_inter_df=retailrocket_sess.inter_feat
retailrocket_sess_inter_df.head()

Unnamed: 0,session_id,item_id_list,item_id,item_length
0,1,"[1, 1, 4, 2, 3, 3, 2]",1,7
1,2,"[1, 1, 4, 2, 3, 3]",2,6
2,3,"[1, 1, 4, 2, 3]",3,5
3,4,"[1, 1, 4, 2]",3,4
4,5,"[1, 1, 4]",2,3


# RSC15


## session-rec
the preprocess config file is as follows

/home/gkoren2/study/git/guyk1971/session-rec/conf/myconf/prep_rsc15_sb_s.yml
```yml
type: single # single|window
mode: session_based # session_based | session_aware
preprocessor: rsc15 # dataset (folder) name
data:
  folder: /home/gkoren2/datasets/recsys/seq_recsys_datasets/rsc15/raw/
  prefix: rsc15-clicks

filter:
  min_item_support: 5
  min_session_length: 2

params:
  days_test: 7

output:
  folder: /home/gkoren2/datasets/recsys/seq_recsys_datasets/rsc15/single/
```

In [13]:
prep_srec_path = '/home/gkoren2/datasets/recsys/seq_recsys_datasets/rsc15/single'
os.listdir(prep_srec_path)

['rsc15-clicks_test.txt',
 'rsc15-clicks_train_valid.txt',
 'rsc15-clicks_train_tr.txt',
 'rsc15-clicks_train_full.txt']

we want to do the following conversion:
- `rsc15-clicks_train_full.txt` --> `rsc15-sess.train.inter`
- `rsc15-clicks_test.txt` --> `rsc15-sess.test.inter`

In [14]:
srec_train_filename='rsc15-clicks_train_full'
inter_train_filename='./rsc15-sess.train.inter'

srec_test_filename='rsc15-clicks_test'
inter_test_filename='./rsc15-sess.test.inter'

convert_srec_to_recbole(srec_train_filename,srec_test_filename,inter_train_filename,inter_test_filename)

generating ./rsc15-sess.train.inter:


  0%|          | 0/7802144 [00:00<?, ?it/s]

generating ./rsc15-sess.test.inter:


  0%|          | 0/172361 [00:00<?, ?it/s]

Done


## RecBole

In [20]:
args = Arguments(dataset='rsc15-sess')
args

Arguments(model='GRU4Rec', dataset='rsc15-sess', validation='False', valid_portion=0.1)

In [21]:
config = Config(model=args.model, dataset=f'{args.dataset}', config_dict=config_dict_sess)
config.final_config_dict['data_path'] = os.path.join(os.path.dirname(os.getcwd()),config.final_config_dict['data_path'])
config.final_config_dict

{'gpu_id': 0,
 'use_gpu': True,
 'seed': 2020,
 'state': 'INFO',
 'reproducibility': True,
 'data_path': '/home/gkoren2/study/git/guyk1971/RecBole/dataset/rsc15-sess',
 'checkpoint_dir': 'saved',
 'show_progress': True,
 'save_dataset': False,
 'dataset_save_path': None,
 'save_dataloaders': False,
 'dataloaders_save_path': None,
 'log_wandb': False,
 'wandb_project': 'recbole',
 'epochs': 300,
 'train_batch_size': 2048,
 'learner': 'adam',
 'learning_rate': 0.001,
 'neg_sampling': None,
 'eval_step': 1,
 'stopping_step': 10,
 'clip_grad_norm': None,
 'weight_decay': 0.0,
 'loss_decimal_place': 4,
 'require_pow': False,
 'eval_args': {'split': {'LS': 'valid_and_test'},
  'order': 'TO',
  'mode': 'full',
  'group_by': 'user'},
 'repeatable': True,
 'metrics': ['Recall', 'MRR'],
 'topk': [20],
 'valid_metric': 'MRR@20',
 'valid_metric_bigger': True,
 'eval_batch_size': 4096,
 'metric_decimal_place': 4,
 'embedding_size': 64,
 'hidden_size': 128,
 'num_layers': 1,
 'dropout_prob': 0.3,
 '

In [22]:
# called from the main script and performs a set of operations to load the .inter file to dataframe
rsc15_sess = create_dataset(config)
rsc15_sess

[1;35mrsc15-sess[0m
[1;34mThe number of users[0m: 23156029
[1;34mAverage actions of users[0m: 1.0224371381827662
[1;34mThe number of items[0m: 58302
[1;34mAverage actions of items[0m: 634.3769727499263
[1;34mThe number of inters[0m: 23675583
[1;34mThe sparsity of the dataset[0m: 99.9982463087132%
[1;34mRemain Fields[0m: ['session_id', 'item_id_list', 'item_id', 'item_length']

In [23]:
rsc15_sess_inter_df=rsc15_sess.inter_feat
rsc15_sess_inter_df.head()

Unnamed: 0,session_id,item_id_list,item_id,item_length
0,1,"[736, 3, 2]",1,3
1,2,"[736, 3]",2,2
2,3,[736],3,1
3,4,"[8, 8, 7, 6, 5]",4,5
4,5,"[8, 8, 7, 6]",5,4
