# Dataset preparation
Background: I'd like to run the RecBole algorithms in a session-based task on 4 datasets, and compare results to what's published in the following papers:
- Evaluation of Session-based Recommendation Algorithms, Ludewig et al 2018
- Empirical Analysis of Session-Based Recommendation Algorithms, Ludewig et al 2020
- A survey on session-based recommender systems, Wang et al 2021

The first 2 papers have also published their code in the session-rec framework. they also have the datasets to download.
All 3 papers are aligned on the datasets: 
![img](SBRS_datasets.png)


the goal is to understand what is needed to create the following SBRS version of the following datasets: 
- RetailRocket
- RSC15
- DIGINETICA - we already have `diginetica-session`. need to compare the characteristics to the published one
- NowPlaying


RecBole has DIGINETICA and NowPlaying in a `session based` version. need to compare the characteristics to `session-rec` version 

## imports


In [6]:
import os 
import numpy as np
import pandas as pd
from recbole.config import Config
from recbole.data import create_dataset
from recbole.data.utils import get_dataloader
from recbole.utils import init_logger, init_seed, get_model, get_trainer, set_color

In [2]:
from dataclasses import dataclass

@dataclass
class Arguments:
    model:str = 'GRU4Rec'
    dataset:str = 'diginetica-session'
    validation: bool = 'False'      # Whether evaluating on validation set (split from train set), otherwise on test set.
    valid_portion: float = 0.1


In [3]:
config_dict_sess = {
    'USER_ID_FIELD': 'session_id',
    'load_col': None,
    'neg_sampling': None,
    'benchmark_filename': ['train', 'test'],
    'alias_of_item_id': ['item_id_list'],
    'topk': [20],
    'metrics': ['Recall', 'MRR'],
    'valid_metric': 'MRR@20',
    'gpu_id': 3
}

# DIGINETICA
Lets start by comparing `diginetica-session` to what we have in `session-rec` and the above table

## Recbole

### session based
There was a script that is not given in the repository that reads the raw data input file as downloaded from diginetica and converts it to `.inter` file



In [7]:
args = Arguments(dataset='diginetica-session')
args

Arguments(model='GRU4Rec', dataset='diginetica-session', validation='False', valid_portion=0.1)

In [5]:
config = Config(model=args.model, dataset=f'{args.dataset}', config_dict=config_dict_sess)
config.final_config_dict['data_path'] = os.path.join(os.path.dirname(os.getcwd()),config.final_config_dict['data_path'])
config.final_config_dict

{'gpu_id': 3,
 'use_gpu': True,
 'seed': 2020,
 'state': 'INFO',
 'reproducibility': True,
 'data_path': '/home/gkoren2/study/git/guyk1971/RecBole/dataset/diginetica-session',
 'checkpoint_dir': 'saved',
 'show_progress': True,
 'save_dataset': False,
 'dataset_save_path': None,
 'save_dataloaders': False,
 'dataloaders_save_path': None,
 'log_wandb': False,
 'wandb_project': 'recbole',
 'epochs': 300,
 'train_batch_size': 2048,
 'learner': 'adam',
 'learning_rate': 0.001,
 'neg_sampling': None,
 'eval_step': 1,
 'stopping_step': 10,
 'clip_grad_norm': None,
 'weight_decay': 0.0,
 'loss_decimal_place': 4,
 'require_pow': False,
 'eval_args': {'split': {'LS': 'valid_and_test'},
  'order': 'TO',
  'mode': 'full',
  'group_by': 'user'},
 'repeatable': True,
 'metrics': ['Recall', 'MRR'],
 'topk': [20],
 'valid_metric': 'MRR@20',
 'valid_metric_bigger': True,
 'eval_batch_size': 4096,
 'metric_decimal_place': 4,
 'embedding_size': 64,
 'hidden_size': 128,
 'num_layers': 1,
 'dropout_prob':

In [6]:
# called from the main script and performs a set of operations to load the .inter file to dataframe
digi_recbole_sess = create_dataset(config)
digi_recbole_sess

[1;35mdiginetica-session[0m
[1;34mThe number of users[0m: 719471
[1;34mAverage actions of users[0m: 1.0845872656260858
[1;34mThe number of items[0m: 43098
[1;34mAverage actions of items[0m: 18.110520574651286
[1;34mThe number of inters[0m: 780328
[1;34mThe sparsity of the dataset[0m: 99.9974834429483%
[1;34mRemain Fields[0m: ['session_id', 'item_id_list', 'item_id', 'item_length']

In [7]:
digi_recbole_sess_inter_df=digi_recbole_sess.inter_feat
digi_recbole_sess_inter_df.head()

Unnamed: 0,session_id,item_id_list,item_id,item_length
0,1,[24864],1,1
1,2,"[137, 3]",2,2
2,3,[137],3,1
3,4,[299],4,1
4,5,[1010],5,1


In [10]:
digi_recbole_sess_inter_df.session_id.value_counts()

1         2
40578     2
40566     2
40567     2
40568     2
         ..
280396    1
280397    1
280398    1
280399    1
719470    1
Name: session_id, Length: 719470, dtype: int64

In [26]:
digi_recbole_sess_inter_df.loc[digi_recbole_sess_inter_df.session_id==1,:]

Unnamed: 0,session_id,item_id_list,item_id,item_length
0,1,[24864],1,1
719470,1,"[20438, 18951, 7866, 20453, 5659]",20453,5


In [17]:
len(digi_recbole_sess_inter_df)

780328

its not clear what is `item_id` in this file. is it the last item in the session (where all the rest are in `item_id_list`) ?

the script for generating the `.inter` file of the sessionized version is not provided

#### Step into `create_dataset`
lets look at the input file first

In [6]:
print(config.final_config_dict['data_path'])
os.listdir(config.final_config_dict['data_path'])

/home/gkoren2/study/git/guyk1971/RecBole/dataset/diginetica-session


['diginetica-session.test.inter', 'diginetica-session.train.inter']

when calling `create_dataset` it understand that the dataset class is `SequentialDataset` and defines this class, calling its constructor.  
the consstructor calls its `super` class (`Dataset`) constructor that calls eventually loads the above files to dataframe using the following:  
`Dataset._load_data()` --> `Dataset._load_inter_feat` --> `Dataset._load_feat`

In [37]:
tr_inter_df=pd.read_csv(os.path.join(config.final_config_dict['data_path'],'diginetica-session.train.inter'),delimiter='\t')
tr_inter_df.head(15)


Unnamed: 0,session_id:token,item_id_list:token_seq,item_id:token
0,1,1,2
1,2,3 4,5
2,3,3,4
3,4,6,7
4,5,8,9
5,6,10 11 12 12 13 14,15
6,7,10 11 12 12 13,14
7,8,10 11 12 12,13
8,9,10 11 12,12
9,10,10 11,12


and this is the structure of the file we need to generate.
the idea is to use the pre-processing of `session-rec` which generates the files in the folder `/home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica/prepared`:


In [30]:
tr_inter_df.columns

Index(['session_id:token', 'item_id_list:token_seq', 'item_id:token'], dtype='object')

In [8]:
prep_srec_path = '/home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica/prepared'
os.listdir(prep_srec_path)

['train-item-views_train_tr.txt',
 'train-item-views_train_full.txt',
 'train-item-views_train_valid.txt',
 'train-item-views_test.txt']

we want to do the following conversion:
- `train-item-views_train_full.txt` --> `diginetica-session.train.inter`
- `train-item-views_test.txt` --> `diginetica-session.test.inter`

### raw

In [12]:
digi_args = Arguments(dataset='diginetica')
digi_args

Arguments(model='GRU4Rec', dataset='diginetica', validation='False', valid_portion=0.1)

In [13]:
digi_config_dict= {
        'USER_ID_FIELD': 'session_id',
        'load_col': None,       # load all columns. dont filter anything
        'neg_sampling': None,
        # 'benchmark_filename': ['train', 'test'],
        # 'alias_of_item_id': ['item_id_list'],
        'eval_args':{
            'group_by': 'user',
            'order': 'TO',
            'split':{'LS': 'test_only'},
            'mode': 'uni100'},
        'topk': [20],
        'metrics': ['Recall', 'MRR'],
        'valid_metric': 'MRR@20'
    }
digi_config = Config(model=digi_args.model, dataset=f'{digi_args.dataset}', config_dict=digi_config_dict)
digi_config.final_config_dict['data_path'] = os.path.join(os.path.dirname(os.getcwd()),digi_config.final_config_dict['data_path'])
digi_config.final_config_dict

{'gpu_id': 0,
 'use_gpu': True,
 'seed': 2020,
 'state': 'INFO',
 'reproducibility': True,
 'data_path': '/home/gkoren2/study/git/guyk1971/RecBole/dataset/diginetica',
 'checkpoint_dir': 'saved',
 'show_progress': True,
 'save_dataset': False,
 'dataset_save_path': None,
 'save_dataloaders': False,
 'dataloaders_save_path': None,
 'log_wandb': False,
 'wandb_project': 'recbole',
 'epochs': 300,
 'train_batch_size': 2048,
 'learner': 'adam',
 'learning_rate': 0.001,
 'neg_sampling': None,
 'eval_step': 1,
 'stopping_step': 10,
 'clip_grad_norm': None,
 'weight_decay': 0.0,
 'loss_decimal_place': 4,
 'require_pow': False,
 'eval_args': {'group_by': 'user',
  'order': 'TO',
  'split': {'LS': 'test_only'},
  'mode': 'uni100'},
 'repeatable': True,
 'metrics': ['Recall', 'MRR'],
 'topk': [20],
 'valid_metric': 'MRR@20',
 'valid_metric_bigger': True,
 'eval_batch_size': 4096,
 'metric_decimal_place': 4,
 'embedding_size': 64,
 'hidden_size': 128,
 'num_layers': 1,
 'dropout_prob': 0.3,
 'los

In [21]:
digi_recbole_raw = create_dataset(digi_config)
digi_recbole_raw

[1;35mdiginetica[0m
[1;34mThe number of users[0m: 204790
[1;34mAverage actions of users[0m: 4.078212208663551
[1;34mThe number of items[0m: 184048
[1;34mAverage actions of items[0m: 19.3613918768546
[1;34mThe number of inters[0m: 835173
[1;34mThe sparsity of the dataset[0m: 99.99778416918709%
[1;34mRemain Fields[0m: ['session_id', 'item_id', 'timestamp', 'number of times', 'item_priceLog2', 'item_name', 'item_category']

In [22]:
digi_recbole_raw_inter_df = digi_recbole_raw.inter_feat
digi_recbole_raw_inter_df.head()

Unnamed: 0,session_id,item_id,timestamp,number of times
0,1,1,1463053000.0,1.0
1,1,2,1463754000.0,1.0
2,1,3,1462967000.0,1.0
3,1,4,1463836000.0,1.0
4,1,5,1462897000.0,1.0


In [24]:
digi_recbole_raw_inter_df['number of times'].max()

22.0

In [25]:
digi_recbole_raw_inter_df.session_id.value_counts()

57593     69
77365     56
47079     52
83362     50
56535     48
          ..
185080     1
97240      1
113837     1
143933     1
184059     1
Name: session_id, Length: 204789, dtype: int64

In [18]:
len(digi_recbole_sess_inter_df)

780328

## session-rec

In [27]:
dataset_path = '/home/gkoren2/datasets/recsys/seq_recsys_datasets/'
datapath_raw = os.path.join(dataset_path,'diginetica','raw')
os.listdir(datapath_raw)

['train-item-views.csv']

In [30]:
digi_srec_raw = pd.read_csv(os.path.join(datapath_raw,'train-item-views.csv'),delimiter=';')
digi_srec_raw.head()

Unnamed: 0,sessionId,userId,itemId,timeframe,eventdate
0,1,,81766,526309,2016-05-09
1,1,,31331,1031018,2016-05-09
2,1,,32118,243569,2016-05-09
3,1,,9654,75848,2016-05-09
4,1,,32627,1112408,2016-05-09


In [31]:
len(digi_srec_raw)

1235380

In [32]:
digi_srec_raw.sessionId.value_counts()

480263    87
129618    81
583862    78
123155    73
120823    72
          ..
349305     1
67155      1
349294     1
67164      1
600687     1
Name: sessionId, Length: 310324, dtype: int64

In [33]:
len(digi_srec_raw.itemId.unique())

122993

### preprocessing with `run_preprocessing.py`
using 2 configurations:

/home/gkoren2/study/git/guyk1971/session-rec/conf/myconf/prep_digi_sb_s.yml
```yml
type: single # single|window
mode: session_based # session_based | session_aware
preprocessor: diginetica # dataset (folder) name
data:
  folder: /home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica/raw/
  prefix: train-item-views

filter:
  min_item_support: 5
  min_session_length: 2

params:
  days_test: 7

output:
  folder: /home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica/prepared/
```

/home/gkoren2/study/git/guyk1971/session-rec/conf/myconf/prep_digi_sb_w.yml
```yml
type: window # single|window
mode: session_based # session_based | session_aware
preprocessor: diginetica #
data:
  folder: /home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica/raw/
  prefix: train-item-views

filter:
  min_item_support: 5
  min_session_length: 2

params:
  days_test: 7
  days_train: 25 #only window
  num_slices: 5 #only window
  days_offset: 45 #only window
  days_shift: 18 #only window

output:
  folder: /home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica/slices/
```

so now we have txt files that we can read and convert to atomic files.

In [36]:
!tree /home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica

/home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica
├── prepared
│   ├── train-item-views_test.txt
│   ├── train-item-views_train_full.txt
│   ├── train-item-views_train_tr.txt
│   └── train-item-views_train_valid.txt
├── raw
│   └── train-item-views.csv
└── slices
    ├── train-item-views_test.0.txt
    ├── train-item-views_test.1.txt
    ├── train-item-views_test.2.txt
    ├── train-item-views_test.3.txt
    ├── train-item-views_test.4.txt
    ├── train-item-views_train_full.0.txt
    ├── train-item-views_train_full.1.txt
    ├── train-item-views_train_full.2.txt
    ├── train-item-views_train_full.3.txt
    └── train-item-views_train_full.4.txt

3 directories, 15 files


how do I read it to dataframe ?

## Converting from session-rec prepared to RecBole .inter 
we can see above the structure of the `.inter` file. this is our target.   
these are our 'input' files:


In [8]:
from tqdm.notebook import tqdm
prep_srec_path = '/home/gkoren2/datasets/recsys/seq_recsys_datasets/diginetica/prepared'
os.listdir(prep_srec_path)

['train-item-views_train_tr.txt',
 'train-item-views_train_full.txt',
 'train-item-views_train_valid.txt',
 'train-item-views_test.txt']

we want to do the following conversion:
- `train-item-views_train_full.txt` --> `diginetica-session.train.inter`
- `train-item-views_test.txt` --> `diginetica-session.test.inter`

#### Draft Sandbox

In [11]:
src_file='train-item-views_train_full'

train_df = pd.read_csv(os.path.join(prep_srec_path,src_file+'.txt'), sep='\t')
train_df.head(10)

Unnamed: 0,SessionId,Time,ItemId,Date,Datestamp,TimeO,ItemSupport
0,1,1462752000.0,9654,2016-05-09,1462752000.0,2016-05-09 00:01:15.848000+00:00,399
1,1,1462752000.0,33043,2016-05-09,1462752000.0,2016-05-09 00:02:53.912000+00:00,195
2,1,1462752000.0,32118,2016-05-09,1462752000.0,2016-05-09 00:04:03.569000+00:00,67
3,1,1462752000.0,12352,2016-05-09,1462752000.0,2016-05-09 00:05:29.870000+00:00,327
4,1,1462752000.0,35077,2016-05-09,1462752000.0,2016-05-09 00:06:30.072000+00:00,102
5,1,1462752000.0,36118,2016-05-09,1462752000.0,2016-05-09 00:08:07.369000+00:00,130
6,1,1462753000.0,81766,2016-05-09,1462752000.0,2016-05-09 00:08:46.309000+00:00,61
7,1,1462753000.0,31331,2016-05-09,1462752000.0,2016-05-09 00:17:11.018000+00:00,55
8,1,1462753000.0,32627,2016-05-09,1462752000.0,2016-05-09 00:18:32.408000+00:00,29
9,2,1462752000.0,100747,2016-05-09,1462752000.0,2016-05-09 00:00:38.317000+00:00,147


In [9]:
src_file='train-item-views_test'

tst_df = pd.read_csv(os.path.join(prep_srec_path,src_file+'.txt'), sep='\t')
tst_df.head(10)

Unnamed: 0,SessionId,Time,ItemId,Date,Datestamp,TimeO,ItemSupport
0,289,1464221000.0,125013,2016-05-26,1464221000.0,2016-05-26 00:00:18.301000+00:00,10
1,289,1464222000.0,64068,2016-05-26,1464221000.0,2016-05-26 00:14:07.735000+00:00,30
2,289,1464222000.0,133346,2016-05-26,1464221000.0,2016-05-26 00:14:38.934000+00:00,36
3,289,1464222000.0,438457,2016-05-26,1464221000.0,2016-05-26 00:18:34.305000+00:00,6
4,289,1464222000.0,198930,2016-05-26,1464221000.0,2016-05-26 00:18:48.607000+00:00,10
5,289,1464222000.0,438457,2016-05-26,1464221000.0,2016-05-26 00:19:13.391000+00:00,6
6,290,1464221000.0,169589,2016-05-26,1464221000.0,2016-05-26 00:00:11.947000+00:00,50
7,290,1464221000.0,354859,2016-05-26,1464221000.0,2016-05-26 00:05:12.051000+00:00,7
8,302,1464221000.0,36202,2016-05-26,1464221000.0,2016-05-26 00:00:45.583000+00:00,82
9,302,1464221000.0,79520,2016-05-26,1464221000.0,2016-05-26 00:02:50.221000+00:00,197


In [12]:
# note that the itemIDs are not sequential. we need to make them such
print(len(train_df.ItemId.value_counts()))
train_df.ItemId.max()

43105


707327

In [13]:
print(len(tst_df.ItemId.value_counts()))
tst_df.ItemId.max()

21139


707327

In [14]:
train_df.dtypes

SessionId        int64
Time           float64
ItemId           int64
Date            object
Datestamp      float64
TimeO           object
ItemSupport      int64
dtype: object

In [14]:
src_col = ['SessionId', 'ItemId']
trdf=train_df[src_col]
tsdf=tst_df[src_col]
trdf.head()

Unnamed: 0,SessionId,ItemId
0,1,9654
1,1,33043
2,1,32118
3,1,12352
4,1,35077


In [16]:
trdf.ItemId.unique()

array([  9654,  33043,  32118, ..., 258318, 416695, 381794])

In [29]:
# set(tsdf.ItemId.unique()) - set(trdf.ItemId.unique())
len(set(tsdf.ItemId.unique()).union(set(trdf.ItemId.unique())))
# set(trdf.ItemId.unique()).add(set(tsdf.ItemId.unique()))

43105

In [5]:
trg2src_id={k+1:v for k, v in zip(range(len(trdf.ItemId.unique())),trdf.ItemId.unique())}
src2trg_id={v:k for k,v in trg2src_id.items()}
len(trg2src_id)

43105

In [6]:
trdf.ItemId=trdf.ItemId.map(src2trg_id)
tsdf.ItemId=tsdf.ItemId.map(src2trg_id)     # assuming no tst items that are absent from training set
trdf.ItemId.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tdf.ItemId=tdf.ItemId.map(src2trg_id)


323      913
492      886
138      839
2785     810
899      764
        ... 
42738      1
43035      1
43034      1
41318      1
42991      1
Name: ItemId, Length: 43105, dtype: int64

In [8]:
trdf.head(30)

Unnamed: 0,SessionId,ItemId
0,1,1
1,1,2
2,1,3
3,1,4
4,1,5
5,1,6
6,1,7
7,1,8
8,1,9
9,2,10


In [7]:
gbs=trdf.groupby('SessionId')

In [12]:
# since we break each session to several subsessions, we need to know how many subsessions we'll have in total.
# this will save us the copy in concatenating by allocating the dataframe from advance
# session with N elements will be broken to N-1 sessions
# so to understand the total number of subsessions we need to summarize the Ns-1 (the size of the session-1):
gbs.size().sum() - len(gbs)

727563

In [9]:
tgt_col=['session_id:token', 'item_id_list:token_seq', 'item_id:token']


In [10]:
def process_session2(sess_df,sid):
    iid = sess_df.ItemId.values
    iids=[(str(iid[:i])[1:-1],str(iid[i])) for i in reversed(range(1,len(iid)))]
    iidfd=dict()
    iidfd.update({'session_id:token':[sid+i+1 for i in range(len(iids))]})
    iidfd.update({'item_id_list:token_seq':[i[0] for i in iids]})
    iidfd.update({'item_id:token':[i[1] for i in iids]})
    iidfs=pd.DataFrame(iidfd)
    
    return iidfs,iidfd['session_id:token'][-1]
    

In [95]:
from tqdm.notebook import tqdm
maxlen=0
for g in tqdm(list(gbs.groups)[100]):
    leng=len(gbs.get_group(g))
    if leng>maxlen:
        maxlen=leng

print(maxlen)

  0%|          | 0/188806 [00:00<?, ?it/s]

70


In [11]:
len(gbs.groups.keys())

188807

In [12]:
rbdf=[]
# rbdf=pd.DataFrame(columns=tgt_col)
sid=0
for grp in tqdm(gbs.groups):
    sub_rbdf,sid=process_session2(gbs.get_group(grp),sid)
    rbdf.append(sub_rbdf)
rbdf=pd.concat(rbdf,ignore_index=True)
rbdf.head(30)



  0%|          | 0/188807 [00:00<?, ?it/s]

Unnamed: 0,session_id:token,item_id_list:token_seq,item_id:token
0,1,1 2 3 4 5 6 7 8,9
1,2,1 2 3 4 5 6 7,8
2,3,1 2 3 4 5 6,7
3,4,1 2 3 4 5,6
4,5,1 2 3 4,5
5,6,1 2 3,4
6,7,1 2,3
7,8,1,2
8,9,10 11 12 13 14 15 13 11 16,17
9,10,10 11 12 13 14 15 13 11,16


In [None]:
rbdf=[]
rbdf=pd.DataFrame(columns=tgt_col)
sid=0
for grp in tqdm(gbs.groups):
    sub_rbdf,sid=process_session2(gbs.get_group(grp),sid)
    rbdf=pd.concat([rbdf,sub_rbdf],ignore_index=True)
rbdf.head(30)


#### Final Code

In [3]:
def process_session(sess_df,sid):
    iid = sess_df.ItemId.values
    iids=[(str(iid[:i])[1:-1],str(iid[i])) for i in reversed(range(1,len(iid)))]
    iidfd=dict()
    iidfd.update({'session_id:token':[sid+i+1 for i in range(len(iids))]})
    iidfd.update({'item_id_list:token_seq':[i[0] for i in iids]})
    iidfd.update({'item_id:token':[i[1] for i in iids]})
    iidfs=pd.DataFrame(iidfd)
    
    return iidfs,iidfd['session_id:token'][-1]


In [None]:
def transform_df(srec_df,n_sess):
    gbs=srec_df.groupby('SessionId')
    recbole_col=['session_id:token', 'item_id_list:token_seq', 'item_id:token']
    rbdf=[]
    sid=0
    n_sess = n_sess or len(gbs.groups)
    for grp in tqdm(list(gbs.groups)[:n_sess]):
        sub_rbdf,sid=process_session(gbs.get_group(grp),sid)
        rbdf.append(sub_rbdf)
    rbdf=pd.concat(rbdf,ignore_index=True)
    return rbdf 


def convert_srec_to_recbole(srec_train_filename,srec_test_filename, inter_train_filename,inter_test_filename,n_sess=None):
    srec_train_df = pd.read_csv(os.path.join(prep_srec_path,srec_train_filename+'.txt'), sep='\t')
    srec_test_df = pd.read_csv(os.path.join(prep_srec_path,srec_test_filename+'.txt'), sep='\t')
    srec_col = ['SessionId', 'ItemId']
    srec_trn_df=srec_train_df[srec_col]
    srec_tst_df=srec_test_df[srec_col]
    if (set(srec_test_df.ItemId.unique())-set(srec_train_df.ItemId.unique())):
        print("Warning: there are new items in the test set ")
        srec_ItemId=set(srec_train_df.ItemId.unique()).union(set(srec_test_df.ItemId.unique()))
    else:
        srec_ItemId=set(srec_train_df.ItemId.unique())
    # remap item_id to be sequential from 1 to N
    trg2src_id={k+1:v for k, v in zip(range(len(srec_ItemId)),srec_ItemId)}
    src2trg_id={v:k for k,v in trg2src_id.items()}
    srec_train_df.ItemId=srec_train_df.ItemId.map(src2trg_id)
    srec_test_df.ItemId=srec_test_df.ItemId.map(src2trg_id)
    print(f'generating {inter_train_filename}:')
    rbdf=transform_df(srec_train_df,n_sess)
    rbdf.to_csv(inter_train_filename,sep='\t',index=None)

    print(f'generating {inter_test_filename}:')
    rbdf=transform_df(srec_test_df,n_sess)
    rbdf.to_csv(inter_test_filename,sep='\t',index=None)

    print('Done')


In [None]:
srec_train_filename='train-item-views_train_full'
inter_train_filename='./diginetica-sess.train.inter'

srec_test_filename='train-item-views_test'
inter_test_filename='./diginetica-sess.test.inter'

convert_srec_to_recbole(srec_train_filename,srec_test_filename,inter_train_filename,inter_test_filename)

# NowPlaying

## RecBole

## session-rec

# RetailRocket

## session-rec

## RecBole

# RSC15

## session-rec

## RecBole