# Getting artificial datasets ready to RecBole 

https://github.com/RUCAIBox/RecBole/discussions/792


Let me make a summary to this question.
If you want to load a new dataset to run models, you can follow these steps:



Prepare your dataset files:

In RecBole, we have a default dataset: ml-100k. If you want to use other dataset, you need to **prepare your data and convert the raw data into Atomic Files(About Atomic Files, here is the docs)**. By the way, we have prepared some popular datasets and you can download the atomic files of these datasets from our Google Drive or Baidu Wangpan. Then, create a folder called MyDataset and organize the file structure like:

-MyDataset
    -DataA
        -DataA.inter
        -DataA.item
        - ......
    -DataB
        -DataB.inter
    ........

Set config:

If you load a new dataset, **the default config settings may need to be changed, so you need to reset the config by yourself.** Before you do this, I strongely recommend you to read our config setting docs and data args docs first, or you may face lots of problems.

https://recbole.io/docs/user_guide/config_settings.html

About the config setting, here are some settings you may need to change:

data_path: The path of "MyDataset" (mentioned before)

load_col: Decide which file and column you want to load;

USER_ID_FIELD: Field name of user ID feature

ITEM_ID_FIELD: Field name of item ID feature

RATING_FIELD: Field name of rating feature

TIME_FIELD: Field name of timestamp feature

Load settings and run the model:

In RecBole, we support three way to load settings: config files, parameter dicts and command line. You can read our config setting docs to find the details and examples about loading the settings. And then, you can simply run the model with RecBole and finish your research.

## Atomic Files


feat_name:feat_type


feat_type, Explanations, Examples

- token, single discrete feature, [user_id, age]

- token_seq, discrete features sequence, [review]

- float, single continuous feature, [rating, timestamp]

- float_seq, continuous feature sequence, [vector]

In [1]:
import pandas as pd

### sudden_drift_all_items_seen_dataset

In [None]:
df = pd.read_csv('sudden_drift_all_items_seen_dataset.csv')
df.head()

In [12]:
df.to_csv('sudden_drift_all_items_seen_dataset.inter',header=['user_id:token','item_id:token','timestamp:float'], sep='\t', index=False)

### sudden_drift_dataset

In [None]:
df = pd.read_csv('sudden_drift_dataset.csv')
df.head()

In [14]:
df.to_csv('sudden_drift_dataset.inter',header=['user_id:token','item_id:token','timestamp:float'], sep='\t', index=False)

## Config file

data_path (str) : The path of input dataset. Defaults to 'dataset/'.
load_col (dict) : Keys are the suffix of loaded atomic files, values are the list of field names to be loaded. If a suffix doesn’t exist in load_col, the corresponding atomic file will not be loaded. Note that if load_col is None, then all the existed atomic files will be loaded. Defaults to {inter: [user_id, item_id]}.


data_path: The path of "MyDataset" (mentioned before)
load_col: Decide which file and column you want to load;

In [1]:
from recbole.config import Config


parameter_dict = {
    'dataset': 'artificial_data/sudden_drift_all_items_seen_dataset.inter',
    'data_path': 'artificial_data/',
    'load_col': {'inter': ['user_id', 'item_id']}
}


config = Config(model='BPR', dataset=None, config_dict=parameter_dict)
print('dataset: ', config['dataset'])
print('data_path: ', config['data_path'])
print('load_col: ', config['load_col'])


dataset:  ./processed_datasets/artificial_data/sudden_drift_all_items_seen_dataset.iter
data_path:  ./processed_datasets/artificial_data/sudden_drift_all_items_seen\./processed_datasets/artificial_data/sudden_drift_all_items_seen_dataset.iter
load_col:  {'inter': ['user_id', 'item_id']}


In [35]:
# import os
# import sys
# sys.path.append(os.path.abspath('') + '/..')

In [6]:
from logging import getLogger
from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.model.general_recommender import BPR
from recbole.trainer import Trainer
from recbole.utils import init_seed, init_logger


parameter_dict = {
    'dataset': 'sudden_drift_all_items_seen_dataset.inter',
    'data_path': 'processed_datasets/artificial_data/',
    'load_col': {'inter': ['user_id', 'item_id']},
    'use_gpu':False,
    'topk':5,
    'valid_metric':'Recall@5'
}

# configurations initialization
config = Config(model='BPR', dataset='sudden_drift_all_items_seen_dataset', config_dict=parameter_dict)

# init random seed
init_seed(config['seed'], config['reproducibility'])

# logger initialization
init_logger(config)
logger = getLogger()

# write config info into log
logger.info(config)

# dataset creating and filtering
dataset = create_dataset(config)
logger.info(dataset)

# dataset splitting
train_data, valid_data, test_data = data_preparation(config, dataset)

# model loading and initialization
model = BPR(config, train_data.dataset).to(config['device'])
logger.info(model)

# trainer loading and initialization
trainer = Trainer(config, model)

# model training
best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)

# model evaluation
test_result = trainer.evaluate(test_data)
print(test_result)

22 Dec 12:13    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = False
seed = 2020
state = INFO
reproducibility = True
data_path = processed_datasets/artificial_data/sudden_drift_all_items_seen_dataset
checkpoint_dir = saved
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 300
train_batch_size = 2048
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'full', 'test': 'full'}}
repeatable = False
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk = [5]
valid_metric = Recall@5
valid_metric_bigger = True
eval_batch_size = 4

OrderedDict([('recall@5', 1.0), ('mrr@5', 0.915), ('ndcg@5', 0.9369), ('hit@5', 1.0), ('precision@5', 0.2)])


In [3]:
from logging import getLogger
from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.model.general_recommender import BPR
from recbole.trainer import Trainer
from recbole.utils import init_seed, init_logger


parameter_dict = {
    'dataset': 'sudden_drift_all_items_seen_dataset.inter',
    'data_path': 'processed_datasets/artificial_data/',
    'load_col': {'inter': ['user_id', 'item_id']},
    'use_gpu':False,
    'topk':1,
    'valid_metric':'Recall@1'
}

# configurations initialization
config = Config(model='BPR', dataset='sudden_drift_all_items_seen_dataset', config_dict=parameter_dict)

# init random seed
init_seed(config['seed'], config['reproducibility'])

# logger initialization
init_logger(config)
logger = getLogger()

# write config info into log
logger.info(config)

# dataset creating and filtering
dataset = create_dataset(config)
logger.info(dataset)

# dataset splitting
train_data, valid_data, test_data = data_preparation(config, dataset)

# model loading and initialization
model = BPR(config, train_data.dataset).to(config['device'])
logger.info(model)

# trainer loading and initialization
trainer = Trainer(config, model)

# model training
best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)

# model evaluation
test_result = trainer.evaluate(test_data)
print(test_result)

22 Dec 12:10    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = False
seed = 2020
state = INFO
reproducibility = True
data_path = processed_datasets/artificial_data/sudden_drift_all_items_seen_dataset
checkpoint_dir = saved
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 300
train_batch_size = 2048
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'full', 'test': 'full'}}
repeatable = False
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk = [1]
valid_metric = Recall@1
valid_metric_bigger = True
eval_batch_size = 4

OrderedDict([('recall@1', 0.8525), ('mrr@1', 0.8525), ('ndcg@1', 0.8525), ('hit@1', 0.8525), ('precision@1', 0.8525)])


In [9]:
from logging import getLogger
from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.model.general_recommender import BPR
from recbole.trainer import Trainer
from recbole.utils import init_seed, init_logger


parameter_dict = {
    'dataset': 'sudden_drift_dataset.inter',
    'data_path': 'processed_datasets/artificial_data/',
    'load_col': {'inter': ['user_id', 'item_id']},
    'use_gpu':False,
    'topk':5,
    'valid_metric':'Recall@5'
}

# configurations initialization
config = Config(model='BPR', dataset='sudden_drift_dataset', config_dict=parameter_dict)

# init random seed
init_seed(config['seed'], config['reproducibility'])

# logger initialization
init_logger(config)
logger = getLogger()

# write config info into log
logger.info(config)

# dataset creating and filtering
dataset = create_dataset(config)
logger.info(dataset)

# dataset splitting
train_data, valid_data, test_data = data_preparation(config, dataset)

# model loading and initialization
model = BPR(config, train_data.dataset).to(config['device'])
logger.info(model)

# trainer loading and initialization
trainer = Trainer(config, model)

# model training
best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)

# model evaluation
test_result = trainer.evaluate(test_data)
print(test_result)

22 Dec 12:15    INFO  
General Hyper Parameters:
gpu_id

 = 0
use_gpu = False
seed = 2020
state = INFO
reproducibility = True
data_path = processed_datasets/artificial_data/sudden_drift_dataset
checkpoint_dir = saved
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 300
train_batch_size = 2048
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'full', 'test': 'full'}}
repeatable = False
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk = [5]
valid_metric = Recall@5
valid_metric_bigger = True
eval_batch_size = 4096
metric_decimal_place = 4

Dataset Hyper Parameters:
field_separato

OrderedDict([('recall@5', 1.0), ('mrr@5', 0.7951), ('ndcg@5', 0.8469), ('hit@5', 1.0), ('precision@5', 0.2)])


In [10]:
from logging import getLogger
from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.model.general_recommender import BPR
from recbole.trainer import Trainer
from recbole.utils import init_seed, init_logger


parameter_dict = {
    'dataset': 'sudden_drift_dataset.inter',
    'data_path': 'processed_datasets/artificial_data/',
    'load_col': {'inter': ['user_id', 'item_id']},
    'use_gpu':False,
    'topk':3,
    'valid_metric':'Recall@3'
}

# configurations initialization
config = Config(model='BPR', dataset='sudden_drift_dataset', config_dict=parameter_dict)

# init random seed
init_seed(config['seed'], config['reproducibility'])

# logger initialization
init_logger(config)
logger = getLogger()

# write config info into log
logger.info(config)

# dataset creating and filtering
dataset = create_dataset(config)
logger.info(dataset)

# dataset splitting
train_data, valid_data, test_data = data_preparation(config, dataset)

# model loading and initialization
model = BPR(config, train_data.dataset).to(config['device'])
logger.info(model)

# trainer loading and initialization
trainer = Trainer(config, model)

# model training
best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)

# model evaluation
test_result = trainer.evaluate(test_data)
print(test_result)

22 Dec 12:15    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = False
seed = 2020
state = INFO
reproducibility = True
data_path = processed_datasets/artificial_data/sudden_drift_dataset
checkpoint_dir = saved
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 300
train_batch_size = 2048
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'full', 'test': 'full'}}
repeatable = False
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk = [3]
valid_metric = Recall@3
valid_metric_bigger = True
eval_batch_size = 4096
metric_deci

OrderedDict([('recall@3', 0.7238), ('mrr@3', 0.4604), ('ndcg@3', 0.5278), ('hit@3', 0.7238), ('precision@3', 0.2413)])


In [11]:
from logging import getLogger
from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.model.general_recommender import BPR
from recbole.trainer import Trainer
from recbole.utils import init_seed, init_logger


parameter_dict = {
    'dataset': 'sudden_drift_dataset.inter',
    'data_path': 'processed_datasets/artificial_data/',
    'load_col': {'inter': ['user_id', 'item_id']},
    'use_gpu':False,
    'topk':2,
    'valid_metric':'Recall@2'
}

# configurations initialization
config = Config(model='BPR', dataset='sudden_drift_dataset', config_dict=parameter_dict)

# init random seed
init_seed(config['seed'], config['reproducibility'])

# logger initialization
init_logger(config)
logger = getLogger()

# write config info into log
logger.info(config)

# dataset creating and filtering
dataset = create_dataset(config)
logger.info(dataset)

# dataset splitting
train_data, valid_data, test_data = data_preparation(config, dataset)

# model loading and initialization
model = BPR(config, train_data.dataset).to(config['device'])
logger.info(model)

# trainer loading and initialization
trainer = Trainer(config, model)

# model training
best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)

# model evaluation
test_result = trainer.evaluate(test_data)
print(test_result)

22 Dec 12:15    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = False
seed = 2020
state = INFO
reproducibility = True
data_path = processed_datasets/artificial_data/sudden_drift_dataset
checkpoint_dir = saved
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 300
train_batch_size = 2048
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'full', 'test': 'full'}}
repeatable = False
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk = [2]
valid_metric = Recall@2
valid_metric_bigger = True
eval_batch_size = 4096
metric_deci

OrderedDict([('recall@2', 0.4288), ('mrr@2', 0.3212), ('ndcg@2', 0.3494), ('hit@2', 0.4288), ('precision@2', 0.2144)])


In [12]:
from logging import getLogger
from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.model.general_recommender import BPR
from recbole.trainer import Trainer
from recbole.utils import init_seed, init_logger


parameter_dict = {
    'dataset': 'sudden_drift_dataset.inter',
    'data_path': 'processed_datasets/artificial_data/',
    'load_col': {'inter': ['user_id', 'item_id']},
    'use_gpu':False,
    'topk':1,
    'valid_metric':'Recall@1'
}

# configurations initialization
config = Config(model='BPR', dataset='sudden_drift_dataset', config_dict=parameter_dict)

# init random seed
init_seed(config['seed'], config['reproducibility'])

# logger initialization
init_logger(config)
logger = getLogger()

# write config info into log
logger.info(config)

# dataset creating and filtering
dataset = create_dataset(config)
logger.info(dataset)

# dataset splitting
train_data, valid_data, test_data = data_preparation(config, dataset)

# model loading and initialization
model = BPR(config, train_data.dataset).to(config['device'])
logger.info(model)

# trainer loading and initialization
trainer = Trainer(config, model)

# model training
best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)

# model evaluation
test_result = trainer.evaluate(test_data)
print(test_result)

22 Dec 12:15    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = False
seed = 2020
state = INFO
reproducibility = True
data_path = processed_datasets/artificial_data/sudden_drift_dataset
checkpoint_dir = saved
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 300
train_batch_size = 2048
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'full', 'test': 'full'}}
repeatable = False
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk = [1]
valid_metric = Recall@1
valid_metric_bigger = True
eval_batch_size = 4096
metric_deci

OrderedDict([('recall@1', 0.1925), ('mrr@1', 0.1925), ('ndcg@1', 0.1925), ('hit@1', 0.1925), ('precision@1', 0.1925)])


In [None]:
# from recbole.quick_start import run_recbole

# parameter_dict = {
#     'dataset': '../processed_datasets/ArtificialData/sudden_drift_all_items_seen/sudden_drift_all_items_seen_dataset.inter',
#     'data_path': '../processed_datasets/ArtificialData/sudden_drift_all_items_seen/',
#     'load_col': {'inter': ['user_id', 'item_id']},
#     'use_gpu':False,
# }

# run_recbole(model='BPR', dataset='sudden_drift_all_items_seen_dataset', config_dict=parameter_dict)

### example from docs

In [None]:

# from recbole.config import Config

# parameter_dict = {
#     'gpu_id': 2,
#     'training_batch_size': 512
# }
# config = Config(model='BPR', dataset='ml-100k', config_dict=parameter_dict)
# print('gpu_id: ', config['gpu_id'])
# print('training_batch_size: ', config['training_batch_size'])

In [None]:
# from recbole.config import Config

# config = Config(model='BPR', dataset='ml-100k', config_file_list=['example.yaml'])
# print('gpu_id: ', config['gpu_id'])
# print('training_batch_size: ', config['training_batch_size'])