<i>Copyright (c) Microsoft Corporation.</i>

<i>Licensed under the MIT license.</i>

# MoRec: A Data-Centric Multi-Objective Learning Framework for Responsible Recommendation Systems

MoRec[[1]](https://arxiv.org/abs/2310.13260v1) is a data-centric multi-objective framework designed for responsible recommendation systems. Concretely, MoRec adopts a tri-level framework to optimize diverse objectives simultaneously, comprising a PID-based objective coordinator for trade-off among objectives and an adaptive data sampler for unified objective modeling. 


## Strengths of MoRec
- MoRec is model-agnostic, which is capable of converting an accuracy-oriented model to multi-objective model
- MoRec adopts a post-training strategy, which is able to convert a well-trained model to multi-objective model at a low cost
- MoRec exhibit great capability in objective controlling, which could optimize model with objective preference without sacrificing too much accuracy

## Data requirements

MoRec is capable of optimizing accuracy, revenue, fairness and alignment objectives simultaneously. 

- For accuracy, basical user-item interaction files are required, including `train.csv`, `valid.csv`, `test.csv` and `user_history.csv`. 
  `train.csv`, `valid.csv`, `test.csv` represent interactions in training set, validation set and test set respectively, which are formatted as follows:

  | user_id | item_id |
  |---------|---------|
  | 1       | 1       |
  | 1       | 2       |
  | ...     | ...     |
  | 100     | 254     |
  | ...     | ...     |

  `user_history.csv` represents the user's interaction history, consist of interactions in training set and validation set, which is formatted as follows:

  | user_id | item_seq |
  |---------|---------|
  | 1       | 1,2,3,...|
  | ...     | ...     |
  | 100     | 254,257,327,... |
  | ...     | ...     |

- For revenue, MoRec would sample data samples according to their weights, i.e. item price. For fairness, MoRec aims to improve the accuracy preformance of the most disadvantaged group. For alignment, MoRec targets on aligning the model's distribution with some pre-defined expectation distribution. To model those objectives, `item_meta_morec_filename` is required to provide item weights, fairness group and alignment group. And if you want to set the pre-set distribution for alignment,  `align_dist_filename` is needed. By default, the expected distribution to aligned with is the distribution derived from the training set. Here are the example of item_meta_morec_file  and align_dist_file.
  
  - item_meta_morec_file: `item_meta_morec.csv`, columns separated by comma

    | item_id | weight | fair_group | align_group |
    |---------|---------|---------|---------|
    | 1       | 2.35    |  1  |  2 |
    | 2       | 63.21   |  5  |  1 |
    | ...     | ...     | ... | ... |
    | 100     | 5.89   |  5  |  4 |
    | ...     | ...     | ... | ... |

  - align_dist_file: `expected_align_dist.csv`, columns separated by comma

    | group_id | proportion |
    |---------|---------|
    | 1       | 0.21 |
    | 2     | 0.12    |
    | 3    | 0.33 |
    | 4    | 0.22    |
    | 5    | 0.12   |


### Example: MovieLens-100k dataset

#### Data Preparation

We put the script for downloading and preprocessing ml-100k into the our [example folder](../../preprocess/download_split_ml100k.py). Here we would call the functions defined in the script. The preprocessed csv files would be saved in `~/.unirec/dataset/ml-100k`. 

We believe that you could easily process your own dataset to obtain `train.csv`, `valid.csv`, `test.csv` and `user_history.csv` using leave-one-out strategy. 

As for the columns in `item_meta_morec.csv` file, we fake it with random numbers due to lack of information in ml-100k. But you can easily obtain it in your own dataset.

In [1]:
import os
import sys
sys.path.append(os.path.abspath("../../"))

from preprocess.download_split_ml100k import prepare_ml100k

prepare_ml100k()

Load raw dataset from /home/v-huangxu/.unirec/dataset/ml-100k.zip
Unzip raw dataset compressed file into /home/v-huangxu/.unirec/dataset
original dataset size: (100000, 4)
filter by rating>=3 dataset size: (82520, 4)
drop_duplicates dataset size: (82520, 4)
Ite: 0, users: 941 / 943, items: 1016 / 1574
Ite: 1, users: 939 / 941, items: 1016 / 1016
Ite: 2, users: 939 / 939, items: 1016 / 1016
k-core filtered dataset size: (80393, 4)
939 1016
size in Train/Valid/Test: (78515, 2) / (939, 2) / (939, 2)
Processed dataset saved in /home/v-huangxu/.unirec/dataset/ml-100k.


True

##### Binary Data File Preparation

Upon the interaction files are processed, UniRec requires to convert them into binary files for time-saving loading. We provide the tools in [example folder](../../preprocess/prepare_data.py) to easily obtain the pickle file.

Note that the function `process_transaction_dataset` requires some meta information of the csv files, such as the directory path, the seperator, header , file format and so on. 

Note that we have defined several data formats in UniRec, you can list all formats using codes below.

In [2]:
# All supported data file formats

from unirec.constants.protocols import DataFileFormat

for format in DataFileFormat.__members__.values():
    print(f"{format}: {format.value}")

DataFileFormat.T1: user-item
DataFileFormat.T2: user-item-label
DataFileFormat.T2_1: user-item-label-session
DataFileFormat.T3: user-item-rating
DataFileFormat.T4: user-item_group-label_group
DataFileFormat.T5: user-item_seq
DataFileFormat.T5_1: user_item_seq
DataFileFormat.T6: user-item_seq-time_seq
DataFileFormat.T7: label-index_group-value_group


In [3]:
import unirec

from preprocess.prepare_data import process_transaction_dataset

binary_data_folder_path = os.path.expanduser("~/.unirec/dataset/binary/")

UNIREC_PATH = os.path.dirname(unirec.__file__)

BINARY_FILE_CONFIG = {
    "raw_datapath": os.path.expanduser("~/.unirec/dataset/ml-100k"), # the dir of csv files
    "outpathroot": binary_data_folder_path, # the output dir of processed binary files
    "dataset_name": "ml-100k", # the dataset name, set as you like
    "example_yaml_file": os.path.join(UNIREC_PATH, "config/dataset/example.yaml"), # Do not modify the value
    "index_by_zero": 0,  # whether the user_id and item_id start from 0
    "sep": "\t" ,   # the seperator of csv files 
    "train_file": 'train.csv',  # the filename of training csv file
    "train_file_format": 'user-item', 
    "train_file_has_header": 1, # whether the training file has header
    "train_file_col_names": "['user_id', 'item_id']",  # the columns of training csv file
    "train_neg_k": 0,  
    "valid_file": 'valid.csv', # the filename of validation csv file
    "valid_file_format": 'user-item', 
    "valid_file_has_header": 1, # whether the validation file has header
    "valid_file_col_names": "['user_id', 'item_id']", # the columns of validation csv file
    "valid_neg_k": 0, 
    "test_file": 'test.csv', # the filename of test csv file
    "test_file_format": 'user-item', 
    "test_file_has_header": 1, # whether the test file has header
    "test_file_col_names": "['user_id', 'item_id']", # the columns of test csv file
    "test_neg_k": 0, 
    "user_history_file": 'user_history.csv', # the filename of history csv file
    "user_history_file_format": 'user-item_seq', 
    "user_history_file_has_header": 1, # whether the history file has header
    "user_history_file_col_names": "['user_id', 'item_seq']" # the columns of history csv file
}

In [4]:
process_transaction_dataset(BINARY_FILE_CONFIG)    # the binary files would be saved in `binary_data_folder_path`

data shape of train.csv is (78515, 2)
data dtypes is user_id    int64
item_id    int64
dtype: object
saving train.pkl at 25/10/2023 20:09:34
finish saving train.pkl at 25/10/2023 20:09:34
In saving:
   user_id  item_id
0        1        1
1        1        2
2        1        3
3        1        4
4        1        5
data.shape=(78515, 2)

data shape of valid.csv is (939, 2)
data dtypes is user_id    int64
item_id    int64
dtype: object
saving valid.pkl at 25/10/2023 20:09:34
finish saving valid.pkl at 25/10/2023 20:09:34
In saving:
   user_id  item_id
0        1      211
1        2      252
2        3      278
3        4      285
4        5      140
data.shape=(939, 2)

data shape of test.csv is (939, 2)
data dtypes is user_id    int64
item_id    int64
dtype: object
saving test.pkl at 25/10/2023 20:09:34
finish saving test.pkl at 25/10/2023 20:09:34
In saving:
   user_id  item_id
0        1      212
1        2      253
2        3       12
3        4      286
4        5       31
data.s

In [5]:
# for the item_meta_morec.csv file, we copy it to the binary file path as well
import shutil

shutil.copyfile(os.path.join(BINARY_FILE_CONFIG['raw_datapath'], 'item_meta_morec.csv'), os.path.join(BINARY_FILE_CONFIG['outpathroot'], BINARY_FILE_CONFIG['dataset_name'], 'item_meta_morec.csv'))

'/home/v-huangxu/.unirec/dataset/binary/ml-100k/item_meta_morec.csv'

#### MoRec pretraining stage: accuracy-oriented model training

Since MoRec provides a post-training strategy to convert a single-objective model (usually an accuracy-oriented model) to a multi-objective model, we need to train the accuracy-oriented model first.

1. First, setup morec_configurations, including hyperparameters, file paths.
2. Second, training with unirec's user-friendly interface.

In [6]:
import datetime
from copy import deepcopy

ckpt_output_path = os.path.expanduser("~/.unirec/output")

GLOBAL_CONF = {
    'config_dir': f"{os.path.join(UNIREC_PATH, 'config')}",
    'exp_name': '',
    'checkpoint_dir': datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S"),
    'model': 'MF',
    'dataloader': 'BaseDataset',
    'dataset': 'ml-100k',
    'dataset_path': os.path.join(BINARY_FILE_CONFIG['outpathroot'], "ml-100k"),
    'output_path': ckpt_output_path,
    'learning_rate': 0.001,
    'scheduler': None,
    'dropout_prob': 0.0,
    'embedding_size': 32,
    'user_pre_item_emb': 0,
    'loss_type': 'bpr',
    'max_seq_len': 10,
    'has_user_bias': 0,
    'has_item_bias': 0,
    'epochs':100,
    'early_stop': 10,
    'batch_size': 512,
    'n_sample_neg_train': 4,
    'valid_protocol': 'one_vs_all',
    'test_protocol': 'one_vs_all',
    'grad_clip_value': 0.1,
    'weight_decay': 1e-6,
    'history_mask_mode': 'autoagressive',
    'user_history_filename': "user_history",
    'metrics': "['hit@10', 'rhit@10', 'pop-kl@10', 'least-misery']",
    'key_metric': "hit@10",
    'num_workers': 4,
    'num_workers_test': 0,
    'verbose': 2,
    'neg_by_pop_alpha': 0,
    'item_meta_morec_filename': 'item_meta_morec.csv',
    'align_dist_filename': None,  # the expected alignment distribution is set as the distribution derived from the training set
}

In [7]:

for arg in sys.argv:  # arguments conflict in notebooks, this is only required in notebooks
    if "-f" in arg:
        sys.argv.remove(arg)

from unirec.main import main

pretrain_config = deepcopy(GLOBAL_CONF)

pretrain_config['checkpoint_dir'] = 'morec_pretrain_' + pretrain_config['checkpoint_dir']
pretrain_config['exp_name'] = "MoRec-Pretrain"

pretrain_result = main.run(pretrain_config)

print(pretrain_result)

Load configuration files from /anaconda/envs/unirec/lib/python3.9/site-packages/unirec/config


[INFO] MF-MoRec-Pretrain: config={'gpu_id': 0, 'use_gpu': True, 'seed': 2022, 'state': 'INFO', 'verbose': 2, 'saved': True, 'use_tensorboard': False, 'use_wandb': False, 'init_method': 'normal', 'init_std': 0.02, 'init_mean': 0.0, 'scheduler': None, 'scheduler_factor': 0.1, 'time_seq': 0, 'seq_last': False, 'has_user_emb': True, 'has_user_bias': 0, 'has_item_bias': 0, 'use_features': False, 'use_text_emb': False, 'use_position_emb': True, 'load_pretrained_model': False, 'embedding_size': 32, 'hidden_size': 128, 'inner_size': 128, 'dropout_prob': 0.0, 'epochs': 100, 'batch_size': 512, 'learning_rate': 0.001, 'optimizer': 'adam', 'eval_step': 1, 'early_stop': 10, 'clip_grad_norm': None, 'weight_decay': 1e-06, 'num_workers': 4, 'persistent_workers': False, 'pin_memory': False, 'shuffle_train': False, 'use_pre_item_emb': 0, 'loss_type': 'bpr', 'ccl_w': 150, 'ccl_m': 0.4, 'distance_type': 'dot', 'metrics': "['hit@10', 'rhit@10', 'pop-kl@10', 'least-misery']", 'key_metric': 'hit@10', 'test_p

Writing logs to /home/v-huangxu/.unirec/output/MF-MoRec-Pretrain.2023-10-25_200935.85.txt


Evaluate: 100%|██████████| 2/2 [00:02<00:00,  1.23s/it]
[INFO] MF-MoRec-Pretrain: epoch 0 evaluating [time: 2.47s, hit@10: 0.006390]
[INFO] MF-MoRec-Pretrain: complete scores on valid set: 
hit@10:0.006389776357827476 min-hit@10:0.0 min-rhit@10:0.0 pop-kl@10:0.006275205722284025 rhit@10:0.3903478411526107
[INFO] MF-MoRec-Pretrain: Saving best model at epoch 0 to /home/v-huangxu/.unirec/output/morec_pretrain_2023-10-25_20-09-34/MF-MoRec-Pretrain.pth
[INFO] MF-MoRec-Pretrain: 
>> epoch 1
Train: 100%|██████████| 154/154 [00:01<00:00, 116.79it/s]
[INFO] MF-MoRec-Pretrain: epoch 1 training [time: 1.32s, train loss: 106.7443]
[INFO] MF-MoRec-Pretrain: one_vs_all
Evaluate: 100%|██████████| 2/2 [00:00<00:00,  8.40it/s]
[INFO] MF-MoRec-Pretrain: epoch 1 evaluating [time: 0.24s, hit@10: 0.018104]
[INFO] MF-MoRec-Pretrain: complete scores on valid set: 
hit@10:0.01810436634717785 min-hit@10:0.009950248756218905 min-rhit@10:0.7130259394968312 pop-kl@10:0.0069606678917607315 rhit@10:1.1928485233078

Logger close successfully.
{'hit@10': 0.15335463258785942, 'rhit@10': 8.945560041904715, 'pop-kl@10': 0.0014158862010865995, 'min-hit@10': 0.09032258064516129, 'min-rhit@10': 4.952566997895631}


#### MoRec fine-tuning stage: multi-objective model tuning

In this stage, the pretrained model is loaded and then trained successively toward a multi-objective model. 

Here we only need to set parameters for MoRec.

In [8]:
# MoRec multi-objective post-training (fine-tuning) stage 
morec_config = deepcopy(GLOBAL_CONF)

morec_config['enable_morec'] = 1
morec_config['exp_name'] = 'Morec-Finetune'

# pretrained model file is loaded by the `model_file` argument
morec_config['model_file'] = os.path.join(pretrain_config['output_path'], pretrain_config['checkpoint_dir'], f"{pretrain_config['model']}-{pretrain_config['exp_name']}.pth")
morec_config['checkpoint_dir'] = "morec_finetune_" + morec_config['checkpoint_dir']

# MoRec parameters
morec_config['morec_objectives']=['fairness', 'alignment', 'revenue']
morec_config["morec_ngroup"] = 10
morec_config["morec_alpha"] = 0.01
morec_config["morec_lambda"] = 0.2
morec_config["morec_expect_loss"] = 0.25
morec_config["morec_beta_min"] = 0.1
morec_config["morec_beta_max"] = 1.5
morec_config["morec_K_p"] = 0.05
morec_config["morec_K_i"] = 0.001
morec_config["morec_objective_controller"] = "PID"
morec_config["morec_objective_weights"] = "[0.1,0.1,0.8]"

morec_config["epochs"] = 10
morec_config["early_stop"] = -1

morec_result = main.run(morec_config)

[INFO] MF-Morec-Finetune: config={'gpu_id': 0, 'use_gpu': True, 'seed': 2022, 'state': 'INFO', 'verbose': 2, 'saved': True, 'use_tensorboard': False, 'use_wandb': False, 'init_method': 'normal', 'init_std': 0.02, 'init_mean': 0.0, 'scheduler': None, 'scheduler_factor': 0.1, 'time_seq': 0, 'seq_last': False, 'has_user_emb': True, 'has_user_bias': 0, 'has_item_bias': 0, 'use_features': False, 'use_text_emb': False, 'use_position_emb': True, 'load_pretrained_model': False, 'embedding_size': 32, 'hidden_size': 128, 'inner_size': 128, 'dropout_prob': 0.0, 'epochs': 10, 'batch_size': 512, 'learning_rate': 0.001, 'optimizer': 'adam', 'eval_step': 1, 'early_stop': -1, 'clip_grad_norm': None, 'weight_decay': 1e-06, 'num_workers': 4, 'persistent_workers': False, 'pin_memory': False, 'shuffle_train': False, 'use_pre_item_emb': 0, 'loss_type': 'bpr', 'ccl_w': 150, 'ccl_m': 0.4, 'distance_type': 'dot', 'metrics': "['hit@10', 'rhit@10', 'pop-kl@10', 'least-misery']", 'key_metric': 'hit@10', 'test_pr

Load configuration files from /anaconda/envs/unirec/lib/python3.9/site-packages/unirec/config
Writing logs to /home/v-huangxu/.unirec/output/MF-Morec-Finetune.2023-10-25_201139.61.txt
static weight: [0.1, 0.1, 0.8].


Evaluate:   0%|          | 0/2 [00:00<?, ?it/s]

Evaluate: 100%|██████████| 2/2 [00:00<00:00,  6.06it/s]
[INFO] MF-Morec-Finetune: epoch 0 evaluating [time: 0.33s, hit@10: 0.203408]
[INFO] MF-Morec-Finetune: complete scores on valid set: 
hit@10:0.20340788072417465 min-hit@10:0.17763157894736842 min-rhit@10:12.053354435024316 pop-kl@10:0.0014936021602110078 rhit@10:12.511718904065457
[INFO] MF-Morec-Finetune: Saving best model at epoch 0 to /home/v-huangxu/.unirec/output/morec_finetune_2023-10-25_20-09-34/MF-Morec-Finetune.pth
[INFO] MF-Morec-Finetune: 
>> epoch 1
Train: 100%|██████████| 154/154 [00:03<00:00, 40.42it/s]
[INFO] MF-Morec-Finetune: epoch 1 training [time: 3.81s, train loss: 5.1587]
[INFO] MF-Morec-Finetune: one_vs_all
Evaluate: 100%|██████████| 2/2 [00:00<00:00,  7.04it/s]
[INFO] MF-Morec-Finetune: epoch 1 evaluating [time: 0.30s, hit@10: 0.198083]
[INFO] MF-Morec-Finetune: complete scores on valid set: 
hit@10:0.19808306709265175 min-hit@10:0.17763157894736842 min-rhit@10:11.699364055209445 pop-kl@10:0.0018346320382672

Logger close successfully.


#### Performance Comparisons

The MoRec framework could improve model's performance in rhit@10, pop-kl@10, min-hit@10, which represent revenue, alignment and fairness respectively. And the improvements only sacrificy little accuracy, resulting in a 2.08% relative drop in term of hit@10. 

Note, the details of metrics used here are given in our [paper](https://arxiv.org/abs/2310.13260v1). The higher metrics represent the better performance, except the pop-kl.

In [10]:
print("Pretrain: ", pretrain_result)
print("Finetune: ", morec_result)

Pretrain:  {'hit@10': 0.15335463258785942, 'rhit@10': 8.945560041904715, 'pop-kl@10': 0.0014158862010865995, 'min-hit@10': 0.09032258064516129, 'min-rhit@10': 4.952566997895631}
Finetune:  {'hit@10': 0.1501597444089457, 'rhit@10': 9.260119967662664, 'pop-kl@10': 0.002535212338356239, 'min-hit@10': 0.0967741935483871, 'min-rhit@10': 5.953895543650172}
