Docs | Google Group | Installation | Usage | Slides | Quickstart | Open Bandit Dataset | 日本語
Table of Contents
Open Bandit Dataset is a public real-world logged bandit feedback data. The dataset is provided by ZOZO, Inc., the largest Japanese fashion e-commerce company with over 5 billion USD market capitalization (as of May 2020). The company uses multi-armed bandit algorithms to recommend fashion items to users in a large-scale fashion e-commerce platform called ZOZOTOWN. The following figure presents examples of displayed fashion items as actions.
Recommended fashion items as actions in ZOZOTOWN
We collected the data in a 7-day experiment in late November 2019 on three “campaigns,” corresponding to all, men's, and women's items, respectively. Each campaign randomly used either the Uniform Random algorithm or the Bernoulli Thompson Sampling (Bernoulli TS) algorithm, which was pre-trained for about a month before the data collection period.
The small size version of our data is available at ./obd. This can be used for running examples. We release the full size version of our data at https://research.zozo.com/data.html. Please download the full size version for research uses. Please see ./obd/README.md for the description of the dataset.
Open Bandit Pipeline is a series of implementations of dataset preprocessing, OPE estimators, and the evaluation of OPE estimators. This pipeline allows researchers to focus on building their own OPE estimator and easily compare it with others’ methods in realistic and reproducible ways. Thus, it facilitates reproducible research on bandit algorithms and off-policy evaluation.
Structure of Open Bandit Pipeline
Open Bandit Pipeline consists of the following main modules.
- dataset module: This module provides a data loader for Open Bandit Dataset and a flexible interface for handling logged bandit feedback. It also provides tools to generate synthetic bandit datasets.
- policy module: This module provides interfaces for online and offline bandit algorithms. It also implements several standard algorithms.
- simulator module: This module provides functions for conducting offline bandit simulation.
- ope module: This module provides interfaces for OPE estimators. It also implements several standard OPE estimators.
-
Bandit Algorithms (implemented in policy module)
- Online
- Context-free
- Random
- Epsilon Greedy
- Bernoulli Thompson Sampling
- Contextual
- Linear
- Linear Epsilon Greedy
- Linear Thompson Sampling [Agrawal and Goyal. 2013]
- Linear Upper Confidence Bound [Li et al. 2010]
- Logistic
- Logistic Epsilon Greedy
- Logistic Thompson Sampling [Chapelle and Li. 2011]
- Logistic Upper Confidence Bound [Mahajan et al. 2012]
- Linear
- Context-free
- Offline (Off-Policy Learning) [Dudík et al. 2014]
- Inverse Probability Weighting
- Online
-
OPE Estimators (implemented in ope module)
- Replay Method [Li et al. 2011]
- Direct Method [Beygelzimer and Langford 2009]
- Inverse Probability Weighting [Precup et al. 2000] [Strehl et al. 2010]
- Self-Normalized Inverse Probability Weighting [Swaminathan and Joachims. 2015]
- Doubly Robust [Dudík et al. 2014]
- Switch Estimator [Wang et al. 2016]
- More Robust Doubly Robust [Farajtabar et al. 2018]
- Doubly Robust with Shrinkage [Su et al. 2019]
- Double Machine Learning [Narita et al. 2020]
In addition to the above algorithms and estimators, the pipeline also provides flexible interfaces. Therefore, researchers can easily implement their own algorithms or estimators and evaluate them with our data and pipeline. Moreover, the pipeline provides an interface for handling real-world logged bandit feedback datasets. Thus, practitioners can combine their own datasets with the pipeline and easily evaluate bandit algorithms' performances in their settings.
Currently, Open Bandit Dataset & Pipeline facilitate evaluation and comparison related to the following research topics.
-
Bandit Algorithms: Our data include large-scale logged bandit feedback collected by the uniform random policy. Therefore, it enables the evaluation of new online bandit algorithms, including contextual and combinatorial algorithms, in a large real-world setting.
-
Off-Policy Evaluation: We present implementations of behavior policies used when collecting datasets as a part of our pipeline. Our open data also contains logged bandit feedback data generated by multiple behavior policies. Therefore, it enables the evaluation of off-policy evaluation with ground-truth for the performance of evaluation policies.
You can install OBP using Python's package manager pip
.
pip install obp
You can install OBP from source.
git clone https://github.com/st-tech/zr-obp
cd zr-obp
python setup.py install
- python>=3.7.0
- matplotlib>=3.2.2
- numpy>=1.18.1
- pandas>=0.25.1
- pyyaml>=5.1
- seaborn>=0.10.1
- scikit-learn>=0.23.1
- scipy>=1.4.1
- tqdm>=4.41.1
We show an example of conducting offline evaluation of the performance of Bernoulli Thompson Sampling (BernoulliTS) as an evaluation policy using the Inverse Probability Weighting (IPW) and logged bandit feedback generated by the Random policy (behavior policy). We see that only ten lines of code are sufficient to complete OPE from scratch.
# a case for implementing OPE of the BernoulliTS policy using log data generated by the Random policy
from obp.dataset import OpenBanditDataset
from obp.policy import BernoulliTS
from obp.ope import OffPolicyEvaluation, InverseProbabilityWeighting as IPW
# (1) Data loading and preprocessing
dataset = OpenBanditDataset(behavior_policy='random', campaign='all')
bandit_feedback = dataset.obtain_batch_bandit_feedback()
# (2) Offline Bandit Simulation
evaluation_policy = BernoulliTS(
n_actions=dataset.n_actions,
len_list=dataset.len_list,
is_zozotown_prior=True, # replicate the policy in the ZOZOTOWN production
campaign="all",
random_state=12345
)
action_dist = evaluation_policy.compute_batch_action_dist(
n_sim=100000, n_rounds=bandit_feedback["n_rounds"]
)
# (3) Off-Policy Evaluation
ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[IPW()])
estimated_policy_value = ope.estimate_policy_values(action_dist=action_dist)
# estimated performance of BernoulliTS relative to the ground-truth performance of Random
relative_policy_value_of_bernoulli_ts = estimated_policy_value['ipw'] / bandit_feedback['reward'].mean()
print(relative_policy_value_of_bernoulli_ts)
1.198126...
A formal introduction with the same example can be found at quickstart. Below, we explain some important features in the example.
We prepare an easy-to-use data loader for Open Bandit Dataset.
# load and preprocess raw data in "ALL" campaign collected by the Random policy
dataset = OpenBanditDataset(behavior_policy='random', campaign='all')
# obtain logged bandit feedback generated by the behavior policy
bandit_feedback = dataset.obtain_batch_bandit_feedback()
print(bandit_feedback.keys())
dict_keys(['n_rounds', 'n_actions', 'action', 'position', 'reward', 'pscore', 'context', 'action_context'])
Users can implement their own feature engineering in the pre_process
method of obp.dataset.OpenBanditDataset
class.
We show an example of implementing some new feature engineering processes in ./examples/examples_with_obd/custom_dataset.py
.
Moreover, by following the interface of obp.dataset.BaseBanditDataset
class, one can handle future open datasets for bandit algorithms other than our Open Bandit Dataset.
dataset
module also provide a class to generate synthetic bandit datasets.
After preparing a dataset, we now compute the action choice probability of BernoulliTS in the ZOZOTOWN production. Then, we can use it as the evaluation policy.
# define evaluation policy (the Bernoulli TS policy here)
# by activating the `is_zozotown_prior` argument of BernoulliTS, we can replicate BernoulliTS used in ZOZOTOWN production.
evaluation_policy = BernoulliTS(
n_actions=dataset.n_actions,
len_list=dataset.len_list,
is_zozotown_prior=True, # replicate the BernoulliTS policy in the ZOZOTOWN production
campaign="all",
random_state=12345
)
# compute the distribution over actions by the evaluation policy using Monte Carlo simulation
# action_dist is an array of shape (n_rounds, n_actions, len_list)
# representing the distribution over actions made by the evaluation policy
action_dist = evaluation_policy.compute_batch_action_dist(
n_sim=100000, n_rounds=bandit_feedback["n_rounds"]
)
When is_zozotown_prior=False
, non-informative prior distribution is used.
The compute_batch_action_dist
method of BernoulliTS
computes the action choice probabilities based on given hyperparameters of the beta distribution. action_dist
is an array representing the distribution over actions made by the evaluation policy.
Users can implement their own bandit algorithms by following the interfaces implemented in ./obp/policy/base.py
.
Our final step is off-policy evaluation (OPE), which attempts to estimate the performance of decision making policy using log data generated by offline bandit simulation. Our pipeline also provides an easy procedure for doing OPE as follows.
# estimate the policy value of BernoulliTS based on the distribution over actions by that policy
# it is possible to set multiple OPE estimators to the `ope_estimators` argument
ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[IPW()])
estimated_policy_value = ope.estimate_policy_values(action_dist=action_dist)
print(estimated_policy_value)
{'ipw': 0.004553...} # dictionary containing estimated policy values by each OPE estimator.
# compare the estimated performance of BernoulliTS (evaluation policy)
# with the ground-truth performance of Random (behavior policy)
relative_policy_value_of_bernoulli_ts = estimated_policy_value['ipw'] / bandit_feedback['reward'].mean()
# our OPE procedure suggests that BernoulliTS improves Random by 19.81%
print(relative_policy_value_of_bernoulli_ts)
1.198126...
Users can implement their own OPE estimator by following the interface of obp.ope.BaseOffPolicyEstimator
class. obp.ope.OffPolicyEvaluation
class summarizes and compares the estimated policy values by several off-policy estimators.
A detailed usage of this class can be found at quickstart. bandit_feedback['reward'].mean()
is the empirical mean of factual rewards (on-policy estimate of the policy value) in the log and thus is the ground-truth performance of the behavior policy (the Random policy in this example.).
If you use our dataset and pipeline in your work, please cite our paper:
Yuta Saito, Shunsuke Aihara, Megumi Matsutani, Yusuke Narita. Large-scale Open Dataset, Pipeline, and Benchmark for Bandit Algorithms https://arxiv.org/abs/2008.07146
Bibtex:
@article{saito2020large,
title={Large-scale Open Dataset, Pipeline, and Benchmark for Bandit Algorithms},
author={Saito, Yuta and Shunsuke Aihara and Megumi Matsutani and Yusuke Narita},
journal={arXiv preprint arXiv:2008.07146},
year={2020}
}
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
- Yuta Saito (Main Contributor; Hanjuku-kaso Co., Ltd. / Tokyo Institute of Technology)
- Shunsuke Aihara (ZOZO Technologies, Inc.)
- Megumi Matsutani (ZOZO Technologies, Inc.)
- Yusuke Narita (Hanjuku-kaso Co., Ltd. / Yale University)
-
Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 129–138, 2009.
-
Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
-
Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pages 297–306, 2011.
-
Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. Learning from Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems, pages 2217–2225, 2010.
-
Doina Precup, Richard S. Sutton, and Satinder Singh. Eligibility Traces for Off-Policy Policy Evaluation. In Proceedings of the 17th International Conference on Machine Learning, 759–766. 2000.
-
Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly Robust Policy Evaluation and Optimization. Statistical Science, 29:485–511, 2014.
-
Adith Swaminathan and Thorsten Joachims. The Self-normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems, pages 3231–3239, 2015.
-
Dhruv Kumar Mahajan, Rajeev Rastogi, Charu Tiwari, and Adway Mitra. LogUCB: An Explore-Exploit Algorithm for Comments Recommendation. In Proceedings of the 21st ACM international conference on Information and knowledge management, 6–15. 2012.
-
Lihong Li, Wei Chu, John Langford, Taesup Moon, and Xuanhui Wang. An Unbiased Offline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models. In Journal of Machine Learning Research: Workshop and Conference Proceedings, volume 26, 19–36. 2012.
-
Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. In Proceedings of the 34th International Conference on Machine Learning, 3589–3597. 2017.
-
Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More Robust Doubly Robust Off-policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning, 1447–1456. 2018.
-
Nathan Kallus and Masatoshi Uehara. Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning. In Advances in Neural Information Processing Systems. 2019.
-
Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly Robust Off-policy Evaluation with Shrinkage. In Proceedings of the 37th International Conference on Machine Learning, 2020.
-
Yusuke Narita, Shota Yasui, and Kohei Yata. Off-policy Bandit and Reinforcement Learning. arXiv preprint arXiv:2002.08536, 2020.
-
Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open Graph Benchmark: Datasets for Machine Learning on Graphs. arXiv preprint arXiv:2005.00687, 2020.
This project is strongly inspired by Open Graph Benchmark --a collection of benchmark datasets, data loaders, and evaluators for graph machine learning: [github] [project page] [paper].