New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default sampling algorithm #2964
Comments
Benchmark Task Candidates and Scenarios.Now, we have the following benchmark tasks candidates.
And, we would like to consider the following scenarios for each benchmark task.
|
I have collected exhaustive benchmark results to review Optuna's sampler defaults. The results and discussion are described here. Note that the experiments were conducted targeting single-objective optimization. The benchmark environment for multi-objective optimization is still under development, and the collection of exhaustive benchmark results is a future work. The following three types of experiments were performed
TL;DR
Summary of Benchmark EnvironmentDuring the development of Optuna V3 we have prepared several backend benchmark scripts in optuna/benchmarks. Thank you!!! @kei-mo, @xadrianzetx, @drumehiron These benchmark scripts can be run locally or on GitHub Actions. Since the experiments we conducted this time dealt with a very large number of studies, trials, problems, and samplers, it was difficult to complete all the experiments on GitHub Actions, so we conducted the experiments on a computational cluster owned by Preferred Networks, Inc. However, you can easily reproduce our experiments using our benchmark scripts and the GitHub Actions runtime environment. (Of course, due to GitHub Actions resource limitations, this will be a partial reproduction.) After reading the details of the experiment described below, if you have any doubts about its reproducibility, you can always try to reproduce the experiment. Also, if you are interested in settings not covered in the following experiments, we encourage you to conduct your own experiments. For actual instructions on how to perform them, please click here. Summary of Benchmark ExperimentStudy and TrialWe evaluated 1000 trials for each combination of problem and sampler to obtain a study. However, for the BoTorchSampler alone, there were only 100 trials. We evaluated 100studies for each configuration. ProblemsWe have prepared a total of 178 problems. They are divided into four main types: hpobench, nasbench, bayesmark, and sigopt. The backend of hpobench, nasbench's NASBench 101, and sigopt is software called kurobako, which is part of the Optuna organization. The backend for nasbench NASBench-201 is software called naslib. bayesmark's backend is software called bayesmark. The breakdown is 4 hpobench, 4 nasbench, 35 bayesmark, and 135 sigopt. For higher dimensional problems, especially those with more than 10 dimensions, sigopt contains 14 problems each with 10, 30, 50, and 100 dimensions, and one each with 11 and 12 dimensions. The table below summarizes the characteristics of each problem. Problem Table
SamplersOptuna has a variety of samplers implemented, but here we focused our experiments on five of the most promising ones for practical use: TPESampler, CmaEsSampler, QMCSampler, BoTorchSampler, and RandomSampler. GridSampler was excluded from the experiments for implementation reasons. NSGAIISampler was excluded from the experiment because it is a sampler mainly used for multi-objective optimization. Each sampler has a variety of arguments, but in this experiment we focused on those that are most likely to affect performance or whose behavior is not well understood when the algorithm's behavior changes when the argument is changed. For each sampler, the table below summarizes the arguments that were varied in this experiment. For more detailed information, please refer to the Optuna documentation. https://optuna.readthedocs.io/en/latest/reference/samplers/index.html TPESampler
CmaEsSampler
QMCSampler
BoTorchSampler
RandomSampler MetricsFor each of n_trials = 25, 50, 75, 100, and 1000, the evaluation is performed after the number of trials has been run. The winning matrix is a n_problems x n_samplers matrix whose ij component is a non-negative integer that indicates "how many times sampler j beat other samplers in problem i". "The sampler j beat sampler k (j ! = k) in problem i" means the following.
Using the constructed Winning matrix, the score of each sampler j is evaluated in the following three ways. Result of First ExperimentSummaryRoughly, we conclude that TPE is stronger for HPO&NAS-based tasks, BoTorch and CMA-ES are stronger for sigopt function-based tasks, and QMC is stronger for high-dimensional problems (d>=10) among sigopt function-based tasks. The results do not change significantly with changes in n_trials. I will explain the difference between different types of PROBLEM. TPE is strong for HPO/NAS tasks such as kurobako and naslib, and univariate TPE is strong for n=25, but the performance difference with multivariate TPE is small. In other cases, multivariate TPE is stronger. TPE is superior in the bayesmark HPO task, and QMC(sobol) is also superior at n=25. There is almost no performance difference between univariate TPE and multivariate TPE, and they win or lose depending on n_trials. BoTorch is strong in the sigopt function tasks, and CMA-ES is also strong, though not as strong as BoTorch. n=1000 indicates that CMA-ES is the strongest since there are no results for BoTorch. QMC(sobol) is strong for high-dimensional problems of 10 or more dimensions in the sigopt functions, and CMA-ES is also strong, though not as strong as QMC(sobol). Since univariate TPE or multivariate TPE is better depends on n_trials and the type of problem, it is necessary to consider what type of problem you prioritize performance on. Below are the settings and scores (univariate, multivariate) for each setting.
DetailsFirst, as a primary experiment, we fixed only the Benchmark results: n_trials = 25
Benchmark results: n_trials = 50
Benchmark results: n_trials = 75
Benchmark results: n_trials = 100
Benchmark results: n_trials = 1000
The results of this experiment showed that changing some of the arguments had no significant effect on performance. We fixed them at their defaults and reexamined the results. In all, 10 different samplers were compared. In this experiment, we included all problems from kurobako, bayesmark, and sigopt. To make the results easier to understand, we tabulated the scores for each problem and the results are shown below. Benchmark results for kurobako: n_trials = 25
Benchmark results for bayesmark: n_trials = 25
Benchmark results for sigopt: n_trials = 25
Benchmark results for sigopt >= 10d: n_trials = 25
Benchmark results for all: n_trials = 25
Benchmark results for kurobako: n_trials = 50
Benchmark results for bayesmark: n_trials = 50
Benchmark results for sigopt: n_trials = 50
Benchmark results for sigopt >= 10d: n_trials = 50
Benchmark results for all: n_trials = 50
Benchmark results for kurobako: n_trials = 75
Benchmark results for bayesmark: n_trials = 75
Benchmark results for sigopt: n_trials = 75
Benchmark results for sigopt >= 10d: n_trials = 75
Benchmark results for all: n_trials = 75
Benchmark results for kurobako: n_trials = 100
Benchmark results for bayesmark: n_trials = 100
Benchmark results for sigopt: n_trials = 100
Benchmark results for sigopt >= 10d: n_trials = 100
Benchmark results for all: n_trials = 100
Benchmark results for kurobako: n_trials = 1000
Benchmark results for bayesmark: n_trials = 1000
Benchmark results for sigopt: n_trials = 1000
Benchmark results for sigopt >= 10d: n_trials = 1000
Benchmark results for all: n_trials = 1000
|
Result of Second ExperimentSummaryThe TPESampler algorithm differs greatly depending on whether the DetailsWe used asv. After installing asv, you can run the speed benchmark by running Speed benchmark scriptfrom typing import cast
from typing import List
from typing import Union
import optuna
from optuna.samplers import BaseSampler
from optuna.samplers import CmaEsSampler
from optuna.samplers import RandomSampler
from optuna.samplers import TPESampler
from optuna.testing.storages import StorageSupplier
def parse_args(args: str) -> List[Union[int, str]]:
ret: List[Union[int, str]] = []
for arg in map(lambda s: s.strip(), args.split(",")):
try:
ret.append(int(arg))
except ValueError:
ret.append(arg)
return ret
SAMPLER_MODES = [
"random",
"tpe",
"multivariate_tpe",
"cmaes",
]
def create_sampler(sampler_mode: str) -> BaseSampler:
if sampler_mode == "random":
return RandomSampler()
elif sampler_mode == "tpe":
return TPESampler()
elif sampler_mode == "multivariate_tpe":
return TPESampler(multivariate=True)
elif sampler_mode == "cmaes":
return CmaEsSampler()
else:
assert False
class OptimizeSuite:
def objective(self, trial: optuna.Trial) -> float:
# x = trial.suggest_float("x", -100, 100)
# y = trial.suggest_int("y", -100, 100)
# return x**2 + y**2
ret = 0
for i in range(100):
ret += trial.suggest_float(f"x{i}", -100, 100) ** 2
return ret
def optimize(self, storage_mode: str, sampler_mode: str, n_trials: int) -> None:
with StorageSupplier(storage_mode) as storage:
sampler = create_sampler(sampler_mode)
study = optuna.create_study(storage=storage, sampler=sampler)
study.optimize(self.objective, n_trials=n_trials)
def time_optimize(self, args: str) -> None:
storage_mode, sampler_mode, n_trials = parse_args(args)
storage_mode = cast(str, storage_mode)
sampler_mode = cast(str, sampler_mode)
n_trials = cast(int, n_trials)
self.optimize(storage_mode, sampler_mode, n_trials)
params = (
# "inmemory, random, 1000",
# "inmemory, random, 10000",
"inmemory, tpe, 1000",
"inmemory, multivariate_tpe, 1000",
# "inmemory, cmaes, 1000",
# "sqlite, random, 1000",
# "cached_sqlite, random, 1000",
# # Following benchmarks use fakeredis instead of Redis.
# "redis, random, 1000",
# "cached_redis, random, 1000",
)
param_names = ["storage, sampler, n_trials"]
timeout = 600 The following is the output.
Result of Third ExperimentSamplers were limited to 8 samplers: 4 types of TPE (with & without multivariate x with & without constant_liar), CMA-ES, QMC, BoTorch, and Random. SummaryFor kurobako, TPE was found to be consistently strong. DetailsBenchmark results for kurobako: n_trials = 25, n_concurrency = 5
Benchmark results for sigopt: n_trials = 25, n_concurrency = 5
Benchmark results for sigopt >=10d: n_trials = 25, n_concurrency = 5
Benchmark results for all: n_trials = 25, n_concurrency = 5
Benchmark results for kurobako: n_trials = 25, n_concurrency = 10
Benchmark results for sigopt: n_trials = 25, n_concurrency = 10
Benchmark results for sigopt >=10d: n_trials = 25, n_concurrency = 10
Benchmark results for all: n_trials = 25, n_concurrency = 10
Benchmark results for kurobako: n_trials = 25, n_concurrency = 50
Benchmark results for sigopt: n_trials = 25, n_concurrency = 50
Benchmark results for sigopt >=10d: n_trials = 25, n_concurrency = 50
Benchmark results for all: n_trials = 25, n_concurrency = 50
Benchmark results for kurobako: n_trials = 50, n_concurrency = 5
Benchmark results for sigopt: n_trials = 50, n_concurrency = 5
Benchmark results for sigopt >=10d: n_trials = 50, n_concurrency = 5
Benchmark results for all: n_trials = 50, n_concurrency = 5
Benchmark results for kurobako: n_trials = 50, n_concurrency = 10
Benchmark results for sigopt: n_trials = 50, n_concurrency = 10
Benchmark results for sigopt >=10d: n_trials = 50, n_concurrency = 10
Benchmark results for all: n_trials = 50, n_concurrency = 10
Benchmark results for kurobako: n_trials = 50, n_concurrency = 50
Benchmark results for sigopt: n_trials = 50, n_concurrency = 50
Benchmark results for sigopt >=10d: n_trials = 50, n_concurrency = 50
Benchmark results for all: n_trials = 50, n_concurrency = 50
Benchmark results for kurobako: n_trials = 75, n_concurrency = 5
Benchmark results for sigopt: n_trials = 75, n_concurrency = 5
Benchmark results for sigopt >=10d: n_trials = 75, n_concurrency = 5
Benchmark results for all: n_trials = 75, n_concurrency = 5
Benchmark results for kurobako: n_trials = 75, n_concurrency = 10
Benchmark results for sigopt: n_trials = 75, n_concurrency = 10
Benchmark results for sigopt >=10d: n_trials = 75, n_concurrency = 10
Benchmark results for all: n_trials = 75, n_concurrency = 10
Benchmark results for kurobako: n_trials = 75, n_concurrency = 50
Benchmark results for sigopt: n_trials = 75, n_concurrency = 50
Benchmark results for sigopt >=10d: n_trials = 75, n_concurrency = 50
Benchmark results for all: n_trials = 75, n_concurrency = 50
Benchmark results for kurobako: n_trials = 100, n_concurrency = 5
Benchmark results for sigopt: n_trials = 100, n_concurrency = 5
Benchmark results for sigopt >=10d: n_trials = 100, n_concurrency = 5
Benchmark results for all: n_trials = 100, n_concurrency = 5
Benchmark results for kurobako: n_trials = 100, n_concurrency = 10
Benchmark results for sigopt: n_trials = 100, n_concurrency = 10
Benchmark results for sigopt >=10d: n_trials = 100, n_concurrency = 10
Benchmark results for all: n_trials = 100, n_concurrency = 10
Benchmark results for kurobako: n_trials = 100, n_concurrency = 50
Benchmark results for sigopt: n_trials = 100, n_concurrency = 50
Benchmark results for sigopt >=10d: n_trials = 100, n_concurrency = 50
Benchmark results for all: n_trials = 100, n_concurrency = 50
Benchmark results for kurobako: n_trials = 1000, n_concurrency = 5
Benchmark results for sigopt: n_trials = 1000, n_concurrency = 5
Benchmark results for sigopt >=10d: n_trials = 1000, n_concurrency = 5
Benchmark results for all: n_trials = 1000, n_concurrency = 5
Benchmark results for kurobako: n_trials = 1000, n_concurrency = 10
Benchmark results for sigopt: n_trials = 1000, n_concurrency = 10
Benchmark results for sigopt >=10d: n_trials = 1000, n_concurrency = 10
Benchmark results for all: n_trials = 1000, n_concurrency = 10
Benchmark results for kurobako: n_trials = 1000, n_concurrency = 50
Benchmark results for sigopt: n_trials = 1000, n_concurrency = 50
Benchmark results for sigopt >=10d: n_trials = 1000, n_concurrency = 50
Benchmark results for all: n_trials = 1000, n_concurrency = 50
ConclusionWe conducted an exhaustive benchmark experiment to investigate changing Optuna's default sampler as one of the development items for Optuna V3. While we did not obtain enough experimental results to change the default, it was a great help in understanding the behavior of the algorithm as it exists at this time. We hope it will also be helpful to users in their choice of samplers. In the future, we will work on improving the behavior of the algorithms by further improving the benchmarking environment. In particular, we are planning to improve TPESampler as soon as possible, so please stay tuned! |
Let me close this issue since we have stopped to consider changing the default sampling algorithm. Note that we still have a lot of TODOs for Optuna's algorithms. The following are exmpales.
|
Motivation
The default sampler,
TPESampler
, has options expected to improve optimization performance in many situations. For Optuna v3, we consider changing the default options to enjoy effective algorithms.Description
multivariate
,group
, andconstant_liar
n_startup_trials
forconstant_liar
option #3908multivariate
,group
, andconstant_liar
The text was updated successfully, but these errors were encountered: