Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hyperband and non-linear runtime scale #13

Open
berndbischl opened this issue Oct 1, 2019 · 4 comments
Open

hyperband and non-linear runtime scale #13

berndbischl opened this issue Oct 1, 2019 · 4 comments
Assignees

Comments

@berndbischl
Copy link
Sponsor Member

i am pretty sure, all of the HB formulas dont hold anymore if the tuned algo does not scale linear in runtime with eta.
this is especially true if we use the subsampling trick. we at least need to document this, but also maybe adapt a bit.

this is a more complicated issue, and needs to be discussed in the team
of course, would be great if people already post observations and thoughts here

@SebGGruber
Copy link
Contributor

SebGGruber commented Oct 8, 2019

Documenting appropriately is a no-brainer - I already did that today.
But maybe give an additional argument specifying the runtime complexity of the learner w.r.t. the chosen budget in big-o notation and update the HB formula accordingly, so the first and the last bracket require approximately the same runtime again.

@SebGGruber
Copy link
Contributor

SebGGruber commented Oct 8, 2019

I had some further thoughts about it and if we simply apply the inverse of the complexity on each budget, we should end up with the same budget across all brackets again.
A simple example for O(f(budget)) with f(budget) = log(budget) :

The inverse then would be g := f^(-1) = exp

Assume we are using R = 2 and eta = 2, the brackets layout would look like this:

 bracket | bracket_stage | budget | n.configs | "real" runtime (ignoring constants) = f(budget)
    1             1           1          2          log(1) = 0
    1             2           2          1          log(2) = 0.693
    2             1           2          2          log(2) = 0.693

In this example, the usual hyperband algorithm always takes a lot longer to run (theoretically) in the last bracket compared to the first ones.
How to fix this? Apply the inverse of the complexity function on each budget in each bracket stage:

 bracket | bracket_stage | budget = g(budget_old) | n.configs | "real" runtime (ignoring constants)
   1              1            exp(1) = 0               2           log(0) = 1
   1              2            exp(2) = 7.39            1           log(7.39) = 2
   2              1            exp(2) = 7.39            2           log(7.39) = 2

Now, the sum of the budgets is (approximately) equal across all brackets, again.

The only issue I can see here is, that the user specifies R = 2, but the algorithm then uses a budget of 7.39 for the last bracket stage. This is misleading? But at least the budgets are scaled correctly

@SebGGruber SebGGruber self-assigned this Oct 8, 2019
@SebGGruber
Copy link
Contributor

SebGGruber commented Nov 12, 2019

should work with the param trafo.
I added an example to the "examples" section of the class doc.
It's a dumb example, but it shows how it would work and I didnt know better

set.seed(123)
ps = ParamSet$new(list(
  ParamInt$new("nrounds", lower = 1, upper = 10, tags = "budget"),
  ParamDbl$new("eta",     lower = 0, upper = 1),
  ParamFct$new("booster", levels = c("gbtree", "gblinear", "dart"))
))

ps$trafo = function(x, param_set) {
  x$nrounds = round(log(x$nrounds)) + 1L
  return(x)
}

inst = TuningInstance$new(
   tsk("iris"),
   lrn("classif.xgboost"),
   rsmp("holdout"),
   msr("classif.ce"),
   ps,
   term("evals", n_evals = 100000)
)
tuner = TunerHyperband$new(eta = 2)
tuner$tune(inst)
inst$best()

@juliambr
Copy link
Contributor

juliambr commented May 25, 2020

From a theoretical perspective, I think this issue is not a problem. The only assumption the hyperband paper makes is that the algorithm performance converges with a budget parameter going to infty.

"Adapting for convergence rates" is another topic, and is also given as outlook in the original hyperband paper.
The issue can be closed IMO, but another interesting research question can be opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants