New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
agent: Scaling logic refactor #371
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't wait to see the test ;)
status update on this: Broadly, the code seems to be working, even though it's based main from back in July. Next steps are squashing, rebasing/merging onto main and polishing it up (removing older stuff that's now unnecessary (like some fields in Before that, it's probably best to merge #506 — should make things on this PR simpler (even though merging/rebasing will be more complicated) |
At a very high level, this work replaces (*Runner).handleVMResources(), moving from an imperative style to an explicit state machine. This new version is more complicated, but ultimately more flexible and easier to extend. The decision-making "core" of the scaling logic is implemented by (*core.State).NextActions(), which returns an ActionSet indicating what the "caller" should do. NextActions() is a pure function, making this easier to test - at least, in theory. That method is called and cached by executor.ExecutorCore, where there's a few different threads (each defined in exec_*.go) responsible for implementing the communications with the other components - namely, the scheduler plugin, vm-informant, and NeonVM k8s API. The various "executor" threads are written generically, using dedicated interfaces (e.g. PluginInterface / PluginHandle) that are implemented in pkg/agent/execbridge.go.
2f639aa
to
0faed63
Compare
Squashed, and now merged main -> here |
This commit was actually really fun to write. By doing it this way, we actually automatically get "retry with a slightly larger downscale" for free, and hooking everything up was super simple! Also, while we're here, it's worth adding back the other tests from before this PR, now that they pass.
tl;dr of the difference is that require calls t.FailNow(), so we stop at the first error, which is more in line with what we want.
I think this is probably ready to merge, need to give it a thorough look over once more. In its current state, this PR:
There's also some failure cases that this PR does not handle currently, that I'm mostly hoping will be okay enough. In particular: if a request fails, we cannot assume it didn't succeed. For example, if we make a request to the scheduler plugin informing it that we downscaled but the request times out, we can't assume that the scheduler didn't register the change, so we must treat the downscaled amount as the new maximum approved value from the scheduler. There's similar subtleties for requests to the other components as well. This PR also surfaced some necessary changes:
In any case, I want to let this sit on staging for as long as possible, given the magnitude of the changes. |
Ran into this a couple times while writing tests - it's easy to accidentally write Call() where you meant Do(), and then the function call would just never be run, which is hard to debug.
54d6f27
to
0098d34
Compare
With IDs, it was theoretically possible for us to reconnect to the same scheduler instance after disconnecting, which would have the same IDs. We could have mitigated this by including the scheduler's resourceVersion in the ID, but IDs are a little hard to grok anyways and tend to require spooky action at a distance, so generation numbers it is! --- also forwards scheduler deletion in trackSchedulerLoop into SchedulerGone calls in the executor.
Essentially, locked state updates should *always* guarantee that a handle to the plugin or vm-monitor (via reading the field of the Runner) will be consistent with the current state of the ExecutorCore.
Added comments re: synchronization explain why this is ok. tl;dr: it requires locking Runner.lock and executor's lock, which means that reading it with either is ok.
Broadly, the design here is to move to a single state object per-VM (
core.State
) that exposes a pure method,(*core.State).NextAction()
, which tells the caller what they should do next.This should allow:
In particular, it'll be much easier to address #350 and #23 after this change.
Steps:
core.State
core.State
from pkg/agent/runner.go, executeAction
s.core.State