agent: Scaling logic refactor #371

sharnoff · 2023-07-03T14:51:52Z

Broadly, the design here is to move to a single state object per-VM (core.State) that exposes a pure method, (*core.State).NextAction(), which tells the caller what they should do next.

This should allow:

Unit tests for our scaling logic
Easier implementation of changes to the scaling flow that require adding/modifying the state we track

In particular, it'll be much easier to address #350 and #23 after this change.

Steps:

Add core.State
Manage core.State from pkg/agent/runner.go, execute Actions.
Add tests for core.State

tychoish

can't wait to see the test ;)

pkg/agent/core/action.go

pkg/agent/core/core.go

sharnoff · 2023-09-27T22:47:53Z

status update on this: Broadly, the code seems to be working, even though it's based main from back in July. Next steps are squashing, rebasing/merging onto main and polishing it up (removing older stuff that's now unnecessary (like some fields in agent.Runner), adding comments to all the new methods, and updating the docs at the top of agent/runner.go).

Before that, it's probably best to merge #506 — should make things on this PR simpler (even though merging/rebasing will be more complicated)

At a very high level, this work replaces (*Runner).handleVMResources(), moving from an imperative style to an explicit state machine. This new version is more complicated, but ultimately more flexible and easier to extend. The decision-making "core" of the scaling logic is implemented by (*core.State).NextActions(), which returns an ActionSet indicating what the "caller" should do. NextActions() is a pure function, making this easier to test - at least, in theory. That method is called and cached by executor.ExecutorCore, where there's a few different threads (each defined in exec_*.go) responsible for implementing the communications with the other components - namely, the scheduler plugin, vm-informant, and NeonVM k8s API. The various "executor" threads are written generically, using dedicated interfaces (e.g. PluginInterface / PluginHandle) that are implemented in pkg/agent/execbridge.go.

sharnoff · 2023-09-28T23:47:11Z

Squashed, and now merged main -> here

This commit was actually really fun to write. By doing it this way, we actually automatically get "retry with a slightly larger downscale" for free, and hooking everything up was super simple! Also, while we're here, it's worth adding back the other tests from before this PR, now that they pass.

tl;dr of the difference is that require calls t.FailNow(), so we stop at the first error, which is more in line with what we want.

sharnoff · 2023-10-08T17:21:20Z

I think this is probably ready to merge, need to give it a thorough look over once more.

In its current state, this PR:

does not fix agent/core: Use VM spec as source of truth for current resources #350 or agent vs NeonVM state is inconsistent when NeonVM fails #23. Those should be easy enough to do, but leaving that for a follow-up because the behavior is kind of significantly different, and I don't want to block this PR on fixing those.
does fix autoscaler-agent does not reliably request increases when min bound increases #252 (if it did exist before this PR), having a couple tests to check it

There's also some failure cases that this PR does not handle currently, that I'm mostly hoping will be okay enough. In particular: if a request fails, we cannot assume it didn't succeed. For example, if we make a request to the scheduler plugin informing it that we downscaled but the request times out, we can't assume that the scheduler didn't register the change, so we must treat the downscaled amount as the new maximum approved value from the scheduler. There's similar subtleties for requests to the other components as well.

This PR also surfaced some necessary changes:

A new request type to the vm-monitor for "forced downscaling" — currently, any downscaling can be denied, and that makes it hard to guarantee the "eventual" part of "eventual consistency".
Requests to the scheduler plugin must include what the agent believes the last permit was. It's always possible for the scheduler to get killed and immediately restart, without the agent believing there was any disconnect, which could lead to unintentional overcommitting of resources from the Buffer values if too many agents request upscaling on the first request to the scheduler. (This would also remove the need for separate "informative" requests.)

In any case, I want to let this sit on staging for as long as possible, given the magnitude of the changes.

Ran into this a couple times while writing tests - it's easy to accidentally write Call() where you meant Do(), and then the function call would just never be run, which is hard to debug.

With IDs, it was theoretically possible for us to reconnect to the same scheduler instance after disconnecting, which would have the same IDs. We could have mitigated this by including the scheduler's resourceVersion in the ID, but IDs are a little hard to grok anyways and tend to require spooky action at a distance, so generation numbers it is! --- also forwards scheduler deletion in trackSchedulerLoop into SchedulerGone calls in the executor.

Essentially, locked state updates should *always* guarantee that a handle to the plugin or vm-monitor (via reading the field of the Runner) will be consistent with the current state of the ExecutorCore.

Added comments re: synchronization explain why this is ok. tl;dr: it requires locking Runner.lock and executor's lock, which means that reading it with either is ok.

tychoish reviewed Jul 3, 2023

View reviewed changes

pkg/agent/core/action.go Outdated Show resolved Hide resolved

pkg/agent/core/core.go Outdated Show resolved Hide resolved

This was referenced Jul 3, 2023

agent/core: Use VM spec as source of truth for current resources #350

Open

no unit tests for scaling algorithm #199

Closed

NeonVM: k8s 1.25 networking follow ups #327

Closed

sharnoff mentioned this pull request Jul 12, 2023

agent vs NeonVM state is inconsistent when NeonVM fails #23

Closed

sharnoff mentioned this pull request Sep 1, 2023

Bug: autoscaler-agent background worker panics with rejectedDownscale returned new target less than current #512

Closed

sharnoff mentioned this pull request Sep 26, 2023

Bug(ish): Kernel buff/cache usage prevents downscaling #531

Closed

sharnoff added 2 commits September 28, 2023 13:57

Merge branch 'main' into agent-runner-update-refactor

f0a1169

sharnoff force-pushed the sharnoff/agent-runner-update-refactor branch from 2f639aa to 0faed63 Compare September 28, 2023 23:46

sharnoff added 7 commits September 28, 2023 16:54

Remove make test/build/run dependence on go vet

91d6cc8

Update (*core.State).DesiredResources... for recent changes in main

1021a78

fix core/state_test State initialization

5f4ff2e

small improvement to computeUnit availability

e2b7075

switch arithmetic ordering

bf18987

simplify condition

d2f0992

sharnoff marked this pull request as ready for review September 29, 2023 02:57

sharnoff added 10 commits September 29, 2023 08:11

add more thorough tests (not yet passing)

724ae6b

switch test from testify/assert to testify/require

5ac1b60

tl;dr of the difference is that require calls t.FailNow(), so we stop at the first error, which is more in line with what we want.

add State.debug for print debugging

bcb8d96

fix warn log lines: s/informant/vm-monitor/

921a78e

fix comment

b7d205b

rewrite the rewrite, I guess

f3f527f

agent/executor: Remove reqeust lock usage

42929b6

agent/executor: Simplify request-if-iface-non-nil logic

6bfb7ce

agent: fix unimplemented (*execMonitorHandle).ID()

6acf619

agent/executor: also log ActionSet returned

f490ebd

sharnoff added 6 commits October 7, 2023 19:12

testhelpers: allow separate VmInfo construction

e1ff8f3

rename some testhelpers bits

bf4db09

state_test: remove calls to (*State).Debug()

b6c4d45

state_test: add tests that VM bounds changes are respected

227ea83

Merge branch 'main' into agent-runner-update-refactor

fbb266b

add metric for number of calls to (*core.State).NextActions()

290750d

sharnoff linked an issue Oct 8, 2023 that may be closed by this pull request

autoscaler-agent does not reliably request increases when min bound increases #252

Closed

sharnoff added 2 commits October 8, 2023 12:17

update comments/docs

07fd321

testhelpers: panic if Call() is not resolved

0098d34

Ran into this a couple times while writing tests - it's easy to accidentally write Call() where you meant Do(), and then the function call would just never be run, which is hard to debug.

sharnoff force-pushed the sharnoff/agent-runner-update-refactor branch from 54d6f27 to 0098d34 Compare October 8, 2023 19:17

sharnoff added 9 commits October 8, 2023 12:34

add util.Broadcaster tests, fix usage in executor

57f71cb

refactor exec_sleeper to match other executor threads

9b59a01

one more executor broadcaster usage fix

240325d

fix typo

59dc242

executor: add warnings when skipping state update

e570cdd

executor: require plugin/monitor interfaces are non-nil during request

f39cf32

Essentially, locked state updates should *always* guarantee that a handle to the plugin or vm-monitor (via reading the field of the Runner) will be consistent with the current state of the ExecutorCore.

remove unnecessary atomics for Runner.{scheduler,monitor}

3592ff1

Added comments re: synchronization explain why this is ok. tl;dr: it requires locking Runner.lock and executor's lock, which means that reading it with either is ok.

reduce executor state logs to debug level

60e8e19

sharnoff merged commit 12b8208 into main Oct 8, 2023
7 checks passed

sharnoff deleted the sharnoff/agent-runner-update-refactor branch October 8, 2023 23:45

This was referenced Oct 16, 2023

agent/executor: Use a single update listener thread, spawn new ones per each request #564

Draft

Bug: autoscaler-agent has memory leak when talking to vm-monitor #503

Closed

sharnoff mentioned this pull request Oct 30, 2023

agent should send vm-monitor upscale message only after resources were added #593

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent: Scaling logic refactor #371

agent: Scaling logic refactor #371

sharnoff commented Jul 3, 2023 •

edited

tychoish left a comment

sharnoff commented Sep 27, 2023 •

edited

sharnoff commented Sep 28, 2023 •

edited

sharnoff commented Oct 8, 2023 •

edited

agent: Scaling logic refactor #371

agent: Scaling logic refactor #371

Conversation

sharnoff commented Jul 3, 2023 • edited

tychoish left a comment

Choose a reason for hiding this comment

sharnoff commented Sep 27, 2023 • edited

sharnoff commented Sep 28, 2023 • edited

sharnoff commented Oct 8, 2023 • edited

sharnoff commented Jul 3, 2023 •

edited

sharnoff commented Sep 27, 2023 •

edited

sharnoff commented Sep 28, 2023 •

edited

sharnoff commented Oct 8, 2023 •

edited