Bug: autoscaler-agent incorrectly assumes unsuccessful requests didn't succeed #680

sharnoff · 2023-12-08T01:51:03Z

Problem description / Motivation

I know this seems strange, but hear me out:

With the way that pkg/agent/core/ is currently implemented, when it makes a request to an external entity (scheduler plugin / NeonVM API / vm-monitor), if there's an error in the request, it assumes that the request was not successfully processed.

Unfortunately this is not necessarily the case: It's entirely possible for example, that the autoscaler-agent could send a request to the scheduler plugin, and hit a client timeout before receiving the result of a successfully handled request. The scheduler plugin may not detect that the connection was closed in time.

This can lead to a possibly inconsistent state if the autoscaler-agent doesn't retry that exact request to completion: the scheduler (or whoever) could believe that there's a different amount reserved for a VM than the autoscaler-agent does.

Implementation ideas

We already track the "current" or "approved" values for the three entities we communicate with. We could also track the lower/upper bound for amounts that they may think we've requested and use e.g. the lower bound for what they've seen as the upper bound for what they've approved (or vice versa).

Here, it's important to be aware that the current NeonVM state is used both as an upper and lower bound, for communications with the scheduler and vm-monitor, respectively. We can't "just" change the meaning of a single set of resources — we'll have to separately track a second set of resources as well.

The text was updated successfully, but these errors were encountered:

shayanh · 2023-12-19T18:51:36Z

What could go wrong with having inconsistent state between agent <-> scheduler or agent <-> vm-monitor?

My main concern is that agent's codebase already is the most complicated part of the system and this change makes it even more complicated.

sharnoff · 2023-12-21T18:34:28Z

What could go wrong with having inconsistent state between agent <-> scheduler or agent <-> vm-monitor?

for the scheduler, see above:

This can lead to a possibly inconsistent state if the autoscaler-agent doesn't retry that exact request to completion: the scheduler (or whoever) could believe that there's a different amount reserved for a VM than the autoscaler-agent does.

After that point, we (a) may unintentionally overcommit, and (b) may end up with unexpected results from requests to the scheduler plugin, because of that inconsistent state.

For the vm-monitor, this would mean that we could indefinitely run in a degraded state (file cache too big / too small), because currently there's no guarantee that we retry failed requests (it doesn't make sense to keep asking for scaling that you no longer think is correct).

sharnoff · 2023-12-21T19:43:53Z

Status update: AFAICT:

implementing this is a blocker for agent/core: Use VM spec as source of truth for current resources #350.
this is blocked on autoscaler-agent should request downscaling from high → low #615 (and transitively, Epic: Move ComputeUnit source of truth from scheduler to autoscaler-agent #706)

See #680 for detail on motivation. tl;dr: this fixes a known category of bugs, and AFAICT is a pre-requisite for using the VM spec as a source of truth. Brief summary of changes: - Introduce a new `resourceBounds` struct in pkg/agent/core that handles the uncertainty associated with requests that may or may not have succeeded. - Switch internal usage so plugin permit, vm-monitor approved, and VM spec resources all are represented by `resourceBounds` - Add a new test to extensively test this (`TestFailuresNotAssumedSuccessful`) I expect we'll find bugs with this in production. Most of those should be fine - restarting the `pkg/agent.Runner` and retrying with a fresh slate. Possible liveness issues would be more concerning (e.g. getting into a state where we stop communicating with other components). Those should *hopefully* be handled by the new test.

sharnoff added t/bug Issue Type: Bug c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent labels Dec 8, 2023

sharnoff mentioned this issue Dec 8, 2023

agent/schedwatch: Replace trackcurrent with global value #675

Merged

sharnoff mentioned this issue Dec 21, 2023

autoscaler-agent should request downscaling from high → low #615

Closed

This was referenced Dec 21, 2023

agent/core: Use VM spec as source of truth for current resources #350

Open

Epic: Move ComputeUnit source of truth from scheduler to autoscaler-agent #706

Closed

agent/core: Send monitor requests in 1 CU increments #713

Merged

sharnoff self-assigned this Jan 4, 2024

sharnoff linked a pull request Jan 6, 2024 that will close this issue

agent/core: Treat failed requests as potentially successful #727

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: autoscaler-agent incorrectly assumes unsuccessful requests didn't succeed #680

Bug: autoscaler-agent incorrectly assumes unsuccessful requests didn't succeed #680

sharnoff commented Dec 8, 2023

shayanh commented Dec 19, 2023

sharnoff commented Dec 21, 2023 •

edited

Loading

sharnoff commented Dec 21, 2023

Bug: autoscaler-agent incorrectly assumes unsuccessful requests didn't succeed #680

Bug: autoscaler-agent incorrectly assumes unsuccessful requests didn't succeed #680

Comments

sharnoff commented Dec 8, 2023

Problem description / Motivation

Implementation ideas

shayanh commented Dec 19, 2023

sharnoff commented Dec 21, 2023 • edited Loading

sharnoff commented Dec 21, 2023

sharnoff commented Dec 21, 2023 •

edited

Loading