Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: autoscaler-agent incorrectly assumes unsuccessful requests didn't succeed #680

Open
Tracked by #350
sharnoff opened this issue Dec 8, 2023 · 3 comments · May be fixed by #727
Open
Tracked by #350

Bug: autoscaler-agent incorrectly assumes unsuccessful requests didn't succeed #680

sharnoff opened this issue Dec 8, 2023 · 3 comments · May be fixed by #727
Assignees
Labels
c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent t/bug Issue Type: Bug

Comments

@sharnoff
Copy link
Member

sharnoff commented Dec 8, 2023

Problem description / Motivation

I know this seems strange, but hear me out:

With the way that pkg/agent/core/ is currently implemented, when it makes a request to an external entity (scheduler plugin / NeonVM API / vm-monitor), if there's an error in the request, it assumes that the request was not successfully processed.

Unfortunately this is not necessarily the case: It's entirely possible for example, that the autoscaler-agent could send a request to the scheduler plugin, and hit a client timeout before receiving the result of a successfully handled request. The scheduler plugin may not detect that the connection was closed in time.

This can lead to a possibly inconsistent state if the autoscaler-agent doesn't retry that exact request to completion: the scheduler (or whoever) could believe that there's a different amount reserved for a VM than the autoscaler-agent does.

Implementation ideas

We already track the "current" or "approved" values for the three entities we communicate with. We could also track the lower/upper bound for amounts that they may think we've requested and use e.g. the lower bound for what they've seen as the upper bound for what they've approved (or vice versa).

Here, it's important to be aware that the current NeonVM state is used both as an upper and lower bound, for communications with the scheduler and vm-monitor, respectively. We can't "just" change the meaning of a single set of resources — we'll have to separately track a second set of resources as well.

@sharnoff sharnoff added t/bug Issue Type: Bug c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent labels Dec 8, 2023
@shayanh
Copy link
Contributor

shayanh commented Dec 19, 2023

What could go wrong with having inconsistent state between agent <-> scheduler or agent <-> vm-monitor?

My main concern is that agent's codebase already is the most complicated part of the system and this change makes it even more complicated.

@sharnoff
Copy link
Member Author

sharnoff commented Dec 21, 2023

What could go wrong with having inconsistent state between agent <-> scheduler or agent <-> vm-monitor?

for the scheduler, see above:

This can lead to a possibly inconsistent state if the autoscaler-agent doesn't retry that exact request to completion: the scheduler (or whoever) could believe that there's a different amount reserved for a VM than the autoscaler-agent does.

After that point, we (a) may unintentionally overcommit, and (b) may end up with unexpected results from requests to the scheduler plugin, because of that inconsistent state.

For the vm-monitor, this would mean that we could indefinitely run in a degraded state (file cache too big / too small), because currently there's no guarantee that we retry failed requests (it doesn't make sense to keep asking for scaling that you no longer think is correct).

@sharnoff
Copy link
Member Author

@sharnoff sharnoff self-assigned this Jan 4, 2024
sharnoff added a commit that referenced this issue Jan 6, 2024
See #680 for detail on motivation. tl;dr: this fixes a known category of
bugs, and AFAICT is a pre-requisite for using the VM spec as a source of
truth.

Brief summary of changes:

- Introduce a new `resourceBounds` struct in pkg/agent/core that handles
  the uncertainty associated with requests that may or may not have
  succeeded.
- Switch internal usage so plugin permit, vm-monitor approved, and VM
  spec resources all are represented by `resourceBounds`
- Add a new test to extensively test this (`TestFailuresNotAssumedSuccessful`)

I expect we'll find bugs with this in production. Most of those should
be fine - restarting the `pkg/agent.Runner` and retrying with a fresh
slate.
Possible liveness issues would be more concerning (e.g. getting into a
state where we stop communicating with other components). Those should
*hopefully* be handled by the new test.
sharnoff added a commit that referenced this issue Jan 19, 2024
See #680 for detail on motivation. tl;dr: this fixes a known category of
bugs, and AFAICT is a pre-requisite for using the VM spec as a source of
truth.

Brief summary of changes:

- Introduce a new `resourceBounds` struct in pkg/agent/core that handles
  the uncertainty associated with requests that may or may not have
  succeeded.
- Switch internal usage so plugin permit, vm-monitor approved, and VM
  spec resources all are represented by `resourceBounds`
- Add a new test to extensively test this (`TestFailuresNotAssumedSuccessful`)

I expect we'll find bugs with this in production. Most of those should
be fine - restarting the `pkg/agent.Runner` and retrying with a fresh
slate.
Possible liveness issues would be more concerning (e.g. getting into a
state where we stop communicating with other components). Those should
*hopefully* be handled by the new test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants