-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: autoscaler-agent incorrectly assumes unsuccessful requests didn't succeed #680
Comments
What could go wrong with having inconsistent state between agent <-> scheduler or agent <-> vm-monitor? My main concern is that agent's codebase already is the most complicated part of the system and this change makes it even more complicated. |
for the scheduler, see above:
After that point, we (a) may unintentionally overcommit, and (b) may end up with unexpected results from requests to the scheduler plugin, because of that inconsistent state. For the vm-monitor, this would mean that we could indefinitely run in a degraded state (file cache too big / too small), because currently there's no guarantee that we retry failed requests (it doesn't make sense to keep asking for scaling that you no longer think is correct). |
Status update: AFAICT:
|
See #680 for detail on motivation. tl;dr: this fixes a known category of bugs, and AFAICT is a pre-requisite for using the VM spec as a source of truth. Brief summary of changes: - Introduce a new `resourceBounds` struct in pkg/agent/core that handles the uncertainty associated with requests that may or may not have succeeded. - Switch internal usage so plugin permit, vm-monitor approved, and VM spec resources all are represented by `resourceBounds` - Add a new test to extensively test this (`TestFailuresNotAssumedSuccessful`) I expect we'll find bugs with this in production. Most of those should be fine - restarting the `pkg/agent.Runner` and retrying with a fresh slate. Possible liveness issues would be more concerning (e.g. getting into a state where we stop communicating with other components). Those should *hopefully* be handled by the new test.
See #680 for detail on motivation. tl;dr: this fixes a known category of bugs, and AFAICT is a pre-requisite for using the VM spec as a source of truth. Brief summary of changes: - Introduce a new `resourceBounds` struct in pkg/agent/core that handles the uncertainty associated with requests that may or may not have succeeded. - Switch internal usage so plugin permit, vm-monitor approved, and VM spec resources all are represented by `resourceBounds` - Add a new test to extensively test this (`TestFailuresNotAssumedSuccessful`) I expect we'll find bugs with this in production. Most of those should be fine - restarting the `pkg/agent.Runner` and retrying with a fresh slate. Possible liveness issues would be more concerning (e.g. getting into a state where we stop communicating with other components). Those should *hopefully* be handled by the new test.
Problem description / Motivation
I know this seems strange, but hear me out:
With the way that
pkg/agent/core/
is currently implemented, when it makes a request to an external entity (scheduler plugin / NeonVM API / vm-monitor), if there's an error in the request, it assumes that the request was not successfully processed.Unfortunately this is not necessarily the case: It's entirely possible for example, that the autoscaler-agent could send a request to the scheduler plugin, and hit a client timeout before receiving the result of a successfully handled request. The scheduler plugin may not detect that the connection was closed in time.
This can lead to a possibly inconsistent state if the autoscaler-agent doesn't retry that exact request to completion: the scheduler (or whoever) could believe that there's a different amount reserved for a VM than the autoscaler-agent does.
Implementation ideas
We already track the "current" or "approved" values for the three entities we communicate with. We could also track the lower/upper bound for amounts that they may think we've requested and use e.g. the lower bound for what they've seen as the upper bound for what they've approved (or vice versa).
Here, it's important to be aware that the current NeonVM state is used both as an upper and lower bound, for communications with the scheduler and vm-monitor, respectively. We can't "just" change the meaning of a single set of resources — we'll have to separately track a second set of resources as well.
The text was updated successfully, but these errors were encountered: