Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Proposal] Add Admission Control - Workload management to improve cluster stability #1144

Closed
mitalawachat opened this issue Aug 24, 2021 · 10 comments
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@mitalawachat
Copy link

mitalawachat commented Aug 24, 2021

Feature Proposal : Add Admission Control - Workload management to improve cluster stability

Overview:

Admission-Control is a workload management knob which limits and restricts the new incoming requests early when a node begins to go under stress. It would be resource-aware where it accounts for the new incoming request cost (memory occupancy), along with tracking the point-in-time state of the node (overall JVMMP). This will allow real-time, state-based, admission-control on the node. This feature helps prevent issues where clusters are overloaded with incoming traffic (either steady increase or surge in traffic).

Requirements:

  • Configure throttling thresholds per/group-of REST endpoints
  • Throttle request early, even before request reaches REST layer based on configuration
  • Track rejections per node and expose via stats API

Problem Statement: How do we throttle requests dynamically based upon the Request URI pattern?

Idea is to limit the number of requests per node which reaches the OpenSearch thread-pool for execution. This can be based upon the number of requests inflight already. i.e. requests being executed of a particular type. For example, each request would acquire tokens from a bucket before executing, and release it after execution completes. This implies, all the additional requests on the node after all tokens are acquired will be throttled (with too many requests exception), until the tokens are available. This will be primarily helpful in protecting the resources on the node from brownouts while also ensuring they are available for other request types as well without contention.

Describe alternatives you've considered

Circuit breakers:

Although circuit-breaker can be safely assumed to be last line of defense where it would protect the nodes from browning out, we still need some mechanism in place to regulate the number of request reaching execution phase. Admission-control will ensure that it understand the workload a node can take and prevent it from being overwhelmed by cutting off any additional workload.
Also, we cannot provide rest endpoint-based limits in circuit-breaker.

Proposed Solution:

Dynamic Cluster Settings exposed by Admission Control:

  • JVMMemoryPressure Controller
    • Evaluates the point-in-time state of the node (overall JVMMP) and decides if request should be forwarded or throttled.
  • RequestSize Controller
    • Allocate say 10% of overall JVM size as number of buckets. Each request will acquire tokens from this bucket according to Content-Length of request, and release these tokens on execution completion.
@mitalawachat mitalawachat added the enhancement Enhancement or improvement to existing feature or request label Aug 24, 2021
@anasalkouz
Copy link
Member

Hi @mitalawachat, could you please elaborate more? what do you mean by early layer? what is the mechanism that currently in use to restrict incoming request? what do you want to change?

@anasalkouz
Copy link
Member

anasalkouz commented Apr 1, 2022

Closing this since we didn't receive a response for a while. @mitalawachat please reopen if needed.

@mitalawachat
Copy link
Author

@anasalkouz Please help in re-opening the issue. Earlier I could not get time to actively work on this. Apologies for the same.

@dblock
Copy link
Member

dblock commented May 2, 2022

I'll reopen. Thanks @mitalawachat

@dblock dblock reopened this May 2, 2022
@mitalawachat mitalawachat changed the title Admission Control - Workload management to improve cluster stability. [Feature Proposal] Add Admission Control - Workload management to improve cluster stability May 9, 2022
@dblock
Copy link
Member

dblock commented May 23, 2022

This feels like a more upstream and generic version of #1329, or at least something that sits ahead of it. My question is, if back-pressure is implemented well in all paths, do we still want admissions control? How do you see these two features co-exist?

Can admissions control be imagined as a cluster-wide, and eventually consistent feature, improving load balancing? In an asynchronous world I would like a feature where I can generate a request ID on the client, make a request to any node (or make the same request to multiple nodes including the generated ID), get the job queued and executed at least once on the most available node, and then come back for results later by supplying my ID to any node.

@mitalawachat
Copy link
Author

mitalawachat commented May 27, 2022

This feels like a more upstream and generic version of #1329, or at least something that sits ahead of it.

Yes I was proposing admission-control to be executed before request reaches OpenSearch threadpool for execution. Backpressure and circuit-breakers are triggered later.

To achieve it I was thinking we can place a netty handler at Netty4HttpServerTransport's initChannel, a new handler in handler chain.

My question is, if back-pressure is implemented well in all paths, do we still want admissions control? How do you see these two features co-exist?

If node is already under stress, or in scenarios where there's burst of sudden requests, Admission Control is aimed to prevent node from doing any work (auth/etc) that would turn out wasteful when requests are rejected by other mechanisms (backpressure/cb).

Can admissions control be imagined as a cluster-wide, and eventually consistent feature, improving load balancing?

I was scoping admission-control for throttling, but load-balancing seems a nice feature too for future extension.

@dblock
Copy link
Member

dblock commented Jun 13, 2022

@mitalawachat I understand why admissions control is better in theory than the existing backpressure/cb methods, but you still have to show at least anecdotal examples where those are net worse than a whole new admissions control feature - note that multiple ways to prevent the cluster from overloading may have disadvantages for users to reason about

@mitalawachat
Copy link
Author

Hi @dblock,

We've actually implemented a version of admissions-control in the Amazon Managed OpenSearch Service and it has been available for around two years. It has helped a great deal in cluster stability, especially on smaller instance types with limited resources. On t2/t3 EC2 instance types it had shown 75%-85% reduction in node drop across various regions, and ~70% node drops reduction on other EC2 instance types.

We've observed it helping in case when user has downgraded cluster with underscaled configuration and we observed JVM hovering 90-95% continuously for few hours. Admission-control helped with selective load-shedding allowing cluster to do some useful management work which otherwise would have resulted to node drops/out-of-memory issues.

We've observed it helping prevent node drops/out-of-memory issue on a cluster for a user due to spike in search traffic as it resulted into sharp jvm spikes on few data nodes. With admission control in place 429 status were proactively sent back to the clients from affected nodes, preventing the further jvm spikes and those nodes from running into issue.

We are planning to enhance the framework while we are open-sourcing it with community feedback as a core component.

@dblock
Copy link
Member

dblock commented Jun 15, 2022

Thanks @mitalawachat! I now understand where you come from. These are some very strong numbers. Great to see this feature open-sourced. Make some PRs!

@ajaymovva
Copy link
Contributor

ajaymovva commented Oct 26, 2023

Closing this issue as we are tracking this feature as below:

RFC for AdmissionController: #8910
Meta Issue: #9504
DOC Issue: [DOC] Create documentation corresponding to changes in _nodes/stats api in OpenSearch for Admission control feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

4 participants