Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Manager Task Throttling #479

Closed
dhwanilpatel opened this issue Apr 1, 2021 · 12 comments
Closed

Cluster Manager Task Throttling #479

dhwanilpatel opened this issue Apr 1, 2021 · 12 comments
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request roadmap

Comments

@dhwanilpatel
Copy link
Contributor

Is your feature request related to a problem? Please describe.

For many cluster activities, data nodes submits tasks to master node. Like for put-mapping, create-index, shard started, etc. Sometimes due to some bug or issue Data nodes floods the master node with too many tasks, as a result we can see the spikes in pending task in master queue. This can affect master's performance, which can effect availability of whole cluster.

We should increase master's resiliency against such high pending task.

Describe the solution you'd like

We can make master more resilient by adding throttling of tasks on master node. Master will reject task submitted from data node based on throttling limits. This throttling should work on task type basis, so throttling of one task wont affect different task's submission.
Once master rejects such task based on throttling logic, data node will perform retries exponential back off to submit this tasks to master node.
We should make dynamic setting for enabling and disabling throttling on master and we should also be able to provide throttling configuration for task types in dynamic setting.
This framework will help if there are some bugs/issue in cluster, we can enable throttling for making master resilient against high tasks and disable it when underlying bug/issue gets resolved.

Describe alternatives you've considered

De-duplication of tasks: We have de-duplication framework as well which prevents submitting duplicate tasks to master node, but it wont help for all the cases. Data nodes can submit different tasks and flood master or master gets flooded from customer driven activities as well where tasks wont be duplicate. We want to make master resilient against high pending tasks, so de duplication wont help achieving it.

Additional context

Master performs the batching of tasks, so it iterate over all the task queued in master queue to see whether they can be batched or not, also such tasks will be remain in queue until they are not executed hence it will consume memory as well(memory according to particular task types).
So such high pending tasks on master queue can affect CPU/JVM of master node and can affect the availability of whole cluster.

@dhwanilpatel dhwanilpatel added the enhancement Enhancement or improvement to existing feature or request label Apr 1, 2021
@dhwanilpatel
Copy link
Contributor Author

Breaking changes in multiple PR:

  • Add Master task throttling changed in data/master nodes(Add master task throttling #553)
  • Add Throttling Stats on Stats API
  • Add Documentation of new Settings.

@anasalkouz
Copy link
Member

Hi @dhwanilpatel, are you actively working on this? could please provide some updates?

@dblock
Copy link
Member

dblock commented Mar 7, 2022

There was an attempt in #553 to implement this that hasn't been finished. Please feel free to pick it up where it was left!

@dhwanilpatel
Copy link
Contributor Author

dhwanilpatel commented May 31, 2022

Hello,

I am going to pick this up again to take this changes to completion. Major feedback on last PR (#554) was to break the changes into multiple PRs for ease of review. Below is plan on how I will be breaking changes into multiple PR.

Below are the list of item for future followup checks,

  • Documentation regarding new settings.
  • Throttling stats in Stats API.

@dblock can you please help in creating the feature branch for this issue, against which we can raise multiple PRs.

@dblock
Copy link
Member

dblock commented Jun 14, 2022

Sorry for the late reply - I think @CEHENKLE has a process for feature branches.

You don't need to wait on me, raise a PR and we can redirect it to a feature branch when it's ready, too.

@CEHENKLE
Copy link
Member

@dhwanilpatel Hey, how's it going? Can we we help at all?

@dhwanilpatel
Copy link
Contributor Author

@CEHENKLE so far it is going as per plan, Data Node and Master Node side changes are in review state. After those PR, upcoming PRs should be straightforward.

Thanks to the reviewers for providing their valuable feedbacks.

@CEHENKLE
Copy link
Member

@dhwanilpatel Cool beans. LMK if we can help :)

/C

@shwetathareja
Copy link
Member

@dhwanilpatel need a task in here for integ tests as well.

@elfisher elfisher added this to 2.3.0 (September 14th) in OpenSearch Project Roadmap Aug 31, 2022
@elfisher
Copy link

@dhwanilpatel I wanted to confirm if this is on track for the 2.3 release. If so, pls add the v2.3.0 label to this issue. Thanks!

@dhwanilpatel
Copy link
Contributor Author

dhwanilpatel commented Oct 10, 2022

Created followup issue for exposing throttling exception to user : #4724

@elfisher
Copy link

elfisher commented Nov 7, 2022

@dhwanilpatel given we are calling the feature "Cluster Manager Task Throttling" can we rename this issue "Cluster Manager Task Throttling"?

@andrross andrross changed the title Master Task Throttling Cluster Manager Task Throttling Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request roadmap
Projects
None yet
8 participants