Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds cluster manager task throttling documentation #1826

Merged
merged 9 commits into from
Nov 15, 2022

Conversation

kolchfa-aws
Copy link
Collaborator

Fixes #1792

Checklist

  • [x ] By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
@kolchfa-aws kolchfa-aws requested a review from a team as a code owner November 6, 2022 23:24
@kolchfa-aws kolchfa-aws self-assigned this Nov 6, 2022
@kolchfa-aws kolchfa-aws added 3 - Tech review PR: Tech review in progress v2.4.0 'Issues and PRs related to version v2.4.0' labels Nov 6, 2022
@kolchfa-aws
Copy link
Collaborator Author

@dhwanilpatel As discussed, could you review for technical accuracy please?


# Cluster manager task throttling

For many cluster activities, such as defining a mapping or creating an index, data nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. Sometimes data nodes may flood the cluster manager with too many tasks at the same time. When this happens, the number of tasks in the queue spikes; this affects the cluster manager performance, and may in turn affect the availability of the whole cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the task can land up on cluster manager node directly or routed via some other node.

For many cluster state updates**


# Cluster manager task throttling

For many cluster activities, such as defining a mapping or creating an index, data nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. Sometimes data nodes may flood the cluster manager with too many tasks at the same time. When this happens, the number of tasks in the queue spikes; this affects the cluster manager performance, and may in turn affect the availability of the whole cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster manager maintains a pending task queue for these tasks and runs them in a single-

and executes them in ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes data nodes may flood the cluster manager with too many tasks at the same time.

In the past, put-mappings or snapshot tasks have caused too much pile of pending tasks on cluster manager

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the ideal solution is to prevent the caller from submitting too many tasks and fix the underlying issue which caused flooding of pending tasks. But, this can take longer and leaves the cluster manager vulnerable to such bugs or issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a need to build protection mechanism in the cluster manager itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shwetathareja. Thanks for your suggestions. The word "executes" is on the list of words to avoid in our style guide. The style guide suggests replacing it with the word "run".


For many cluster activities, such as defining a mapping or creating an index, data nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. Sometimes data nodes may flood the cluster manager with too many tasks at the same time. When this happens, the number of tasks in the queue spikes; this affects the cluster manager performance, and may in turn affect the availability of the whole cluster.

To avoid task overload on the cluster manager, you can specify to throttle tasks by setting throttling limits. The cluster manager uses the throttling limits to determine whether to reject tasks from the data nodes. It rejects a task if the total number of tasks of the same type in the pending task queue exceeds the threshold. Since the cluster manager throttles tasks based on the task type, rejecting one task does not affect any other tasks of a different type. For example, if the cluster manager rejects a `put-mapping` task, it can still accept a subsequent `create-index` task. If the cluster manager rejects a task, the data node performs retries with exponential backoff to resubmit the task to the cluster manager. If retries are unsuccessful within the timeout period, OpenSearch returns a cluster timeout error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task submission can be from any node including the cluster manager itself right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwetathareja: I have implemented the comments. Please take a look when you get a chance. Thanks!

kolchfa-aws and others added 4 commits November 10, 2022 08:53
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

The first line of defense is to implement mechanisms in the caller nodes to avoid task overload on the cluster manager. However, even with those mechanisms in place, the cluster manager needs a built-in way to protect itself---cluster manager task throttling.

To turn on cluster manager task throttling, you need to specify to throttle tasks by setting throttling limits. The cluster manager uses the throttling limits to determine whether to reject a task.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify throttling tasks?


To turn on cluster manager task throttling, you need to specify to throttle tasks by setting throttling limits. The cluster manager uses the throttling limits to determine whether to reject a task.

The cluster manager rejects tasks on the task type basis. For any incoming task, the cluster manager evaluates the total number of tasks of the same type in the pending task queue. If this number exceeds the threshold for this task type, the cluster manager rejects the incoming task. Rejecting a task does not affect tasks of a different type. For example, if the cluster manager rejects a `put-mapping` task, it can still accept a subsequent `create-index` task.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rejects tasks on the basis of task types?


## Setting throttling limits

You can set the throttling limits by specifying them in the `cluster_manager.throttling.thresholds` object and updating the [OpenSearch cluster settings]({{site.url}}{{site.baseurl}}/api-reference/cluster-settings). The setting is dynamic, so you can change the behavior of this feature without restarting your cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set throttling limits?


The following table describes the `cluster_manager.throttling.thresholds` object.

Field name | Description
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Field name | Description
Field Name | Description

Field name | Description
:--- | :---
task-type | The task type. See [supported task types](#supported-task-types) for a list of valid values.
value | The maximum number of tasks of the type specified by the `task-type` in the cluster manager's pending task queue. Default is `-1` (no task throttling).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tasks of the task-type type specified by?

Copy link
Contributor

@ariamarble ariamarble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good other than my comments

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

# Cluster manager task throttling

For many cluster state updates, such as defining a mapping or creating an index, nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. When nodes send tens of thousands of resource-intensive tasks, like `put-mapping` or snapshot tasks, these tasks pile up in the queue, and the cluster manager is flooded. This affects the cluster manager performance, and may in turn affect the availability of the whole cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion only:
"When nodes send tens of thousands of resource-intensive tasks, like put-mapping or snapshot tasks, these tasks can pile up in the queue and flood the cluster manager."
"This affects cluster manager performance..."
or
"This affects the cluster manager's performance..."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good. I'll change. Thanks!

Copy link
Contributor

@cwillum cwillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thumbs up.

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws Only two minor changes. Thanks!

_opensearch/cluster-manager-task-throttling.md Outdated Show resolved Hide resolved
_opensearch/cluster-manager-task-throttling.md Outdated Show resolved Hide resolved
kolchfa-aws and others added 2 commits November 11, 2022 12:06
Co-authored-by: Nate Bower <nbower@amazon.com>
Co-authored-by: Nate Bower <nbower@amazon.com>
@kolchfa-aws kolchfa-aws added 6 - Done but waiting to merge PR: The work is done and ready to merge and removed 3 - Tech review PR: Tech review in progress labels Nov 11, 2022
@kolchfa-aws kolchfa-aws merged commit 99bc98a into main Nov 15, 2022
@Naarcha-AWS Naarcha-AWS deleted the Fix1792-task-throttling branch December 13, 2022 19:57
@kolchfa-aws kolchfa-aws added v2.5.0 'Issues and PRs related to version v2.5.0' and removed v2.4.0 'Issues and PRs related to version v2.4.0' labels Jan 9, 2023
@hdhalter hdhalter added the release-notes PR: Include this PR in the automated release notes label Jan 13, 2023
@kolchfa-aws kolchfa-aws added the backport 2.5 PR: Backport label for 2.5 label Jan 24, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 24, 2023
* Adds cluster manager task throttling documentation

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Update cluster-manager-task-throttling.md

* Rewording

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* More rewording

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Reworded for clarity

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Incorporated doc review comments

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* More doc review comments

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Update _opensearch/cluster-manager-task-throttling.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _opensearch/cluster-manager-task-throttling.md

Co-authored-by: Nate Bower <nbower@amazon.com>

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Co-authored-by: Nate Bower <nbower@amazon.com>
(cherry picked from commit 99bc98a)
kolchfa-aws added a commit that referenced this pull request Jan 24, 2023
* Adds cluster manager task throttling documentation

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Update cluster-manager-task-throttling.md

* Rewording

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* More rewording

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Reworded for clarity

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Incorporated doc review comments

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* More doc review comments

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Update _opensearch/cluster-manager-task-throttling.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _opensearch/cluster-manager-task-throttling.md

Co-authored-by: Nate Bower <nbower@amazon.com>

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Co-authored-by: Nate Bower <nbower@amazon.com>
(cherry picked from commit 99bc98a)

Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6 - Done but waiting to merge PR: The work is done and ready to merge backport 2.5 PR: Backport label for 2.5 release-notes PR: Include this PR in the automated release notes v2.5.0 'Issues and PRs related to version v2.5.0'
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster manager task throttling [DOC]
6 participants