-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds cluster manager task throttling documentation #1826
Conversation
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
@dhwanilpatel As discussed, could you review for technical accuracy please? |
|
||
# Cluster manager task throttling | ||
|
||
For many cluster activities, such as defining a mapping or creating an index, data nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. Sometimes data nodes may flood the cluster manager with too many tasks at the same time. When this happens, the number of tasks in the queue spikes; this affects the cluster manager performance, and may in turn affect the availability of the whole cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the task can land up on cluster manager node directly or routed via some other node.
For many cluster state updates**
|
||
# Cluster manager task throttling | ||
|
||
For many cluster activities, such as defining a mapping or creating an index, data nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. Sometimes data nodes may flood the cluster manager with too many tasks at the same time. When this happens, the number of tasks in the queue spikes; this affects the cluster manager performance, and may in turn affect the availability of the whole cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cluster manager maintains a pending task queue for these tasks and runs them in a single-
and executes them in ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes data nodes may flood the cluster manager with too many tasks at the same time.
In the past, put-mappings or snapshot tasks have caused too much pile of pending tasks on cluster manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though the ideal solution is to prevent the caller from submitting too many tasks and fix the underlying issue which caused flooding of pending tasks. But, this can take longer and leaves the cluster manager vulnerable to such bugs or issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a need to build protection mechanism in the cluster manager itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @shwetathareja. Thanks for your suggestions. The word "executes" is on the list of words to avoid in our style guide. The style guide suggests replacing it with the word "run".
|
||
For many cluster activities, such as defining a mapping or creating an index, data nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. Sometimes data nodes may flood the cluster manager with too many tasks at the same time. When this happens, the number of tasks in the queue spikes; this affects the cluster manager performance, and may in turn affect the availability of the whole cluster. | ||
|
||
To avoid task overload on the cluster manager, you can specify to throttle tasks by setting throttling limits. The cluster manager uses the throttling limits to determine whether to reject tasks from the data nodes. It rejects a task if the total number of tasks of the same type in the pending task queue exceeds the threshold. Since the cluster manager throttles tasks based on the task type, rejecting one task does not affect any other tasks of a different type. For example, if the cluster manager rejects a `put-mapping` task, it can still accept a subsequent `create-index` task. If the cluster manager rejects a task, the data node performs retries with exponential backoff to resubmit the task to the cluster manager. If retries are unsuccessful within the timeout period, OpenSearch returns a cluster timeout error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
task submission can be from any node including the cluster manager itself right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shwetathareja: I have implemented the comments. Please take a look when you get a chance. Thanks!
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
|
||
The first line of defense is to implement mechanisms in the caller nodes to avoid task overload on the cluster manager. However, even with those mechanisms in place, the cluster manager needs a built-in way to protect itself---cluster manager task throttling. | ||
|
||
To turn on cluster manager task throttling, you need to specify to throttle tasks by setting throttling limits. The cluster manager uses the throttling limits to determine whether to reject a task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specify throttling tasks?
|
||
To turn on cluster manager task throttling, you need to specify to throttle tasks by setting throttling limits. The cluster manager uses the throttling limits to determine whether to reject a task. | ||
|
||
The cluster manager rejects tasks on the task type basis. For any incoming task, the cluster manager evaluates the total number of tasks of the same type in the pending task queue. If this number exceeds the threshold for this task type, the cluster manager rejects the incoming task. Rejecting a task does not affect tasks of a different type. For example, if the cluster manager rejects a `put-mapping` task, it can still accept a subsequent `create-index` task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rejects tasks on the basis of task types?
|
||
## Setting throttling limits | ||
|
||
You can set the throttling limits by specifying them in the `cluster_manager.throttling.thresholds` object and updating the [OpenSearch cluster settings]({{site.url}}{{site.baseurl}}/api-reference/cluster-settings). The setting is dynamic, so you can change the behavior of this feature without restarting your cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set throttling limits?
|
||
The following table describes the `cluster_manager.throttling.thresholds` object. | ||
|
||
Field name | Description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Field name | Description | |
Field Name | Description |
Field name | Description | ||
:--- | :--- | ||
task-type | The task type. See [supported task types](#supported-task-types) for a list of valid values. | ||
value | The maximum number of tasks of the type specified by the `task-type` in the cluster manager's pending task queue. Default is `-1` (no task throttling). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tasks of the task-type
type specified by?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good other than my comments
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
|
||
# Cluster manager task throttling | ||
|
||
For many cluster state updates, such as defining a mapping or creating an index, nodes submit tasks to the cluster manager. The cluster manager maintains a pending task queue for these tasks and runs them in a single-threaded environment. When nodes send tens of thousands of resource-intensive tasks, like `put-mapping` or snapshot tasks, these tasks pile up in the queue, and the cluster manager is flooded. This affects the cluster manager performance, and may in turn affect the availability of the whole cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion only:
"When nodes send tens of thousands of resource-intensive tasks, like put-mapping
or snapshot tasks, these tasks can pile up in the queue and flood the cluster manager."
"This affects cluster manager performance..."
or
"This affects the cluster manager's performance..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good. I'll change. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thumbs up.
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolchfa-aws Only two minor changes. Thanks!
Co-authored-by: Nate Bower <nbower@amazon.com>
Co-authored-by: Nate Bower <nbower@amazon.com>
* Adds cluster manager task throttling documentation Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update cluster-manager-task-throttling.md * Rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * More rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Reworded for clarity Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Incorporated doc review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * More doc review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _opensearch/cluster-manager-task-throttling.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _opensearch/cluster-manager-task-throttling.md Co-authored-by: Nate Bower <nbower@amazon.com> Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Co-authored-by: Nate Bower <nbower@amazon.com> (cherry picked from commit 99bc98a)
* Adds cluster manager task throttling documentation Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update cluster-manager-task-throttling.md * Rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * More rewording Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Reworded for clarity Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Incorporated doc review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * More doc review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _opensearch/cluster-manager-task-throttling.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _opensearch/cluster-manager-task-throttling.md Co-authored-by: Nate Bower <nbower@amazon.com> Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Co-authored-by: Nate Bower <nbower@amazon.com> (cherry picked from commit 99bc98a) Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Fixes #1792
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.