Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for Shard Hotspot Identification RCA #3741

Merged
merged 18 commits into from
May 2, 2023
Merged

Conversation

ariamarble
Copy link
Contributor

Description

Adds documentation for Shard Hotspot Identification RCA

Issues Resolved

fixes #3635

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: ariamarble <armarble@amazon.com>
@ariamarble ariamarble self-assigned this Apr 10, 2023
@ariamarble ariamarble added 2 - In progress Issue/PR: The issue or PR is in progress. anomaly-detection alerting v2.7.0 labels Apr 10, 2023
Signed-off-by: ariamarble <armarble@amazon.com>
Signed-off-by: ariamarble <armarble@amazon.com>
Signed-off-by: ariamarble <armarble@amazon.com>
_monitoring-your-cluster/pa/rca/shard-hotspot.md Outdated Show resolved Hide resolved
_monitoring-your-cluster/pa/rca/shard-hotspot.md Outdated Show resolved Hide resolved
- `CPU_Utilization`
- `Heap_AllocRate`

These metrics provide an accurate picture of operation intensities for certain shards, such as the following:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should mention here that although the metric that you should monitor depends on the cluster configuration and the workload, these are suggested metrics to monitor for the following operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, my understanding here is that these are the two metrics specifically monitored by this RCA, and only these two. Are you saying this isn't correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I understand that those are the two metrics, but the intro sentence here can be rewritten to be more explicit. The question is, how does the user know which metric to monitor? That depends on the user's cluster configuration and workload. However, for let's say for workloads with heavy bulk operation usage, we recommend that they look for the high heap allocation rate. For workloads with heavy search request usage, they should look for high CPU utilization, etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, another way it can be written is to emphasize which operations may lead to which resource consumption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok, I'll work on it

_monitoring-your-cluster/pa/rca/shard-hotspot.md Outdated Show resolved Hide resolved
_monitoring-your-cluster/pa/rca/shard-hotspot.md Outdated Show resolved Hide resolved
_monitoring-your-cluster/pa/rca/shard-hotspot.md Outdated Show resolved Hide resolved
_monitoring-your-cluster/pa/rca/shard-hotspot.md Outdated Show resolved Hide resolved
_monitoring-your-cluster/pa/rca/shard-hotspot.md Outdated Show resolved Hide resolved

#### Response

```json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a table describing the response fields, even though they are pretty self-explanatory. The only things that are not apparent are:

  • What other values are possible for state?
  • If the state is "healthy", is HotClusterSummary populated, empty, or omitted altogether?
  • Is the time period fixed or configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put out a request for some further clarification, in the meantime I'll write up a table.

ariamarble and others added 2 commits April 19, 2023 18:38
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: ariamarble <armarble@amazon.com>
ariamarble and others added 2 commits April 19, 2023 19:00
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: ariamarble <armarble@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

## Shard hotspot identification

Hot shard identification Root Cause Analysis (RCA) lets you can identify a hot shard within an index. A hot shard is an outlier that consumes more resources than other shards and may lead to poor indexing and search performance. The hot shard identification RCA monitors the following metrics:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"may lead to poor indexing and search performance"

Any queries hitting this shard will see impact, so wondering if may lead to should be replaced with causes ?

- CPU utilization
- Heap allocation rate

Although the metric that you should monitor depends on the cluster configuration and the workload, refer to the following list for common operations which may lead to increased resource consumption:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Although the metric that you should monitor depends on the cluster configuration and the workload"

  1. Hot Shard is a workload problem and not a cluster configuration problem. Let us remove the 'cluster configuration' part from here.
  2. Replace 'metric' with 'Key Performance Indicator(KPI)' here?

state | Object | The state of the cluster determined by the RCA. The `state` can be `healthy`, `unhealthy`, or `unknown`.
HotClusterSummary.HotNodeSummary.number_of_nodes | Integer | The number of nodes in the cluster.
HotClusterSummary.HotNodeSummary.number_of_unhealthy_nodes | Integer | The number of nodes found to be in an `unhealthy` state.
HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_type | Object | The type of resource checked, either "cpu usage" or "heap".
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of resource checked, either "cpu usage" or "heap". -> The type of resource causing the unhealthy state, either "cpu usage" or "heap".

HotClusterSummary.HotNodeSummary.number_of_nodes | Integer | The number of nodes in the cluster.
HotClusterSummary.HotNodeSummary.number_of_unhealthy_nodes | Integer | The number of nodes found to be in an `unhealthy` state.
HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_type | Object | The type of resource checked, either "cpu usage" or "heap".
HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_metric | String | The definition of the resource_type. Either "cpu usage(num of cores)" or "heap alloc rate(heap alloc rate in bytes per second)".
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpu usage(num of cores) -> cpu usage ratio, this can be linked to https://opensearch.org/docs/latest/monitoring-your-cluster/pa/reference/

HotClusterSummary.HotNodeSummary.number_of_unhealthy_nodes | Integer | The number of nodes found to be in an `unhealthy` state.
HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_type | Object | The type of resource checked, either "cpu usage" or "heap".
HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_metric | String | The definition of the resource_type. Either "cpu usage(num of cores)" or "heap alloc rate(heap alloc rate in bytes per second)".
HotClusterSummary.HotNodeSummary.HotResourceSummary.threshold | Float | The value that determines if a resource is highly utilized.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

highly utilized -> contended

HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_metric | String | The definition of the resource_type. Either "cpu usage(num of cores)" or "heap alloc rate(heap alloc rate in bytes per second)".
HotClusterSummary.HotNodeSummary.HotResourceSummary.threshold | Float | The value that determines if a resource is highly utilized.
HotClusterSummary.HotNodeSummary.HotResourceSummary.value | Float | The current value of the resource.
HotClusterSummary.HotNodeSummary.HotResourceSummary.time_period_seconds | Time | The amount of time that the resource_type has to be above the threshold value in order to mark the shard state as `unhealthy`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The amount of time that the resource_type has to be above the threshold value in order to mark the shard state as unhealthy -> The amount of time shard is monitored before declaring its state as healthy/unhealthy

HotClusterSummary.HotNodeSummary.HotResourceSummary.threshold | Float | The value that determines if a resource is highly utilized.
HotClusterSummary.HotNodeSummary.HotResourceSummary.value | Float | The current value of the resource.
HotClusterSummary.HotNodeSummary.HotResourceSummary.time_period_seconds | Time | The amount of time that the resource_type has to be above the threshold value in order to mark the shard state as `unhealthy`.
HotClusterSummary.HotNodeSummary.HotResourceSummary.meta_data | String | The metadata associated with the resource_type.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most important field. We can break it down.
"meta_data": "QRF4rBM7SNCDr1g3KU6HyA index9 0", the value here is made of 3 fields representing Node_Name Index_Name Shard_Id respectively. The Shard 0 of Index index9 on Node QRF4rBM7SNCDr1g3KU6HyA is hot.

Comment on lines 18 to 21
- Bulk requests: High heap allocation rate.
- Search requests: High CPU utilization
- Complex queries: High CPU utilization and high heap allocation rate.
- Document updates: High CPU utilization and high heap allocation rate.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generic and not related to Shard HotSpot. Instead, we can add about the cluster workload using _routing parameter and/or custom document ID, mapping to a segment on disk. Repeated access to this segment(in turn OpenSeach shard) leading to more resource consumption and thus becoming a Hot Shard.

We can discuss more on the phrasing of this.

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
- CPU utilization
- Heap allocation rate

Shards may become hot because of the nature of your workload. When you use a `_routing` parameter or a custom document ID, a specific shard within the cluster receives frequent updates, consuming more CPU and heap resources than other shards.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a specific shard -> specific shard(s)

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Copy link

@khushbr khushbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for making these changes.

@kolchfa-aws kolchfa-aws added 4 - Doc Review PR: Doc review in progress release-notes PR: Include this PR in the automated release notes and removed 2 - In progress Issue/PR: The issue or PR is in progress. labels Apr 25, 2023
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Copy link
Contributor

@cwillum cwillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_metric | String | The definition of the resource_type. Either "cpu usage(num of cores)" or "heap alloc rate(heap alloc rate in bytes per second)".
HotClusterSummary.HotNodeSummary.HotResourceSummary.threshold | Float | The value that determines if a resource is contended.
HotClusterSummary.HotNodeSummary.HotResourceSummary.value | Float | The current value of the resource.
HotClusterSummary.HotNodeSummary.HotResourceSummary.time_period_seconds | Time | The amount of time a shard is monitored before its state was declared as healthy or unhealthy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
HotClusterSummary.HotNodeSummary.HotResourceSummary.time_period_seconds | Time | The amount of time a shard is monitored before its state was declared as healthy or unhealthy.
The amount of time a shard was monitored before its state was declared as healthy or unhealthy.

nav_order: 30
---

## Hot shard identification
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to begin this topic with a heading level 1? Consider that "Response fields" further down is the same heading level.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Changed.

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Co-authored-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws Just a few minor changes. Thanks!

_monitoring-your-cluster/pa/rca/shard-hotspot.md Outdated Show resolved Hide resolved
_monitoring-your-cluster/pa/rca/shard-hotspot.md Outdated Show resolved Hide resolved
_monitoring-your-cluster/pa/rca/shard-hotspot.md Outdated Show resolved Hide resolved
_monitoring-your-cluster/pa/rca/shard-hotspot.md Outdated Show resolved Hide resolved
Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
@kolchfa-aws kolchfa-aws added 6 - Done but waiting to merge PR: The work is done and ready to merge and removed 4 - Doc Review PR: Doc review in progress labels Apr 25, 2023
@kolchfa-aws kolchfa-aws merged commit 5a3c106 into main May 2, 2023
vagimeli added a commit that referenced this pull request May 4, 2023
* Shard Hotspot RCA

Signed-off-by: ariamarble <armarble@amazon.com>

* small update

Signed-off-by: ariamarble <armarble@amazon.com>

* API and content update

Signed-off-by: ariamarble <armarble@amazon.com>

* additional updates

Signed-off-by: ariamarble <armarble@amazon.com>

* Apply suggestions from doc review

Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* further doc review changes

Signed-off-by: ariamarble <armarble@amazon.com>

* Apply suggestions from doc review

Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Added response field table

Signed-off-by: ariamarble <armarble@amazon.com>

* Add more details

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Implemented tech review comments

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Last tech review update

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Small change

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* small typo

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Implemented doc review comments

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Update _monitoring-your-cluster/pa/rca/shard-hotspot.md

Co-authored-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

---------

Signed-off-by: ariamarble <armarble@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: Fanit Kolchina <kolchfa@amazon.com>
Co-authored-by: Melissa Vagi <vagimeli@amazon.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
vagimeli added a commit that referenced this pull request May 4, 2023
harshavamsi pushed a commit to harshavamsi/documentation-website that referenced this pull request Oct 31, 2023
…oject#3741)

* Shard Hotspot RCA

Signed-off-by: ariamarble <armarble@amazon.com>

* small update

Signed-off-by: ariamarble <armarble@amazon.com>

* API and content update

Signed-off-by: ariamarble <armarble@amazon.com>

* additional updates

Signed-off-by: ariamarble <armarble@amazon.com>

* Apply suggestions from doc review

Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* further doc review changes

Signed-off-by: ariamarble <armarble@amazon.com>

* Apply suggestions from doc review

Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Added response field table

Signed-off-by: ariamarble <armarble@amazon.com>

* Add more details

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Implemented tech review comments

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Last tech review update

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Small change

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* small typo

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Implemented doc review comments

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Update _monitoring-your-cluster/pa/rca/shard-hotspot.md

Co-authored-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

---------

Signed-off-by: ariamarble <armarble@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: Fanit Kolchina <kolchfa@amazon.com>
Co-authored-by: Melissa Vagi <vagimeli@amazon.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
@Naarcha-AWS Naarcha-AWS deleted the issue3635 branch March 28, 2024 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6 - Done but waiting to merge PR: The work is done and ready to merge alerting anomaly-detection release-notes PR: Include this PR in the automated release notes v2.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC] Add documentation for Shard Hotspot Identification feature in RCA Engine
7 participants