Add documentation for Shard Hotspot Identification RCA #3741

ariamarble · 2023-04-10T23:20:36Z

Description

Adds documentation for Shard Hotspot Identification RCA

Issues Resolved

fixes #3635

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: ariamarble <armarble@amazon.com>

…ation-website into issue3635

Signed-off-by: ariamarble <armarble@amazon.com>

_monitoring-your-cluster/pa/rca/shard-hotspot.md

kolchfa-aws · 2023-04-19T15:16:20Z

_monitoring-your-cluster/pa/rca/shard-hotspot.md

+- `CPU_Utilization`
+- `Heap_AllocRate`
+
+These metrics provide an accurate picture of operation intensities for certain shards, such as the following: 


I think we should mention here that although the metric that you should monitor depends on the cluster configuration and the workload, these are suggested metrics to monitor for the following operations.

So, my understanding here is that these are the two metrics specifically monitored by this RCA, and only these two. Are you saying this isn't correct?

No, I understand that those are the two metrics, but the intro sentence here can be rewritten to be more explicit. The question is, how does the user know which metric to monitor? That depends on the user's cluster configuration and workload. However, for let's say for workloads with heavy bulk operation usage, we recommend that they look for the high heap allocation rate. For workloads with heavy search request usage, they should look for high CPU utilization, etc.

Or, another way it can be written is to emphasize which operations may lead to which resource consumption.

ah, ok, I'll work on it

_monitoring-your-cluster/pa/rca/shard-hotspot.md

kolchfa-aws · 2023-04-19T15:36:42Z

_monitoring-your-cluster/pa/rca/shard-hotspot.md

+
+#### Response
+
+```json


I would add a table describing the response fields, even though they are pretty self-explanatory. The only things that are not apparent are:

What other values are possible for state?

If the state is "healthy", is HotClusterSummary populated, empty, or omitted altogether?

Is the time period fixed or configurable?

I've put out a request for some further clarification, in the meantime I'll write up a table.

_monitoring-your-cluster/pa/rca/shard-hotspot.md

Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

Signed-off-by: ariamarble <armarble@amazon.com>

_monitoring-your-cluster/pa/rca/shard-hotspot.md

Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

Signed-off-by: ariamarble <armarble@amazon.com>

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

khushbr · 2023-04-24T21:48:17Z

_monitoring-your-cluster/pa/rca/shard-hotspot.md

+
+## Shard hotspot identification
+
+Hot shard identification Root Cause Analysis (RCA) lets you can identify a hot shard within an index. A hot shard is an outlier that consumes more resources than other shards and may lead to poor indexing and search performance. The hot shard identification RCA monitors the following metrics:


"may lead to poor indexing and search performance"

Any queries hitting this shard will see impact, so wondering if may lead to should be replaced with causes ?

khushbr · 2023-04-24T21:50:39Z

_monitoring-your-cluster/pa/rca/shard-hotspot.md

+- CPU utilization
+- Heap allocation rate
+
+Although the metric that you should monitor depends on the cluster configuration and the workload, refer to the following list for common operations which may lead to increased resource consumption:


"Although the metric that you should monitor depends on the cluster configuration and the workload"

Hot Shard is a workload problem and not a cluster configuration problem. Let us remove the 'cluster configuration' part from here.

Replace 'metric' with 'Key Performance Indicator(KPI)' here?

khushbr · 2023-04-24T21:56:44Z

_monitoring-your-cluster/pa/rca/shard-hotspot.md

+state | Object | The state of the cluster determined by the RCA. The `state` can be `healthy`, `unhealthy`, or `unknown`.
+HotClusterSummary.HotNodeSummary.number_of_nodes | Integer | The number of nodes in the cluster.
+HotClusterSummary.HotNodeSummary.number_of_unhealthy_nodes | Integer | The number of nodes found to be in an `unhealthy` state.
+HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_type | Object | The type of resource checked, either "cpu usage" or "heap".


The type of resource checked, either "cpu usage" or "heap". -> The type of resource causing the unhealthy state, either "cpu usage" or "heap".

khushbr · 2023-04-24T21:58:20Z

_monitoring-your-cluster/pa/rca/shard-hotspot.md

+HotClusterSummary.HotNodeSummary.number_of_nodes | Integer | The number of nodes in the cluster.
+HotClusterSummary.HotNodeSummary.number_of_unhealthy_nodes | Integer | The number of nodes found to be in an `unhealthy` state.
+HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_type | Object | The type of resource checked, either "cpu usage" or "heap".
+HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_metric | String | The definition of the resource_type. Either "cpu usage(num of cores)" or "heap alloc rate(heap alloc rate in bytes per second)".


cpu usage(num of cores) -> cpu usage ratio, this can be linked to https://opensearch.org/docs/latest/monitoring-your-cluster/pa/reference/

khushbr · 2023-04-24T21:59:03Z

_monitoring-your-cluster/pa/rca/shard-hotspot.md

+HotClusterSummary.HotNodeSummary.number_of_unhealthy_nodes | Integer | The number of nodes found to be in an `unhealthy` state.
+HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_type | Object | The type of resource checked, either "cpu usage" or "heap".
+HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_metric | String | The definition of the resource_type. Either "cpu usage(num of cores)" or "heap alloc rate(heap alloc rate in bytes per second)".
+HotClusterSummary.HotNodeSummary.HotResourceSummary.threshold | Float | The value that determines if a resource is highly utilized.


highly utilized -> contended

khushbr · 2023-04-24T22:01:56Z

_monitoring-your-cluster/pa/rca/shard-hotspot.md

+HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_metric | String | The definition of the resource_type. Either "cpu usage(num of cores)" or "heap alloc rate(heap alloc rate in bytes per second)".
+HotClusterSummary.HotNodeSummary.HotResourceSummary.threshold | Float | The value that determines if a resource is highly utilized.
+HotClusterSummary.HotNodeSummary.HotResourceSummary.value | Float | The current value of the resource.
+HotClusterSummary.HotNodeSummary.HotResourceSummary.time_period_seconds | Time | The amount of time that the resource_type has to be above the threshold value in order to mark the shard state as `unhealthy`.


The amount of time that the resource_type has to be above the threshold value in order to mark the shard state as unhealthy -> The amount of time shard is monitored before declaring its state as healthy/unhealthy

khushbr · 2023-04-24T22:04:29Z

_monitoring-your-cluster/pa/rca/shard-hotspot.md

+HotClusterSummary.HotNodeSummary.HotResourceSummary.threshold | Float | The value that determines if a resource is highly utilized.
+HotClusterSummary.HotNodeSummary.HotResourceSummary.value | Float | The current value of the resource.
+HotClusterSummary.HotNodeSummary.HotResourceSummary.time_period_seconds | Time | The amount of time that the resource_type has to be above the threshold value in order to mark the shard state as `unhealthy`.
+HotClusterSummary.HotNodeSummary.HotResourceSummary.meta_data | String | The metadata associated with the resource_type.


This is the most important field. We can break it down.
"meta_data": "QRF4rBM7SNCDr1g3KU6HyA index9 0", the value here is made of 3 fields representing Node_Name Index_Name Shard_Id respectively. The Shard 0 of Index index9 on Node QRF4rBM7SNCDr1g3KU6HyA is hot.

khushbr · 2023-04-24T22:07:42Z

_monitoring-your-cluster/pa/rca/shard-hotspot.md

+- Bulk requests: High heap allocation rate.
+- Search requests: High CPU utilization
+- Complex queries: High CPU utilization and high heap allocation rate.
+- Document updates: High CPU utilization and high heap allocation rate.


This is generic and not related to Shard HotSpot. Instead, we can add about the cluster workload using _routing parameter and/or custom document ID, mapping to a segment on disk. Repeated access to this segment(in turn OpenSeach shard) leading to more resource consumption and thus becoming a Hot Shard.

We can discuss more on the phrasing of this.

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

khushbr · 2023-04-25T01:50:08Z

_monitoring-your-cluster/pa/rca/shard-hotspot.md

+- CPU utilization
+- Heap allocation rate
+
+Shards may become hot because of the nature of your workload. When you use a `_routing` parameter or a custom document ID, a specific shard within the cluster receives frequent updates, consuming more CPU and heap resources than other shards.


a specific shard -> specific shard(s)

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

khushbr

LGTM! Thank you for making these changes.

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

cwillum

Looks great.

cwillum · 2023-04-25T15:26:23Z

_monitoring-your-cluster/pa/rca/shard-hotspot.md

+HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_metric | String | The definition of the resource_type. Either "cpu usage(num of cores)" or "heap alloc rate(heap alloc rate in bytes per second)".
+HotClusterSummary.HotNodeSummary.HotResourceSummary.threshold | Float | The value that determines if a resource is contended.
+HotClusterSummary.HotNodeSummary.HotResourceSummary.value | Float | The current value of the resource.
+HotClusterSummary.HotNodeSummary.HotResourceSummary.time_period_seconds | Time | The amount of time a shard is monitored before its state was declared as healthy or unhealthy.


Suggested change

HotClusterSummary.HotNodeSummary.HotResourceSummary.time_period_seconds | Time | The amount of time a shard is monitored before its state was declared as healthy or unhealthy.

The amount of time a shard was monitored before its state was declared as healthy or unhealthy.

cwillum · 2023-04-25T15:27:38Z

_monitoring-your-cluster/pa/rca/shard-hotspot.md

+nav_order: 30
+---
+
+## Hot shard identification


Do you want to begin this topic with a heading level 1? Consider that "Response fields" further down is the same heading level.

Thank you! Changed.

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

_monitoring-your-cluster/pa/rca/shard-hotspot.md

Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

natebower

@kolchfa-aws Just a few minor changes. Thanks!

_monitoring-your-cluster/pa/rca/shard-hotspot.md

Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Shard Hotspot RCA Signed-off-by: ariamarble <armarble@amazon.com> * small update Signed-off-by: ariamarble <armarble@amazon.com> * API and content update Signed-off-by: ariamarble <armarble@amazon.com> * additional updates Signed-off-by: ariamarble <armarble@amazon.com> * Apply suggestions from doc review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * further doc review changes Signed-off-by: ariamarble <armarble@amazon.com> * Apply suggestions from doc review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Added response field table Signed-off-by: ariamarble <armarble@amazon.com> * Add more details Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Implemented tech review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Last tech review update Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Small change Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * small typo Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Implemented doc review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _monitoring-your-cluster/pa/rca/shard-hotspot.md Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: ariamarble <armarble@amazon.com> Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina <kolchfa@amazon.com> Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Co-authored-by: Nathan Bower <nbower@amazon.com>

This reverts commit 93aa981.

…oject#3741) * Shard Hotspot RCA Signed-off-by: ariamarble <armarble@amazon.com> * small update Signed-off-by: ariamarble <armarble@amazon.com> * API and content update Signed-off-by: ariamarble <armarble@amazon.com> * additional updates Signed-off-by: ariamarble <armarble@amazon.com> * Apply suggestions from doc review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * further doc review changes Signed-off-by: ariamarble <armarble@amazon.com> * Apply suggestions from doc review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Added response field table Signed-off-by: ariamarble <armarble@amazon.com> * Add more details Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Implemented tech review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Last tech review update Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Small change Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * small typo Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Implemented doc review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _monitoring-your-cluster/pa/rca/shard-hotspot.md Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: ariamarble <armarble@amazon.com> Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina <kolchfa@amazon.com> Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Co-authored-by: Nathan Bower <nbower@amazon.com>

Shard Hotspot RCA

7354936

Signed-off-by: ariamarble <armarble@amazon.com>

ariamarble self-assigned this Apr 10, 2023

ariamarble added 2 - In progress Issue/PR: The issue or PR is in progress. anomaly-detection alerting v2.7.0 labels Apr 10, 2023

ariamarble added 5 commits April 15, 2023 12:49

Merge branch 'main' of https://github.com/opensearch-project/document…

a2cece7

…ation-website into issue3635

Merge branch 'main' of https://github.com/opensearch-project/document…

1bb02f1

…ation-website into issue3635

small update

fa98031

Signed-off-by: ariamarble <armarble@amazon.com>

API and content update

f7556c3

Signed-off-by: ariamarble <armarble@amazon.com>

additional updates

2d0b5d9

Signed-off-by: ariamarble <armarble@amazon.com>

kolchfa-aws reviewed Apr 19, 2023

View reviewed changes

Naarcha-AWS reviewed Apr 19, 2023

View reviewed changes

_monitoring-your-cluster/pa/rca/shard-hotspot.md Outdated Show resolved Hide resolved

_monitoring-your-cluster/pa/rca/shard-hotspot.md Outdated Show resolved Hide resolved

ariamarble and others added 2 commits April 19, 2023 18:38

Apply suggestions from doc review

a9fe815

Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

further doc review changes

abc21b1

Signed-off-by: ariamarble <armarble@amazon.com>

kolchfa-aws reviewed Apr 20, 2023

View reviewed changes

_monitoring-your-cluster/pa/rca/shard-hotspot.md Outdated Show resolved Hide resolved

ariamarble and others added 2 commits April 19, 2023 19:00

Apply suggestions from doc review

3735f4a

Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

Added response field table

8b220f4

Signed-off-by: ariamarble <armarble@amazon.com>

ariamarble assigned kolchfa-aws Apr 23, 2023

Add more details

435d72e

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

kolchfa-aws requested review from carolxob, cwillum, hdhalter, JeffHuss, vagimeli, ananzh, seanneumann, AMoo-Miki and natebower as code owners April 24, 2023 20:02

khushbr reviewed Apr 24, 2023

View reviewed changes

kolchfa-aws added 2 commits April 24, 2023 18:57

Implemented tech review comments

0c2bb6b

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

Last tech review update

6ac282a

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

khushbr reviewed Apr 25, 2023

View reviewed changes

Small change

8b0b723

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

khushbr approved these changes Apr 25, 2023

View reviewed changes

kolchfa-aws added 4 - Doc Review PR: Doc review in progress release-notes PR: Include this PR in the automated release notes and removed 2 - In progress Issue/PR: The issue or PR is in progress. labels Apr 25, 2023

small typo

658d2e7

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

cwillum approved these changes Apr 25, 2023

View reviewed changes

Implemented doc review comments

b26738a

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

vagimeli approved these changes Apr 25, 2023

View reviewed changes

_monitoring-your-cluster/pa/rca/shard-hotspot.md Outdated Show resolved Hide resolved

Update _monitoring-your-cluster/pa/rca/shard-hotspot.md

461cf79

Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

natebower reviewed Apr 25, 2023

View reviewed changes

Apply suggestions from code review

2c8d3db

Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

kolchfa-aws added 6 - Done but waiting to merge PR: The work is done and ready to merge and removed 4 - Doc Review PR: Doc review in progress labels Apr 25, 2023

bbarani mentioned this pull request Apr 26, 2023

[RELEASE] Release version 2.7.0 opensearch-project/opensearch-build#3230

Closed

42 tasks

kolchfa-aws merged commit 5a3c106 into main May 2, 2023

bbarani mentioned this pull request May 2, 2023

[Retrospective] Release version 2.7.0 opensearch-project/opensearch-build#3431

Closed

vagimeli added a commit that referenced this pull request May 4, 2023

Revert "Add documentation for Shard Hotspot Identification RCA (#3741)"

92cb886

This reverts commit 93aa981.

Naarcha-AWS deleted the issue3635 branch March 28, 2024 23:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add documentation for Shard Hotspot Identification RCA #3741

Add documentation for Shard Hotspot Identification RCA #3741

ariamarble commented Apr 10, 2023

kolchfa-aws Apr 19, 2023

ariamarble Apr 20, 2023

kolchfa-aws Apr 20, 2023

kolchfa-aws Apr 20, 2023

ariamarble Apr 20, 2023

kolchfa-aws Apr 19, 2023

ariamarble Apr 20, 2023

khushbr Apr 24, 2023

khushbr Apr 24, 2023

khushbr Apr 24, 2023

khushbr Apr 24, 2023

khushbr Apr 24, 2023

khushbr Apr 24, 2023

khushbr Apr 24, 2023

khushbr Apr 24, 2023

khushbr Apr 25, 2023

khushbr left a comment

cwillum left a comment

cwillum Apr 25, 2023

cwillum Apr 25, 2023

kolchfa-aws Apr 25, 2023

natebower left a comment


		## Shard hotspot identification

		Hot shard identification Root Cause Analysis (RCA) lets you can identify a hot shard within an index. A hot shard is an outlier that consumes more resources than other shards and may lead to poor indexing and search performance. The hot shard identification RCA monitors the following metrics:

	HotClusterSummary.HotNodeSummary.HotResourceSummary.time_period_seconds \| Time \| The amount of time a shard is monitored before its state was declared as healthy or unhealthy.
	The amount of time a shard was monitored before its state was declared as healthy or unhealthy.


		#### Response

		```json

Add documentation for Shard Hotspot Identification RCA #3741

Add documentation for Shard Hotspot Identification RCA #3741

Conversation

ariamarble commented Apr 10, 2023

Description

Issues Resolved

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khushbr left a comment

Choose a reason for hiding this comment

cwillum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natebower left a comment

Choose a reason for hiding this comment