-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for Shard Hotspot Identification RCA #3741
Conversation
Signed-off-by: ariamarble <armarble@amazon.com>
Signed-off-by: ariamarble <armarble@amazon.com>
Signed-off-by: ariamarble <armarble@amazon.com>
Signed-off-by: ariamarble <armarble@amazon.com>
- `CPU_Utilization` | ||
- `Heap_AllocRate` | ||
|
||
These metrics provide an accurate picture of operation intensities for certain shards, such as the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should mention here that although the metric that you should monitor depends on the cluster configuration and the workload, these are suggested metrics to monitor for the following operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, my understanding here is that these are the two metrics specifically monitored by this RCA, and only these two. Are you saying this isn't correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I understand that those are the two metrics, but the intro sentence here can be rewritten to be more explicit. The question is, how does the user know which metric to monitor? That depends on the user's cluster configuration and workload. However, for let's say for workloads with heavy bulk operation usage, we recommend that they look for the high heap allocation rate. For workloads with heavy search request usage, they should look for high CPU utilization, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, another way it can be written is to emphasize which operations may lead to which resource consumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, ok, I'll work on it
|
||
#### Response | ||
|
||
```json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a table describing the response fields, even though they are pretty self-explanatory. The only things that are not apparent are:
- What other values are possible for
state
? - If the state is "healthy", is HotClusterSummary populated, empty, or omitted altogether?
- Is the time period fixed or configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put out a request for some further clarification, in the meantime I'll write up a table.
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: ariamarble <armarble@amazon.com>
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: ariamarble <armarble@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
|
||
## Shard hotspot identification | ||
|
||
Hot shard identification Root Cause Analysis (RCA) lets you can identify a hot shard within an index. A hot shard is an outlier that consumes more resources than other shards and may lead to poor indexing and search performance. The hot shard identification RCA monitors the following metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"may lead to poor indexing and search performance"
Any queries hitting this shard will see impact, so wondering if may lead to
should be replaced with causes
?
- CPU utilization | ||
- Heap allocation rate | ||
|
||
Although the metric that you should monitor depends on the cluster configuration and the workload, refer to the following list for common operations which may lead to increased resource consumption: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Although the metric that you should monitor depends on the cluster configuration and the workload"
- Hot Shard is a workload problem and not a cluster configuration problem. Let us remove the 'cluster configuration' part from here.
- Replace 'metric' with 'Key Performance Indicator(KPI)' here?
state | Object | The state of the cluster determined by the RCA. The `state` can be `healthy`, `unhealthy`, or `unknown`. | ||
HotClusterSummary.HotNodeSummary.number_of_nodes | Integer | The number of nodes in the cluster. | ||
HotClusterSummary.HotNodeSummary.number_of_unhealthy_nodes | Integer | The number of nodes found to be in an `unhealthy` state. | ||
HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_type | Object | The type of resource checked, either "cpu usage" or "heap". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of resource checked, either "cpu usage" or "heap".
-> The type of resource causing the unhealthy state, either "cpu usage" or "heap".
HotClusterSummary.HotNodeSummary.number_of_nodes | Integer | The number of nodes in the cluster. | ||
HotClusterSummary.HotNodeSummary.number_of_unhealthy_nodes | Integer | The number of nodes found to be in an `unhealthy` state. | ||
HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_type | Object | The type of resource checked, either "cpu usage" or "heap". | ||
HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_metric | String | The definition of the resource_type. Either "cpu usage(num of cores)" or "heap alloc rate(heap alloc rate in bytes per second)". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpu usage(num of cores)
-> cpu usage ratio
, this can be linked to https://opensearch.org/docs/latest/monitoring-your-cluster/pa/reference/
HotClusterSummary.HotNodeSummary.number_of_unhealthy_nodes | Integer | The number of nodes found to be in an `unhealthy` state. | ||
HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_type | Object | The type of resource checked, either "cpu usage" or "heap". | ||
HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_metric | String | The definition of the resource_type. Either "cpu usage(num of cores)" or "heap alloc rate(heap alloc rate in bytes per second)". | ||
HotClusterSummary.HotNodeSummary.HotResourceSummary.threshold | Float | The value that determines if a resource is highly utilized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
highly utilized
-> contended
HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_metric | String | The definition of the resource_type. Either "cpu usage(num of cores)" or "heap alloc rate(heap alloc rate in bytes per second)". | ||
HotClusterSummary.HotNodeSummary.HotResourceSummary.threshold | Float | The value that determines if a resource is highly utilized. | ||
HotClusterSummary.HotNodeSummary.HotResourceSummary.value | Float | The current value of the resource. | ||
HotClusterSummary.HotNodeSummary.HotResourceSummary.time_period_seconds | Time | The amount of time that the resource_type has to be above the threshold value in order to mark the shard state as `unhealthy`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The amount of time that the resource_type has to be above the threshold value in order to mark the shard state as unhealthy
-> The amount of time shard is monitored before declaring its state as healthy/unhealthy
HotClusterSummary.HotNodeSummary.HotResourceSummary.threshold | Float | The value that determines if a resource is highly utilized. | ||
HotClusterSummary.HotNodeSummary.HotResourceSummary.value | Float | The current value of the resource. | ||
HotClusterSummary.HotNodeSummary.HotResourceSummary.time_period_seconds | Time | The amount of time that the resource_type has to be above the threshold value in order to mark the shard state as `unhealthy`. | ||
HotClusterSummary.HotNodeSummary.HotResourceSummary.meta_data | String | The metadata associated with the resource_type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the most important field. We can break it down.
"meta_data": "QRF4rBM7SNCDr1g3KU6HyA index9 0"
, the value here is made of 3 fields representing Node_Name Index_Name Shard_Id
respectively. The Shard 0
of Index index9
on Node QRF4rBM7SNCDr1g3KU6HyA
is hot.
- Bulk requests: High heap allocation rate. | ||
- Search requests: High CPU utilization | ||
- Complex queries: High CPU utilization and high heap allocation rate. | ||
- Document updates: High CPU utilization and high heap allocation rate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generic and not related to Shard HotSpot. Instead, we can add about the cluster workload using _routing
parameter and/or custom document ID, mapping to a segment on disk. Repeated access to this segment(in turn OpenSeach shard) leading to more resource consumption and thus becoming a Hot Shard.
We can discuss more on the phrasing of this.
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
- CPU utilization | ||
- Heap allocation rate | ||
|
||
Shards may become hot because of the nature of your workload. When you use a `_routing` parameter or a custom document ID, a specific shard within the cluster receives frequent updates, consuming more CPU and heap resources than other shards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a specific shard
-> specific shard(s)
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for making these changes.
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great.
HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_metric | String | The definition of the resource_type. Either "cpu usage(num of cores)" or "heap alloc rate(heap alloc rate in bytes per second)". | ||
HotClusterSummary.HotNodeSummary.HotResourceSummary.threshold | Float | The value that determines if a resource is contended. | ||
HotClusterSummary.HotNodeSummary.HotResourceSummary.value | Float | The current value of the resource. | ||
HotClusterSummary.HotNodeSummary.HotResourceSummary.time_period_seconds | Time | The amount of time a shard is monitored before its state was declared as healthy or unhealthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HotClusterSummary.HotNodeSummary.HotResourceSummary.time_period_seconds | Time | The amount of time a shard is monitored before its state was declared as healthy or unhealthy. | |
The amount of time a shard was monitored before its state was declared as healthy or unhealthy. |
nav_order: 30 | ||
--- | ||
|
||
## Hot shard identification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to begin this topic with a heading level 1? Consider that "Response fields" further down is the same heading level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Changed.
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolchfa-aws Just a few minor changes. Thanks!
Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
* Shard Hotspot RCA Signed-off-by: ariamarble <armarble@amazon.com> * small update Signed-off-by: ariamarble <armarble@amazon.com> * API and content update Signed-off-by: ariamarble <armarble@amazon.com> * additional updates Signed-off-by: ariamarble <armarble@amazon.com> * Apply suggestions from doc review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * further doc review changes Signed-off-by: ariamarble <armarble@amazon.com> * Apply suggestions from doc review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Added response field table Signed-off-by: ariamarble <armarble@amazon.com> * Add more details Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Implemented tech review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Last tech review update Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Small change Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * small typo Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Implemented doc review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _monitoring-your-cluster/pa/rca/shard-hotspot.md Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: ariamarble <armarble@amazon.com> Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina <kolchfa@amazon.com> Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Co-authored-by: Nathan Bower <nbower@amazon.com>
…oject#3741) * Shard Hotspot RCA Signed-off-by: ariamarble <armarble@amazon.com> * small update Signed-off-by: ariamarble <armarble@amazon.com> * API and content update Signed-off-by: ariamarble <armarble@amazon.com> * additional updates Signed-off-by: ariamarble <armarble@amazon.com> * Apply suggestions from doc review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * further doc review changes Signed-off-by: ariamarble <armarble@amazon.com> * Apply suggestions from doc review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Added response field table Signed-off-by: ariamarble <armarble@amazon.com> * Add more details Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Implemented tech review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Last tech review update Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Small change Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * small typo Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Implemented doc review comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _monitoring-your-cluster/pa/rca/shard-hotspot.md Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: ariamarble <armarble@amazon.com> Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina <kolchfa@amazon.com> Co-authored-by: Melissa Vagi <vagimeli@amazon.com> Co-authored-by: Nathan Bower <nbower@amazon.com>
Description
Adds documentation for Shard Hotspot Identification RCA
Issues Resolved
fixes #3635
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.