Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for Shard Hotspot Identification RCA #3741

Merged
merged 18 commits into from
May 2, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions _monitoring-your-cluster/pa/rca/shard-hotspot.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
---
layout: default
title: Hot shard identification
parent: Root Cause Analysis
grand_parent: Performance Analyzer
nav_order: 30
---

# Hot shard identification

Hot shard identification root cause analysis (RCA) lets you identify a hot shard within an index. A hot shard is an outlier that consumes more resources than other shards and may lead to poor indexing and search performance. The hot shard identification RCA monitors the following metrics:

- CPU utilization
- Heap allocation rate

Shards may become hot because of the nature of your workload. When you use a `_routing` parameter or a custom document ID, a specific shard or several shards within the cluster receive frequent updates, consuming more CPU and heap resources than other shards.

The hot shard identification RCA compares the CPU utilization and heap allocation rates against their threshold values. If the usage for either metric is greater than the threshold, the shard is considered to be _hot_.

For more information about the hot shard identification RCA implementation, see [Hot Shard RCA](https://github.com/opensearch-project/performance-analyzer-rca/blob/main/src/main/java/org/opensearch/performanceanalyzer/rca/store/rca/hotshard/docs/README.md).

#### Example request

ariamarble marked this conversation as resolved.
Show resolved Hide resolved
The following query requests hot shard identification:

```bash
GET _plugins/_performanceanalyzer/rca?name=HotShardClusterRca
```
{% include copy-curl.html %}

#### Example Response

The response contains a list of unhealthy shards:

```json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a table describing the response fields, even though they are pretty self-explanatory. The only things that are not apparent are:

  • What other values are possible for state?
  • If the state is "healthy", is HotClusterSummary populated, empty, or omitted altogether?
  • Is the time period fixed or configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put out a request for some further clarification, in the meantime I'll write up a table.

"HotShardClusterRca": [{
"rca_name": "HotShardClusterRca",
"timestamp": 1680721367563,
"state": "unhealthy",
"HotClusterSummary": [
{
"number_of_nodes": 3,
"number_of_unhealthy_nodes": 1,
"HotNodeSummary": [
{
"node_id": "7kosAbpASsqBoHmHkVXxmw",
"host_address": "192.168.80.4",
"HotResourceSummary": [
{
"resource_type": "cpu usage",
"resource_metric": "cpu usage(num of cores)",
"threshold": 0.027397981341796683,
"value": 0.034449630200405396,
"time_period_seconds": 60,
"meta_data": "ssZw1WRUSHS5DZCW73BOJQ index9 4"
},
{
"resource_type": "heap",
"resource_metric": "heap alloc rate(heap alloc rate in bytes per second)",
"threshold": 7605441.367010161,
"value": 10872119.748328414,
"time_period_seconds": 60,
"meta_data": "ssZw1WRUSHS5DZCW73BOJQ index9 4"
},
{
"resource_type": "heap",
"resource_metric": "heap alloc rate(heap alloc rate in bytes per second)",
"threshold": 7605441.367010161,
"value": 8019622.354388569,
"time_period_seconds": 60,
"meta_data": "QRF4rBM7SNCDr1g3KU6HyA index9 0"
}
]
}
]
}
]
}]
```

## Response fields

The following table lists the response fields.

Field | Type | Description
:--- | :--- | :---
rca_name | String | The name of the RCA. In this case, "HotShardClusterRca".
timestamp | Integer | The timestamp of the RCA.
state | Object | The state of the cluster determined by the RCA. The `state` can be `healthy`, `unhealthy`, or `unknown`.
HotClusterSummary.HotNodeSummary.number_of_nodes | Integer | The number of nodes in the cluster.
HotClusterSummary.HotNodeSummary.number_of_unhealthy_nodes | Integer | The number of nodes found to be in an `unhealthy` state.
HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_type | Object | The type of resource causing the unhealthy state, either "cpu usage" or "heap".
HotClusterSummary.HotNodeSummary.HotResourceSummary.resource_metric | String | The definition of the resource_type. Either "cpu usage(num of cores)" or "heap alloc rate(heap alloc rate in bytes per second)".
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpu usage(num of cores) -> cpu usage ratio, this can be linked to https://opensearch.org/docs/latest/monitoring-your-cluster/pa/reference/

HotClusterSummary.HotNodeSummary.HotResourceSummary.threshold | Float | The value that determines whether a resource is contended.
HotClusterSummary.HotNodeSummary.HotResourceSummary.value | Float | The current value of the resource.
HotClusterSummary.HotNodeSummary.HotResourceSummary.time_period_seconds | Time | The amount of time that a shard was monitored before its state was declared to be healthy or unhealthy.
HotClusterSummary.HotNodeSummary.HotResourceSummary.meta_data | String | The metadata associated with the resource_type.

In the preceding example response, `meta_data` is `QRF4rBM7SNCDr1g3KU6HyA index9 0`. The `meta_data` string consists of three fields:

- Node name: `QRF4rBM7SNCDr1g3KU6HyA`
- Index name: `index9`
- Shard ID: `0`

This means that shard `0` of index `index9` on node `QRF4rBM7SNCDr1g3KU6HyA` is hot.
Loading