Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta] Shard level Indexing Back-Pressure #478

Closed
getsaurabh02 opened this issue Apr 1, 2021 · 2 comments
Closed

[Meta] Shard level Indexing Back-Pressure #478

getsaurabh02 opened this issue Apr 1, 2021 · 2 comments
Labels
enhancement Enhancement or improvement to existing feature or request Meta Meta issue, not directly linked to a PR

Comments

@getsaurabh02
Copy link
Member

getsaurabh02 commented Apr 1, 2021

Is your feature request related to a problem? Please describe.
Elastic search today provides few gating mechanism to protect a node when under a duress, via the concepts of queue rejections and circuit breakers. However, queue sizes are fixed, isolated and do not effectively represent the total work required to be done. Similarly, Circuit breakers acts as a last line of defence, are mostly too late to act upon, and do not offer fairness. Some of these gaps create availability issue with the cluster, when under duress due to hardware failures, node performance degradation or traffic bursts.

Indexing Pressure today tries to address this to some extent by rejecting indexing requests based on some hard-coded limits on nodes. However, there is a need for smarter rejection mechanism at shard level, when there are too many stuck/slow indexing requests, breaching key performance thresholds (such as throughput). This can prevent the cluster from running into cascading effects of failures.

Describe the solution you'd like
With shard level indexing pressure we want to improve the current Indexing Pressure framework which performs memory accounting at node level and rejects the requests. We aim to take a step further to have rejections based on the memory accounting at shard level along with other key performance factors like throughput and last successful requests. This can be called as ShardIndexingPressure.

Key features to be covered

  • Granular tracking of indexing tasks performance, at every Shard level, for each Node role i.e. coordinator, primary and replica.
  • Smarter rejections by discarding the requests intended only for problematic index or shard, while still allowing others to continue (fairness in rejection).
  • Rejections thresholds governed by combination of configurable parameters (such as memory limits on node) and dynamic parameters (such as latency increase, throughput degradation).
  • Node level and Shard level indexing pressure statistics exposed through stats api.
  • Integration of Indexing pressure stats with Plugins for for metric visibility and auto-tuning in future.
  • Control knobs to tune to the key performance thresholds which control rejections, to address any specific requirement or issues.
  • Control knobs to run the feature in Shadow-Mode or Enforced-Mode. In shadow-mode only internal rejection breakdown metrics will be published while no actual rejections will be performed.

Additional context
Shard Indexing Pressure will be available in two phases ie Shadow & Enforced, via dynamic ES settings as below:
Enabled - To turn the shard indexing pressure feature turn on and off (default off initially).
Enforced - To run the feature in Shadow/Enforced mode if Enabled. In Shadow mode there will be no rejections but metrics will be published. In Enforced mode there will be actual rejection happening in the cluster.

Rejection Criteria - Three broad criteria for rejections as below:
Node Limit - Acts as the defence line if the utilisation by all the shards reaches the node limit assigned. i.e. 10% of heap. This effectively indicates a shard is unable to take in more traffic on the node.
Throughput degradation - This is to detect any hardware/software issue resulting into performance degradation. When node level occupancy is already breaching its soft-limit, and there exists a constant deterioration in the request turnaround at a shard level, additional requests are rejected until the system recovers.
Last Successful request - This is to address stuck requests or black hole scenarios. When node level occupancy is already breaching the soft-limit, and shard has multiple outstanding requests whole no request are turning around, then beyond a threshold additional requests will be actively discarded until the system recovers.

All the thresholds and key parameters will be available in for of dynamic ES settings for real time tuning.

Feature Branch: 478_indexBackPressure

@getsaurabh02 getsaurabh02 added the enhancement Enhancement or improvement to existing feature or request label Apr 1, 2021
getsaurabh02 referenced this issue in getsaurabh02/OpenSearch-backup Apr 1, 2021
… based on key performance thresholds. (#478)

Signed-off-by: Saurabh Singh <sisurab@amazon.com>
getsaurabh02 referenced this issue in getsaurabh02/OpenSearch-backup Apr 5, 2021
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
getsaurabh02 referenced this issue in getsaurabh02/OpenSearch-backup Apr 5, 2021
…ration. (#478)

Signed-off-by: Saurabh Singh <sisurab@amazon.com>
getsaurabh02 referenced this issue in getsaurabh02/OpenSearch-backup Apr 5, 2021
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
getsaurabh02 referenced this issue in getsaurabh02/OpenSearch-backup Apr 5, 2021
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
@getsaurabh02 getsaurabh02 changed the title Shard level Indexing Back-Pressure [Meta] Shard level Indexing Back-Pressure Apr 5, 2021
@getsaurabh02
Copy link
Member Author

getsaurabh02 commented Apr 5, 2021

Breaking the changes further into more granular PRs as below. This is to logically help the reviewers to visit the changes.

Below are the initial PRs for IndexingPressure changes, which are now being broken and merged into feature branch. Keeping here for discussion traceability.

Below are List of Items for Future Follow Ups on the backpressure changes:

  • Doc update once we have one.
  • Generic way to expose metrics to plugins.
  • Revisit modelling ShardIndexingPressureTracker with more concrete info such as Roles. (Issue to be created for discussion) (Add Shard Indexing Pressure Tracker. (#478) #717)
  • With the remodelling of ShardIndexingPressureTracker, there is an opportunity to break the core logic of ShardIndexingPressure for reusability.
  • Evaluate and refactor shard store cleanup strategy.

getsaurabh02 referenced this issue in getsaurabh02/OpenSearch-backup Apr 8, 2021
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
psychbot pushed a commit to getsaurabh02/OpenSearch that referenced this issue Apr 19, 2021
@nknize nknize added Meta Meta issue, not directly linked to a PR v2.0.0 Version 2.0.0 v1.0.0 Version 1.0.0 labels May 12, 2021
getsaurabh02 added a commit to getsaurabh02/OpenSearch that referenced this issue May 17, 2021
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
getsaurabh02 added a commit to getsaurabh02/OpenSearch that referenced this issue May 17, 2021
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
getsaurabh02 added a commit to getsaurabh02/OpenSearch that referenced this issue May 17, 2021
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
adnapibar pushed a commit that referenced this issue Sep 15, 2021
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
adnapibar pushed a commit that referenced this issue Sep 15, 2021
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
adnapibar pushed a commit that referenced this issue Sep 15, 2021
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
adnapibar pushed a commit that referenced this issue Sep 15, 2021
* Add Shard Indexing Pressure Store (#478)

Signed-off-by: Saurabh Singh <sisurab@amazon.com>

* Added comments and shard allocation based on compute in hot store.

Signed-off-by: Saurabh Singh <sisurab@amazon.com>

Co-authored-by: Saurabh Singh <sisurab@amazon.com>
adnapibar pushed a commit that referenced this issue Sep 15, 2021
It introduces a Memory Manager for Shard Indexing Pressure. It is responsible for increasing and decreasing the allocated shard limit based on incoming requests, and validate the current values against the thresholds.

Signed-off-by: Saurabh Singh <sisurab@amazon.com>
adnapibar pushed a commit that referenced this issue Sep 15, 2021
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
adnapibar pushed a commit that referenced this issue Sep 15, 2021
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
adnapibar pushed a commit that referenced this issue Sep 15, 2021
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
adnapibar pushed a commit that referenced this issue Sep 15, 2021
* Add Shard Indexing Pressure Store (#478)

Signed-off-by: Saurabh Singh <sisurab@amazon.com>

* Added comments and shard allocation based on compute in hot store.

Signed-off-by: Saurabh Singh <sisurab@amazon.com>

Co-authored-by: Saurabh Singh <sisurab@amazon.com>
adnapibar pushed a commit that referenced this issue Sep 15, 2021
It introduces a Memory Manager for Shard Indexing Pressure. It is responsible for increasing and decreasing the allocated shard limit based on incoming requests, and validate the current values against the thresholds.

Signed-off-by: Saurabh Singh <sisurab@amazon.com>
@getsaurabh02 getsaurabh02 added this to OpenSearch 1.2 (November 9) in OpenSearch Project Roadmap Sep 29, 2021
@CEHENKLE CEHENKLE removed the v2.0.0 Version 2.0.0 label Sep 29, 2021
getsaurabh02 added a commit to getsaurabh02/OpenSearch that referenced this issue Oct 6, 2021
getsaurabh02 added a commit to getsaurabh02/OpenSearch that referenced this issue Oct 6, 2021
getsaurabh02 added a commit to getsaurabh02/OpenSearch that referenced this issue Oct 6, 2021
getsaurabh02 added a commit to getsaurabh02/OpenSearch that referenced this issue Oct 6, 2021
…h-project#838)

* Add Shard Indexing Pressure Store (opensearch-project#478)

Signed-off-by: Saurabh Singh <sisurab@amazon.com>

* Added comments and shard allocation based on compute in hot store.

Signed-off-by: Saurabh Singh <sisurab@amazon.com>

Co-authored-by: Saurabh Singh <sisurab@amazon.com>
getsaurabh02 added a commit to getsaurabh02/OpenSearch that referenced this issue Oct 6, 2021
…pensearch-project#945)

It introduces a Memory Manager for Shard Indexing Pressure. It is responsible for increasing and decreasing the allocated shard limit based on incoming requests, and validate the current values against the thresholds.

Signed-off-by: Saurabh Singh <sisurab@amazon.com>
@jcgraybill jcgraybill removed this from OpenSearch 1.2 (November 16) in OpenSearch Project Roadmap Nov 2, 2021
@getsaurabh02
Copy link
Member Author

Closing this issue as changes are already merged as part of #1336

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Meta Meta issue, not directly linked to a PR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants