Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a counter to node stat api to track shard going from idle to non-idle #12768

Merged
merged 6 commits into from
Mar 26, 2024

Conversation

ruai0511
Copy link
Contributor

@ruai0511 ruai0511 commented Mar 19, 2024

Description

Shards automatically refresh every second, but when a shard doesn't receive search requests for over 30 seconds, it goes into an idle state to improve performance by suspending the implicit index refresh (More information on search idle feature here). However, this introduces a problem: After a shard does idle, the next search request must force a refresh to reflect the latest data. This extra step increases latency.

We want to monitor how often idle shards are reactivated. This PR introduces a counter called search_idle_waken_up_total and exports it in the node stat api.

Related Issues

Resolves #12678

Check List

  • New functionality includes testing.
  • All tests pass
  • New functionality has been documented.
  • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Ruirui Zhang <mariazrr@amazon.com>
Signed-off-by: Ruirui Zhang <mariazrr@amazon.com>
Copy link
Contributor

github-actions bot commented Mar 19, 2024

Compatibility status:

Checks if related components are compatible with change 5d47d6b

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/performance-analyzer.git]

Copy link
Contributor

✅ Gradle check result for 3a13164: SUCCESS

@ruai0511
Copy link
Contributor Author

Good point. I'll address it in a later PR

I think it will make it easier if you should address it all in the same PR because you will need the same version checks for those APIs and it will make backporting a lot easier if you only have 1 PR to backport instead of 2.

Ah you're right about backport! Thanks I'll look into it

Signed-off-by: Ruirui Zhang <mariazrr@amazon.com>
Copy link
Contributor

❌ Gradle check result for 919e4e2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for ef646c6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for c5ee63f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 8ec3c12: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 2d9bb3d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@jainankitk
Copy link
Contributor

@ruai0511 - Thank you for this PR. I am wondering if we should also include whether refresh triggered or not as part of search request into coordinator slow logs. That will prevent us from guessing whether refresh happened or not causing the query to run slower.

@msfroh
Copy link
Collaborator

msfroh commented Mar 22, 2024

Do you want to add this metric to the tabular (_cat) APIs as well?

For example in server/src/main/java/org/opensearch/rest/action/cat/RestIndicesAction.java (and the corresponding _cat/nodes, _cat/shards, etc?)

This will probably require changes to the rest spec tests as well. You can use #9622 as a reference.

If you choose to add it to a _cat API (probably _cat/shards, since it's a shard-level metric), please don't add it to the default set of columns. While _node/stats is very verbose (and more verbose with every release), the default set of columns from _cat APIs tend to be limited to the most important values to get a summary of the given resource.

Signed-off-by: Ruirui Zhang <mariazrr@amazon.com>
Copy link
Contributor

❌ Gradle check result for 49ebf20: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Ruirui Zhang <mariazrr@amazon.com>
Signed-off-by: Ruirui Zhang <mariazrr@amazon.com>
Copy link
Contributor

❌ Gradle check result for d2a686e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 5172c09: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❕ Gradle check result for 5d47d6b: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.cluster.allocation.ClusterRerouteIT.testDelayWithALargeAmountOfShards

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Collaborator

@msfroh msfroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank, @ruai0511! Looks good!

I'll add the backport 2.x label, but we'll need to update the PR to change the version check for the new field.

Can you please prepare a PR to update the version check from 3.0 to 2.14?

@msfroh msfroh added the backport 2.x Backport to 2.x branch label Mar 26, 2024
@msfroh msfroh merged commit f2cc3d8 into opensearch-project:main Mar 26, 2024
33 of 34 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-12768-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 f2cc3d8ea314b2d49a8b6ea5f57de5b6aff4faf9
# Push it to GitHub
git push --set-upstream origin backport/backport-12768-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-12768-to-2.x.

ruai0511 added a commit to ruai0511/OpenSearch that referenced this pull request Mar 28, 2024
… from idle to non-idle (opensearch-project#12768)

Signed-off-by: Ruirui Zhang <mariazrr@amazon.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
… from idle to non-idle (opensearch-project#12768)

---------

Signed-off-by: Ruirui Zhang <mariazrr@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
harshavamsi pushed a commit to harshavamsi/OpenSearch that referenced this pull request Apr 29, 2024
… from idle to non-idle (opensearch-project#12768)



---------

Signed-off-by: Ruirui Zhang <mariazrr@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed enhancement Enhancement or improvement to existing feature or request Search Search query, autocomplete ...etc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Add metrics to track latency issue due to search idle feature
5 participants