Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request level coordinator slow logs #10650

Merged
merged 1 commit into from Nov 16, 2023

Conversation

dzane17
Copy link
Contributor

@dzane17 dzane17 commented Oct 16, 2023

Related Issues

Resolves #9642
Documentation PR: opensearch-project/documentation-website#5298

Description

Slow logs at coordinator level: As of now, we only have the capability to enable slow logs at a shard level for desired search phase(query and fetch). See this. Setting this threshold is tricky when customer usually sees latency spikes at a request level. Plus shard level slow logs doesn't offer a holistic view. So as part of this, we will also add capabilities to capture slow logs at a request level along with different search phases from coordinator node perspective.

Coordinator slow logs are governed by cluster settings.

// Setting on a whole request level
cluster.search.request.slowlog.threshold.warn: 10s
cluster.search.request.slowlog.threshold.info: 5s
cluster.search.request.slowlog.threshold.debug: 2s
cluster.search.request.slowlog.threshold.trace: 500ms

// Minimum level to print
cluster.search.request.slowlog.level: "trace"

Sample SearchRequestSlowLog:

% cat runTask_index_search_slowlog.log
[2023-10-27T11:38:06,741][TRACE][c.s.r.slowlog] [runTask-0] took[119.5ms], took_millis[119], phase_took_millis[{expand=0, query=76, fetch=24}], total_hits[1 hits], search_type[QUERY_THEN_FETCH], shards[{total: 10, successful: 10, skipped: 0, failed: 0}], source[{"query":{"query_string":{"query":"value1","fields":[],"type":"best_fields","default_operator":"or","max_determinized_states":10000,"enable_position_increments":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,"fuzzy_max_expansions":50,"phrase_slop":0,"escape":false,"auto_generate_synonyms_phrase_query":true,"fuzzy_transpositions":true,"boost":1.0}}}], id[]
% cat runTask_index_search_slowlog.json | jq
{
  "type": "search_request_slowlog",
  "timestamp": "2023-10-27T11:38:06,741-07:00",
  "level": "TRACE",
  "component": "c.s.r.slowlog",
  "cluster.name": "runTask",
  "node.name": "runTask-0",
  "took": "119.5ms",
  "took_millis": "119",
  "phase_took": "{expand=0, query=76, fetch=24}",
  "total_hits": "1 hits",
  "search_type": "QUERY_THEN_FETCH",
  "shards": "{total: 10, successful: 10, skipped: 0, failed: 0}",
  "source": "{\"query\":{\"query_string\":{\"query\":\"value1\",\"fields\":[],\"type\":\"best_fields\",\"default_operator\":\"or\",\"max_determinized_states\":10000,\"enable_position_increments\":true,\"fuzziness\":\"AUTO\",\"fuzzy_prefix_length\":0,\"fuzzy_max_expansions\":50,\"phrase_slop\":0,\"escape\":false,\"auto_generate_synonyms_phrase_query\":true,\"fuzzy_transpositions\":true,\"boost\":1.0}}}",
  "cluster.uuid": "iH0dDisEQFar-1ZyfD_vjw",
  "node.id": "AR_pfXWRSm-di2Ru8aaruQ"
}

Thresholds are set based on total search duration, but logs include phase-level breakdown.

Testing

  1. Launch local OpenSearch server with debugger
    https://github.com/opensearch-project/OpenSearch/blob/main/TESTING.md#launching-and-debugging-from-an-ide
./gradlew run --debug-jvm
  1. Create an index and ingest some data
curl -X PUT "localhost:9200/test2?pretty" -H 'Content-Type: application/json' -d' 
{                                                
  "settings": {
    "number_of_shards": 10
  }
}'

curl -X POST "localhost:9200/_bulk?pretty" -H 'Content-Type: application/json' -d'
{ "index" : { "_index" : "test2", "_id" : "1" } }
{ "field1" : "value1" }                          
{ "index" : { "_index" : "test2", "_id" : "2" } }
{ "field2" : "value2" }
{ "index" : { "_index" : "test2", "_id" : "3" } }
{ "field3" : "value3" }
{ "index" : { "_index" : "test2", "_id" : "4" } }
{ "field4" : "value4" }
'
  1. Update and verify cluster setting
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d '{"transient" : {"cluster.search.request.slowlog.threshold.trace" : "0ms"}}'
curl -X GET "localhost:9200/_cluster/settings?pretty&flat_settings"
  1. Search request
curl -XGET 'localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d'
{
 "query": { "query_string": { "query": "abc" } }
}'
  1. Check output in OpenSearch server tab

Performance Tests

x% increase in P50, P90, P99 latency observed when logging all requests.
No significant impact to P50, P90, P99 latency observed when logging no requests (all searches under log threshold).

Procedure:

  1. Spin up localhost OpenSearch cluster
./gradlew run
  1. Update slowlog threshold cluster setting
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d '{"transient" : {"cluster.search.request.slowlog.threshold.trace" : "0ms"}}'
curl -X GET "localhost:9200/_cluster/settings?pretty&flat_settings"
  1. Run opensearch-benchmark nyc_taxis workload
opensearch-benchmark execute-test --workload-path=./nyc_taxis --pipeline=benchmark-only --client-options="basic_auth_user:'admin',basic_auth_password:'Admin_123'" --results-file=/tmp/output.txt

Results

Main 50th percentile latency 90th percentile latency 99th percentile latency
1 6.45966 7.16774 12.2049
2 7.14608 8.80989 14.7602
3 6.23131 9.04382 15.4128
4 6.7432 8.19057 11.8496
5 6.92057 8.19506 9.92856
6 6.01713 7.73637 10.7374
7 6.21754 7.18561 8.82464
8 7.15195 8.22125 12.5205
9 5.51498 7.36984 9.75166
10 7.6733 8.25051 11.0328
Mean 6.60757 8.01707 11.70231
St Dev 0.64087 0.64559 2.12586
Slowlog Off (-1 threshold) 50th percentile latency 90th percentile latency 99th percentile latency
1 7.14947 8.37281 12.2798
2 7.52558 8.81152 15.9566
3 6.53822 7.56006 10.7598
4 8.16498 8.82595 11.5625
5 7.20736 8.58643 10.4346
6 6.43303 8.07587 8.72964
7 6.73321 8.41968 9.49111
8 6.72711 8.32944 14.8351
9 6.53244 7.71201 16.0138
10 6.46849 7.6002 9.50796
Mean 6.94799 8.2294 11.95709
St Dev 0.56304 0.47424 2.73339
Slowlog On (0ms thresold) 50th percentile latency 90th percentile latency 99th percentile latency
1 7.69207 9.87628 13.73330
2 6.76473 7.63000 9.31518
3 6.28997 9.05151 16.19130
4 6.81525 8.31300 17.52860
5 7.74137 8.48340 10.63630
6 6.77349 8.65474 11.93540
7 7.31905 8.94366 15.39660
8 7.78096 8.90685 11.84000
9 6.36214 8.52370 15.33980
10 6.56414 8.96694 12.39780
Mean 7.01032 8.73501 13.43143
St Dev 0.57552 0.58148 2.63721

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Oct 16, 2023

Compatibility status:

Checks if related components are compatible with change 51371e9

Incompatible components

Incompatible components: [https://github.com/opensearch-project/performance-analyzer.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/performance-analyzer-rca.git]

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request feature New feature or request Search Search query, autocomplete ...etc v2.11.0 Issues and PRs related to version 2.11.0 labels Oct 16, 2023
@dzane17 dzane17 marked this pull request as ready for review October 16, 2023 22:43
Copy link
Contributor

❌ Gradle check result for ccc1cd9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for de00a07: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@msfroh
Copy link
Collaborator

msfroh commented Nov 15, 2023

Looks like you have a merge conflict still.

The buildRewriteListener method still uses searchRequestOperationsListener:

> Task :server:compileJava FAILED
/var/jenkins/workspace/gradle-check/search/server/src/main/java/org/opensearch/action/search/TransportSearchAction.java:525: error: cannot find symbol
                searchRequestOperationsListener
                ^
  symbol:   variable searchRequestOperationsListener
  location: class TransportSearchAction
/var/jenkins/workspace/gradle-check/search/server/src/main/java/org/opensearch/action/search/TransportSearchAction.java:579: error: cannot find symbol
                    searchRequestContext
                    ^
  symbol:   variable searchRequestContext
  location: class TransportSearchAction
/var/jenkins/workspace/gradle-check/search/server/src/main/java/org/opensearch/action/search/TransportSearchAction.java:601: error: cannot find symbol
                            searchRequestContext
                            ^
  symbol:   variable searchRequestContext
  location: class TransportSearchAction
/var/jenkins/workspace/gradle-check/search/server/src/main/java/org/opensearch/action/search/TransportSearchAction.java:652: error: cannot find symbol
                                searchRequestContext

Copy link
Contributor

❌ Gradle check result for 0be52eb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@dzane17 dzane17 force-pushed the slowlog-final branch 2 times, most recently from 4e6f50d to 2eac4cd Compare November 15, 2023 17:26
Copy link
Contributor

❌ Gradle check result for 4e6f50d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for ba36aa2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 2eac4cd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: David Zane <davizane@amazon.com>
Copy link
Contributor

❕ Gradle check result for 51371e9: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.snapshots.CloneSnapshotIT.testClonePreventsSnapshotDelete
      1 org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=pit/10_basic/Delete all}

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@msfroh msfroh merged commit 37ec313 into opensearch-project:main Nov 16, 2023
29 checks passed
@msfroh msfroh added backport 2.x Backport to 2.x branch and removed v2.11.0 Issues and PRs related to version 2.11.0 labels Nov 16, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-10650-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 37ec31370ebb91b3af0223def14cb7999e6d554e
# Push it to GitHub
git push --set-upstream origin backport/backport-10650-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-10650-to-2.x.

@dzane17 dzane17 deleted the slowlog-final branch November 16, 2023 22:59
dzane17 added a commit to dzane17/OpenSearch that referenced this pull request Nov 17, 2023
Signed-off-by: David Zane <davizane@amazon.com>
(cherry picked from commit 37ec313)
fahadshamiinsta pushed a commit to fahadshamiinsta/OpenSearch270 that referenced this pull request Dec 4, 2023
Signed-off-by: David Zane <davizane@amazon.com>
rayshrey pushed a commit to rayshrey/OpenSearch that referenced this pull request Mar 18, 2024
Signed-off-by: David Zane <davizane@amazon.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
Signed-off-by: David Zane <davizane@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed enhancement Enhancement or improvement to existing feature or request feature New feature or request Search Search query, autocomplete ...etc v2.12.0 Issues and PRs related to version 2.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Search Latency Tracking - Coordinator Slow Logs
4 participants