[BUG][Search Backpressure] High Heap Usage Cancellation Due to High Node-Level CPU Utilization #13295

ticheng-aws · 2024-04-19T00:31:18Z

Describe the bug

With the current search backpressure cancellation logic, we've noticed that some high CPU usage search requests, such as multi-term aggregation, may result in more cancellations due to task-level heap usage settings. However, the system still has sufficient heap memory to process the tasks.

Related component

Search:Resiliency

To Reproduce

Use multi_term_agg in http_logs workload. It's often referred to as a high CPU usage search request.

Setup a OpenSearch cluster and OpenSearch Benchmark client
Run test with multi_term_agg operation in http_logs workload and gradually increase the search client using below sample command

opensearch-benchmark execute-test --pipeline=benchmark-only --client-options='basic_auth_user:<USER>,basic_auth_password:<PASSWORD>,timeout:300' --target-hosts '<END_POINT>:443' --kill-running-processes --workload=http_logs --workload-param='target_throughput:none,number_of_replicas:0,number_of_shards:1,search_clients:2'

Monitor the CPU utilization and JVM memory pressure of your OpenSearch cluster
Retrieve cancellation count with GET _nodes/stats/search_backpressure restful API

Expected behavior

We need to adjust the current search backpressure cancellation logic to cancel tasks based on measurements of node-level resources. For example, if a node is under duress due to high CPU utilization, we should only consider canceling tasks based on CPU settings, rather than heap or elapsed time settings at the task level.

Additional Details

Host/Environment (please complete the following information):

Version OS_1.3 +

The text was updated successfully, but these errors were encountered:

jainankitk · 2024-04-25T19:16:13Z

Assigning to @kaushalmahi12, due to his prior context with query sandboxing and search backpressure

kaushalmahi12 · 2024-04-25T20:56:20Z

The backpressure works as follows and here the heap_domination threshold is mere 0.05 percent of total jvm memroy available for the process. The same flowchart is applicable for both SearchTasks and SearchShardTasks

There are basically three trackers which can potentially cancel a task

Heap Based
CPU Based
Time Based
The basic flaw in this is even when node is not under duress because of Heap, the tracker will still kick in and cancel the task

sohami · 2024-04-25T23:38:46Z

@kaushalmahi12 Agreed and that is what the issue is trying to explain. I think we should check the duress condition for each tracker as well. For example: If under heap duress, then only evaluate the task for heap based cancellation.

kaushalmahi12 · 2024-04-26T04:14:43Z

Thats right @sohami
The weird thing about this is that, it takes the tasks for cancellation even when total jvm allocations by co-ordinator/shard level tasks are 0.05%. ref.
Time Based cancellation is still not justified for cases where the cluster has very light search traffic and user is fine with higher latencies for those queries (Given only CPU is high and AC is already there to safeguard against new incoming requests).

I think we should increase this threshold for search workload JVM(or remove it altogether) and separate out the corresponding trackers.

peternied · 2024-05-01T15:40:39Z

[Triage - attendees 1 2 3 4 5 6 7 8]
@ticheng-aws Thanks for creating this issue, looking forward to seeing this resolved.

ticheng-aws added bug Something isn't working enhancement Enhancement or improvement to existing feature or request untriaged Indexing & Search labels Apr 19, 2024

github-actions bot added the Search:Resiliency label Apr 19, 2024

ticheng-aws changed the title ~~[BUG][Search Backpressure] High Heap Usage Cancellations Due to High Node-Level CPU Utilization~~ [BUG][Search Backpressure] High Heap Usage Cancellation Due to High Node-Level CPU Utilization Apr 19, 2024

getsaurabh02 assigned jainankitk Apr 24, 2024

jainankitk assigned kaushalmahi12 and unassigned jainankitk Apr 25, 2024

kaushalmahi12 mentioned this issue Apr 30, 2024

Bug/sbp cancellation #13474

Merged

8 tasks

peternied added Search:Performance and removed untriaged labels May 1, 2024

vikasvb90 removed the Indexing & Search label May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][Search Backpressure] High Heap Usage Cancellation Due to High Node-Level CPU Utilization #13295

[BUG][Search Backpressure] High Heap Usage Cancellation Due to High Node-Level CPU Utilization #13295

ticheng-aws commented Apr 19, 2024 •

edited

Loading

jainankitk commented Apr 25, 2024

kaushalmahi12 commented Apr 25, 2024

sohami commented Apr 25, 2024

kaushalmahi12 commented Apr 26, 2024 •

edited

Loading

peternied commented May 1, 2024

[BUG][Search Backpressure] High Heap Usage Cancellation Due to High Node-Level CPU Utilization #13295

[BUG][Search Backpressure] High Heap Usage Cancellation Due to High Node-Level CPU Utilization #13295

Comments

ticheng-aws commented Apr 19, 2024 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

jainankitk commented Apr 25, 2024

kaushalmahi12 commented Apr 25, 2024

sohami commented Apr 25, 2024

kaushalmahi12 commented Apr 26, 2024 • edited Loading

peternied commented May 1, 2024

ticheng-aws commented Apr 19, 2024 •

edited

Loading

kaushalmahi12 commented Apr 26, 2024 •

edited

Loading