Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Detecting search queries which have been cancelled but continues to run for long. #6953

Closed
sgup432 opened this issue Apr 3, 2023 · 22 comments
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:StAR (Stability, Availability, Resiliency) Project-wide roadmap label

Comments

@sgup432
Copy link
Contributor

sgup432 commented Apr 3, 2023

Co-Author: @jainankitk

Problem Statement

As of now in OpenSearch, we have a best effort mechanism to cancel long running search queries via task cancel API POST _tasks/<task_id>/_cancel. We can refer this as soft cancellation as this mechanism works by setting a cancellable flag as true and then requires search thread to check on itself at regular intervals. As we have seen in the past, this mechanism doesn’t work all the time as there are some cases where search query continues to run in lucene tight loops and causes cluster instability.

While we eventually want to solve this problem by doing a "hard" cancellation, we first want to add a mechanism to detect such queries which were cancelled and still continues to run for a while. With the introduction of Search backpressure, there would be a lot of such cases and we need a mechanism to track them. This will also help us test the effectiveness of existing search cancellation, identify tight loops within OS/lucene and eventually build a case to hard cancel rogues queries.

Note: This problem deals with detecting cancelled queries at a shard level and not at coordinator level.

Solution

There are two parts to the solution.

  1. First part comprises of detecting such long running cancelled queries.
  2. Second part comprises of dumping such queries along with the stacktrace for later analysis.

We will only focus on the Part 1 for now.

Introduction of a new field in Task API to detect time elapsed for cancelled query

We will introduce a new variable as cancellationTime in CancellableTask. Every time a task is cancelled, we will capture this timestamp which will be later used to calculate time elapsed after this task was cancelled. We will also append this information in Tasks API by introducing a new field running_time_since_cancellation_nanos.

This will give us additional information to display such tasks especially for search.

"tasks" : {
    "oTUltX4IQMOUUVeiohTt8A:464" : {
    "node" : "oTUltX4IQMOUUVeiohTt8A",
    "id" : 464,
    "type" : "transport",
    "action" : "indices:data/read/search",
    "description" : "indices[test], types[test], search_type[QUERY_THEN_FETCH], source[{\"query\":...}]",
    "start_time_in_millis" : 1483478610008,
    "running_time_in_nanos" : 139913830000,
    "cancellable" : true,
    "cancelled" : true,
    "cancelled_at_millis": 1483479610008
    }
}

Separate monitoring thread to detect such queries

As part of this approach, we will have a monitor thread(out of generic threadpool) within opensearch running every Y seconds and extracting all such search tasks which have been running since X seconds after cancellation. Here Y << X.

In every such run, this monitoring thread will fetch all the current running search shard tasks using TaskResourceTrackingService::getResourceAwareTasks here. We will then filter out tasks which have been cancelled and have been running for more than X seconds using.

X ~ 10s - Reason to keep this value so that it is long enough to detect tight loops within OS/lucene and short enough to avoid missing long running queries after cancellation. Configurable.
Y ~ 5s - As X is 10s, keeping this 5 so that we don’t check too frequently and keeping enough buffer to incorporate execution time of detection. Configurable.

Below are the components that will be newly introduced:

SearchCancellationMonitoringService: This will run every Y seconds and fetch all current running searchShardTasks. And then filter tasks which have been cancelled and running since X seconds. Will store such tasks count in local variable.
SearchCancellationMonitoringStats: Maintains all stats related to search cancellation from above. Sample response for: _nodes/stats/search_cancellation

{
  "_nodes": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "cluster_name": "runTask",
  "nodes": {
    "T7aqO6zaQX-lt8XBWBYLsA": {
      "timestamp": 1667409521070,
      "name": "runTask-0",
      "transport_address": "127.0.0.1:9300",
      "host": "127.0.0.1",
      "ip": "127.0.0.1:9300",
      "roles": [],
      "attributes": {
        "testattr": "test",
        "shard_indexing_pressure_enabled": "true"
      },
      "search_cancellation": {
        "search_shard_task": {
          "post_cancellation_stats": {
             "current_query_count": 12,
             "total_query_count": 100
          }
        }
      }
    }
  }
}

Using above approach, we will have some insights on such queries. OS clients can also custom cron to call node/stats API to monitor such queries if required.

Future plan

We can eventually have a logic to dump such queries into log as well(Part 2). And a long term plan is to hard cancel such queries but requires more deep dive as there seems no easy way for now to do that.

Describe alternatives you've considered

  • Only other way is to let OS clients build custom cron solution on top of Task API changes mentioned above and track such API's. But in case we want to extract/dump such rogue queries, it won't be possible using this.
  • Using Search Backpressure Service to detect such queries. Search backpressure service runs after 1 second to identify potential tasks which are resource intensive and needs to be cancelled. We could have potentially used this to identify long running tasks. But it couples our logic with search backpressure. And we will have to use the same backpressure stats now which doesn't seem right.

Additional context
Add any other context or screenshots about the feature request here.

@sgup432 sgup432 added enhancement Enhancement or improvement to existing feature or request untriaged labels Apr 3, 2023
@sgup432
Copy link
Contributor Author

sgup432 commented Apr 3, 2023

@dblock
Copy link
Member

dblock commented Apr 4, 2023

How will this interact with slow log feature? https://opensearch.org/docs/1.2/opensearch/logs/#slow-logs

@sgup432
Copy link
Contributor Author

sgup432 commented Apr 4, 2023

How will this interact with slow log feature? https://opensearch.org/docs/1.2/opensearch/logs/#slow-logs

This is not meant to interact with slow log feature. As part of this proposal we only want to detect stats(count) around queries which were attempted for cancellation but still continues to run for a while. There is no way currently to do this. This will help us identify the effectiveness of search backpressure which internally tries to cancel queries. We might later plan to dump such queries along with stacktrace but not use slow log for this.

@sgup432 sgup432 changed the title [RFC] Detecting long running and cancelled search queries [RFC] Detecting search queries which have been cancelled but continues to run for long. Apr 4, 2023
@dblock
Copy link
Member

dblock commented Apr 4, 2023

I guess I was asking a leading question. It feels to me that slow queries are the same thing as long running queries. In this proposal you're adding an API that returns information about long running queries, so can they also serve information about what is currently recorded as a slow query (and dumped in the logs)?

@sgup432
Copy link
Contributor Author

sgup432 commented Apr 4, 2023

I just changed the description to avoid confusion. Yes, slow logs are same as long running queries based on a defined threshold.
We are kind of capturing stats around long running queries only difference being that these were explicitly cancelled(manually or by search backpressure) and they still continue to run even after that(lets say 10secs).
Let me know if it makes sense.

@dblock
Copy link
Member

dblock commented Apr 5, 2023

@sgup432 I understand now.

I think I would prefer canceled_at_nanos to tell me when the task was cancelled (after how long the task was cancelled), rather than adding a field that is now - canceled_at_nanos. This would make the response stable over time and cacheable. Similarly, consider all kinds of post_cancellation_stats to be just stats that can return a stable response of, for example, what's still running. This makes the response stable before and after cancellation, which itself is just a boolean state.

@minalsha minalsha removed the untriaged label Apr 6, 2023
@sgup432
Copy link
Contributor Author

sgup432 commented Apr 7, 2023

@dblock

Yeah adding cancelled_at_nanos does make sense. But I think we can also keep running_time_since_cancellation_nanos as it is easier to read similar to running_time_in_nanos. Let me know your thoughts.

Regarding post_cancellation_stats, do you mean to rename it to stats? If yes, then I wanted to explicitly call it such to be more clear to the user about what are we are showing. If not, can you elaborate with an example?

Also I would be correcting above stats to include current, total count. Total would help us to visualize it over time and compare with previous value.

"post_cancellation_stats": {
             "current_query_count": 12,
             "total_query_count": 100
          }

@dblock
Copy link
Member

dblock commented Apr 13, 2023

@dblock

Yeah adding cancelled_at_nanos does make sense. But I think we can also keep running_time_since_cancellation_nanos as it is easier to read similar to running_time_in_nanos. Let me know your thoughts.

This would be duplicate information (you can calculate running time from the timestamp), so that's another anti-pattern for an API. Return timestamps and let any UX consuming this data decide how to display it, including showing a duration. Speaking of displaying, getting this timestamp makes it possible to display a correct duration live without querying the API again.

Regarding post_cancellation_stats, do you mean to rename it to stats? If yes, then I wanted to explicitly call it such to be more clear to the user about what are we are showing. If not, can you elaborate with an example?

Yes, I mean renaming post_cancellation_stats to stats. You already have search_cancellation above. But more generally I don't think this API layout is generic enough.

How about a search stats API that includes cancelation stats? Something like this?

"search": {
   "search_shard_task": {
     "status": "cancelled",
     "created_at": ts,
     "cancelled_at": ts,
     "stats": {
         "pending_query_count": 12,
         "cancelled_query_count": 88,
          "total_query_count": 100
      }
   }
}

Also I would be correcting above stats to include current, total count. Total would help us to visualize it over time and compare with previous value.

"post_cancellation_stats": {
             "current_query_count": 12,
             "total_query_count": 100
          }

I am not sure I understand all kinds of query counts. What kids are there?

@sgup432
Copy link
Contributor Author

sgup432 commented Apr 18, 2023

@dblock

This would be duplicate information (you can calculate running time from the timestamp), so that's another anti-pattern for an API. Return timestamps and let any UX consuming this data decide how to display it, including showing a duration. Speaking of displaying, getting this timestamp makes it possible to display a correct duration live without querying the API again.

running_time_since_cancellation_nanos is indeed duplicate information which is calculated from cancelled start time but I still think it is important as usually this API is used by operator via command line, and to calculate running time, we will have to use some website/tool to calculate running time manually (current_time - start_time) which is a big disadvantage plus time consuming.

How about a search stats API that includes cancelation stats? Something like this?

We already have an existing search stats under indices stats something like below:

"indices": {
  "search": {
         "query_total" : 117284834,
          "query_time_in_millis" : 2309818328,
          "query_current" : 3,
          "fetch_total" : 47100569,
          "fetch_time_in_millis" : 76673528,
          "fetch_current" : 0
  }
}

So I don't think we should have duplicate search stats apart from this. And regarding updating this section with cancellation stats, that is also not possible as IndicesStats is maintained in memory by IndicesService. And I don't see anyway for our service(SearchCancellationMonitoringService) to update this search stats from outside.

Let me know if you have more suggestion. But as we are using a separate service to track such queries, we will need to create a new stats section specific to this use case.

I am not sure I understand all kinds of query counts. What kids are there?

"post_cancellation_stats": {
             "current_query_count": 12,
             "total_query_count": 100
          }
  • current_query_count: Returns current running queries(on shard level) which got cancelled X seconds ago.
  • total_query_count: Cumulative metric. Returns total such queries count since beginning. This will help us to get count for a particular period by calculating difference between current and previous value.

Name it explicitly post_cancellation_stats as we might have more stats under search_cancellation in future, so it distinguishes well.

@dblock
Copy link
Member

dblock commented Apr 19, 2023

running_time_since_cancellation_nanos is indeed duplicate information which is calculated from cancelled start time but I still think it is important as usually this API is used by operator via command line, and to calculate running time, we will have to use some website/tool to calculate running time manually (current_time - start_time) which is a big disadvantage plus time consuming.

I feel pretty strongly about returning a timestamp and not a duration for the reasons above, but I am happy to be convinced otherwise if other maintainers feel that it's not important - maybe @nknize or @andrross?

We already have an existing search stats under indices stats something like below:

"indices": {
  "search": {
         "query_total" : 117284834,
          "query_time_in_millis" : 2309818328,
          "query_current" : 3,
          "fetch_total" : 47100569,
          "fetch_time_in_millis" : 76673528,
          "fetch_current" : 0
  }
}

So I don't think we should have duplicate search stats apart from this. And regarding updating this section with cancellation stats, that is also not possible as IndicesStats is maintained in memory by IndicesService. And I don't see anyway for our service(SearchCancellationMonitoringService) to update this search stats from outside.

Okay, so that's an implementation problem not an API design. Implementation is obviously important, but shouldn't be driving our design decisions. In theory, at least.

What work do you expect SearchCancellationMonitoringService to be doing? If it collects stats about cancelation of queries, why wouldn't that be added to IndicesService? What's so special about search cancelation that requires it to be a standalone function and a separate section? I want to know all kinds of things about searches (how long they took, how often they get canceled, how many have been canceled, etc.).

As an example from above - we don't separate "fetch_total" into a "fetch_stats". So what's the reasoning to separate "post_cancellation_stats"?

@dblock
Copy link
Member

dblock commented Apr 19, 2023

@sgup432 you should PR what you have, especially if parts of the discussion above are non-controversial (e.g. "Introduction of a new field in Task API to detect time elapsed for cancelled query"), and we can keep discussing in the PR whether that should be a duration or a timestamp - I'm sure we can resolve that quickly either way.

@AmiStrn
Copy link
Contributor

AmiStrn commented Apr 19, 2023

Why not use the timeout? my guess is that it was not documented correctly, but there is a way to have queries time out reliably after a given time. #986

@ketanv3
Copy link
Contributor

ketanv3 commented Apr 19, 2023

Thanks for putting up this proposal, @sgup432!

I agree that task cancellations (and timeouts) are ineffective against non-cooperating code flows that do not check the 'cancelled' status regularly. This is analogous to threads that do not check the 'interrupted' status. I'm more interested in knowing how you plan to achieve 'hard' cancellations, considering Thread.stop is dangerous and deprecated. I'm guessing the best we can do is to make more code flows soft-cancellable.
@AmiStrn, I believe this also explains why all queries do not cancel or time out reliably.

I like the idea of identifying tasks that were cancelled long ago. Please note that 'cancelled at' should be measured using a wall clock (System.currentTimeMillis) while 'time since cancellation' (i.e. duration) should be measured using a monotonic clock (System.nanoTime). You should call it cancelled_at_millis to correctly represent the 'cancelled at' time; since a monotonic clock returns nanoseconds past some arbitrary point.
@dblock, I agree that this is duplicate information and non-cacheable. While we have similar start_time_in_millis and running_time_in_nanos fields, I feel it's better not to repeat the same mistake.

Instead of creating a new SearchCancellationMonitoringService, how about we plug this functionality into the existing TaskCancellationService? That we can generalise this for all types of task cancellations.

You can have a long polling interval and still catch long-cancelled tasks by plugging into the task completion hook provided by the task resource tracking framework.

@andrross
Copy link
Member

I agree that this is duplicate information and non-cacheable. While we have similar start_time_in_millis and running_time_in_nanos fields, I feel it's better not to repeat the same mistake.

I agree with both @dblock and @ketanv3 here regarding not including the running time for all the reasons cited above.

@sgup432
Copy link
Contributor Author

sgup432 commented Apr 19, 2023

@dblock Yeah will PR task API related changes.
@ketanv3 Regarding hard cancellation, the best probable way does seems to be extending soft cancellation framework but it is not guaranteed to work everytime and cover all tight loops. That's why wanted to use this proposal to gather more data around it while we work in the background to identify some way to do this "hard cancellation". I will look into using TaskCancellationService.

@dblock
Copy link
Member

dblock commented Apr 20, 2023

@ketanv3 Thanks for a detailed review! Appreciate your continued eyes on this.

@sgup432
Copy link
Contributor Author

sgup432 commented Apr 20, 2023

@ketanv3

Instead of creating a new SearchCancellationMonitoringService, how about we plug this functionality into the existing TaskCancellationService? That we can generalise this for all types of task cancellations.

Took a look at TaskCancellationService and also agree on the part of making it generic. It has a functionality to cancel tasks when needed. In our case we want to repeatedly detect long cancelled tasks running since X(10 for now) seconds. I see below option of achieving this:

Extend TaskCancellationService functionality by also adding a capability to schedule it every Y seconds to detect long cancelled tasks running since X seconds. To achieve this we need to inherit from AbstractLifeCycleComponent and then schedule it accordingly.

We can a generic task cancellation stats like below:

"task_cancellation": {
        "search_shard_task": {
           "stats": {
               "current_query_count": 12,
               "total_query_count": 100
          }
        }
      }

And can add a setting through which we can enable monitoring of different type of tasks in the future. Will keep SearchShardTask for now.
Pros:

  • Can be extended easily for different type of tasks in the future as you mentioned.

Cons

  • All different type of tasks will be binded to these intervals, X and Y seconds and we won't be able to change this in case we need to tune this for a particular type of task.

I am not in favor of this approach with reason mentioned in the cons. We can though add common logic in TaskCancellationService and make SearchCancellationMonitoring(and rest) extend it. This way each task can have its own intervals and we will also achieve code reusability.

You can have a long polling interval and still catch long-cancelled tasks by plugging into the task completion hook provided by the task resource tracking framework.

I am not sure whether this was an extension to above TaskCancellationService from your side. But if I understand correctly, you are suggesting to hook into task completion interface to track long cancelled tasks. If yes, then I don't think this will be efficient as some tasks continue to run even for an hour or sometimes never get completed gracefully because opensearch process is restarted in such cases to avoid havoc caused by these long running tasks. So we need a way for sure to schedule a service regularly to keep track of such tasks.

@sgup432
Copy link
Contributor Author

sgup432 commented Apr 20, 2023

@dblock

What work do you expect SearchCancellationMonitoringService to be doing? If it collects stats about cancelation of queries, why wouldn't that be added to IndicesService? What's so special about search cancelation that requires it to be a standalone function and a separate section? I want to know all kinds of things about searches (how long they took, how often they get canceled, how many have been canceled, etc.).

IndicesService collect these stats(query_total, fetch_total etc) via listener based approach. Where after every phase is done, listener is triggered and stats are collected. This approach will not work very well for our use case. As for us, we need to wait for atleast X seconds after a task is cancelled, check if it is still running and then collect stats for it.

Also as @ketanv3 mentioned, we can plan to track this for all kind of tasks and make it more generic. Through this approach, it would make more sense to add a separate stats(named as task_cancellation like mentioned above) instead of reusing existing ones.

Let me know if this makes sense.

@ketanv3
Copy link
Contributor

ketanv3 commented May 3, 2023

@sgup432, I like the ideas presented overall.

All different type of tasks will be binded to these intervals, X and Y seconds and we won't be able to change this in case we need to tune this for a particular type of task.

Do we have a use-case of monitoring different types of tasks at different intervals? Does that require separate monitoring threads for each 'cancellable' task type? In that case, you can explore hierarchical timing wheels to schedule work at varying intervals while still keeping the monitoring code single-threaded. But again, this is an implementation detail that can be ironed out when the PRs are out.

I don't think this will be efficient as some tasks continue to run even for an hour or sometimes never get completed gracefully because opensearch process is restarted...

I was pointing out that we may not have to go with very short monitoring intervals, which would otherwise put strain on the cluster. We can combine a longer interval with plugging into that hook to catch tasks that complete before the monitoring cycle.

@sgup432
Copy link
Contributor Author

sgup432 commented May 3, 2023

@ketanv3 We don't have a use case right now regarding different intervals for different tasks. But I guess we need to design in such a way so it accommodates those needs int the future? And thanks, will look into the link you provided.

And yes, we can discuss implementation related details in the PR.

@sohami sohami added RFC Issues requesting major changes Roadmap:StAR (Stability, Availability, Resiliency) Project-wide roadmap label labels May 14, 2024
@sohami
Copy link
Collaborator

sohami commented May 14, 2024

@sgup432 Is there anything pending for this ? If not, then lets close this issue. Looks like we have made changes for the detection part

@sgup432
Copy link
Contributor Author

sgup432 commented May 14, 2024

@sohami Yeah nothing else pending for this. Closing this out.

@sgup432 sgup432 closed this as completed May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:StAR (Stability, Availability, Resiliency) Project-wide roadmap label
Projects
None yet
Development

No branches or pull requests

8 participants