Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Tracking Search latency and stats at Coordinator node level #7334

Open
sgup432 opened this issue Apr 29, 2023 · 8 comments
Open

[RFC] Tracking Search latency and stats at Coordinator node level #7334

sgup432 opened this issue Apr 29, 2023 · 8 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:Search and ML Project-wide roadmap label

Comments

@sgup432
Copy link
Contributor

sgup432 commented Apr 29, 2023

Co-Author: @jainankitk

Problem Statement

As of today, we track search request latencies on a shard level via node stats. After every query/fetch phase is completed on a shard, we note down the time taken for each, keep accumulating those values and maintain an overall average value which is tracked under stats.

But we don’t have a mechanism to track search latencies around coordinator node. Coordinator node plays an important role in fanning out requests to individual shard/data-nodes, aggregating those responses and eventually sending response back to the client. We have seen multiple issues in the past where it becomes hard/impossible to reason latency related issues because of lack of insights into coordinator level stats and we ended up spending a lot of unnecessary time/bandwidth on figuring it out. Clients using search API only rely on overall took time(present as part of search response) which doesn’t offer much insights into time taken by different phases.

Solution

We want to introduce below high level changes to give us more insights into coordinator level latencies.

  1. Per Request level tracking: As part of this, we will offer further breakdown of existing took time in search response. To do this, we will introduce a new field(phase_took) in search response which will give more insights/visibility into overall time taken by different search phases(query/fetch/canMatch etc) to the clients.
{
    "took" : 29321,
    "phase_took" : { // new field
        "fetch" : 29253,
        "query" : 14,
    },
    "timed_out" : false,
    "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
    },
    "hits" : {
        "total" : {
        "value" : 0,
        "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
    }
}
  1. Add coordinator node level stats: As part of this, we will also track search stats from coordinator node perspective. Currently we track search shard stats under IndicesStats. We can utilize same stats and also track overall search request level stats ie (total time taken by coordinator + communication time between coordinator and data node + execution time at data node)
"indices" : {
        "search" : {
          "open_contexts" : 1,
          "query_total" : 117336056,
          "query_time_in_millis" : 2244945904,
          "query_current" : 1,
          "fetch_total" : 47060301,
          "fetch_time_in_millis" : 78532219,
          "fetch_current" : 0,
          "scroll_total" : 466,
          "scroll_time_in_millis" : 318294065,
          "scroll_current" : 0,
          "suggest_total" : 0,
          "suggest_time_in_millis" : 0,
          "suggest_current" : 0,
          "request": { // New field. Contains overall search requests stats from coordinator node perspective.
             "query_time_in_millis" : 5244945904,
             "query_current" : 1,
             "query_total" : 7336056,
             "fetch_time_in_millis": 178532219,
             "fetch_current" : 0,
              "fetch_total" : 7060301
          }
        }
  1. Slow logs at coordinator level: As of now, we only have the capability to enable slow logs at a shard level for desired search phase(query and fetch). See this. Setting this threshold is tricky when customer usually sees latency spikes at a request level. Plus shard level slow logs doesn't offer a holistic view.
    So as part of this, we will also add capabilities to capture slow logs at a request level along with different search phases from coordinator node perspective.

We can plan to use new prefix index.search.request to achieve this.

// Setting on a whole search request level
index.search.request.slowlog.threshold.overall.warn: 10s
index.search.request.slowlog.threshold.overall.info: 5s
index.search.request.slowlog.threshold.overall.debug: 2s
index.search.request.slowlog.threshold.overall.trace: 500ms

// Setting on query phase level
index.search.request.slowlog.threshold.query.warn: 10s
.....

// Setting on fetch phase level
index.search.request.slowlog.threshold.fetch.warn: 1s
.....

Approach

We want to design this in a way so that it is simplistic, extensible and has negligible performance impact.

Listener based approach

We can utilize a listener based approach which is pretty similar the way current shard level stats and slow log works. Via this mechanism we can solve all the three problems mentioned above.

Here, before the start and end of every phase at a coordinator node level, we will invoke listener which will help us keep track of the took time of individual search phases(query/fetch/canMatch)

High level Listener interface

// Listener to track start and end of phases at a coordinator node level
public interface SearchRequestOperationsListener {
    
    void onRequestStart(SearchPhaseContext context) {}
    void onRequestEnd(SearchPhaseContext context) {}
   
    // Query phase
    void onQueryPhaseStart(SearchPhaseContext context) {}
    void onQueryPhaseEnd(SearchPhaseContext phase, long tookTime) {}
    void onQueryPhaseFailure(SearchPhaseContext context) {}
    
    // Fetch phase
    void onFetchPhaseStart(SearchPhaseContext context) {}
    void onFetchPhaseEnd(SearchPhaseContext phase, long tookTime) {}
    void onFetchPhaseFailure(SearchPhaseContext context) {}
    
        
    // Gives a capability to register multiple listeners and invoke them accordingly.
    final class CompositeListener implements SearchRequestOperationsListener {
        private final List<SearchRequestOperationsListener> listeners;
        private final Logger logger;

        CompositeListener(List<SearchRequestOperationsListener> listeners, Logger logger) {
            this.listeners = listeners;
            this.logger = logger;
        }

       public void onQueryPhaseStart(SearchPhaseContext context) {
           for (SearchRequestOperationsListener listener : listeners) {
               try {
                   listener.onQueryPhaseStart(context);
               } catch (Exception e) {}
           }
       }

       public void onQueryPhaseEnd(SearchPhaseContext context, long tookTime) {
           for (SearchRequestOperationsListener listener : listeners) {
               try {
                   listener.onQueryPhaseEnd(context, tookTime);
               } catch (Exception e) {}
           }
       }
    }
}

Above listener will be invoked from AbstractSearchAsyncAction which serves as an entry point to the search request landing at coordinator node. This action has a mechanism to keep track as to when any phase starts or ends. See this and this.

Below components will subscribe to above listener to achieve our goal.

  • SearchTimeProvider: Currently keeps track of overall took time in search response. To offer breakdown of each phase took time, we will use same and track those via above listener.
  • SearchRequestStats: A new class to keep track of coordinator level stats. We will encapsulate it within existing SearchStats. SearchRequestStats will also subscribe to above listener events.
  • SearchRequestSlowLogs: Same as above. This will also implement listener and via those events will provide the capability to capture slow logs at coordinator/request level.

latency_listener_based

Scenarios to cover

  • Search type: We will handle both queryAndThenFetch and DFS search types of part of this.

  • SearchPhase: We will cover all possible search phases which exists as of today. Basically every search phase implements base class SearchPhase. This includes:

    1. CanMatchPreFilterPhase - Only valid for QueryAndThenFetch type.
    2. PreQueryPhase/DFSPhase - Only valid for DFS search type.
    3. QueryPhase - Only Valid for QueryAndThenFetch type
    4. DFSQueryPhase - Only valid for DFS type.
    5. FetchSearchPhase - Valid for both types.
    6. ExpandSearchPhase - Valid for both types. This is executed if search request has inner hits and needs field collapsing
  • MultiSearch - Multi search response has individual search request responses. We will accordingly provide phase wise breakdown.

@sgup432 sgup432 added enhancement Enhancement or improvement to existing feature or request untriaged labels Apr 29, 2023
@sgup432
Copy link
Contributor Author

sgup432 commented May 1, 2023

@reta
Copy link
Collaborator

reta commented May 2, 2023

@sgup432 thanks for the RFC, I think there is large overlap with an existing issue #705 to extend the query profiling information with coordinator level timings. I honestly believe that such details should be captured on demand with (profile = true) otherwise there is a potential to introduce significant overhead (in terms of different listeners / callbacks / trackers creations and wrappings) on each and every request, even if that is not needed to the end user. That may not be a problem for majority of the queries (or searches, to generalize) but some might be impacted.

@sgup432
Copy link
Contributor Author

sgup432 commented May 2, 2023

@reta Thanks for comments.
This listener based approach is very lightweight and will have almost negligible performance impact. It is just like starting and ending a timer, that's it. As we are only looking to track search phases took time. It is almost identical as to how we collect shard level stats/latencies where we see no overhead. If you still see a problem, we can add a cluster setting to turn this off in worst case scenario.

Profiler is much more exhaustive and relatively expensive. Also as I see, it is only used to debug certain queries or reproduce issues happened in the past. While we have seen multiple cases where it is impossible to debug latency related issues(which are hard to reproduce) at a particular point in the past as we don't have much insights into coordinator level latencies/stats. And current took time also doesn't provide much insights to the OS clients.

@reta
Copy link
Collaborator

reta commented May 3, 2023

@sgup432 sure, fair enough, let me ask this question though, let say you added these took times (as per proposal):

"phase_took" : { // new field
        "fetch" : 29253,
        "query" : 14,
    },

How it is helpful? I mean, fetch phase took more time than query phase, but why? Where to go next? With profile, we could and should provide significantly more details not only on phase level but phase level on each shard, + plus coordinator, at least it will be clear where is the potential problem.

This is just my opinion, let see what other folks think.

@kkhatua
Copy link
Member

kkhatua commented May 3, 2023

Good point, @reta .

What we are often coming across are cases where an application might be deploying a new set of queries in addition to an existing set, and begin to observe end-to-end latency regressions. The took_time by itself is not very useful. But more than fetch, the value lies in the post-fetch activities which are often on the same coordinator node itself.
A classic example would be a workload that traditionally aggregates across, say 7 days worth. A new use-case is now added to the workload which aggregates across 30 days (4x). We likely will not see much change in the query or even fetch phase latencies, but the aggregations will take longer. Overall, we see an increase in general took_time, but the nodes do not show anything else, except higher CPU and memory utilization.

There is also a value-add in the end-user also being able to track latencies as a percentile, assuming their workloads are independently consuming the Opensearch resource.

@sgup432
Copy link
Contributor Author

sgup432 commented May 3, 2023

@reta In addition to took time, it is important to add coordinator/request level stats as part node stats(as mentioned in the proposal). As of now, only way to build some metrics/dashboards around search latency is only at a shard level(by using node stats API and doing some calculation using query_time_in_millis). There is no way to track coordinator level stats which is critical and usually a blind spot as of now.

Along with that, another important aspect is slow logs at a request level. Consider below scenario:
Some of my search queries are taking more than X seconds overall and I want to debug which ones exactly. Only way to log them is at a shard level by setting desired threshold which is doesn't make sense, tricky and doesn't give a holistic view(from coordinator perspective). I would ideally want to set a slow log at request level like proposed above.
Let me know what you think.

@reta
Copy link
Collaborator

reta commented May 4, 2023

@sgup432 sure, I am all in for more useful stats (this is separate API(s) though), for search in particular we do have more powerful means to collect them using profiling.

@sgup432
Copy link
Contributor Author

sgup432 commented May 5, 2023

@reta Yes, we do have profiling. But as mentioned, it doesn't provide a historical view of latency data either at shard or coordinator level. Node stats does that but it lacks coordinator level data.
Profiling only helps to selectively debug certain slow queries when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:Search and ML Project-wide roadmap label
Projects
Status: In Progress
Development

No branches or pull requests

5 participants