[Search Perf] Improve Query Frontend -> Querier Job Throughput #2464

joe-elliott · 2023-05-11T18:37:47Z

The querier/query-frontend relationship is using code originally developed for Cortex many years ago and it is likely showing it's age. For larger queries Tempo will often create 10s of thousands of jobs which are piped from the query-frontend to the queriers one at a time. Currently, I believe, there is a bottleneck in delivering these jobs to the queriers at scale.

Code

In the querier you can control the number of jobs that it will do in parallel using max_concurrent_queries. For every concurrent query the querier starts a new goroutine and opens a grpc connection to the query frontend.

On the query-frontend side a goroutine is started for every Process call above and they all block on a call to GetNextRequestForQuerier. The end result is often 10k + goroutines in the query frontend waiting on one mutex to deliever one job a time downstream to a querier.

Metrics

This is a graph of requests/second serviced by queriers during a longer traceql query. Notice how the least active querier starts 20-30 seconds later than the first queriers and how slow the ramp up is across the course of the query:

It should be noted that CPU or network saturation could also cause an effect like this.

Possible Solutions

It's possible just by reducing contention on the mutex linked above we could see improved querier performance. Perhaps we can find a way to efficiently shard that queue to spread the load across N mutexes.
Rewrite the relationship between these two components. Perhaps, upon connection, the querier could pass the number of jobs it is willing to take and the query-frontend could deliver a batch of jobs at once that the querier would respond to one at at time.

We are currently seeing a fair amount of querier imbalance and slow spin up for larger queries. Unlocking this bottleneck would likely have a large positive impact on performance.

The text was updated successfully, but these errors were encountered:

joe-elliott · 2023-06-30T14:27:07Z

Some additional analysis on the rate at which queriers draw down the frontend queue.

relevant config:

query_frontend:
    max_outstanding_per_tenant: 100000
    search:
        concurrent_jobs: 75000
querier:
    max_concurrent_queries: 1000

With 100 queriers the cluster had space for 100,000 concurrent jobs. I repeatedly executed an exhaustive query over 6 hours that created ~65k jobs.

Time spent in queue

histogram_quantile(.9, sum by (le) (rate(tempo_query_frontend_queue_duration_seconds_bucket{}[1m])))

It appears that 10% of the jobs spent 4.5s or more in the queue waiting for a querier to service them.

Querier min/max/avg RPS

Still seeing an imbalance in queriers. Our busiest queriers are doing 4x the work of our slowest queriers.

joe-elliott · 2023-07-25T14:55:19Z

A number of PRs went in that have been merged to improve this situation. Closing this issue as any future improvements would require a dedicated redesign of the relationship between the queriers and fronted and should be their own issue.

joe-elliott changed the title ~~Improve Query Frontend -> Querier Job Throughput~~ [Search Perf] Improve Query Frontend -> Querier Job Throughput May 12, 2023

joe-elliott added the type/performance label May 12, 2023

mdisibio mentioned this issue May 12, 2023

[Search Perf] Improve throughput of backendRequests #2469

Closed

joe-elliott mentioned this issue Jul 10, 2023

Frontend Queues: Reduces Lock contention #2631

Closed

3 tasks

joe-elliott self-assigned this Jul 11, 2023

joe-elliott closed this as completed Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Search Perf] Improve Query Frontend -> Querier Job Throughput #2464

[Search Perf] Improve Query Frontend -> Querier Job Throughput #2464

joe-elliott commented May 11, 2023 •

edited

joe-elliott commented Jun 30, 2023

joe-elliott commented Jul 25, 2023

[Search Perf] Improve Query Frontend -> Querier Job Throughput #2464

[Search Perf] Improve Query Frontend -> Querier Job Throughput #2464

Comments

joe-elliott commented May 11, 2023 • edited

joe-elliott commented Jun 30, 2023

joe-elliott commented Jul 25, 2023

joe-elliott commented May 11, 2023 •

edited