Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log time in queue per request #4949

Merged
merged 6 commits into from
Jan 4, 2022
Merged

Conversation

ssncferreira
Copy link
Contributor

@ssncferreira ssncferreira commented Dec 16, 2021

What this PR does / why we need it:

Add query enqueue time to result statistics and metrics.go.

Incoming requests are handled by the query-frontend which pushes each request into an internal queue. The queries pull the requests from this queue and execute them. The time each request stays in the queue is observed by metric cortex_query_scheduler_queue_duration_seconds.
The idea of this PR is to add this metric value to the query's result statistics as well as metrics.go logline. This is accomplished by setting a new header X-Query-Enqueue-Time in the query HTTP GRPC request at the scheduler. A new HTTP middleware function ExtractQueryMetricsMiddleware is responsible for extracting this header at the queriers which is then used to populate the new EnqueueTime field of the results statistics and the metrics.go logline.

Which issue(s) this PR fixes:
Fixes #4774

Special notes for your reviewer:

Checklist

  • Documentation added
  • Tests updated
  • Add an entry in the CHANGELOG.md about the changes.

r.queueSpan.Finish()

level.Info(s.log).Log("msg", "querier request dequeued", "queryID", r.queryID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be interesting to add the querier id too so that we can see if they dequeue fairly across all of them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Wondering if the original plan is to bring this queue information as a part of logging in metrics.go?

#4774

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kavirajk adding this information as part of logging in metrics.go makes sense, however, we already have this information as part of the scheduler request. To add this information as part of metrics, we would need to in some way pass it downstream to the logql metrics. I wasn't able to do that in a simpler way, but let me know if I missed something 🙂

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this idea! is it possible to get the tenant ID here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slim-bean good idea. Added in commit 2421761
From what I understood by looking at the enqueueRequest method, a query can be multi-tenant. Therefore, used the method tenant.TenantIDsFromOrgID by using the scheduled request's orgId (r.userID).

However, it is not clear the distinction between tenant and orgId in this context. What is the difference between tenant and org?

@ssncferreira ssncferreira force-pushed the issue/4774-log_request_enqueue_time branch from a5e55af to 2421761 Compare December 20, 2021 13:28
@ssncferreira ssncferreira self-assigned this Dec 20, 2021
@ssncferreira ssncferreira marked this pull request as ready for review December 20, 2021 18:35
@ssncferreira ssncferreira requested a review from a team as a code owner December 20, 2021 18:35
@ssncferreira
Copy link
Contributor Author

ssncferreira commented Dec 21, 2021

When a query is requested, it is split into smaller queries based on shards values. All these smaller queries are then handled by the query-scheduler to be distributed by the querier workers. This means that the outcome of this PR will result in an increase in the outputted log lines for each query requested.

As a result, and taking into account the constant query requests from loki-canary this can significantly increase the generated logs.
@grafana/loki-team Could this be a potential problem? 🤔

@slim-bean
Copy link
Collaborator

This will definitely introduce a lot of logging, multiple entries per query.

I have a thought, I wonder if we should include a configureable threshold, say log only if it exceeds a certain time in queue.

And if we did that it would be nice if it was part of the runtime config so we could change it if we wanted on the fly. https://github.com/grafana/loki/blob/main/pkg/runtime/config.go

This would also allow it to be configured per tenant at runtime.

The parameter could take a value of -1 for disabled, 0 for log all, and a duration for 'log slower than'

What do we think?

@slim-bean
Copy link
Collaborator

@ssncferreira could you include an example of what the log line looks like currently?

@slim-bean
Copy link
Collaborator

I'm also wondering now if another approach would be to log this info in our 'metrics.go' line which could then show an aggregated result in the query frontend metrics.go

@ssncferreira
Copy link
Contributor Author

@ssncferreira could you include an example of what the log line looks like currently?

As of commit 2421761, for the test query:

{container="query-frontend",namespace="loki-bigtable"} |= "scf50"

the log lines from this PR are shown as:

level=info ts=2021-12-20T19:01:49.130700149Z caller=scheduler.go:477 msg="querier request dequeued" tenant_ids=3927 querier_id=querier-6d44f945fc-2gfqp query_id=17445911836402265139 request="/loki/api/v1/query_range?direction=BACKWARD&end=1640026908919000000&limit=1000&query={container=\"query-frontend\",+namespace=\"loki-bigtable\"}+|=+\"scf50\"&shards=0_of_16&start=1640026608919000000&step=0.200000" enqueue_time(ms)=0.045099

This simple test query, generated 3 subsets of queries:

  1. sum by count over time: request="/loki/api/v1/query_range?direction=BACKWARD&end=1640026908000000000&limit=1261&query=sum+by(level)(count_over_time({container=\"query-frontend\",+namespace=\"loki-bigtable\"}+|=+\"scf50\"[1s]))&shards=0_of_16&start=1640026800000000000&step=1.000000"
  2. sum by count over time with different start param: request="/loki/api/v1/query_range?direction=BACKWARD&end=1640026799000000000&limit=1261&query=sum+by(level)(count_over_time({container=\"query-frontend\",+namespace=\"loki-bigtable\"}+|=+\"scf50\"[1s]))&shards=0_of_16&start=1640026608000000000&step=1.000000"
  3. the query itself: request="/loki/api/v1/query_range?direction=BACKWARD&end=1640026908919000000&limit=1000&query={container=\"query-frontend\",+namespace=\"loki-bigtable\"}+|=+\"scf50\"&shards=0_of_16&start=1640026608919000000&step=0.200000" enqueue_time(ms)=0.045099

Each of these subsets generates smaller queries with 16 shards parameters. This means that a simple test query generated 48 log lines. These tests were done on the dev cluster and can be seen here.

@ssncferreira
Copy link
Contributor Author

ssncferreira commented Dec 22, 2021

I'm also wondering now if another approach would be to log this info in our 'metrics.go' line which could then show an aggregated result in the query frontend metrics.go

@slim-bean yes, I'm starting to think that this is the best approach and simpler to then analyze the logs. And I think I prefer this approach to the configurable threshold.
However, since a query is split into multiple sub-queries, how would the enqueue time for a single query be calculated? As the sum of the enqueue time of all its sub-queries? Or as the average? 🤔

Nevertheless, I still need to investigate further how to achieve this...how to pass the enqueue time that is already calculated on the scheduler to the logql stats metrics 🕵️‍♀️

@trevorwhitney
Copy link
Collaborator

@ssncferreira I know that we currently send metrics in headers which can be printed by logcli. Could we take a similar approach where each sub-query could include this enqueue time in metadata in the request it sends back to the query frontend, and let the query frontend publish the metric for each sub-query once the query has finished?

Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit, but LGTM

pkg/scheduler/scheduler.go Outdated Show resolved Hide resolved
@pull-request-size pull-request-size bot added size/L and removed size/S labels Dec 30, 2021
@ssncferreira ssncferreira force-pushed the issue/4774-log_request_enqueue_time branch from 731e6fb to 40cf768 Compare December 30, 2021 16:09
@ssncferreira ssncferreira force-pushed the issue/4774-log_request_enqueue_time branch 6 times, most recently from f0f0a84 to 3b11ac8 Compare December 30, 2021 16:51
@ssncferreira
Copy link
Contributor Author

ssncferreira commented Dec 30, 2021

Commit 3b11ac8 presents a different solution that introduces the enqueueTime as part of the metrics.go log. This is accomplished by adding an additional header X-Query-Enqueue-Time in the query HTTP GRPC request in the scheduler. This header is then used to populate the new EnqueueTime field of the results statistics.

The new metrics.go logline is now presented as:

level=info ts=2021-12-30T17:18:18.162215879Z caller=metrics.go:115 org_id=fake latency=fast query="{filename=\"/var/log/app.log\"}" query_type=limited range_type=range length=1h0m1s step=5s duration=3.919166ms status=200 limit=1000 returned_lines=48 throughput=891kB total_bytes=3.5kB enqueue_time=131.291µs

Additionally, the query stats field in the JSON response is now presented with an enqueueTime field:

"stats": {
    "summary": {
        "bytesProcessedPerSecond":891005,
        "linesProcessedPerSecond":12247,
        "totalBytesProcessed":3492,
        "totalLinesProcessed":48,
        "execTime":0.003919166,
        "enqueueTime":0.000131291
    }
    ...
}

Example in dev loki-bigtable:

  • Query example:
{container="query-frontend", namespace="loki-bigtable"} |= "scf105"

@ssncferreira ssncferreira requested a review from a team December 30, 2021 19:28
Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great. A few suggestions:

  1. Can we rename enqueue_time to queue_time? The former would be the moment it enters the queue, whereas the latter more correctly refers to how long the query was queued.
  2. Can we store the queue time in nanoseconds (int64(<time.Duration>))? This will give us more granular data, especially for queries which are queued less than a second (which is expected).

pkg/logqlmodel/stats/context.go Outdated Show resolved Hide resolved
@@ -168,7 +171,8 @@ func (i *Ingester) Merge(m Ingester) {
func (r *Result) Merge(m Result) {
r.Querier.Merge(m.Querier)
r.Ingester.Merge(m.Ingester)
r.ComputeSummary(time.Duration(int64((r.Summary.ExecTime + m.Summary.ExecTime) * float64(time.Second))))
r.ComputeSummary(time.Duration(int64((r.Summary.ExecTime+m.Summary.ExecTime)*float64(time.Second))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required for this PR, but I wonder why we store execTime as seconds instead of nanoseconds 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that this could also be stored as nanoseconds. This way it would also stay consistent with the queueTime. I can address this in a separate PR 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is the associated PR: #5034

pkg/util/httpreq/tags.go Outdated Show resolved Hide resolved
pkg/scheduler/scheduler.go Outdated Show resolved Hide resolved
@ssncferreira ssncferreira force-pushed the issue/4774-log_request_enqueue_time branch from f63cc6e to eacd8d1 Compare January 4, 2022 13:33
Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work.

@owen-d owen-d merged commit 82f8f3c into main Jan 4, 2022
@owen-d owen-d deleted the issue/4774-log_request_enqueue_time branch January 4, 2022 21:52
@slim-bean
Copy link
Collaborator

@ssncferreira awesome work, this looks great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Log time in queue per request (metrics.go?)
6 participants