slow query log #982

woodsaj · 2018-08-08T13:52:00Z

We often have customers overwhelming their instances with huge queries.
When this happens it is often difficult to track down what the queries are.

To make this easier, we would defined a "slow Query" limit, and keep a log of all queries that exceed the value.
We should then add an API endpoint to get the list of the slow queries.
Each log record should keep

the targets (functions and series expressions)
time range
size details (number of series, number of points fetched, number of points returned)
Timing information
client headers (UserAgent and Referrer, which will help identify the source)

Dieterbe · 2018-08-09T10:33:02Z

+1 for the use case.
but not so sure if we should build this implementation in MT itself, as that is feature creepy, and wouldn't give us overviews that are aggregated across all MT's in a cluster.

Should an overview of slow queries be made available to our customers? I think that would be good, so that eventually they can self-serve.

I rather rely on our jaeger integration since it solves much of this already and aligning with this project long term has other benefits (becoming more familiar with jaeger tracing is strategically useful, jaeger will improve and will gain new ways to query/analyze the data, etc).

I see 3 options:

A) querying the jaeger UI seems to solve the most urgent needs. though we don't want to expose this to our customers, and there are some bugs and limitations some of which are non-trivial: jaegertracing/jaeger#166, jaegertracing/jaeger#690, jaegertracing/jaeger#892

B) by routing all traces through a bus (e.g. their new kafka support see jaegertracing/jaeger#929) we'll able to write our own consumers allowing us to do our own processing, serve up a slow query overview, aggregated across all instances, etc.

C) custom queries directly on the jaeger cassandra database

Dieterbe · 2018-08-10T08:16:09Z

looks like B would be easy to implement. per latest comments in the linked to ticket.
@woodsaj in which ways does solution A fall short?

woodsaj · 2018-08-10T16:35:21Z

There are two objectives here.

Allow admins to have the insight of poor performing queries, which we can really already do with jaeger.
allow users to get a list of poorly performing queries and some insight into why they are performing poorly.

I am ok, with jeager, but it seems like a much larger project to be able to expose them to users. We need a more immediate solution.
Just having a log of the last 50 slow queries available via an API seems much easier.

Dieterbe · 2018-08-10T16:50:15Z

Just having a log of the last 50 slow queries available via an API seems much easier.

but then it's per mt instance, not across the cluster.
I think B is a better solution and should also be doable on a pretty short timeline, but with room to develop more later instead of creeping more and more features into MT itself.

Dieterbe · 2018-10-30T14:08:12Z

the cortex guys told us we need to sample jaeger traces more aggressively, meaning until jaegertracing/jaeger#425 is implemented, we may randomly discard precisely those spans corresponding to slow queries.

stale · 2020-04-04T09:39:21Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Dieterbe added the help wanted label Sep 6, 2018

stale bot added the stale label Apr 4, 2020

stale bot closed this as completed Apr 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow query log #982

slow query log #982

woodsaj commented Aug 8, 2018

Dieterbe commented Aug 9, 2018 •

edited

Loading

Dieterbe commented Aug 10, 2018

woodsaj commented Aug 10, 2018

Dieterbe commented Aug 10, 2018

Dieterbe commented Oct 30, 2018

stale bot commented Apr 4, 2020

slow query log #982

slow query log #982

Comments

woodsaj commented Aug 8, 2018

Dieterbe commented Aug 9, 2018 • edited Loading

Dieterbe commented Aug 10, 2018

woodsaj commented Aug 10, 2018

Dieterbe commented Aug 10, 2018

Dieterbe commented Oct 30, 2018

stale bot commented Apr 4, 2020

Dieterbe commented Aug 9, 2018 •

edited

Loading