Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

slow query log #982

Closed
woodsaj opened this issue Aug 8, 2018 · 6 comments
Closed

slow query log #982

woodsaj opened this issue Aug 8, 2018 · 6 comments

Comments

@woodsaj
Copy link
Member

woodsaj commented Aug 8, 2018

We often have customers overwhelming their instances with huge queries.
When this happens it is often difficult to track down what the queries are.

To make this easier, we would defined a "slow Query" limit, and keep a log of all queries that exceed the value.
We should then add an API endpoint to get the list of the slow queries.
Each log record should keep

  • the targets (functions and series expressions)
  • time range
  • size details (number of series, number of points fetched, number of points returned)
  • Timing information
  • client headers (UserAgent and Referrer, which will help identify the source)
@Dieterbe
Copy link
Contributor

Dieterbe commented Aug 9, 2018

+1 for the use case.
but not so sure if we should build this implementation in MT itself, as that is feature creepy, and wouldn't give us overviews that are aggregated across all MT's in a cluster.

Should an overview of slow queries be made available to our customers? I think that would be good, so that eventually they can self-serve.

I rather rely on our jaeger integration since it solves much of this already and aligning with this project long term has other benefits (becoming more familiar with jaeger tracing is strategically useful, jaeger will improve and will gain new ways to query/analyze the data, etc).

I see 3 options:

A) querying the jaeger UI seems to solve the most urgent needs. though we don't want to expose this to our customers, and there are some bugs and limitations some of which are non-trivial: jaegertracing/jaeger#166, jaegertracing/jaeger#690, jaegertracing/jaeger#892

B) by routing all traces through a bus (e.g. their new kafka support see jaegertracing/jaeger#929) we'll able to write our own consumers allowing us to do our own processing, serve up a slow query overview, aggregated across all instances, etc.

C) custom queries directly on the jaeger cassandra database

@Dieterbe
Copy link
Contributor

looks like B would be easy to implement. per latest comments in the linked to ticket.
@woodsaj in which ways does solution A fall short?

@woodsaj
Copy link
Member Author

woodsaj commented Aug 10, 2018

There are two objectives here.

  1. Allow admins to have the insight of poor performing queries, which we can really already do with jaeger.
  2. allow users to get a list of poorly performing queries and some insight into why they are performing poorly.

I am ok, with jeager, but it seems like a much larger project to be able to expose them to users. We need a more immediate solution.
Just having a log of the last 50 slow queries available via an API seems much easier.

@Dieterbe
Copy link
Contributor

Just having a log of the last 50 slow queries available via an API seems much easier.

but then it's per mt instance, not across the cluster.
I think B is a better solution and should also be doable on a pretty short timeline, but with room to develop more later instead of creeping more and more features into MT itself.

@Dieterbe
Copy link
Contributor

the cortex guys told us we need to sample jaeger traces more aggressively, meaning until jaegertracing/jaeger#425 is implemented, we may randomly discard precisely those spans corresponding to slow queries.

@stale
Copy link

stale bot commented Apr 4, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 4, 2020
@stale stale bot closed this as completed Apr 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants