Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

Support multiple targets that act on different time windows #1739

Merged
merged 5 commits into from
Aug 5, 2020

Conversation

shanson7
Copy link
Collaborator

To support functions like timeShift and movingWindow, we need to support functions that can adjust the from and to at a per context level. The archive chosen for unrelated requests should not be influenced by time ranges that were adjusted by other functions.

@shanson7 shanson7 requested a review from Dieterbe March 31, 2020 15:54
@shanson7
Copy link
Collaborator Author

shanson7 commented Apr 9, 2020

Note: There are a lot of possibilities for datapoint reduction in the "soft" limit case. We can (should?) start with the largest PN Groups (largest defined as most datapoints? Longest time-range?) and then move on to the largest singles. Or maybe the other way around? I think the most important part is being consistent to avoid issues with refreshing causing graph to change (I think this is already consistent, not sure).

@shanson7
Copy link
Collaborator Author

shanson7 commented Apr 9, 2020

Another thing to consider here is performance for queries that might act on a large number of series.

@Dieterbe
Copy link
Contributor

Dieterbe commented Apr 9, 2020

To support functions like timeShift and movingWindow, we need to support functions that can adjust the from and to at a per context level.

In NewPlan, each target has its own context.
e.g. target=foo&target=sum(movingAverage(bar*,"1min")) will result in distinct contexts that can be influenced by the respective functions: "sum" and "movingAverage" can modify the context which will result in the bar* Req having modified from/to, whereas the context (and from/to) for foo are unaffected.
I do notice already an interesting case here though. Our current draft movingAverage code does this:

func (s *FuncMovingAverage) Context(context Context) Context {
        context.from -= uint32(s.window)
        return context
}

This is insufficient.

window can be specified as a time but also as a number of points.
in Graphite, number of points in respected literally. So:

taking movingAverage(...,10) of ... ... results in
1s raw data 10-secondly points
1min rollup data 10-minutely points
summarize(bar*,"1min","sum") where bar is 1s data 10-minutely points

Our current context type is incapable of tracking a "from offset" expressed in a number of points, neither is PlanRequests or anything else currently equipped to handle this, and the third case makes this even harder to persue. Perhaps your PR solves some of this, i need to dig more into it, but this looks like a big challenge unless we cut some corners.

And then there's a function like timeStack() which seems like it requires multiple from/to pairs.

api/query_engine.go Outdated Show resolved Hide resolved
@shanson7
Copy link
Collaborator Author

shanson7 commented Apr 14, 2020

Our current draft movingAverage code...is insufficent. window can be specified as a time but also as a number of points. in Graphite, number of points in respected literally

Ooh, yeah that could prove quite difficult to support. Perhaps we could "partially" support functions based on the parameters given? Maybe some simple method to abort and proxy during the planning stage (e.g. when Context() is called). That way, when possible we can use the native (and fast) code, otherwise proxy to graphite.

@Dieterbe
Copy link
Contributor

Perhaps we could "partially" support functions based on the parameters given? Maybe some simple method to abort and proxy during the planning stage (e.g. when Context() is called). That way, when possible we can use the native (and fast) code, otherwise proxy to graphite.

Yeah that seems like a very good idea. Especially because the most common case would be "simple" queries specifying a time window, not a number of points, which we can handle.

@shanson7
Copy link
Collaborator Author

Hmm, I'm still a little stuck on reconciling:

  1. Graphite compatibility
  2. Supporting "soft" req behavior
  3. Intuitive behavior

Graphite will always choose the first retention that satisfies the entire time range. It doesn't have any special behavior for if this is a significant amount of data. This "per-series" behavior is what this PR initially intends to emulate, but doing so in the "soft" req handling results in strange returned results.

For example, if we have a query that fetches 100 series with 10k datapoints each (raw) or 1k datapoints (next rollup) and our soft limit is 500k, this PR (at time of this comment) will return the 100 series with 44 of them at raw interval and 56 at the rollup interval.

I think intuitively, I would expect them to all be "softened" in step (per target?).

@shanson7
Copy link
Collaborator Author

I'm leaning toward separating reqs by target and only in the "soft" limit case repeatedly "coarsen" the data with the most datapoints to fetch until soft limit is satisfied or there is nothing else we can do to reduce DPs.

In this case, if the same find expression is used in multiple targets, we may still need to fetch them separately if we need to get different archives.

Copy link
Collaborator Author

@shanson7 shanson7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, I decided to leave the "soft" handling the same as it is today and only do the planning differently. I think this takes us one step closer to graphite behavior and supports timeShift without complicating this PR too much

api/graphite.go Show resolved Hide resolved
@shanson7 shanson7 changed the title WIP - support multiple targets that act on different time windows Support multiple targets that act on different time windows Apr 17, 2020
@shanson7
Copy link
Collaborator Author

Any updates?

@Dieterbe
Copy link
Contributor

Dieterbe commented Aug 3, 2020

For example, if we have a query that fetches 100 series with 10k datapoints each (raw) or 1k datapoints (next rollup) and our soft limit is 500k, this PR (at time of this comment) will return the 100 series with 44 of them at raw interval and 56 at the rollup interval.
I think intuitively, I would expect them to all be "softened" in step (per target?).

I agree that this per-single-granularity will look strange. I don't understand why we need to introduce this granularity.
If the only reason to do this is because they may have different from/to values, then why don't we just first categorize them in groups based on their from/to, and then reduce each group as a whole?

@shanson7
Copy link
Collaborator Author

shanson7 commented Aug 3, 2020

I agree that this per-single-granularity will look strange. I don't understand why we need to introduce this granularity.

As I said above:

Ultimately, I decided to leave the "soft" handling the same as it is today and only do the planning differently.

@Dieterbe
Copy link
Contributor

Dieterbe commented Aug 3, 2020

the stats computation in 4) send out some metrics and we're done! needs to be updated.
also you might be interested in this commit: 1a1d233

@shanson7
Copy link
Collaborator Author

shanson7 commented Aug 3, 2020

the stats computation in 4) send out some metrics and we're done! needs to be updated.

Yes, likely. I'll confess I'm not entirely sure what it's doing right now. It seems like it should be indicating the number of the chosen archive as a meter, but I'm not sure why it would multiply by the number of requests. That would distort the input right? Like how would it differentiate between 100 reqs with archive 1 and 50 reqs with archive 2?

@Dieterbe
Copy link
Contributor

Dieterbe commented Aug 3, 2020

oops. that's a bug. we should be reporting each archive len(reqs) times (at least when they all have the same archive id). like this #1868 , perhaps we should merge this to master, and then you can rebase on top of master?

@Dieterbe
Copy link
Contributor

Dieterbe commented Aug 3, 2020

I stand corrected. we only check progress (and possibly bail out) every time we're done processing all requests in rp.single.mdpno that correspond to each same archiveID. So that makes sense.

The PR as a whole makes sense too.
When we do a PR such as for movingAverage, let's make sure to test a case like target=foo&movingAverage(foo.... too make sure we properly process that, don't mangle up the requests, etc.

@Dieterbe
Copy link
Contributor

Dieterbe commented Aug 5, 2020

left todo in this PR:

  1. sean to review and apply (hopefully) 1a1d233
  2. now that fix reporting of archives #1868 has been merged, rebase on master and fix the stats. (if you don't have time, we can take care of the code changes)

@shanson7
Copy link
Collaborator Author

shanson7 commented Aug 5, 2020

  • sean to review and apply (hopefully) 1a1d233
  • now that fix reporting of archives #1868 has been merged, rebase on master and fix the stats. (if you don't have time, we can take care of the code changes)

@Dieterbe Dieterbe merged commit 3b20cae into grafana:master Aug 5, 2020
@shanson7 shanson7 deleted the more_time branch October 22, 2021 16:14
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants