Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "integral" function to InfluxQL #8194

Merged
merged 2 commits into from
Mar 30, 2017
Merged

Add "integral" function to InfluxQL #8194

merged 2 commits into from
Mar 30, 2017

Conversation

jsternberg
Copy link
Contributor

@jsternberg jsternberg commented Mar 23, 2017

The integral function is an aggregator that returns the area under the curve. The integral function also accepts an optional second argument as a time duration to determine the unit of the returned values. The default time duration is 1s (similar to derivative()).

The area under the curve can also be grouped into buckets, but integral acts slightly differently than other aggregates. First, integral does not support FILL() and will ignore any FILL() function on the query. Second, the area under the curve will automatically interpolate the area under the curve using a point in the next interval if it exists. So if you group every 20 seconds and record metrics every 10 seconds, the point at 20 seconds will be used to find the area under the curve between 10s and 20s. If you record a point every 15 seconds and group every 20 seconds, then the point at 30 seconds will be used to interpolate the area under the curve between 15 and 20 seconds and the point at 15 seconds will be used to interpolate the area under the curve between 20 and 30 seconds.

Unlike derivative(), you cannot use a function inside of integral(). If you wish to perform a query like that, subqueries are the easiest way.

If there are multiple points at the same time, this is considered a vertical line. Vertical lines do not add anything to the area under the curve, but they do change the line so the next point will be calculated based on the last point at a timestamp rather than just being completely ignored. This behavior differs from the traditional behavior of just ignoring duplicate points in a stream.

Example queries:

SELECT integral(value) FROM cpu
SELECT integral(value, 1m) FROM cpu
SELECT integral(value) FROM cpu GROUP BY time(20s)
SELECT integral(value, 1m) FROM cpu GROUP BY time(20s)
SELECT integral(mean) FROM (SELECT mean(value) FROM cpu GROUP BY time(10s)) GROUP BY time(1m)

@jsternberg
Copy link
Contributor Author

There's a small lingering issue that I'm encountering with the interpolation. Since integral acts so weird, there's a lingering question I have. What happens when it is used with fill() and should that even be possible?

My original idea was just to read in points as a stream and perform interpolation like that. But, what happens when a fill specification is included? Imagine you have data equally spaced every 10 seconds and you call integral and tell it to group every 20 seconds. The interpolation feature allows this to learn where the next point is to complete the line going to the next interval. But, if you were to have the next point skip 1 minute into the future, should it perform an interpolation between those two points or should it cut off the area calculation at the last time before that point?

A specific example:

cpu,value=1 0s
cpu,value=2 10s
cpu,value=3 20s
cpu,value=4 30s
cpu,value=5 50s

> SELECT integral(value) FROM cpu GROUP BY time(20s)

For the first bucket, 0s to 20s, I think this is pretty simple. You would find the area between 0s and 10s and then find the area between 10s and 20s. But, the point at 20s isn't technically in the first bucket. It's the first point of the next bucket. If that first point in the next bucket was 21s and the fill is null, none, or some number, what should happen here? You can see that issue comes up later in that series because 40s is missing and it's the beginning of an interval.

@pauldix any thoughts on this? I don't think integral is complete without some form of interpolation handling the area between different intervals personally so I would like to hash out how this should work.

@jsternberg
Copy link
Contributor Author

Note, my current favored plan for that is just to say fill() doesn't work with integral and tell people to use subqueries with an aggregate if they really need that functionality.

@pauldix
Copy link
Member

pauldix commented Mar 24, 2017

+1 for making fill not work with integral. Should validate this at query parse and return an error. This is another one of those cases for a stream/functional oriented query language ;)

@jsternberg
Copy link
Contributor Author

The potential issue with an error though is should this type of query be allowed?

SELECT mean(value), integral(value) FROM cpu GROUP BY time(20s) FILL(0)

Since we allow multiple aggregates to be queried, that FILL() would refer to MEAN().

@pauldix
Copy link
Member

pauldix commented Mar 24, 2017

Hmmm yeah. Maybe with multiple aggregates it would just apply to the ones that work while leaving integral alone.

@jsternberg jsternberg force-pushed the js-integral-function branch 2 times, most recently from e4cf04e to 0cf6b72 Compare March 24, 2017 15:16
@jsternberg
Copy link
Contributor Author

We don't seem to throw any kind of error when FILL() is used in a situation where it doesn't do anything so I think we should just document it and plan in the future to improve query parsing. Integral is already going to be a very weird function.

@pauldix
Copy link
Member

pauldix commented Mar 24, 2017 via email

@jsternberg
Copy link
Contributor Author

No, I mean just ignore the FILL() function and let it be used. We don't seem to have any verification to see if the FILL() function is used properly anyway. We likely need to start thinking of a plan for a v2 query parser that prevents these PHP-style things, but the current query parser's philosophy is mostly to ignore things that don't make sense silently.

So this would be valid, but also useless:

SELECT integral(value) FROM cpu GROUP BY time(1m) FILL(0)

@pauldix
Copy link
Member

pauldix commented Mar 24, 2017 via email

@Sineos
Copy link

Sineos commented Mar 25, 2017

While playing with Grafana for graphing impulse meters (e.g. S0 meter) or consumption values I noted that Grafana plots the graph in a way not suitable for such a use case:

grafik

  • At time 2 the impulses give a power value of 10
  • We assume between time 1 and time 2 we had: Work = Delta(t) x 10
  • The "test left" graph is wrong here, since for the same time it would give a Work of 0
  • As of today, Grafana plots "test left" for stepped graphs

This is exactly the use case, why I'm eagerly waiting for the Integral implementation. I was wondering if this is only a "graphical problem" or if it would affect this upcoming feature as well.

@jsternberg
Copy link
Contributor Author

@Sineos I'm not sure I understand your point, but I'm going to give it a guess. Is your point that the integral emits the wrong timestamp and that affects the final graph? I think we're currently emitting the later timestamp rather than the earlier timestamp for the area so I would imagine you run into the same problem. Am I understanding what you're saying correctly?

@Sineos
Copy link

Sineos commented Mar 28, 2017

@jsternberg
Yes, this is my concern currently.
Let me give you an example: Lets measure power consumption in Watts.

At t1 = 00:00:00 --> 100 Watts
At t2 = 01:00:00 --> 500 Watts
Delta(Power) = 500 - 100 = 400
Delta(Time) = t2 - t1 = 1h

Energy = 400W x 1h = 400 Wh

So, a typical use case for an integral.

Now the example as a Graph:
grafik

The blue graph would show the correct result, whereas the black graph would give an Energy value of 100 Wh (the respective areas under the graph).

This would be true for all consumption based calculations and also for rate based calculations if the derivative function follows the same logic.

I guess the idea for a consumption calculation is that I can only look in the past. So if we measure 500 Watts at t2, it is safe to assume that this happened in the time between t1 and t2. So if we choose Delta(t) small enough, the calculation will be pretty accurate.

The same logic applies the other way round: If we measure network traffic and our counter shows 100MB at t1 and 400MB at t2, then 300MB of traffic have been generated between t1 and t2. Given the time and the traffic we can calculate the network bandwidth that has caused the traffic. The result in words would be: From t1 onward we had a rate of X MB/s that would eventually lead to an increase of 300MB at t2.

@rbetts rbetts added review and removed in progress labels Mar 30, 2017
@dgnorton
Copy link
Contributor

I tried creating an adhoc test for @Sineos 's example above and came across what appears to be an inconsistency in the timestamps in the output. E.g., given the following data:

> select * from pwr
name: pwr
time                 watts
----                 -----
1970-01-01T00:00:00Z 100
1970-01-01T01:00:00Z 500
1970-01-01T02:00:00Z 100

It returns the following:

> SELECT integral(watts,1h) FROM pwr WHERE time >= 0 AND time <= 7200000000000 GROUP BY time(1h)
name: pwr
time                 integral
----                 --------
1970-01-01T00:00:00Z 300
1970-01-01T02:00:00Z 300

Note that the timestamp for the first bucket is at the beginning of the first bucket and the timestamp for the second bucket is at the end of the third bucket.

@Tomcat-Engineering
Copy link
Contributor

@Sineos in your energy example the algorithm will give an integral of 300Wh (as per @dgnorton's comment).

You can think of this as the average power multiplied by the time period, or as linear interpolation or as the trapezium rule - they are all the same thing and give the same answer!

@jsternberg
Copy link
Contributor Author

I fixed the bug that caused the wrong output @dgnorton found. I also added an additional condition where if your last point is at the very start of a new interval (so there is no area yet), no point will be pushed out for the last interval even though other aggregates would. This is solely due to the unique nature of integral.

The time that gets output for a bucket is the start of the interval to match with the same behavior that other aggregates do. So the area between 0:00 and 1:00 will have a time of 0:00 when the query is ascending. It'll be the opposite when descending.

@dgnorton
Copy link
Contributor

@pauldix any thoughts on whether timestamps in the output should be from the beginning or end of each bucket?

@pauldix
Copy link
Member

pauldix commented Mar 30, 2017

@dgnorton I think it makes sense to match the behavior of the other ones like @jsternberg did

@Sineos
Copy link

Sineos commented Mar 30, 2017

@Tomcat-Engineering
Thanks for the clarification. Appreciated.

@jsternberg jsternberg merged commit a221e32 into master Mar 30, 2017
@jsternberg jsternberg deleted the js-integral-function branch March 30, 2017 23:24
@jsternberg jsternberg removed the review label Mar 30, 2017
@inselbuch
Copy link

you gots a typo fella

SELECT integral(value, 1m) FROM cpu GROU PBY time(20s)

@jsternberg
Copy link
Contributor Author

jsternberg commented May 1, 2017

Fixed the typo for anyone who encounters this from a search engine. Unfortunately, the commit message will be there for all time :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants