Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group by time starts grouping before minimum time in where clause #8010

Closed
phemmer opened this issue Feb 15, 2017 · 21 comments
Closed

Group by time starts grouping before minimum time in where clause #8010

phemmer opened this issue Feb 15, 2017 · 21 comments

Comments

@phemmer
Copy link
Contributor

phemmer commented Feb 15, 2017

Bug report

System info: [Include InfluxDB version, operating system name, and other relevant details]
InfluxDB 1.2.0
CentOS/7

Steps to reproduce:

insert test v=0 1487184837000000000
insert test v=0 1487184850000000000
insert test v=38787738 1487184851000000000
insert test v=0 1487184852000000000
insert test v=0 1487184865000000000
insert test v=38796282 1487184866000000000
insert test v=0 1487184867000000000
insert test v=0 1487184880000000000
insert test v=38804816 1487184881000000000

select sum(v) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s)
select non_negative_derivative(sum(v),1s) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s)

Expected behavior: [What you expected to happen]

> select sum(v) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s)
name: test
time                sum
----                ---
1487184837000000000 38787738
1487184852000000000 38796282
1487184867000000000 38804816

> select non_negative_derivative(sum(v),1s) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s)
name: test
time                non_negative_derivative
----                -----------------------
1487184852000000000 569.6
1487184867000000000 568.9333333333333

Actual behavior: [What actually happened]

> select sum(v) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s)
name: test
time                sum
----                ---
1487184825000000000 0
1487184840000000000 38787738
1487184855000000000 38796282
1487184870000000000 38804816

> select non_negative_derivative(sum(v),1s) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s)
name: test
time                non_negative_derivative
----                -----------------------
1487184840000000000 2585849.2
1487184855000000000 569.6
1487184870000000000 568.9333333333333

Note that the time of the first group is before our specified where time >=. This results in
influx having an incomplete group, and thus incorrect results.
The second query shows the really bad effects of this in that the derivative value is obscenely high. This is causing problems in our graphs as the first point destroys the scale of the graph, and all other points are just a flat line at the bottom.

@jwilder
Copy link
Contributor

jwilder commented Feb 21, 2017

I believe this happens because the intervals are always on even time boundaries and do not start with the first point in the results.

1487184837s is 2017-02-15 18:53:57 +0000 UTC and falls in the interval 2017-02-15 18:53:45 +0000 UTC - 2017-02-15 18:54:00 +0000 UTC

cc @jsternberg

@jsternberg
Copy link
Contributor

@jwilder is correct. I think you can change that functionality by doing GROUP BY time(15s, now()) and it will use the start time as the beginning for time boundaries.

@phemmer
Copy link
Contributor Author

phemmer commented Feb 21, 2017

Using now() just makes the result random.

> select sum(v) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s,now())
name: test
time                sum
----                ---
1487184835786116207 0
1487184850786116207 38787738
1487184865786116207 38796282
1487184880786116207 38804816

> select sum(v) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s,now())
name: test
time                sum
----                ---
1487184836418254017 38787738
1487184851418254017 38796282
1487184866418254017 38804816
1487184881418254017 

> select sum(v) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s,now())
name: test
time                sum
----                ---
1487184822177972692 0
1487184837177972692 38787738
1487184852177972692 38796282
1487184867177972692 38804816

While it seems like it would be possible to use the offset parameter of group by time() to adjust the start time, this would require a lot of manual math every time such a query needs to be crafted.

I haven't checked, but I suspect a similar issue would exist at the end of the time range as well. In that if the "group time" doesn't end on the "where time <=", aggregates will be screwed up. And a group by time() offset will only be able to solve one of these, not both.

@jsternberg
Copy link
Contributor

Hm, I messed up. The now() offset actually offsets it to now(), not the start time.

@jwilder we likely need some other keyword to offset to the query start time. Maybe time(15s, start()) or something else?

@Dylan-Eikelenboom
Copy link

Dylan-Eikelenboom commented Mar 1, 2017

+1 for a start() option as offset.

What I currently do, to make sure a new time bucket starts at your start time, is take the number of seconds from your queries' start time since epoch 0, divide it by your group-by-time and take the remainder as offset.

In Python:

import datetime

epoch = datetime.utcfromtimestamp(0)              # epoch 0
start_time = datetime.datetime(<some value>)      # lower bound of time range
group_by = <value>                                # number of seconds to group data by

offset = (start_time - epoch).total_seconds() % group_by

The query then looks something like this:

query = 'SELECT MEAN({field}) from {meas} WHERE time >= {start_time} GROUP BY time({group_by}s,{offset}s) '

@gtt116
Copy link

gtt116 commented Mar 2, 2017

The same problem here.

select max(*) from example where time <'2017-03-02 11:00:00' and time > '2017-03-02 10:03:00' group by time(5m)
name: example 
time                 max_service max_service_size
----                 ----------- ----------------
2017-03-02T10:00:00Z             
2017-03-02T10:05:00Z             
2017-03-02T10:10:00Z             
2017-03-02T10:15:00Z 1           1392114
2017-03-02T10:20:00Z 323         2784228
2017-03-02T10:25:00Z 0           1392114
2017-03-02T10:30:00Z             
2017-03-02T10:35:00Z 18          1392114
2017-03-02T10:40:00Z             
2017-03-02T10:45:00Z             
2017-03-02T10:50:00Z             
2017-03-02T10:55:00Z           

But I think the first point (2017-03-02T10:00:00Z) should be filtered out, because it is not > "2017-03-02 10:03:00". I am trying to using sub-query to work around this, the sub-query looks like below, but the result is the same as the first query. 😢

select * from (select max(*) from example where time <'2017-03-02 11:00:00' and time > '2017-03-02 10:03:00' group by time(5m) fill(0) )where time > '2017-03-02 10:03:00'

If there is more information I can provide, please let me know 😄

@Dylan-Eikelenboom
Copy link

This is not an issue with InfluxDB, as this is by design.

To get around that, a sub query would not work, i think, since (quoting the docs) 'InfluxDB uses preset round-number time boundaries for GROUP BY intervals that are independent of any time conditions in the WHERE clause'. Using an offset, such as in my response above, is the way to go, until a start() or similar is implemented.

See also common issue 1 and the advanced GROUP BY time() syntax at the official docs

@phemmer
Copy link
Contributor Author

phemmer commented Mar 2, 2017

The problem with offset, and the proposed start() value, is that this just shifts the problem to the other end of the time range. Meaning the beginning of the grouping will line up with the where time >= clause, but you also need to make the end line up, or you'll end up with an incomplete group on the end.
The only conceivable way to fix this is to chop one of the ends off. Either of:

  • the end of the last group should match the max time, and then the min time should be truncated to the start time of the first group which begins after/equal the min time.
  • the start of the first group should match the min time, and then the max time should be truncated to the end time of the last group which ends before/equal the max time.

@Dylan-Eikelenboom
Copy link

Dylan-Eikelenboom commented Mar 3, 2017

I agree that, with a start option, the total time period at the end of the time range may fall outside the WHERE time <= clause, but it will not have any data points older than the time <= clause.

So let's say you query data from a single day from 00:00 to 7:30 and group by 1 hour. You will then get back a value with timestamp 00:00, which groups values from 00:00 till 01:00, and a value at 01:00 (which groups values from 01:00-02:00), etc.

When Influx gets to the value at 07:00, it will only include values from 07:00 till 07:30 (since data points after 07:30 are excluded). Therefore the value at the end will be correct (containing only data points before 07:30), but it will just be a group of a shorter time period (30 minutes instead of 60 minutes). In my opinion, this is the expected behaviour if the total time between start and end time is not some even multiple of the group-by time.

So in essence it already does the second option you suggest if you have the offset correctly to make sure a new time interval starts at the start time.

@phemmer
Copy link
Contributor Author

phemmer commented Mar 3, 2017

When Influx gets to the value at 07:00, it will only include values from 07:00 till 07:30 (since data points after 07:30 are excluded). Therefore the value at the end will be correct

No, the value at the end will be incorrect. If you're doing some sort of average, and your values are relatively consistent, then yes it may not be a significant issue. But if you're doing a summation, count, derivative, etc, or your values are very inconsistent, the value will be invalid.

So in essence it already does the second option you suggest if you have the offset correctly to make sure a new time interval starts at the start time.

No, it doesn't do anything close to the second option. This is what the whole issue is about. You have incomplete data for the group, thus the value is inaccurate.

@Dylan-Eikelenboom
Copy link

I see. With some aggregators you can get weird values, but I wonder if they are actually incorrect
given the constraint you don't want all values in the final (or first) time interval. I do agree it may be nice to be able to have Influx stop at the end of the final 'complete' interval.

Given my example above, the last time interval would be only half as long as the others and contain half as much data points (if the data logging interval is constant). So if I had 1 value per minute and performed a count per hour, all values would be 60, except for the last one, which is then 30. This is correct, since all values after 07:30 are excluded and then there are then only 30 values in that interval.

It may be weird to have such a value in there and I see your point that it would be nice if you could supply a parameter to have Influx not return that value, but currently the only solution would be to choose the start and end-time or the group-by time such that you get exactly x number of intervals in your query and have the first interval start on your start-time.

@dandv
Copy link
Contributor

dandv commented Oct 3, 2017

This is not an issue with InfluxDB, as this is by design.

I've just wasted half a day hunting a bug in a charting application due to this completely unexpected behavior of InfluxDB.

If I SELECT ... WHERE time > X, then I expect every single point to be later than X. Any points earlier than that should be filtered out.

@onlynone
Copy link

onlynone commented Nov 1, 2017

If I SELECT ... WHERE time > X, then I expect every single point to be later than X. Any points earlier than that should be filtered out.

@dandv, If I understand what's going on, the raw data points earlier than X are being filtered out, but the first data points after X, might belong to a time interval group that starts before X, so they're put in that bucket.

Imagine you're selecting floats that are greater than 3.14159 and then grouping them by rounding (or more appropriately, flooring in this case). The row with a value of 3.12 will be excluded. The row with a value of 3.15 will be included, but put in3 bucket.

I wouldn't necessarily call this behavior unexpected, but it should be much easier to set the group by offsets based on the time ranges in the where clause, and/or discard any "incomplete" groups.

@phemmer
Copy link
Contributor Author

phemmer commented Feb 2, 2018

Since this seems to be contentious on how to modify the group by clause, what about a different approach:
How about a time modifier that takes a time and rounds it to the specified duration. For example using splunk's syntax which does this: (now() - 1h)@h. The @h means round to the nearest whole hour. Then with tools like grafana, you can do WHERE time >= (now() - 24h)@$__interval AND time < now()@-$__interval GROUP BY $__interval (the - would mean to round down instead of up, preventing an incomplete hour on the end).
Or if we don't like the special @ syntax, a function such as round(now() - 24h, 1h).

@jsternberg
Copy link
Contributor

jsternberg commented Feb 9, 2018

It's not that the issue is contentious. It's mostly that it may not be possible in the existing query engine. We might be able to do the rounding, but we have triaged this for Flux so, given the timelines and the estimated work involved in making it work by changing the parser, we are unlikely to have this functionality in influxql.

We have taken a look at this issue and are taking it seriously. This was an oversight of influxql and we are attempting to fix that in Flux so that the time intervals for grouping make more sense. I think Flux will allow you to customize how grouping is done and I think the default might be to use the end time as the end of the last bucket rather than just truncating the buckets. We will be keeping this in mind while we design Flux.

Thank you for filing this issue. I know that this may not be a very satisfactory answer, but the creation of this issue means that we are thinking about how our design decisions for Flux affect customers and issues like this help us to design Flux to reduce future confusion that influxql caused.

@stale
Copy link

stale bot commented Jul 24, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 24, 2019
@stale
Copy link

stale bot commented Jul 31, 2019

This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions.

@stale stale bot closed this as completed Jul 31, 2019
@dandv
Copy link
Contributor

dandv commented Jul 31, 2019

Should this stay open, given @jsternberg's comment?

@mmihalev
Copy link

So this issue won't be fixed?

@mmihalev
Copy link

So, what is the correct way to solve this?

@p3r7
Copy link

p3r7 commented Jan 15, 2020

Hello,

Facing the same situation, at Transatel we decided to solve it by making an HTTP proxy to handle this: Transatel/cleanflux.

We've been using it for 2 whole years, both from Grafana and various scripts (including Jupyter notebooks).

For now, it is quite declarative (retention policies have o be redeclared in proxy configuration) but this solution is pretty robust and transparent for the client.

In addition, it also solves issues #7198 and #5961.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests