# Group by time starts grouping before minimum time in where clause #8010

Open
opened this Issue Feb 15, 2017 · 15 comments

Projects
None yet
9 participants
Contributor

### Bug report

System info: [Include InfluxDB version, operating system name, and other relevant details]
InfluxDB 1.2.0
CentOS/7

Steps to reproduce:

``````insert test v=0 1487184837000000000
insert test v=0 1487184850000000000
insert test v=38787738 1487184851000000000
insert test v=0 1487184852000000000
insert test v=0 1487184865000000000
insert test v=38796282 1487184866000000000
insert test v=0 1487184867000000000
insert test v=0 1487184880000000000
insert test v=38804816 1487184881000000000

select sum(v) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s)
select non_negative_derivative(sum(v),1s) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s)
``````

Expected behavior: [What you expected to happen]

``````> select sum(v) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s)
name: test
time                sum
----                ---
1487184837000000000 38787738
1487184852000000000 38796282
1487184867000000000 38804816

> select non_negative_derivative(sum(v),1s) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s)
name: test
time                non_negative_derivative
----                -----------------------
1487184852000000000 569.6
1487184867000000000 568.9333333333333
``````

Actual behavior: [What actually happened]

``````> select sum(v) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s)
name: test
time                sum
----                ---
1487184825000000000 0
1487184840000000000 38787738
1487184855000000000 38796282
1487184870000000000 38804816

> select non_negative_derivative(sum(v),1s) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s)
name: test
time                non_negative_derivative
----                -----------------------
1487184840000000000 2585849.2
1487184855000000000 569.6
1487184870000000000 568.9333333333333
``````

Note that the time of the first group is before our specified `where time >=`. This results in
influx having an incomplete group, and thus incorrect results.
The second query shows the really bad effects of this in that the derivative value is obscenely high. This is causing problems in our graphs as the first point destroys the scale of the graph, and all other points are just a flat line at the bottom.

Collaborator

### jwilder commented Feb 21, 2017

 I believe this happens because the intervals are always on even time boundaries and do not start with the first point in the results. `1487184837s` is `2017-02-15 18:53:57 +0000 UTC` and falls in the interval `2017-02-15 18:53:45 +0000 UTC` - `2017-02-15 18:54:00 +0000 UTC`

Contributor

### jsternberg commented Feb 21, 2017

 @jwilder is correct. I think you can change that functionality by doing `GROUP BY time(15s, now())` and it will use the start time as the beginning for time boundaries.
Contributor

### phemmer commented Feb 21, 2017

 Using `now()` just makes the result random. ``````> select sum(v) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s,now()) name: test time sum ---- --- 1487184835786116207 0 1487184850786116207 38787738 1487184865786116207 38796282 1487184880786116207 38804816 > select sum(v) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s,now()) name: test time sum ---- --- 1487184836418254017 38787738 1487184851418254017 38796282 1487184866418254017 38804816 1487184881418254017 > select sum(v) from test where time >= 1487184837s and time <= 1487184837s + 45s group by time(15s,now()) name: test time sum ---- --- 1487184822177972692 0 1487184837177972692 38787738 1487184852177972692 38796282 1487184867177972692 38804816 `````` While it seems like it would be possible to use the offset parameter of `group by time()` to adjust the start time, this would require a lot of manual math every time such a query needs to be crafted. I haven't checked, but I suspect a similar issue would exist at the end of the time range as well. In that if the "group time" doesn't end on the "where time <=", aggregates will be screwed up. And a `group by time()` offset will only be able to solve one of these, not both.
Contributor

### jsternberg commented Feb 21, 2017

 Hm, I messed up. The `now()` offset actually offsets it to `now()`, not the start time. @jwilder we likely need some other keyword to offset to the query start time. Maybe `time(15s, start())` or something else?

### Dylan-Eikelenboom commented Mar 1, 2017 • edited

 +1 for a start() option as offset. What I currently do, to make sure a new time bucket starts at your start time, is take the number of seconds from your queries' start time since epoch 0, divide it by your group-by-time and take the remainder as offset. In Python: ``````import datetime epoch = datetime.utcfromtimestamp(0) # epoch 0 start_time = datetime.datetime() # lower bound of time range group_by = # number of seconds to group data by offset = (start_time - epoch).total_seconds() % group_by `````` The query then looks something like this: ``````query = 'SELECT MEAN({field}) from {meas} WHERE time >= {start_time} GROUP BY time({group_by}s,{offset}s) ' ``````

### gtt116 commented Mar 2, 2017

 The same problem here. ``````select max(*) from example where time <'2017-03-02 11:00:00' and time > '2017-03-02 10:03:00' group by time(5m) name: example time max_service max_service_size ---- ----------- ---------------- 2017-03-02T10:00:00Z 2017-03-02T10:05:00Z 2017-03-02T10:10:00Z 2017-03-02T10:15:00Z 1 1392114 2017-03-02T10:20:00Z 323 2784228 2017-03-02T10:25:00Z 0 1392114 2017-03-02T10:30:00Z 2017-03-02T10:35:00Z 18 1392114 2017-03-02T10:40:00Z 2017-03-02T10:45:00Z 2017-03-02T10:50:00Z 2017-03-02T10:55:00Z `````` But I think the first point (2017-03-02T10:00:00Z) should be filtered out, because it is not > "2017-03-02 10:03:00". I am trying to using sub-query to work around this, the sub-query looks like below, but the result is the same as the first query. 😢 ``````select * from (select max(*) from example where time <'2017-03-02 11:00:00' and time > '2017-03-02 10:03:00' group by time(5m) fill(0) )where time > '2017-03-02 10:03:00' `````` If there is more information I can provide, please let me know 😄

### Dylan-Eikelenboom commented Mar 2, 2017

 This is not an issue with InfluxDB, as this is by design. To get around that, a sub query would not work, i think, since (quoting the docs) 'InfluxDB uses preset round-number time boundaries for GROUP BY intervals that are independent of any time conditions in the WHERE clause'. Using an offset, such as in my response above, is the way to go, until a start() or similar is implemented. See also common issue 1 and the advanced GROUP BY time() syntax at the official docs
Contributor

### phemmer commented Mar 2, 2017

 The problem with offset, and the proposed `start()` value, is that this just shifts the problem to the other end of the time range. Meaning the beginning of the grouping will line up with the `where time >= ` clause, but you also need to make the end line up, or you'll end up with an incomplete group on the end. The only conceivable way to fix this is to chop one of the ends off. Either of: the end of the last group should match the max time, and then the min time should be truncated to the start time of the first group which begins after/equal the min time. the start of the first group should match the min time, and then the max time should be truncated to the end time of the last group which ends before/equal the max time.

### Dylan-Eikelenboom commented Mar 3, 2017 • edited

 I agree that, with a start option, the total time period at the end of the time range may fall outside the WHERE time <= clause, but it will not have any data points older than the time <= clause. So let's say you query data from a single day from 00:00 to 7:30 and group by 1 hour. You will then get back a value with timestamp 00:00, which groups values from 00:00 till 01:00, and a value at 01:00 (which groups values from 01:00-02:00), etc. When Influx gets to the value at 07:00, it will only include values from 07:00 till 07:30 (since data points after 07:30 are excluded). Therefore the value at the end will be correct (containing only data points before 07:30), but it will just be a group of a shorter time period (30 minutes instead of 60 minutes). In my opinion, this is the expected behaviour if the total time between start and end time is not some even multiple of the group-by time. So in essence it already does the second option you suggest if you have the offset correctly to make sure a new time interval starts at the start time.
Contributor

### phemmer commented Mar 3, 2017 • edited

 When Influx gets to the value at 07:00, it will only include values from 07:00 till 07:30 (since data points after 07:30 are excluded). Therefore the value at the end will be correct No, the value at the end will be incorrect. If you're doing some sort of average, and your values are relatively consistent, then yes it may not be a significant issue. But if you're doing a summation, count, derivative, etc, or your values are very inconsistent, the value will be invalid. So in essence it already does the second option you suggest if you have the offset correctly to make sure a new time interval starts at the start time. No, it doesn't do anything close to the second option. This is what the whole issue is about. You have incomplete data for the group, thus the value is inaccurate.

### Dylan-Eikelenboom commented Mar 3, 2017

 I see. With some aggregators you can get weird values, but I wonder if they are actually incorrect given the constraint you don't want all values in the final (or first) time interval. I do agree it may be nice to be able to have Influx stop at the end of the final 'complete' interval. Given my example above, the last time interval would be only half as long as the others and contain half as much data points (if the data logging interval is constant). So if I had 1 value per minute and performed a count per hour, all values would be 60, except for the last one, which is then 30. This is correct, since all values after 07:30 are excluded and then there are then only 30 values in that interval. It may be weird to have such a value in there and I see your point that it would be nice if you could supply a parameter to have Influx not return that value, but currently the only solution would be to choose the start and end-time or the group-by time such that you get exactly x number of intervals in your query and have the first interval start on your start-time.

Open

Open

Contributor

### dandv commented Oct 3, 2017 • edited

 This is not an issue with InfluxDB, as this is by design. I've just wasted half a day hunting a bug in a charting application due to this completely unexpected behavior of InfluxDB. If I `SELECT ... WHERE time > X`, then I expect every single point to be later than X. Any points earlier than that should be filtered out.

### onlynone commented Nov 1, 2017 • edited

 If I SELECT ... WHERE time > X, then I expect every single point to be later than X. Any points earlier than that should be filtered out. @dandv, If I understand what's going on, the raw data points earlier than `X` are being filtered out, but the first data points after `X`, might belong to a time interval group that starts before `X`, so they're put in that bucket. Imagine you're selecting `float`s that are greater than `3.14159` and then grouping them by rounding (or more appropriately, `floor`ing in this case). The row with a value of `3.12` will be excluded. The row with a value of `3.15` will be included, but put in`3` bucket. I wouldn't necessarily call this behavior unexpected, but it should be much easier to set the group by offsets based on the time ranges in the where clause, and/or discard any "incomplete" groups.

Closed

Contributor

### phemmer commented Feb 2, 2018 • edited

 Since this seems to be contentious on how to modify the `group by` clause, what about a different approach: How about a time modifier that takes a time and rounds it to the specified duration. For example using splunk's syntax which does this: `(now() - 1h)@h`. The `@h` means round to the nearest whole hour. Then with tools like grafana, you can do `WHERE time >= (now() - 24h)@\$__interval AND time < now()@-\$__interval GROUP BY \$__interval` (the `-` would mean to round down instead of up, preventing an incomplete hour on the end). Or if we don't like the special `@` syntax, a function such as `round(now() - 24h, 1h)`.
Contributor

### jsternberg commented Feb 9, 2018

 It's not that the issue is contentious. It's mostly that it may not be possible in the existing query engine. We might be able to do the rounding, but we have triaged this for ifql so, given the timelines and the estimated work involved in making it work by changing the parser, we are unlikely to have this functionality in influxql. We have taken a look at this issue and are taking it seriously. This was an oversight of influxql and we are attempting to fix that in ifql so that the time intervals for grouping make more sense. I think ifql will allow you to customize how grouping is done and I think the default might be to use the end time as the end of the last bucket rather than just truncating the buckets. We will be keeping this in mind while we design ifql. Thank you for filing this issue. I know that this may not be a very satisfactory answer, but the creation of this issue means that we are thinking about how our design decisions for ifql affect customers and issues like this help us to design ifql to reduce future confusion that influxql caused.

Open

Open