-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Group by time starts grouping before minimum time in where clause #8010
Comments
I believe this happens because the intervals are always on even time boundaries and do not start with the first point in the results.
cc @jsternberg |
@jwilder is correct. I think you can change that functionality by doing |
Using
While it seems like it would be possible to use the offset parameter of I haven't checked, but I suspect a similar issue would exist at the end of the time range as well. In that if the "group time" doesn't end on the "where time <=", aggregates will be screwed up. And a |
Hm, I messed up. The @jwilder we likely need some other keyword to offset to the query start time. Maybe |
+1 for a start() option as offset. What I currently do, to make sure a new time bucket starts at your start time, is take the number of seconds from your queries' start time since epoch 0, divide it by your group-by-time and take the remainder as offset. In Python:
The query then looks something like this:
|
The same problem here.
But I think the first point (2017-03-02T10:00:00Z) should be filtered out, because it is not > "2017-03-02 10:03:00". I am trying to using sub-query to work around this, the sub-query looks like below, but the result is the same as the first query. 😢
If there is more information I can provide, please let me know 😄 |
This is not an issue with InfluxDB, as this is by design. To get around that, a sub query would not work, i think, since (quoting the docs) 'InfluxDB uses preset round-number time boundaries for GROUP BY intervals that are independent of any time conditions in the WHERE clause'. Using an offset, such as in my response above, is the way to go, until a start() or similar is implemented. See also common issue 1 and the advanced GROUP BY time() syntax at the official docs |
The problem with offset, and the proposed
|
I agree that, with a start option, the total time period at the end of the time range may fall outside the WHERE time <= clause, but it will not have any data points older than the time <= clause. So let's say you query data from a single day from 00:00 to 7:30 and group by 1 hour. You will then get back a value with timestamp 00:00, which groups values from 00:00 till 01:00, and a value at 01:00 (which groups values from 01:00-02:00), etc. When Influx gets to the value at 07:00, it will only include values from 07:00 till 07:30 (since data points after 07:30 are excluded). Therefore the value at the end will be correct (containing only data points before 07:30), but it will just be a group of a shorter time period (30 minutes instead of 60 minutes). In my opinion, this is the expected behaviour if the total time between start and end time is not some even multiple of the group-by time. So in essence it already does the second option you suggest if you have the offset correctly to make sure a new time interval starts at the start time. |
No, the value at the end will be incorrect. If you're doing some sort of average, and your values are relatively consistent, then yes it may not be a significant issue. But if you're doing a summation, count, derivative, etc, or your values are very inconsistent, the value will be invalid.
No, it doesn't do anything close to the second option. This is what the whole issue is about. You have incomplete data for the group, thus the value is inaccurate. |
I see. With some aggregators you can get weird values, but I wonder if they are actually incorrect Given my example above, the last time interval would be only half as long as the others and contain half as much data points (if the data logging interval is constant). So if I had 1 value per minute and performed a count per hour, all values would be 60, except for the last one, which is then 30. This is correct, since all values after 07:30 are excluded and then there are then only 30 values in that interval. It may be weird to have such a value in there and I see your point that it would be nice if you could supply a parameter to have Influx not return that value, but currently the only solution would be to choose the start and end-time or the group-by time such that you get exactly x number of intervals in your query and have the first interval start on your start-time. |
I've just wasted half a day hunting a bug in a charting application due to this completely unexpected behavior of InfluxDB. If I |
@dandv, If I understand what's going on, the raw data points earlier than Imagine you're selecting I wouldn't necessarily call this behavior unexpected, but it should be much easier to set the group by offsets based on the time ranges in the where clause, and/or discard any "incomplete" groups. |
Since this seems to be contentious on how to modify the |
It's not that the issue is contentious. It's mostly that it may not be possible in the existing query engine. We might be able to do the rounding, but we have triaged this for Flux so, given the timelines and the estimated work involved in making it work by changing the parser, we are unlikely to have this functionality in influxql. We have taken a look at this issue and are taking it seriously. This was an oversight of influxql and we are attempting to fix that in Flux so that the time intervals for grouping make more sense. I think Flux will allow you to customize how grouping is done and I think the default might be to use the end time as the end of the last bucket rather than just truncating the buckets. We will be keeping this in mind while we design Flux. Thank you for filing this issue. I know that this may not be a very satisfactory answer, but the creation of this issue means that we are thinking about how our design decisions for Flux affect customers and issues like this help us to design Flux to reduce future confusion that influxql caused. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions. |
Should this stay open, given @jsternberg's comment? |
So this issue won't be fixed? |
So, what is the correct way to solve this? |
Hello, Facing the same situation, at Transatel we decided to solve it by making an HTTP proxy to handle this: Transatel/cleanflux. We've been using it for 2 whole years, both from Grafana and various scripts (including Jupyter notebooks). For now, it is quite declarative (retention policies have o be redeclared in proxy configuration) but this solution is pretty robust and transparent for the client. |
Bug report
System info: [Include InfluxDB version, operating system name, and other relevant details]
InfluxDB 1.2.0
CentOS/7
Steps to reproduce:
Expected behavior: [What you expected to happen]
Actual behavior: [What actually happened]
Note that the time of the first group is before our specified
where time >=
. This results ininflux having an incomplete group, and thus incorrect results.
The second query shows the really bad effects of this in that the derivative value is obscenely high. This is causing problems in our graphs as the first point destroys the scale of the graph, and all other points are just a flat line at the bottom.
The text was updated successfully, but these errors were encountered: