New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature]continuous query with time offset #6932

Closed
jffifa opened this Issue Jun 29, 2016 · 13 comments

Comments

Projects
None yet
8 participants
@jffifa

jffifa commented Jun 29, 2016

Here comes a common situation that data is written to influxdb with some latency(i.e. written with timestamp specified), and when we need to do some aggregation or downsampling on such data stream, we would like to run CQ every <interval> time , but with some offset or latency, i.e. we want to aggregate on data between now()-latency-interval and now()-latency, instead of between now()-interval and now().

The only way I know to do such a job is to use RESAMPLE FOR <LONG_DURATION> in CQ query, but that may lead to useless workload for influxdb.

For example, we may need to do aggregation on data between now()-60seconds and now()-50seconds every 10seconds, in CQ:

CREATE CONTINUOUS QUERY RESAMPLE FOR 60s BEGIN ... GROUP BY time(10s) END

However, such CQ leads to 6 queries on data [now()-60s, now()-50s), [now()-50s, now()-40s), [now()-40s, now()-30s), [now()-30s, now()-20s), [now()-20s, now()-10s), [now()-10s, now(), where the only useful query is the first one.

@beckettsean

This comment has been minimized.

Contributor

beckettsean commented Jun 29, 2016

Continuous queries are designed to operate on new timestamps, and there is no way to have them exclude new timestamps. The additional queries should be idempotent, is there some harm in letting them run?

You can also use Kapacitor or ad hoc backfill queries to process the data.

@jsternberg

This comment has been minimized.

Contributor

jsternberg commented Aug 4, 2016

If I understand this correctly, is this a request to have some kind of offset to delay the interval when a CQ runs? Maybe something like:

CREATE CONTINUOUS QUERY RESAMPLE DELAY 50s BEGIN ... GROUP BY time(10s) END

Maybe replace DELAY with OFFSET so we don't have to add a new keyword.

@jffifa

This comment has been minimized.

jffifa commented Aug 5, 2016

Continuous queries are designed to operate on new timestamps, and there is no way to have them exclude new timestamps. The additional queries should be idempotent, is there some harm in letting them run?

No, there is no harm but just performance concerning.

@jffifa jffifa closed this Sep 1, 2016

@jpriebe

This comment has been minimized.

jpriebe commented Jul 5, 2017

@jffifa - did you find another way to handle the delayed data situation? I recently encountered this problem, where I'm inserting AWS CloudWatch metrics into an influxdb, and some of those metrics come in on a two- to three-minute delay (and not just sometimes; they are always delayed).

Without resampling, nothing gets written by the continuous queries. But I worry that the extra resampling that I've defined is going to waste compute resources...

@pootzko

This comment has been minimized.

pootzko commented Oct 19, 2017

Hi all,

I feel this should be re-opened. We have a situation where we have some sensor listeners which timestamp data points and then batch-send them every few seconds to Influx. This sometimes causes the latest points of the hour to be dropped from the CQ for that hour.

I'm aware of the group by time() with the offset option (Example 4) but if I understood how this works, this will shift the whole CQ interval to roll-up the data from 8:15-9:15 instead of 8:00-9:00.. and that's not what we need. We need to roll-up the data from 8:00-9:00 but we want that event to occur at 9:15 (or 9:05, but the point is we want it to run once we're positive that all the points which will be used by the CQ arrived by this point).

Is there an option to do this and I'm missing something or is it not possible?

Thnx

@t0mk

This comment has been minimized.

t0mk commented Nov 23, 2017

What about revwriting the CQ to kapacitor TICK script and use the shift node:
https://docs.influxdata.com/kapacitor/v1.3/nodes/shift_node/

@djhoese

This comment has been minimized.

djhoese commented Dec 11, 2017

I agree with @pootzko and am hoping this issue could be reopened (if there isn't another issue open somewhere). I have a situation where meteorological instrument's observations are batch sent in 1 minute intervals (5 second data) due to the commercial instrument software being used. While migrating the data ingest to a new server and adding an additional 1 minute delay (during migration) I noticed that my 1m average CQ wasn't catching any of the data in the interval because it was coming in ~50 seconds after the CQ was being run. I now get almost no 1 minute CQ data points. After I remove this 1 minute extra delay I'm still concerned that the last few data points won't be included in the CQ due to the batch sending even if the CQ does produce a value.

@djhoese

This comment has been minimized.

djhoese commented Apr 13, 2018

I'm running in to this situation again and was wondering if there has been any consideration to adding a feature like this. I've been reading the documentation for resampling and don't think there is any way to reliably achieve what I'm trying to do with a CQ. If I'm understanding things correctly using resampling the CQ could produce the right value after a couple executions, but it would still write a data point for the time that will get updated in the future the next time the CQ runs. My users may be requesting only the latest data point for a field and adding it to a local cache of the data (to reduce the amount of data sent). Using resampling would not work for their use case.

I could manually backfill I suppose, but I'm guessing that has quite the performance hit too. Any other suggestions?

@dugwood

This comment has been minimized.

dugwood commented Apr 23, 2018

Isn't it already available using a combination of EVERY and FOR? It's not easy to understand at first, but the example seems to cope with the needs: https://docs.influxdata.com/influxdb/v1.5/query_language/continuous_queries/#example-3-configuring-execution-intervals-and-cq-time-ranges

Refer to https://docs.influxdata.com/influxdb/v1.5/query_language/continuous_queries/#examples-of-advanced-syntax for initial data to be requested.

Suppose we lose the last minute worth of data (because of latency). I'll give my example based on the second, but of course 07:59:59 will mean 07:59:59.999999999.

As you can see:

  • RESAMPLE EVERY 1h FOR 90m
  • GROUP BY time(30m)

At 9:00 cq_advanced_every_for executes a query with the time range WHERE time >= '7:30' AND time < '9:00'. cq_advanced_every_for writes three points to the average_passengers measurement:

This means that CQ will compute data for 3 time intervals:

  • 2016-08-28T07:30:00Z 7.5

    • includes both points:
      • 2016-08-28T07:30:00Z 8
      • 2016-08-28T07:45:00Z 7
    • so it's 07:30:00 to 07:59:59 => will get the last minute latency (07:59:00 to 07:59:59)
  • 2016-08-28T08:00:00Z 11.5

    • includes both points:
      • 2016-08-28T08:00:00Z 8
      • 2016-08-28T08:15:00Z 15
    • so it's 08:00:00 to 08:29:59 => will get the last minute of latency (08:29:00 to 08:29:59)
  • 2016-08-28T08:30:00Z 16

    • includes both points:
      • 2016-08-28T08:30:00Z 15
      • 2016-08-28T08:45:00Z 17
    • so it's 08:30:00 to 08:59:59 => will LOSE the last minute because of latency (08:59:00 to 08:59:59)

As we can see, the latest point is wrong (missing one minute), but the new computation of 07:30 and 08:00 are now okay.

Of course there will be a delay equal to the EVERY keyword, but you must keep in mind that the DELAY proposal is just a guess: what if you set a DELAY of 60 seconds and your data arrives 61 seconds after the timestamp? Then you'll lose the data because of 1 second. The best way is to resample multiple times, with the maximum acceptable delay.

In the above example, the parameters means that we're okay to lose data that will offset by 1 hour (so it's inserted 1 hour AFTER its own timestamp), as 07:30 to 07:59 will be computed for the last time at 09:00 (so the last computation of the 07:59:59 happens one hour later, at 09:00).

So the real question is:

  • what maximum offset do you accept in your resampling (ie: a point not found for the current run should appear at most XX minutes after insertion)? Use that value in EVERY keyword
  • after what amount of time do you consider a point to be totally ignored? Use that value relative to GROUP and set the FOR parameter. For example if you want 2 hours before ignoring the point, and have a 1h GROUP parameter, set FOR to 1+2 = 3h.
@djhoese

This comment has been minimized.

djhoese commented Apr 23, 2018

@dugwood The situation you describe is exactly what I want to avoid when I mentioned users only retrieving the newest data point. In example 3, as you point out, it recalculates the data points twice. This means that if someone grabs the most recent data points from the first resample, caches them locally, and then never requests those data points again, they now have invalid data. In fact I used EVERY/FOR as a temporary workaround on my server and the next day I had an email from a user asking why the data points changed (they were actually coming up as null occasionally due to how late the data was for that interval).

As for how long is too long for late data, in this specific case it's probably 2 minutes for a 1 minute average CQ. Regardless, the data points are recalculated so users see data change over time.

@dugwood

This comment has been minimized.

dugwood commented Apr 23, 2018

I understand your concern, although you may never know exactly when the data points will be available from the source. Maybe it's a 2 minutes delay today, but suppose one server get a load spike, and the delay increases to 3 minutes: you'll have invalid data too.

So the NFR is a valid one (run at *:05 instead of *:00), but the resampling will always implies that you may have data changes in the future.

Another case would be if you drop your influx database (by mistake, by choice...), and insert all your data again: you'll have all missing points (> 2 minutes), so the data would be invalid too (either the old one, or the new one if the need was to recreate older data).

One way or another, you'll have an issue. Not a big one, I agree, but it can't be ignored. Especially in your case as you have users complaining on the results. I only have global statistics for a project, so if I miss 10% of the points that won't be really bad (but not expected either!).

@djhoese

This comment has been minimized.

djhoese commented Apr 23, 2018

although you may never know exactly when the data points will be available from the source.

Right but I would like the option to choose what that threshold is without resampling multiple times. There is a point where data being late is ok. There is a point where data being later than that is not ok. Points that arrive within the acceptable window should be included in the average window, once. Points that arrive later than the acceptable window should not be included in the average and that is fine. If I, as the maintainer of the DB, could choose what that threshold is and not have to resample the results multiple times to do an average and don't have to set up a cronjob to run the CQs manually every couple minutes that would be awesome. I understand that InfluxDB offers options for achieving the same final result, but it seems to require extra computation or extra queries outside of the CQ system (cronjob) for this specific case.

If I had to insert all of my data over again then I would run the CQs manually which is a fairly acceptable request, especially considering I was the one who had to put all the data back in.

I'm not sure I understand your non-functional requirement statement. My hope would be that with a feature like DELAY I wouldn't need to resample at all and could do everything I want with just a GROUP BY (X minute averages every X minutes). Hopefully this all makes sense and I understand if this feature isn't worth the development time/effort to add, but it is an issue that is not solved by the existing CQ functionality (from what I understand).

@dugwood

This comment has been minimized.

dugwood commented Apr 23, 2018

As I said, the NFR is valid, should be implemented, and as you said I don't know the effort either.

The issue in your lack of resampling is to expect a result that's depends on the source (telegraf?), that can't be reproduced. Running a live log (such as a logparser) and running from an old log won't produce the same results. And you can't expect to reproduce the source as it occurred the first time (all points will be there, but the first time some were missing).

So the threshold is a good idea... but impossible to implement in order to always have the same values in the end.

But I agree that if you don't mind about the resampling or reimporting in the future, you'll be fine with your DELAY.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment