[feature]continuous query with time offset #6932
Comments
Continuous queries are designed to operate on new timestamps, and there is no way to have them exclude new timestamps. The additional queries should be idempotent, is there some harm in letting them run? You can also use Kapacitor or ad hoc backfill queries to process the data. |
If I understand this correctly, is this a request to have some kind of offset to delay the interval when a CQ runs? Maybe something like:
Maybe replace |
No, there is no harm but just performance concerning. |
@jffifa - did you find another way to handle the delayed data situation? I recently encountered this problem, where I'm inserting AWS CloudWatch metrics into an influxdb, and some of those metrics come in on a two- to three-minute delay (and not just sometimes; they are always delayed). Without resampling, nothing gets written by the continuous queries. But I worry that the extra resampling that I've defined is going to waste compute resources... |
Hi all, I feel this should be re-opened. We have a situation where we have some sensor listeners which timestamp data points and then batch-send them every few seconds to Influx. This sometimes causes the latest points of the hour to be dropped from the CQ for that hour. I'm aware of the Is there an option to do this and I'm missing something or is it not possible? Thnx |
What about revwriting the CQ to kapacitor TICK script and use the shift node: |
I agree with @pootzko and am hoping this issue could be reopened (if there isn't another issue open somewhere). I have a situation where meteorological instrument's observations are batch sent in 1 minute intervals (5 second data) due to the commercial instrument software being used. While migrating the data ingest to a new server and adding an additional 1 minute delay (during migration) I noticed that my 1m average CQ wasn't catching any of the data in the interval because it was coming in ~50 seconds after the CQ was being run. I now get almost no 1 minute CQ data points. After I remove this 1 minute extra delay I'm still concerned that the last few data points won't be included in the CQ due to the batch sending even if the CQ does produce a value. |
I'm running in to this situation again and was wondering if there has been any consideration to adding a feature like this. I've been reading the documentation for resampling and don't think there is any way to reliably achieve what I'm trying to do with a CQ. If I'm understanding things correctly using resampling the CQ could produce the right value after a couple executions, but it would still write a data point for the time that will get updated in the future the next time the CQ runs. My users may be requesting only the latest data point for a field and adding it to a local cache of the data (to reduce the amount of data sent). Using resampling would not work for their use case. I could manually backfill I suppose, but I'm guessing that has quite the performance hit too. Any other suggestions? |
Isn't it already available using a combination of Refer to https://docs.influxdata.com/influxdb/v1.5/query_language/continuous_queries/#examples-of-advanced-syntax for initial data to be requested. Suppose we lose the last minute worth of data (because of latency). I'll give my example based on the second, but of course As you can see:
This means that CQ will compute data for 3 time intervals:
As we can see, the latest point is wrong (missing one minute), but the new computation of 07:30 and 08:00 are now okay. Of course there will be a delay equal to the In the above example, the parameters means that we're okay to lose data that will offset by 1 hour (so it's inserted 1 hour AFTER its own timestamp), as 07:30 to 07:59 will be computed for the last time at 09:00 (so the last computation of the 07:59:59 happens one hour later, at 09:00). So the real question is:
|
@dugwood The situation you describe is exactly what I want to avoid when I mentioned users only retrieving the newest data point. In example 3, as you point out, it recalculates the data points twice. This means that if someone grabs the most recent data points from the first resample, caches them locally, and then never requests those data points again, they now have invalid data. In fact I used As for how long is too long for late data, in this specific case it's probably 2 minutes for a 1 minute average CQ. Regardless, the data points are recalculated so users see data change over time. |
I understand your concern, although you may never know exactly when the data points will be available from the source. Maybe it's a 2 minutes delay today, but suppose one server get a load spike, and the delay increases to 3 minutes: you'll have invalid data too. So the NFR is a valid one (run at *:05 instead of *:00), but the resampling will always implies that you may have data changes in the future. Another case would be if you drop your influx database (by mistake, by choice...), and insert all your data again: you'll have all missing points (> 2 minutes), so the data would be invalid too (either the old one, or the new one if the need was to recreate older data). One way or another, you'll have an issue. Not a big one, I agree, but it can't be ignored. Especially in your case as you have users complaining on the results. I only have global statistics for a project, so if I miss 10% of the points that won't be really bad (but not expected either!). |
Right but I would like the option to choose what that threshold is without resampling multiple times. There is a point where data being late is ok. There is a point where data being later than that is not ok. Points that arrive within the acceptable window should be included in the average window, once. Points that arrive later than the acceptable window should not be included in the average and that is fine. If I, as the maintainer of the DB, could choose what that threshold is and not have to resample the results multiple times to do an average and don't have to set up a cronjob to run the CQs manually every couple minutes that would be awesome. I understand that InfluxDB offers options for achieving the same final result, but it seems to require extra computation or extra queries outside of the CQ system (cronjob) for this specific case. If I had to insert all of my data over again then I would run the CQs manually which is a fairly acceptable request, especially considering I was the one who had to put all the data back in. I'm not sure I understand your non-functional requirement statement. My hope would be that with a feature like DELAY I wouldn't need to resample at all and could do everything I want with just a GROUP BY (X minute averages every X minutes). Hopefully this all makes sense and I understand if this feature isn't worth the development time/effort to add, but it is an issue that is not solved by the existing CQ functionality (from what I understand). |
As I said, the NFR is valid, should be implemented, and as you said I don't know the effort either. The issue in your lack of resampling is to expect a result that's depends on the source (telegraf?), that can't be reproduced. Running a live log (such as a logparser) and running from an old log won't produce the same results. And you can't expect to reproduce the source as it occurred the first time (all points will be there, but the first time some were missing). So the threshold is a good idea... but impossible to implement in order to always have the same values in the end. But I agree that if you don't mind about the resampling or reimporting in the future, you'll be fine with your |
+1 for this. It is quite common practice and recommended to batch points while writing, which by definition causes a delay between the datapoint timestamp and the time it gets written into InfluxDB. I stream live financial data which is either already timestamped or timestamped upon arrival on my streaming script. However, due to batch writing, data only gets written into InfluxDB after a few seconds, causing a After realizing this I am current using the |
Also, looks like cq (or any query) is not atomic. Since |
Hi , I want to set time 4:00 PM of my continuous query but current time is 5:15 PM. In built function is that , when I create new CQ it will take current time. But I want to override the time on particular CQ. Ex. Can I do this? |
@mananpatel1 if you know that CQ is run exactly at 5:15 and you want to run it with data up to 4:00, you just need to filter your data with |
Here comes a common situation that data is written to influxdb with some latency(i.e. written with timestamp specified), and when we need to do some aggregation or downsampling on such data stream, we would like to run CQ every
<interval>
time , but with some offset or latency, i.e. we want to aggregate on data betweennow()-latency-interval
andnow()-latency
, instead of betweennow()-interval
andnow()
.The only way I know to do such a job is to use
RESAMPLE FOR <LONG_DURATION>
in CQ query, but that may lead to useless workload for influxdb.For example, we may need to do aggregation on data between
now()-60seconds
andnow()-50seconds
every10seconds
, in CQ:However, such CQ leads to 6 queries on data
[now()-60s, now()-50s)
,[now()-50s, now()-40s)
,[now()-40s, now()-30s)
,[now()-30s, now()-20s)
,[now()-20s, now()-10s)
,[now()-10s, now()
, where the only useful query is the first one.The text was updated successfully, but these errors were encountered: