Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should be able to force recalculation of continuous query for given time interval #211

Closed
pauldix opened this issue Jan 28, 2014 · 30 comments

Comments

@pauldix
Copy link
Member

pauldix commented Jan 28, 2014

If users have a continuous query running and they fill in data from a previous interval, they should be able to trigger recalculation of a continuous query for a given interval of time.

Maybe something like this:

replay :query_id where time > '2013-11-01' and time < '2013-12-01'
@jvshahid jvshahid modified the milestones: 0.8.0, 0.7.0 May 2, 2014
jvshahid pushed a commit that referenced this issue Aug 12, 2014
fix: wait for all goroutines to finish before Stop
@jvshahid jvshahid modified the milestones: 0.8.0, Next release Aug 25, 2014
@tdunning
Copy link

Manual triggering is nice, but shouldn't this happen automagically?

@tdunning
Copy link

Does manual triggering also imply by symmetry that there should be a way to pause the processing of particular (or all) continuous queries?

@jvshahid
Copy link
Contributor

The intention of this issue to recalculate the output of the continuous query if old data has changed and has nothing to do with pausing the continuous query.

@tdunning
Copy link

Yes. I understood that. But it is often useful to introduce complementary features at the same time.

Continuous queries and similar mechanisms typically fail for two reasons. One is that data is delayed. The other is due to some sort of overload or failure condition outside of the query computation itself. The first case is best handled by having a proper trigger mechanism for continuous queries so that a query re-runs automatically if new data is inserted into a previously completed window. The second condition may require a manual trigger if the results were somehow incorrect due to the failure mode, but it is also common that you need to remove live loads before correcting the problem. Similarly, when trying to recover from a situation, it is very nice to be able to get the system to sit still while you are working to repair it. Having continuous queries fire while you are working can make a proper fix very difficult.

Thus, a pause is a very reasonable complement to a manually forced recomputation.

@jvshahid jvshahid removed this from the Next release milestone Oct 9, 2014
@pauldix pauldix removed this from the Next release milestone Oct 9, 2014
@jvshahid jvshahid added engine and removed bug labels Oct 9, 2014
@ghost
Copy link

ghost commented Nov 5, 2014

I'd be very interested in this functionality being implemented. I'd like to use influxdb in a project where users import data in bulk and we then run analytics on the data. Do you have a rough estimate on when anyone will work on this?
It would be awesome if this could happen automagically after an insert, but I can see how that might create a huge overhead for inserts. Another convenient way might be if one could give a hint during insert that there's continuous queries that need to be re-run (thus automatically inferring the interval).

PS: Thanks for the great work, playing around with influxdb has been a pleasure and I'd be extremely happy to use it productively :)

@kerush
Copy link

kerush commented Mar 31, 2015

Guys, I'm using 0.8.8 and I think I'm seeing something similar to what's described here.
Say now is time T and I'm loading data that is slightly delayed (say T-2m) and I want to roll it up every second. I have the continuous query running, I see it triggering:

[2015/03/31 17:01:53 BST] INFO Start Query: db: core, u: root, q: select min(latency) as pct0,percentile(latency,50) as pct50,percentile(latency,75) as pct75,percentile(latency,90) as pct90,percentile(latency,99) as pct99,max(latency) as pct100 from "myapp" where (time < 1427817713000000000) AND (time > 1427817712000000000) group by time(1s)

If you bring it into human format, you see that is trying to roll up 3/31/2015 5:01:52 - 3/31/2015 17:01:53 PM. That is basically [T-1s,T] but the last data point I loaded is T-120s, so no rollups. Never, ever.

At this point I am thinking the only workaround is to drop the continuous query myself and try to do some backfilling overnight, but it's ugly.

How did you guys get around the problem?

@kerush
Copy link

kerush commented Mar 31, 2015

I haven't looked at the implementation, but I think the main issue here is that continuous queries run on a schedule rather than being triggered by the loading of some new data and this leads to invalid results. Essentially the output of a continuous query should be considered valid only if there's a newer value outside the time window in the original series.
Practically, when new data is loaded we should check whether the new point expires any window defined by a continuous query attached to the series, and only then we can execute the query, store the result and advance the time window expiry time.

@toddboom
Copy link
Contributor

@kerush You're right on this, and we've done a bit of rewriting of continuous queries in v0.9.0. There will be a configurable lag on running continuous queries, and we'll probably implement some sort of automatic, time-based retriggering in additional manual retriggering.

@kerush
Copy link

kerush commented Mar 31, 2015

Thanks @toddboom, that's a good news. I'll be waiting for 0.9 to be released then.

I think continuous queries is really the killer feature of influxdb over hybrid solutions like cassandra+spark. For this reason, their scheduling really needs to be event-based rather than clock-based, both for performance and consistency reasons. I hope you're going down that path.

Thanks again.

@beckettsean beckettsean added this to the Next Point Release milestone Apr 8, 2015
@beckettsean beckettsean removed the idea label Apr 8, 2015
@vvakar
Copy link

vvakar commented Apr 29, 2015

+1
There's bound to be some lag between the time data is collected vs loaded. If continuous queries don't take that into consideration, a portion of the data fill be unaccounted for, requiring the query to be recreated. Looking forward to 9.0!

Thanks for all the great work so far!

@beckettsean beckettsean modified the milestones: Longer term, Next Point Release Aug 6, 2015
@jbothe
Copy link

jbothe commented Aug 7, 2015

+1

@mobarre
Copy link

mobarre commented Aug 7, 2015

I agree with a lot of what has been proposed in here. As for the original feature description and taking into account all the other things mentioned, I'd say this is a must have.

@comcomservices
Copy link

+1, Query's like "SELECT mean(value) INTO feeds_mean_1h FROM feeds GROUP BY time(1h), *" should work too!

@dstreppa
Copy link

dstreppa commented Sep 2, 2015

+1

@humcguire
Copy link

+1

@DanielMorsing
Copy link
Contributor

Right now, CQs don't validate statements that are created, so you can create an invalid query. Obviously, for backfill, you need to have a valid query, but should I add this restriction as well for all CQs? It should be easy to do since I'm adding a loop into the tsdb anyway.

The only reason I can see why I shouldn't is that someone might want to create a CQ that will become valid in the future, but that's a weird edge case and validating CQs eliminate so many annoyances.

@tdunning
Copy link

If you take an example from another domain, JDBC validates prepared queries
even though they aren't yet being executed. The same argument that they
might be valid later applies and nobody thinks that it is worth allowing
temporarily invalid queries.

With JDBC, the time between preparation is typically smaller than with
continuous queries, but not necessarily all that short. For a long running
server, it could be weeks.

On Wed, Sep 23, 2015 at 2:46 AM, Daniel Morsing notifications@github.com
wrote:

Right now, CQs don't validate statements that are created, so you can
create an invalid query. Obviously, for backfill, you need to have a valid
query, but should I add this restriction as well for all CQs? It should be
easy to do since I'm adding a loop into the tsdb anyway.

The only reason I can see why I shouldn't is that someone might want to
create a CQ that will become valid in the future, but that's a weird edge
case and validating CQs eliminate so many annoyances.


Reply to this email directly or view it on GitHub
#211 (comment).

@pbooth
Copy link

pbooth commented Oct 12, 2015

Given that this functionality isn't yet available, but appears to be recognized as important, what hacks/workarounds are possible to create summary rollups from influxdb series?

I had been thinking of either:

  1. using a script to periodically issue a query and convert the results into a LineProtocol file that qwould be uploaded with curl to a different Influx instance
  2. using a script to periodically issue a query and write the results to a whisper DB

Are there any other approaches that people are using?

@ivanscattergood
Copy link

As I am loading the Data from a java process I actually use the https://github.com/influxdb/influxdb-java library to generate LineProtocol and backfill the data in measurement I am using for the continuous query

@beckettsean
Copy link
Contributor

#4454 will be a strong mitigation feature for this need

@ryanjin
Copy link

ryanjin commented Nov 8, 2015

mark

@hoomanv
Copy link

hoomanv commented Dec 8, 2015

As I understood, if I write the data in batches every 10 seconds and there is a CQ that rolls up by 1 min intervals, that CQ will possibly miss a few seconds worth of data (10 seconds in worst case) that falls at the end of every minute, right?
I suggest a configurable delayed execution strategy for the CQs so that practically we allow more data to arrive to fill in the last gaps.

@pauldix pauldix modified the milestones: 0.10.0, 0.9.5 Dec 8, 2015
@beckettsean
Copy link
Contributor

@hoomanv what you describe is not quite accurate. See https://github.com/influxdb/docs.influxdata.com/blob/extended_cqconfig_options/content/influxdb/v0.9/query_language/continuous_queries_config.md for the work in progress doc that describes the CQ config settings.

If you have the default CQ settings and a 10s CQ, then three queries will run every two minutes, each grabbing 10s worth of points. You will get 30s of good downsampled data and miss the other 90s of each 120s.

In order to actually capture all the data, you need to lower the compute-no-more-than to something like 30s, or raise the recompute-previous-n to something like 12.

@hoomanv
Copy link

hoomanv commented Dec 15, 2015

Thanks @beckettsean I didn't know about the upcomming CQ configs

@jwilder jwilder removed this from the 0.10.0 milestone Feb 1, 2016
@zp-markusp
Copy link

Hi @beckettsean,

Is this still on your roadmap? If yes, what's the timeplan for this? As we are in POC phase with influxdb and elasticsearch the continuous queries feature would be one argument for influxdb. If it would be feature complete - means also backfilling would be possible.

@beckettsean
Copy link
Contributor

@zp-markusp the new CQ syntax allows you to define the look-back interval for each CQ individually. In addition, the INTO keyword, documented on that same page, allows for ad hoc backfill.

There's no mechanism for triggering a backfill based on an out of order points, the backfill is either always on (CQ) or manually triggered (INTO).

@beckettsean
Copy link
Contributor

Closing this since the INTO keyword addresses the need. A particular CQ cannot be triggered, but any valid query in a CQ can be run with the INTO keyword to accomplish the same end result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests