Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics created by rules should keep the timestamp #386

Closed
grobie opened this Issue Mar 28, 2014 · 8 comments

Comments

Projects
None yet
3 participants
@grobie
Copy link
Member

grobie commented Mar 28, 2014

When using a rule expression to create a metric, it shouldn't matter how long the server needs to create the new data, datapoints should be at added at same time. At the moment, rule created metrics lag behind.

These two graphs show the same data, one queries the times eries directly sum(foo) by (bar), the other one uses the rule defined metric: job:foo_by_bar:sum.

screenshot from 2014-03-28 16 52 26

@matttproud

This comment has been minimized.

Copy link
Member

matttproud commented Mar 28, 2014

On what order of magnitude is the lag you're seeing?

2014-03-28 21:55 GMT+01:00 Tobias Schmidt notifications@github.com:

When using a rule expression to create a metric, it shouldn't matter how
long the server needs to create the new data, datapoints should be at added
at same time. At the moment, rule created metrics lag behind.

These two graphs show the same data, one queries the times eries directly sum(foo)
by (bar), the other one uses the rule defined metric: job:foo_by_bar:sum.

[image: screenshot from 2014-03-28 16 52 26]https://cloud.githubusercontent.com/assets/3432/2555167/466412aa-b6bb-11e3-82d6-0e10dd6b09b1.png


Reply to this email directly or view it on GitHubhttps://github.com//issues/386
.

Key: 0xC42234343FDE43A8
Fingerprint: 6945 9248 4362 68CB EA83 F1EA C422 3434 3FDE 43A8

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Mar 28, 2014

It's probably just the rule evaluation interval being out of sync with the
scrape interval, which then interpolates points a bit differently. The lag
you see is probably at max. the rule evaluation interval? Will check with
you in more detail later. It might also be related to dependent rules not
being topologically sorted.
On Mar 28, 2014 9:55 PM, "Tobias Schmidt" notifications@github.com wrote:

When using a rule expression to create a metric, it shouldn't matter how
long the server needs to create the new data, datapoints should be at added
at same time. At the moment, rule created metrics lag behind.

These two graphs show the same data, one queries the times eries directly sum(foo)
by (bar), the other one uses the rule defined metric: job:foo_by_bar:sum.

[image: screenshot from 2014-03-28 16 52 26]https://cloud.githubusercontent.com/assets/3432/2555167/466412aa-b6bb-11e3-82d6-0e10dd6b09b1.png

Reply to this email directly or view it on GitHubhttps://github.com//issues/386
.

@grobie

This comment has been minimized.

Copy link
Member Author

grobie commented Mar 28, 2014

Yeah, my assumption was that it's the rule evaluation time visible here.
Depending on the rule and server, that can be >1 minute as in this case
here.

Given a low scrape interval (15s for example), I'd expect to see the data
being inserted at the scrape timestamp. Regardless of how long it takes to
store the information or evaluate any rules.

On Fri, Mar 28, 2014 at 7:30 PM, juliusv notifications@github.com wrote:

It's probably just the rule evaluation interval being out of sync with the
scrape interval, which then interpolates points a bit differently. The lag
you see is probably at max. the rule evaluation interval? Will check with
you in more detail later. It might also be related to dependent rules not
being topologically sorted.

On Mar 28, 2014 9:55 PM, "Tobias Schmidt" notifications@github.com
wrote:

When using a rule expression to create a metric, it shouldn't matter how
long the server needs to create the new data, datapoints should be at
added
at same time. At the moment, rule created metrics lag behind.

These two graphs show the same data, one queries the times eries
directly sum(foo)
by (bar), the other one uses the rule defined metric: job:foo_by_bar:sum.

[image: screenshot from 2014-03-28 16 52 26]<
https://cloud.githubusercontent.com/assets/3432/2555167/466412aa-b6bb-11e3-82d6-0e10dd6b09b1.png

Reply to this email directly or view it on GitHub<
https://github.com/prometheus/prometheus/issues/386>
.

Reply to this email directly or view it on GitHubhttps://github.com//issues/386#issuecomment-38978411
.

@matttproud

This comment has been minimized.

Copy link
Member

matttproud commented Mar 29, 2014

Thanks for reporting this. This is indeed worth fixing as a matter of
promoting predictability to users.

Can you give some more detail about the load type the server is
experiencing: # of targets, # rules, relative complexity of rules, target
collection frequency, and eval. frequency? Maybe, if you are OK with it,
including a snapshot of the problematic Prometheus server's internal
metrics?

The pinning of times to samples is a no-brainer, but handling the
backlogging is, well, quite concerning.

2014-03-29 0:39 GMT+01:00 Tobias Schmidt notifications@github.com:

Yeah, my assumption was that it's the rule evaluation time visible here.
Depending on the rule and server, that can be >1 minute as in this case
here.

Given a low scrape interval (15s for example), I'd expect to see the data
being inserted at the scrape timestamp. Regardless of how long it takes to
store the information or evaluate any rules.

On Fri, Mar 28, 2014 at 7:30 PM, juliusv notifications@github.com wrote:

It's probably just the rule evaluation interval being out of sync with
the
scrape interval, which then interpolates points a bit differently. The
lag
you see is probably at max. the rule evaluation interval? Will check with
you in more detail later. It might also be related to dependent rules not
being topologically sorted.

On Mar 28, 2014 9:55 PM, "Tobias Schmidt" notifications@github.com
wrote:

When using a rule expression to create a metric, it shouldn't matter
how
long the server needs to create the new data, datapoints should be at
added
at same time. At the moment, rule created metrics lag behind.

These two graphs show the same data, one queries the times eries
directly sum(foo)
by (bar), the other one uses the rule defined metric:
job:foo_by_bar:sum.

[image: screenshot from 2014-03-28 16 52 26]<

https://cloud.githubusercontent.com/assets/3432/2555167/466412aa-b6bb-11e3-82d6-0e10dd6b09b1.png

Reply to this email directly or view it on GitHub<
https://github.com/prometheus/prometheus/issues/386>
.

Reply to this email directly or view it on GitHub<
https://github.com/prometheus/prometheus/issues/386#issuecomment-38978411>

.


Reply to this email directly or view it on GitHubhttps://github.com//issues/386#issuecomment-38978883
.

Key: 0xC42234343FDE43A8
Fingerprint: 6945 9248 4362 68CB EA83 F1EA C422 3434 3FDE 43A8

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Mar 29, 2014

@grobie I think there are some misunderstandings here about how rules work.

The timestamp assigned to recorded rules is already independent of how long it takes to evaluate the rules. A timestamp is chosen at the beginning of a rule evaluation cycle, the expression is evaluated at that point in time, then the same timestamp is attached to the resulting samples. Even if a rule took 5 minutes to evaluate, its timestamp would be the same as one that took 1 second.

The reason why the rule timestamp conceptually cannot be the timestamp of the underlying samples is that the underlying samples have many different timestamps. Imagine you are summing over the timeseries from 100 different targets. There is then not 1 scrape timestamp, but 100, but the new samples recorded by the rule can only have one timestamp in the end. For that reason, rule evaluations and their timestamps are decoupled at regular intervals from scrapes.

@grobie In your case, you are scraping targets every 30 seconds and also evaluating rules every 30 seconds. Complete rule evaluation cycles are taking 0.5s for the 50th percentile, 2s for the 99th percentile. So the time it takes to evaluate the rules is certainly not connected to the lag you are seeing. I think what you are seeing is a combination of two things: the rule evaluation cycle not happening up to 30s after the scrapes, and Prometheus' handling of chains of rules (data dependencies) not being very strong yet. I.e. what can happen is that if you have rules like:

B = A
C = B

...at the time when C = B is evaluated, it might not have access to the results of the B = A evaluation above it yet, simply because those samples haven't made it to storage yet (when one rule finishes evaluating, it sends its samples asynchronously to storage, with no hard guarantees when they will make it there, because that happens in another thread/goroutine). That will mean it would use the B value from the previous rule evaluation cycle to compute the result. Per component in a "chain" of rules like this, you can then accumulate up to 30s of extra delay.

We should probably detect data dependencies like this and have something akin to a "memory barrier" (http://en.wikipedia.org/wiki/Memory_barrier) take effect for them.

For the time being, if you want the most time-accurate results you can get, I guess you'll want to avoid such chains of rules and write out the full expression in each of them instead, i.e. having only data dependencies to scraped data, not rule-computed data.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Mar 29, 2014

There's also #17, which is closely related to this, but probably not the problem in this case, as your rules are already topologically sorted in your file.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 22, 2015

The asynchronous storage appends are fixed (except for the case of completely new time series appearing during a rule evaluation cycle and not being indexed yet), and the other timestamp issues here are basically working as intended for now (or tracked in other issues). Closing this one.

@juliusv juliusv closed this Jul 22, 2015

simonpasquier pushed a commit to simonpasquier/prometheus that referenced this issue Oct 12, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.