Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upRule evaluation failure after reload with newly added rules #5193
Comments
simonpasquier
added
kind/bug
component/rules
labels
Feb 8, 2019
This comment has been minimized.
This comment has been minimized.
|
So I've done some investigation and it was "expected" that when rules are moved around, the first evaluation after the reload may fail... The corner case was documented in 1692f3c (see also 6f813b4) but I don't think it is called out in the docs. That being said, the code has changed a bit since then and rules within a group are no longer evaluated in parallel so I guess that the stale markers could be inserted after all rules of the group have been evaluated. @brian-brazil what do you think? |
This comment has been minimized.
This comment has been minimized.
|
That would break rules that depended on previous results in the group.
…On Fri 8 Feb 2019, 15:34 Simon Pasquier ***@***.*** wrote:
So I've done some investigation and it was "expected" that when rules are
moved around, the first evaluation after the reload may fail... The corner
case was documented in 1692f3c
<1692f3c>
(see also 6f813b4
<6f813b4>)
but I don't think it is called out in the docs.
That being said, the code has changed a bit since then and rules within a
group are no longer evaluated in parallel so I guess that the stale markers
could be inserted after all rules of the group have been evaluated.
@brian-brazil <https://github.com/brian-brazil> what do you think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5193 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGyTdqmk9eMoxmojVtPEjLSEWtwc6sz-ks5vLZj_gaJpZM4au4n9>
.
|
This comment has been minimized.
This comment has been minimized.
dbrinegar
commented
Feb 9, 2019
|
Another thought is special case handling on config reload. Keeping manager logic as is, but when config reloads then dump the rule caches: write them all as stale at the time of config reload, empty the previous-seen caches, then on next eval loop everything carries on from scratch with abandoned time series marked stale and resumed time series carrying on after the NaN. |
This comment has been minimized.
This comment has been minimized.
|
I don't see why we'd write things as stale, we generally want to avoid any changes when there's a reload and the rules remain the same. |
This comment has been minimized.
This comment has been minimized.
dbrinegar
commented
Feb 11, 2019
|
Yes that's a bit of a brute force approach with blind writes. Perhaps we could make the config reload handler idea more nuanced and try to remap old rule cache to new rules, so where a rule has simply moved in the list it will carry on with the previous rule's cache. Then old rules that were deleted or changed would be left over after the remap, and those caches would be candidates to mark stale. We could mark them stale ahead of the new rule eval (a smaller blind pass), or keep them for reconciliation against the results of the new eval. I can see trade-offs either way, but maybe reconciliation at the end of the new eval is the happy path so one can hand old rules results to a new rule and not lose continuity. That path minimizes stale writes, and handles this reorder failure case as well as the case of terminating output of a deleted rule. The main point of this approach is that the normal path is completely untouched, we just do reconciliation on config reload. Just thinking out loud (and happy to dig into details and test cases in a write up if the approach sounds useful to consider further.) |
This comment has been minimized.
This comment has been minimized.
|
That sounds kinda like what we do for scrapes, so a similar approach could work. I wouldn't worry about supporting reorders much, they should be rare. |
jsravn
added a commit
to jsravn/prometheus
that referenced
this issue
Mar 15, 2019
jsravn
added a commit
to jsravn/prometheus
that referenced
this issue
Mar 15, 2019
jsravn
added a commit
to jsravn/prometheus
that referenced
this issue
Mar 15, 2019
This comment has been minimized.
This comment has been minimized.
|
I made a potential fix in #5368. |
jsravn commentedFeb 8, 2019
•
edited
Bug Report
What did you do?
docker run -p 9090:9090 prom/prometheus:v2.7.1/etc/prometheus/prometheus.yml, enable the- "first_rules.yml"file./etc/prometheus/first_rules.yml:kill -SIGHUP 1. Everything is fine./etc/prometheus/first_rules.yml:kill -SIGHUP 1.instance: my-secondrule is missing at that timestamp.What did you expect to see?
I expected to see no errors, and data for
instance: my-secondat the evaluation time. Although they have the same identifiertest, they should be unique time series as they have different labels.What did you see instead? Under which circumstances?
Log error, and missing data for the
instance: my-secondrule at the evaluation time.More details
Environment
docker container