Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist alert 'for' state across restarts #4061

Merged
merged 6 commits into from
Aug 2, 2018

Conversation

codesome
Copy link
Member

@codesome codesome commented Apr 8, 2018

Addresses #422

Design:

  • Store the time (ActiveAt) in nanoseconds in the series when the alert goes to Pending or Firing state. We add this sample in the series during every Eval.
  • If it goes to Inactive state, then store -1 in the series.
  • While restoring for state, we query for the time range (currentTime-holdDuration, currentTime), and if we find any samples in the series, we restore ActiveAt accordingly.

[EDIT] Some design decisions were changed, hence the above is not the final design that was implemented.

Copy link
Member

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just minor comments!

@@ -248,7 +248,7 @@ func main() {
)

ruleManager = rules.NewManager(&rules.ManagerOptions{
Appendable: fanoutStorage,
Storage: fanoutStorage,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we want to use fanoutStorage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need its own, as this should use only local storage.

@@ -276,6 +299,7 @@ func (r *AlertingRule) Eval(ctx context.Context, ts time.Time, query QueryFunc,
}

vec = append(vec, r.sample(a, ts))
vec = append(vec, r.forStateSample(a, ts, float64(a.ActiveAt.UnixNano())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if the floating point inaccuracies will mess something later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Floating point actually doesn't cause much harm (maybe few nano seconds?). We actually lose some small info when we are converting nano to milli here. Hence at max we may end up losing few milliseconds (1-2), which should be fine?

In my tests when I did multiple restarts, I saw a divergence of max 2 milliseconds from the actual time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll make no practical difference.

However you should be using seconds, as that's the Prometheus standard.

rules/manager.go Outdated

_, err := alertRule.Eval(m.opts.Context, ts, newg.opts.QueryFunc, newg.opts.ExternalURL)
if err != nil {
level.Error(m.logger).Log("msg", "Failed to restore 'for' state",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the alerting rule also to the logs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

rules/manager.go Outdated

sset, err := q.Select(nil, matchers...)
if err != nil {
level.Error(m.logger).Log("msg", "Failed to restore 'for' state",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

rules/manager.go Outdated
maxtMS := int64(model.TimeFromUnixNano(ts.UnixNano()))
for _, newg := range m.groups {
for _, rule := range newg.Rules() {
if alertRule, ok := rule.(*AlertingRule); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can reduce nesting by doing:

alertRule, ok := rule.(*AlertingRule)
if !ok {
  continue
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@brian-brazil brian-brazil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this, I don't think this PR will work in real world situations as a Prometheus that was down for a while and came back up will have no recent data.

@@ -248,7 +248,7 @@ func main() {
)

ruleManager = rules.NewManager(&rules.ManagerOptions{
Appendable: fanoutStorage,
Storage: fanoutStorage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need its own, as this should use only local storage.

@@ -276,6 +299,7 @@ func (r *AlertingRule) Eval(ctx context.Context, ts time.Time, query QueryFunc,
}

vec = append(vec, r.sample(a, ts))
vec = append(vec, r.forStateSample(a, ts, float64(a.ActiveAt.UnixNano())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll make no practical difference.

However you should be using seconds, as that's the Prometheus standard.

@@ -266,6 +288,7 @@ func (r *AlertingRule) Eval(ctx context.Context, ts time.Time, query QueryFunc,
if a.State != StateInactive {
a.State = StateInactive
a.ResolvedAt = ts
vec = append(vec, r.forStateSample(a, ts, -1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right to me, staleness handling should cover this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it also handle when we query over a range (which is being done when restoring)?

If we consider this case

  • for_duration: 10m
  • Event timeline:
    for_active <1m elapsed> for_inactive <1m elapsed> prom_goes_down <6m elapsed> prom_is_up and for_is_active on startup

And if by default if I check for samples within last 1h, I will still get samples.

In such case, if we dont add a sample when for state goes inactive, we might end up restoring wrong for state and cause premature alerts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me put it another way. You're duplicating the functionality of staleness, don't do that.

rules/manager.go Outdated
if !ok {
continue
}
_, err := alertRule.Eval(m.opts.Context, ts, newg.opts.QueryFunc, newg.opts.ExternalURL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right at startup there won't be any recent scraped data if Prometheus has been down a while, so this would always return an empty set of alerts (except for expressions using absent and friends). This logic will have to wait at least until the first evaluation, which we may want to delay at least one scrape interval.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One simple solution that I can think of is first waiting here till the db is open. And inside the ruleManager.Run(), delay for one scrape interval, and move the restoring part inside run() of every group. Does that sound right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should do it offhand, just be careful that there's no restoration logic if a new group is added due to a config reload.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alerts could also depend on recording rules, so we really need to run those too. So we'll probably need a bit of a different control flow here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way is:

  1. Rule groups perform the first Eval as usual.
  2. Then we perform restoration once all groups are done with the first Eval (a barrier).
  3. After the restoration we continue the evaluations as usual.

Does this flow sound right? During restoration currently I am doing a dry Eval without persisting the samples, just to have the alerts updated in the rule. With this new flow the dry run can be skipped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like that sounds sane as a first pass. Even if an alert isn't firing from the first eval we can restore it, so it can potentially work on the 2nd (the 1st eval may not have enough data to work with to produce all the alerts it should).

rules/manager.go Outdated
mint := alertRule.SubtractHoldDuration(ts)
mintMS := int64(model.TimeFromUnixNano(mint.UnixNano()))

q, err := m.opts.Storage.Querier(m.opts.Context, mintMS, maxtMS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using the for length here? This value should be tied to how long a Prometheus outage we want to tolerate, not the various values the for will have. This should come from a flag, and I'd suggest a 1 hour default.

rules/manager.go Outdated
mt, _ := labels.NewMatcher(labels.MatchEqual, l.Name, l.Value)
matchers = append(matchers, mt)
}
for _, l := range a.Labels {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this cover all the alertRule.labels?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not right, alert labels are not static.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Labels in alert are added after expanding (here), and given that it is from the last Eval, wont it be the expanded and new labels?

rules/manager.go Outdated
alertFunc := func(a *Alert) {

var matchers []*labels.Matcher
for _, l := range alertRule.labels {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These contain templates that need evaluation, you need to work off the result of the rule evaluation.

rules/manager.go Outdated
return
}

// Series found for the 'for' state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The selector could return series with labels beyond the ones you specified with matchers, you need to extract just the one you want.

rules/manager.go Outdated
@@ -520,9 +521,91 @@ func (m *Manager) Update(interval time.Duration, files []string) error {
wg.Wait()
m.groups = groups

ts := time.Now()
m.restoreForState(ts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only run at startup, not on every rule file change

rules/manager.go Outdated
}

restoredTime := model.TimeFromUnixNano(int64(v))
a.ActiveAt = restoredTime.Time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a grace period here. For example if we're 59m into a 60m for, then we probably don't want to immediately alert if it's still happening after Prometheus has been down for a while. I'd suggest making sure there's always at least 10m (flag controlled) left on an alert.

Tracking this state as the time at which the alert will fire is probably going to be easier.

a.Flag("alertmanager.outage-tolerance", "Max time to tolerate prometheus outage for restoring 'for' state of alert.").
Default("1h").SetValue(&cfg.outageTolerance)

a.Flag("alertmanager.for-grace-period", "Minimum duration between alert and restored 'for' state. Applies only for alerts with 'for' time greater than grace period.").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not quite right. If I have an alert with a for of 30m and it was 29m pending, then we should have to wait 10m before setting it to fire.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does that. Maybe the description was not clear.

Are you referring to "Applies only for alerts with 'for' time greater than grace period"?
If yes, then I actually meant the for duration set in the config, and not the for duration remaining.

rules/manager.go Outdated
if err != nil {
level.Warn(g.logger).Log("msg", "creating appender failed", "err", err)
return
}
tsdbApp, err := g.opts.TSDB.Appender()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need a 2nd appender for this, use the usual one.

rules/manager.go Outdated
mt, _ := labels.NewMatcher(labels.MatchEqual, l.Name, l.Value)
matchers = append(matchers, mt)
}
for _, l := range a.Labels {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not right, alert labels are not static.

rules/manager.go Outdated
// No series found for the 'for' state,
// hence getting active for the first time.

level.Error(m.logger).Log("msg", "No series found")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an error.

rules/manager.go Outdated
var v float64
s := sset.At()
it := s.Iterator()
for it.Next() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to use the labels for these samples.

@codesome codesome force-pushed the persist-for-state branch 3 times, most recently from a55679c to cc0cf37 Compare May 7, 2018 17:45
@codesome codesome force-pushed the persist-for-state branch 2 times, most recently from 2b0e57c to 37cbaa1 Compare May 16, 2018 02:40
@codesome
Copy link
Member Author

@brian-brazil PTAL, I have fixed some comments since last time. These 2 comments were not very clear though.
#4061 (comment)
#4061 (comment)

@@ -153,6 +155,25 @@ func (r *AlertingRule) sample(alert *Alert, ts time.Time) promql.Sample {
return s
}

// forStateSample returns the sample for ALERTS_FOR_STATE.
// v should be -1 for StateInactive, UnixNano for StatePending and StateFiring.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no value for inactive, we don't know the labels of inactive alerts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to remove that comment. Currently staleness handling takes care of it.

@@ -276,6 +298,7 @@ func (r *AlertingRule) Eval(ctx context.Context, ts time.Time, query QueryFunc,
}

vec = append(vec, r.sample(a, ts))
vec = append(vec, r.forStateSample(a, ts, float64(a.ActiveAt.Unix())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we'll be bumping this forwards at restart, the intended firing time might be easier to understand. It's weird for ActiveAt to change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to change the intended firing time, it would require changing holdDuration, which wont be right. I can't think of any other way to achieve this without adding another variable in the AlertingRule for an addition expected firing time.

rules/manager.go Outdated
restoredTime = ts.Add(m.opts.ForGracePeriod).Add(-alertHoldDuration)
}
a.ActiveAt = restoredTime
level.Info(m.logger).Log("msg", "'for' state restored",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug, and include the labels and timestamps

rules/manager.go Outdated

restoredTime := time.Unix(int64(v), 0)
if alertHoldDuration > m.opts.ForGracePeriod && restoredTime.Add(alertHoldDuration).Sub(ts) < m.opts.ForGracePeriod {
restoredTime = ts.Add(m.opts.ForGracePeriod).Add(-alertHoldDuration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right, it always waits another 10m. If I've a 2h alert that I'm only 5m into and do an instant restart, I should have to wait the remaining 1h55m - not only 10m.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restoredTime.Add(alertHoldDuration).Sub(ts) < m.opts.ForGracePeriod takes care of it.
restoredTime.Add(alertHoldDuration).Sub(ts) will be 1h55m, which is > grace period. So it wont go inside the if block.

To explain the logic more clearly:
If alert became active at 0 and prometheus was down for 5m, then:

  1. restoredTime will be 0.
  2. alertHoldDuration > m.opts.ForGracePeriod: we see that the hold duration is greater than grace period to apply the logic of grace period.
  3. restoredTime.Add(alertHoldDuration).Sub(ts) can be viewed as
// ts = current time
  restoredTime                        ts                      firing wrt restoredTime
       |-------------------------------|-------------------------------|

       ^-----------------------alertHoldDuration-----------------------^

                                       ^-------------------------------^
                                     restoredTime.Add(alertHoldDuration).Sub(ts)

       ^--------------5m--------------^^-------------1h55m-------------^
  1. ts.Add(m.opts.ForGracePeriod).Add(-alertHoldDuration): it makes the alert to fire at ts.Add(m.opts.ForGracePeriod)

As restoredTime.Add(alertHoldDuration).Sub(ts) (1h55m) > m.opts.ForGracePeriod (10m), the if block wont run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll never go inside the if for an alert with more than a 10m for - which can't be right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it the other way round?
alertHoldDuration > m.opts.ForGracePeriod specifies that it goes inside if only when the for is >10m. If for was <10, then pending the alert for another 10m won't be right, hence doesn't go inside in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seem consistent with your previous explanation. I don't think this code can be right, as it's discarding the timestamp of the sample it read from the tsdb.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's discarding the timestamp of the sample

The timestamp of the sample comes to use indirectly when querying. As we query for ALERTS_FOR_STATE in the outage tolerance period, the timestamp of the samples come into picture in the query processing.

If you talk about explicitly using the timestamp of the sample, its not needed, as the ActiveAt is the value of the sample.

I have updated the comments to make the logic more clear. Please let me know which part doesn't seem fine.

rules/manager.go Outdated
if alertHoldDuration > m.opts.ForGracePeriod && restoredTime.Add(alertHoldDuration).Sub(ts) < m.opts.ForGracePeriod {
restoredTime = ts.Add(m.opts.ForGracePeriod).Add(-alertHoldDuration)
}
a.ActiveAt = restoredTime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be persisted? Will it clash with the value the Eval wrote out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Eval done during restoration doesn't persist the samples. But if the Eval of the manager wrote before restoration, then still it wont clash I guess. In the next Eval samples will be written accordingly, with either pending or firing.

If you are talking about persisting the ActiveAt, we might not want to persist this, because if prometheus goes down for 2nd time while alert still being active, we will want to restore the ActiveAt based on when it was active for the 1st time and not the last time when the prometheus server was up, as alert never became inactive in between.

@codesome codesome force-pushed the persist-for-state branch 4 times, most recently from 6c2cf6d to 5746914 Compare June 10, 2018 19:51
rules/manager.go Outdated
if !ok {
continue
}
_, err := alertRule.Eval(m.opts.Context, ts, newg.opts.QueryFunc, newg.opts.ExternalURL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alerts could also depend on recording rules, so we really need to run those too. So we'll probably need a bit of a different control flow here

rules/manager.go Outdated
var v float64
it := s.Iterator()
for it.Next() {
_, v = it.At()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see how this can be right. You're not using the timestamp at which the alert last fired.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I answer this I have 1 doubt,
If the alert was firing when the prometheus went down, should it be firing again just after restoration or should it wait for the grace period?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should still be firing if it's still evaluating to true I'd say, though it if waits for the grace period that's not the end of the world.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed that it should wait for grace period even in that case. If we have to wait for grace period, we actually don't need to know if the alert was firing. With the ActiveAt, we can make out when it should be firing, and then handle the grace period according.

If we want it to fire immediately after restoration, then we might want to use the timestamp of the sample.

I guess we would like it to fire after restoration as you said. I will make the necessary changes.

@codesome codesome force-pushed the persist-for-state branch 6 times, most recently from 7bc8686 to 0679177 Compare June 17, 2018 21:49
@codesome
Copy link
Member Author

@brian-brazil I have updated the PR. Now the timestamp of the sample is being used to check for 'firing' state. Control flow has been modified to restore the state after the first Eval.

@@ -484,6 +623,7 @@ func (m *Manager) Update(interval time.Duration, files []string) error {
}
return errors.New("error loading rules, previous rule set restored")
}
m.restored = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be set before loadGroups?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, Update is called while loading the rule files for the first time too. So we have to set m.restored to be true after the first loading right? Or are the rule files loaded somewhere else for the first time?

rules/manager.go Outdated
@@ -209,6 +216,14 @@ func (g *Group) run(ctx context.Context) {
}
lastTriggered := time.Now()
iter()
if g.shouldRestore {
// Waiting for all other groups to finish first evalution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should have groups depending on each other, I'd have each group handle this independently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it was not required, I have removed it now.

rules/manager.go Outdated
if g.shouldRestore {
// Waiting for all other groups to finish first evalution.
g.firstIterWG.Done()
g.firstIterWG.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing waitgroups around like this is odd, they typically live entirely inside on function

@@ -276,6 +306,9 @@ func (r *AlertingRule) Eval(ctx context.Context, ts time.Time, query QueryFunc,
}

vec = append(vec, r.sample(a, ts))
if r.restored {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably shouldn't write out the above line either

rules/manager.go Outdated

alertHoldDuration := alertRule.HoldDuration()

alertRule.ForEachActiveAlert(func(a *Alert) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still presuming that the first evaluation was firing, which it might not due to not having enough data yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First evaluation happens after waiting for eval interval, wont we have enough samples by then? If not then we can restore after 2nd eval, does that sound fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the 2nd is probably best

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

rules/manager.go Outdated
var v float64
var t int64
it := s.Iterator()
for it.Next() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should check the iterator for an error too

rules/manager.go Outdated
crashTime := time.Unix(t/1000, 0)
if crashTime.Sub(restoredActiveAt) >= alertHoldDuration {
// The alert was firing when it crashed. Hence we need not wait for grace
// period, we can immediatly make it fire.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still apply the same logic here, and wait the 10m. Just because it was firing an hour ago doesn't mean the condition is still happening.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. In that case we have no use of timestamp of the sample, as we wait for 10 minutes in any case.

rules/manager.go Outdated
// ^-------------------------------^
// restoredActiveAt.Add(alertHoldDuration).Sub(ts)
//
// Hence `restoredActiveAt.Add(alertHoldDuration).Sub(ts)` is the time left
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see where the disconnect is. If an alert has a for 6h, I think it must spend at least 6h firing. If the Prometheus is down for half an hour, it now needs to spend at least 6h+30m+10m firing. This will reduce false positives due to alerts resolving while Prometheus is down.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With firing do you mean pending time? I did not quite follow what you meant.