Follow-up on limiting the number of alerts / rules #9342

roidelapluie · 2021-09-15T07:51:45Z

This is a followup on the PR #9260.

I think we would benefit from a small dedicated section in the docs, that explains the semantics behind this. In particular:

How to monitor this (evaluation failed metric)
What happens when you hit the limit
- alert is not seen as active or pending, they just disappear)
- this is an error in evaluation, no stale markers are written for rules
What happens when you recover (alerts go back to pending state)
some caveats (I think the recently closed alerts count in the limit <state=inactive>)

LeviHarrison · 2021-09-16T13:34:39Z

(from #9260 (comment))

Mmh this pull request takes into consideration the "inactive alerts", which are not pending and not firing. I feel like they should not count, WDYT?

In the original issue (#9225), the problem was /api/v1/rules becoming unusable. I'm pretty sure that endpoint returns inactive rules, so if we didn't include them in the count, there could still be overload.

LeviHarrison · 2021-09-22T22:26:50Z

We should definitely be explicit about this though.

Rudy167 · 2021-10-16T14:36:37Z

@roidelapluie @LeviHarrison
After observing this thread, I tried to write a rule file:

groups:
- name: my-project
  limit: 5
  rules:
  # Alert for any instance that is unreachable for >1 s.
  - alert: InstanceDown
    expr: up == 0
    for: 1s
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

but there is some issue with the syntax here for the limit, as the prometheus server keeps exiting on start after adding the limit property.

Would appreciate your guidance, really keen on contributing more!

LeviHarrison · 2021-10-16T14:38:15Z

Could you please provide any logs from the server?

Rudy167 · 2021-10-16T17:02:59Z

Could you please provide any logs from the server?

I was running the server inside a docker container, attached is the error logs, I am using the latest version(hopefully)
error.logs.txt

LeviHarrison · 2021-10-16T17:06:15Z

You're running v2.30.3 of Prometheus. This feature hasn't made it in to a release yet.

LeviHarrison · 2021-10-16T17:08:02Z

If you want to try it out, you'll have to build from the main branch, or use one of the main Docker images.

Rudy167 · 2021-10-16T18:09:50Z

If you want to try it out, you'll have to build from the main branch, or use one of the main Docker images.

Thanks @LeviHarrison Running the main image worked!, but in the rules UI shouldnt we see the limits property too?
Also ideally the firing alerts should disappear after 1s if the limit is 1(for 1s interval), please correct me if i am wrong

PS : I am a little new too this paradigm, so its highly likely that i might not make sense at times

LeviHarrison · 2021-10-16T18:37:29Z

The limit belongs to the rule group, and the other numerical field, interval, isn't displayed on the rules UI either.

What your comment did make me realize though is that I forgot to add limit to the rules API (https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#rule_group). If you're interested in doing that, you would need to add the field to the RuleGroup struct:

prometheus/web/api/v1/api.go

Lines 1141 to 1152 in 4414351

    
           // RuleGroup has info for rules which are part of a group 
        
           type RuleGroup struct { 
        
           	Name string `json:"name"` 
        
           	File string `json:"file"` 
        
           	// In order to preserve rule ordering, while exposing type (alerting or recording) 
        
           	// specific properties, both alerting and recording rules are exposed in the 
        
           	// same array. 
        
           	Rules          []rule    `json:"rules"` 
        
           	Interval       float64   `json:"interval"` 
        
           	EvaluationTime float64   `json:"evaluationTime"` 
        
           	LastEvaluation time.Time `json:"lastEvaluation"` 
        
           }

assign it in the API response here:

prometheus/web/api/v1/api.go

Lines 1198 to 1205 in 4414351

    
           apiRuleGroup := &RuleGroup{ 
        
           	Name:           grp.Name(), 
        
           	File:           grp.File(), 
        
           	Interval:       grp.Interval().Seconds(), 
        
           	Rules:          []rule{}, 
        
           	EvaluationTime: grp.GetEvaluationTime().Seconds(), 
        
           	LastEvaluation: grp.GetLastEvaluation(), 
        
           }

and add the field to the API docs (with a note the 0 means there is no limit): https://github.com/prometheus/prometheus/blob/main/docs/querying/api.md#rules

Rudy167 · 2021-10-18T11:28:44Z

Was able to do the above, used gitpod for testing(finally read contributing.md), limit is showing in the responses of /rules
Also are there any unit test case fixes which should be done?

planning to raise separate PRs for changes in documentation and the code, please let me know if it is okay, and thanks for the help!
update: PR-#9528

LeviHarrison · 2021-10-18T11:42:54Z

That sounds great. Thanks so much! There might be a test to update in web/api/v1/api_test.go, but I'm not at my computer right now so I can't link it.

When you get around to opening a PR please make sure to sign off on the DCO in your commits.

Rudy167 · 2021-10-18T15:34:39Z

done, will resolve if there are any issues #9531

Rudy167 · 2021-10-19T13:57:01Z

@LeviHarrison request you to have a look into my commit before i open a PR again Rudy167@2207b36

had trouble while running the linter and test cases on git pod(for linting i got this issue : WARN [runner] The linter 'golint' is deprecated (since v1.41.0) due to: The repository of the linter has been archived by the owner. Replaced by revive. )

let me know if everything looks ok so i can raise a PR again

LeviHarrison · 2021-10-25T15:54:31Z

@Rudy167 friendly ping :)

It would be great to get this in before the upcoming release.

Rudy167 · 2021-10-26T14:58:26Z

@Rudy167 friendly ping :)

It would be great to get this in before the upcoming release.
@LeviHarrison
ohh..i thought it was already taken up as the thread got closed, i am sorry, give me a while

roidelapluie · 2021-10-26T15:01:33Z

Sorry, I have re-opened it.

roidelapluie added component/documentation kind/enhancement priority/P3 labels Sep 21, 2021

LeviHarrison mentioned this issue Oct 19, 2021

Rule alerts/series limit updates #9541

Merged

juliusv mentioned this issue Oct 19, 2021

added limits to rule api #9531

Closed

roidelapluie closed this as completed in #9541 Oct 21, 2021

roidelapluie reopened this Oct 26, 2021

This was referenced Jan 10, 2022

add limits to the api #10145

Closed

add limit to the rules api #10152

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Follow-up on limiting the number of alerts / rules #9342

Follow-up on limiting the number of alerts / rules #9342

roidelapluie commented Sep 15, 2021

LeviHarrison commented Sep 16, 2021

LeviHarrison commented Sep 22, 2021

Rudy167 commented Oct 16, 2021

LeviHarrison commented Oct 16, 2021

Rudy167 commented Oct 16, 2021

LeviHarrison commented Oct 16, 2021

LeviHarrison commented Oct 16, 2021

Rudy167 commented Oct 16, 2021

LeviHarrison commented Oct 16, 2021

Rudy167 commented Oct 18, 2021 •

edited

LeviHarrison commented Oct 18, 2021

Rudy167 commented Oct 18, 2021

Rudy167 commented Oct 19, 2021

LeviHarrison commented Oct 25, 2021 •

edited

Rudy167 commented Oct 26, 2021

roidelapluie commented Oct 26, 2021

Follow-up on limiting the number of alerts / rules #9342

Follow-up on limiting the number of alerts / rules #9342

Comments

roidelapluie commented Sep 15, 2021

LeviHarrison commented Sep 16, 2021

LeviHarrison commented Sep 22, 2021

Rudy167 commented Oct 16, 2021

LeviHarrison commented Oct 16, 2021

Rudy167 commented Oct 16, 2021

LeviHarrison commented Oct 16, 2021

LeviHarrison commented Oct 16, 2021

Rudy167 commented Oct 16, 2021

LeviHarrison commented Oct 16, 2021

Rudy167 commented Oct 18, 2021 • edited

LeviHarrison commented Oct 18, 2021

Rudy167 commented Oct 18, 2021

Rudy167 commented Oct 19, 2021

LeviHarrison commented Oct 25, 2021 • edited

Rudy167 commented Oct 26, 2021

roidelapluie commented Oct 26, 2021

Rudy167 commented Oct 18, 2021 •

edited

LeviHarrison commented Oct 25, 2021 •

edited