Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow-up on limiting the number of alerts / rules #9342

Open
roidelapluie opened this issue Sep 15, 2021 · 16 comments · Fixed by #9541
Open

Follow-up on limiting the number of alerts / rules #9342

roidelapluie opened this issue Sep 15, 2021 · 16 comments · Fixed by #9541

Comments

@roidelapluie
Copy link
Member

This is a followup on the PR #9260.

I think we would benefit from a small dedicated section in the docs, that explains the semantics behind this. In particular:

  • How to monitor this (evaluation failed metric)
  • What happens when you hit the limit
    • alert is not seen as active or pending, they just disappear)
    • this is an error in evaluation, no stale markers are written for rules
  • What happens when you recover (alerts go back to pending state)
  • some caveats (I think the recently closed alerts count in the limit <state=inactive>)
@LeviHarrison
Copy link
Member

(from #9260 (comment))

Mmh this pull request takes into consideration the "inactive alerts", which are not pending and not firing. I feel like they should not count, WDYT?

In the original issue (#9225), the problem was /api/v1/rules becoming unusable. I'm pretty sure that endpoint returns inactive rules, so if we didn't include them in the count, there could still be overload.

@LeviHarrison
Copy link
Member

We should definitely be explicit about this though.

@Rudy167
Copy link

Rudy167 commented Oct 16, 2021

@roidelapluie @LeviHarrison
After observing this thread, I tried to write a rule file:

groups:
- name: my-project
  limit: 5
  rules:
  # Alert for any instance that is unreachable for >1 s.
  - alert: InstanceDown
    expr: up == 0
    for: 1s
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

but there is some issue with the syntax here for the limit, as the prometheus server keeps exiting on start after adding the limit property.

Would appreciate your guidance, really keen on contributing more!

@LeviHarrison
Copy link
Member

Could you please provide any logs from the server?

@Rudy167
Copy link

Rudy167 commented Oct 16, 2021

Could you please provide any logs from the server?

I was running the server inside a docker container, attached is the error logs, I am using the latest version(hopefully)
error.logs.txt

@LeviHarrison
Copy link
Member

You're running v2.30.3 of Prometheus. This feature hasn't made it in to a release yet.

@LeviHarrison
Copy link
Member

If you want to try it out, you'll have to build from the main branch, or use one of the main Docker images.

@Rudy167
Copy link

Rudy167 commented Oct 16, 2021

If you want to try it out, you'll have to build from the main branch, or use one of the main Docker images.

Thanks @LeviHarrison Running the main image worked!, but in the rules UI shouldnt we see the limits property too?
Also ideally the firing alerts should disappear after 1s if the limit is 1(for 1s interval), please correct me if i am wrong

rules

PS : I am a little new too this paradigm, so its highly likely that i might not make sense at times

@LeviHarrison
Copy link
Member

The limit belongs to the rule group, and the other numerical field, interval, isn't displayed on the rules UI either.

What your comment did make me realize though is that I forgot to add limit to the rules API (https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#rule_group). If you're interested in doing that, you would need to add the field to the RuleGroup struct:

prometheus/web/api/v1/api.go

Lines 1141 to 1152 in 4414351

// RuleGroup has info for rules which are part of a group
type RuleGroup struct {
Name string `json:"name"`
File string `json:"file"`
// In order to preserve rule ordering, while exposing type (alerting or recording)
// specific properties, both alerting and recording rules are exposed in the
// same array.
Rules []rule `json:"rules"`
Interval float64 `json:"interval"`
EvaluationTime float64 `json:"evaluationTime"`
LastEvaluation time.Time `json:"lastEvaluation"`
}

assign it in the API response here:

prometheus/web/api/v1/api.go

Lines 1198 to 1205 in 4414351

apiRuleGroup := &RuleGroup{
Name: grp.Name(),
File: grp.File(),
Interval: grp.Interval().Seconds(),
Rules: []rule{},
EvaluationTime: grp.GetEvaluationTime().Seconds(),
LastEvaluation: grp.GetLastEvaluation(),
}

and add the field to the API docs (with a note the 0 means there is no limit): https://github.com/prometheus/prometheus/blob/main/docs/querying/api.md#rules

@Rudy167
Copy link

Rudy167 commented Oct 18, 2021

Was able to do the above, used gitpod for testing(finally read contributing.md), limit is showing in the responses of /rules
Also are there any unit test case fixes which should be done?

planning to raise separate PRs for changes in documentation and the code, please let me know if it is okay, and thanks for the help!
update: PR-#9528

@LeviHarrison
Copy link
Member

That sounds great. Thanks so much! There might be a test to update in web/api/v1/api_test.go, but I'm not at my computer right now so I can't link it.

When you get around to opening a PR please make sure to sign off on the DCO in your commits.

@Rudy167
Copy link

Rudy167 commented Oct 18, 2021

done, will resolve if there are any issues #9531

@Rudy167
Copy link

Rudy167 commented Oct 19, 2021

@LeviHarrison request you to have a look into my commit before i open a PR again Rudy167@2207b36

had trouble while running the linter and test cases on git pod(for linting i got this issue : WARN [runner] The linter 'golint' is deprecated (since v1.41.0) due to: The repository of the linter has been archived by the owner. Replaced by revive. )

let me know if everything looks ok so i can raise a PR again

@LeviHarrison
Copy link
Member

LeviHarrison commented Oct 25, 2021

@Rudy167 friendly ping :)

It would be great to get this in before the upcoming release.

@Rudy167
Copy link

Rudy167 commented Oct 26, 2021

@Rudy167 friendly ping :)

It would be great to get this in before the upcoming release.
@LeviHarrison
ohh..i thought it was already taken up as the thread got closed, i am sorry, give me a while

@roidelapluie roidelapluie reopened this Oct 26, 2021
@roidelapluie
Copy link
Member Author

Sorry, I have re-opened it.

This was referenced Jan 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants