Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: A way to determine rule group order for important rule groups #4727

Closed
SpencerMalone opened this Issue Oct 11, 2018 · 5 comments

Comments

Projects
None yet
4 participants
@SpencerMalone
Copy link

SpencerMalone commented Oct 11, 2018

Proposal

Use case. Why is this important?
We have some core rule groups that help define common functionality to be shared between services. This gives us a shared place to have and work with our core common rules. For example:

All of our web apps use traefik, we can create one set of rules for defining a web application's general availability that takes in a service specification, and each service can define their own alerting level for their specific service. It'd be nice to tweak these core rulesets for everyone involved, but as is, either we have a single very large ruleset for everyone, or accept that sometimes data will be out of order if the rules are loaded in the wrong order, or have rule duplication all over the place.

My yaml example that we do right now and accept that order will be wrong sometimes. Ours is a little more complicated with two or three layers of dependent rules in the general ruleset before we get a service specific definition, but you get the idea:

## General Ruleset
# This is incase we swap our LB stack for our apps, we want to be able to keep the old rules without changing a bunch
  - record: web_requests
    expr: traefik_requests_total

  - record: service:slo_errors_per_request:ratio_rate5m
    expr: |
      sum(rate(web_request_errors[5m])) by (service)
      /
      sum(rate(web_requests[5m])) by (service)

And then a specific service might say...

  - alert: myservice_error_rate
    expr: service:slo_errors_per_request:ratio_rate5m{service="myservice.com/"} > (14.4*0.001)
    labels:
      alertroute: my-route
      severity: production

in it's own unique rule group.

My thought is maybe we could do a numerical system on rule groups, with the default being the first run alongside things w/ an order of 1? Ex:

groups:
- name: thishappensfirst.rules
   order: 1
   rules: ...
- name: thishappenssecond.rules
   order: 2
   rules: ...
- name: thishappensfirstbydefault.rules
   rules: ...
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Oct 11, 2018

have rule duplication all over the place.

I'd suggest duplicating the rules, with each scoped to the relevant service within its group. This has the advantage of spreading the CPU load around.
https://www.robustperception.io/using-time-series-as-alert-thresholds
You might also want to look at

My thought is maybe we could do a numerical system on rule groups, with the default being the first run alongside things w/ an order of 1?

You can already do this by putting all the constituent rules in one group.

@SpencerMalone

This comment has been minimized.

Copy link
Author

SpencerMalone commented Oct 11, 2018

But if we duplicate the rules everywhere, when the time comes to change them (we find a better model, the data changes, etc.) we'd have to do it in many rule groups all together at once, which can be daunting as the # of rule groups grows. I guess that's my biggest goal, is avoiding that scenario.

If that doesn't fit in the prometheus model, that's OK, we can manage, it just felt like something that was jumping out at us as we try to expand internal adoption

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Oct 11, 2018

That's a configuration management problem, and I'm afraid those are out of scope for Prometheus.

@SpencerMalone

This comment has been minimized.

Copy link
Author

SpencerMalone commented Oct 11, 2018

I... Don't entirely feel satisfied by that answer, but that's fair enough. Thanks for your time! My preference would be to leave this issue open for a few days to hear other peep's input (or workarounds) if they exist, but I'll leave it to y'all to tell me if you'd prefer this issue be closed.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Oct 12, 2018

I was expecting @brian-brazil to give that answer, though I can see how that wasn't the one you hoped for :)

I also don't think introducing the extra level of complexity in Prometheus would be justified quite yet. It's the first time I'm hearing about this need, so I'm not sure how common it is (and then if it's not super common, but solvable with configuration management)...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.