New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rulegroups #2842

Merged
merged 15 commits into from Jun 19, 2017

Conversation

Projects
None yet
6 participants
@gouthamve
Member

gouthamve commented Jun 14, 2017

The initial implementation of rule-groups. Some keypoints:

  1. Group name should be unique in a single file.
  2. Promtool can be used to update the rules.
  3. Uses ghodss/yaml instead of go-yaml, I can see that it is already vendored.

cc @brian-brazil @fabxc @beorn7 @juliusv @grobie @SuperQ


This change is Reviewable

fabxc and others added some commits Jun 7, 2017

Goutham Veeramachaneni
Move rules to new format
Signed-off-by: Goutham Veeramachaneni <goutham@boomerangcommerce.com>
Add update-rules command to promtool
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Make sure groups are unique in a single file
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Add tests
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Update check-rules to new format.
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Add License Header
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
@@ -20,8 +20,12 @@ import (
"path/filepath"
"strings"
yaml "github.com/ghodss/yaml"

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017

Member

Why are you using a non-standard yaml library?

This comment has been minimized.

@brancz

brancz Jun 14, 2017

Member

I believe the goal was to switch over to this library, as the semantics are much closer to the standard lib encoding/json package, where the Marshal/Unmarshal interface is a lot saner.

For the better or worse, the fact that it is already vendored is thanks to kubernetes/client-go.

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017

Member

That's still under discussion, I don't believe we got to consensus there.

This comment has been minimized.

@fabxc

fabxc Jun 14, 2017

Member

It was an experiment of mine in the first commit. Turns out, error messages for YAML parsing errors are a lot clearer. Validation and custom marshalling becomes tons better.

I haven't seen any drawbacks yet.

This comment has been minimized.

@fabxc

fabxc Jun 14, 2017

Member

It's just a small wrapper btw.

This comment has been minimized.

@fabxc

fabxc Jun 14, 2017

Member

It's a wrapper. It makes error reporting better, it allows us to do marshalling saner. We should probably migrate other places to it. Look at our config packages. That's where the insanity lies.

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017

Member

If we're going to migrate we should complete the discussion elsewhere and then migrate. Introducing things only in new code has a tendency to end up with a split world for rather a long time.

This comment has been minimized.

@fabxc

fabxc Jun 15, 2017

Member

It's vendored already, it's entirely local to the package and gives us an immediate benefit. Any config/ refactoring is far away. I see no reason to not do this.

This comment has been minimized.

@brian-brazil

brian-brazil Jun 15, 2017

Member

Changing the rest of our code to also use this new library being something that is far away is a strong reason not to do this.

We do not have consensus on this, and using multiple different libraries within our organisation that do the same thing is to be avoided.

This comment has been minimized.

@gouthamve

gouthamve Jun 16, 2017

Member

Moved back to go-yaml. PTAL.

@@ -0,0 +1,59 @@
version: 1

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017

Member

Do we really need a version number at this stage? It's noise.

- alert: HighErrors
expr: |
sum without(instance) (rate(errors_total[5m]))
/ sum without(instance) (rate(requests_total[5m]))

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017

Member

nit:

  sum without(instance) (rate(errors_total[5m]))
/ 
  sum without(instance) (rate(requests_total[5m]))

This comment has been minimized.

@fabxc

fabxc Jun 14, 2017

Member

I think YAML doesn't get it if the indentation on the first line is offset by this =/ one of the warts.

This comment has been minimized.

@grobie

grobie Jun 14, 2017

Member

Ugh, is this confirmed? Any workaround? This would be a serious drawback in readability.

This comment has been minimized.

@gouthamve

gouthamve Jun 14, 2017

Member

I did update the file now to reflect something better. Everything has to be in the same indentation level.

This comment has been minimized.

@juliusv

juliusv Jun 15, 2017

Member

That kind of constraint goes straight to my worries about nesting one kind of language in another one, especially if the outer one is one that cares about white space like this. I don't know what conclusion to draw from this though, as there might not be a better way.

@@ -270,18 +271,10 @@ func typeForRule(r Rule) ruleType {
// In the future a single group will be evaluated sequentially to properly handle

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017

Member

Comment needs updating

@@ -270,18 +271,10 @@ func typeForRule(r Rule) ruleType {
// In the future a single group will be evaluated sequentially to properly handle
// rule dependency.
func (g *Group) Eval(ts time.Time) {

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017

Member

The different rule groups need to have offset runtimes, like we do for targets.

}
rules = append(rules, rule)
// Groups need not be unique across filenames.

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017

Member

What is this going to look like in the UI and in metrics?

Will the labels of a group change every time it's rescheduled in a different directory?

I don't think non-unique names works with our current setup.

This comment has been minimized.

@gouthamve

gouthamve Jun 14, 2017

Member

What do you mean labels?

I haven't yet touched the UI. But it would be simple to add the grouping there as I now added the filename also to the Group object. Currently in the UI, all the rules will be listed without any reference to the group/file.

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017

Member

We will have a metric for how long the rule groups took to evaluate. The labels of a given group in that metric should be consistent across Prometheus invocations.

This comment has been minimized.

@gouthamve

gouthamve Jun 14, 2017

Member

Can we not have name, filename as the labels?

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017

Member

The filename can change as Prometheus is rescheduled.

}
// Validate the rule and return a list of encountered errors.
func (r *Rule) Validate() (errs []error) {

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017

Member

You need to validate that the labelnames/annotations names are valid, and for rules that label values are valid.

gouthamve added some commits Jun 14, 2017

Validate labels and annotations
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Add file name to group.
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Reflect the grouping in the UI
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Move back to go-yaml
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
return nil, err
rgs, errs := rulefmt.ParseFile(fn)
if errs != nil {
return nil, tsdb.MultiError(errs)

This comment has been minimized.

@fabxc

fabxc Jun 16, 2017

Member

Not a good idea to pull in tsdb as a dep just for a multierror type. In fact, it's probably a good idea to have loadGroups also just returns a straight slice of errors so the caller can print a single log line per encountered issue or similar. That's why we made it a slice in rulefmt to begin with.

l := model.LabelSet{
"name": model.LabelValue(g.name),
"filename": model.LabelValue(g.file),
}
return l.Fingerprint()

This comment has been minimized.

@fabxc

fabxc Jun 16, 2017

Member

Can we rename it to hash() and use pkg/labels.Labels.Hash()?

@@ -171,16 +176,96 @@ func checkRules(t cli.Term, filename string) (int, error) {
return 0, fmt.Errorf("is a directory")
}
rgs, errs := rulefmt.ParseFile(filename)
if errs != nil {
return 0, tsdb.MultiError(errs)

This comment has been minimized.

@fabxc

fabxc Jun 16, 2017

Member

Same about import tsdb for this and probably better returning the actual slice so the errors can be logged as one per line. tsdb.MultiError will just concat them with ; which will be impossible to debug with.

}
return len(rules), nil
return ioutil.WriteFile(filename+".yaml", y, 0777)

This comment has been minimized.

@fabxc

fabxc Jun 16, 2017

Member

0666 for files, 0777 for dirs. That's what we are going with in general I think.

}
yamlRG.Groups[0].Rules = yamlRules
y, err := yaml.Marshal(yamlRG)

This comment has been minimized.

@fabxc

fabxc Jun 16, 2017

Member

Because we are now using the plain YAML lib again this will generate rule files with basically random key ordering right?

@brian-brazil if yes, this is pretty bad UX just because we don't want to use a 100 LOC wrapper lib that's vendored already anyway.

This comment has been minimized.

@gouthamve

gouthamve Jun 16, 2017

Member

Using go-yaml the fields are ordered as we declare them. ghodss/yaml produces the fields in alphabetical order.

This comment has been minimized.

@fabxc

fabxc Jun 16, 2017

Member

Okay, let's move forward then. Somewhat sure I've seen things wildely mixed – but maybe I recall incorrectly.

@brian-brazil

The offset logic isn't included here yet.

@@ -243,6 +328,11 @@ func main() {
Run: CheckMetricsCmd,
})
app.Register("update-rules", &cli.Command{
Desc: "update the rules to the new YAML format",

This comment has been minimized.

@brian-brazil

brian-brazil Jun 16, 2017

Member

update rule files to...

to be consistent with the above.

type RuleGroup struct {
Name string `yaml:"name"`
Interval model.Duration `yaml:"interval,omitempty"`
Rules []Rule `yaml:"rules"`

This comment has been minimized.

@brian-brazil

brian-brazil Jun 16, 2017

Member

This should have the XXX checkOverflow stuff.

var (
wg sync.WaitGroup
)
for i, rule := range g.rules {

This comment has been minimized.

@brian-brazil

brian-brazil Jun 16, 2017

Member

Should we check done after each rule?

This comment has been minimized.

@gouthamve

gouthamve Jun 16, 2017

Member

I don't see why we need to. This is all going sequentially and the function wouldn't exit until all the rules are evaluated.

This comment has been minimized.

@fabxc

fabxc Jun 16, 2017

Member

I think Brian's point was to actually do exit if done indicates the group should terminate.

This comment has been minimized.

@gouthamve
Incorporate PR feedback
* Move fingerprint to Hash()
* Move away from tsdb.MultiError
* 0777 -> 0666 for files
* checkOverflow of extra fields

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
@fabxc

This comment has been minimized.

Member

fabxc commented Jun 16, 2017

I don't think we need offset evaluation for the first iteration. Let's focus on working down breaking changes for now.

@gouthamve

This comment has been minimized.

@fabxc

This comment has been minimized.

Member

fabxc commented Jun 16, 2017

Mh, I thought this was alluding to "offset evaluation" where we evaluate recording rules not against now but now-5m for example. That can be relevant for metrics that are lagging behind like Cloudwatch.

Check done before every rule evaluation.
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
@brian-brazil

This comment has been minimized.

Member

brian-brazil commented Jun 16, 2017

No, that's what I was talking about. I'd thought it hadn't been merged into this PR.

My "offset eval" is still at the idea stage, I want to wait until we've got cloudwatch&friends outputting timestamps and see if it still makes sense.

gouthamve added some commits Jun 19, 2017

Merge remote-tracking branch 'upstream/dev-2.0' into rulegroups
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Remove version from RuleGroups
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
@fabxc

This comment has been minimized.

Member

fabxc commented Jun 19, 2017

👍

@fabxc fabxc merged commit ab1bc9b into prometheus:dev-2.0 Jun 19, 2017

2 of 3 checks passed

ci/circleci CircleCI is running your tests
Details
codacy/pr Good work! A positive pull request.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment