Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rulegroups #2842

Merged
merged 15 commits into from
Jun 19, 2017
Merged

Rulegroups #2842

merged 15 commits into from
Jun 19, 2017

Conversation

gouthamve
Copy link
Member

@gouthamve gouthamve commented Jun 14, 2017

The initial implementation of rule-groups. Some keypoints:

  1. Group name should be unique in a single file.
  2. Promtool can be used to update the rules.
  3. Uses ghodss/yaml instead of go-yaml, I can see that it is already vendored.

cc @brian-brazil @fabxc @beorn7 @juliusv @grobie @SuperQ


This change is Reviewable

fabxc and others added 7 commits June 7, 2017 17:06
Signed-off-by: Goutham Veeramachaneni <goutham@boomerangcommerce.com>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
@@ -20,8 +20,12 @@ import (
"path/filepath"
"strings"

yaml "github.com/ghodss/yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using a non-standard yaml library?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the goal was to switch over to this library, as the semantics are much closer to the standard lib encoding/json package, where the Marshal/Unmarshal interface is a lot saner.

For the better or worse, the fact that it is already vendored is thanks to kubernetes/client-go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's still under discussion, I don't believe we got to consensus there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an experiment of mine in the first commit. Turns out, error messages for YAML parsing errors are a lot clearer. Validation and custom marshalling becomes tons better.

I haven't seen any drawbacks yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a small wrapper btw.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a wrapper. It makes error reporting better, it allows us to do marshalling saner. We should probably migrate other places to it. Look at our config packages. That's where the insanity lies.

Copy link
Contributor

@brian-brazil brian-brazil Jun 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to migrate we should complete the discussion elsewhere and then migrate. Introducing things only in new code has a tendency to end up with a split world for rather a long time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's vendored already, it's entirely local to the package and gives us an immediate benefit. Any config/ refactoring is far away. I see no reason to not do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the rest of our code to also use this new library being something that is far away is a strong reason not to do this.

We do not have consensus on this, and using multiple different libraries within our organisation that do the same thing is to be avoided.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved back to go-yaml. PTAL.

@@ -0,0 +1,59 @@
version: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a version number at this stage? It's noise.

- alert: HighErrors
expr: |
sum without(instance) (rate(errors_total[5m]))
/ sum without(instance) (rate(requests_total[5m]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

  sum without(instance) (rate(errors_total[5m]))
/ 
  sum without(instance) (rate(requests_total[5m]))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think YAML doesn't get it if the indentation on the first line is offset by this =/ one of the warts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, is this confirmed? Any workaround? This would be a serious drawback in readability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did update the file now to reflect something better. Everything has to be in the same indentation level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That kind of constraint goes straight to my worries about nesting one kind of language in another one, especially if the outer one is one that cares about white space like this. I don't know what conclusion to draw from this though, as there might not be a better way.

rules/manager.go Outdated
@@ -270,18 +271,10 @@ func typeForRule(r Rule) ruleType {
// In the future a single group will be evaluated sequentially to properly handle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment needs updating

rules/manager.go Outdated
@@ -270,18 +271,10 @@ func typeForRule(r Rule) ruleType {
// In the future a single group will be evaluated sequentially to properly handle
// rule dependency.
func (g *Group) Eval(ts time.Time) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The different rule groups need to have offset runtimes, like we do for targets.

rules/manager.go Outdated
}
rules = append(rules, rule)

// Groups need not be unique across filenames.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this going to look like in the UI and in metrics?

Will the labels of a group change every time it's rescheduled in a different directory?

I don't think non-unique names works with our current setup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean labels?

I haven't yet touched the UI. But it would be simple to add the grouping there as I now added the filename also to the Group object. Currently in the UI, all the rules will be listed without any reference to the group/file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have a metric for how long the rule groups took to evaluate. The labels of a given group in that metric should be consistent across Prometheus invocations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not have name, filename as the labels?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filename can change as Prometheus is rescheduled.

}

// Validate the rule and return a list of encountered errors.
func (r *Rule) Validate() (errs []error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to validate that the labelnames/annotations names are valid, and for rules that label values are valid.

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
rules/manager.go Outdated
return nil, err
rgs, errs := rulefmt.ParseFile(fn)
if errs != nil {
return nil, tsdb.MultiError(errs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a good idea to pull in tsdb as a dep just for a multierror type. In fact, it's probably a good idea to have loadGroups also just returns a straight slice of errors so the caller can print a single log line per encountered issue or similar. That's why we made it a slice in rulefmt to begin with.

rules/manager.go Outdated
l := model.LabelSet{
"name": model.LabelValue(g.name),
"filename": model.LabelValue(g.file),
}
return l.Fingerprint()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename it to hash() and use pkg/labels.Labels.Hash()?

@@ -171,16 +176,96 @@ func checkRules(t cli.Term, filename string) (int, error) {
return 0, fmt.Errorf("is a directory")
}

rgs, errs := rulefmt.ParseFile(filename)
if errs != nil {
return 0, tsdb.MultiError(errs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same about import tsdb for this and probably better returning the actual slice so the errors can be logged as one per line. tsdb.MultiError will just concat them with ; which will be impossible to debug with.

}
return len(rules), nil

return ioutil.WriteFile(filename+".yaml", y, 0777)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0666 for files, 0777 for dirs. That's what we are going with in general I think.

}

yamlRG.Groups[0].Rules = yamlRules
y, err := yaml.Marshal(yamlRG)
Copy link
Contributor

@fabxc fabxc Jun 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we are now using the plain YAML lib again this will generate rule files with basically random key ordering right?

@brian-brazil if yes, this is pretty bad UX just because we don't want to use a 100 LOC wrapper lib that's vendored already anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using go-yaml the fields are ordered as we declare them. ghodss/yaml produces the fields in alphabetical order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's move forward then. Somewhat sure I've seen things wildely mixed – but maybe I recall incorrectly.

Copy link
Contributor

@brian-brazil brian-brazil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offset logic isn't included here yet.

@@ -243,6 +328,11 @@ func main() {
Run: CheckMetricsCmd,
})

app.Register("update-rules", &cli.Command{
Desc: "update the rules to the new YAML format",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update rule files to...

to be consistent with the above.

type RuleGroup struct {
Name string `yaml:"name"`
Interval model.Duration `yaml:"interval,omitempty"`
Rules []Rule `yaml:"rules"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have the XXX checkOverflow stuff.

var (
wg sync.WaitGroup
)

for i, rule := range g.rules {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check done after each rule?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we need to. This is all going sequentially and the function wouldn't exit until all the rules are evaluated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Brian's point was to actually do exit if done indicates the group should terminate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* Move fingerprint to Hash()
* Move away from tsdb.MultiError
* 0777 -> 0666 for files
* checkOverflow of extra fields

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
@fabxc
Copy link
Contributor

fabxc commented Jun 16, 2017

I don't think we need offset evaluation for the first iteration. Let's focus on working down breaking changes for now.

@gouthamve
Copy link
Member Author

@fabxc
Copy link
Contributor

fabxc commented Jun 16, 2017

Mh, I thought this was alluding to "offset evaluation" where we evaluate recording rules not against now but now-5m for example. That can be relevant for metrics that are lagging behind like Cloudwatch.

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
@brian-brazil
Copy link
Contributor

No, that's what I was talking about. I'd thought it hadn't been merged into this PR.

My "offset eval" is still at the idea stage, I want to wait until we've got cloudwatch&friends outputting timestamps and see if it still makes sense.

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
@fabxc
Copy link
Contributor

fabxc commented Jun 19, 2017

👍

@fabxc fabxc merged commit ab1bc9b into prometheus:dev-2.0 Jun 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants