Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rulegroups #2842

Merged
merged 15 commits into from Jun 19, 2017
Merged

Rulegroups #2842

merged 15 commits into from Jun 19, 2017

Conversation

@gouthamve
Copy link
Member

@gouthamve gouthamve commented Jun 14, 2017

The initial implementation of rule-groups. Some keypoints:

  1. Group name should be unique in a single file.
  2. Promtool can be used to update the rules.
  3. Uses ghodss/yaml instead of go-yaml, I can see that it is already vendored.

cc @brian-brazil @fabxc @beorn7 @juliusv @grobie @SuperQ


This change is Reviewable

fabxc and others added 7 commits Jun 7, 2017
Signed-off-by: Goutham Veeramachaneni <goutham@boomerangcommerce.com>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Add tests
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
@@ -20,8 +20,12 @@ import (
"path/filepath"
"strings"

yaml "github.com/ghodss/yaml"

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017
Contributor

Why are you using a non-standard yaml library?

This comment has been minimized.

@brancz

brancz Jun 14, 2017
Member

I believe the goal was to switch over to this library, as the semantics are much closer to the standard lib encoding/json package, where the Marshal/Unmarshal interface is a lot saner.

For the better or worse, the fact that it is already vendored is thanks to kubernetes/client-go.

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017
Contributor

That's still under discussion, I don't believe we got to consensus there.

This comment has been minimized.

@fabxc

fabxc Jun 14, 2017
Contributor

It was an experiment of mine in the first commit. Turns out, error messages for YAML parsing errors are a lot clearer. Validation and custom marshalling becomes tons better.

I haven't seen any drawbacks yet.

This comment has been minimized.

@fabxc

fabxc Jun 14, 2017
Contributor

It's just a small wrapper btw.

This comment has been minimized.

@fabxc

fabxc Jun 14, 2017
Contributor

It's a wrapper. It makes error reporting better, it allows us to do marshalling saner. We should probably migrate other places to it. Look at our config packages. That's where the insanity lies.

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017
Contributor

If we're going to migrate we should complete the discussion elsewhere and then migrate. Introducing things only in new code has a tendency to end up with a split world for rather a long time.

This comment has been minimized.

@fabxc

fabxc Jun 15, 2017
Contributor

It's vendored already, it's entirely local to the package and gives us an immediate benefit. Any config/ refactoring is far away. I see no reason to not do this.

This comment has been minimized.

@brian-brazil

brian-brazil Jun 15, 2017
Contributor

Changing the rest of our code to also use this new library being something that is far away is a strong reason not to do this.

We do not have consensus on this, and using multiple different libraries within our organisation that do the same thing is to be avoided.

This comment has been minimized.

@gouthamve

gouthamve Jun 16, 2017
Author Member

Moved back to go-yaml. PTAL.

@@ -0,0 +1,59 @@
version: 1

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017
Contributor

Do we really need a version number at this stage? It's noise.

- alert: HighErrors
expr: |
sum without(instance) (rate(errors_total[5m]))
/ sum without(instance) (rate(requests_total[5m]))

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017
Contributor

nit:

  sum without(instance) (rate(errors_total[5m]))
/ 
  sum without(instance) (rate(requests_total[5m]))

This comment has been minimized.

@fabxc

fabxc Jun 14, 2017
Contributor

I think YAML doesn't get it if the indentation on the first line is offset by this =/ one of the warts.

This comment has been minimized.

@grobie

grobie Jun 14, 2017
Member

Ugh, is this confirmed? Any workaround? This would be a serious drawback in readability.

This comment has been minimized.

@gouthamve

gouthamve Jun 14, 2017
Author Member

I did update the file now to reflect something better. Everything has to be in the same indentation level.

This comment has been minimized.

@juliusv

juliusv Jun 15, 2017
Member

That kind of constraint goes straight to my worries about nesting one kind of language in another one, especially if the outer one is one that cares about white space like this. I don't know what conclusion to draw from this though, as there might not be a better way.

@@ -270,18 +271,10 @@ func typeForRule(r Rule) ruleType {
// In the future a single group will be evaluated sequentially to properly handle

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017
Contributor

Comment needs updating

@@ -270,18 +271,10 @@ func typeForRule(r Rule) ruleType {
// In the future a single group will be evaluated sequentially to properly handle
// rule dependency.
func (g *Group) Eval(ts time.Time) {

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017
Contributor

The different rule groups need to have offset runtimes, like we do for targets.

}
rules = append(rules, rule)

// Groups need not be unique across filenames.

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017
Contributor

What is this going to look like in the UI and in metrics?

Will the labels of a group change every time it's rescheduled in a different directory?

I don't think non-unique names works with our current setup.

This comment has been minimized.

@gouthamve

gouthamve Jun 14, 2017
Author Member

What do you mean labels?

I haven't yet touched the UI. But it would be simple to add the grouping there as I now added the filename also to the Group object. Currently in the UI, all the rules will be listed without any reference to the group/file.

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017
Contributor

We will have a metric for how long the rule groups took to evaluate. The labels of a given group in that metric should be consistent across Prometheus invocations.

This comment has been minimized.

@gouthamve

gouthamve Jun 14, 2017
Author Member

Can we not have name, filename as the labels?

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017
Contributor

The filename can change as Prometheus is rescheduled.

}

// Validate the rule and return a list of encountered errors.
func (r *Rule) Validate() (errs []error) {

This comment has been minimized.

@brian-brazil

brian-brazil Jun 14, 2017
Contributor

You need to validate that the labelnames/annotations names are valid, and for rules that label values are valid.

gouthamve added 4 commits Jun 14, 2017
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
return nil, err
rgs, errs := rulefmt.ParseFile(fn)
if errs != nil {
return nil, tsdb.MultiError(errs)

This comment has been minimized.

@fabxc

fabxc Jun 16, 2017
Contributor

Not a good idea to pull in tsdb as a dep just for a multierror type. In fact, it's probably a good idea to have loadGroups also just returns a straight slice of errors so the caller can print a single log line per encountered issue or similar. That's why we made it a slice in rulefmt to begin with.

l := model.LabelSet{
"name": model.LabelValue(g.name),
"filename": model.LabelValue(g.file),
}
return l.Fingerprint()

This comment has been minimized.

@fabxc

fabxc Jun 16, 2017
Contributor

Can we rename it to hash() and use pkg/labels.Labels.Hash()?

@@ -171,16 +176,96 @@ func checkRules(t cli.Term, filename string) (int, error) {
return 0, fmt.Errorf("is a directory")
}

rgs, errs := rulefmt.ParseFile(filename)
if errs != nil {
return 0, tsdb.MultiError(errs)

This comment has been minimized.

@fabxc

fabxc Jun 16, 2017
Contributor

Same about import tsdb for this and probably better returning the actual slice so the errors can be logged as one per line. tsdb.MultiError will just concat them with ; which will be impossible to debug with.

}
return len(rules), nil

return ioutil.WriteFile(filename+".yaml", y, 0777)

This comment has been minimized.

@fabxc

fabxc Jun 16, 2017
Contributor

0666 for files, 0777 for dirs. That's what we are going with in general I think.

}

yamlRG.Groups[0].Rules = yamlRules
y, err := yaml.Marshal(yamlRG)

This comment has been minimized.

@fabxc

fabxc Jun 16, 2017
Contributor

Because we are now using the plain YAML lib again this will generate rule files with basically random key ordering right?

@brian-brazil if yes, this is pretty bad UX just because we don't want to use a 100 LOC wrapper lib that's vendored already anyway.

This comment has been minimized.

@gouthamve

gouthamve Jun 16, 2017
Author Member

Using go-yaml the fields are ordered as we declare them. ghodss/yaml produces the fields in alphabetical order.

This comment has been minimized.

@fabxc

fabxc Jun 16, 2017
Contributor

Okay, let's move forward then. Somewhat sure I've seen things wildely mixed – but maybe I recall incorrectly.

Copy link
Contributor

@brian-brazil brian-brazil left a comment

The offset logic isn't included here yet.

@@ -243,6 +328,11 @@ func main() {
Run: CheckMetricsCmd,
})

app.Register("update-rules", &cli.Command{
Desc: "update the rules to the new YAML format",

This comment has been minimized.

@brian-brazil

brian-brazil Jun 16, 2017
Contributor

update rule files to...

to be consistent with the above.

type RuleGroup struct {
Name string `yaml:"name"`
Interval model.Duration `yaml:"interval,omitempty"`
Rules []Rule `yaml:"rules"`

This comment has been minimized.

@brian-brazil

brian-brazil Jun 16, 2017
Contributor

This should have the XXX checkOverflow stuff.

var (
wg sync.WaitGroup
)

for i, rule := range g.rules {

This comment has been minimized.

@brian-brazil

brian-brazil Jun 16, 2017
Contributor

Should we check done after each rule?

This comment has been minimized.

@gouthamve

gouthamve Jun 16, 2017
Author Member

I don't see why we need to. This is all going sequentially and the function wouldn't exit until all the rules are evaluated.

This comment has been minimized.

@fabxc

fabxc Jun 16, 2017
Contributor

I think Brian's point was to actually do exit if done indicates the group should terminate.

This comment has been minimized.

@gouthamve

gouthamve Jun 16, 2017
Author Member

Done

* Move fingerprint to Hash()
* Move away from tsdb.MultiError
* 0777 -> 0666 for files
* checkOverflow of extra fields

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
@fabxc
Copy link
Contributor

@fabxc fabxc commented Jun 16, 2017

I don't think we need offset evaluation for the first iteration. Let's focus on working down breaking changes for now.

@fabxc
Copy link
Contributor

@fabxc fabxc commented Jun 16, 2017

Mh, I thought this was alluding to "offset evaluation" where we evaluate recording rules not against now but now-5m for example. That can be relevant for metrics that are lagging behind like Cloudwatch.

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
@brian-brazil
Copy link
Contributor

@brian-brazil brian-brazil commented Jun 16, 2017

No, that's what I was talking about. I'd thought it hadn't been merged into this PR.

My "offset eval" is still at the idea stage, I want to wait until we've got cloudwatch&friends outputting timestamps and see if it still makes sense.

gouthamve added 2 commits Jun 19, 2017
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
@fabxc
Copy link
Contributor

@fabxc fabxc commented Jun 19, 2017

👍

@fabxc fabxc merged commit ab1bc9b into prometheus:dev-2.0 Jun 19, 2017
2 of 3 checks passed
2 of 3 checks passed
ci/circleci CircleCI is running your tests
Details
codacy/pr Good work! A positive pull request.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants