Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix consul_autopilot_healthy metric emission #11231

Merged
merged 3 commits into from
Oct 8, 2021
Merged

fix consul_autopilot_healthy metric emission #11231

merged 3 commits into from
Oct 8, 2021

Conversation

acpana
Copy link
Contributor

@acpana acpana commented Oct 5, 2021

Overview

This PR fixes the values for the consul_autopilot_healthy emitted on:

  • server startup
  • leadership loss

The desired behavior, as documented on https://www.consul.io/docs/agent/telemetry#autopilot, is for that metric to be NaN in the states above.

Due to a change in go-metrics this behavior changed and now the value reported is 0. This can be disruptive for folks, especially if they use the metric to monitor or alarm on 0, which marks an "unhealthy" cluster.

See FAQ below.

Issue Related

Notes:

PR Checklist

  • go fmt
  • go mod
  • pr/changelog added
  • unit tests
  • integration tests - n/a

FAQ

Click to expand!
  • Q: Why a "stop gap" solution first?
    • A: I think this behavior deviation is quite disruptive for practitioners and I'd like to fix this asap. While we can come up with more elegant solutions, this should unblock folks
  • Q: Where is the testing for the leadership loss / raft state change?
    • A: At present, our "unit" testing harness does not allow for more than one metrics sink (ref). While we work on adding a test for that, here's how I tested that scenario:

To test that NaN was emitted by follower nodes, I followed the steps in #10730 (comment) but used https://github.com/dhiaayachi/consul-local-cluster/ to dockerize my dev build of consul.

  • Q: Why add a metrics_test.go file and not have the consul_autopilot_healthy metrics test as part of
    autopilot_test.go?
    • A: I think we ought to add more testing around metrics emissions/ behavior in out of consul. I.e., I'm hoping we can add the 1/0 checks for consul_autopilot_healthy but also tests for any other metric here.

Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I think this is a good fix for Servers, but Clients will still emit a 0 for this metric, which I think will still be a problem. To fix that, I think we need to make sure that AutopilotGauges are only added in getPrometheusDefs when in server mode. That will likely require some additional plumbing.

Left some comments about the tests. I think it would be good to test, but I'm concerned we're adding yet more complexity to what is already a very complex test helper. I'm hoping we can find a better way to test this.

agent/consul/autopilot.go Outdated Show resolved Hide resolved
agent/testagent.go Outdated Show resolved Hide resolved
agent/testagent.go Outdated Show resolved Hide resolved
agent/metrics_test.go Outdated Show resolved Hide resolved
@vercel vercel bot temporarily deployed to Preview – consul October 6, 2021 22:09 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging October 6, 2021 22:09 Inactive
@acpana
Copy link
Contributor Author

acpana commented Oct 6, 2021

@dnephin thanks for taking a look 💯 !!

can confirm this is true:

but Clients will still emit a 0 for this metric, which I think will still be a problem.

used the same setup described in the FAQ to stand up a local 3 server cluster with 2 clients.

Clients report 0 for consul_autopilot_healthy. The behavior here seems to be undefined 🤔 . As it's not documented here: https://www.consul.io/docs/agent/telemetry#autopilot .

Tracks the overall health of the local server cluster.


EDIT:

think we need to make sure that AutopilotGauges are only added in getPrometheusDefs when in server mode. That will likely require some additional plumbing

I added one way of doing that in this draft PR: #11241 (comment).

The side effect of that approach is that the metric won't even appear, which could be ok or not, depending on the practitioner's environment expectation. For instance, they could treat "missing" data as breaching.

I can add that commit to this PR if that's the direction we want to go in.

agent/metrics_test.go Outdated Show resolved Hide resolved
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
@vercel vercel bot temporarily deployed to Preview – consul October 8, 2021 06:41 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging October 8, 2021 06:41 Inactive
Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix and test coverage!

@acpana acpana merged commit a0bba91 into main Oct 8, 2021
@acpana acpana deleted the ffmmm/b-10730 branch October 8, 2021 17:31
@hc-github-team-consul-core
Copy link
Collaborator

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/467345.

@hc-github-team-consul-core
Copy link
Collaborator

🍒❌ Cherry pick of commit a0bba91 onto release/1.10.x failed! Build Log

@markblackman
Copy link
Contributor

Can this get backported to 1.9 as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants