-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix consul_autopilot_healthy metric emission #11231
Conversation
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I think this is a good fix for Servers, but Clients will still emit a 0 for this metric, which I think will still be a problem. To fix that, I think we need to make sure that AutopilotGauges
are only added in getPrometheusDefs
when in server mode. That will likely require some additional plumbing.
Left some comments about the tests. I think it would be good to test, but I'm concerned we're adding yet more complexity to what is already a very complex test helper. I'm hoping we can find a better way to test this.
@dnephin thanks for taking a look 💯 !! can confirm this is true:
used the same setup described in the FAQ to stand up a local 3 server cluster with 2 clients. Clients report
EDIT:
I added one way of doing that in this draft PR: #11241 (comment). The side effect of that approach is that the metric won't even appear, which could be ok or not, depending on the practitioner's environment expectation. For instance, they could treat "missing" data as breaching. I can add that commit to this PR if that's the direction we want to go in. |
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice fix and test coverage!
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/467345. |
Can this get backported to 1.9 as well? |
Overview
This PR fixes the values for the
consul_autopilot_healthy
emitted on:The desired behavior, as documented on https://www.consul.io/docs/agent/telemetry#autopilot, is for that metric to be
NaN
in the states above.Due to a change in
go-metrics
this behavior changed and now the value reported is0
. This can be disruptive for folks, especially if they use the metric to monitor or alarm on0
, which marks an "unhealthy" cluster.See FAQ below.
Issue Related
consul.autopilot.healthy|consul_autpilot_healthy
metric in Consul 1.10 #10730Notes:
consul_autopilot_healthy
metric and not any of the remainder of 0 returned instead of NaN forconsul.autopilot.healthy|consul_autpilot_healthy
metric in Consul 1.10 #10730 (comment)PR Checklist
go fmt
go mod
FAQ
Click to expand!
To test that
NaN
was emitted by follower nodes, I followed the steps in #10730 (comment) but used https://github.com/dhiaayachi/consul-local-cluster/ to dockerize my dev build of consul.metrics_test.go
file and not have theconsul_autopilot_healthy
metrics test as part ofautopilot_test.go
?1/0
checks forconsul_autopilot_healthy
but also tests for any other metric here.