fix consul_autopilot_healthy metric emission #11231

acpana · 2021-10-05T21:39:08Z

Overview

This PR fixes the values for the consul_autopilot_healthy emitted on:

server startup
leadership loss

The desired behavior, as documented on https://www.consul.io/docs/agent/telemetry#autopilot, is for that metric to be NaN in the states above.

Due to a change in go-metrics this behavior changed and now the value reported is 0. This can be disruptive for folks, especially if they use the metric to monitor or alarm on 0, which marks an "unhealthy" cluster.

See FAQ below.

Issue Related

0 returned instead of NaN for consul.autopilot.healthy|consul_autpilot_healthy metric in Consul 1.10 #10730
consul_autopilot_healthy is reporting zero on the leader #11152

Notes:

this is more of a stop gap solution that only tackles the consul_autopilot_healthy metric and not any of the remainder of 0 returned instead of NaN for consul.autopilot.healthy|consul_autpilot_healthy metric in Consul 1.10 #10730 (comment)

PR Checklist

FAQ

Click to expand!

Q: Why a "stop gap" solution first?
- A: I think this behavior deviation is quite disruptive for practitioners and I'd like to fix this asap. While we can come up with more elegant solutions, this should unblock folks
Q: Where is the testing for the leadership loss / raft state change?
- A: At present, our "unit" testing harness does not allow for more than one metrics sink (ref). While we work on adding a test for that, here's how I tested that scenario:

To test that NaN was emitted by follower nodes, I followed the steps in #10730 (comment) but used https://github.com/dhiaayachi/consul-local-cluster/ to dockerize my dev build of consul.

Q: Why add a metrics_test.go file and not have the consul_autopilot_healthy metrics test as part of
autopilot_test.go?
- A: I think we ought to add more testing around metrics emissions/ behavior in out of consul. I.e., I'm hoping we can add the 1/0 checks for consul_autopilot_healthy but also tests for any other metric here.

Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>

dnephin

Nice! I think this is a good fix for Servers, but Clients will still emit a 0 for this metric, which I think will still be a problem. To fix that, I think we need to make sure that AutopilotGauges are only added in getPrometheusDefs when in server mode. That will likely require some additional plumbing.

Left some comments about the tests. I think it would be good to test, but I'm concerned we're adding yet more complexity to what is already a very complex test helper. I'm hoping we can find a better way to test this.

agent/consul/autopilot.go

agent/testagent.go

agent/metrics_test.go

acpana · 2021-10-06T22:22:59Z

@dnephin thanks for taking a look 💯 !!

can confirm this is true:

but Clients will still emit a 0 for this metric, which I think will still be a problem.

used the same setup described in the FAQ to stand up a local 3 server cluster with 2 clients.

Clients report 0 for consul_autopilot_healthy. The behavior here seems to be undefined 🤔 . As it's not documented here: https://www.consul.io/docs/agent/telemetry#autopilot .

Tracks the overall health of the local server cluster.

EDIT:

think we need to make sure that AutopilotGauges are only added in getPrometheusDefs when in server mode. That will likely require some additional plumbing

I added one way of doing that in this draft PR: #11241 (comment).

The side effect of that approach is that the metric won't even appear, which could be ok or not, depending on the practitioner's environment expectation. For instance, they could treat "missing" data as breaching.

I can add that commit to this PR if that's the direction we want to go in.

agent/metrics_test.go

Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>

dnephin

Nice fix and test coverage!

hc-github-team-consul-core · 2021-10-08T17:33:01Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/467345.

hc-github-team-consul-core · 2021-10-08T17:33:03Z

🍒❌ Cherry pick of commit a0bba91 onto release/1.10.x failed! Build Log

#10730

markblackman · 2022-05-20T14:50:12Z

Can this get backported to 1.9 as well?

FFMMM added 2 commits October 5, 2021 13:26

set autopilot_healthy metric to nan on startup

ff223d4

Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>

add metrics test for consul_autopilot_healthy and plumbing for testagent

eeccf9c

Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>

acpana requested review from dnephin, banks and markan October 5, 2021 21:39

acpana mentioned this pull request Oct 6, 2021

telemetry: improve cert expiry metrics #10771

Merged

2 tasks

acpana added the backport/1.10 label Oct 6, 2021

dnephin reviewed Oct 6, 2021

View reviewed changes

agent/consul/autopilot.go Outdated Show resolved Hide resolved

agent/testagent.go Outdated Show resolved Hide resolved

agent/testagent.go Outdated Show resolved Hide resolved

agent/metrics_test.go Outdated Show resolved Hide resolved

acpana force-pushed the ffmmm/b-10730 branch from 163f676 to f45d85f Compare October 6, 2021 22:09

vercel bot temporarily deployed to Preview – consul October 6, 2021 22:09 Inactive

vercel bot temporarily deployed to Preview – consul-ui-staging October 6, 2021 22:09 Inactive

acpana mentioned this pull request Oct 6, 2021

only add prom autopilot gauges to servers #11241

Merged

dnephin reviewed Oct 7, 2021

View reviewed changes

agent/metrics_test.go Outdated Show resolved Hide resolved

reset consul_autopilot_healthy to NaN on leadership state

51b9579

Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>

acpana force-pushed the ffmmm/b-10730 branch from f45d85f to 51b9579 Compare October 8, 2021 06:41

vercel bot temporarily deployed to Preview – consul October 8, 2021 06:41 Inactive

vercel bot temporarily deployed to Preview – consul-ui-staging October 8, 2021 06:41 Inactive

dnephin approved these changes Oct 8, 2021

View reviewed changes

acpana merged commit a0bba91 into main Oct 8, 2021

acpana deleted the ffmmm/b-10730 branch October 8, 2021 17:31

acpana pushed a commit that referenced this pull request Oct 8, 2021

fix consul_autopilot_healthy metric emission (#11231)

6dc3acb

#10730

acpana mentioned this pull request Oct 8, 2021

fix consul_autopilot_healthy metric emission (#11231) #11259

Merged

acpana pushed a commit that referenced this pull request Oct 8, 2021

fix consul_autopilot_healthy metric emission (#11231) (#11259)

4981439

#10730

acpana mentioned this pull request Oct 8, 2021

0 returned instead of NaN for consul.autopilot.healthy|consul_autpilot_healthy metric in Consul 1.10 #10730

Closed

acpana mentioned this pull request Oct 21, 2021

0 returned instead of NaN for consul_autopilot_failure_tolerance metric in Consul 1.10 #11378

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix consul_autopilot_healthy metric emission #11231

fix consul_autopilot_healthy metric emission #11231

acpana commented Oct 5, 2021 •

edited

Loading

dnephin left a comment

acpana commented Oct 6, 2021 •

edited

Loading

dnephin left a comment

hc-github-team-consul-core commented Oct 8, 2021

hc-github-team-consul-core commented Oct 8, 2021

markblackman commented May 20, 2022

fix consul_autopilot_healthy metric emission #11231

fix consul_autopilot_healthy metric emission #11231

Conversation

acpana commented Oct 5, 2021 • edited Loading

Overview

Issue Related

Notes:

PR Checklist

FAQ

dnephin left a comment

Choose a reason for hiding this comment

acpana commented Oct 6, 2021 • edited Loading

dnephin left a comment

Choose a reason for hiding this comment

hc-github-team-consul-core commented Oct 8, 2021

hc-github-team-consul-core commented Oct 8, 2021

markblackman commented May 20, 2022

acpana commented Oct 5, 2021 •

edited

Loading

acpana commented Oct 6, 2021 •

edited

Loading