Enable running autopilot state updates on all servers #12617

mkeeler · 2022-03-25T15:15:34Z

Previously we started autopilot when a server gained leadership and stopped it when a server lost leadership. For some upcoming features we need autopilot on all servers to continually track the state of all servers. This PR pulls in a raft-autopilot update and enables that functionality.

When a Consul server is started, autopilot will be started but with reconciliation disabled to prevent it from attempting raft config modifications. Upon gaining leadership we will tell the running autopilot to enable reconciliation and then to disable reconciliation once the server loses leadership.

There are a few related changes in this PR as well.

The autopilot.healthy and autopilot.failure_tolerance metrics will now be emitted regularly on both followers and the leader. There is now a "leader" label added to the metrics to make it simpler to pick out the leaders view for the purposes of alerting.
The Operator.AutopilotState and Operator.ServerHealth RPCs no longer forcefully forward to the leader. Our HTTP server and CLI will default to non-stale queries so the overall behavior should be unchanged for existing uses. The exception is if the -stale CLI parameter or corresponding query parameter are being used then it will cause those RPCs to not be forward to the leader and instead return the follower servers view of the state.

Lastly the second commit in this PR is unrelated. Its just fixing a linter warning that popped up when I was working on this.

TODOS:

Pull in official version of raft-autopilot once the upstream PR is merged and a release is made: Allow running autopilot with reconciliation disabled on non-leaders raft-autopilot#16

boxofrad

LGTM 💯

On the non-leader servers all they do is update the state and do not attempt any modifications.

Technically they were relying on racey behavior before. Now they should be reliable.

mkeeler · 2022-04-05T20:46:57Z

agent/consul/rpc_test.go

@@ -817,7 +817,8 @@ func TestRPC_RPCMaxConnsPerClient(t *testing.T) {
 		tc := tc
 		t.Run(tc.name, func(t *testing.T) {
 			dir1, s1 := testServerWithConfig(t, func(c *Config) {
-				c.RPCMaxConnsPerClient = 2
+				// we have to set this to 3 because autopilot is going to keep a connection open
+				c.RPCMaxConnsPerClient = 3


Previously this test executing successfully was relying on having all the connections getting opened prior to autopilot being started. Once leadership was established and autopilot is started, a conn will be kept open taking up one of the available slots for connections.

The changes here up the limit to 3 and also wait for establishing leadership further down. Just introducing the leader establishment wait in the previous code is enough to cause the test to reliably fail.

It doesn’t really mesh well with go-metrics and prometheus and our gauge predefinitions as metrics with different label values are treated distinctly so we could end up outputting multiple versions of these metrics to prometheus which would be undesirable.

The label was removed from the metrics so the changelog shouldn't say it exists.

mkeeler · 2022-04-07T14:48:22Z

agent/metrics_test.go

@@ -250,8 +250,8 @@ func TestHTTPHandlers_AgentMetrics_ConsulAutopilot_Prometheus(t *testing.T) {
 		respRec := httptest.NewRecorder()
 		recordPromMetrics(t, a, respRec)

-		assertMetricExistsWithValue(t, respRec, "agent_2_autopilot_healthy", "NaN")


Setting to NaN causes metrics test failures because NaN cannot be json encoded.

hc-github-team-consul-core · 2022-04-07T14:49:28Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/629100.

hc-github-team-consul-core · 2022-04-07T14:49:30Z

🍒✅ Cherry pick of commit a553982 onto release/1.12.x succeeded!

* Fixes a lint warning about t.Errorf not supporting %w * Enable running autopilot on all servers On the non-leader servers all they do is update the state and do not attempt any modifications. * Fix the RPC conn limiting tests Technically they were relying on racey behavior before. Now they should be reliable.

oseiberts11 · 2022-04-29T09:00:20Z

The comment at the top says that "There is now a "leader" label added to the metrics to make it simpler to pick out the leaders view for the purposes of alerting." but this label is removed again in commit 0753558 . The part "for the purposes of alerting" sounded useful; was it in fact not as useful as thought, or is there some other strategy one can use instead?

mkeeler force-pushed the autopilot-for-all branch from d19ab87 to c4624dc Compare March 25, 2022 15:23

vercel bot temporarily deployed to Preview – consul-ui-staging March 25, 2022 15:23 Inactive

vercel bot deployed to Preview – consul March 25, 2022 15:23 View deployment

mkeeler force-pushed the autopilot-for-all branch from c4624dc to c7b1a70 Compare March 25, 2022 15:25

vercel bot temporarily deployed to Preview – consul-ui-staging March 25, 2022 15:25 Inactive

vercel bot deployed to Preview – consul March 25, 2022 15:25 View deployment

boxofrad approved these changes Mar 28, 2022

View reviewed changes

mkeeler force-pushed the autopilot-for-all branch from c7b1a70 to a058bf2 Compare April 4, 2022 14:19

vercel bot temporarily deployed to Preview – consul-ui-staging April 4, 2022 14:19 Inactive

vercel bot deployed to Preview – consul April 4, 2022 14:19 View deployment

mkeeler marked this pull request as ready for review April 4, 2022 14:20

mkeeler requested a review from a team as a code owner April 4, 2022 14:20

mkeeler force-pushed the autopilot-for-all branch from a058bf2 to 744dcfe Compare April 4, 2022 14:25

vercel bot deployed to Preview – consul April 4, 2022 14:25 View deployment

vercel bot temporarily deployed to Preview – consul-ui-staging April 4, 2022 14:25 Inactive

mkeeler added backport/1.12 pr/no-metrics-test labels Apr 5, 2022

mkeeler force-pushed the autopilot-for-all branch from 744dcfe to 5a9a071 Compare April 5, 2022 13:52

vercel bot deployed to Preview – consul April 5, 2022 13:52 View deployment

vercel bot temporarily deployed to Preview – consul-ui-staging April 5, 2022 13:52 Inactive

mkeeler force-pushed the autopilot-for-all branch from 5a9a071 to 90f560c Compare April 5, 2022 15:14

vercel bot had a problem deploying to Preview – consul April 5, 2022 15:14 Failure

vercel bot temporarily deployed to Preview – consul-ui-staging April 5, 2022 15:14 Inactive

vercel bot deployed to Preview – consul April 5, 2022 15:16 View deployment

mkeeler added 3 commits April 5, 2022 12:09

Fixes a lint warning about t.Errorf not supporting %w

9b072e3

Enable running autopilot on all servers

a00b603

On the non-leader servers all they do is update the state and do not attempt any modifications.

Fix the RPC conn limiting tests

68003d9

Technically they were relying on racey behavior before. Now they should be reliable.

mkeeler force-pushed the autopilot-for-all branch from 90f560c to 68003d9 Compare April 5, 2022 20:43

vercel bot temporarily deployed to Preview – consul April 5, 2022 20:43 Inactive

vercel bot temporarily deployed to Preview – consul-ui-staging April 5, 2022 20:43 Inactive

mkeeler commented Apr 5, 2022

View reviewed changes

vercel bot temporarily deployed to Preview – consul April 6, 2022 20:53 Inactive

vercel bot temporarily deployed to Preview – consul-ui-staging April 6, 2022 20:53 Inactive

Update changelog regarding leader label

44ca934

The label was removed from the metrics so the changelog shouldn't say it exists.

vercel bot temporarily deployed to Preview – consul April 6, 2022 20:54 Inactive

vercel bot temporarily deployed to Preview – consul-ui-staging April 6, 2022 20:54 Inactive

mkeeler commented Apr 7, 2022

View reviewed changes

mkeeler merged commit a553982 into main Apr 7, 2022

mkeeler deleted the autopilot-for-all branch April 7, 2022 14:48

dpw mentioned this pull request May 20, 2022

Prometheus metric to show whether a server is a leader or follower #13169

Closed

tgross mentioned this pull request Sep 27, 2022

metrics: emit nomad.nomad.autopilot.healthy on followers hashicorp/nomad#13219

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable running autopilot state updates on all servers #12617

Enable running autopilot state updates on all servers #12617

mkeeler commented Mar 25, 2022 •

edited

Loading

boxofrad left a comment

mkeeler Apr 5, 2022

mkeeler Apr 7, 2022

hc-github-team-consul-core commented Apr 7, 2022

hc-github-team-consul-core commented Apr 7, 2022

oseiberts11 commented Apr 29, 2022

Enable running autopilot state updates on all servers #12617

Enable running autopilot state updates on all servers #12617

Conversation

mkeeler commented Mar 25, 2022 • edited Loading

boxofrad left a comment

Choose a reason for hiding this comment

mkeeler Apr 5, 2022

Choose a reason for hiding this comment

mkeeler Apr 7, 2022

Choose a reason for hiding this comment

hc-github-team-consul-core commented Apr 7, 2022

hc-github-team-consul-core commented Apr 7, 2022

oseiberts11 commented Apr 29, 2022

mkeeler commented Mar 25, 2022 •

edited

Loading