Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable running autopilot state updates on all servers #12617

Merged
merged 5 commits into from
Apr 7, 2022
Merged

Conversation

mkeeler
Copy link
Member

@mkeeler mkeeler commented Mar 25, 2022

Previously we started autopilot when a server gained leadership and stopped it when a server lost leadership. For some upcoming features we need autopilot on all servers to continually track the state of all servers. This PR pulls in a raft-autopilot update and enables that functionality.

When a Consul server is started, autopilot will be started but with reconciliation disabled to prevent it from attempting raft config modifications. Upon gaining leadership we will tell the running autopilot to enable reconciliation and then to disable reconciliation once the server loses leadership.

There are a few related changes in this PR as well.

  1. The autopilot.healthy and autopilot.failure_tolerance metrics will now be emitted regularly on both followers and the leader. There is now a "leader" label added to the metrics to make it simpler to pick out the leaders view for the purposes of alerting.

  2. The Operator.AutopilotState and Operator.ServerHealth RPCs no longer forcefully forward to the leader. Our HTTP server and CLI will default to non-stale queries so the overall behavior should be unchanged for existing uses. The exception is if the -stale CLI parameter or corresponding query parameter are being used then it will cause those RPCs to not be forward to the leader and instead return the follower servers view of the state.

Lastly the second commit in this PR is unrelated. Its just fixing a linter warning that popped up when I was working on this.

TODOS:

Copy link
Contributor

@boxofrad boxofrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 💯

On the non-leader servers all they do is update the state and do not attempt any modifications.
Technically they were relying on racey behavior before. Now they should be reliable.
@vercel vercel bot temporarily deployed to Preview – consul April 5, 2022 20:43 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging April 5, 2022 20:43 Inactive
@@ -817,7 +817,8 @@ func TestRPC_RPCMaxConnsPerClient(t *testing.T) {
tc := tc
t.Run(tc.name, func(t *testing.T) {
dir1, s1 := testServerWithConfig(t, func(c *Config) {
c.RPCMaxConnsPerClient = 2
// we have to set this to 3 because autopilot is going to keep a connection open
c.RPCMaxConnsPerClient = 3
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this test executing successfully was relying on having all the connections getting opened prior to autopilot being started. Once leadership was established and autopilot is started, a conn will be kept open taking up one of the available slots for connections.

The changes here up the limit to 3 and also wait for establishing leadership further down. Just introducing the leader establishment wait in the previous code is enough to cause the test to reliably fail.

It doesn’t really mesh well with go-metrics and prometheus and our gauge predefinitions as metrics with different label values are treated distinctly so we could end up outputting multiple versions of these metrics to prometheus which would be undesirable.
@vercel vercel bot temporarily deployed to Preview – consul April 6, 2022 20:53 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging April 6, 2022 20:53 Inactive
The label was removed from the metrics so the changelog shouldn't say it exists.
@vercel vercel bot temporarily deployed to Preview – consul April 6, 2022 20:54 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging April 6, 2022 20:54 Inactive
@@ -250,8 +250,8 @@ func TestHTTPHandlers_AgentMetrics_ConsulAutopilot_Prometheus(t *testing.T) {
respRec := httptest.NewRecorder()
recordPromMetrics(t, a, respRec)

assertMetricExistsWithValue(t, respRec, "agent_2_autopilot_healthy", "NaN")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting to NaN causes metrics test failures because NaN cannot be json encoded.

@mkeeler mkeeler merged commit a553982 into main Apr 7, 2022
@mkeeler mkeeler deleted the autopilot-for-all branch April 7, 2022 14:48
@hc-github-team-consul-core
Copy link
Collaborator

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/629100.

@hc-github-team-consul-core
Copy link
Collaborator

🍒✅ Cherry pick of commit a553982 onto release/1.12.x succeeded!

hc-github-team-consul-core pushed a commit that referenced this pull request Apr 7, 2022
* Fixes a lint warning about t.Errorf not supporting %w

* Enable running autopilot on all servers

On the non-leader servers all they do is update the state and do not attempt any modifications.

* Fix the RPC conn limiting tests

Technically they were relying on racey behavior before. Now they should be reliable.
@oseiberts11
Copy link

The comment at the top says that "There is now a "leader" label added to the metrics to make it simpler to pick out the leaders view for the purposes of alerting." but this label is removed again in commit 0753558 . The part "for the purposes of alerting" sounded useful; was it in fact not as useful as thought, or is there some other strategy one can use instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-inactive/1.12 This release series is no longer active pr/no-metrics-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants