-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable running autopilot state updates on all servers #12617
Conversation
d19ab87
to
c4624dc
Compare
c4624dc
to
c7b1a70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 💯
c7b1a70
to
a058bf2
Compare
a058bf2
to
744dcfe
Compare
744dcfe
to
5a9a071
Compare
5a9a071
to
90f560c
Compare
On the non-leader servers all they do is update the state and do not attempt any modifications.
Technically they were relying on racey behavior before. Now they should be reliable.
90f560c
to
68003d9
Compare
@@ -817,7 +817,8 @@ func TestRPC_RPCMaxConnsPerClient(t *testing.T) { | |||
tc := tc | |||
t.Run(tc.name, func(t *testing.T) { | |||
dir1, s1 := testServerWithConfig(t, func(c *Config) { | |||
c.RPCMaxConnsPerClient = 2 | |||
// we have to set this to 3 because autopilot is going to keep a connection open | |||
c.RPCMaxConnsPerClient = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously this test executing successfully was relying on having all the connections getting opened prior to autopilot being started. Once leadership was established and autopilot is started, a conn will be kept open taking up one of the available slots for connections.
The changes here up the limit to 3 and also wait for establishing leadership further down. Just introducing the leader establishment wait in the previous code is enough to cause the test to reliably fail.
It doesn’t really mesh well with go-metrics and prometheus and our gauge predefinitions as metrics with different label values are treated distinctly so we could end up outputting multiple versions of these metrics to prometheus which would be undesirable.
The label was removed from the metrics so the changelog shouldn't say it exists.
@@ -250,8 +250,8 @@ func TestHTTPHandlers_AgentMetrics_ConsulAutopilot_Prometheus(t *testing.T) { | |||
respRec := httptest.NewRecorder() | |||
recordPromMetrics(t, a, respRec) | |||
|
|||
assertMetricExistsWithValue(t, respRec, "agent_2_autopilot_healthy", "NaN") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting to NaN causes metrics test failures because NaN cannot be json encoded.
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/629100. |
🍒✅ Cherry pick of commit a553982 onto |
* Fixes a lint warning about t.Errorf not supporting %w * Enable running autopilot on all servers On the non-leader servers all they do is update the state and do not attempt any modifications. * Fix the RPC conn limiting tests Technically they were relying on racey behavior before. Now they should be reliable.
The comment at the top says that "There is now a "leader" label added to the metrics to make it simpler to pick out the leaders view for the purposes of alerting." but this label is removed again in commit 0753558 . The part "for the purposes of alerting" sounded useful; was it in fact not as useful as thought, or is there some other strategy one can use instead? |
Previously we started autopilot when a server gained leadership and stopped it when a server lost leadership. For some upcoming features we need autopilot on all servers to continually track the state of all servers. This PR pulls in a raft-autopilot update and enables that functionality.
When a Consul server is started, autopilot will be started but with reconciliation disabled to prevent it from attempting raft config modifications. Upon gaining leadership we will tell the running autopilot to enable reconciliation and then to disable reconciliation once the server loses leadership.
There are a few related changes in this PR as well.
The
autopilot.healthy
andautopilot.failure_tolerance
metrics will now be emitted regularly on both followers and the leader. There is now a "leader" label added to the metrics to make it simpler to pick out the leaders view for the purposes of alerting.The
Operator.AutopilotState
andOperator.ServerHealth
RPCs no longer forcefully forward to the leader. Our HTTP server and CLI will default to non-stale queries so the overall behavior should be unchanged for existing uses. The exception is if the-stale
CLI parameter or corresponding query parameter are being used then it will cause those RPCs to not be forward to the leader and instead return the follower servers view of the state.Lastly the second commit in this PR is unrelated. Its just fixing a linter warning that popped up when I was working on this.
TODOS: