New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple xds capacity controller and raft-autopilot #20511
Conversation
This prevents a potential bug where autopilot deadlocks while attempting to execute `AutopilotDelegate.NotifyState()` on an xdscapacity controller that stopped consuming messages.
ad28907
to
1668a31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine but the better solution is to probably not directly call SetServerCount
from NotifyState
. Instead, the xds capacity controller should use the event publisher to listen for server events and update itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the goal here is to backport pretty far this limited change makes sense over rethinking the interaction completely.
Agreed. I will take that into consideration when we implement the corresponding changes for the new catalog in a future release. |
@hashi-derek, a backport is missing for this PR [20511] for versions [1.15,1.16,1.17] please perform the backport manually and add the following snippet to your backport PR description:
|
This prevents a potential bug where raft-autopilot could deadlock while attempting to execute
AutopilotDelegate.NotifyState()
on an xdscapacity controller that has stopped / delayed consuming messages.The following line of logic appears that it could be a potential problem:
The call to
countProxies()
can wait for up to 1 minute before retrying to load info. It also never resets its counter, which means a significant wait is more likely. https://github.com/hashicorp/consul/blob/v1.17.2/agent/consul/xdscapacity/capacity.go#L194countProxies()
shares a loop with the consumption from theserverCh
that has counts published to it. https://github.com/hashicorp/consul/blob/v1.17.2/agent/consul/xdscapacity/capacity.go#L93-L97The
serverCh
channel is published to by theSetServerCount()
function. https://github.com/hashicorp/consul/blob/v1.17.2/agent/consul/xdscapacity/capacity.go#L113The autopilot state delegate calls
SetServerCount()
on every change. https://github.com/hashicorp/consul/blob/main/agent/consul/autopilot.go#L68Autopilot acquires an exclusive lock and waits for the delegate to finish execution. https://github.com/hashicorp/raft-autopilot/blob/v0.1.6/state.go#L390-L393