Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple xds capacity controller and raft-autopilot #20511

Merged
merged 2 commits into from Feb 8, 2024

Conversation

hashi-derek
Copy link
Member

@hashi-derek hashi-derek commented Feb 6, 2024

This prevents a potential bug where raft-autopilot could deadlock while attempting to execute AutopilotDelegate.NotifyState() on an xdscapacity controller that has stopped / delayed consuming messages.

The following line of logic appears that it could be a potential problem:

  1. The call to countProxies() can wait for up to 1 minute before retrying to load info. It also never resets its counter, which means a significant wait is more likely. https://github.com/hashicorp/consul/blob/v1.17.2/agent/consul/xdscapacity/capacity.go#L194

  2. countProxies() shares a loop with the consumption from the serverCh that has counts published to it. https://github.com/hashicorp/consul/blob/v1.17.2/agent/consul/xdscapacity/capacity.go#L93-L97

  3. The serverCh channel is published to by the SetServerCount() function. https://github.com/hashicorp/consul/blob/v1.17.2/agent/consul/xdscapacity/capacity.go#L113

  4. The autopilot state delegate calls SetServerCount() on every change. https://github.com/hashicorp/consul/blob/main/agent/consul/autopilot.go#L68

  5. Autopilot acquires an exclusive lock and waits for the delegate to finish execution. https://github.com/hashicorp/raft-autopilot/blob/v0.1.6/state.go#L390-L393

@hashi-derek hashi-derek changed the title Decouple xds capacity controller and autopilot Decouple xds capacity controller and raft-autopilot Feb 6, 2024
This prevents a potential bug where autopilot deadlocks while attempting
to execute `AutopilotDelegate.NotifyState()` on an xdscapacity controller
that stopped consuming messages.
@hashi-derek hashi-derek force-pushed the derekm/decouple-autopilot-xdscapacity branch from ad28907 to 1668a31 Compare February 7, 2024 18:47
@hashi-derek hashi-derek marked this pull request as ready for review February 7, 2024 20:20
@hashi-derek hashi-derek added backport/1.15 Changes are backported to 1.15 backport/1.16 Changes are backported to 1.16 backport/1.17 Changes are backported to 1.17 backport/1.18 Changes are backported to 1.18 labels Feb 7, 2024
Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine but the better solution is to probably not directly call SetServerCount from NotifyState. Instead, the xds capacity controller should use the event publisher to listen for server events and update itself.

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the goal here is to backport pretty far this limited change makes sense over rethinking the interaction completely.

@hashi-derek
Copy link
Member Author

This seems fine but the better solution is to probably not directly call SetServerCount from NotifyState. Instead, the xds capacity controller should use the event publisher to listen for server events and update itself.

Agreed. I will take that into consideration when we implement the corresponding changes for the new catalog in a future release.

@hc-github-team-consul-core
Copy link
Collaborator

@hashi-derek, a backport is missing for this PR [20511] for versions [1.15,1.16,1.17] please perform the backport manually and add the following snippet to your backport PR description:

<details>
	<summary> Overview of commits </summary>
		- <<backport commit 1>>
		- <<backport commit 2>>
		...
</details>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.15 Changes are backported to 1.15 backport/1.16 Changes are backported to 1.16 backport/1.17 Changes are backported to 1.17 backport/1.18 Changes are backported to 1.18
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants