Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot Restore can cause Autopilot to not be executing on the leader #9626

Closed
mkeeler opened this issue Jan 22, 2021 · 0 comments · Fixed by #9644
Closed

Snapshot Restore can cause Autopilot to not be executing on the leader #9626

mkeeler opened this issue Jan 22, 2021 · 0 comments · Fixed by #9644

Comments

@mkeeler
Copy link
Member

mkeeler commented Jan 22, 2021

Overview of the Issue

The process of restoring a snapshot has the potential to result in Autopilot not executing on the leader server in a cluster.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Start a server
  2. Push some data in.
  3. Snapshot the server
  4. Restore the server
  5. Join a second server to the first
  6. Observe that consul operator autopilot state does not show the second server

Note that it is non-deterministic whether this will trigger it.

Details

The snapshot restore process requests that the leader reassert its leadership after the snapshot is restored here:

case s.reassertLeaderCh <- errCh:

The leader loop handles that request here:

s.revokeLeadership()
err := s.establishLeadership(stopCtx)

revokeLeadership will stop autopilot here:

s.autopilot.Stop()

establishLeadership will restart autopilot here:

s.autopilot.Start(ctx)

The Autopilot.Stop method returns a chan which can be selected on to determine when it has actually been shut down. We could just have revokeLeadership wait on that chan and the issue will be resolved. However this is kind of an autopilot issue too so we could just fix the corresponding autopilot issue and then pull in the dependency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant