Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on HA leadership transfer when standby was actively forwarding a request #12601

Closed
impl opened this issue Sep 21, 2021 · 3 comments
Closed
Labels
bug Used to indicate a potential bug core/ha specific to high-availability
Milestone

Comments

@impl
Copy link
Contributor

impl commented Sep 21, 2021

Describe the bug

A deadlock can occur between the stateLock and requestForwardingConnectionLock of the standby core when assuming HA leadership while actively forwarding a request. In particular, the following goroutines will hang indefinitely:

goroutine 109 [semacquire, 37 minutes]:
sync.runtime_SemacquireMutex(0xc0003be980, 0x0, 0x0)
	/goroot/src/runtime/sema.go:71 +0x47
sync.(*RWMutex).Lock(0xc0003be978)
	/goroot/src/sync/rwmutex.go:103 +0x85
github.com/hashicorp/vault/vault.standardUnsealStrategy.unseal(0x5ede4a0, 0xc002aee7c0, 0x5f3ab40, 0xc00111c030, 0xc0003be600, 0x0, 0x0)
	/gopath/src/github.com/hashicorp/vault/vault/core.go:1918 +0x5c
github.com/hashicorp/vault/vault.(*Core).postUnseal(0xc0003be600, 0x5ede4a0, 0xc002aee7c0, 0xc00042d330, 0x5e35000, 0x8c62d00, 0x0, 0x0)
	/gopath/src/github.com/hashicorp/vault/vault/core.go:2086 +0x336
github.com/hashicorp/vault/vault.(*Core).waitForLeadership(0xc0003be600, 0x0, 0xc00106afc0, 0xc00106b140)
	/gopath/src/github.com/hashicorp/vault/vault/ha.go:565 +0xc2e
github.com/hashicorp/vault/vault.(*Core).runStandby.func7(0x0, 0x0)
	/gopath/src/github.com/hashicorp/vault/vault/ha.go:387 +0x45
github.com/hashicorp/vault/vendor/github.com/oklog/run.(*Group).Run.func1(0xc00107d380, 0xc00111db60, 0xc00113e9a0)
	/gopath/src/github.com/hashicorp/vault/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/hashicorp/vault/vendor/github.com/oklog/run.(*Group).Run
	/gopath/src/github.com/hashicorp/vault/vendor/github.com/oklog/run/group.go:37 +0xbb

goroutine 12292904 [semacquire, 37 minutes]:
sync.runtime_SemacquireMutex(0xc0003be6fc, 0xc00816d800, 0x0)
	/goroot/src/runtime/sema.go:71 +0x47
sync.(*RWMutex).RLock(...)
	/goroot/src/sync/rwmutex.go:50
github.com/hashicorp/vault/vault.(*Core).PerfStandby(0xc0003be600, 0xc00816d980)
	/gopath/src/github.com/hashicorp/vault/vault/ha.go:67 +0xa5
github.com/hashicorp/vault/vault.(*Core).ForwardRequest(0xc0003be600, 0xc00402a700, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/gopath/src/github.com/hashicorp/vault/vault/request_forwarding.go:375 +0x498
github.com/hashicorp/vault/http.forwardRequest(0xc0003be600, 0x5eb1660, 0xc0071f68c0, 0xc00402a700)
	/gopath/src/github.com/hashicorp/vault/http/handler.go:826 +0x193
github.com/hashicorp/vault/http.handleRequestForwarding.func1(0x5eb1660, 0xc0071f68c0, 0xc00402a700)
	/gopath/src/github.com/hashicorp/vault/http/handler.go:794 +0x272
net/http.HandlerFunc.ServeHTTP(0xc001077f00, 0x5eb1660, 0xc0071f68c0, 0xc00402a700)
	/goroot/src/net/http/server.go:2042 +0x44
net/http.(*ServeMux).ServeHTTP(0xc001003340, 0x5eb1660, 0xc0071f68c0, 0xc00402a700)
	/goroot/src/net/http/server.go:2417 +0x1ad
github.com/hashicorp/vault/http.wrapHelpHandler.func1(0x5eb1660, 0xc0071f68c0, 0xc00402a700)
	/gopath/src/github.com/hashicorp/vault/http/help.go:24 +0x14a
net/http.HandlerFunc.ServeHTTP(0xc001077fe0, 0x5eb1660, 0xc0071f68c0, 0xc00402a700)
	/goroot/src/net/http/server.go:2042 +0x44
github.com/hashicorp/vault/http.wrapCORSHandler.func1(0x5eb1660, 0xc0071f68c0, 0xc00402a700)
	/gopath/src/github.com/hashicorp/vault/http/cors.go:29 +0x91e
net/http.HandlerFunc.ServeHTTP(0xc00113e000, 0x5eb1660, 0xc0071f68c0, 0xc00402a700)
	/goroot/src/net/http/server.go:2042 +0x44
github.com/hashicorp/vault/http.rateLimitQuotaWrapping.func1(0x5eb1660, 0xc0071f68c0, 0xc00402a700)
	/gopath/src/github.com/hashicorp/vault/http/util.go:98 +0xa95
net/http.HandlerFunc.ServeHTTP(0xc00113e020, 0x5eb1660, 0xc0071f68c0, 0xc00402a700)
	/goroot/src/net/http/server.go:2042 +0x44
github.com/hashicorp/vault/http.wrapGenericHandler.func1(0x5eb1660, 0xc0071f68c0, 0xc00402a500)
	/gopath/src/github.com/hashicorp/vault/http/handler.go:353 +0x55a
net/http.HandlerFunc.ServeHTTP(0xc001003440, 0x5eb1660, 0xc0071f68c0, 0xc00402a500)
	/goroot/src/net/http/server.go:2042 +0x44
github.com/hashicorp/vault/vendor/github.com/hashicorp/go-cleanhttp.PrintablePathCheckHandler.func1(0x5eb1660, 0xc0071f68c0, 0xc00402a500)
	/gopath/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/go-cleanhttp/handlers.go:42 +0xba
net/http.HandlerFunc.ServeHTTP(0xc00113e040, 0x5eb1660, 0xc0071f68c0, 0xc00402a500)
	/goroot/src/net/http/server.go:2042 +0x44
net/http.serverHandler.ServeHTTP(0xc0010702a0, 0x5eb1660, 0xc0071f68c0, 0xc00402a500)
	/goroot/src/net/http/server.go:2843 +0xa3
net/http.(*conn).serve(0xc002eb43c0, 0x5ede4a0, 0xc00628b200)
	/goroot/src/net/http/server.go:1925 +0x8ad
created by net/http.(*Server).Serve
	/goroot/src/net/http/server.go:2969 +0x36c

To Reproduce

  1. Make a storage request using the standby that requires request forwarding and takes some time (might be easiest to use some plugin that just sleeps for 10 seconds on every request).
  2. Have the leader step down while forwarded request is in-flight (this happened automatically for us because of infrastructure changes, but I think it's reproducible using vault operator step-down) -- nothing unusual in the leader logs.
  3. Observe that the standby will never get past "post-unseal setup starting" after acquiring the leadership lock.

Expected behavior

Standby should proceed with unseal strategy and start handling requests as HA leader.

Environment:

  • Vault Server Version (retrieve with vault status): 1.7.3
  • Vault CLI Version (retrieve with vault version): N/A
  • Server Operating System/Architecture: Linux/amd64

Vault server configuration file(s):

default_max_request_duration = "15m"

listener "tcp" {
  address = "0.0.0.0:8200"
  tls_disable = "true"
}

ui = false
plugin_directory = "/nebula/vault/plugins"

storage "gcs" {
  bucket     = "[...]"
  ha_enabled = "true"
}

seal "gcpckms" {
  project    = "[...]"
  region     = "[...]"
  key_ring   = "[...]"
  crypto_key = "[...]"
}

Additional context

Full standby server logs, including goroutine trace.

I believe this is the same issue as previously reported:

Bug was likely introduced by #9496.

Thanks also to @scotje for his help uncovering this bug.

@ncabatoff
Copy link
Collaborator

Nice find, and thanks for the really great bug report. Concise yet contains everything we could ask for.

I think the fix is easy, but I'm not going to PR it because we should repro this ourselves so we can make sure this solves it, and I haven't the time myself right now. We'll make sure someone gets to this soon, hopefully in time for the next minor release.

Proposed fix:

diff --git a/vault/request_forwarding.go b/vault/request_forwarding.go
index bdd31a54f..5ee438685 100644
--- a/vault/request_forwarding.go
+++ b/vault/request_forwarding.go
@@ -329,6 +329,7 @@ func (c *Core) clearForwardingClients() {
 // ForwardRequest forwards a given request to the active node and returns the
 // response.
 func (c *Core) ForwardRequest(req *http.Request) (int, http.Header, []byte, error) {
+       isPerfStandby := c.PerfStandby()
        c.requestForwardingConnectionLock.RLock()
        defer c.requestForwardingConnectionLock.RUnlock()
 
@@ -372,7 +373,7 @@ func (c *Core) ForwardRequest(req *http.Request) (int, http.Header, []byte, erro
        // If we are a perf standby and the request was forwarded to the active node
        // we should attempt to wait for the WAL to ship to offer best effort read after
        // write guarantees
-       if c.PerfStandby() && resp.LastRemoteWal > 0 {
+       if isPerfStandby && resp.LastRemoteWal > 0 {
                WaitUntilWALShipped(req.Context(), c, resp.LastRemoteWal)
        }

@ncabatoff ncabatoff added bug Used to indicate a potential bug core/ha specific to high-availability labels Sep 22, 2021
hghaf099 added a commit that referenced this issue Sep 30, 2021
was actively forwarding a request
fixes GH #12601
hghaf099 added a commit that referenced this issue Oct 4, 2021
* Fix a Deadlock on HA leadership transfer when standby
was actively forwarding a request
fixes GH #12601

* adding the changelog
hghaf099 added a commit that referenced this issue Oct 4, 2021
* Fix a Deadlock on HA leadership transfer when standby
was actively forwarding a request
fixes GH #12601

* adding the changelog
hghaf099 added a commit that referenced this issue Oct 4, 2021
* Fix a Deadlock on HA leadership transfer when standby
was actively forwarding a request
fixes GH #12601

* adding the changelog
hghaf099 added a commit that referenced this issue Oct 4, 2021
* Fix a Deadlock on HA leadership transfer when standby
was actively forwarding a request
fixes GH #12601

* adding the changelog
hghaf099 added a commit that referenced this issue Oct 4, 2021
* Fix a Deadlock on HA leadership transfer when standby
was actively forwarding a request
fixes GH #12601

* adding the changelog
hghaf099 added a commit that referenced this issue Oct 4, 2021
* Fix a Deadlock on HA leadership transfer when standby
was actively forwarding a request
fixes GH #12601

* adding the changelog
hghaf099 added a commit that referenced this issue Oct 4, 2021
* Fix a Deadlock on HA leadership transfer when standby
was actively forwarding a request
fixes GH #12601

* adding the changelog
@hghaf099
Copy link
Contributor

hghaf099 commented Oct 4, 2021

Thanks for submitting this issue. The issue has been addressed in the above PRs on main and three released branches. I am going to close this issue.

@hghaf099 hghaf099 closed this as completed Oct 4, 2021
@scotje
Copy link

scotje commented Oct 4, 2021

Thanks for the quick resolution, we really appreciate it!

@mladlow mladlow added this to the 1.6.8 milestone Jan 31, 2022
tionebsalocin added a commit to criteo-forks/vault that referenced this issue Mar 14, 2022
…2691)

* Fix a Deadlock on HA leadership transfer when standby
was actively forwarding a request
fixes GH hashicorp#12601

* adding the changelog
tionebsalocin pushed a commit to criteo-forks/vault that referenced this issue Mar 15, 2022
…#12721)

* Fix a Deadlock on HA leadership transfer when standby
was actively forwarding a request
fixes GH hashicorp#12601

* adding the changelog
tionebsalocin pushed a commit to criteo-forks/vault that referenced this issue Mar 15, 2022
…#12722)

* Fix a Deadlock on HA leadership transfer when standby
was actively forwarding a request
fixes GH hashicorp#12601

* adding the changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug core/ha specific to high-availability
Projects
None yet
Development

No branches or pull requests

5 participants