Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transit: Fix bug where locks were not being freed on specific operations - with tests #139

Merged
merged 5 commits into from
Feb 19, 2024

Conversation

cipherboy
Copy link
Member

@cipherboy cipherboy commented Feb 12, 2024

@Gabrielopesantos started this one in #136 (thank you!!!), but I got a little excited with the tests, so I'll take over from here... :-)

This PR includes the fix to a bug happening in certain transit operations, sign and verify, where locks weren't being freed as expected which lead to goroutine deadlocks (operations hitting their timeout).

The deferred call of Unlock has been included in the if block so that it is only called if Lock has been initially called.

EDIT: After doing some additional tests (on other files), it seems there are cases where makes difference between leaving the deferred Unlock call to be in the if block or outside. Are there any known cases where Unlock calls are expected even if Lock isn't called?

Resolves hashicorp/vault#25325.


This moves all of Transit and the keysutil.LockManager.GetPolicy(...) helper to a consistent pattern: acquire lock and defer, or, if given an implicit lock and we don't need one, unlock it.

This imposes a slight performance penalty on mixed read/write workload: by holding the lock longer during templating (in the batch encrypt/decrypt or HMAC cases), we can prevent a write from occurring sooner. This is an OK trade off for


Notably, this catches a panic caused by concurrent access to an unlocked pointer: when I introduced d52d307, I had forgotten to adjust the locking mechanism for keyPolicyWrite: previously it didn't need any values in the returned policy and so could unlock in the one case it held the lock. After that commit, it read that policy, which (when caching is in use and rotation was occurring) could cause a concurrent read/write to the same map:

fatal error: concurrent map iteration and map write

goroutine 12681 [running]:github.com/openbao/openbao/builtin/logical/transit.(*backend).formatKeyPolicy(0x479ecb0?, 0xc00094dd40, {0x0, 0x0, 0x0})
      /home/cipherboy/GitHub/openbao/openbao/builtin/logical/transit/path_keys.go:364 +0x186cgithub.com/openbao/openbao/builtin/logical/transit.(*backend).pathPolicyWrite(0xc00065b500, {0x479e428, 0xc0031a7f20}, 0xc0037a4fc0, 0xc0014f6b10)
      /home/cipherboy/GitHub/openbao/openbao/builtin/logical/transit/path_keys.go:257 +0xd25github.com/openbao/openbao/sdk/framework.(*Backend).HandleRequest(0xc000f141e0, {0x479e428, 0xc0031a7f20}, 0xc0037a4fc0)
      /home/cipherboy/GitHub/openbao/openbao/sdk/framework/backend.go:300 +0xac8github.com/openbao/openbao/builtin/plugin/v5.(*backend).HandleRequest(0xc0010526c0, {0x479e428, 0xc0031a7f20}, 0xc0037a4fc0)

Copy link
Contributor

@Gabrielopesantos Gabrielopesantos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I didn't consider the slight performance trade off that you mention but it should be very minor.

@cipherboy
Copy link
Member Author

Thanks @Gabrielopesantos!

Interesting, I didn't consider the slight performance trade off that you mention but it should be very minor.

This is based on the size of the batch request. A larger batch will result in more time taken to marshal the response, but relative to the encryption operation I'm guessing its minor, yes.

It is a read lock though, so at worst it delays future writes (+ newly inflight reads post-write?) from happening... but it doesn't matter on read-heavy workloads.

Gabrielopesantos and others added 5 commits February 19, 2024 13:57
Signed-off-by: Gabriel Santos <gabrielopesantos97@gmail.com>
Cleanup is presently implicitly deferred, though the comment notes it
aught to be deferred. In no return case is cleanup not called, and this
likely holds for the future, so we definitely want to explicitly call
defer here and let cleanup handle the edge cases for not calling unlock
appropriately (localizing them).

However, note this doesn't address the underlying confusion around
locking and who grabs the lock (LockManager/Policy level locking vs
Transit code locking).

Signed-off-by: Alexander Scheel <alexander.m.scheel@gmail.com>
While strictly the use of defer holds the lock for longer than is
necessary (e.g., with large batch operations, formatting the response
does not require the lock to be held), we benefit from a simpler to
audit, consistent, locking semantic due to the complexity of the locking
mechanism.

Unify the locking behavior and make every call site use the

> if { lock(...) } defer unlock(...)

 pattern.

Signed-off-by: Alexander Scheel <alexander.m.scheel@gmail.com>
These fail with a message like:

> panic: test timed out after 10m0s
> running tests:
> 	TestTransit_NoDeadlock_SignVerify/rsa-2048 (8m31s)
> 	TestTransit_NoDeadlock_SignVerify/rsa-3072 (8m31s)
> 	TestTransit_NoDeadlock_SignVerify/rsa-4096 (8m31s)

when a deadlock has occurred. Note that the context's deadline doesn't
seem to be respected as the read to the server hangs indefinitely.

This validates both the cached and direct cases.

Signed-off-by: Alexander Scheel <alexander.m.scheel@gmail.com>
After (potentially) creating a key (which, will be a no-op), we failed
to grab a read lock on the key to output the policy information
contained in that key. This meant that a concurrent writer on the same
key (such as keys/rotate) could cause a panic when caching was enabled:

> fatal error: concurrent map iteration and map write
>
> goroutine 12681 [running]:github.com/openbao/openbao/builtin/logical/transit.(*backend).formatKeyPolicy(0x479ecb0?, 0xc00094dd40, {0x0, 0x0, 0x0})
> 	/home/cipherboy/GitHub/openbao/openbao/builtin/logical/transit/path_keys.go:364 +0x186cgithub.com/openbao/openbao/builtin/logical/transit.(*backend).pathPolicyWrite(0xc00065b500, {0x479e428, 0xc0031a7f20}, 0xc0037a4fc0, 0xc0014f6b10)
> 	/home/cipherboy/GitHub/openbao/openbao/builtin/logical/transit/path_keys.go:257 +0xd25github.com/openbao/openbao/sdk/framework.(*Backend).HandleRequest(0xc000f141e0, {0x479e428, 0xc0031a7f20}, 0xc0037a4fc0)
> 	/home/cipherboy/GitHub/openbao/openbao/sdk/framework/backend.go:300 +0xac8github.com/openbao/openbao/builtin/plugin/v5.(*backend).HandleRequest(0xc0010526c0, {0x479e428, 0xc0031a7f20}, 0xc0037a4fc0)

This was introduced by me in d52d307.

Signed-off-by: Alexander Scheel <alexander.m.scheel@gmail.com>
@naphelps naphelps merged commit 633d1cf into openbao:main Feb 19, 2024
3 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Deadlock] Hanging when reading a transit key
3 participants