Transit: Fix bug where locks were not being freed on specific operations - with tests #139

cipherboy · 2024-02-12T02:02:29Z

@Gabrielopesantos started this one in #136 (thank you!!!), but I got a little excited with the tests, so I'll take over from here... :-)

This PR includes the fix to a bug happening in certain transit operations, sign and verify, where locks weren't being freed as expected which lead to goroutine deadlocks (operations hitting their timeout).

The deferred call of Unlock has been included in the if block so that it is only called if Lock has been initially called.

EDIT: After doing some additional tests (on other files), it seems there are cases where makes difference between leaving the deferred Unlock call to be in the if block or outside. Are there any known cases where Unlock calls are expected even if Lock isn't called?

Resolves hashicorp/vault#25325.

This moves all of Transit and the keysutil.LockManager.GetPolicy(...) helper to a consistent pattern: acquire lock and defer, or, if given an implicit lock and we don't need one, unlock it.

This imposes a slight performance penalty on mixed read/write workload: by holding the lock longer during templating (in the batch encrypt/decrypt or HMAC cases), we can prevent a write from occurring sooner. This is an OK trade off for

Notably, this catches a panic caused by concurrent access to an unlocked pointer: when I introduced d52d307, I had forgotten to adjust the locking mechanism for keyPolicyWrite: previously it didn't need any values in the returned policy and so could unlock in the one case it held the lock. After that commit, it read that policy, which (when caching is in use and rotation was occurring) could cause a concurrent read/write to the same map:

fatal error: concurrent map iteration and map write

goroutine 12681 [running]:github.com/openbao/openbao/builtin/logical/transit.(*backend).formatKeyPolicy(0x479ecb0?, 0xc00094dd40, {0x0, 0x0, 0x0})
      /home/cipherboy/GitHub/openbao/openbao/builtin/logical/transit/path_keys.go:364 +0x186cgithub.com/openbao/openbao/builtin/logical/transit.(*backend).pathPolicyWrite(0xc00065b500, {0x479e428, 0xc0031a7f20}, 0xc0037a4fc0, 0xc0014f6b10)
      /home/cipherboy/GitHub/openbao/openbao/builtin/logical/transit/path_keys.go:257 +0xd25github.com/openbao/openbao/sdk/framework.(*Backend).HandleRequest(0xc000f141e0, {0x479e428, 0xc0031a7f20}, 0xc0037a4fc0)
      /home/cipherboy/GitHub/openbao/openbao/sdk/framework/backend.go:300 +0xac8github.com/openbao/openbao/builtin/plugin/v5.(*backend).HandleRequest(0xc0010526c0, {0x479e428, 0xc0031a7f20}, 0xc0037a4fc0)

Gabrielopesantos

Interesting, I didn't consider the slight performance trade off that you mention but it should be very minor.

cipherboy · 2024-02-12T22:50:03Z

Thanks @Gabrielopesantos!

Interesting, I didn't consider the slight performance trade off that you mention but it should be very minor.

This is based on the size of the batch request. A larger batch will result in more time taken to marshal the response, but relative to the encryption operation I'm guessing its minor, yes.

It is a read lock though, so at worst it delays future writes (+ newly inflight reads post-write?) from happening... but it doesn't matter on read-heavy workloads.

Signed-off-by: Gabriel Santos <gabrielopesantos97@gmail.com>

Cleanup is presently implicitly deferred, though the comment notes it aught to be deferred. In no return case is cleanup not called, and this likely holds for the future, so we definitely want to explicitly call defer here and let cleanup handle the edge cases for not calling unlock appropriately (localizing them). However, note this doesn't address the underlying confusion around locking and who grabs the lock (LockManager/Policy level locking vs Transit code locking). Signed-off-by: Alexander Scheel <alexander.m.scheel@gmail.com>

While strictly the use of defer holds the lock for longer than is necessary (e.g., with large batch operations, formatting the response does not require the lock to be held), we benefit from a simpler to audit, consistent, locking semantic due to the complexity of the locking mechanism. Unify the locking behavior and make every call site use the > if { lock(...) } defer unlock(...) pattern. Signed-off-by: Alexander Scheel <alexander.m.scheel@gmail.com>

These fail with a message like: > panic: test timed out after 10m0s > running tests: > TestTransit_NoDeadlock_SignVerify/rsa-2048 (8m31s) > TestTransit_NoDeadlock_SignVerify/rsa-3072 (8m31s) > TestTransit_NoDeadlock_SignVerify/rsa-4096 (8m31s) when a deadlock has occurred. Note that the context's deadline doesn't seem to be respected as the read to the server hangs indefinitely. This validates both the cached and direct cases. Signed-off-by: Alexander Scheel <alexander.m.scheel@gmail.com>

After (potentially) creating a key (which, will be a no-op), we failed to grab a read lock on the key to output the policy information contained in that key. This meant that a concurrent writer on the same key (such as keys/rotate) could cause a panic when caching was enabled: > fatal error: concurrent map iteration and map write > > goroutine 12681 [running]:github.com/openbao/openbao/builtin/logical/transit.(*backend).formatKeyPolicy(0x479ecb0?, 0xc00094dd40, {0x0, 0x0, 0x0}) > /home/cipherboy/GitHub/openbao/openbao/builtin/logical/transit/path_keys.go:364 +0x186cgithub.com/openbao/openbao/builtin/logical/transit.(*backend).pathPolicyWrite(0xc00065b500, {0x479e428, 0xc0031a7f20}, 0xc0037a4fc0, 0xc0014f6b10) > /home/cipherboy/GitHub/openbao/openbao/builtin/logical/transit/path_keys.go:257 +0xd25github.com/openbao/openbao/sdk/framework.(*Backend).HandleRequest(0xc000f141e0, {0x479e428, 0xc0031a7f20}, 0xc0037a4fc0) > /home/cipherboy/GitHub/openbao/openbao/sdk/framework/backend.go:300 +0xac8github.com/openbao/openbao/builtin/plugin/v5.(*backend).HandleRequest(0xc0010526c0, {0x479e428, 0xc0031a7f20}, 0xc0037a4fc0) This was introduced by me in d52d307. Signed-off-by: Alexander Scheel <alexander.m.scheel@gmail.com>

Gabrielopesantos approved these changes Feb 12, 2024

View reviewed changes

cipherboy force-pushed the clean-up-transit-keysutil branch from 66e5c86 to e6f913d Compare February 19, 2024 13:33

naphelps self-requested a review February 19, 2024 18:57

Gabrielopesantos and others added 5 commits February 19, 2024 13:57

Defer policy Unlock call when Lock is called

a519a4a

Signed-off-by: Gabriel Santos <gabrielopesantos97@gmail.com>

naphelps force-pushed the clean-up-transit-keysutil branch from e6f913d to 3a7860f Compare February 19, 2024 18:57

naphelps approved these changes Feb 19, 2024

View reviewed changes

naphelps merged commit 633d1cf into openbao:main Feb 19, 2024
3 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transit: Fix bug where locks were not being freed on specific operations - with tests #139

Transit: Fix bug where locks were not being freed on specific operations - with tests #139

cipherboy commented Feb 12, 2024 •

edited

Loading

Gabrielopesantos left a comment

cipherboy commented Feb 12, 2024

Transit: Fix bug where locks were not being freed on specific operations - with tests #139

Transit: Fix bug where locks were not being freed on specific operations - with tests #139

Conversation

cipherboy commented Feb 12, 2024 • edited Loading

Gabrielopesantos left a comment

Choose a reason for hiding this comment

cipherboy commented Feb 12, 2024

cipherboy commented Feb 12, 2024 •

edited

Loading