-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transit: Fix bug where locks were not being freed on specific operations - with tests #139
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I didn't consider the slight performance trade off that you mention but it should be very minor.
Thanks @Gabrielopesantos!
This is based on the size of the batch request. A larger batch will result in more time taken to marshal the response, but relative to the encryption operation I'm guessing its minor, yes. It is a read lock though, so at worst it delays future writes (+ newly inflight reads post-write?) from happening... but it doesn't matter on read-heavy workloads. |
66e5c86
to
e6f913d
Compare
Signed-off-by: Gabriel Santos <gabrielopesantos97@gmail.com>
Cleanup is presently implicitly deferred, though the comment notes it aught to be deferred. In no return case is cleanup not called, and this likely holds for the future, so we definitely want to explicitly call defer here and let cleanup handle the edge cases for not calling unlock appropriately (localizing them). However, note this doesn't address the underlying confusion around locking and who grabs the lock (LockManager/Policy level locking vs Transit code locking). Signed-off-by: Alexander Scheel <alexander.m.scheel@gmail.com>
While strictly the use of defer holds the lock for longer than is necessary (e.g., with large batch operations, formatting the response does not require the lock to be held), we benefit from a simpler to audit, consistent, locking semantic due to the complexity of the locking mechanism. Unify the locking behavior and make every call site use the > if { lock(...) } defer unlock(...) pattern. Signed-off-by: Alexander Scheel <alexander.m.scheel@gmail.com>
These fail with a message like: > panic: test timed out after 10m0s > running tests: > TestTransit_NoDeadlock_SignVerify/rsa-2048 (8m31s) > TestTransit_NoDeadlock_SignVerify/rsa-3072 (8m31s) > TestTransit_NoDeadlock_SignVerify/rsa-4096 (8m31s) when a deadlock has occurred. Note that the context's deadline doesn't seem to be respected as the read to the server hangs indefinitely. This validates both the cached and direct cases. Signed-off-by: Alexander Scheel <alexander.m.scheel@gmail.com>
After (potentially) creating a key (which, will be a no-op), we failed to grab a read lock on the key to output the policy information contained in that key. This meant that a concurrent writer on the same key (such as keys/rotate) could cause a panic when caching was enabled: > fatal error: concurrent map iteration and map write > > goroutine 12681 [running]:github.com/openbao/openbao/builtin/logical/transit.(*backend).formatKeyPolicy(0x479ecb0?, 0xc00094dd40, {0x0, 0x0, 0x0}) > /home/cipherboy/GitHub/openbao/openbao/builtin/logical/transit/path_keys.go:364 +0x186cgithub.com/openbao/openbao/builtin/logical/transit.(*backend).pathPolicyWrite(0xc00065b500, {0x479e428, 0xc0031a7f20}, 0xc0037a4fc0, 0xc0014f6b10) > /home/cipherboy/GitHub/openbao/openbao/builtin/logical/transit/path_keys.go:257 +0xd25github.com/openbao/openbao/sdk/framework.(*Backend).HandleRequest(0xc000f141e0, {0x479e428, 0xc0031a7f20}, 0xc0037a4fc0) > /home/cipherboy/GitHub/openbao/openbao/sdk/framework/backend.go:300 +0xac8github.com/openbao/openbao/builtin/plugin/v5.(*backend).HandleRequest(0xc0010526c0, {0x479e428, 0xc0031a7f20}, 0xc0037a4fc0) This was introduced by me in d52d307. Signed-off-by: Alexander Scheel <alexander.m.scheel@gmail.com>
e6f913d
to
3a7860f
Compare
@Gabrielopesantos started this one in #136 (thank you!!!), but I got a little excited with the tests, so I'll take over from here... :-)
This moves all of Transit and the
keysutil.LockManager.GetPolicy(...)
helper to a consistent pattern: acquire lock and defer, or, if given an implicit lock and we don't need one, unlock it.This imposes a slight performance penalty on mixed read/write workload: by holding the lock longer during templating (in the batch encrypt/decrypt or HMAC cases), we can prevent a write from occurring sooner. This is an OK trade off for
Notably, this catches a panic caused by concurrent access to an unlocked pointer: when I introduced d52d307, I had forgotten to adjust the locking mechanism for
keyPolicyWrite
: previously it didn't need any values in the returned policy and so could unlock in the one case it held the lock. After that commit, it read that policy, which (when caching is in use and rotation was occurring) could cause a concurrent read/write to the same map: