-
-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tsan data race between sa_doall and ossl_sa_set #24672
Comments
Probably some solution similar to the approach I came up with in that PR is the way ahead (but that PR was using an RCU lock which has not been adopted in master (yet)). |
Something I've found very helpful in BoringSSL is to run with TSan in CI and then write tests that specifically exercise subtle APIs intended to be used across threads. We have a lot less shared mutable state than OpenSSL (less complex and more performant; see the various 3.x perf regressions), so there's less of this sort of thing in the first place, but I think that strategy would apply here too. It might help you all avoid these kinds of issues from happening in the first place. |
We do run tsan in CI, and the threadstest is explicitly written to find these kind of issues. But that test is focused on libcrypto. We should probably extend it to do some libssl testing. |
Ah yeah, BoringSSL has some thread tests for TLS session resumption, which we have definitely found valuable. Although the race itself seems to be in libcrypto, so it seems to there may be some TSan testing gaps in OpenSSL on the libcrypto side too. |
Just to say this out loud, with the exception of ossl_free_leaves, the SA table is really pretty close to being able to be lock free. If ossl_sa_set were modified to use the new CRYPTO_atomic_store api when adding new leaves and to the values themselves, locking around the data structure could be eliminated. The atomic op may be a performance hit, but if we could remove the surrounding locks, we could claw some of that back, and it would resolve the tsan race above. |
Except I doubt this is really true in this case. |
This is where some of the discussion in the other bugs about unnecessary shared mutability comes in. A more straightforward way to design this would simply have been:
This is a pretty general lesson about threaded systems. Shared things should be immutable. Shared, mutable things are an endless source of complexity, synchronization problems, and thread contention. This is why I flagged issues like #23369 as they stand in the way of you all fixing this design problem. Of course, the immediate issue is a threading problem and the immediate fix is that you all should lock the mutable state that you currently keep mutable. That will likely make things even slower, but the performance problems are just part of the OpenSSL 3.x architecture. To fix those, you have to fix the architecture. |
Hmmm... thinking loud - perhaps we could disallow concurrent provider load and use in a single library context at least in 4.0. I have no idea how this could be a reasonable operation of any application anyway as that entails to randomly failing operation if a provider is not yet loaded, or randomly changing the provider which will perform the operation, etc. The bigger problem will be the no-cache flag for queries if we want to instantiate all the provider operations on load - we would have to deem it unsupported/ignored basically. Also I am not sure how expensive the initial load of providers like default or a general pkcs11 provider will be if there are hundreds of operations implemented - this could be actually prohibitively expensive for simple apps that use just a few operations. |
The original design didn't call for any caching at all. I don't think we need to concern ourselves over maintaining the no-cache support, it would be nice to keep but not essential IMO. I do agree that we shouldn't have made providers anything like as dynamic as they are. |
no-cache is a misfeature anyway IMO. Ignoring it seems ok to me. |
It's also great for testing. |
Based on above discussion, is it right that no straightforward fix exists? |
No, the fix would not be overly complicated. The discussion is only partially related. |
Theres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. However, we can't lock in do_all, as the call stack in several cases later attempts to take the write lock. The choices to fix it are I think: 1) add an argument to indicate to ossl_method_store_do_all weather to take the read or write lock when doing iterations, and add an is_locked api to the ossl_property_[read|write] lock family so that subsequent callers can determine if they need to take a lock or not 2) Clone the algs sparse array in ossl_method_store_do_all and use the clone to iterate with no lock held, ensuring that updates to the parent copy of the sparse array are left untoucheTheres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. However, we can't lock in do_all, as the call stack in several cases later attempts to take the write lock. The choices to fix it are I think: 1) add an argument to indicate to ossl_method_store_do_all weather to take the read or write lock when doing iterations, and add an is_locked api to the ossl_property_[read|write] lock family so that subsequent callers can determine if they need to take a lock or not 2) Clone the algs sparse array in ossl_method_store_do_all and use the clone to iterate with no lock held, ensuring that updates to the parent copy of the sparse array are left untouched during the iteration I think method (2), while being a bit more expensive, is probably the far less invasive way to go here Fixes openssl#24672
read lock store on ossl_method_store_do_all Theres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. However, we can't lock in do_all, as the call stack in several cases later attempts to take the write lock. The choices to fix it are I think: 1) add an argument to indicate to ossl_method_store_do_all weather to take the read or write lock when doing iterations, and add an is_locked api to the ossl_property_[read|write] lock family so that subsequent callers can determine if they need to take a lock or not 2) Clone the algs sparse array in ossl_method_store_do_all and use the clone to iterate with no lock held, ensuring that updates to the parent copy of the sparse array are left untoucheTheres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. I think method (2), while being a bit more expensive, is probably the far less invasive way to go here Fixes openssl#24672
This is the 1st commit message: read lock store on ossl_method_store_do_all Theres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. However, we can't lock in do_all, as the call stack in several cases later attempts to take the write lock. The choices to fix it are I think: 1) add an argument to indicate to ossl_method_store_do_all weather to take the read or write lock when doing iterations, and add an is_locked api to the ossl_property_[read|write] lock family so that subsequent callers can determine if they need to take a lock or not 2) Clone the algs sparse array in ossl_method_store_do_all and use the clone to iterate with no lock held, ensuring that updates to the parent copy of the sparse array are left untoucheTheres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. However, we can't lock in do_all, as the call stack in several cases later attempts to take the write lock. The choices to fix it are I think: 1) add an argument to indicate to ossl_method_store_do_all weather to take the read or write lock when doing iterations, and add an is_locked api to the ossl_property_[read|write] lock family so that subsequent callers can determine if they need to take a lock or not 2) Clone the algs sparse array in ossl_method_store_do_all and use the clone to iterate with no lock held, ensuring that updates to the parent copy of the sparse array are left untouched during the iteration I think method (2), while being a bit more expensive, is probably the far less invasive way to go here Fixes openssl#24672 This is the commit message #2: amend! read lock store on ossl_method_store_do_all read lock store on ossl_method_store_do_all Theres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. However, we can't lock in do_all, as the call stack in several cases later attempts to take the write lock. The choices to fix it are I think: 1) add an argument to indicate to ossl_method_store_do_all weather to take the read or write lock when doing iterations, and add an is_locked api to the ossl_property_[read|write] lock family so that subsequent callers can determine if they need to take a lock or not 2) Clone the algs sparse array in ossl_method_store_do_all and use the clone to iterate with no lock held, ensuring that updates to the parent copy of the sparse array are left untoucheTheres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. I think method (2), while being a bit more expensive, is probably the far less invasive way to go here Fixes openssl#24672 This is the commit message #3: fixup! amend! read lock store on ossl_method_store_do_all
This is a combination of 3 commits. This is the 1st commit message: read lock store on ossl_method_store_do_all Theres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. However, we can't lock in do_all, as the call stack in several cases later attempts to take the write lock. The choices to fix it are I think: 1) add an argument to indicate to ossl_method_store_do_all weather to take the read or write lock when doing iterations, and add an is_locked api to the ossl_property_[read|write] lock family so that subsequent callers can determine if they need to take a lock or not 2) Clone the algs sparse array in ossl_method_store_do_all and use the clone to iterate with no lock held, ensuring that updates to the parent copy of the sparse array are left untoucheTheres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. However, we can't lock in do_all, as the call stack in several cases later attempts to take the write lock. The choices to fix it are I think: 1) add an argument to indicate to ossl_method_store_do_all weather to take the read or write lock when doing iterations, and add an is_locked api to the ossl_property_[read|write] lock family so that subsequent callers can determine if they need to take a lock or not 2) Clone the algs sparse array in ossl_method_store_do_all and use the clone to iterate with no lock held, ensuring that updates to the parent copy of the sparse array are left untouched during the iteration I think method (2), while being a bit more expensive, is probably the far less invasive way to go here Fixes openssl#24672 This is the commit message #2: amend! read lock store on ossl_method_store_do_all read lock store on ossl_method_store_do_all Theres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. However, we can't lock in do_all, as the call stack in several cases later attempts to take the write lock. The choices to fix it are I think: 1) add an argument to indicate to ossl_method_store_do_all weather to take the read or write lock when doing iterations, and add an is_locked api to the ossl_property_[read|write] lock family so that subsequent callers can determine if they need to take a lock or not 2) Clone the algs sparse array in ossl_method_store_do_all and use the clone to iterate with no lock held, ensuring that updates to the parent copy of the sparse array are left untoucheTheres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. I think method (2), while being a bit more expensive, is probably the far less invasive way to go here Fixes openssl#24672 This is the commit message #3: fixup! amend! read lock store on ossl_method_store_do_all
I tried to reproduce the issue locally (it happens in one of our integration tests, specifically In any case, I pushed your fix (thanks!) to our OpenSSL fork where it will be subject to our test suite (--> ClickHouse/ClickHouse#66064). We'll need to observe |
Theres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. However, we can't lock in do_all, as the call stack in several cases later attempts to take the write lock. The choices to fix it are I think: 1) add an argument to indicate to ossl_method_store_do_all weather to take the read or write lock when doing iterations, and add an is_locked api to the ossl_property_[read|write] lock family so that subsequent callers can determine if they need to take a lock or not 2) Clone the algs sparse array in ossl_method_store_do_all and use the clone to iterate with no lock held, ensuring that updates to the parent copy of the sparse array are left untoucheTheres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. I think method (2), while being a bit more expensive, is probably the far less invasive way to go here Fixes #24672 Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Matt Caswell <matt@openssl.org> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #24782) (cherry picked from commit d8def79)
Theres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. However, we can't lock in do_all, as the call stack in several cases later attempts to take the write lock. The choices to fix it are I think: 1) add an argument to indicate to ossl_method_store_do_all weather to take the read or write lock when doing iterations, and add an is_locked api to the ossl_property_[read|write] lock family so that subsequent callers can determine if they need to take a lock or not 2) Clone the algs sparse array in ossl_method_store_do_all and use the clone to iterate with no lock held, ensuring that updates to the parent copy of the sparse array are left untoucheTheres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. I think method (2), while being a bit more expensive, is probably the far less invasive way to go here Fixes #24672 Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Matt Caswell <matt@openssl.org> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #24782) (cherry picked from commit d8def79)
Theres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. However, we can't lock in do_all, as the call stack in several cases later attempts to take the write lock. The choices to fix it are I think: 1) add an argument to indicate to ossl_method_store_do_all weather to take the read or write lock when doing iterations, and add an is_locked api to the ossl_property_[read|write] lock family so that subsequent callers can determine if they need to take a lock or not 2) Clone the algs sparse array in ossl_method_store_do_all and use the clone to iterate with no lock held, ensuring that updates to the parent copy of the sparse array are left untoucheTheres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. I think method (2), while being a bit more expensive, is probably the far less invasive way to go here Fixes #24672 Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Matt Caswell <matt@openssl.org> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #24782) (cherry picked from commit d8def79)
Theres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. However, we can't lock in do_all, as the call stack in several cases later attempts to take the write lock. The choices to fix it are I think: 1) add an argument to indicate to ossl_method_store_do_all weather to take the read or write lock when doing iterations, and add an is_locked api to the ossl_property_[read|write] lock family so that subsequent callers can determine if they need to take a lock or not 2) Clone the algs sparse array in ossl_method_store_do_all and use the clone to iterate with no lock held, ensuring that updates to the parent copy of the sparse array are left untoucheTheres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. I think method (2), while being a bit more expensive, is probably the far less invasive way to go here Fixes #24672 Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Matt Caswell <matt@openssl.org> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #24782) (cherry picked from commit d8def79)
read lock store on ossl_method_store_do_all Theres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. However, we can't lock in do_all, as the call stack in several cases later attempts to take the write lock. The choices to fix it are I think: 1) add an argument to indicate to ossl_method_store_do_all weather to take the read or write lock when doing iterations, and add an is_locked api to the ossl_property_[read|write] lock family so that subsequent callers can determine if they need to take a lock or not 2) Clone the algs sparse array in ossl_method_store_do_all and use the clone to iterate with no lock held, ensuring that updates to the parent copy of the sparse array are left untoucheTheres a data race between ossl_method_store_insert and ossl_method_store_do_all, as the latter doesn't take the property lock before iterating. I think method (2), while being a bit more expensive, is probably the far less invasive way to go here Fixes openssl#24672 Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Matt Caswell <matt@openssl.org> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from openssl#24782) (cherry picked from commit d8def79)
We (ClickHouse, an open-source analytical database) recently migrated from boringssl to OpenSSL 3.2 (ClickHouse/ClickHouse#59870).
Many of our tests are executed with *sanitizer instrumentation (thread, memory, address). One test checks the MySQL connector of ClickHouse and it fails with a data race detected by thread sanitizer in OpenSSL.
Here is the downstream issue report: ClickHouse/ClickHouse#64239. Clicking the first link and "integration_run_parallel1_0.log" brings up the detailed report: https://s3.amazonaws.com/clickhouse-test-reports/64199/96ebaa17d33a059d8da6a48c2fffdd8161e83238/integration_tests__tsan__[4_6]//home/ubuntu/actions-runner/_work/_temp/test/output_dir/integration_run_parallel1_0.log I also included it below for reference.
We are using this exact OpenSSL branch: https://github.com/ClickHouse/openssl/tree/ClickHouse/openssl-3.2.1
The issue looks similar to #19326 and #21527 (but I am not really an OpenSSL expert).
The text was updated successfully, but these errors were encountered: