Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tsan data race between sa_doall and ossl_sa_set #24672

Closed
Tracked by #596
rschu1ze opened this issue Jun 18, 2024 · 16 comments
Closed
Tracked by #596

Tsan data race between sa_doall and ossl_sa_set #24672

rschu1ze opened this issue Jun 18, 2024 · 16 comments
Assignees
Labels
branch: master Merge to master branch branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 branch: 3.3 Merge to openssl-3.3 severity: important Important bugs affecting a released version triaged: bug The issue/pr is/fixes a bug
Milestone

Comments

@rschu1ze
Copy link
Contributor

We (ClickHouse, an open-source analytical database) recently migrated from boringssl to OpenSSL 3.2 (ClickHouse/ClickHouse#59870).

Many of our tests are executed with *sanitizer instrumentation (thread, memory, address). One test checks the MySQL connector of ClickHouse and it fails with a data race detected by thread sanitizer in OpenSSL.

Here is the downstream issue report: ClickHouse/ClickHouse#64239. Clicking the first link and "integration_run_parallel1_0.log" brings up the detailed report: https://s3.amazonaws.com/clickhouse-test-reports/64199/96ebaa17d33a059d8da6a48c2fffdd8161e83238/integration_tests__tsan__[4_6]//home/ubuntu/actions-runner/_work/_temp/test/output_dir/integration_run_parallel1_0.log I also included it below for reference.

We are using this exact OpenSSL branch: https://github.com/ClickHouse/openssl/tree/ClickHouse/openssl-3.2.1

The issue looks similar to #19326 and #21527 (but I am not really an OpenSSL expert).

E           Exception: Sanitizer assert found for instance ==================
E           WARNING: ThreadSanitizer: data race (pid=3978)
E             Read of size 8 at 0x72200015c5a0 by thread T688 (mutexes: write M0):
E               #0 sa_doall build_docker/./contrib/openssl/crypto/sparse_array.c:86:30 (clickhouse+0x200cc67b) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #1 ossl_sa_doall_arg build_docker/./contrib/openssl/crypto/sparse_array.c:148:9 (clickhouse+0x200cc67b)
E               #2 ossl_sa_ALGORITHM_doall_arg build_docker/./contrib/openssl/crypto/property/property.c:97:1 (clickhouse+0x20098171) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #3 ossl_method_store_do_all build_docker/./contrib/openssl/crypto/property/property.c:490:9 (clickhouse+0x20098171)
E               #4 evp_generic_do_all build_docker/./contrib/openssl/crypto/evp/evp_fetch.c:621:5 (clickhouse+0x20020d5c) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #5 EVP_KEYMGMT_do_all_provided build_docker/./contrib/openssl/crypto/evp/keymgmt_meth.c:298:5 (clickhouse+0x2002ce87) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #6 ossl_decoder_ctx_setup_for_pkey build_docker/./contrib/openssl/crypto/encode_decode/decoder_pkey.c:441:5 (clickhouse+0x1fff9905) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #7 OSSL_DECODER_CTX_new_for_pkey build_docker/./contrib/openssl/crypto/encode_decode/decoder_pkey.c:803:16 (clickhouse+0x1fff9905)
E               #8 x509_pubkey_ex_d2i_ex build_docker/./contrib/openssl/crypto/x509/x_pubkey.c:208:14 (clickhouse+0x2010f534) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #9 asn1_item_embed_d2i build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:262:20 (clickhouse+0x1ff6ad8d) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #10 asn1_template_noexp_d2i build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:682:15 (clickhouse+0x1ff6c971) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #11 asn1_template_ex_d2i build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:558:16 (clickhouse+0x1ff6b83d) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #12 asn1_item_embed_d2i build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:422:19 (clickhouse+0x1ff6b209) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #13 asn1_template_noexp_d2i build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:682:15 (clickhouse+0x1ff6c971) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #14 asn1_template_ex_d2i build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:558:16 (clickhouse+0x1ff6b83d) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #15 asn1_item_embed_d2i build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:422:19 (clickhouse+0x1ff6b209) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #16 asn1_item_ex_d2i_intern build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:118:10 (clickhouse+0x1ff6a9ab) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #17 ASN1_item_d2i_ex build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:144:9 (clickhouse+0x1ff6a9ab)
E               #18 ASN1_item_d2i build_docker/./contrib/openssl/crypto/asn1/tasn_dec.c:154:12 (clickhouse+0x1ff6a9ab)
E               #19 d2i_X509 build_docker/./contrib/openssl/crypto/x509/x_x509.c:138:1 (clickhouse+0x2010f670) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #20 tls_process_server_certificate build_docker/./contrib/openssl/ssl/statem/statem_clnt.c:2006:13 (clickhouse+0x1ff456b9) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #21 ossl_statem_client_process_message build_docker/./contrib/openssl/ssl/statem/statem_clnt.c:1100:16 (clickhouse+0x1ff4411f) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #22 read_state_machine build_docker/./contrib/openssl/ssl/statem/statem.c:684:19 (clickhouse+0x1ff3ff07) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #23 state_machine build_docker/./contrib/openssl/ssl/statem/statem.c:478:21 (clickhouse+0x1ff3ff07)
E               #24 ossl_statem_connect build_docker/./contrib/openssl/ssl/statem/statem.c:297:12 (clickhouse+0x1ff3f0ee) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #25 SSL_do_handshake build_docker/./contrib/openssl/ssl/ssl_lib.c:4746:19 (clickhouse+0x1fec6701) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #26 SSL_connect build_docker/./contrib/openssl/ssl/ssl_lib.c:2208:12 (clickhouse+0x1fec6813) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #27 ma_tls_connect build_docker/./contrib/mariadb-connector-c/libmariadb/secure/openssl.c:627:30 (clickhouse+0x1d7c95e4) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
[...]


E             Previous write of size 8 at 0x72200015c5a0 by thread T678 (mutexes: write M1, write M2, write M3):
E               #0 ossl_sa_set build_docker/./contrib/openssl/crypto/sparse_array.c:214:8 (clickhouse+0x200cca9d) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #1 ossl_sa_ALGORITHM_set build_docker/./contrib/openssl/crypto/property/property.c:97:1 (clickhouse+0x20097ce0) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #2 ossl_method_store_insert build_docker/./contrib/openssl/crypto/property/property.c:286:12 (clickhouse+0x20097ce0)
E               #3 ossl_method_store_add build_docker/./contrib/openssl/crypto/property/property.c:344:14 (clickhouse+0x20097ce0)
E               #4 put_evp_method_in_store build_docker/./contrib/openssl/crypto/evp/evp_fetch.c:191:12 (clickhouse+0x200212ab) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #5 ossl_method_construct_this build_docker/./contrib/openssl/crypto/core_fetch.c:123:5 (clickhouse+0x1fff8be4) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #6 algorithm_do_map build_docker/./contrib/openssl/crypto/core_algorithm.c:77:13 (clickhouse+0x1fff8648) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #7 algorithm_do_this build_docker/./contrib/openssl/crypto/core_algorithm.c:122:15 (clickhouse+0x1fff8648)
E               #8 ossl_provider_doall_activated build_docker/./contrib/openssl/crypto/provider_core.c:1483:14 (clickhouse+0x2009fa63) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #9 ossl_algorithm_do_all build_docker/./contrib/openssl/crypto/core_algorithm.c:162:9 (clickhouse+0x1fff843b) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #10 ossl_method_construct build_docker/./contrib/openssl/crypto/core_fetch.c:153:5 (clickhouse+0x1fff88ce) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #11 inner_evp_generic_fetch build_docker/./contrib/openssl/crypto/evp/evp_fetch.c:313:23 (clickhouse+0x2002035e) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #12 evp_generic_fetch build_docker/./contrib/openssl/crypto/evp/evp_fetch.c:378:14 (clickhouse+0x20020082) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #13 EVP_KDF_fetch build_docker/./contrib/openssl/crypto/evp/kdf_meth.c:162:12 (clickhouse+0x200293c7) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #14 tls13_generate_secret build_docker/./contrib/openssl/ssl/tls13_enc.c:181:11 (clickhouse+0x1fee3c67) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #15 ssl_gensecret build_docker/./contrib/openssl/ssl/s3_lib.c:4854:18 (clickhouse+0x1feb76c5) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #16 ssl_derive build_docker/./contrib/openssl/ssl/s3_lib.c:4907:14 (clickhouse+0x1feb76c5)
E               #17 tls_parse_stoc_key_share build_docker/./contrib/openssl/ssl/statem/extensions_clnt.c:1885:13 (clickhouse+0x1ff35a7f) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #18 tls_parse_extension build_docker/./contrib/openssl/ssl/statem/extensions.c:765:20 (clickhouse+0x1ff2de8e) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #19 tls_parse_all_extensions build_docker/./contrib/openssl/ssl/statem/extensions.c:799:14 (clickhouse+0x1ff2df88) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #20 tls_process_server_hello build_docker/./contrib/openssl/ssl/statem/statem_clnt.c:1744:10 (clickhouse+0x1ff45088) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #21 ossl_statem_client_process_message build_docker/./contrib/openssl/ssl/statem/statem_clnt.c:1094:16 (clickhouse+0x1ff4412c) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #22 read_state_machine build_docker/./contrib/openssl/ssl/statem/statem.c:684:19 (clickhouse+0x1ff3ff07) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #23 state_machine build_docker/./contrib/openssl/ssl/statem/statem.c:478:21 (clickhouse+0x1ff3ff07)
E               #24 ossl_statem_connect build_docker/./contrib/openssl/ssl/statem/statem.c:297:12 (clickhouse+0x1ff3f0ee) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #25 SSL_do_handshake build_docker/./contrib/openssl/ssl/ssl_lib.c:4746:19 (clickhouse+0x1fec6701) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #26 SSL_connect build_docker/./contrib/openssl/ssl/ssl_lib.c:2208:12 (clickhouse+0x1fec6813) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
E               #27 ma_tls_connect build_docker/./contrib/mariadb-connector-c/libmariadb/secure/openssl.c:627:30 (clickhouse+0x1d7c95e4) (BuildId: 085b072d72b502023a507882c87a450e95b212d9)
[...]
@mattcaswell
Copy link
Member

This appears to be a bug in ossl_method_store_do_all

It seems I previously encountered this when developing #24344 (in particular see 5d492e0) - but since that PR is currently abandoned the bug I discovered along the way got forgotten about.

@mattcaswell mattcaswell added triaged: bug The issue/pr is/fixes a bug and removed issue: bug report The issue was opened to report a bug labels Jun 18, 2024
@mattcaswell
Copy link
Member

Probably some solution similar to the approach I came up with in that PR is the way ahead (but that PR was using an RCU lock which has not been adopted in master (yet)).

@mattcaswell mattcaswell added the severity: important Important bugs affecting a released version label Jun 18, 2024
@davidben
Copy link
Contributor

Something I've found very helpful in BoringSSL is to run with TSan in CI and then write tests that specifically exercise subtle APIs intended to be used across threads. We have a lot less shared mutable state than OpenSSL (less complex and more performant; see the various 3.x perf regressions), so there's less of this sort of thing in the first place, but I think that strategy would apply here too. It might help you all avoid these kinds of issues from happening in the first place.

@mattcaswell
Copy link
Member

We do run tsan in CI, and the threadstest is explicitly written to find these kind of issues. But that test is focused on libcrypto. We should probably extend it to do some libssl testing.

@davidben
Copy link
Contributor

davidben commented Jun 18, 2024

Ah yeah, BoringSSL has some thread tests for TLS session resumption, which we have definitely found valuable. Although the race itself seems to be in libcrypto, so it seems to there may be some TSan testing gaps in OpenSSL on the libcrypto side too.

@nhorman
Copy link
Contributor

nhorman commented Jun 18, 2024

Just to say this out loud, with the exception of ossl_free_leaves, the SA table is really pretty close to being able to be lock free. If ossl_sa_set were modified to use the new CRYPTO_atomic_store api when adding new leaves and to the values themselves, locking around the data structure could be eliminated. The atomic op may be a performance hit, but if we could remove the surrounding locks, we could claw some of that back, and it would resolve the tsan race above.

@mattcaswell
Copy link
Member

Except I doubt this is really true in this case. ossl_method_store_do_all iterates over all the ALGORITHMs in the store. We really need a consistent set of ALGORITHMs for the entire operation and we don't want to have to handle changes to the sparse array half way through iterating over it.

@davidben
Copy link
Contributor

davidben commented Jun 19, 2024

This is where some of the discussion in the other bugs about unnecessary shared mutability comes in. A more straightforward way to design this would simply have been:

  1. At the time you load a provider, query all the algorithms and instantiate EVP_MDs, etc., for every one of them. Build an efficient index to map from algorithm names to those EVP_MDs and whatnot.
  2. After all that stuff has been instantiated, keep the entire provider object immutable. The EVP_MDs are fixed, the index is fixed, etc.
  3. Since OpenSSL decided to allow concurrent provider load and use (bad idea), you all are stuck paying for some serialization in the global provider list. However, now that the individual providers are immutable, this synchronization is limited to a single list of O(10) elements. Now techniques like RCU are viable.
  4. Although the provider list itself needs synchronization, your providers themselves are immutable and so they can be queried concurrently without fuss.

This is a pretty general lesson about threaded systems. Shared things should be immutable. Shared, mutable things are an endless source of complexity, synchronization problems, and thread contention. This is why I flagged issues like #23369 as they stand in the way of you all fixing this design problem.

Of course, the immediate issue is a threading problem and the immediate fix is that you all should lock the mutable state that you currently keep mutable. That will likely make things even slower, but the performance problems are just part of the OpenSSL 3.x architecture. To fix those, you have to fix the architecture.

@t8m
Copy link
Member

t8m commented Jun 19, 2024

3. Since OpenSSL decided to allow concurrent provider load and use (bad idea), you all are stuck paying for some serialization in the global provider list. However, now that the individual providers are immutable, this synchronization is limited to a single list of O(10) elements. Now techniques like RCU are viable.

Hmmm... thinking loud - perhaps we could disallow concurrent provider load and use in a single library context at least in 4.0. I have no idea how this could be a reasonable operation of any application anyway as that entails to randomly failing operation if a provider is not yet loaded, or randomly changing the provider which will perform the operation, etc.

The bigger problem will be the no-cache flag for queries if we want to instantiate all the provider operations on load - we would have to deem it unsupported/ignored basically.

Also I am not sure how expensive the initial load of providers like default or a general pkcs11 provider will be if there are hundreds of operations implemented - this could be actually prohibitively expensive for simple apps that use just a few operations.

@t8m t8m added branch: master Merge to master branch branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 branch: 3.3 Merge to openssl-3.3 labels Jun 19, 2024
@paulidale
Copy link
Contributor

The original design didn't call for any caching at all. I don't think we need to concern ourselves over maintaining the no-cache support, it would be nice to keep but not essential IMO.

I do agree that we shouldn't have made providers anything like as dynamic as they are.

@mattcaswell
Copy link
Member

The bigger problem will be the no-cache flag for queries if we want to instantiate all the provider operations on load - we would have to deem it unsupported/ignored basically.

no-cache is a misfeature anyway IMO. Ignoring it seems ok to me.

@paulidale
Copy link
Contributor

no-cache is meant to be a space saving feature.

It's also great for testing.

@rschu1ze
Copy link
Contributor Author

rschu1ze commented Jul 2, 2024

Based on above discussion, is it right that no straightforward fix exists?

@t8m
Copy link
Member

t8m commented Jul 2, 2024

Based on above discussion, is it right that no straightforward fix exists?

No, the fix would not be overly complicated. The discussion is only partially related.

nhorman added a commit to nhorman/openssl that referenced this issue Jul 2, 2024
Theres a data race between ossl_method_store_insert and
ossl_method_store_do_all, as the latter doesn't take the property lock
before iterating.

However, we can't lock in do_all, as the call stack in several cases
later attempts to take the write lock.

The choices to fix it are I think:
1) add an argument to indicate to ossl_method_store_do_all weather to
   take the read or write lock when doing iterations, and add an
   is_locked api to the ossl_property_[read|write] lock family so that
   subsequent callers can determine if they need to take a lock or not

2) Clone the algs sparse array in ossl_method_store_do_all and use the
   clone to iterate with no lock held, ensuring that updates to the
   parent copy of the sparse array are left untoucheTheres a data race
   between ossl_method_store_insert and ossl_method_store_do_all, as the
   latter doesn't take the property lock before iterating.

   However, we can't lock in do_all, as the call stack in several cases
   later attempts to take the write lock.

   The choices to fix it are I think:
   1) add an argument to indicate to ossl_method_store_do_all weather to
      take the read or write lock when doing iterations, and add an
      is_locked api to the ossl_property_[read|write] lock family so
      that subsequent callers can determine if they need to take a lock
      or not

   2) Clone the algs sparse array in ossl_method_store_do_all and use
      the clone to iterate with no lock held, ensuring that updates to
      the parent copy of the sparse array are left untouched during the
      iteration

I think method (2), while being a bit more expensive, is probably the
far less invasive way to go here

Fixes openssl#24672
@nhorman
Copy link
Contributor

nhorman commented Jul 2, 2024

Its not a great fix, but I think its the best we can do right now without some significant refactoring:
#24782

@rschu1ze can you test with the attached draft PR, and confirm that the issue is resolved for you please?

@nhorman nhorman added this to the 3.4.0 milestone Jul 2, 2024
@nhorman nhorman self-assigned this Jul 2, 2024
nhorman added a commit to nhorman/openssl that referenced this issue Jul 3, 2024
read lock store on ossl_method_store_do_all

Theres a data race between ossl_method_store_insert and
ossl_method_store_do_all, as the latter doesn't take the property lock
before iterating.

However, we can't lock in do_all, as the call stack in several cases
later attempts to take the write lock.

The choices to fix it are I think:
1) add an argument to indicate to ossl_method_store_do_all weather to
   take the read or write lock when doing iterations, and add an
   is_locked api to the ossl_property_[read|write] lock family so that
   subsequent callers can determine if they need to take a lock or not

2) Clone the algs sparse array in ossl_method_store_do_all and use the
   clone to iterate with no lock held, ensuring that updates to the
   parent copy of the sparse array are left untoucheTheres a data race
   between ossl_method_store_insert and ossl_method_store_do_all, as the
   latter doesn't take the property lock before iterating.

I think method (2), while being a bit more expensive, is probably the
far less invasive way to go here

Fixes openssl#24672
rschu1ze pushed a commit to ClickHouse/openssl that referenced this issue Jul 3, 2024
This is the 1st commit message:

read lock store on ossl_method_store_do_all

Theres a data race between ossl_method_store_insert and
ossl_method_store_do_all, as the latter doesn't take the property lock
before iterating.

However, we can't lock in do_all, as the call stack in several cases
later attempts to take the write lock.

The choices to fix it are I think:
1) add an argument to indicate to ossl_method_store_do_all weather to
   take the read or write lock when doing iterations, and add an
   is_locked api to the ossl_property_[read|write] lock family so that
   subsequent callers can determine if they need to take a lock or not

2) Clone the algs sparse array in ossl_method_store_do_all and use the
   clone to iterate with no lock held, ensuring that updates to the
   parent copy of the sparse array are left untoucheTheres a data race
   between ossl_method_store_insert and ossl_method_store_do_all, as the
   latter doesn't take the property lock before iterating.

   However, we can't lock in do_all, as the call stack in several cases
   later attempts to take the write lock.

   The choices to fix it are I think:
   1) add an argument to indicate to ossl_method_store_do_all weather to
      take the read or write lock when doing iterations, and add an
      is_locked api to the ossl_property_[read|write] lock family so
      that subsequent callers can determine if they need to take a lock
      or not

   2) Clone the algs sparse array in ossl_method_store_do_all and use
      the clone to iterate with no lock held, ensuring that updates to
      the parent copy of the sparse array are left untouched during the
      iteration

I think method (2), while being a bit more expensive, is probably the
far less invasive way to go here

Fixes openssl#24672

This is the commit message #2:

amend! read lock store on ossl_method_store_do_all

read lock store on ossl_method_store_do_all

Theres a data race between ossl_method_store_insert and
ossl_method_store_do_all, as the latter doesn't take the property lock
before iterating.

However, we can't lock in do_all, as the call stack in several cases
later attempts to take the write lock.

The choices to fix it are I think:
1) add an argument to indicate to ossl_method_store_do_all weather to
   take the read or write lock when doing iterations, and add an
   is_locked api to the ossl_property_[read|write] lock family so that
   subsequent callers can determine if they need to take a lock or not

2) Clone the algs sparse array in ossl_method_store_do_all and use the
   clone to iterate with no lock held, ensuring that updates to the
   parent copy of the sparse array are left untoucheTheres a data race
   between ossl_method_store_insert and ossl_method_store_do_all, as the
   latter doesn't take the property lock before iterating.

I think method (2), while being a bit more expensive, is probably the
far less invasive way to go here

Fixes openssl#24672

This is the commit message #3:

fixup! amend! read lock store on ossl_method_store_do_all
rschu1ze pushed a commit to ClickHouse/openssl that referenced this issue Jul 3, 2024
This is a combination of 3 commits.

This is the 1st commit message:

read lock store on ossl_method_store_do_all

Theres a data race between ossl_method_store_insert and
ossl_method_store_do_all, as the latter doesn't take the property lock
before iterating.

However, we can't lock in do_all, as the call stack in several cases
later attempts to take the write lock.

The choices to fix it are I think:
1) add an argument to indicate to ossl_method_store_do_all weather to
   take the read or write lock when doing iterations, and add an
   is_locked api to the ossl_property_[read|write] lock family so that
   subsequent callers can determine if they need to take a lock or not

2) Clone the algs sparse array in ossl_method_store_do_all and use the
   clone to iterate with no lock held, ensuring that updates to the
   parent copy of the sparse array are left untoucheTheres a data race
   between ossl_method_store_insert and ossl_method_store_do_all, as the
   latter doesn't take the property lock before iterating.

   However, we can't lock in do_all, as the call stack in several cases
   later attempts to take the write lock.

   The choices to fix it are I think:
   1) add an argument to indicate to ossl_method_store_do_all weather to
      take the read or write lock when doing iterations, and add an
      is_locked api to the ossl_property_[read|write] lock family so
      that subsequent callers can determine if they need to take a lock
      or not

   2) Clone the algs sparse array in ossl_method_store_do_all and use
      the clone to iterate with no lock held, ensuring that updates to
      the parent copy of the sparse array are left untouched during the
      iteration

I think method (2), while being a bit more expensive, is probably the
far less invasive way to go here

Fixes openssl#24672

This is the commit message #2:

amend! read lock store on ossl_method_store_do_all

read lock store on ossl_method_store_do_all

Theres a data race between ossl_method_store_insert and
ossl_method_store_do_all, as the latter doesn't take the property lock
before iterating.

However, we can't lock in do_all, as the call stack in several cases
later attempts to take the write lock.

The choices to fix it are I think:
1) add an argument to indicate to ossl_method_store_do_all weather to
   take the read or write lock when doing iterations, and add an
   is_locked api to the ossl_property_[read|write] lock family so that
   subsequent callers can determine if they need to take a lock or not

2) Clone the algs sparse array in ossl_method_store_do_all and use the
   clone to iterate with no lock held, ensuring that updates to the
   parent copy of the sparse array are left untoucheTheres a data race
   between ossl_method_store_insert and ossl_method_store_do_all, as the
   latter doesn't take the property lock before iterating.

I think method (2), while being a bit more expensive, is probably the
far less invasive way to go here

Fixes openssl#24672

This is the commit message #3:

fixup! amend! read lock store on ossl_method_store_do_all
@rschu1ze
Copy link
Contributor Author

rschu1ze commented Jul 3, 2024

I tried to reproduce the issue locally (it happens in one of our integration tests, specifically test_mysql_killed_while_insert_8_0) but I did not even manage to make the test even run on my machine 😢. Since the issue happens only sporadically, my hopes to verify the patch were low anyways.

In any case, I pushed your fix (thanks!) to our OpenSSL fork where it will be subject to our test suite (--> ClickHouse/ClickHouse#66064). We'll need to observe test_mysql_killed_while_insert_8_0 for a while to understand if the fix really helps. I can report back in one or two weeks if it is not urgent.

openssl-machine pushed a commit that referenced this issue Jul 9, 2024
Theres a data race between ossl_method_store_insert and
ossl_method_store_do_all, as the latter doesn't take the property lock
before iterating.

However, we can't lock in do_all, as the call stack in several cases
later attempts to take the write lock.

The choices to fix it are I think:
1) add an argument to indicate to ossl_method_store_do_all weather to
   take the read or write lock when doing iterations, and add an
   is_locked api to the ossl_property_[read|write] lock family so that
   subsequent callers can determine if they need to take a lock or not

2) Clone the algs sparse array in ossl_method_store_do_all and use the
   clone to iterate with no lock held, ensuring that updates to the
   parent copy of the sparse array are left untoucheTheres a data race
   between ossl_method_store_insert and ossl_method_store_do_all, as the
   latter doesn't take the property lock before iterating.

I think method (2), while being a bit more expensive, is probably the
far less invasive way to go here

Fixes #24672

Reviewed-by: Paul Dale <ppzgs1@gmail.com>
Reviewed-by: Matt Caswell <matt@openssl.org>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #24782)

(cherry picked from commit d8def79)
openssl-machine pushed a commit that referenced this issue Jul 9, 2024
Theres a data race between ossl_method_store_insert and
ossl_method_store_do_all, as the latter doesn't take the property lock
before iterating.

However, we can't lock in do_all, as the call stack in several cases
later attempts to take the write lock.

The choices to fix it are I think:
1) add an argument to indicate to ossl_method_store_do_all weather to
   take the read or write lock when doing iterations, and add an
   is_locked api to the ossl_property_[read|write] lock family so that
   subsequent callers can determine if they need to take a lock or not

2) Clone the algs sparse array in ossl_method_store_do_all and use the
   clone to iterate with no lock held, ensuring that updates to the
   parent copy of the sparse array are left untoucheTheres a data race
   between ossl_method_store_insert and ossl_method_store_do_all, as the
   latter doesn't take the property lock before iterating.

I think method (2), while being a bit more expensive, is probably the
far less invasive way to go here

Fixes #24672

Reviewed-by: Paul Dale <ppzgs1@gmail.com>
Reviewed-by: Matt Caswell <matt@openssl.org>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #24782)

(cherry picked from commit d8def79)
openssl-machine pushed a commit that referenced this issue Jul 9, 2024
Theres a data race between ossl_method_store_insert and
ossl_method_store_do_all, as the latter doesn't take the property lock
before iterating.

However, we can't lock in do_all, as the call stack in several cases
later attempts to take the write lock.

The choices to fix it are I think:
1) add an argument to indicate to ossl_method_store_do_all weather to
   take the read or write lock when doing iterations, and add an
   is_locked api to the ossl_property_[read|write] lock family so that
   subsequent callers can determine if they need to take a lock or not

2) Clone the algs sparse array in ossl_method_store_do_all and use the
   clone to iterate with no lock held, ensuring that updates to the
   parent copy of the sparse array are left untoucheTheres a data race
   between ossl_method_store_insert and ossl_method_store_do_all, as the
   latter doesn't take the property lock before iterating.

I think method (2), while being a bit more expensive, is probably the
far less invasive way to go here

Fixes #24672

Reviewed-by: Paul Dale <ppzgs1@gmail.com>
Reviewed-by: Matt Caswell <matt@openssl.org>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #24782)

(cherry picked from commit d8def79)
openssl-machine pushed a commit that referenced this issue Jul 9, 2024
Theres a data race between ossl_method_store_insert and
ossl_method_store_do_all, as the latter doesn't take the property lock
before iterating.

However, we can't lock in do_all, as the call stack in several cases
later attempts to take the write lock.

The choices to fix it are I think:
1) add an argument to indicate to ossl_method_store_do_all weather to
   take the read or write lock when doing iterations, and add an
   is_locked api to the ossl_property_[read|write] lock family so that
   subsequent callers can determine if they need to take a lock or not

2) Clone the algs sparse array in ossl_method_store_do_all and use the
   clone to iterate with no lock held, ensuring that updates to the
   parent copy of the sparse array are left untoucheTheres a data race
   between ossl_method_store_insert and ossl_method_store_do_all, as the
   latter doesn't take the property lock before iterating.

I think method (2), while being a bit more expensive, is probably the
far less invasive way to go here

Fixes #24672

Reviewed-by: Paul Dale <ppzgs1@gmail.com>
Reviewed-by: Matt Caswell <matt@openssl.org>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #24782)

(cherry picked from commit d8def79)
rschu1ze pushed a commit to ClickHouse/openssl that referenced this issue Jul 14, 2024
read lock store on ossl_method_store_do_all

Theres a data race between ossl_method_store_insert and
ossl_method_store_do_all, as the latter doesn't take the property lock
before iterating.

However, we can't lock in do_all, as the call stack in several cases
later attempts to take the write lock.

The choices to fix it are I think:
1) add an argument to indicate to ossl_method_store_do_all weather to
   take the read or write lock when doing iterations, and add an
   is_locked api to the ossl_property_[read|write] lock family so that
   subsequent callers can determine if they need to take a lock or not

2) Clone the algs sparse array in ossl_method_store_do_all and use the
   clone to iterate with no lock held, ensuring that updates to the
   parent copy of the sparse array are left untoucheTheres a data race
   between ossl_method_store_insert and ossl_method_store_do_all, as the
   latter doesn't take the property lock before iterating.

I think method (2), while being a bit more expensive, is probably the
far less invasive way to go here

Fixes openssl#24672

Reviewed-by: Paul Dale <ppzgs1@gmail.com>
Reviewed-by: Matt Caswell <matt@openssl.org>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from openssl#24782)

(cherry picked from commit d8def79)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch: master Merge to master branch branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 branch: 3.3 Merge to openssl-3.3 severity: important Important bugs affecting a released version triaged: bug The issue/pr is/fixes a bug
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants