[Core] Fix completion queue shutdown race on weak memory models (ARM)#41510
[Core] Fix completion queue shutdown race on weak memory models (ARM)#41510zarinn3pal wants to merge 2 commits into
Conversation
4c15527 to
c96c3b1
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request correctly identifies a data race on the shutdown_called flag and addresses it by changing the type to std::atomic<bool>. This is a good step towards fixing the shutdown hang on weak memory model architectures like ARM. However, the use of std::memory_order_relaxed for all atomic operations is insufficient as it does not create the necessary memory barriers to guarantee visibility across threads. I've left several comments suggesting the use of std::memory_order_release for stores and std::memory_order_acquire for loads to ensure the shutdown signal is correctly propagated, making the fix robust and correct.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a memory visibility issue on weak memory model architectures like ARM by changing the shutdown_called flag from a bool to a std::atomic<bool> across various completion queue data structures. The use of appropriate memory ordering (acquire for loads and release for stores) ensures proper synchronization between threads, preventing the shutdown hangs that were observed. The changes are consistent and well-commented. I have one minor suggestion to improve the consistency of the comments.
|
Please check why the CI is failing. |
|
Can you merge in the master branch? I'd like to get some up to date test results with all of the test changes we've made in the fixit, and the kokoro labels don't quite seem to be doing it. |
2370441 to
5e23512
Compare
|
Thank you, that looks better |
On ARM64, server shutdown could hang for 20+ minutes due to a memory visibility issue in the C-core completion queue. The shutdown_called flag lacks memory barriers, causing blocked threads to never wake up on ARM's weak memory model.
A workaround fix was created for ruby that sent a dummy RPC before shutdown to unblock the completion queue from the I/O side. 41223.
This PR addresses the issue in the core; such that all wrapped languages can reap the benefit; as well as the root cause is addressed. Converted the
shutdown_calledflag from bool tostd::atomic<bool>in all internal completion queue data structures. This guarantees that the shutdown state transition is atomically visible across threads, preventing race conditions and ensuring the completion queue drains and shuts down correctly on all architectures.The PR addresses the issue skipped in 40770