Skip to content

[Core] Fix completion queue shutdown race on weak memory models (ARM)#41510

Closed
zarinn3pal wants to merge 2 commits into
grpc:masterfrom
zarinn3pal:fix/cc-queue-shutdown
Closed

[Core] Fix completion queue shutdown race on weak memory models (ARM)#41510
zarinn3pal wants to merge 2 commits into
grpc:masterfrom
zarinn3pal:fix/cc-queue-shutdown

Conversation

@zarinn3pal
Copy link
Copy Markdown
Contributor

On ARM64, server shutdown could hang for 20+ minutes due to a memory visibility issue in the C-core completion queue. The shutdown_called flag lacks memory barriers, causing blocked threads to never wake up on ARM's weak memory model.

A workaround fix was created for ruby that sent a dummy RPC before shutdown to unblock the completion queue from the I/O side. 41223.

This PR addresses the issue in the core; such that all wrapped languages can reap the benefit; as well as the root cause is addressed. Converted the shutdown_called flag from bool to std::atomic<bool> in all internal completion queue data structures. This guarantees that the shutdown state transition is atomically visible across threads, preventing race conditions and ensuring the completion queue drains and shuts down correctly on all architectures.

The PR addresses the issue skipped in 40770

@emil10001
Copy link
Copy Markdown
Member

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly identifies a data race on the shutdown_called flag and addresses it by changing the type to std::atomic<bool>. This is a good step towards fixing the shutdown hang on weak memory model architectures like ARM. However, the use of std::memory_order_relaxed for all atomic operations is insufficient as it does not create the necessary memory barriers to guarantee visibility across threads. I've left several comments suggesting the use of std::memory_order_release for stores and std::memory_order_acquire for loads to ensure the shutdown signal is correctly propagated, making the fix robust and correct.

Comment thread src/core/lib/surface/completion_queue.cc Outdated
Comment thread src/core/lib/surface/completion_queue.cc Outdated
Comment thread src/core/lib/surface/completion_queue.cc Outdated
Comment thread src/core/lib/surface/completion_queue.cc Outdated
Comment thread src/core/lib/surface/completion_queue.cc Outdated
Comment thread src/core/lib/surface/completion_queue.cc Outdated
@zarinn3pal
Copy link
Copy Markdown
Contributor Author

/gemini review

@zarinn3pal zarinn3pal added the release notes: yes Indicates if PR needs to be in release notes label Feb 11, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a memory visibility issue on weak memory model architectures like ARM by changing the shutdown_called flag from a bool to a std::atomic<bool> across various completion queue data structures. The use of appropriate memory ordering (acquire for loads and release for stores) ensures proper synchronization between threads, preventing the shutdown hangs that were observed. The changes are consistent and well-commented. I have one minor suggestion to improve the consistency of the comments.

Comment thread src/core/lib/surface/completion_queue.cc
Copy link
Copy Markdown
Member

@asheshvidyut asheshvidyut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@asheshvidyut
Copy link
Copy Markdown
Member

Please check why the CI is failing.

@murgatroid99
Copy link
Copy Markdown
Member

Can you merge in the master branch? I'd like to get some up to date test results with all of the test changes we've made in the fixit, and the kokoro labels don't quite seem to be doing it.

@zarinn3pal zarinn3pal force-pushed the fix/cc-queue-shutdown branch from 2370441 to 5e23512 Compare April 24, 2026 11:02
@murgatroid99
Copy link
Copy Markdown
Member

Thank you, that looks better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants