When trying to writing to a single component of uint4 it will create a local copy of the uint4, update the single component and then store the entire uint4 back into memory. If multiple threads are writing to the uint4 components concurrently, then this will cause a datarace.
I assume the issue is in the codegen of the accessor to Out[0][ThreadID.y] for assignment.
Godbolt reproduction: https://godbolt.org/z/na1n9vqvh
This is consistent with the runtime output where only one component of the uint4 is written here for the GroupMemoryBarrierWithGroupSync tests.