-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: mallocgc does not seem to need to call publicationBarrier when allocating noscan objects #63640
Comments
I think you might be right, given the comments and CL description the publication barrier for noscan was only there to protect the heap bitmap. |
CC @golang/runtime |
I also think you might be right, but such changes are risky. We could try removing it for noscan objects early next development cycle and watch CI closely to see if anything breaks. |
The memory model no longer allows out of thin air values, even in racy programs. This means there needs to be a publication barrier somewhere between when the object is initialized to 0 and when it is returned (published) by the GC. When the memory model was weaker we were not concerned with racy programs seeing out of thin air values except when it broke the GC. I suppose the zeroing and therefore the publication barrier could be done eagerly, at least for noscan. There are other reasons to do eager zeroing and any numbers showing that eager zeroing is a performance win would be interesting. |
@RLH I think I follow but I'd like to confirm. Is this the kind of situation you're describing? (1) Thread 1 on CPU 1 allocates an object at address X without a publication barrier. In step (4) we expect thread 2 to see zeroed memory, but if CPU 2 happens to get a hit in the cache at that address then it might not appear to be zero. There is a race in this scenario, but one could imagine a program that's OK with the race. And it would be surprising that one is able to observe the memory at X as not zeroed. |
GoTheLanguage (GoLan) and GoTheImplementation (GoImp) shouldn't be conflated. But yes if Thread 1 (aka goroutine 1) is allowed to reorder the initialization stores of X and the store of the address of X then racy goroutines may (will) see out of thin air values from previous collected objects that were allocated at that address. The compiler, the runtime, and the HW must all conspire so that GoImp provides GoLan's memory model. See 1 for more discussion. |
Sure, just trying to wrap my head around a concrete example. Thanks for confirming! I agree that a simple change to remove the publication barrier for noscan objects would be violation of the memory model. I reread https://go.dev/ref/mem and I believe it's clear that this sort of thing is explicitly disallowed by the following line that discusses rules around racy programs: "Additionally, observation of acausal and “out of thin air” writes is disallowed." Apologies for the too-quick conclusion. I had only considered non-racy programs earlier.
We generally haven't been thinking about eager zeroing lately, just because it's hard to decide when the right time to do it is. We could theoretically drop the barrier in cases where the memory is already zeroed (for instance, by the page allocator, or by the OS) or when the caller explicitly asks for non-zeroed memory, but I wonder if those cases are popular enough to actually make a difference. Also, special-casing the publication barrier at all is also going to introduce more fragility to the What do other allocators do on |
Oh, another question I have is whether atomics on @wbyao's |
Thanks for your reply, and there is some information I did not describe correctly, my environment is arm64 not arm. |
Thanks for the clarification! My point about kernel-implemented atomics is moot then.
That's a good point, I think you're right. There's no barrier in memclrNoHeapPointersChunked either. I believe the right thing to do here is to add a second publication barrier. (I'm trying to convince myself that we can just move it, but I keep going back and forth. This case is kind of special because the GC will observe that the span for this large object is noscan, and the allocation of that span has its own publication barrier for the GC.) It should probably be OK performance-wise because the cost of zeroing here should make the second barrier rare enough that it doesn't impact performance too much. |
I think it's a nice fix. if (needzero && span.needzero != 0 && !delayedZeroing) || (gcphase != _GCoff && !noscan) {
publicationBarrier()
} |
The 2011 Java Memory implementation cookbook is out of date but is still worth the read and there are links to more recent papers at the top. The original publication barriers discussion is in a section of this 23 year old paper in the context of a store release / load acquire machine. |
Not to kibitz too much but recent versions of ARM, including ARM64, have store release and load acquire instructions and a store release is sufficient for a publication barrier. There is no need for a load acquire by the consumer since the load will be dependent on seeing the pointer so will be program ordered. This should be faster than the full barrier now being used. As an aside I (mis)spent a chunk of my youth at Intel working on a new load acquire / store release architecture and there was a lot of pain moving from TSO. It was not helped by the fact that the early micro-architecture was quietly TSO and later micro-architectures were weaker and some binaries broke. So if Go is going to switch to the weaker model then the sooner the better so any pain will be lessened. Note that I am only talking about pain in racy programs. |
@RLH , I also saw a discussion about store release in the #9984, but didn't explain why a store/store barrier at the end of mallocgc can't solve that problem. @aclements's analysis in #9984 seems correct? I'm a little confused. And, if relevant and possible, Can someone explain whether it is possible not to protect heap bitmap with memory barrier during non-GC period. |
To be clear, we're not currently using a full barrier.
I'm not 100% positive, but I think you're right that if the GC isn't active, writes to the heap bitmap shouldn't require any synchronization. The mark phase starts by stopping the world, which will be enough to synchronize all writes that occur prior to STW. Even if we were to switch to a ragged barrier, that would still be sufficient to synchronize these writes. That said, we're also in the process of switching away from the heap bitmap, which may change the synchronization requirements here, though probably not in any fundamental way. |
I tried a littest for this scene.
This littest is a simulation for mallocgc thread and GC thread. The result is X3 can be only zero in thread 2.
So, it seems it's ok to use store release for aarch64. The only problem is how to find out the "store new object to heap". Maybe SSA rules can match some scenes. But for the scenes which SSA rules can't work. We should still add a DMB conservatively. |
I removed the |
I'm doing some performance analysis of
go1.20
onarm64
platform, and I found that the DMB instruction in mallocgc consumes a lot of time.I found some CLs:
go1.5
,publicationBarrier
was added when allocating scan objects.go1.7
,publicationBarrier
is was added when allocating noscan objects because of theheapBitsSetTypeNoScan
.publicationBarrier
is still called when allocating noscan objects afterheapBitsSetTypeNoScan
is deleted.I tried to find the reason why
publicationBarrier
was called when allocating noscan objects, but I couldn't find it.Knowing from the comments of source code,
publicationBarrier
inmallocgc
ensure that the stores above that initialize x to type-safe memory and set the heap bits occur before the caller can make x observable to the GC.publicationBarrier
inallocSpan
make sure the newly allocated span will be observed by the GC before pointers into the span are published.My question is, GC don't scan noscan object, why is
publicationBarrier
required when allocating noscan objects inmallocgc
?Thank you in advance for your help.
The text was updated successfully, but these errors were encountered: