New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A miscompilation bug in LICMPass (concurrency) #64188
Comments
This is basically the same issue discussed in the review of https://reviews.llvm.org/D153932 ; our escape analysis doesn't properly account for atomic fences. (See also https://reviews.llvm.org/D137707.) I'm still a little surprised to see it actually show up like this. CC @jdoerfert @aeubanks |
The example here doesn't use fences though. Is the example valid as given? If this is just about fences, then it goes straight on the "technically a bug, but we're not going to fix it" pile as far as I'm concerned. |
I didn't mean "fence" instructions specifically; that's just what I was thinking about looking at the patch I mentioned. The example is valid, as far as I can tell. The key construct here is the "relaxed" store of the address ( Note that if the store of the address is not atomic, another barrier would be necessary, and if the store of the address is "release", it naturally enforces its own ordering. So the miscompile is only visible with a "relaxed" store of newly allocated pointer, I think. As a practical matter, it seems relatively unlikely anyone would stumble over this by accident; lockless datastructures that allocate memory would normally either use a "release" store of the pointer, or use some other form of locking. I'm a little uncomfortable assuming nobody will use atomics this way, but the potential performance regressions from restricting escape analysis around calls are also sort of nasty. |
For LICM, we'd need to do a peel of one iteration for this to be handled. FWIW, atomics do not have to show anywhere for this to appear. I'm not sure if the situation is really common enough to show up if we forbid LICM to do this if the pointer of the final store escapes later. |
An ad hoc remedy for this particular example is to move |
This could be made legal by adding a release fence before the address gets leaked, right? So this kind of code motion becomes a trade-off, where the cost of the fence is weighed against the benefit of moving the stores around. |
This isn't really about LICM in my opinion; it's some sort of combination provenance + concurrency issue (alias analysis will tell you that your locals haven't escaped before you pass them to extern functions; but this isn't true in a world where concurrency can cause time travel). Brief example with no loops: https://gcc.godbolt.org/z/4qKs4TYG3 I think the right fix for this is at the language level rather than LLVM though (and that the language should retroactively bless LLVM's behavior); assuming non-escaped locals don't alias random pointers passed to you seems like it must be fundamental. Edit: fixed typo in godbolt link |
The new testcase is interesting because there is no release barrier... which means no other thread can access the variable "local". But the thread that allocated it is fine, I guess? There's a happens-before relationship between the accesses (and the deallocation) because they're on the same thread. I can imagine a language rule somehow trying to address this, but it would be weird. We'd have to divide up memory into "single-threaded" and "multi-threaded" variables. A variable starts out "single-threaded", then becomes "multi-threaded" when its address is stored to memory visible to other threads (or the memory where the address is stored becomes visible to other threads, or the address is converted to an integer). And then we require a release-acquire edge between the store of the address, and any load that accesses the address. Or something along those lines; the key here is that "publishing" the address of a variable becomes a special operation. The current C++ model, by contrast, only requires that there's a release-acquire edge between the allocation/initialization of the variable and any accesses to the variable's value from another thread; the address itself is irrelevant.
This doesn't help; a "release" only has an effect if there's an "acquire" on the other side. |
The thing I was thinking of is applying something like a happens-before rule not just to the accesses, but also to pointers' provenances. So e.g. in the original example, we declare that at thread2's load of p, even though p's value was obtained via a valid happens-before chain, its provenance was not. This sort of captures the notion of publication in a way that meshes with the rest of the memory model. FWIW, the C++ committee concurrency study group will talk about this in the next couple of days and see if anyone has anything clever to say. |
The simplest way to "solve" the issue is just to forbid relaxed loads/stores of pointers: without those, you can't do anything weird (ignoring inttoptr conversions; not sure how we'd handle those). Saying that relaxed loads/stores of pointers don't have provenance is basically the same thing as forbidding them: a pointer without provenance is just an integer. If you want to allow them, you need some way to transfer the provenance alongside the pointer. The notion of "publishing" is a way out of this conflict: it means you need a barrier at exactly one point, the point of first publication. After that, the provenance transfers like you'd normally expect. This minimizes the amount of code made undefined, and keeps existing compiler optimizations working. |
Some (evolving) slides for an upcoming C++ concurrency group presentations are here. So far, I like David's suggestion, as I understand it. Can we paraphrase that as disallowing copying of a pointer unless the lifetime-start of the object it refers to happens-before the pointer is copied? |
The C++ memory model already forbids accessing an object before it's created, or after it's destroyed. The issue here is atomic operations that happen between the object being created and the object's address "escaping". LLVM optimizations assume such atomic operations are irrelevant... but the C++ memory model doesn't attach any special significance to the point where the pointer escapes. So the fix here is to make the point where the pointer escapes special: there has to be a happens-before edge between the escape point, and any use of the address in other threads. |
@davidtgoldblatt isn't yours a completely different issue? It doesn't involve any release/acquire. It seems more like the example from this recent talk. In your example,
Usually the provenance is part of the pointer value itself. So how would this work, where does the gap in the provenance hb-chain come from?
Ah, I was fooled by the
The issue is that the "escape point" is not really a well-defined notion, is it? As compiler analyses get better and better, they may be able to push the "escape point" further and further down program execution. How did you plan to define "escape point"? |
I think my take is that the fundamental issue here is that the presence of concurrency can "move up" (in sequenced-before) the effective time when some thread's private-seeming data has escaped. That's done with load buffering in my example, but can also happen with store reordering (as in the example here). In this view, the release/acquire synchronization in the original example is a detail -- it's just a way to give defined behavior to our ability to detect that early-escape, but it's not really the cause.
I think that I would say something like "at T2.3, the load should return a pointer value lacking provenance, because the store of the pointer value does not happen before the load". This isn't quite workable (because of standalone fences), but something in that direction at least feels plausible to me as a language-level fix. Maybe more succinctly: the root cause of all these issues is that some line of code gets a pointer value whose bits are equal to a valid pointer, but which they shouldn't be allowed to use. That's the same situation that we want provenance rules to solve; you have to come by your pointer values honestly. Use relaxed accesses to pass them to other threads is cheating, in the same way abusing the 1-past-the-end-of-an-array exception to get a pointer that happens to be equal to some other object's address is cheating. |
That makes sense, thanks. Though I do worry it can be fiendishly difficult in practice to reason about this... except ofc by just always using release-acquire when pointers are involved.^^ However, does that capture all cases of this issue? Here is another example, which I got from a colleague (Chung-Kil Hur):
This should be equivalent to using release stores / acquire loads on I guess you could say that at T2.2 there is no happens-before to T1.5 since relaxed doesn't induce HB -- but that would be basically saying fences don't work when doing the standard message passing idiom with pointers. Which seems bad? I suppose that's what you meant when you said your proposal isn't quite workable because of fences. |
The point where the pointer becomes accessible to another thread. This is, as you note, basically impossible for the compiler to compute precisely; it's a dynamic property of the code at runtime. But the compiler doesn't need to precisely compute it; a conservative approximation is sufficient. Standard compiler escape analysis is such an approximation. |
Even at runtime, this would be a very strange notion to introduce in the very definition of what is Defined Behavior and what not. How is it defined? A global scan of all memory transitively reachable through all pointers in a thread's stack? That's a hell of a property to reason about, if I want to justify precisely why my program is correct. |
Not quite sure. :) This sort of property isn't completely unprecedented; proposals for defining inttoptr are sort of similar. |
The proposals for defining inttoptr that I am involved in (mostly on the Rust side) don't look anything like that. ;) |
Yeah, exactly. (I mean, not limited to just fences of course -- any sort of opaque library synchronization could work too). But that at least feels like something we could put some words around, because now we have some synchronization primitive to hang it off of. Maybe introducing "provisional provenance"; it starts acting like empty provenance but can get promoted if certain synchronization conditions are met. Something like: If there are atomic pointer store S of value V and atomic pointer load load L, where L takes its value from S[1], then if S happens before L, L obtains V. If S does not happen before L, then L obtains V', which has the same address as V but a provisional version of V's provenance. On dereference of V', if there is some evaluation P such that the object to which V refers' lifetime start happens before P, L happens before P, and P happens before the dereference, then the dereference acts as though it has V's provenance. Otherwise, it acts as though it has empty provenance. The intuition being, we pretend that the provenance for some storage lives with the storage, being stored to at its lifetime start. At some unspecified point after we load a pointer value, we try to load its provenance from the storage and update our pointer value with it; these are the conditions we'd need to avoid UB in that case. I think that this blesses LLVM's current behavior on the various weirdo test cases without rendering too much sane code UB. [1] Not this exactly -- we probably need something about release sequences here I'm not 100% sure this could really be formalized because of the operational semantics / axiomatic semantics mismatch in provenance and the memory model, but OTOH we already have issues there anyways. |
There's also the "there is some evaluation P" quantification which doesn't seem a good fit for formalization -- it seems to require running hypothetical alternative executions. Can this be stated in terms of just the current execution and its memory model graphs? |
Is it fundamental, though? With your example, neither GCC nor MSVC performs the problematic optimization, though Clang and ICC do. If I change the relaxed store to an opaque function call, then ICC stops performing the optimization, leaving Clang as the odd one out. In other words, other compilers seem to apply "non-escaped locals" optimizations only to locals that never escape, as opposed to identifying when each local escapes and applying the optimizations at points where they haven't escaped yet. (In ICC's case, the relaxed store should be enough to block the optimization, but the fact an opaque call does block it suggests that the relaxed-store case may be a relatively localized bug in ICC, as opposed to proving that ICC follows a general policy of "aliasing can't happen until after the escape", as LLVM currently does. Maybe. ICC is closed source, so I'm only speculating.) Now to speculate even more loosely: If I had to guess what kind of scenario benefits most from this kind of optimization, I'd guess something like this: A long function contains a local that's manipulated many times, but mostly by value. However, at some point a pointer to it is passed to another, opaque function. In reality, the second function only accesses the pointer for the duration of the call (i.e. it could be But even in such a scenario, LLVM's time-based escape analysis only provides a benefit until the first escape. After that, the local must be treated as forever aliased. That is a pretty limited benefit! It doesn't sound so bad to forgo it, especially if other C/C++ compilers already do so. If this were Rust code, then under most circumstances, the compiler should be allowed to treat the local as non-aliased both before and after the call. The specifics are complicated, and the spec-permitted optimizations are mostly unimplemented, but if they were implemented, I believe they would render time-based escape analysis mostly redundant for Rust code. |
This is just an artifact of the C++ standardese weasel wording we use to avoid being precise about where a memory or synchronization operation happens (see e.g. here ) -- not really meant as a quantification across multiple executions. |
I should say, despite the discussion here, I'm not convinced modifying the standard is the right approach. The discussion here indicates that any modification we make would be confusing. And we don't know what the actual performance penalty is for just following the standard as it's currently written.
Yes. The relevant code is llvm::PointerMayBeCapturedBefore.
I don't see how local variables are any better off here. Sending a reference to another thread without a fence would require unsafe code, but it's not clear to me it would be undefined behavior.
You can directly translate the original example to Rust by replacing "malloc" with "Box::new()". |
Do you mean that in full generality, including the load-buffering-like examples? My impression was that everyone thought that those were too weird / harmful to be supported, while most people thought that the "establish release/acquire synchronization through some opaque function call" ones should remain correct per the standard (and that all we disagreed about was which set this example fell into). |
You mean, specifically, the examples that don't involve any fences (#64188 (comment))? I hadn't really thought about it before... but if the standard could change something there, it might be helpful. Specifically, to remove the happens-before relationship between loads/stores on the same thread if the pointer operand for one of the operations comes from another thread. From the perspective of LLVM optimizations, it would be straightforward to make llvm::PointerMayBeCapturedBefore() treat function calls that aren't marked "nosync" as escape points for pointers that eventually escape. That solves the cases involving fences, but not the cases that only involve relaxed load/store. |
In particular, we currently don't track whether a function contains any relaxed load/store ops, so if we need to track that, it would make the implementation a bit more complicated. The performance impact is 99% the same either way, though: the primary impact is treating unknown functions as an escape point for pointers that eventually escape. |
I think yeah, we should should change the standard to make LLVM's current behavior allowed in those cases. I also think, though, that this specific case is more like the ones you describe as not involving any fences than it seems like at first glance. I think the pattern we want to allow is the one you mentioned upthread, discussed in https://reviews.llvm.org/D153932 . In the usages everyone agrees should be allowed, the pattern is: Publishing thread: Consuming thread: In this example, though, the underlying pattern is: Which is the same root issue in some of the unsynchronized, relaxed-only examples we're looking at elsewhere. I'm OK jettisoning it to make alias analysis (in the sense of "your arguments don't alias storage you allocate") more viable.
It gets subtle fast -- I think you need to consider not just function calls, but things that happen after your function returns, so you can't just infer some nosync-variant in a bottom-up call graph traversal (without giving up on sret/noalias optimizations). |
I was thinking that between the allocation of a variable, and the first atomic operation after the allocation, we could treat treat the variable as "non-escaped" relative to any other memory accesses. That said, in the fenceless variants, I guess it's possible to load the address of a variable before the variable is actually allocated. Which would cause problems for alias analysis.
The problem is that it's hard to establish what, exactly, it means to "get a pointer". For the case where the load and the store are on the same thread, I think we can just forbid loading a pointer before you store it. For the cross-thread cases, the definition gets more complicated. And even if we do manage to establish a definition, what have we actually accomplished? The handling required for cases like D153932 still basically requires treating unknown functions as escape points. |
But that would essentially be a volatile access in that case? |
LICMPass of Clang++ (13.0.0+) optimizes the above function into the following (C++ syntax is used for brevity, see godbolt for the IR code):
This is a miscompilation bug. By compiling and running a concurrent program that involves the
bug
function on Armv8 processors Apple M1 and Cavium ThunderX2, we actually observed certain program behaviors forbidden by the original program before the optimization. The actual program we tested can be found here. (The forbidden behavior can be observed couple of times throughout millions~billions of program runs. The test code includes some intricate crafts, e.g., that helps repeat the test many times.)In the following, we explain a simplified version of the concurrent program we used and show why the above LICM is a miscompilation bug.
To see a buggy execution, let's consider two threads where the first thread calls the
bug
function.Thread 1:
Thread 2:
Also suppose that the
foo
function is given as follows:Now, if we create two threads as follows, the program should never print
bug!!!
before the problematic LICM in thebug
function.Indeed, once
thread2
readsN
froma
in an acquire access mode (line T2.2), the last call to thefoo
function (line A6, writingN
toa
with a release mode) synchronizes with the acquire read. Therefore, the last write top
(line A5,*p = N
) bythread1
happens-before the read from*p
(line T2.5,*p != N
) bythread2
, andthread2
can only readN
from*p
. Note that there is no data race in this program (before the LICM).However, after the above LICM, the program becomes racy on
*p
, andthread2
can read a value other thanN
from*p
. We actually observed this program printing "bug!!!" by compiling and running it on Armv8 processors Apple M1 and ThunderX2!Roughly, the above LICM is wrong because it moves a non-atomic write
*p = i
(line A5) after a release write (line A6). This is wrong even thoughp
is never leaked before it is returned. Specifically, LLVM applies the above LICM and inlines thebug
function inthread1
, resulting in the following code:Now, the write to
p
(line O5) is not properly synchronized anymore, andthread2
can read an uninitialized value fromp
(line T2.5). In particular, such behavior is allowed by the Arm architecture because it can reorder the two independent stores O5 and O6. Therefore, this program can print "bug!!!", which is originally forbidden.We conclude that the pointer
p
in the functionbug
should not be treated as completely local even before it is leaked (line A9).The text was updated successfully, but these errors were encountered: