Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A miscompilation bug in LICMPass (concurrency) #64188

Open
sunghwanl opened this issue Jul 28, 2023 · 31 comments
Open

A miscompilation bug in LICMPass (concurrency) #64188

sunghwanl opened this issue Jul 28, 2023 · 31 comments

Comments

@sunghwanl
Copy link

sunghwanl commented Jul 28, 2023

A1:  int* bug(int N, atomic<int>& a) {
A2:    int* p = (int*)malloc(sizeof(int));
A3:    
A4:    for (int i = 1; i <= N; i++) {
A5:      *p = i;
A6:      foo(i, a);
A7:    }
A8:  
A9:    return p;
A10: }

LICMPass of Clang++ (13.0.0+) optimizes the above function into the following (C++ syntax is used for brevity, see godbolt for the IR code):

B1:  int* bug(int N, atomic<int>& a) {
B2:    int* p = (int*)malloc(sizeof(int));
B3:    
B4:    for (int i = 1; i <= N; i++) {
B5:      foo(i, a);
B6:    }
B7:  
B8:    *p = N;
B9:  
B10:   return p;
B11: }

This is a miscompilation bug. By compiling and running a concurrent program that involves the bug function on Armv8 processors Apple M1 and Cavium ThunderX2, we actually observed certain program behaviors forbidden by the original program before the optimization. The actual program we tested can be found here. (The forbidden behavior can be observed couple of times throughout millions~billions of program runs. The test code includes some intricate crafts, e.g., that helps repeat the test many times.)

In the following, we explain a simplified version of the concurrent program we used and show why the above LICM is a miscompilation bug.

To see a buggy execution, let's consider two threads where the first thread calls the bug function.

Thread 1:

T1.1: void thread1(int N, atomic<int>& a, atomic<int*>& X) {
T1.2:   int* p = bug(N, a);
T1.3:   X.store(p, memory_order_relaxed);
T1.4: }

Thread 2:

T2.1: void thread2(int N, atomic<int>& a, atomic<int*>& X) {
T2.2:   while (a.load(memory_order_acquire) < N);
T2.3:   int *p = X.load(memory_order_relaxed);
T2.4:   if (p != NULL) {
T2.5:     if (*p != N) {
T2.6:       cout << "bug!!!" << endl;
T2.7:     }
T2.8:   }
T2.9: }

Also suppose that the foo function is given as follows:

F1: void foo(int i, atomic<int>& a) {
F2:   a.store(i, memory_order_release);
F3: }

Now, if we create two threads as follows, the program should never print bug!!! before the problematic LICM in the bug function.

atomic<int> a;
atomic<int*> X;
int N = 1;
a.store(0, memory_order_relaxed);
X.store(NULL, memory_order_relaxed);
thread* th1 = new thread(thread1, N, ref(a), ref(X));
thread* th2 = new thread(thread2, N, ref(a), ref(X));

Indeed, once thread2 reads N from a in an acquire access mode (line T2.2), the last call to the foo function (line A6, writing N to a with a release mode) synchronizes with the acquire read. Therefore, the last write to p (line A5, *p = N) by thread1 happens-before the read from *p (line T2.5, *p != N) by thread2, and thread2 can only read N from *p. Note that there is no data race in this program (before the LICM).

However, after the above LICM, the program becomes racy on *p, and thread2 can read a value other than N from *p. We actually observed this program printing "bug!!!" by compiling and running it on Armv8 processors Apple M1 and ThunderX2!

Roughly, the above LICM is wrong because it moves a non-atomic write *p = i (line A5) after a release write (line A6). This is wrong even though p is never leaked before it is returned. Specifically, LLVM applies the above LICM and inlines the bug function in thread1, resulting in the following code:

    ...
O1: for (int i = 1; i <= N; i++) {
O2:   foo(i, a);
O3: }
O4:
O5: *p = N;
O6: X.store(p, memory_order_relaxed);
    ...

Now, the write to p (line O5) is not properly synchronized anymore, and thread2 can read an uninitialized value from p (line T2.5). In particular, such behavior is allowed by the Arm architecture because it can reorder the two independent stores O5 and O6. Therefore, this program can print "bug!!!", which is originally forbidden.

We conclude that the pointer p in the function bug should not be treated as completely local even before it is leaked (line A9).

@nikic
Copy link
Contributor

nikic commented Jul 28, 2023

cc @efriedma-quic

@efriedma-quic
Copy link
Collaborator

This is basically the same issue discussed in the review of https://reviews.llvm.org/D153932 ; our escape analysis doesn't properly account for atomic fences. (See also https://reviews.llvm.org/D137707.) I'm still a little surprised to see it actually show up like this. CC @jdoerfert @aeubanks

@nikic
Copy link
Contributor

nikic commented Jul 28, 2023

The example here doesn't use fences though. Is the example valid as given?

If this is just about fences, then it goes straight on the "technically a bug, but we're not going to fix it" pile as far as I'm concerned.

@efriedma-quic
Copy link
Collaborator

efriedma-quic commented Jul 28, 2023

I didn't mean "fence" instructions specifically; that's just what I was thinking about looking at the patch I mentioned. The example is valid, as far as I can tell.

The key construct here is the "relaxed" store of the address (X.store(p, memory_order_relaxed);). This allows the pointer to reach another thread without a happens-before edge. So the ordering has to be enforced some other way... but the store/load of "a" is enough to enforce ordering according to the standard. The location of the store of the pointer in the code doesn't matter; it could be before or after the "release" operation that enforces the ordering.

Note that if the store of the address is not atomic, another barrier would be necessary, and if the store of the address is "release", it naturally enforces its own ordering. So the miscompile is only visible with a "relaxed" store of newly allocated pointer, I think.

As a practical matter, it seems relatively unlikely anyone would stumble over this by accident; lockless datastructures that allocate memory would normally either use a "release" store of the pointer, or use some other form of locking.

I'm a little uncomfortable assuming nobody will use atomics this way, but the potential performance regressions from restricting escape analysis around calls are also sort of nasty.

@jdoerfert
Copy link
Member

I'm a little uncomfortable assuming nobody will use atomics this way, but the potential performance regressions from restricting escape analysis around calls are also sort of nasty.

For LICM, we'd need to do a peel of one iteration for this to be handled. FWIW, atomics do not have to show anywhere for this to appear. I'm not sure if the situation is really common enough to show up if we forbid LICM to do this if the pointer of the final store escapes later.

@sunghwanl
Copy link
Author

sunghwanl commented Jul 29, 2023

An ad hoc remedy for this particular example is to move *p = i before the loop instead of moving it after. It is safe to move *p = i before foo even if foo contains some acquire accesses/fences. In general, I would say we should avoid moving memory accesses after unknown functions (or release accesses/fences) if the address being accessed can possibly be leaked in some continuation.

@RalfJung
Copy link
Contributor

RalfJung commented Mar 5, 2024

In general, I would say we should avoid moving memory accesses after unknown functions (or release accesses/fences) if the address being accessed can possibly be leaked in some continuation.

This could be made legal by adding a release fence before the address gets leaked, right? So this kind of code motion becomes a trade-off, where the cost of the fence is weighed against the benefit of moving the stores around.

@davidtgoldblatt
Copy link
Contributor

davidtgoldblatt commented Mar 15, 2024

This isn't really about LICM in my opinion; it's some sort of combination provenance + concurrency issue (alias analysis will tell you that your locals haven't escaped before you pass them to extern functions; but this isn't true in a world where concurrency can cause time travel).

Brief example with no loops: https://gcc.godbolt.org/z/4qKs4TYG3

I think the right fix for this is at the language level rather than LLVM though (and that the language should retroactively bless LLVM's behavior); assuming non-escaped locals don't alias random pointers passed to you seems like it must be fundamental.

Edit: fixed typo in godbolt link

@efriedma-quic
Copy link
Collaborator

The new testcase is interesting because there is no release barrier... which means no other thread can access the variable "local". But the thread that allocated it is fine, I guess? There's a happens-before relationship between the accesses (and the deallocation) because they're on the same thread.

I can imagine a language rule somehow trying to address this, but it would be weird. We'd have to divide up memory into "single-threaded" and "multi-threaded" variables. A variable starts out "single-threaded", then becomes "multi-threaded" when its address is stored to memory visible to other threads (or the memory where the address is stored becomes visible to other threads, or the address is converted to an integer). And then we require a release-acquire edge between the store of the address, and any load that accesses the address. Or something along those lines; the key here is that "publishing" the address of a variable becomes a special operation.

The current C++ model, by contrast, only requires that there's a release-acquire edge between the allocation/initialization of the variable and any accesses to the variable's value from another thread; the address itself is irrelevant.


This could be made legal by adding a release fence before the address gets leaked, right?

This doesn't help; a "release" only has an effect if there's an "acquire" on the other side.

@davidtgoldblatt
Copy link
Contributor

I can imagine a language rule somehow trying to address this, but it would be weird. We'd have to divide up memory into "single-threaded" and "multi-threaded" variables. A variable starts out "single-threaded", then becomes "multi-threaded" when its address is stored to memory visible to other threads (or the memory where the address is stored becomes visible to other threads, or the address is converted to an integer). And then we require a release-acquire edge between the store of the address, and any load that accesses the address. Or something along those lines; the key here is that "publishing" the address of a variable becomes a special operation.

The thing I was thinking of is applying something like a happens-before rule not just to the accesses, but also to pointers' provenances. So e.g. in the original example, we declare that at thread2's load of p, even though p's value was obtained via a valid happens-before chain, its provenance was not. This sort of captures the notion of publication in a way that meshes with the rest of the memory model.

FWIW, the C++ committee concurrency study group will talk about this in the next couple of days and see if anyone has anything clever to say.

@efriedma-quic
Copy link
Collaborator

The simplest way to "solve" the issue is just to forbid relaxed loads/stores of pointers: without those, you can't do anything weird (ignoring inttoptr conversions; not sure how we'd handle those).

Saying that relaxed loads/stores of pointers don't have provenance is basically the same thing as forbidding them: a pointer without provenance is just an integer. If you want to allow them, you need some way to transfer the provenance alongside the pointer.

The notion of "publishing" is a way out of this conflict: it means you need a barrier at exactly one point, the point of first publication. After that, the provenance transfers like you'd normally expect. This minimizes the amount of code made undefined, and keeps existing compiler optimizations working.

@hboehm
Copy link

hboehm commented Mar 18, 2024

Some (evolving) slides for an upcoming C++ concurrency group presentations are here.

So far, I like David's suggestion, as I understand it. Can we paraphrase that as disallowing copying of a pointer unless the lifetime-start of the object it refers to happens-before the pointer is copied?

@efriedma-quic
Copy link
Collaborator

disallowing copying of a pointer unless the lifetime-start of the object it refers to happens-before the pointer is copied?

The C++ memory model already forbids accessing an object before it's created, or after it's destroyed.

The issue here is atomic operations that happen between the object being created and the object's address "escaping". LLVM optimizations assume such atomic operations are irrelevant... but the C++ memory model doesn't attach any special significance to the point where the pointer escapes. So the fix here is to make the point where the pointer escapes special: there has to be a happens-before edge between the escape point, and any use of the address in other threads.

@RalfJung
Copy link
Contributor

RalfJung commented Mar 19, 2024

@davidtgoldblatt isn't yours a completely different issue? It doesn't involve any release/acquire. It seems more like the example from this recent talk.

In your example, t2 should always return 1, no matter the value of allocated, shouldn't it? Except load buffering can "send values into the past" and that is incompatible with basic freshness assumptions. But this is a very different issue from the OP where no load buffering is involved. Or did I misunderstand something?

So e.g. in the original example, we declare that at thread2's load of p, even though p's value was obtained via a valid happens-before chain, its provenance was not. This sort of captures the notion of publication in a way that meshes with the rest of the memory model.

Usually the provenance is part of the pointer value itself. So how would this work, where does the gap in the provenance hb-chain come from?
Under the models of provenance I am aware of, a fresh provenance is generated by malloc, but that is happens-before the access in thread2. The issue is that the compiler kind of wants to not make that provenance part of the "release" in thread1... but how is that supposed to work in general, where I do need to rely on "release" publishing the entire contents of memory, including all its provenance?

@efriedma-quic

This could be made legal by adding a release fence before the address gets leaked, right?

This doesn't help; a "release" only has an effect if there's an "acquire" on the other side.

Ah, I was fooled by the acquire in thread2 in the original example.
The publishing is via X which is only used relaxed...
So what happens is that the data at *p gets published via a and release/acquire, but the value of p itself gets published later via relaxed accesses -- the two are basically "taking different paths" to the other thread. Very cool, but also quite concerning.

The issue here is atomic operations that happen between the object being created and the object's address "escaping". LLVM optimizations assume such atomic operations are irrelevant... but the C++ memory model doesn't attach any special significance to the point where the pointer escapes. So the fix here is to make the point where the pointer escapes special: there has to be a happens-before edge between the escape point, and any use of the address in other threads.

The issue is that the "escape point" is not really a well-defined notion, is it? As compiler analyses get better and better, they may be able to push the "escape point" further and further down program execution. How did you plan to define "escape point"?

@davidtgoldblatt
Copy link
Contributor

@davidtgoldblatt isn't yours a completely different issue? It doesn't involve any release/acquire. It seems more like the example from this recent talk.

In your example, t2 should always return 1, no matter the value of allocated, shouldn't it? Except load buffering can "send values into the past" and that is incompatible with basic freshness assumptions. But this is a very different issue from the OP where no load buffering is involved. Or did I misunderstand something?

I think my take is that the fundamental issue here is that the presence of concurrency can "move up" (in sequenced-before) the effective time when some thread's private-seeming data has escaped. That's done with load buffering in my example, but can also happen with store reordering (as in the example here). In this view, the release/acquire synchronization in the original example is a detail -- it's just a way to give defined behavior to our ability to detect that early-escape, but it's not really the cause.

So e.g. in the original example, we declare that at thread2's load of p, even though p's value was obtained via a valid happens-before chain, its provenance was not. This sort of captures the notion of publication in a way that meshes with the rest of the memory model.

Usually the provenance is part of the pointer value itself. So how would this work, where does the gap in the provenance hb-chain come from? Under the models of provenance I am aware of, a fresh provenance is generated by malloc, but that is happens-before the access in thread2. The issue is that the compiler kind of wants to not make that provenance part of the "release" in thread1... but how is that supposed to work in general, where I do need to rely on "release" publishing the entire contents of memory, including all its provenance?

I think that I would say something like "at T2.3, the load should return a pointer value lacking provenance, because the store of the pointer value does not happen before the load". This isn't quite workable (because of standalone fences), but something in that direction at least feels plausible to me as a language-level fix.

Maybe more succinctly: the root cause of all these issues is that some line of code gets a pointer value whose bits are equal to a valid pointer, but which they shouldn't be allowed to use. That's the same situation that we want provenance rules to solve; you have to come by your pointer values honestly. Use relaxed accesses to pass them to other threads is cheating, in the same way abusing the 1-past-the-end-of-an-array exception to get a pointer that happens to be equal to some other object's address is cheating.

@RalfJung
Copy link
Contributor

RalfJung commented Mar 19, 2024

Maybe more succinctly: the root cause of all these issues is that some line of code gets a pointer value whose bits are equal to a valid pointer, but which they shouldn't be allowed to use. That's the same situation that we want provenance rules to solve; you have to come by your pointer values honestly. Use relaxed accesses to pass them to other threads is cheating, in the same way abusing the 1-past-the-end-of-an-array exception to get a pointer that happens to be equal to some other object's address is cheating.

That makes sense, thanks. Though I do worry it can be fiendishly difficult in practice to reason about this... except ofc by just always using release-acquire when pointers are involved.^^

However, does that capture all cases of this issue? Here is another example, which I got from a colleague (Chung-Kil Hur):

<<Thread1>>

T1.1:   void thread1(atomic<int*>& X, atomic<int*>& Y) {
T1.2:     int* p = (int*)malloc(sizeof(int));
T1.3:     *p = 42;
T1.4:     rel_fence();
T1.5:     X.store(p, memory_order_relaxed);
T1.6:   }
T1.7:  void rel_fence() {
T1.8:    atomic_thread_fence(memory_order_release)
T1.9:  }

<<Thread2>>

T2.1:   void thread2(atomic<int*>& X) {
T2.2:     int*p = X.load(memory_order_relaxed);
T2.3:     atomic_thread_fence(memory_order_acquire);
T2.4:     if (p != NULL) {
T2.5:       if (*p != 42) {
T2.6:         cout << "bug!!!" << endl;
T2.7:       }
T2.8:     }
T2.9:   }

This should be equivalent to using release stores / acquire loads on X, we are following the usual pattern: "release fence then relaxed store" and "relaxed load then acquire fence". And yet, the compiler may want to move the *p = 42; down below the rel_fence() call, which would seem to break this program.

I guess you could say that at T2.2 there is no happens-before to T1.5 since relaxed doesn't induce HB -- but that would be basically saying fences don't work when doing the standard message passing idiom with pointers. Which seems bad?

I suppose that's what you meant when you said your proposal isn't quite workable because of fences.

@efriedma-quic
Copy link
Collaborator

The issue is that the "escape point" is not really a well-defined notion, is it? As compiler analyses get better and better, they may be able to push the "escape point" further and further down program execution. How did you plan to define "escape point"?

The point where the pointer becomes accessible to another thread. This is, as you note, basically impossible for the compiler to compute precisely; it's a dynamic property of the code at runtime. But the compiler doesn't need to precisely compute it; a conservative approximation is sufficient. Standard compiler escape analysis is such an approximation.

@RalfJung
Copy link
Contributor

Even at runtime, this would be a very strange notion to introduce in the very definition of what is Defined Behavior and what not. How is it defined? A global scan of all memory transitively reachable through all pointers in a thread's stack? That's a hell of a property to reason about, if I want to justify precisely why my program is correct.

@efriedma-quic
Copy link
Collaborator

How is it defined? A global scan of all memory transitively reachable through all pointers in a thread's stack?

Not quite sure. :)

This sort of property isn't completely unprecedented; proposals for defining inttoptr are sort of similar.

@RalfJung
Copy link
Contributor

The proposals for defining inttoptr that I am involved in (mostly on the Rust side) don't look anything like that. ;)

@davidtgoldblatt
Copy link
Contributor

davidtgoldblatt commented Mar 19, 2024

I suppose that's what you meant when you said your proposal isn't quite workable because of fences.

Yeah, exactly. (I mean, not limited to just fences of course -- any sort of opaque library synchronization could work too). But that at least feels like something we could put some words around, because now we have some synchronization primitive to hang it off of. Maybe introducing "provisional provenance"; it starts acting like empty provenance but can get promoted if certain synchronization conditions are met. Something like:

If there are atomic pointer store S of value V and atomic pointer load load L, where L takes its value from S[1], then if S happens before L, L obtains V. If S does not happen before L, then L obtains V', which has the same address as V but a provisional version of V's provenance. On dereference of V', if there is some evaluation P such that the object to which V refers' lifetime start happens before P, L happens before P, and P happens before the dereference, then the dereference acts as though it has V's provenance. Otherwise, it acts as though it has empty provenance.

The intuition being, we pretend that the provenance for some storage lives with the storage, being stored to at its lifetime start. At some unspecified point after we load a pointer value, we try to load its provenance from the storage and update our pointer value with it; these are the conditions we'd need to avoid UB in that case.

I think that this blesses LLVM's current behavior on the various weirdo test cases without rendering too much sane code UB.

[1] Not this exactly -- we probably need something about release sequences here
[2] I'm being sloppy here with happens before vs simply happens before vs. strongly happens before. I think where I say happens before we mostly actually want simply happens before, but I haven't thought it through all the way.

I'm not 100% sure this could really be formalized because of the operational semantics / axiomatic semantics mismatch in provenance and the memory model, but OTOH we already have issues there anyways.

@RalfJung
Copy link
Contributor

I'm not 100% sure this could really be formalized because of the operational semantics / axiomatic semantics mismatch in provenance and the memory model, but OTOH we already have issues there anyways.

There's also the "there is some evaluation P" quantification which doesn't seem a good fit for formalization -- it seems to require running hypothetical alternative executions. Can this be stated in terms of just the current execution and its memory model graphs?

@comex
Copy link
Contributor

comex commented Mar 25, 2024

I think the right fix for this is at the language level rather than LLVM though (and that the language should retroactively bless LLVM's behavior); assuming non-escaped locals don't alias random pointers passed to you seems like it must be fundamental.

Is it fundamental, though? With your example, neither GCC nor MSVC performs the problematic optimization, though Clang and ICC do. If I change the relaxed store to an opaque function call, then ICC stops performing the optimization, leaving Clang as the odd one out.

In other words, other compilers seem to apply "non-escaped locals" optimizations only to locals that never escape, as opposed to identifying when each local escapes and applying the optimizations at points where they haven't escaped yet.

(In ICC's case, the relaxed store should be enough to block the optimization, but the fact an opaque call does block it suggests that the relaxed-store case may be a relatively localized bug in ICC, as opposed to proving that ICC follows a general policy of "aliasing can't happen until after the escape", as LLVM currently does. Maybe. ICC is closed source, so I'm only speculating.)

Now to speculate even more loosely:

If I had to guess what kind of scenario benefits most from this kind of optimization, I'd guess something like this: A long function contains a local that's manipulated many times, but mostly by value. However, at some point a pointer to it is passed to another, opaque function. In reality, the second function only accesses the pointer for the duration of the call (i.e. it could be nocapture), but LLVM doesn't know that. Ideally the local could be optimized for the rest of the function, e.g. by storing it in registers.

But even in such a scenario, LLVM's time-based escape analysis only provides a benefit until the first escape. After that, the local must be treated as forever aliased. That is a pretty limited benefit! It doesn't sound so bad to forgo it, especially if other C/C++ compilers already do so.

If this were Rust code, then under most circumstances, the compiler should be allowed to treat the local as non-aliased both before and after the call. The specifics are complicated, and the spec-permitted optimizations are mostly unimplemented, but if they were implemented, I believe they would render time-based escape analysis mostly redundant for Rust code.

@davidtgoldblatt
Copy link
Contributor

There's also the "there is some evaluation P" quantification which doesn't seem a good fit for formalization -- it seems to require running hypothetical alternative executions. Can this be stated in terms of just the current execution and its memory model graphs?

This is just an artifact of the C++ standardese weasel wording we use to avoid being precise about where a memory or synchronization operation happens (see e.g. here ) -- not really meant as a quantification across multiple executions.

@efriedma-quic
Copy link
Collaborator

I should say, despite the discussion here, I'm not convinced modifying the standard is the right approach. The discussion here indicates that any modification we make would be confusing. And we don't know what the actual performance penalty is for just following the standard as it's currently written.

But even in such a scenario, LLVM's time-based escape analysis only provides a benefit until the first escape.

Yes. The relevant code is llvm::PointerMayBeCapturedBefore.

If this were Rust code, then under most circumstances, the compiler should be allowed to treat the local as non-aliased both before and after the call.

I don't see how local variables are any better off here. Sending a reference to another thread without a fence would require unsafe code, but it's not clear to me it would be undefined behavior.

I believe they would render time-based escape analysis mostly redundant for Rust code.

You can directly translate the original example to Rust by replacing "malloc" with "Box::new()".

@davidtgoldblatt
Copy link
Contributor

I should say, despite the discussion here, I'm not convinced modifying the standard is the right approach. The discussion here indicates that any modification we make would be confusing. And we don't know what the actual performance penalty is for just following the standard as it's currently written.

Do you mean that in full generality, including the load-buffering-like examples? My impression was that everyone thought that those were too weird / harmful to be supported, while most people thought that the "establish release/acquire synchronization through some opaque function call" ones should remain correct per the standard (and that all we disagreed about was which set this example fell into).

@efriedma-quic
Copy link
Collaborator

Do you mean that in full generality, including the load-buffering-like examples?

You mean, specifically, the examples that don't involve any fences (#64188 (comment))? I hadn't really thought about it before... but if the standard could change something there, it might be helpful. Specifically, to remove the happens-before relationship between loads/stores on the same thread if the pointer operand for one of the operations comes from another thread.

From the perspective of LLVM optimizations, it would be straightforward to make llvm::PointerMayBeCapturedBefore() treat function calls that aren't marked "nosync" as escape points for pointers that eventually escape. That solves the cases involving fences, but not the cases that only involve relaxed load/store.

@efriedma-quic
Copy link
Collaborator

From the perspective of LLVM optimizations, it would be straightforward to make llvm::PointerMayBeCapturedBefore() treat function calls that aren't marked "nosync" as escape points for pointers that eventually escape

In particular, we currently don't track whether a function contains any relaxed load/store ops, so if we need to track that, it would make the implementation a bit more complicated. The performance impact is 99% the same either way, though: the primary impact is treating unknown functions as an escape point for pointers that eventually escape.

@davidtgoldblatt
Copy link
Contributor

You mean, specifically, the examples that don't involve any fences (#64188 (comment))? I hadn't really thought about it before... but if the standard could change something there, it might be helpful. Specifically, to remove the happens-before relationship between loads/stores on the same thread if the pointer operand for one of the operations comes from another thread.

I think yeah, we should should change the standard to make LLVM's current behavior allowed in those cases.

I also think, though, that this specific case is more like the ones you describe as not involving any fences than it seems like at first glance. I think the pattern we want to allow is the one you mentioned upthread, discussed in https://reviews.llvm.org/D153932 . In the usages everyone agrees should be allowed, the pattern is:

Publishing thread:
P.1. initialize pointee
P.2. Establish synchronization with consuming thread (maybe via a fence in some un-analyzable function)
P.3. publish pointer

Consuming thread:
C.1 Get pointer
C.2 Establish synchronization for pointee
C.3 Access pointee

In this example, though, the underlying pattern is:
C1. Establish synchronization for pointee
C2. Get pointer
C3. Access pointee (no synchronization established between C2 and C3)

Which is the same root issue in some of the unsynchronized, relaxed-only examples we're looking at elsewhere. I'm OK jettisoning it to make alias analysis (in the sense of "your arguments don't alias storage you allocate") more viable.

In particular, we currently don't track whether a function contains any relaxed load/store ops, so if we need to track that, it would make the implementation a bit more complicated. The performance impact is 99% the same either way, though: the primary impact is treating unknown functions as an escape point for pointers that eventually escape.

It gets subtle fast -- I think you need to consider not just function calls, but things that happen after your function returns, so you can't just infer some nosync-variant in a bottom-up call graph traversal (without giving up on sret/noalias optimizations).

@efriedma-quic
Copy link
Collaborator

I was thinking that between the allocation of a variable, and the first atomic operation after the allocation, we could treat treat the variable as "non-escaped" relative to any other memory accesses. That said, in the fenceless variants, I guess it's possible to load the address of a variable before the variable is actually allocated. Which would cause problems for alias analysis.

In this example, though, the underlying pattern is: C1. Establish synchronization for pointee C2. Get pointer C3. Access pointee (no synchronization established between C2 and C3)

Which is the same root issue in some of the unsynchronized, relaxed-only examples we're looking at elsewhere. I'm OK jettisoning it to make alias analysis (in the sense of "your arguments don't alias storage you allocate") more viable.

The problem is that it's hard to establish what, exactly, it means to "get a pointer". For the case where the load and the store are on the same thread, I think we can just forbid loading a pointer before you store it. For the cross-thread cases, the definition gets more complicated.

And even if we do manage to establish a definition, what have we actually accomplished? The handling required for cases like D153932 still basically requires treating unknown functions as escape points.

@Permik
Copy link

Permik commented Apr 8, 2024

That said, in the fenceless variants, I guess it's possible to load the address of a variable before the variable is actually allocated. Which would cause problems for alias analysis.

But that would essentially be a volatile access in that case?
No provenance, because the pointer was "incorrectly" created and all the bets are off.
And if my understanding is correct, heavy de-optimization because you can't really optimize around volatile memory access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants