Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/race: inconsistent snapshot of vector clocks during finalizer sync leads to false positive #39186

Open
nvanbenschoten opened this issue May 21, 2020 · 27 comments

Comments

@nvanbenschoten
Copy link
Contributor

@nvanbenschoten nvanbenschoten commented May 21, 2020

Over at CockroachDB, we've been debugging a reported data race on and off for the past month: cockroachdb/cockroach#46793. There's a lot of context in that thread and it might be helpful to skim it, if only to see all the possibilities we ruled out, but I'll try to keep this issue as self-contained as possible.

The interactions are deep in the guts of the database and I've never been able to pull out a smaller reproduction. However, this example demonstrates the actors at play: https://play.golang.org/p/FViQigC1Q-N. We have one goroutine pulling pointers to messages from a channel, reading the value of those messages, and putting them in a sync.Pool. A collection of other goroutines are pulling messages out of the sync.Pool, reading/writing to them, and sending them on the channel. It's all innocent enough. It's also pretty clear with everything else stripped away that we should just be storing values in the channel and avoiding the sync.Pool entirely, but that's a topic for another time.

What we see is that occasionally under heavy stress, two of the source goroutines will report a race with one another on a message's memory address. At first, we thought this meant that we were double-freeing these messages by placing them into the pool twice somehow. No amount of reference counting or object tracing found this to be the case.

At this point, we started to look into the sync.Pool implementation itself. Most of the details are in cockroachdb/cockroach#46793 so I won't say too much here other than that we could still reproduce when disabling all of the sync.Pool except the per-P private storage. So even without any cross kernel thread object sharing, we still had a reported race.

We then saw an even more confusing race report. If we split the source goroutines into a group that used the pool and a group that allocated each message directly, we would still see reports of races between the two classes of sources. In other words, we'd see a race between the first write to an object after it was allocated and the first write to it after it had been pulled from the pool. Given the structure of the program, these operations should have the fattest happens-before edge imaginable between them, so something was clearly going wrong.

We then turned our eyes to the race detector and ThreadSanitizer itself, as this was starting to look like a loss of causality between goroutines due to some failure to propagate vector clocks during either the channel send between the source and the sink or during the sync.Pool insertion on the sink. We never saw any reported races due to the sink goroutine's read of the message, so the latter possibility was the best guess. After instrumenting all SyncClock interactions and filtering down to just those on the sync.Pool (with len(poolRaceHash) dropped down to 1) we started to see a clearer picture emerge.

During the sink goroutine's insertion into the sync.Pool, it's corresponding ReleaseMerge tsan operation was not fully propagating its vector clock. Specifically, it could fail to propagate clock state for other goroutines if it hit this branch, which was introduced as part of this optimization. This optimization tracks the latest time that a thread/goroutine performs an acquire operation (i.e. merges other vector clocks with its own) and avoids an expensive full update of a SyncClock during a release operator (i.e. merges its clock with the clock associated with a synchronization element) if that clock already had a reading above the thread's last_acquire_ timestamp. If I understand correctly, the correctness of this optimization is founded somewhere along the lines of: if the SyncClock has a higher reading for a given thread than the last time that thread synced with any other threads, then that thread must not have anything new to tell the SyncClock about anyone other than itself due to how causality flows through vector clocks. So if we hit that branch, we don't expect the acquiring thread to have any clock values above those in the SyncClock (other than its own). And yet, I was able to trigger the following assertion, which violates the fundamental invariant that this optimization is based on:

@@ -248,6 +248,29 @@ void ThreadClock::release(ClockCache *c, SyncClock *dst) {
   // since the last release on dst. If so, we need to update
   // only dst->elem(tid_).
   if (dst->elem(tid_).epoch > last_acquire_) {
+    uptr i = 0;
+    for (ClockElem &ce : *dst) {
+      // Ignore the current thread.
+      if (i == tid_) {
+        i++;
+        continue;
+      }
+      // Ignore dirty state. This might not be necessary, but can't hurt.
+      bool dirty = false;
+      for (unsigned j = 0; j < kDirtyTids; j++) {
+        if (i == dst->dirty_[i]->tid) {
+          dirty = true;
+          break;
+        }
+      }
+      if (!dirty) {
+        // The current thread should never have more advanced knowledge about
+        // other threads, or we shouldn't be in this branch.
+        CHECK_LE(clk_[i], ce.epoch);
+      }
+      i++;
+    }
     UpdateCurrentThread(c, dst);
     if (dst->release_store_tid_ != tid_ ||
         dst->release_store_reused_ != reused_)

Furthermore, I was able to see that the clock element that triggered the assertion was from the goroutine that performed the initial mutation on the memory location which was later flagged in the race. In other words, the clock element that was was in violation was the very one we needed to merge into the SyncClock to ensure proper propagation of causality to establish the happens-before link to show that this is not a data race.

From here, we began looking into how this invariant could be violated. I noticed in the tsan logs that shortly before, I usually saw a call to AcquireGlobal. This function is meant to synchronize the caller thread with all other threads be establishing a happens-before relation from all threads to the current thread. It is called by the Go finalizer goroutine right before executing finalizers. AcquireGlobal is somewhat strange in that it iterates over every other thread, grabs its current clock reading, and puts that into the caller's clock. Critically, it does not do so during a stop-the-world pause or with any synchronization with the other threads/goroutines. This means that as it iterates over other goroutines, it may see an inconsistent snapshot of their state.

This is exactly what I saw in the logs. I often saw histories like:

#6: AcquireGlobal: thread=1, epoch=1000
#6: AcquireGlobal: thread=2, epoch=1000
...
#378: Acquire, last_acquired_=146416
#6: AcquireGlobal: thread=427, epoch=1000
#839: Acquire, last_acquired_=375081
#6: AcquireGlobal: thread=428, epoch=1000
#6: AcquireGlobal: thread=429, epoch=1000
...
#6: AcquireGlobal: thread=900, epoch=1000

And later on I would see the assertion in ThreadClock::release fire with:

#839: assertion failed for thread=378; 149294 > 148778

Notice that the finalizer goroutine acquired a value from goroutine 378 before both goroutine 378 and goroutine 839 performed acquisitions but didn't acquire a value from goroutine 839 until after those two acquisitions. This presents an opportunity for the final state of the finalizer goroutine's clock to violate invariant we previously relied upon in ThreadClock::release. If the finalizer goroutine's vector clock ever made it into the sync.Pool's SyncClock, goroutine 839's later release into that SyncClock could be ignored, even though goroutine 839 has a larger value for 378's clock element.

So I think we are seeing a false positive from the race detector due to a combination of the inconsistent clock state observed by finalizers during their call to AcquireGlobal and the optimization in ThreadClock::release. I confirmed that I can not hit this race if I disable that optimization.

As an added vote of confidence for this being the problem or being in the near vicinity of the problem, the race became significantly more prevalent in the CockroachDB codebase at about the same time that we turned on https://github.com/cockroachdb/pebble by default as our replacement for RocksDB (in C++). A notable aspect of Pebble, which only became clear after all of this debugging, is that it uses finalizers in a few places to manage C memory that is referenced by pooled objects. It seems very possible that the additional uses of finalizers made this bug easier to hit.

Unfortunately, I don't have great suggestions for how to fix this. The cleanest solution is probably to make AcquireGlobal properly synchronize with all other threads'/goroutines' sync operations to ensure that it always observes a consistent global clock state. There will likely be some kind of performance impact from such a change though, as we'd now need to acquire a global read-lock on every other sync operation.

cc. @dvyukov since, in all likelihood, you're going to be the one taking a look at this.

@ianlancetaylor ianlancetaylor added this to the Go1.16 milestone May 21, 2020
@odeke-em odeke-em changed the title race: inconsistent snapshot of vector clocks during finalizer sync leads to false positive runtime/race: inconsistent snapshot of vector clocks during finalizer sync leads to false positive May 21, 2020
@dvyukov
Copy link
Member

@dvyukov dvyukov commented May 22, 2020

Hi @nvanbenschoten,

This is quite interesting. And thanks for the super detailed description and debugging.

I see the bug and I can reproduce it with the following unit test:

func TestIssue39186(t *testing.T) {
	var data uint64
	ready := make(chan bool, 1)
	finalized := make(chan bool, 1)
	var wg sync.WaitGroup
	wg.Add(2)
	runtime.SetFinalizer(new([10]int), func(*[10]int) {
		wg.Done()
		finalized <- true
	})
	runtime.GC()
	go func() {
		data = 1
		ready <- true
	}()
	go func() {
		<-ready
		<-finalized
		wg.Done()
	}()
	wg.Wait()
	if data != 1 {
		t.Fatal("no data")
	}
}

Both as the race:

==================
WARNING: DATA RACE
Read at 0x00c0000161b8 by goroutine 79:
  runtime/race_test.TestIssue39186()
      /src/go/src/runtime/race/race_test.go:275 +0x223
  testing.tRunner()
      /src/go/src/testing/testing.go:991 +0x1eb

Previous write at 0x00c0000161b8 by goroutine 8:
  runtime/race_test.TestIssue39186.func2()
      /src/go/src/runtime/race/race_test.go:261 +0x38

Goroutine 79 (running) created at:
  testing.(*T).Run()
      /src/go/src/testing/testing.go:1042 +0x660
  testing.runTests.func1()
      /src/go/src/testing/testing.go:1284 +0xa6
  testing.tRunner()
      /src/go/src/testing/testing.go:991 +0x1eb
  testing.runTests()
      /src/go/src/testing/testing.go:1282 +0x527
  testing.(*M).Run()
      /src/go/src/testing/testing.go:1199 +0x2ff
  main.main()
      _testmain.go:66 +0x223

Goroutine 8 (finished) created at:
  runtime/race_test.TestIssue39186()
      /src/go/src/runtime/race/race_test.go:260 +0x1a7
  testing.tRunner()
      /src/go/src/testing/testing.go:991 +0x1eb
==================

and as violation of your check:

/tmp/go-stress744268477
FATAL: ThreadSanitizer CHECK failed: ./gotsan.cpp:568 "((clk_[i])) <= ((ce.epoch))" (0x4006, 0x4003)

/tmp/go-stress622856440
FATAL: ThreadSanitizer CHECK failed: ./gotsan.cpp:568 "((clk_[i])) <= ((ce.epoch))" (0xa006, 0xa005)

But it's not easy to trigger still, I used:

$ go test -c -race ./runtime/race && nice stress -p=64 ./race.test -test.run=TestIssue39186 -test.count=100

Thinking what's the best way to fix this.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented May 22, 2020

Besides fixing __tsan_finalizer_goroutine itself, there are some interesting options:

  1. Remove __tsan_finalizer_goroutine. I think the guaranteed semantics are (or should be):
  • SetFinalizer happens-before the finalizer execution
  • KeepAlive happens-before the finalizer execution (?)
  • any explicit synchronization in user code involving the finalizer
    So we could remove __tsan_finalizer_goroutine and instead do more traditional release/acquire in SetFinalizer/KeepAlive.
    FWIW Java has similar "weak" guarantees for finalizers (for good reasons), for all the gory details see:
    https://www.hboehm.info/misc_slides/java_finalizers.pdf
  1. Run __tsan_finalizer_goroutine during STW in GC before it makes the next batch of finalizer ready to run. This will naturally serialize __tsan_finalizer_goroutine execution and should provide enough synchronization.
@nvanbenschoten
Copy link
Contributor Author

@nvanbenschoten nvanbenschoten commented May 22, 2020

Thanks for taking a look @dvyukov. I'm very glad to see that you were able to find a smaller reproduction.

Run __tsan_finalizer_goroutine during STW in GC before it makes the next batch of finalizer ready to run. This will naturally serialize __tsan_finalizer_goroutine execution and should provide enough synchronization.

This is how I assumed things were working for a while. Making this change would simplify the interactions here significantly, along with the implementation of __tsan_finalizer_goroutine / __tsan::AcquireGlobal because that function would no longer need to make any effort to handle concurrently running threads.

The downside of this approach is that it pushes the burden of synchronization on to the caller of AcquireGlobal. I don't know exactly how you view the programmatic interface of the ThreadSanitizer runtime library (e.g. is this a specialized hook for Go or something more general?), but this change would complicate that interface.

There's also the concern that iterating over each active thread is not instantaneous, as evidenced by the fact that we were able to race with the process. That indicates that this might be too expensive of an operation to add to stop the world pauses in GC, even if this will only fire when the race detector is enabled. Then again, the race detector already has such large runtime overhead that this may be in the noise.

So we could remove __tsan_finalizer_goroutine and instead do more traditional release/acquire in SetFinalizer/KeepAlive

I prefer an option like this because it moves the causality tracking closer to the minimum guarantees provided by the memory model.

That said, I don't think I understand how this would work. It seems like we'd need to establish a happens-before edge with any time that the object was still reachable, even if the object itself wasn't being directly manipulated. For instance, we might have a slice of objects on the heap. I would expect that any memory access I perform provably before that slice has become unreachable will have a happens-before edge to the finalizers of each of these objects. I don't know how that's possible to do efficiently because it's not even clear which memory address to synchronize on, and I assume that's why __tsan_finalizer_goroutine was added in the first place. That allows us to use the GC's understanding of reachability to establish these happens-before edges, even if this does result in significant over-synchronization.

FWIW Java has similar "weak" guarantees for finalizers (for good reasons), for all the gory details see: https://www.hboehm.info/misc_slides/java_finalizers.pdf

This was an enlightening read. I hope we don't go too far in that direction. The need to synchronize internal memory accesses in a single-threaded application because of the subtle reachability rules seems unnecessarily antagonistic.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 22, 2020

I don't know what to do here, I just want to comment that runtime.KeepAlive is in effect a compiler intrinsic. It just changes the lifetime information for the object. If we want runtime.KeepAlive to do some sort of synchronization with the race detector, then it seems to me that we want that same synchronization to occur on any access to the object, since any access will keep the object alive.

When I look at @dvyukov 's test case, it seems that the assignment to data happens before the send on ready happens before the receive from ready happens before the call to wg.Done happens before the return from wg.Wait happens before the read of data. So I'm not sure what the finalizer has to do with this. But I don't actually understand what the problem is.

@nvanbenschoten
Copy link
Contributor Author

@nvanbenschoten nvanbenschoten commented May 22, 2020

So I'm not sure what the finalizer has to do with this. But I don't actually understand what the problem is.

The finalizer is key here because its call to __tsan_finalizer_goroutine is faulty in that it establishes a happens-before edge from all other goroutines to the finalizer goroutine. However, it does so in an unsynchronized manner which allows the resulting vector clock in the finalizer goroutine to enter an inconsistent state. For instance, consider the following sequence of events:

g1: [1, 0], g2: [0, 1]
-- __tsan_finalizer_goroutine reads g1's local clock, observes 1
-- g1 ticks clock & reads from data @ [2, 0]
g1: [3, 0], g2: [1, 1] -- g2 acquires              (+ ticks clock)
g1: [3, 0], g2: [3, 2] -- g1 releases, g2 acquires (+ ticks clock). last_acquire_ = 2.
g1: [3, 0], g2: [3, 3] -- g2 ticks clock
-- __tsan_finalizer_goroutine reads g2's local clock, observes 3

g_finalizer: [1, 3] 

If this inconsistent clock state is merged into a synchronization element (the WaitGroup in the test case), then a future release operation which is necessary to establish the "call to wg.Done happens before the return from wg.Wait happens" link may be ignored due to this optimization. Continuing our example:

s_wg: [0, 0]
s_wg: [1, 3] -- g_finalizer releases into s_wg when calling wg.Done
s_wg: [1, 3] -- g2 releases into s_wg when calling wg.Done
-- g2's clock is [3, 3], but causality is not propagated because last_acquire_ (2) < 3.

g3: [0, 0]
g3: [1, 3] -- g3 acquires from s_wg when wg.Wait completes
-- g3 writes to data @ [1, 3]. [2, 0] does not happen-before [1, 3]. Data race detected!

So we could disable the optimization, but more likely, we want to prevent these inconsistent vector clock states from becoming possible.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented May 23, 2020

I don't know exactly how you view the programmatic interface of the ThreadSanitizer runtime library (e.g. is this a specialized hook for Go or something more general?), but this change would complicate that interface.

It's pretty much tied to it's few real users. We don't try to solve all of the world's problems with it.
We can change it or introduce a new function as necessary. I've notified Java folks as well, most likely the Java interface is equally affected:
https://mail.openjdk.java.net/pipermail/tsan-dev/2020-May/000429.html

@dvyukov
Copy link
Member

@dvyukov dvyukov commented May 23, 2020

That said, I don't think I understand how this would work. It seems like we'd need to establish a happens-before edge with any time that the object was still reachable, even if the object itself wasn't being directly manipulated. For instance, we might have a slice of objects on the heap. I would expect that any memory access I perform provably before that slice has become unreachable will have a happens-before edge to the finalizers of each of these objects.

No. That's the point. We don't need to explain this to tsan, because this synchronization does not exist. Check out the presentation again ;)

Finalizers introduce concurrency
   Synchronize!
Finalizers can run earlier than you think!
   Synchronize!

We need to explain to tsan only synchronization that actually does exist in real life (that's why I explicitly mentioned only SetFinalizer/KeepAlive). Consider we have:

*p = foo
*y = bar

Now consider that the second store make an object unreachable and the first store stores to something transitively reachable from that object.
To provide the guarantee you are assuming is there, compiler would need to enforce program order for all stores (no optimizations possible). Moreover, with a better implementation (e.g. maybe the new async preemption is enough), compiler would also need to inject memory barriers between all stores as well to ensure that if a concurrent GC schedules a finalizer, it will observe all stores properly. This is unbearably expensive. That's why there is simply no such guarantee. In Java it's very explicit, in Go it does not seem to be documented that well, but I believe Go simply can't realistically provide such a guarantee.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented May 23, 2020

I don't know what to do here, I just want to comment that runtime.KeepAlive is in effect a compiler intrinsic. It just changes the lifetime information for the object. If we want runtime.KeepAlive to do some sort of synchronization with the race detector, then it seems to me that we want that same synchronization to occur on any access to the object, since any access will keep the object alive.

See my reply above:
#39186 (comment)

runtime.KeepAlive should provide such synchronization to be useful.
Today nothing special may be required (no-op on implementation) level because GC includes STW. But I believe on semantic level, KeepAlive should synchronize with the finalizer.

Otherwise os package use of finalizers would be broken (and majority of code that uses finalizers):

func (f *File) readdirnames(n int) (names []string, err error) {
	if f.dirinfo == nil {
		f.dirinfo = new(dirInfo)
		f.dirinfo.buf = make([]byte, blockSize)
...
			runtime.KeepAlive(f)
func (file *file) close() error {
	if file.dirinfo != nil {
		file.dirinfo.close()
	}

Consider there is no synchronization between KeepAlive and the finalizer. Now the finalizer is not guaranteed to observe file.dirinfo != nil even though it was set. And if it observes file.dirinfo != nil, it may not properly observe contents of *file.dirinfo.

Now re "same synchronization to occur on any access to the object" part. This is too costly to provide. Simply infeasible. See the Java paper. Such synchronization should not be guaranteed.
But since we have KeepAlive, it actually solves the problem. If one does:

		f.dirinfo = new(dirInfo)
	        runtime.KeepAlive(f)

This now guarantees that if this code run, the finalizer will properly observe f.dirinfo.
And if we ever need any actual hardware synchronization (e.g. when GC becomes even more concurrent), KeepAlive will be the place to add it . So it may be the natural place to add race detection annotation as well.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented May 23, 2020

Proposed fix:
https://reviews.llvm.org/D80474
@nvanbenschoten do you mind reviewing it? I think you are the best qualified to meaningfully review the fix logic.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 23, 2020

I agree that it seems infeasible to synchronize between any access of the object and the finalizer.

But semantically speaking, it seems to me that any access to the object really does happen before the execution of the finalizer, in exactly the same way that any call to runtime.KeepAlive happens before the execution of the finalizer. If the last use of f is to set the dirinfo field, then the finalizer must reliably see the value set in the dirinfo field.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented May 24, 2020

If the last use of f is to set the dirinfo field, then the finalizer must reliably see the value set in the dirinfo field.

The question is: is it guaranteed to see *dirinfo values? That is, all indirectly accessible values.
If not, then it becomes very hard to use finalizers correctly (fd.dirinfo is broken).
If yes, it would impose pretty hard limitations on compilation (can't say about gc, but gcc and llvm don't provide them) and potentially hardware synchronization in future.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 24, 2020

Personally, I think finalizers are required to see the *dirinfo values.

I think that implies that a compiler can't consider a variable to be dead until all memory stores made before the last access of that variable have been emitted. That in itself is not a major restriction; it just adjusts the liveness map of the function.

At execution time it implies that there must be a synchronization barrier between an object being considered dead and the object's finalizer being run. Fortunately I think we already have that in the current garbage collector, via the brief stop-the-world phases. It's true that we'll have to keep this in mind as we keep adjusting the garbage collector. But since we make no promises as to how soon we will run a finalizer, I don't think it's a hard promise to keep.

No doubt I am missing some things.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented May 25, 2020

it just adjusts the liveness map of the function.

Are you sure it's enough? I think it must restrict order of operations themselves. Even if we have a pointer slot live, if it already contains nil, it won't help.

@randall77
Copy link
Contributor

@randall77 randall77 commented May 25, 2020

I don't see how this could be a problem. How could there be a point in the program where we still have a pending write to an object, but we no longer have a pointer to it? I could see the hardware doing that, but the compiler can't.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented May 25, 2020

I don't see how this could be a problem. How could there be a point in the program where we still have a pending write to an object, but we no longer have a pointer to it? I could see the hardware doing that, but the compiler can't.

Not the object, but something transitively reachable from the object that finalizer uses.
E.g. *f.dirinfo here:

func (f *File) readdirnames(n int) (names []string, err error) {
	// If this file has no dirinfo, create one.
	if f.dirinfo == nil {
		f.dirinfo = new(dirInfo)
		// The buffer must be at least a block long.
		f.dirinfo.buf = make([]byte, blockSize)
	}
	d := f.dirinfo

	size := n
	if size <= 0 {
		size = 100
		n = -1
	}

	names = make([]string, 0, size) // Empty with room to grow.
	for n != 0 {
		// Refill the buffer if necessary
		if d.bufp >= d.nbuf {
			d.bufp = 0
			var errno error
			d.nbuf, errno = f.pfd.ReadDirent(d.buf)
			runtime.KeepAlive(f)
			if errno != nil {
				return names, wrapSyscallError("readdirent", errno)
			}
			if d.nbuf <= 0 {
				break // EOF
			}
		}

		// Drain the buffer
		var nb, nc int
		nb, nc, names = syscall.ParseDirent(d.buf[d.bufp:d.nbuf], n, names)
		d.bufp += nb
		n -= nc
	}
	if n >= 0 && len(names) == 0 {
		return names, io.EOF
	}
	return names, nil
}

func (file *file) close() error {
	if file == nil {
		return syscall.EINVAL
	}
	if file.dirinfo != nil {
		file.dirinfo.close()
	}
@randall77
Copy link
Contributor

@randall77 randall77 commented May 25, 2020

Just so I understand, you're worried about:

  f := new(...)
  runtime.SetFinalizer(f, ...)
  p := new(...)
  f.p = p
  p.x = 7
  runtime.KeepAlive(f) // last use of f

You're worried that the finalizer might not see f.p.x == 7?

I don't think that's a problem in the current compiler, as runtime.KeepAlive is modeled as a write operation, and will not be reordered with other writes. (GC does not reorder writes at all.)

Without the runtime.KeepAlive, this code might not see f.p.x == 7, and we should report a race, because the finalizer could happen before the p.x = 7 statement.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented May 26, 2020

Exactly. runtime.KeepAlive is what provides synchronization with the finalizer and what makes the code correct. I was worried about the claim that any access to the object provides the same synchronization as KeepAlive e.g. in:

  f := new(...)
  runtime.SetFinalizer(f, ...)
  p := new(...)
  p.x = 7
  f.p = p

gc is not the only compiler that compiles Go code.

@randall77
Copy link
Contributor

@randall77 randall77 commented May 26, 2020

f := new(...)
runtime.SetFinalizer(f, ...)
p := new(...)
p.x = 7
f.p = p

Yes, if the compiler reorders the last two writes, and it does not extend the lifetime of f until after the now-last p.x = 7 statement, then we have a problem.
So store-reordering optimizations are certainly possible, they just need to keep some sense of the pre-reordered liveness map around.

f := new(...)
runtime.SetFinalizer(f, ...)
p := new(...)
f.p = p
// f can't die here, even though it looks like it does
p.x = 7
// f has to die (at least) here

Is that saying the same thing that Ian said? I think so. I hope so.

As Ian mentioned, reordering by the hardware shouldn't matter - the stop-the-world at the start of the GC (and allocating black during GC?) should save us.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented May 26, 2020

they just need to keep some sense of the pre-reordered liveness map around.

Not just maps, but also values. Consider the example here:
#39186 (comment)
If we have a value live, but it's nil, it doesn't help much. Which effectively means no reordering, no?

@randall77
Copy link
Contributor

@randall77 randall77 commented May 26, 2020

Yes, you need the value of f live somewhere where I noted. Not the variable f.
The variable could have been overwritten, sure.
So when reordering stores, the lifetime of the values of the pointers being stored to cannot shrink. The value of each pointer must live on somewhere (not necessarily in f, just somewhere marked live) until at least its original source location.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented May 26, 2020

But doesn't it effectively mean no reordering? If for:

*p = foo
*y = bar

compiler would need to generate on reordering:

oldy := *y
*y = bar
*p = foo
_ = oldy

It looks reasonable to simply prohibit write reordering at this point (probably won't be profitable: more instructions, bringing cache line in potentially shared state, more registers consumed).

And we don't do this today, right? Which means we do not give this guarantee. Which means we can as well never give it because there is no code that relies on it (theoretically).

Also brings interesting concerns for write-only memory. In such cases we cannot preserve the value, and the write is the end of life.

@randall77
Copy link
Contributor

@randall77 randall77 commented May 26, 2020

*p = foo
*y = bar

Ah, now I see what the issue is. The last pointer to the object being finalized is overwritten, not just dropped. It might not live in a live map (i.e. be on the stack) anywhere where we can extend its lifetime.

So we execute *y = bar, making our object unreachable, then we get interrupted and GC is run. GC decides to schedule the finalizer and it runs before the goroutine that is about to execute *p = foo ever gets scheduled.

This would be hard to do in the current garbage collector - you have to time it just right or the write barrier will prevent collection of the overwritten pointer. But not impossible, I guess.

The preemptive scheduler makes this worse by making everywhere a safe point. Before that, only call sites are safe points so we need only prevent reordering writes across calls.

So yes, it looks like this would prevent reordering writes where at least one of them is a pointer write? Or we have to mark such reorderings as unsafe points, perhaps?

dvyukov added a commit to llvm/llvm-project that referenced this issue May 27, 2020
Add ThreadClock:: global_acquire_ which is the last time another thread
has done a global acquire of this thread's clock.

It helps to avoid problem described in:
golang/go#39186
See test/tsan/java_finalizer2.cpp for a regression test.
Note the failuire is _extremely_ hard to hit, so if you are trying
to reproduce it, you may want to run something like:
$ go get golang.org/x/tools/cmd/stress
$ stress -p=64 ./a.out

The crux of the problem is roughly as follows.
A number of O(1) optimizations in the clocks algorithm assume proper
transitive cumulative propagation of clock values. The AcquireGlobal
operation may produce an inconsistent non-linearazable view of
thread clocks. Namely, it may acquire a later value from a thread
with a higher ID, but fail to acquire an earlier value from a thread
with a lower ID. If a thread that executed AcquireGlobal then releases
to a sync clock, it will spoil the sync clock with the inconsistent
values. If another thread later releases to the sync clock, the optimized
algorithm may break.

The exact sequence of events that leads to the failure.
- thread 1 executes AcquireGlobal
- thread 1 acquires value 1 for thread 2
- thread 2 increments clock to 2
- thread 2 releases to sync object 1
- thread 3 at time 1
- thread 3 acquires from sync object 1
- thread 1 acquires value 1 for thread 3
- thread 1 releases to sync object 2
- sync object 2 clock has 1 for thread 2 and 1 for thread 3
- thread 3 releases to sync object 2
- thread 3 sees value 1 in the clock for itself
  and decides that it has already released to the clock
  and did not acquire anything from other threads after that
  (the last_acquire_ check in release operation)
- thread 3 does not update the value for thread 2 in the clock from 1 to 2
- thread 4 acquires from sync object 2
- thread 4 detects a false race with thread 2
  as it should have been synchronized with thread 2 up to time 2,
  but because of the broken clock it is now synchronized only up to time 1

The global_acquire_ value helps to prevent this scenario.
Namely, thread 3 will not trust any own clock values up to global_acquire_
for the purposes of the last_acquire_ optimization.

Reviewed-in: https://reviews.llvm.org/D80474
Reported-by: nvanbenschoten (Nathan VanBenschoten)
@dvyukov
Copy link
Member

@dvyukov dvyukov commented May 27, 2020

FTR tsan fix is merged in llvm/llvm-project@4408eee

@nvanbenschoten
Copy link
Contributor Author

@nvanbenschoten nvanbenschoten commented May 28, 2020

Thanks for merging the fix @dvyukov. I see that this issue is marked under the go1.16 milestone. Is it too late in the 1.15 release cycle to rebuild the .syso files in src/runtime/race to pick up this patch?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 28, 2020

As far as I know this is not a new problem in 1.15, so at this point in the release process the fix should wait until 1.16. Sorry.

@nvanbenschoten
Copy link
Contributor Author

@nvanbenschoten nvanbenschoten commented May 28, 2020

Got it, thanks for the update. Then as far as I'm concerned, this issue can be closed. I'll defer to you on whether we should keep it open to track the proposed changes to the role of finalizers in the memory model.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 28, 2020

Seems like we should at least keep this open to rebuild the .syso files for the 1.16 release.

rjnw added a commit to rjnw/f18-llvm-project that referenced this issue May 29, 2020
Add ThreadClock:: global_acquire_ which is the last time another thread
has done a global acquire of this thread's clock.

It helps to avoid problem described in:
golang/go#39186
See test/tsan/java_finalizer2.cpp for a regression test.
Note the failuire is _extremely_ hard to hit, so if you are trying
to reproduce it, you may want to run something like:
$ go get golang.org/x/tools/cmd/stress
$ stress -p=64 ./a.out

The crux of the problem is roughly as follows.
A number of O(1) optimizations in the clocks algorithm assume proper
transitive cumulative propagation of clock values. The AcquireGlobal
operation may produce an inconsistent non-linearazable view of
thread clocks. Namely, it may acquire a later value from a thread
with a higher ID, but fail to acquire an earlier value from a thread
with a lower ID. If a thread that executed AcquireGlobal then releases
to a sync clock, it will spoil the sync clock with the inconsistent
values. If another thread later releases to the sync clock, the optimized
algorithm may break.

The exact sequence of events that leads to the failure.
- thread 1 executes AcquireGlobal
- thread 1 acquires value 1 for thread 2
- thread 2 increments clock to 2
- thread 2 releases to sync object 1
- thread 3 at time 1
- thread 3 acquires from sync object 1
- thread 1 acquires value 1 for thread 3
- thread 1 releases to sync object 2
- sync object 2 clock has 1 for thread 2 and 1 for thread 3
- thread 3 releases to sync object 2
- thread 3 sees value 1 in the clock for itself
  and decides that it has already released to the clock
  and did not acquire anything from other threads after that
  (the last_acquire_ check in release operation)
- thread 3 does not update the value for thread 2 in the clock from 1 to 2
- thread 4 acquires from sync object 2
- thread 4 detects a false race with thread 2
  as it should have been synchronized with thread 2 up to time 2,
  but because of the broken clock it is now synchronized only up to time 1

The global_acquire_ value helps to prevent this scenario.
Namely, thread 3 will not trust any own clock values up to global_acquire_
for the purposes of the last_acquire_ optimization.

Reviewed-in: https://reviews.llvm.org/D80474
Reported-by: nvanbenschoten (Nathan VanBenschoten)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.