Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: apparent false-positive race report for a buffered channel after CL 220419 #42598

Closed
bcmills opened this issue Nov 13, 2020 · 30 comments
Closed

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Nov 13, 2020

I got the following race report on CL 258758.

(https://storage.googleapis.com/go-build-log/04d98a9c/linux-amd64-race_596e00e8.log)

--- FAIL: TestScript (0.11s)
    --- FAIL: TestScript/mod_retract (0.34s)
        script_test.go:211: 
            # Populate go.sum. (0.087s)
            # 'go list pkg' does not report an error when a retracted version is used. (0.069s)
            # Nor does 'go build'. (0.000s)
            # Neither 'go list' nor 'go build' should download go.mod from the version
            # that would list retractions. (0.000s)
            # Importing a package from a module with a retracted latest version will
            # select the latest non-retracted version. (0.175s)
            > go get -d ./use_self_prev
            [stderr]
            go: downloading example.com v1.0.0
            go: downloading example.com/retract/self/prev v1.1.0
            go: downloading example.com/retract v1.1.0
            ==================
            WARNING: DATA RACE
            Read at 0x00c0002c0ce8 by main goroutine:
              cmd/go/internal/modget.reportRetractions()
                  /workdir/go/src/cmd/go/internal/modget/get.go:550 +0x224
              cmd/go/internal/modget.runGet()
                  /workdir/go/src/cmd/go/internal/modget/get.go:490 +0x931
              main.main()
                  /workdir/go/src/cmd/go/main.go:195 +0xb94
            
            Previous write at 0x00c0002c0ce8 by goroutine 76:
              cmd/go/internal/modload.addRetraction()
                  /workdir/go/src/cmd/go/internal/modload/build.go:132 +0x424
              cmd/go/internal/modload.ListModules.func1.1()
                  /workdir/go/src/cmd/go/internal/modload/list.go:40 +0x84
            
            Goroutine 76 (finished) created at:
              cmd/go/internal/modload.ListModules.func1()
                  /workdir/go/src/cmd/go/internal/modload/list.go:32 +0xe4
              cmd/go/internal/modload.ListModules()
                  /workdir/go/src/cmd/go/internal/modload/list.go:46 +0x1fa
              cmd/go/internal/modget.reportRetractions()
                  /workdir/go/src/cmd/go/internal/modget/get.go:547 +0x184
              cmd/go/internal/modget.runGet()
                  /workdir/go/src/cmd/go/internal/modget/get.go:490 +0x931
              main.main()
                  /workdir/go/src/cmd/go/main.go:195 +0xb94
            ==================
            ==================
            WARNING: DATA RACE
            Read at 0x00c000058ac0 by main goroutine:
              cmd/go/internal/modget.reportRetractions()
                  /workdir/go/src/cmd/go/internal/modget/get.go:556 +0x2b9
              cmd/go/internal/modget.runGet()
                  /workdir/go/src/cmd/go/internal/modget/get.go:490 +0x931
              main.main()
                  /workdir/go/src/cmd/go/main.go:195 +0xb94
            
            Previous write at 0x00c000058ac0 by goroutine 76:
              cmd/go/internal/modload.checkRetractions()
                  /workdir/go/src/cmd/go/internal/modload/modfile.go:163 +0x4a4
              cmd/go/internal/modload.addRetraction()
                  /workdir/go/src/cmd/go/internal/modload/build.go:126 +0x104
              cmd/go/internal/modload.ListModules.func1.1()
                  /workdir/go/src/cmd/go/internal/modload/list.go:40 +0x84
            
            Goroutine 76 (finished) created at:
              cmd/go/internal/modload.ListModules.func1()
                  /workdir/go/src/cmd/go/internal/modload/list.go:32 +0xe4
              cmd/go/internal/modload.ListModules()
                  /workdir/go/src/cmd/go/internal/modload/list.go:46 +0x1fa
              cmd/go/internal/modget.reportRetractions()
                  /workdir/go/src/cmd/go/internal/modget/get.go:547 +0x184
              cmd/go/internal/modget.runGet()
                  /workdir/go/src/cmd/go/internal/modget/get.go:490 +0x931
              main.main()
                  /workdir/go/src/cmd/go/main.go:195 +0xb94
            ==================
            go: warning: example.com/retract@v1.0.0-bad is retracted: bad
            go: run 'go get example.com/retract@latest' to switch to the latest unretracted version
            Found 2 data race(s)
            [exit status 66]
            FAIL: testdata/script/mod_retract.txt:23: unexpected command failure
            
FAIL
FAIL	cmd/go	151.822s

The reported race is for a function that uses an N-buffered channel as a semaphore. The function sends to the channel (from the main goroutine) to acquire a semaphore token, and receives from the channel (in a background goroutine) to release the token. Before the function returns, it sends N additional tokens to the channel to ensure that all of the background goroutines have finished.

I believe that this pattern is explicitly covered by the memory model:

The kth receive on a channel with capacity C happens before the k+Cth send from that channel completes.

This is the first race we've seen reported for this code, and the timing w.r.t. CL 220419, which changes the semantics of channels under the race detector, seems suspicious to me.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 13, 2020

@bcmills

This comment has been hidden.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Nov 14, 2020

I've spent some time analyzing the effect of CL 220419 at the time and I concluded it's perfectly in line with MM.

I believe that this pattern explicitly covered by the memory model:
The kth receive on a channel with capacity C happens before the k+Cth send from that channel completes.

This is true, but this relation does not seem to be transitive. Namely, if there are more than C operations, kth receive HB k+Cth send, but it does not HB 2*k+Cth send. So when the code does:

	for n := cap(sem); n > 0; n-- {
		sem <- token{}
	}

It only synchronizes with the completion of the last cap(sem) operations, but not with all the previous ones.

@cherrymui

This comment has been hidden.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 14, 2020

@dvyukov If I understand you correctly, that sounds like a bug in the memory model. Perhaps the memory model ought to say that the kth receive on a channel with capacity C happens before the k+Cth+N send from the channel completes, where N is any non-negative integer. There is no necessary relationship between sends, of course. But with capacity C, a receive may happen concurrently with up to (but not including) C sends, but it happens before C sends complete, or C+1 sends complete, or C+2 sends complete, etc.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 15, 2020

@dvyukov, in this case we should be able to prove that all previous receives happen before the final send by induction, because all sends on the channel occur in the same goroutine.

  • The kth receive happens before the k+Cth send because “[t]he kth receive on a channel with capacity C happens before the k+Cth send from that channel completes.

  • In this program all sends are within a single goroutine, so the k+Cth send happens before the i+Cth send for all i ≥ k because “[w]ithin a single goroutine, the happens-before order is the order expressed by the program.

  • “Happens before” is transitive, so in this program the kth receive happens before the i+Cth send for all i ≥ k.

  • In this program there are exactly N+C sends and exactly N receives, so each receive happens before the last send.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 15, 2020

@ianlancetaylor, I think the terms “kth receive” and “k+Cth send” are confusing and perhaps misleading here. Identifying the “kth receive” requires a total order among the receives, but the point of HB is that we may only have a partial order.

A more precise definition might be more like: “If a send on a channel with capacity C happens before k ≥ C other sends on the channel, then the receive corresponding to the first send happens before the last of those k+1 sends completes.”

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Nov 15, 2020

@dvyukov If I understand you correctly, that sounds like a bug in the memory model. Perhaps the memory model ought to say that the kth receive on a channel with capacity C happens before the k+Cth+N send from the channel completes, where N is any non-negative integer. There is no necessary relationship between sends, of course. But with capacity C, a receive may happen concurrently with up to (but not including) C sends, but it happens before C sends complete, or C+1 sends complete, or C+2 sends complete, etc.

Good question.
The examples we had in CL 258758, did not look like something we need to support and race detector indeed produced false negatives on these examples. This "wait for all" example did not come up at the time.

happens before the k+Cth+N

This feels a bit fishy to me. It seems we are trying to compensate for broken transitivity, where fixing transitivity would cleaner.
I am thinking of:

-The kth receive on a channel with capacity C happens before the k+Cth send from that channel completes.
-This rule generalizes the previous rule to buffered channels.

+The kth receive completion on a buffered channel with capacity C happens before the k+Cth send.

It does not generalize the rule for unbuffered channels anymore: the use of send/recv and send completion/recv completion is swapped. But breaking generalization by introducing new custom rules also feels fishy to me :)
However, my understanding always was that weird splitting to operation and operation completion was introduced specifically for unbuffered channels to break HB cycle. Without that splitting the current rules would work perfectly for buffered channels (support the code in question).

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Nov 15, 2020

@dvyukov, in this case we should be able to prove that all previous receives happen before the final send by induction, because all sends on the channel occur in the same goroutine.

  • The kth receive happens before the k+Cth send because “[t]he kth receive on a channel with capacity C happens before the k+Cth send from that channel completes.
  • In this program all sends are within a single goroutine, so the k+Cth send happens before the i+Cth send for all i ≥ k because “[w]ithin a single goroutine, the happens-before order is the order expressed by the program.
  • “Happens before” is transitive, so in this program the kth receive happens before the i+Cth send for all i ≥ k.
  • In this program there are exactly N+C sends and exactly N receives, so each receive happens before the last send.

Good point.
With the changes Ian or me proposed above it would work without relying on the fact that all sends happen in the same goroutine, the channel itself would provide the necessary synchronization. But if we take into account that all sends happen in the same goroutine, then it seems that it should work with the current rules as well:
hb
Here channel capacity is 2 and all accesses d0-3 happen before use.

@dfava where is the flaw now?

Also now I wonder if we are dealing with one issue or two :) There seems to be a bug in race detector implementation, but also do we still want to fix the memory model so that similar synchronization is provided even when sends do not happen in the same goroutine? For ListModules is seems inherent that all sends happen in the same goroutine (?).

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 15, 2020

In this case it actually is important that the initial sends happen before the final loop, so I would be wary about strengthening the model too much.

If the initial sends were moved into the background goroutines, it would be possible for that final channel-fill loop to execute before the background goroutines, causing them to leak (blocked on the sends) instead of execute. That wouldn't be a data race, but only because the mutations would no longer happen at all!

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 15, 2020

@dvyukov, also note that the phrasing I suggested in #42598 (comment) does continue to generalize to the unbuffered case.

@aclements
Copy link
Member

@aclements aclements commented Nov 19, 2020

Given that we're coming up against the beta, should we revert CL 220419?

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Nov 20, 2020

Yes, I think we need to rollback for now and add this pattern as a new race test. False positives are much worse than false negatives.

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 20, 2020

Change https://golang.org/cl/271946 mentions this issue: Revert "runtime: swap the order of raceacquire() and racerelease()"

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 20, 2020

I sent a revert CL: https://golang.org/cl/271946.

@dfava
Copy link
Contributor

@dfava dfava commented Nov 20, 2020

Sorry I wasn't receiving emails about this thread until the change was reverted just now. I haven't had a chance to read it, but I will soon and can comment.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 20, 2020

For what it's worth, the revert CL is not yet committed. A simple roll forward fix is still possible, but it would have to happen very very soon.

@dfava
Copy link
Contributor

@dfava dfava commented Nov 20, 2020

Is the failure intermittent (or "somewhat deterministic")? I'm not able to reproduce it on commit 5e58ae4.

Seems like there has been some refactoring around modules. I can't find "reportRetractions" in the source tree. Seems like the function has recently been removed and it used to live in src/cmd/go/internal/modget/get.go. Is that correct?

@bcmills , is the race still reported after this recent refactoring?

(I haven't had a chance to fully digest the discussion above, just going piece by piece from the beginning. I agree on reverting the CL so we can drill down into this carefully. I doubt there is a "bug" in memory model specification; more likely the bug is on my code, or on the code where the race is being reported)

@dfava
Copy link
Contributor

@dfava dfava commented Nov 20, 2020

Also, I vaguely recall an earlier discussion on how the race detector treats differently channels whose elements are size zero... Perhaps that special treatment maters now.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Nov 20, 2020

Also, I vaguely recall an earlier discussion on how the race detector treats differently channels whose elements are size zero... Perhaps that special treatment maters now.

Good point!
Here is the address we use for synchronization:

// chanbuf(c, i) is pointer to the i'th slot in the buffer.
func chanbuf(c *hchan, i uint) unsafe.Pointer {
	return add(c.buf, uintptr(i)*uintptr(c.elemsize))
}

I assume for struct{} it's always the same location for all elements.
I guess racereleaseacquire fix does not work for such case.

I think we could do:

if c.elemsize != 0 {
  racereleaseacquire(qp)
} else {
  raceacquire(qp)
  racerelease(qp)
}
@dfava
Copy link
Contributor

@dfava dfava commented Nov 20, 2020

I agree, we could put a check for elemsize. It's nice and simple, and it doesn't force us to revisit the assumptions made about channels of elemsize==0 right now.

I'm happy that the memory model specification is fine as it is, and I also see that the code in ListModules is correct.

I also think that my patch is not wrong :) though it did expose the issue with elemsize==0, and, in hindsight, I should have accounted for that when testing.

I'll spend a bit of time thinking about what exactly happens when elemsize==0 interacts with racereleaseacquire().

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 20, 2020

@dfava

Is the failure intermittent (or "somewhat deterministic")? I'm not able to reproduce it on commit 5e58ae4.

The failure is intermittent.

There have so far been at least three failures in TryBots and one on the build dashboard (2020-11-14T17:24:37-92c732e/linux-amd64-race).

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Nov 20, 2020

I'll spend a bit of time thinking about what exactly happens when elemsize==0 interacts with racereleaseacquire().

My understanding is that previously HB was completely cumulative, so the fact that the same address was shared between unrelated elements did not matter (we only get more false negatives). But racereleaseacquire is not cumulative, it always drops everything that was there before, so sharing breaks.

@dfava
Copy link
Contributor

@dfava dfava commented Nov 20, 2020

I'll spend a bit of time thinking about what exactly happens when elemsize==0 interacts with racereleaseacquire().

My understanding is that previously HB was completely cumulative, so the fact that the same address was shared between unrelated elements did not matter (we only get more false negatives). But racereleaseacquire is not cumulative, it always drops everything that was there before, so sharing breaks.

You are absolutely right. Here is in a bit more detail, perhaps it's nice to document.
The code essentially does the following, where c is a channel of size 2:

T0                T1        T2
c <-         |           |
go T1        |  wr X     |
c <-         |  <- c     |
go T2        |           |  wr Y
             |           |  <- c
c <-         |           |
c <-         |           |

I'll represent each entry in the channel's buffer as B0 and B1.
Here is a picture of the state of the race detector after T1 and T2 write to variable X and Y:

B0      B1      T0             T1      T2
{}      {}      {}             {wr X}  {wr Y}  => then T1 receives from c
{wr X}  {}      {}             {wr X}  {wr Y}  => then T2 receives from c
{wr X}  {wr Y}  {}             {wr X}  {wr Y}  => then T0 sends onto c
{}      {wr Y}  {wr X}         {wr X}  {wr Y}  => then T0 sends onto c
{}      {wr X}  {wr X, wr Y}   {wr X}  {wr Y}

The end state is, T0 is properly synchronized with T1 and T2: T0 "knows" of both T1's and T2's writes to X and Y.

However, we currently model channels whose elmsize==0 with one buffer entry, so only B0 and no B1. The picture then looks like:

B0        T0             T1      T2
{}        {}             {wr X}  {wr Y}        => then T1 receives from c
{wr X}    {}             {wr X}  {wr Y}        => then T2 receives from c
{wr Y}    {}             {wr X}  {wr X, wr Y}  => then T0 sends onto c
{}        {wr Y}         {wr X}  {wr Y, wr Y}  => then T0 sends onto c
{wr Y}    {wr Y}         {wr X}  {wr Y, wr Y}

In the end state now, T0 is unaware of one of the writes, so a race is flagged.

@dfava
Copy link
Contributor

@dfava dfava commented Nov 20, 2020

It seems like we are able to keep CL 220419 and add the check that Dmitry suggested.

Let me know if you would like me to do it, or if you prefer to do it yourselves (since you are probably faster at it than me, given your familiarity with the flow and my lack of access to buildbot and etc). I'm happy either way!

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Nov 20, 2020

Whoever is doing it, just please add some comments/references and tests ;)

@dfava
Copy link
Contributor

@dfava dfava commented Nov 20, 2020

Whoever is doing it, just please add some comments/references and tests ;)

:) Is it possible to force a specific scheduling of goroutines so we hit the scenario exactly? Maybe using internals/unsafe/something?

@dfava
Copy link
Contributor

@dfava dfava commented Nov 20, 2020

(I'm working on a patch)

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 20, 2020

Change https://golang.org/cl/271987 mentions this issue: runtime: check channel's elemsize before calling race detector

@dfava
Copy link
Contributor

@dfava dfava commented Nov 20, 2020

I submitted a patch which included the check for elemsize, documentation, and a test.
The test fails on top-of-tree (as expected), and passes on the branch with the patch (as expected).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants
You can’t perform that action at this time.