Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: race condition raised with parallel tests, panic(nil) and -race #64649

Closed
dolmen opened this issue Dec 11, 2023 · 15 comments
Closed

runtime: race condition raised with parallel tests, panic(nil) and -race #64649

dolmen opened this issue Dec 11, 2023 · 15 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. RaceDetector
Milestone

Comments

@dolmen
Copy link
Contributor

dolmen commented Dec 11, 2023

Go version

go version go1.21.5 darwin/arm64

What operating system and processor architecture are you using (go env)?

- Local (Apple M2)
- GitHub Actions (x86)

What did you do?

I have a testsuite for my github.com/dolmen-go/rendezvous package.

$ git clone https://github.com/dolmen-go/rendezvous.git
$ cd rendezvous
$ git checkout 44e68d01a216b3755c9ba5dc75835074c3e5e866
$ go version
go version go1.21.5 darwin/arm64
$ go test -v -race -run 'Test(One|Two)' "-gcflags=all=-trimpath=$(go env GOPATH)" github.com/dolmen-go/rendezvous
=== RUN   TestOne
=== PAUSE TestOne
=== RUN   TestTwo
=== PAUSE TestTwo
=== CONT  TestOne
=== CONT  TestTwo
==================
WARNING: DATA RACE
Read at 0x00c000076560 by goroutine 14:
  internal/godebug.(*Setting).IncNonDefault()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/internal/godebug/godebug.go:102 +0x58
  internal/godebug.(*Setting).IncNonDefault-fm()
      <autogenerated>:1 +0x34
  runtime.(*godebugInc).IncNonDefault()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/runtime/runtime.go:111 +0xd0
  github.com/dolmen-go/rendezvous.WaitAll.func1()
      src/github.com/dolmen-go/rendezvous/rendezvous.go:56 +0xec
  github.com/dolmen-go/rendezvous.WaitAll.func2()
      src/github.com/dolmen-go/rendezvous/rendezvous.go:57 +0x44

Previous write at 0x00c000076560 by goroutine 12:
  internal/godebug.New()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/internal/godebug/godebug.go:74 +0x40
  internal/godebug.newIncNonDefault()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/internal/godebug/godebug.go:203 +0x28
  runtime.(*godebugInc).IncNonDefault()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/runtime/runtime.go:108 +0x84
  github.com/dolmen-go/rendezvous.WaitAll.func1()
      src/github.com/dolmen-go/rendezvous/rendezvous.go:56 +0xec
  github.com/dolmen-go/rendezvous.WaitAll.func2()
      src/github.com/dolmen-go/rendezvous/rendezvous.go:57 +0x44

Goroutine 14 (running) created at:
  github.com/dolmen-go/rendezvous.WaitAll()
      src/github.com/dolmen-go/rendezvous/rendezvous.go:47 +0x210
  github.com/dolmen-go/rendezvous_test.TestTwo()
      src/github.com/dolmen-go/rendezvous/rendezvous_test.go:118 +0xe4
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/testing/testing.go:1595 +0x1b0
  testing.(*T).Run.func1()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/testing/testing.go:1648 +0x40

Goroutine 12 (running) created at:
  github.com/dolmen-go/rendezvous.WaitAll()
      src/github.com/dolmen-go/rendezvous/rendezvous.go:47 +0x210
  github.com/dolmen-go/rendezvous_test.TestOne()
      src/github.com/dolmen-go/rendezvous/rendezvous_test.go:104 +0x180
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/testing/testing.go:1595 +0x1b0
  testing.(*T).Run.func1()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/testing/testing.go:1648 +0x40
==================
=== NAME  TestOne
    testing.go:1465: race detected during execution of test
--- FAIL: TestOne (0.00s)
=== NAME  TestTwo
    testing.go:1465: race detected during execution of test
--- FAIL: TestTwo (0.00s)
FAIL
exit status 1
FAIL	github.com/dolmen-go/rendezvous	0.104s

Some characteristics that might be relevant:

  • The rendezvous package launches goroutines in its tests.
  • The rendezvous testsuite uses parallel testing at top testing level:
func TestOne(t *testing.T) {
        t.Parallel()
        // ...
}

func TestTwo(t *testing.T) {
        t.Parallel()
        // ...
}

This issue is also visible on GitHub Actions: https://github.com/dolmen-go/rendezvous/actions/runs/7147035848/job/19465833264#step:4:32

What did you expect to see?

Test success like with previous Go versions.

What did you see instead?

Race condition in runtime:

WARNING: DATA RACE
Read at 0x00c000076560 by goroutine 14:
  internal/godebug.(*Setting).IncNonDefault()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/internal/godebug/godebug.go:102 +0x58
  internal/godebug.(*Setting).IncNonDefault-fm()
      <autogenerated>:1 +0x34
  runtime.(*godebugInc).IncNonDefault()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/runtime/runtime.go:111 +0xd0
  github.com/dolmen-go/rendezvous.WaitAll.func1()
      src/github.com/dolmen-go/rendezvous/rendezvous.go:56 +0xec
  github.com/dolmen-go/rendezvous.WaitAll.func2()
      src/github.com/dolmen-go/rendezvous/rendezvous.go:57 +0x44

Previous write at 0x00c000076560 by goroutine 12:
  internal/godebug.New()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/internal/godebug/godebug.go:74 +0x40
  internal/godebug.newIncNonDefault()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/internal/godebug/godebug.go:203 +0x28
  runtime.(*godebugInc).IncNonDefault()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/runtime/runtime.go:108 +0x84
  github.com/dolmen-go/rendezvous.WaitAll.func1()
      src/github.com/dolmen-go/rendezvous/rendezvous.go:56 +0xec
  github.com/dolmen-go/rendezvous.WaitAll.func2()
      src/github.com/dolmen-go/rendezvous/rendezvous.go:57 +0x44

Link to runtime source: https://github.com/golang/go/blob/go1.21.5/src/runtime/runtime.go#L97

A comment says:

		// If other goroutines are racing here, no big deal. One will win,
		// and all the inc functions will be using the same underlying
		// *godebug.Setting.

If this is related, I think this is "big deal" as running go test -race in CI is quite common. (Calling t.Parallel() from top level tests is not so common, this is unfortunate, and this issue blocks from pushing for it)

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 11, 2023
@dolmen
Copy link
Contributor Author

dolmen commented Dec 11, 2023

$ go test -race -run 'Test(One|Two)' "-gcflags=all=-trimpath=$(go env GOPATH)" -count=1 -parallel=1 
PASS
ok  	github.com/dolmen-go/rendezvous	1.271s
$ go test -race -run 'Test(One|Two)' "-gcflags=all=-trimpath=$(go env GOPATH)" -count=1 -parallel=2
==================
WARNING: DATA RACE
Read at 0x00c000076500 by goroutine 14:
  internal/godebug.(*Setting).IncNonDefault()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/internal/godebug/godebug.go:102 +0x58
  internal/godebug.(*Setting).IncNonDefault-fm()
      <autogenerated>:1 +0x34
  runtime.(*godebugInc).IncNonDefault()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/runtime/runtime.go:111 +0xd0
  github.com/dolmen-go/rendezvous.WaitAll.func1()
      src/github.com/dolmen-go/rendezvous/rendezvous.go:56 +0xec
  github.com/dolmen-go/rendezvous.WaitAll.func2()
      src/github.com/dolmen-go/rendezvous/rendezvous.go:57 +0x44

Previous write at 0x00c000076500 by goroutine 12:
  internal/godebug.New()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/internal/godebug/godebug.go:74 +0x40
  internal/godebug.newIncNonDefault()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/internal/godebug/godebug.go:203 +0x28
  runtime.(*godebugInc).IncNonDefault()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/runtime/runtime.go:108 +0x84
  github.com/dolmen-go/rendezvous.WaitAll.func1()
      src/github.com/dolmen-go/rendezvous/rendezvous.go:56 +0xec
  github.com/dolmen-go/rendezvous.WaitAll.func2()
      src/github.com/dolmen-go/rendezvous/rendezvous.go:57 +0x44

Goroutine 14 (running) created at:
  github.com/dolmen-go/rendezvous.WaitAll()
      src/github.com/dolmen-go/rendezvous/rendezvous.go:47 +0x210
  github.com/dolmen-go/rendezvous_test.TestTwo()
      src/github.com/dolmen-go/rendezvous/rendezvous_test.go:118 +0xe4
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/testing/testing.go:1595 +0x1b0
  testing.(*T).Run.func1()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/testing/testing.go:1648 +0x40

Goroutine 12 (running) created at:
  github.com/dolmen-go/rendezvous.WaitAll()
      src/github.com/dolmen-go/rendezvous/rendezvous.go:47 +0x210
  github.com/dolmen-go/rendezvous_test.TestOne()
      src/github.com/dolmen-go/rendezvous/rendezvous_test.go:104 +0x180
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/testing/testing.go:1595 +0x1b0
  testing.(*T).Run.func1()
      /opt/homebrew/Cellar/go/1.21.5/libexec/src/testing/testing.go:1648 +0x40
==================
--- FAIL: TestOne (0.00s)
    testing.go:1465: race detected during execution of test
--- FAIL: TestTwo (0.00s)
    testing.go:1465: race detected during execution of test
FAIL
exit status 1
FAIL	github.com/dolmen-go/rendezvous	0.314s

@mauri870 mauri870 added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 11, 2023
@mauri870
Copy link
Member

cc @golang/runtime

@mknyszek mknyszek added this to the Backlog milestone Dec 13, 2023
@prattmic
Copy link
Member

Read is here: https://cs.opensource.google/go/go/+/master:src/internal/godebug/godebug.go;l=102;drc=39effbc105f5c54117a6011af3c48e3c8f14eca9

Write is here: https://cs.opensource.google/go/go/+/master:src/internal/godebug/godebug.go;l=74;drc=39effbc105f5c54117a6011af3c48e3c8f14eca9

The write is the initialization of the struct. It must happen before the read. Indeed, I believe the atomic store at https://cs.opensource.google/go/go/+/release-branch.go1.21:src/runtime/runtime.go;l=109;drc=1a9893a969e0a73dc4f1e48ed40ccaf29ec238a6 is publishing the value. But this is a runtime/internal/atomic value, which doesn't include race annotations. So I think we need to manually annotate that path.

@mauri870
Copy link
Member

mauri870 commented Dec 13, 2023

@prattmic I was looking at this the other day and I also think we should instrument it manually. So something like this would suffice? I'm wondering how to come up with a test for this 😅

diff --git a/src/runtime/runtime.go b/src/runtime/runtime.go
index 0829a84e43..c40a2725f9 100644
--- a/src/runtime/runtime.go
+++ b/src/runtime/runtime.go
@@ -161,8 +161,10 @@ type godebugInc struct {
 }
 
 func (g *godebugInc) IncNonDefault() {
+	var racer int
 	inc := g.inc.Load()
 	if inc == nil {
+		raceacquire(unsafe.Pointer(&racer))
 		newInc := godebugNewIncNonDefault.Load()
 		if newInc == nil {
 			return
@@ -173,6 +175,7 @@ func (g *godebugInc) IncNonDefault() {
 		inc = new(func())
 		*inc = (*newInc)(g.name)
 		g.inc.Store(inc)
+		racerelease(unsafe.Pointer(&racer))
 	}
 	(*inc)()
 }

@dolmen
Copy link
Contributor Author

dolmen commented Dec 14, 2023

Has anyone analyzed why this happens with parallel testing? What Go debug events metric is incremented in that case?

Is there a workaround besides:

  • not using parallel tests
  • not using the race detector when parallel testing is enabled: go test -parallel=2
  • not using parallel tests when the race detector is not enabled: go test -race -parallel=1

All those solutions are unfortunately about avoiding parallel testing.

@dolmen
Copy link
Contributor Author

dolmen commented Dec 14, 2023

As godebugInc is about incrementing a metric, I did more investigation by dumping the metrics that are effectively incremented:

package rendezvous_test

import (
	"fmt"
	"os"
	"runtime/metrics"
	"strings"
	"testing"
)

func TestMain(m *testing.M) {
	// call flag.Parse() here if TestMain uses flags
	exitCode := m.Run()

	descs := metrics.All()
	var godebugMetrics []metrics.Sample
	for i := range descs {
		if strings.HasPrefix(descs[i].Name, "/godebug/") && strings.HasSuffix(descs[i].Name, ":events") {
			godebugMetrics = append(godebugMetrics, metrics.Sample{Name: descs[i].Name})
		}
	}
	metrics.Read(godebugMetrics)
	for i := range godebugMetrics {
		if godebugMetrics[i].Value.Uint64() > 0 {
			fmt.Printf("%s: %v\n", godebugMetrics[i].Name, godebugMetrics[i].Value.Uint64())
		}
	}

	os.Exit(exitCode)
}

The result:

$ go test -run TestOne
PASS
/godebug/non-default-behavior/panicnil:events: 1
ok  	github.com/dolmen-go/rendezvous	0.151s

I have in fact a test that runs panic(nil) on purpose. I have to check if I still want that test.

dolmen added a commit to dolmen-go/rendezvous that referenced this issue Dec 14, 2023
To investigate golang/go#64649
(race detector reports a race condition in package runtime related to
the increment of a runtime metric), I'm adding a TestMain that dumps runtime/metrics
at the end of the tests.

Result:
  $ go test -run TestOne
  PASS
  /godebug/non-default-behavior/panicnil:events: 1
  ok  	github.com/dolmen-go/rendezvous	0.151s
@dolmen
Copy link
Contributor Author

dolmen commented Dec 14, 2023

My final workaround is to drop the panic(nil) in my tests as it fails on Go 1.21 and is no longer relevant with Go 1.21+.

@dolmen dolmen changed the title runtime,internal/godebug: race condition raised with parallel tests with goroutines and -race runtime: race condition raised with parallel tests, panic(nil) and -race Dec 14, 2023
@dr2chase
Copy link
Contributor

dr2chase commented Dec 14, 2023

I'm having no luck with a small reproducer, this is on Darwin amd64, 1.21.5, 1.21.3, also tip. Here's my current attempt, I'm about to try this on darwin/arm64. To get an event count I have to specify 1.20 in go.mod, or GODEBUG=panicnil=1, but in no case have I been able to tickle a race failure. (I absolutely believe this is a bug, your test reproducer fails just fine, but a small test would be darn nice.)

package parpanic_test

import (
	"fmt"
	"os"
	"runtime/metrics"
	"strings"
	"testing"
)

//go:noinline
func panicNil() (e any) {
	defer func() {
		e = recover()
	}()
	panic(e)
}

var tests = []string{"ant", "bat", "cat", "dog", "emu", "fox", "gnu", "hen"}

func TestMany(t *testing.T) {
	t.Parallel()
	for _, tc := range tests {
		t.Run(tc, func(t *testing.T) {
			t.Log(panicNil())
		})
	}

}

func TestMain(m *testing.M) {
	// call flag.Parse() here if TestMain uses flags
	exitCode := m.Run()
	descs := metrics.All()
	var godebugMetrics []metrics.Sample
	for i := range descs {
		if strings.HasPrefix(descs[i].Name, "/godebug/") && strings.HasSuffix(descs[i].Name, ":events") {
			godebugMetrics = append(godebugMetrics, metrics.Sample{Name: descs[i].Name})
		}
	}
	metrics.Read(godebugMetrics)
	for i := range godebugMetrics {
		if godebugMetrics[i].Value.Uint64() > 0 {
			fmt.Printf("%s: %v\n", godebugMetrics[i].Name, godebugMetrics[i].Value.Uint64())
		}
	}
	os.Exit(exitCode)
}

@dolmen
Copy link
Contributor Author

dolmen commented Dec 14, 2023

Here is my small reproducer:

package main_test

import "testing"

func Test(t *testing.T) {
	test := func(t *testing.T) {
		t.Parallel()
		defer func() {
			recover()
		}()
		panic(nil)
	}
	t.Run("One", test)
	t.Run("Two", test)
}
$ GODEBUG=panicnil=1 go test -run 'Test$' -race -parallel=2

@dolmen
Copy link
Contributor Author

dolmen commented Dec 14, 2023

@dr2chase Your TestMany executes in parallel, but there is no parallel test to run. The subtests where the panic(nil) happens are not run in parallel.

Also I have fixed a few issues on my metrics dump. See https://github.com/dolmen-go/rendezvous/blob/eab2b2b76f826e0daca0703147018d02a3641cbd/main_test.go#L15

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/549796 mentions this issue: runtime: add race annotations in IncNonDefault

@dr2chase
Copy link
Contributor

@gopherbot please consider this for backport to 1.21. It broke someone's integration tests, and since GODEBUG=panicnil (or go.mod version = 1.20) is a compatibility feature added to 1.21, it is a regression, if you look at it right.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #64756 (for 1.20), #64757 (for 1.21).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Dec 20, 2023
@dmitshur dmitshur modified the milestones: Backlog, Go1.22 Dec 20, 2023
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/551856 mentions this issue: runtime: use racereleasemerge for godebugInc

gopherbot pushed a commit that referenced this issue Dec 21, 2023
CL 549796 adds race annotations to godebugInc. It uses racerelease
to model a CompareAndSwap. However, a CompareAndSwap is
essentially a load and a store. Modeling it as just racerelease
makes it not synchronized with other racerelease, i.e. other CAS.
For the following execution

thread         A             B
          load, got nil
                        load, got nil
          set *inc
                        set *inc
          racerelease
          CAS success
                        racerelease
                        CAS fail
                        load
                        raceacquire
                        use *inc (from A)

On thread B, the raceacquire synchronizes with the previous
racerelease, which is not synchronized with racerelease on thread
A, so it doesn't know that the use of *inc on thread B is after
the set on thread A, and will report a race.

Change it to use racereleasemerge, which synchronizes with
previous racerelease and racereleasemerge. So in the case above it
knows thread B's CAS is after thread A's.

Also remove stale comment that was more relevant when the code
used atomic store, where CL 549796 changed to CAS.

Updates #64649.

Change-Id: I17671090a19c0699fcb4e6481e2abd98ef2e5542
Reviewed-on: https://go-review.googlesource.com/c/go/+/551856
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
@dolmen
Copy link
Contributor Author

dolmen commented Jan 15, 2024

The fix has been released with Go 1.21.6 (see #64757).

ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
Also use CompareAndSwap to make the code actually less racy.

Added a test which will be meaningful when run under the race
detector (tested it -race with broken fix in runtime, it failed).

Fixes golang#64649

Change-Id: I5972e08901d1adc8ba74858edad7eba91be1b0ce
Reviewed-on: https://go-review.googlesource.com/c/go/+/549796
Run-TryBot: David Chase <drchase@google.com>
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
CL 549796 adds race annotations to godebugInc. It uses racerelease
to model a CompareAndSwap. However, a CompareAndSwap is
essentially a load and a store. Modeling it as just racerelease
makes it not synchronized with other racerelease, i.e. other CAS.
For the following execution

thread         A             B
          load, got nil
                        load, got nil
          set *inc
                        set *inc
          racerelease
          CAS success
                        racerelease
                        CAS fail
                        load
                        raceacquire
                        use *inc (from A)

On thread B, the raceacquire synchronizes with the previous
racerelease, which is not synchronized with racerelease on thread
A, so it doesn't know that the use of *inc on thread B is after
the set on thread A, and will report a race.

Change it to use racereleasemerge, which synchronizes with
previous racerelease and racereleasemerge. So in the case above it
knows thread B's CAS is after thread A's.

Also remove stale comment that was more relevant when the code
used atomic store, where CL 549796 changed to CAS.

Updates golang#64649.

Change-Id: I17671090a19c0699fcb4e6481e2abd98ef2e5542
Reviewed-on: https://go-review.googlesource.com/c/go/+/551856
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. RaceDetector
Projects
None yet
Development

No branches or pull requests

8 participants