Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/race: race detector sometimes misses race on []byte copy #36794

Closed
rogpeppe opened this issue Jan 27, 2020 · 6 comments
Closed

runtime/race: race detector sometimes misses race on []byte copy #36794

rogpeppe opened this issue Jan 27, 2020 · 6 comments
Labels
Milestone

Comments

@rogpeppe
Copy link
Contributor

@rogpeppe rogpeppe commented Jan 27, 2020

What version of Go are you using (go version)?

$ go version
go version devel +6dbcc8b865 Thu Jan 9 15:38:00 2020 +0000 linux/amd64

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="auto"
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/rog/.cache/go-build"
GOENV="/home/rog/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY="github.com/heetch"
GONOSUMDB="github.com/heetch"
GOOS="linux"
GOPATH="/home/rog/src/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/rog/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/rog/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/tmp/m/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build752416385=/tmp/go-build -gno-record-gcc-switches"
/pre>

What did you do?

When this test code runs under the race detector, I don't see a race detected, even though there is a race. When I add a sleep (set bogus to true), the race detector does find the race.

https://play.golang.org/p/ECoOELB1fC1

package test

import (
	"sync"
	"testing"
	"time"
)

const bogus = false

func corrupt(data []byte) {
	copy(data, data[2:])
}

var saved string

func TestFoo(t *testing.T) {
	data := []byte("hello there")
	var wg sync.WaitGroup
	wg.Add(1)
	go func() {
		saved = string(data)
		wg.Done()
	}()
	if bogus {
		time.Sleep(time.Millisecond)
	}
	corrupt(data)
	wg.Wait()
	t.Logf("saved %q; data %q", saved, data)
}
@myitcv

This comment has been minimized.

Copy link
Member

@myitcv myitcv commented Jan 27, 2020

@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented Jan 27, 2020

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Jan 27, 2020

Definitely strange. It seems very fragile. Changing the string length affects it (only lengths 10 and 11 demonstrate the bug).
Replacing time.Sleep with a simple integer global variable write still reproduces the bug. Anything that calls into the race detector, really.

The calls out to the race detector from the runtime look fine, and the same in both cases.

slicecopy write 0xc00011c090 9
slicecopy read 0xc00011c092 9
slicebytetostring read 0xc00011c090 11

I suspect something in the race detector. Possibly this is just a limitation of the race detector, something is colliding in the shadow space in one case and not in another. It's not guaranteed to find all races.

@dvyukov

This comment has been minimized.

Copy link
Member

@dvyukov dvyukov commented Jan 28, 2020

I suspect this is related to one of deep limitations of the tsan runtime - it memorizes at most 4 unique previous memory accesses for each 8-byte aligned granule of application memory (hard to memorize all previous memory accesses that ever happened in an application with full stack traces).

When all 4 slots are already busy, tsan either replaces one of "weaker" (for purposes of race detection) memory accesses (e.g. we have a read and now a write to the same memory), or if there are no weaker ones, it replaces one pseudo-randomly.

There are several aspects combined here to give this stable effect on this test:

  1. One of the accesses is unaligned (to 8-byte granule).
    The slice range accesses end up in tsan MemoryAccessRange function:
    https://github.com/llvm-mirror/compiler-rt/blob/master/lib/tsan/rtl/tsan_rtl_thread.cpp#L376-L405
    For aligned ranges (and middle part of large accesses), tsan models 8-byte accesses, so these take just 1 of 4 available slots. However for unaligned start/end, it simply models them as N 1-byte accesses. So these may take lots of slots.

  2. The unaligned offset is 2, so tsan models 6 individual 1-byte access to the beginning of the slice. These probably evict everything (the writes that should be detect later as racing).
    There is a potential for improvement here: tsan could model this 6-byte access as 1 2-byte access + 1 4-byte accesses instead, which would take only 2 slots (it supports only 1/2/4/8 byte accesses).

  3. Exact behavior of the pseudo random eviction (potentially the writes could have not been evicted even with 6 subsequent reads). That's probably why inserting even another access, surfaces the race.

  4. The fact that slicecopy first does racewriterangepc followed by racereadrangepc. As the result the split reads evict previous writes.
    I think this is a "bug" (for some definition of bug of course b/c it was totally not obvious). I think we need to swap the order. If ranges don't overlap, it should not matter (minus potential effect on pseudo-random eviction). But if they overlap, then writes are always "stronger" for race detection and it's better to preserve them. This may be enough to "fix" this particular case. Currently tsan does remember reads to the region, but since the second access is also a read (slicebytetostring), reads don't race with reads, and the race is not detected. If we keep writes, it should be detected.

  5. As a bonus point, if ranges overlap, we could instrument the overlapping part only as write (also faster).

Other range instrumentation sites should be audited for similar cases too (points 4 and 5).

@rogpeppe thanks for the nice test case! False positives are worse, but they always loud and are not left unnoticed. Whereas false negatives very frequently stay unnoticed in large quantities for long time because they are silent. So each of them is way more important in some sense and always requires some manual attention and a bit healthy mistrust in things that "known to work" :)

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Jan 28, 2020

I think this is a "bug" (for some definition of bug of course b/c it was totally not obvious). I think we need to swap the order. If ranges don't overlap, it should not matter (minus potential effect on pseudo-random eviction). But if they overlap, then writes are always "stronger" for race detection and it's better to preserve them. This may be enough to "fix" this particular case. Currently tsan does remember reads to the region, but since the second access is also a read (slicebytetostring), reads don't race with reads, and the race is not detected. If we keep writes, it should be detected.

Yes, I noticed this as well, and swapping the order fixes this particular issue. I didn't understand why; now I do.
I'll submit a CL for that for 1.15.

@cagedmantis cagedmantis added this to the Go1.15 milestone Feb 3, 2020
@cagedmantis cagedmantis added the NeedsFix label Feb 3, 2020
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Feb 25, 2020

Change https://golang.org/cl/220685 mentions this issue: runtime: reorder race detector calls in slicecopy

@gopherbot gopherbot closed this in 089e482 Feb 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.