runtime/race: race detector misses race condition in case of fmt.Println #27756

doroginin · 2018-09-19T14:09:20Z

What version of Go are you using (`go version`)?

go version go1.11 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/home/dd/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/dd/go:/home/dd/data/Documents/dev"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go-1.11/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go-1.11/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="0"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -fmessage-length=0 -fdebug-prefix-map=/tmp/user/1000/go-build740506104=/tmp/go-build -gno-record-gcc-switches"

What did you do?

cat /home/dd/.GoLand2018.2/config/scratches/scratch_5.go

package main

import "fmt"

var a int

func f() {
	fmt.Println(a) // if comment this, data race is found
	a = 1
}

func main() {
	go f()
	fmt.Println(a)
}

go run -race /home/dd/.GoLand2018.2/config/scratches/scratch_5.go

What did you expect to see?

0
0
==================
WARNING: DATA RACE
Write at 0x0000005e3640 by goroutine 6:
  main.f()
      /home/dd/.GoLand2018.2/config/scratches/scratch_5.go:9 +0x69

Previous read at 0x0000005e3640 by main goroutine:
  main.main()
      /home/dd/.GoLand2018.2/config/scratches/scratch_5.go:14 +0x5e

Goroutine 6 (running) created at:
  main.main()
      /home/dd/.GoLand2018.2/config/scratches/scratch_5.go:13 +0x46
==================
Found 1 data race(s)

What did you see instead?

0
0

I think this is related with: #12664
cc @dvyukov

The text was updated successfully, but these errors were encountered:

mwf · 2018-09-19T14:23:28Z

@gopherbot please add label RaceDetector

mirtchovski · 2018-09-19T15:12:35Z

The race detector works by tracking reads and writes to memory addresses from different goroutines. In your case, the introduction of fmt.Println() in f() acts as a scheduling barrier, forcing the Println in main() to run first before the shared variable a is modified in f(). In effect, the race detector sees a read of a from main, a read of a from f() and a write to a from f(). (This doesn't take into account that main() may exit and killed the process before the write to a even occurrs.)

Multiple reads from the same location are not considered race conditions. only after a write has occurred subsequent reads or writes will trigger. The modified version of your program below illustrates this, as the race detector will trigger on the second print in main() after a delay of 1 second:

package main

import (
	"fmt"
	"time"
)

var a int

func f() {
	fmt.Println(a)
	a = 1
}

func main() {
	go f()
	fmt.Println(a)
	time.Sleep(1 * time.Second)
	fmt.Println(a)
}

Unfortunately that's a consequence of how the race detector works. The race detector will not issue "false positives" but it is possible for it to miss something ("false negative") in short-lived programs. As the blog post about the race detector concluded:

The race detector is a powerful tool for checking the correctness of concurrent programs. It will not issue false positives, so take its warnings seriously. But it is only as good as your tests; you must make sure they thoroughly exercise the concurrent properties of your code so that the race detector can do its job.

randall77 · 2018-09-19T16:38:58Z

The race detector should catch this.
I suspect some synchronization inside fmt.Printf is causing this. But I looked around, and I don't see anything obvious. Initially I thought it might be the sync.Pool operation, but I get the same issue if I change fmt to just use new to get a new pp.

The race detector should find the race, even though the read gets executed before the write. It remembers reads so that it can issue errors if subsequent writes happen without a happens-before relationship.

Fixed repro: make sure the goroutine has a chance to run.

package main

import "fmt"

var a int

func f(c chan int) {
	fmt.Println(a) // if comment this, data race is found
	a = 1
	c <- 0
}

func main() {
	c := make(chan int)
	go f(c)
	fmt.Println(a)
	<- c
}

mirtchovski · 2018-09-19T17:25:06Z

My analysis was incorrect, I apologize. OTOH I think I may have found an even simpler repro, one that doesn't involve fmt at all. The code below does not report a race, however if the last two lines of main are swapped, the race is reported:

package main

var a int

func f(c chan int) {
	a = 1
	c <- a
}

func main() {
	c := make(chan int)
	go f(c)
	<-c    // swapping this line and the one below will cause a race to be reported
	a = 2
}

I could not find in the documentation whether the channel communication clears the state of the writes to not be racy.

randall77 · 2018-09-19T17:29:02Z

@mirtchovski : your last example doesn't have a race. The channel operations establish a happens-before relationship between the operations in f and the final write.
Swapping the last two statements removes the happens-before relationship.

mirtchovski · 2018-09-19T17:29:23Z

thank you. that's the part i was missing.

dvyukov · 2018-09-19T17:43:09Z

This doesn't take into account that main() may exit and killed the process before the write to a even occurrs.

Also race detector should sleep for few seconds before existing specifically yo catch such cases.

I suspect some synchronization inside fmt.Printf is causing this. But I looked around, and I don't see anything obvious.

We also have some acquire/release race annotations in syscall package around Read/Write. But Printf should only Write?

mirtchovski · 2018-09-19T17:48:12Z

We also have some acquire/release race annotations in syscall package around Read/Write. But Printf should only Write?

I observed the same behaviour from @randall77's reproducer if I replace Printf with os.Stdout.Write([]byte{a+0x41})

dvyukov · 2018-09-19T18:35:30Z

This is due to internal/poll.(*FD).Write/internal/poll.(*FD).writeLock.

dvyukov · 2018-09-19T18:56:25Z

Why are we doing this fdMutex business for blocking descriptors? That's just pure slow down, we never promised that they can be closed concurrently with read/write and it never worked. I think it can break things too because sure there are some weird files in linux pseudo file systems that require writing second time to unblock a preceding write, but fdMutex won't allow that.

randall77 · 2018-09-19T19:57:26Z

@ianlancetaylor

dvyukov · 2018-09-20T09:02:35Z

Was that intentional? We have the non-blocking flag, so we could check it before we take the fdMutex.

But back to races: I think we should ignore fdMutex for race detector. We don't guarantee any user-visible synchronization there, right? I think we also explicitly catch races between Read/Write and Close for anything other than net.Conn by annotating Close as write. That's what we used to detect for os.File because Read/Write read the fd field and Close wrote -1 there.

ianlancetaylor · 2018-09-24T21:33:45Z

It's now explicitly OK to call Read or Write concurrently with Close, even for a blocking descriptor. For a non-blocking descriptor, including a pipe returned by os.Pipe, Close will interrupt a pending Read or Write. That won't work with a blocking descriptor like a file, but in general Read and Write don't block when used on files. In general if Read and Write can block, then the netpoll mechanism will work and we will not consider the descriptor to be blocking.

The upshot of that is that I think we need to keep using the fdMutex even for a blocking descriptor.

dvyukov · 2018-09-25T07:38:38Z

Is this new behavior useful for blocking descriptors? It looks like closing a chan concurrently with sends.

ianlancetaylor · 2018-09-25T14:56:01Z

It's useful for descriptors that actually block, yes. See, for example, #18507.

It's true that concurrently closing and writing to a descriptor is a race. But it's a race that is mediated by the kernel to be harmless. And it's a race that is difficult to avoid when you aren't in control of the other end of the descriptor.

I understand that this is a problem in that it can hide some legitimate races. And it adds a few atomic instructions to every file I/O operation. Is there another problem?

dvyukov · 2018-09-25T15:14:29Z

But is it useful for non-pollable descriptors? It looks like it can only mask bugs.

The race is not mediated by kernel. If we actually race write/close in the kernel, it will be quite harmful. But we mediate it with fdMutex.

The problem is (1) additional contended atomic operations (kernel operations can be completely independent, e.g. reading random entropy from per-cpu pool), (2) masking bugs on file descriptors (even if it does not crash now, write success is still non-deterministic), (3) masking other unrelated races, (4) deadlocking when a file descriptor requires 2 parallel writes to proceed, but we block the second write on fdMutex.

ianlancetaylor · 2018-09-25T15:27:51Z

Why do you say that racing write and close in the kernel is harmful? That is a normal operation. One or both will fail. The result is non-deterministic but not otherwise harmful.

There are non-pollable non-standard descriptors that can block. For these descriptors the ability to close a descriptor that is hanging in a read is useful.

I agree with your points 1, 2, and 3, but I don't understand when a descriptor would require 2 parallel writes.

dvyukov · 2018-09-25T15:46:50Z

Why do you say that racing write and close in the kernel is harmful?

Because in this case you can't know on what descriptor you actually issue the write. Consider:

int fd = ...;

// thread 1
write(fd, ...);

// thread 2
close(fd, ...);

Option 1: write gets into kernel and resolves fd number to file description object and acquire a reference to it first (before close). Everything is fine.
Option 2: thread 1 reads fd value, but preempted just before write, in libc, or in kernel before fd is resolved to description. Close closed and releases the fd. Now write resumes and resolves fd number to description object. Option 2a: resolution fails since the fd is closed. Option 2b: yet another thread opens something and gets this fd in between, now thread 1 issues write to a random file description (e.g. sends your crypto keys to a random client connection).

Since write and close are concurrent, you can't possibly avoid 2a/2b.

Now, fdMutex avoids this race by delaying close, but you really don't want to race write/close in general.

dvyukov · 2018-09-25T15:51:01Z

but I don't understand when a descriptor would require 2 parallel writes.

Consider a custom file (like something in devfs, procfs, sysfs, debugfs). Writes to such file are really just some operation in kernel (don't have anything to do with write semantics). Now consider that semantics are so that a second write unblocks a first write. For example, first write says "I want to receive a message from a peer" and the second write says "I am sending a message to a peer".
Now, fdMutex will not allow the second write to start until the first write completes. We are deadlocked.

dvyukov · 2018-09-25T15:53:59Z

There are non-pollable non-standard descriptors that can block. For these descriptors the ability to close a descriptor that is hanging in a read is useful.

But for non-pollable descriptors we will not issue close while there is an outstanding read/write. Close will only try to evict the fd from poller (which should be no-op for non-pollable descriptors) and then delay the actual close until after read/write returns.

ianlancetaylor · 2018-09-25T16:58:27Z

But the kind of write/close race you describe can not happen with the current internal/poll code.

ianlancetaylor · 2018-09-25T17:02:47Z

What are you suggesting we should do?

dvyukov · 2018-09-26T10:15:09Z

What are you suggesting we should do?

One option is always to do nothing :)

Then it depends on if we promised that Close/Read/Write races are OK for all files or not. You said It's now explicitly OK to call Read or Write concurrently with Close, even for a blocking descriptor. Is it documented somewhere?
Possible options:

Skip fdMutex for non-pollable descriptors.
Ignore fdMutex for race detector for all descriptors.
2a. Additionally detect races between Close/Write/Read with manual race annotations.

ianlancetaylor · 2018-09-26T13:52:50Z

I don't think it's explicitly documented, but I believe that it should be OK to call Read or Write concurrently with Close. Rightly or wrongly, people expect that to work. There were several different bug reports when it did not work.

gopherbot added the RaceDetector label Sep 19, 2018

pohly mentioned this issue Jun 19, 2019

runtime/race: misses race in gRPC server because of timing #32682

Closed

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022

seankhliao added this to the Unplanned milestone Aug 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime/race: race detector misses race condition in case of fmt.Println #27756

runtime/race: race detector misses race condition in case of fmt.Println #27756

doroginin commented Sep 19, 2018

mwf commented Sep 19, 2018

mirtchovski commented Sep 19, 2018

randall77 commented Sep 19, 2018

mirtchovski commented Sep 19, 2018 •

edited

Loading

randall77 commented Sep 19, 2018

mirtchovski commented Sep 19, 2018 •

edited

Loading

dvyukov commented Sep 19, 2018

mirtchovski commented Sep 19, 2018

dvyukov commented Sep 19, 2018

dvyukov commented Sep 19, 2018

randall77 commented Sep 19, 2018

dvyukov commented Sep 20, 2018

ianlancetaylor commented Sep 24, 2018

dvyukov commented Sep 25, 2018

ianlancetaylor commented Sep 25, 2018

dvyukov commented Sep 25, 2018

ianlancetaylor commented Sep 25, 2018

dvyukov commented Sep 25, 2018

dvyukov commented Sep 25, 2018

dvyukov commented Sep 25, 2018

ianlancetaylor commented Sep 25, 2018

ianlancetaylor commented Sep 25, 2018

dvyukov commented Sep 26, 2018

ianlancetaylor commented Sep 26, 2018

runtime/race: race detector misses race condition in case of fmt.Println #27756

runtime/race: race detector misses race condition in case of fmt.Println #27756

Comments

doroginin commented Sep 19, 2018

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

mwf commented Sep 19, 2018

mirtchovski commented Sep 19, 2018

randall77 commented Sep 19, 2018

mirtchovski commented Sep 19, 2018 • edited Loading

randall77 commented Sep 19, 2018

mirtchovski commented Sep 19, 2018 • edited Loading

dvyukov commented Sep 19, 2018

mirtchovski commented Sep 19, 2018

dvyukov commented Sep 19, 2018

dvyukov commented Sep 19, 2018

randall77 commented Sep 19, 2018

dvyukov commented Sep 20, 2018

ianlancetaylor commented Sep 24, 2018

dvyukov commented Sep 25, 2018

ianlancetaylor commented Sep 25, 2018

dvyukov commented Sep 25, 2018

ianlancetaylor commented Sep 25, 2018

dvyukov commented Sep 25, 2018

dvyukov commented Sep 25, 2018

dvyukov commented Sep 25, 2018

ianlancetaylor commented Sep 25, 2018

ianlancetaylor commented Sep 25, 2018

dvyukov commented Sep 26, 2018

ianlancetaylor commented Sep 26, 2018

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

mirtchovski commented Sep 19, 2018 •

edited

Loading

mirtchovski commented Sep 19, 2018 •

edited

Loading