Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/race: race detector misses race condition in case of fmt.Println #27756

Open
doroginin opened this issue Sep 19, 2018 · 24 comments

Comments

@doroginin
Copy link
Contributor

commented Sep 19, 2018

What version of Go are you using (go version)?

go version go1.11 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/home/dd/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/dd/go:/home/dd/data/Documents/dev"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go-1.11/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go-1.11/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="0"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -fmessage-length=0 -fdebug-prefix-map=/tmp/user/1000/go-build740506104=/tmp/go-build -gno-record-gcc-switches"

What did you do?

cat /home/dd/.GoLand2018.2/config/scratches/scratch_5.go

package main

import "fmt"

var a int

func f() {
	fmt.Println(a) // if comment this, data race is found
	a = 1
}

func main() {
	go f()
	fmt.Println(a)
}

go run -race /home/dd/.GoLand2018.2/config/scratches/scratch_5.go

What did you expect to see?

0
0
==================
WARNING: DATA RACE
Write at 0x0000005e3640 by goroutine 6:
  main.f()
      /home/dd/.GoLand2018.2/config/scratches/scratch_5.go:9 +0x69

Previous read at 0x0000005e3640 by main goroutine:
  main.main()
      /home/dd/.GoLand2018.2/config/scratches/scratch_5.go:14 +0x5e

Goroutine 6 (running) created at:
  main.main()
      /home/dd/.GoLand2018.2/config/scratches/scratch_5.go:13 +0x46
==================
Found 1 data race(s)

What did you see instead?

0
0

I think this is related with: #12664
cc @dvyukov

@mwf

This comment has been minimized.

Copy link

commented Sep 19, 2018

@gopherbot please add label RaceDetector

@mirtchovski

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2018

The race detector works by tracking reads and writes to memory addresses from different goroutines. In your case, the introduction of fmt.Println() in f() acts as a scheduling barrier, forcing the Println in main() to run first before the shared variable a is modified in f(). In effect, the race detector sees a read of a from main, a read of a from f() and a write to a from f(). (This doesn't take into account that main() may exit and killed the process before the write to a even occurrs.)

Multiple reads from the same location are not considered race conditions. only after a write has occurred subsequent reads or writes will trigger. The modified version of your program below illustrates this, as the race detector will trigger on the second print in main() after a delay of 1 second:

package main

import (
	"fmt"
	"time"
)

var a int

func f() {
	fmt.Println(a)
	a = 1
}

func main() {
	go f()
	fmt.Println(a)
	time.Sleep(1 * time.Second)
	fmt.Println(a)
}

Unfortunately that's a consequence of how the race detector works. The race detector will not issue "false positives" but it is possible for it to miss something ("false negative") in short-lived programs. As the blog post about the race detector concluded:

The race detector is a powerful tool for checking the correctness of concurrent programs. It will not issue false positives, so take its warnings seriously. But it is only as good as your tests; you must make sure they thoroughly exercise the concurrent properties of your code so that the race detector can do its job.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2018

The race detector should catch this.
I suspect some synchronization inside fmt.Printf is causing this. But I looked around, and I don't see anything obvious. Initially I thought it might be the sync.Pool operation, but I get the same issue if I change fmt to just use new to get a new pp.

The race detector should find the race, even though the read gets executed before the write. It remembers reads so that it can issue errors if subsequent writes happen without a happens-before relationship.

Fixed repro: make sure the goroutine has a chance to run.

package main

import "fmt"

var a int

func f(c chan int) {
	fmt.Println(a) // if comment this, data race is found
	a = 1
	c <- 0
}

func main() {
	c := make(chan int)
	go f(c)
	fmt.Println(a)
	<- c
}
@mirtchovski

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2018

My analysis was incorrect, I apologize. OTOH I think I may have found an even simpler repro, one that doesn't involve fmt at all. The code below does not report a race, however if the last two lines of main are swapped, the race is reported:

package main

var a int

func f(c chan int) {
	a = 1
	c <- a
}

func main() {
	c := make(chan int)
	go f(c)
	<-c    // swapping this line and the one below will cause a race to be reported
	a = 2
}

I could not find in the documentation whether the channel communication clears the state of the writes to not be racy.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2018

@mirtchovski : your last example doesn't have a race. The channel operations establish a happens-before relationship between the operations in f and the final write.
Swapping the last two statements removes the happens-before relationship.

@mirtchovski

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2018

thank you. that's the part i was missing.

@dvyukov

This comment has been minimized.

Copy link
Member

commented Sep 19, 2018

This doesn't take into account that main() may exit and killed the process before the write to a even occurrs.

Also race detector should sleep for few seconds before existing specifically yo catch such cases.

I suspect some synchronization inside fmt.Printf is causing this. But I looked around, and I don't see anything obvious.

We also have some acquire/release race annotations in syscall package around Read/Write. But Printf should only Write?

@mirtchovski

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2018

We also have some acquire/release race annotations in syscall package around Read/Write. But Printf should only Write?

I observed the same behaviour from @randall77's reproducer if I replace Printf with os.Stdout.Write([]byte{a+0x41})

@dvyukov

This comment has been minimized.

Copy link
Member

commented Sep 19, 2018

This is due to internal/poll.(*FD).Write/internal/poll.(*FD).writeLock.

@dvyukov

This comment has been minimized.

Copy link
Member

commented Sep 19, 2018

Why are we doing this fdMutex business for blocking descriptors? That's just pure slow down, we never promised that they can be closed concurrently with read/write and it never worked. I think it can break things too because sure there are some weird files in linux pseudo file systems that require writing second time to unblock a preceding write, but fdMutex won't allow that.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2018

@dvyukov

This comment has been minimized.

Copy link
Member

commented Sep 20, 2018

Was that intentional? We have the non-blocking flag, so we could check it before we take the fdMutex.

But back to races: I think we should ignore fdMutex for race detector. We don't guarantee any user-visible synchronization there, right? I think we also explicitly catch races between Read/Write and Close for anything other than net.Conn by annotating Close as write. That's what we used to detect for os.File because Read/Write read the fd field and Close wrote -1 there.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 24, 2018

It's now explicitly OK to call Read or Write concurrently with Close, even for a blocking descriptor. For a non-blocking descriptor, including a pipe returned by os.Pipe, Close will interrupt a pending Read or Write. That won't work with a blocking descriptor like a file, but in general Read and Write don't block when used on files. In general if Read and Write can block, then the netpoll mechanism will work and we will not consider the descriptor to be blocking.

The upshot of that is that I think we need to keep using the fdMutex even for a blocking descriptor.

@dvyukov

This comment has been minimized.

Copy link
Member

commented Sep 25, 2018

Is this new behavior useful for blocking descriptors? It looks like closing a chan concurrently with sends.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2018

It's useful for descriptors that actually block, yes. See, for example, #18507.

It's true that concurrently closing and writing to a descriptor is a race. But it's a race that is mediated by the kernel to be harmless. And it's a race that is difficult to avoid when you aren't in control of the other end of the descriptor.

I understand that this is a problem in that it can hide some legitimate races. And it adds a few atomic instructions to every file I/O operation. Is there another problem?

@dvyukov

This comment has been minimized.

Copy link
Member

commented Sep 25, 2018

But is it useful for non-pollable descriptors? It looks like it can only mask bugs.

The race is not mediated by kernel. If we actually race write/close in the kernel, it will be quite harmful. But we mediate it with fdMutex.

The problem is (1) additional contended atomic operations (kernel operations can be completely independent, e.g. reading random entropy from per-cpu pool), (2) masking bugs on file descriptors (even if it does not crash now, write success is still non-deterministic), (3) masking other unrelated races, (4) deadlocking when a file descriptor requires 2 parallel writes to proceed, but we block the second write on fdMutex.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2018

Why do you say that racing write and close in the kernel is harmful? That is a normal operation. One or both will fail. The result is non-deterministic but not otherwise harmful.

There are non-pollable non-standard descriptors that can block. For these descriptors the ability to close a descriptor that is hanging in a read is useful.

I agree with your points 1, 2, and 3, but I don't understand when a descriptor would require 2 parallel writes.

@dvyukov

This comment has been minimized.

Copy link
Member

commented Sep 25, 2018

Why do you say that racing write and close in the kernel is harmful?

Because in this case you can't know on what descriptor you actually issue the write. Consider:

int fd = ...;

// thread 1
write(fd, ...);

// thread 2
close(fd, ...);

Option 1: write gets into kernel and resolves fd number to file description object and acquire a reference to it first (before close). Everything is fine.
Option 2: thread 1 reads fd value, but preempted just before write, in libc, or in kernel before fd is resolved to description. Close closed and releases the fd. Now write resumes and resolves fd number to description object. Option 2a: resolution fails since the fd is closed. Option 2b: yet another thread opens something and gets this fd in between, now thread 1 issues write to a random file description (e.g. sends your crypto keys to a random client connection).

Since write and close are concurrent, you can't possibly avoid 2a/2b.

Now, fdMutex avoids this race by delaying close, but you really don't want to race write/close in general.

@dvyukov

This comment has been minimized.

Copy link
Member

commented Sep 25, 2018

but I don't understand when a descriptor would require 2 parallel writes.

Consider a custom file (like something in devfs, procfs, sysfs, debugfs). Writes to such file are really just some operation in kernel (don't have anything to do with write semantics). Now consider that semantics are so that a second write unblocks a first write. For example, first write says "I want to receive a message from a peer" and the second write says "I am sending a message to a peer".
Now, fdMutex will not allow the second write to start until the first write completes. We are deadlocked.

@dvyukov

This comment has been minimized.

Copy link
Member

commented Sep 25, 2018

There are non-pollable non-standard descriptors that can block. For these descriptors the ability to close a descriptor that is hanging in a read is useful.

But for non-pollable descriptors we will not issue close while there is an outstanding read/write. Close will only try to evict the fd from poller (which should be no-op for non-pollable descriptors) and then delay the actual close until after read/write returns.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2018

But the kind of write/close race you describe can not happen with the current internal/poll code.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2018

What are you suggesting we should do?

@dvyukov

This comment has been minimized.

Copy link
Member

commented Sep 26, 2018

What are you suggesting we should do?

One option is always to do nothing :)

Then it depends on if we promised that Close/Read/Write races are OK for all files or not. You said It's now explicitly OK to call Read or Write concurrently with Close, even for a blocking descriptor. Is it documented somewhere?
Possible options:

  1. Skip fdMutex for non-pollable descriptors.
  2. Ignore fdMutex for race detector for all descriptors.
    2a. Additionally detect races between Close/Write/Read with manual race annotations.
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 26, 2018

I don't think it's explicitly documented, but I believe that it should be OK to call Read or Write concurrently with Close. Rightly or wrongly, people expect that to work. There were several different bug reports when it did not work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.