Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: close of a channel with pending sends incorrectly flagged by race detector #30372

Closed
kjk opened this issue Feb 24, 2019 · 23 comments

Comments

Projects
None yet
8 participants
@kjk
Copy link

commented Feb 24, 2019

What version of Go are you using (go version)?

$ go version
go version go1.11.4 windows/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\kjk\AppData\Local\go-build
set GOEXE=.exe
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=C:\Users\kjk\go
set GOPROXY=
set GORACE=
set GOROOT=C:\Go
set GOTMPDIR=
set GOTOOLDIR=C:\Go\pkg\tool\windows_amd64
set GCCGO=gccgo
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=C:\Users\kjk\AppData\Local\Temp\go-build206244702=/tmp/go-build -gno-record-gcc-switches

What did you do?

Save this to e.g. t.go:

package main

import (
	"fmt"
	"time"
)

func main() {
	ch := make(chan string)
	for i := 0; i < 256; i++ {
		go func(n int) {
			ch <- fmt.Sprintf("string %d", n)
		}(i)
	}
	time.Sleep(time.Second)
	close(ch)
}

and run go -race t.go.

What did you see instead?

I see a data race reported:

> go run -race t.go
==================                                                                
WARNING: DATA RACE                                                                
Write at 0x00c00001a130 by main goroutine:                                        
  runtime.closechan()                                                             
      C:/Go/src/runtime/chan.go:327 +0x0                                          
  main.main()                                                                     
      C:/Users/kjk/src/t.go:16 +0xb5                                              
                                                                                  
Previous read at 0x00c00001a130 by goroutine 231:                                 
  runtime.chansend()                                                              
      C:/Go/src/runtime/chan.go:140 +0x0                                          
  main.main.func1()                                                               
      C:/Users/kjk/src/t.go:12 +0xcc                                              
                                                                                  
Goroutine 231 (running) created at:                                               
  main.main()                                                                     
      C:/Users/kjk/src/t.go:11 +0x83                                              
==================                                                                
panic: send on closed channel                                                     
                                                                                  
goroutine 201 [running]:                                                          
main.main.func1(0xc00001a120, 0xc3)                                               
        C:/Users/kjk/src/t.go:12 +0xcd                                            
created by main.main                                                              
        C:/Users/kjk/src/t.go:11 +0x84                                            
panic: send on closed channel                                                     
                                                                                  
goroutine 235 [running]:                                                          
main.main.func1(0xc00001a120, 0xe5)                                               
        C:/Users/kjk/src/t.go:12 +0xcd                                            
created by main.main                                                              
        C:/Users/kjk/src/t.go:11 +0x84                                            
exit status 2                                                                     

What did you expect to see?

I didn't expect data race between close(ch) and pending ch <- v send.

The spec (https://golang.org/ref/spec#Channel_types) is curiously mum on the subject.

It says:

A single channel may be used in send statements, receive operations, and calls to the built-in functions cap and len by any number of goroutines without further synchronization.

By not mentioning close it implies that one has to synchronize close and sends.

But I wouldn't know how to synchronize those. My sends have already been dispatched and closing a channel seems like a valid (and only) way to ensure those sending goroutines don't stick forever (assumingI recover the resulting panic).

The data race seems harmless, as I'm done with this channel, but I run my tests with -race detection enabled and this makes the test fail.

@go101

This comment has been minimized.

Copy link

commented Feb 24, 2019

Similar to #27769

@as

This comment has been minimized.

Copy link
Contributor

commented Feb 24, 2019

It's a valid transaction race. There is no guarantee the sends complete before the close, which would result in a panic. The race detector actually helped you here.

@kjk

This comment has been minimized.

Copy link
Author

commented Feb 24, 2019

@as The code with data race is in Go's runtime, not in my code.

Where does the spec declare this code as generating a data race?

As stated in my bug report, I understand that this panics as that is what the spec says and I'm fine with that.

I'm not fine with this also being reported as a data race as the spec doesn't say that closing a channel with pending sends is undefined behavior (which would be enough of an escape to consider the implementation conforming).

@odeke-em

This comment has been minimized.

Copy link
Member

commented Feb 24, 2019

Thank you for this report @kjk!

In deed as @go101 noted, it is a duplicate of #27769 which is still open and trying to decide how to document this. As @as has noted it is a valid transaction race but as you too have found there is no documentation in the spec about this. Let's move this to that issue as it is also actionable but more discussion has gone on there. I'll close this one in favor of #27769 but please feel free to reopen this one.

@odeke-em odeke-em closed this Feb 24, 2019

@kjk

This comment has been minimized.

Copy link
Author

commented Feb 24, 2019

I don't have permission to re-open the bug, @odeke-em could you please re-open it.

This is not a dup #27769.

The discussion in that bug operates under the assumption that this:

	ch := make(chan string)
	for i := 0; i < 256; i++ {
		go func(n int) {
			ch <- fmt.Sprintf("string %d", n)
		}(i)
	}
	close(ch)

is invalid Go program, that race detector is rightfully detecting a data race here and it's just a matter of documenting that.

I've read the discussion and I didn't find a justification for that position.

The fact that close of a channel with pending sends is considered invalid Go code would certainly violate my assumptions, given that raison d'etre for channels to allow thread-safe exchange of data between goroutines.

I claim that the above is a valid Go code and there's a bug in Go's implementation of close/send or implementation of race detector.

Maybe I'm wrong but I would like someone involved in the spec/compiler/race detector to opine on that. Without that closing this bug seems premature.

@kjk kjk changed the title close(ch) while active ch <- causes data race Close of a channel with pending sends incorrectly flagged by race detector Feb 24, 2019

@kjk

This comment has been minimized.

Copy link
Author

commented Feb 24, 2019

Furthermore, #27070 describes a similar race between close and len/cap which was deemed a bug of enough severity that the fix was backported to 1.11.

@kjk

This comment has been minimized.

Copy link
Author

commented Feb 24, 2019

I can't quite match up the line numbers in chan.go but chansend() is reading c.closed outside of c.lock while closechan() is writing to it inside c.lock.

That suggests the intent was read/writes of c.closed to be protected and chansend() failing to do so is a bug that would manifest as a data race.

Furthermore chanrecv() uses atomic.Load() to read c.closed further muddying the water as to how the variable is supposed to be used.

Presumably always using atomic operations for c.closed would be the right thing to do.

@as

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

There should be two checks. One fast path outside the critical section and another one inside. I'm not sure where there's suddenly an atomic being used in chanrecv. This code in particular always seemed to avoid atomics for checking closed.

@ALTree

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

Reopening this since @odeke-em said "feel free to reopen this one" and the issue's author has requested it.

@ALTree ALTree reopened this Feb 25, 2019

@ALTree ALTree changed the title Close of a channel with pending sends incorrectly flagged by race detector runtime: close of a channel with pending sends incorrectly flagged by race detector Feb 25, 2019

@ALTree ALTree added this to the Go1.13 milestone Feb 25, 2019

@kjk

This comment has been minimized.

Copy link
Author

commented Feb 25, 2019

@as Not sure how to parse this response.

A write to c.closed happens within a lock. A read of c.closed happens outside of a lock.

As far as I can tell, if that happens on different threads, it's a data race.

Here's chan.go code stripped down:

func chansend(c *hchan, ep unsafe.Pointer, block bool, callerpc uintptr) bool {
	if c.closed == 0 { // concurrent read on another thread, data race
		return false
	}
	lock(&c.lock)

	if c.closed != 0 {
		unlock(&c.lock)
		panic(plainError("send on closed channel"))
	}
}

func closechan(c *hchan) {
	lock(&c.lock)
	if c.closed != 0 {
		unlock(&c.lock)
		panic(plainError("close of closed channel"))
	}
	c.closed = 1 // write on one thread
}
@bcmills

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

Where does the spec declare this code as generating a data race?

See https://golang.org/ref/mem#tmp_7.

The happens-before relationship for channels is only defined between corresponding sends and receives, and between a close and the receives that observe the corresponding zero-value. There is no happens-before relationship between a send and a close, so such an event is a data race.

Data races are defined by the memory model, not what the concrete channel implementation in a given version of the Go runtime happens to do: reads and writes to c.closed in the implementation are not particularly relevant.

@bcmills bcmills closed this Feb 25, 2019

@kjk

This comment has been minimized.

Copy link
Author

commented Feb 25, 2019

Sorry for pressing this but I don't feel this was fully addressed and would appreciate clarification on 3 points.

1. What this implies is: closing a channel with pending sends is incorrect Go code?

If yes, how is one supposed to safely close such channel?

If there is no safe way, it implies that this is invalid concurrency patter in Go: I launch goroutines to send, they block because no-one is reading yet and I want to terminate the communication by closing the channel? This is the essence of my program is doing.

The implication is that if I ever want to close a channel, I'm not supposed to do a send on that channel on another goroutine.

2. A race between len and close was considered a bug (#27070) by @ianlancetaylor . Not by literal reading of the spec but presumably because of practical implications. On what basis a race between send and close is different?

3. A data race in the implementation of channel send and receive is relevant, because:

a) it's a bug in itself
b) it might be the root cause of the issue

Am I wrong that racing on c.closed is a bug that should be fixed?

@kjk

This comment has been minimized.

Copy link
Author

commented Feb 25, 2019

And to quote @ncw from #27070:

I think that any channel operations causing the race detector to moan is surprising.

@bcmills

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

  1. What this implies is: closing a channel with pending sends is incorrect Go code?

Yes.

If yes, how is one supposed to safely close such channel?

Wait for the senders to complete. (For your original example, that might look something like https://play.golang.org/p/4LtjIUA0tla.)

If there is no safe way, it implies that this is invalid concurrency [pattern] in Go: I launch goroutines to send, they block because no-one is reading yet and I want to terminate the communication by closing the channel?

Correct, that pattern is invalid. You can terminate communication by closing a channel on which the goroutines are receiving, but not one on which they are sending.

Note that in your original example, the goroutines are not terminated: if any are remaining, they either leak or panic. (See https://play.golang.org/p/2Tjo2a3Q3iN and https://play.golang.org/p/sHi6fxSgEwv respectively.)

@bcmills

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

  1. A race between len and close was considered a bug […]. On what basis a race between send and close is different?

len does not change its behavior depending on whether the channel is closed (https://play.golang.org/p/WVYBJCwZf8q).

@bcmills

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

Am I wrong that racing on c.closed is a bug that should be fixed?

Since a send concurrent with a close is defined to be a data race, the implementation of chansend may assume that the closed field is not being written concurrently.

In contrast, closechan must acquire the lock in order to avoid races with receive operations.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

Sorry for pressing this but I don't feel this was fully addressed and would appreciate clarification on 3 points.

  1. What this implies is: closing a channel with pending sends is incorrect Go code?

Yes. In addition to what Brian said, consider the following two operations happening concurrently:

c <- x
close(c)

If the send happens first, both operations complete fine, and the channel reader sees x, then end-of-channel.
If the close happens first, the send panics and the channel reader sees just end-of-channel.
Those are wildly different behaviors, and the choice between those behaviors is made by the scheduler (which the user has no control over). One can of course design a program so that it can handle either behavior, but normally this situation is just a bug waiting to happen.

If yes, how is one supposed to safely close such channel?

See Brian's answer.

If there is no safe way, it implies that this is invalid concurrency patter in Go: I launch goroutines to send, they block because no-one is reading yet and I want to terminate the communication by closing the channel? This is the essence of my program is doing.

The implication is that if I ever want to close a channel, I'm not supposed to do a send on that channel on another goroutine.

  1. A race between len and close was considered a bug (#27070) by @ianlancetaylor . Not by literal reading of the spec but presumably because of practical implications. On what basis a race between send and close is different?

Consider the following two operations happening concurrently:

close(c)
... = len(c)

We get the same final result no matter which order those two operations happen. So it's unlikely to be a bug in the user's program. (Digression: this is kind of a weird case, because presumably there is a receiver around somewhere, and the receive and len race with each other. So it's only odd cases where #27070 matters.)

  1. A data race in the implementation of channel send and receive is relevant, because:

a) it's a bug in itself
b) it might be the root cause of the issue

Am I wrong that racing on c.closed is a bug that should be fixed?

The runtime has all sorts of data races in it. The runtime doesn't depend on the Go memory model, but the memory model of the underlying hardware. The particular case of reading c.closed outside of a lock is very tricky, as evidenced by the long comment before the c.closed==0 test in chansend.

I'm pretty sure there isn't a bug there, but I share your squeamishness. Sometimes we go squeamish to go fast...
It's definitely not (b) though, the implementation is separate from the language semantics.

@kjk

This comment has been minimized.

Copy link
Author

commented Feb 25, 2019

I believe the justification for fixing #27070 was:

I think that any channel operations causing the race detector to moan is surprising.
which applies equally here.

As to closing the channel: in my case, there is no guarantee that readers will drain the channel hence I would deadlock waiting for sends to complete.

For context: it's a NOSQL/JSON document database driver where a client can ask to be notified about future changes in the database, for example:

func (c *DatabaseChanges) ForAllDocuments() (chan *DocumentChange, CancelFunc, error)

Those changes are async by nature (might or might not happen at some point in the future) so It seemed to me that the idiomatic way to solve that in Go would be to return a channel with those changes.

My driver code has no control over when / if the client will read those notifications.

For that reason, I do each send on a goroutine, to not block the thread that talks to the database. When the client calls CancelFunc or closes the session that owns those notifications, I want to terminate those sends and potential readers, which close does.

I'm fine both with the panic that results from send on a close channel and with indeterminism of how many notifications will end read by the client.

My point here is that this seems like a reasonable thing to do yet seemingly impossible to do without a data race (which I wouldn't care about because it is seemingly benign, but it does cause tests to fail).

The "wait for readers to drain the channel" solution is not applicable here.

And sure, I could re-architect this to e.g. use a callback function or implement a thread-safe queue of messages but channel seemed like the most elegant solution.

@bcmills

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

And sure, I could re-architect this to e.g. use a callback function or implement a thread-safe queue of messages but channel seemed like the most elegant solution.

The usual approach would be a synchronous callback function or explicit iterator. Channels in exported APIs are rarely the most elegant solution.

(For more detail, see the first section of Rethinking Classical Concurrency Patterns.)

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

The only reason to close a channel is to tell the receiver that there is nothing left to receive. Therefore, closing a channel should always be thought of a send operation on a channel, and only the sender should close a channel. When you have multiple senders, they must coordinate when to close the channel.

Note also that sending on a closed channel causes a run-time panic. The most obvious way to see that your original code has a race is to remove the call to time.Sleep. That program will panic, because it has a race condition. You can never remove a race condition by adding a time.Sleep call.

@kjk

This comment has been minimized.

Copy link
Author

commented Feb 25, 2019

I feel like those justifications are mixing up different kind of races.

This bug report is about data race as detected by race detector. An implementation of channel send and receive in the runtime is doing reads and writes to the same memory location and race detector flags it because the cpu considers this to produce an undefined result.

The race you're talking about is about not being able to say if "a" happens before "b" or "b" happens before "a". It's a logical race related to ordering. It might or might not be a bug in my code, depending on whether I rely on the order or not. In my particular case I do not so it's not a bug.

While I appreciate you take the time to offer guidance on how to structure the code, the snippet I provided is a minimized test case to demonstrate a data race in the runtime. As I described elsewhere in my real code I can't coordinate senders because they are blocked on readers that might never read. It might be bad code but the issue it surfaced is whether a close is safe for concurrent use.

I understand that this generates a panic because the spec says so. I expect it, I handle it, it doesn't make the program incorrect as per Go spec.

I concede that the spec allows it:

A single channel may be used in send statements, receive operations, and calls to the built-in functions cap and len by any number of goroutines without further synchronization.

By not including close in the verbiage the spec requires synchronization between close and send.

But it also requires synchronization between close and a receive, which, I hope we agree, is safe to do without further synchronization and would be a deal breaker in practice to have it work any other way.

To sum up: the spec requires synchronization between close and all other channel operations but in practice, the implementation is safe for concurrent use of close with receive/len/cap but not send.

Additionally, a race between close and len was fixed recently even though, as far as I can tell, the spec treats them the same.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

This bug report is about data race as detected by race detector. An implementation of channel send and receive in the runtime is doing reads and writes to the same memory location and race detector flags it because the cpu considers this to produce an undefined result.

(I assume you mean send and close, not send and receive?)

The runtime arranges to have closechan write a location that chansend reads on purpose. This isn't an accidental race detection. It's not a bug, it's intentional. There are even tests: runtime/race/testdata/chan_test.go:TestRaceChanSyncCloseSend and TestRaceChanAsyncCloseSend.

It might or might not be a bug in my code, depending on whether I rely on the order or not. In my particular case I do not so it's not a bug.

The problem is that for almost everyone else, it is a bug. It is a known gotcha for inexperienced Go programmers. Their program works fine while testing, but then in production the scheduler starts doing something differently and their program panics. We're very on board with the fact that the race detector helps us detect this issue more reliably in testing.

As I described elsewhere in my real code I can't coordinate senders because they are blocked on readers that might never read.

If they are blocked on channel reads, then you can use a select on a cancellation channel to get them out of that read.
If they are blocked on an io.Reader, then it is trickier. It would be nice to fix #20280 so io.Reader operations are cancellable.

You're right that the spec is somewhat muddied in where exactly a race should be reported, and where it shouldn't. But we definitely want a race between send and close. Not because of the spec, but because reporting that race helps real users fix real bugs.

@go101

This comment has been minimized.

Copy link

commented Feb 26, 2019

I do agree closing a channel with pending sends is incorrect bad Go code.
However, I also think @kjk's opinion is not totally unreasonable.
How about I do anticipate a send would panic on closing a channel?
Like the following code shows:

package main

import (
	"fmt"
	"time"
)

func main() {
	ch := make(chan string)
	for i := 0; i < 256; i++ {
		go func(n int) {
			defer func() {
				recover()
			}()
			ch <- fmt.Sprintf("string %d", n)
		}(i)
	}
	time.Sleep(time.Second)
	close(ch)
}

And I still feel this issue is a dup of #27769

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.