Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: send to empty buffered channel can fail on select statement with default case #48433

Closed
changkun opened this issue Sep 17, 2021 · 12 comments
Closed

Comments

@changkun
Copy link
Contributor

@changkun changkun commented Sep 17, 2021

What version of Go are you using (go version)?

$ go version
go version devel go1.18-9133245be7 Sat Sep 4 20:35:25 2021 +0000 darwin/arm64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="auto"
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/changkun/Library/Caches/go-build"
GOENV="/Users/changkun/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/changkun/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/changkun/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/changkun/dev/godev/go-gerrit"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/changkun/dev/godev/go-gerrit/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="devel go1.18-9133245be7 Sat Sep 4 20:35:25 2021 +0000"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/g2/6fmr1qzx0ns3shq74zrp6bd40000gn/T/go-build1056299778=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

In the practice of using golang.org/x/mobile, we observed #48434 that the rendering loop in the golang.org/x/mobile/app package can pause. After debugging and logging (see CL 350212), we captured a log case which seems to suggest the following situation may execute the default case:

ch := make(chan struct{}, 1)
select {
case ch <- struct{}{}:
default:
    panic("should be impossible?")
}

What did you expect to see?

always do the case of send.

What did you see instead?

the default case is executed

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 17, 2021

That would certainly be a very serious problem, but I think we're going to need a test case.

Loading

@changkun
Copy link
Contributor Author

@changkun changkun commented Sep 17, 2021

I really wish I could write a minimal test case, but sadly the problem only happens occasionally in a very complex application.
Based on the current collected log, and a logical causality analysis on it, it eventually leads me to open this issue.

Nevertheless, I've written a minimum example to demonstrate the key components. Again, this minimum example does not reproduce the problem (at least I haven't run long enough to really see if it can reproduce):

package main

/*
void processGLcall() {
	for (int i = 0; i < 100000; i++) {
		// simulate some work
	}
}
*/
import "C"
import (
	"math/rand"
	"time"
)

func init() {
	rand.Seed(time.Now().Unix())
}

func main() {
	go G1()
	go G2()
	select {}
}

type fn struct{ blocking bool }

var (
	workAvailable = make(chan struct{}, 1)
	work          = make(chan fn, 1)
	retvalue      = make(chan int)
)

func G1() {
	tk := time.NewTicker(time.Second / 30)
	for range tk.C {
		select {
		case <-workAvailable:
			println("G1: onDraw")
			doWork()
		case <-time.After(100 * time.Millisecond):
			println("G1: timeout")
		}
	}
}

func G2() {
	tk := time.NewTicker(time.Second / 30)
	for range tk.C {
		n := rand.Intn(1000)
		for i := 0; i < n; i++ {
			callFn(fn{blocking: rand.Intn(4) == 0})
		}
	}
}

func doWork() {
	for {
		select {
		case w := <-work:
			println("G1: doWork consume work")
			C.processGLcall()
			if w.blocking {
				println("G1: before returning a value")
				retvalue <- 1
				println("G1: after value returned")
			}
		default:
			println("G1: doWork returned")
			return
		}
	}
}

func callFn(c fn) {
	work <- c

	select {
	case workAvailable <- struct{}{}:
		println("G2: workAvaliable success")
	default:
		println("G2: workAvaliable failed")
	}

	if c.blocking {
		println("G2: callFn, wait for ret")
		<-retvalue
		println("G2: callFn, after get ret")
	}
}

and the log (collected from real work application) prints:

G2 workAvaliable success  (a)
G2 workAvaliable success  (b)
G2 workAvaliable failed   (c)
G1 doWork returned        (d)
G1 doWork returned        (e)
G2 workAvaliable success  (f)
G2 workAvaliable success  (g)
G1 onDraw                 (h)
G1 doWork returned        (i)
G1 doWork returned        (j)
G2 workAvaliable failed   (k)
G1 timeout                (l)
G1 timeout
G1 timeout
G1 timeout
G1 timeout
G1 timeout
...
onDraw is never executed again.

The workAvaliable failed in (k) seems to suggest the issue.

Loading

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 17, 2021

The question for a failure like this is: how do you know for sure that the value was fetched out of the channel before the select statement that tried sending to the channel?

As far as I can tell, in your sample program, if G1 gets suspended for the times it takes to make two separate calls to callFn, then you will get this error. It's not likely. But I don't see anything preventing it, so it's possible. A time.Ticker is best effort in Go; there is no hard real time guarantee.

Loading

@changkun
Copy link
Contributor Author

@changkun changkun commented Sep 18, 2021

how do you know for sure that the value was fetched out of the channel before the select statement that tried sending to the channel?

This is easy to answer: Because the cap(workAvaliable) == 1, then workAvaliable success means its previous work <- c in the callFn succeed. Hence len(work) should be greater or equal than 1, and G1 will result in a doWork return. Therefore, (a) result in (d), (b) result in (e), (f) result in (i), and (g) result in (j). Therefore, after (j), workAvaliable must be empty. As we see (k) failed, hence leads to this issue.

if G1 gets suspended for the times it takes to make two separate calls to callFn, then you will get this error.

I am not quite sure I follow here. G1 can't make two separate calls to callFn, because cap(work) == 1.

A time.Ticker is best effort in Go; there is no hard real time guarantee.

the time.Ticker is for demonstration purposes. We don't need to discuss its guarantee here. In the real application, the G1 wakes periodically by OS, and executes the select statement inside the loop. (Note that OS failed to wake the G1 is a false statement, because the timeout keep getting printed hence the call is indeed get called)

Loading

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 20, 2021

When I look at your small example program, I see this possible execution sequence:

  • G1 calls doWork
  • G2 calls callFn (let's say that blocking is false)
  • callFn sends c on the channel work
  • callFn sends an empty struct on the channel workAvailable
  • callFn returns
  • doWork receives w from the channel work
  • doWork is preempted before doing anything else
  • G2 calls callFn again
  • callFn sends c on the channel work (this succeeds because the channel is empty)
  • callFn tries to send an empty struct on the channel workAvailable
  • the send on workAvailable fails because the channel is full
  • callFn prints "G2: workAvaliable failed"

This sequence of events seems entirely possible to me, and it does not indicate any problem with the channel implementation.

Therefore, (a) result in (d), (b) result in (e), (f) result in (i), and (g) result in (j). Therefore, after (j), workAvaliable must be empty. As we see (k) failed, hence leads to this issue.

But doWork can read from the channel work before it prints anything. That means that work can be empty while workAvailable is not empty.

I understand that your small example program is not the real program. But perhaps the same kind of thing can happen in the real program. The problem in the small example program is that there is no synchronization between the work and workAvailable channels.

Loading

@changkun
Copy link
Contributor Author

@changkun changkun commented Sep 21, 2021

  • doWork receives w from the channel work
    ...
  • the send on workAvailable fails because the channel is full
    ...
    This sequence of events seems entirely possible to me

Isn't here leads to workAvailable was empty? How does the last send on workAvailable fail? Otherwise, doWork won't be called. The described case seems impossible to happen even there is a preemption.

But doWork can read from the channel work before it prints anything.

This seems not true neither. From the log, (j) happened, and a subsequent (h) did not appear, which should be executed before doWork's receive of the work channel. If doWork can read from the work channel, then a further "G1: onDraw" should, at least, be printed.

work can be empty while workAvailable is not empty

The work channel can be empty while workAvailable is not empty, but then in the next tick, workAvailable will be consumed, which won't lead to a forever freeze of G2. This is because a further callFn will enqueue another call (work is empty at the moment), and sends another signal to workAvailalbe, hence, in one more tick, the work channel will receive a new call and guarantee all works can proceed eventually. This contradicts the freeze we observed in practice.

I understand that your small example program is not the real program.

It is difficult to produce a minimum reproducer. As even in a bigger proglram, the problem reproduces rarely (but terrible disaster when occur). Quote "Reproduced this by leaving an app one hour after leaving that on Pixel 4a." from @hajimehoshi

The problem in the small example program is that there is no synchronization between the work and workAvailable channels.

See comment in above. I still think the synchronization is guaranteed implicitly (by causality).

Loading

@hajimehoshi
Copy link
Member

@hajimehoshi hajimehoshi commented Sep 21, 2021

It is difficult to produce a minimum reproducer. As even in a bigger proglram, the problem reproduces rarely (but terrible disaster when occur). Quote "Reproduced this by leaving an app one hour after leaving that on Pixel 4a." from @hajimehoshi

I remember this can be reproduced only with 5 or 10 minutes waiting by adjusting the workbufLen like this:
hajimehoshi/ebiten#1332 (comment)

My current vague guess is that this issue happens only on Android. I think we should make a minimized case by adjusting workbufLen and e.g., flooding GL commands...

Loading

@changkun
Copy link
Contributor Author

@changkun changkun commented Sep 21, 2021

I remember this can be reproduced only with 5 or 10 minutes waiting by adjusting the workbufLen like this:
hajimehoshi/ebiten#1332 (comment)

Thanks for the confirmation :) Sorry that I didn't read through the entire issue before.

My current vague guess is that this issue happens only on Android.

I can confirm the issue also happens on iOS (surprisingly not reproducible on macOS with M1 chip).

Loading

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 21, 2021

Isn't here leads to workAvailable was empty? How does the last send on workAvailable fail? Otherwise, doWork won't be called. The described case seems impossible to happen even there is a preemption.

doWork is a loop that keeps reading from the work channel as long as there is something to read. There is no required correlation between the number of values that doWork reads from work and the number of values that G1 reads from workAvailable. it is possible for callFn to be called several times while doWork is in the loop reading from work, before doWork returns to G1.

If doWork can read from the work channel, then a further "G1: onDraw" should, at least, be printed.

Eventually, sure, but the delay between them can be arbitrary long.

This contradicts the freeze we observed in practice.

I haven't seen discussion of a freeze, though maybe I missed it. I thought the issue was that the select in callFn would sometimes choose the default case when the channel is empty. As far as I can tell that is entirely possible.

Loading

@changkun
Copy link
Contributor Author

@changkun changkun commented Sep 21, 2021

Eventually, sure, but the delay between them can be arbitrary long.

Sorry if the previous description was misleading, but certainly the "Eventually" means two ticks (clearly bounded less than a second).

As said previously, "If the delay between them can be arbitrary long" contradicts the current observation, except the runtime preemption can cause an arbitrarily long delay. If this is what is currently suggesting, it seems to be another serious problem in the runtime too.

I haven't seen discussion of a freeze, though maybe I missed it.

I am sorry again if the previous wording was a little bit misleading. The initial term that is used on top of the issue was "pause". I thought linking to another issue might clarify it more, but apparently not yet.

I thought the issue was that the select in callFn would sometimes choose the default case when the channel is empty.

Based on a causality analysis, this seems to be the only cause of a rendering freeze issue, see #48434.

As far as I can tell that is entirely possible.

See comments above. But allow me to clarify: it is entirely possible, if and only if the runtime preemption can cause an arbitrarily long delay to the scheduling of a goroutine. Is my understanding correct? If not, would you mind clarify the possibility again that leads to a freeze of the observation that doWork is never called again (after at least minutes)

Loading

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 21, 2021

it is entirely possible, if and only if the runtime preemption can cause an arbitrarily long delay to the scheduling of a goroutine.

In a long-running program, there are many possible reasons why a goroutine may be arbitrarily delayed. For example, although the garbage collector is fast, it does freeze all other goroutines briefly. This requires preempting goroutines, which requires signals. The kernel can and does interpose arbitrary delays on signal delivery, depending on system load. Until the signal is delivered, the goroutine will keep running. If your program relies on any kind of synchronization, you can't assume anything about how long it takes a goroutine is delayed.

That said, I want to stress that I don't think this has anything to do with the behavior of your sample program. In your sample program calls to callFn, which write to work and workAvailable, are entirely independent of doWork, which reads from work in a loop. If there are two calls to callFn while doWork is running, then the second call to callFn will print G2: workAvaliable failed. Nothing prevents that. Discussions of tickers are irrelevant to that fact. That will happen if doWork is temporarily suspended while G2 calls callFn, and that can happen for any reason.

Loading

@changkun
Copy link
Contributor Author

@changkun changkun commented Sep 25, 2021

Unfortunately, I cannot find any reliable reproducer as a convincer in a short frame but only a log-based logical deduction. Close until I got any new evidence.

Loading

@changkun changkun closed this Sep 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants