Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: network connection timeout reported too late in a GUI application #50356

Closed
zhuah opened this issue Dec 27, 2021 · 17 comments
Closed

runtime: network connection timeout reported too late in a GUI application #50356

zhuah opened this issue Dec 27, 2021 · 17 comments
Labels
WaitingForInfo

Comments

@zhuah
Copy link

@zhuah zhuah commented Dec 27, 2021

What version of Go are you using (go version)?

$ go version
go version go1.17.5 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/zhuah/Library/Caches/go-build"
GOENV="/Users/zhuah/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/zhuah/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/zhuah/go"
GOPRIVATE=""
GOPROXY="https://goproxy.io,direct"
GOROOT="/usr/local/Cellar/go/1.17.5/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.17.5/libexec/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.17.5"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/zhuah/Projects/test/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/dl/1ntmg7z54yx9c3b5p9s18r8c0000gn/T/go-build1002202817=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I created a gui app, which embedded a server, but the network connection will accidentally timeout after running a while.

i have create a simple demo to reproduce it:

package main

import (
	"encoding/binary"
	"fmt"
	"github.com/go-gl/glfw/v3.3/glfw"
	"io"
	"log"
	"math/rand"
	"net"
	"os"
	"runtime"
	"sync"
	"time"
)

type Conn struct {
	mu   sync.Mutex
	conn net.Conn
}

func NewConn(conn net.Conn) *Conn {
	c := &Conn{
		conn: conn,
	}

	return c
}

func (c *Conn) Close() error {
	return c.conn.Close()
}

func (c *Conn) writeChunk(b []byte) error {
	err := binary.Write(c.conn, binary.BigEndian, uint32(len(b)))
	if err != nil {
		return err
	}
	_, err = c.conn.Write(b)
	if err != nil {
		if os.IsTimeout(err) {
			// not recoverable from timeout as we have written the payload length or payload content
			return fmt.Errorf("write failed: %s", err)
		}
		return err
	}
	return nil
}

func (c *Conn) readChunk() ([]byte, error) {
	var length uint32
	err := binary.Read(c.conn, binary.BigEndian, &length)
	if err != nil {
		return nil, err
	}
	buf := make([]byte, length)
	_, err = io.ReadFull(c.conn, buf)
	if err != nil {
		if os.IsTimeout(err) {
			// not recoverable from timeout as we have read the payload length
			return nil, fmt.Errorf("read chunk failed: %s", err)
		}
		return nil, err
	}
	return buf, nil
}

const (
	msgKindPing = iota
	msgKindPong
)

func (c *Conn) writeMsg(kind uint8) {
	var b [1]byte
	b[0] = kind
	c.mu.Lock()
	c.writeChunk(b[:1])
	c.mu.Unlock()
}

func (c *Conn) Serve() error {
	defer func() {
		c.Close()
	}()

	timeoutDur := 3*time.Second + time.Duration(rand.Intn(300)+100)*time.Millisecond
	var timeout bool

	for {
		start := time.Now()
		deadline := start.Add(timeoutDur)
		err := c.conn.SetReadDeadline(deadline)
		if err != nil {
			return err
		}
		data, err := c.readChunk()
		if time.Since(start) > 10*time.Second {
			return fmt.Errorf("sleeps too long: %s, %s, %s", time.Since(start), err, deadline.Sub(start))
		}
		if err != nil {
			if !os.IsTimeout(err) || timeout {
				return err
			}
			c.writeMsg(msgKindPing)
			timeout = true
			continue
		}
		timeout = false
		if len(data) == 0 {
			continue
		}

		switch data[0] {
		case msgKindPing:
			c.writeMsg(msgKindPong)
		case msgKindPong:
		}
	}
}

func main() {
	runtime.LockOSThread()

	err := glfw.Init()
	if err != nil {
		log.Fatalln(err)
	}
	defer glfw.Terminate()

	w, err := glfw.CreateWindow(300, 300, "Test", nil, nil)
	if err != nil {
		log.Fatalln(err)
	}
	defer w.Destroy()

	addr := fmt.Sprintf("127.0.0.1:%d", rand.Intn(100)+5000)
	l, err := net.Listen("tcp", addr)
	if err != nil {
		log.Fatalln(err)
		return
	}
	defer l.Close()

	log.Println("program started:", time.Now().Format("2006/01/02 15:04:05"))
	go func() {
		conn, err := l.Accept()
		if err != nil {
			log.Fatalln(err)
		}
		c := NewConn(conn)
		err = c.Serve()
		log.Fatalln("server exit:", err)
	}()
	go func() {
		conn, err := net.Dial("tcp", addr)
		if err != nil {
			log.Fatalln(err)
		}
		c := NewConn(conn)
		err = c.Serve()
		log.Fatalln("client exit:", err)
	}()
	for !w.ShouldClose() {
		glfw.WaitEvents()
	}
}

Steps to reproduce: running the program, then put the glfw window to background, wait for a while(1min ~2min),
the program will exited with error:

2021/12/27 13:09:52 program started: 2021/12/27 13:09:52
2021/12/27 13:10:56 server exit: sleeps too long: 10.039270984s, read tcp 127.0.0.1:5081->127.0.0.1:55235: i/o timeout, 3.147s
2021/12/27 13:10:56 client exit: sleeps too long: 10.039481846s, read tcp 127.0.0.1:55235->127.0.0.1:5081: i/o timeout, 3.187s

if i remove this block:

		if time.Since(start) > 10*time.Second {
			return fmt.Errorf("sleeps too long: %s, %s, %s", time.Since(start), err, deadline.Sub(start))
		}

then the connection will timeout twice, which also cause it to be closed.

It seems only occurred in GUI app, i can't reproduce it in a normal server program without GUI.

What did you expect to see?

What did you see instead?

@thanm thanm added the NeedsInvestigation label Dec 28, 2021
@thanm
Copy link
Contributor

@thanm thanm commented Dec 28, 2021

Thanks for the report.

What is it that brings you to the conclusion that this is a Go problem, as opposed to a problem with glfw?

@thanm thanm added WaitingForInfo and removed NeedsInvestigation labels Dec 28, 2021
@zhuah
Copy link
Author

@zhuah zhuah commented Dec 28, 2021

my own code doesn't uses glfw, but directly cocoa api, someday i find this problem, and reproduce it with glfw.

i could only think that the problem may be come from go runtime currently, as it stuck at the read call and doesn't wake up even deadline reached.

there may be some conflicts with glfw's waitevents and go runtime, but i couldn't find and solve it

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 28, 2021

What happens if you simply omit the glfw calls? They don't seem to have anything to do with the rest of your program.

@zhuah
Copy link
Author

@zhuah zhuah commented Dec 29, 2021

It works well if i remove glfw related code and replace glfw.WaitEvents with select {}.

the demo i provided is just to reproduce this issue. it's actually extracted from my bidirectional rpc library, when connection established, both client and server conn could receive or send rpc call. the demo only contains the heartbeat checking

@zhuah
Copy link
Author

@zhuah zhuah commented Dec 29, 2021

NSEvent *event = [NSApp nextEventMatchingMask:NSEventMaskAny
                                        untilDate:[NSDate distantFuture]
                                           inMode:NSDefaultRunLoopMode
                                          dequeue:YES];

both glfw and my code contains this code to block on event loop, does it conflict with go runtime?

@zhuah
Copy link
Author

@zhuah zhuah commented Dec 29, 2021

I have just port this demo to gioui and also found this issue, which internally uses [NSApp run] https://github.com/gioui/gio/blob/main/app/os_macos.m#L292.

I have imagine that the bug may come from my code, but as i said before, it works well if i remove GUI related code.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 29, 2021

If glfw blocks in the event loop in a way that Go does not understand, that will indeed break timeouts and scheduling in general. That said, it's not obvious to me why that is happening here.

@zhuah
Copy link
Author

@zhuah zhuah commented Dec 29, 2021

I have make a more simple demo:

package main

import (
	"github.com/go-gl/glfw/v3.3/glfw"
	"log"
	"runtime"
	"time"
)

func main() {
	runtime.LockOSThread()

	err := glfw.Init()
	if err != nil {
		log.Fatalln(err)
	}
	defer glfw.Terminate()

	w, err := glfw.CreateWindow(300, 300, "Test", nil, nil)
	if err != nil {
		log.Fatalln(err)
	}
	defer w.Destroy()
	go func() {
		ticker := time.NewTicker(time.Second * 3)
		for range ticker.C {
			log.Println("timer")
		}
	}()
	for !w.ShouldClose() {
		glfw.WaitEvents()
	}
}

it just contains a timer to print timer every 3 seconds,

then i run this program and put the opened window into background,
here are the outputs:

2021/12/29 11:10:30 timer
2021/12/29 11:10:33 timer
2021/12/29 11:10:36 timer
2021/12/29 11:10:39 timer
2021/12/29 11:10:42 timer
2021/12/29 11:10:45 timer
2021/12/29 11:10:48 timer
2021/12/29 11:10:51 timer
2021/12/29 11:10:54 timer
2021/12/29 11:10:57 timer
2021/12/29 11:11:00 timer
2021/12/29 11:11:03 timer
2021/12/29 11:11:06 timer
2021/12/29 11:11:09 timer
2021/12/29 11:11:22 timer
2021/12/29 11:11:34 timer
2021/12/29 11:11:36 timer
2021/12/29 11:11:39 timer
2021/12/29 11:11:52 timer
2021/12/29 11:11:57 timer
2021/12/29 11:11:57 timer
2021/12/29 11:12:00 timer
2021/12/29 11:12:03 timer
2021/12/29 11:12:06 timer
2021/12/29 11:12:11 timer
2021/12/29 11:12:12 timer
2021/12/29 11:12:15 timer
2021/12/29 11:12:18 timer
2021/12/29 11:12:21 timer
2021/12/29 11:12:34 timer
2021/12/29 11:12:42 timer
2021/12/29 11:12:44 timer
2021/12/29 11:12:55 timer
2021/12/29 11:13:07 timer
2021/12/29 11:13:15 timer

the runtime scheduling are breaking:

2021/12/29 11:11:09 timer
2021/12/29 11:11:22 timer
2021/12/29 11:11:34 timer
2021/12/29 11:11:36 timer
2021/12/29 11:11:39 timer
2021/12/29 11:11:52 timer
2021/12/29 11:11:57 timer
2021/12/29 11:11:57 timer

@zhuah
Copy link
Author

@zhuah zhuah commented Dec 29, 2021

there seems no issue if i keeping the window on foreground or wake it up from background.
i wondering that there may some scheduling priority in Apple? @ianlancetaylor

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 29, 2021

I'm sorry, I have no idea.

It appears that the problem is related to glfw, so perhaps this bug should move to the glfw bug tracker so that the glfw maintainers can take a look at it.

@zhuah
Copy link
Author

@zhuah zhuah commented Dec 29, 2021

it's also occurred with gioui(https://github.com/gioui/gio), which doesn't uses glfw.

package main

// A simple Gio program. See https://gioui.org for more information.

import (
	"log"
	"os"
	"time"

	"gioui.org/app"
	"gioui.org/io/system"
)

func main() {
	go func() {
		ticker := time.NewTicker(time.Second * 3)
		for range ticker.C {
			log.Println("timer")
		}
	}()
	go func() {
		w := app.NewWindow()
		if err := loop(w); err != nil {
			log.Fatal(err)
		}
		os.Exit(0)
	}()
	app.Main()
}

func loop(w *app.Window) error {
	for {
		e := <-w.Events()
		switch e := e.(type) {
		case system.DestroyEvent:
			return e.Err
		case system.FrameEvent:
		}
	}
}

the problem may come from Apple's Cocoa framework or macOS itself.

@zhuah
Copy link
Author

@zhuah zhuah commented Dec 29, 2021

A new demo directly calls Cocoa api:

package main

/*
#include <AppKit/AppKit.h>
#cgo CFLAGS: -x objective-c
#cgo LDFLAGS: -framework AppKit

@interface AppDelegate: NSObject <NSApplicationDelegate>
@end
@implementation AppDelegate

- (void)applicationDidFinishLaunching:(NSNotification *)notification {
	NSWindow *w = [[NSWindow alloc] initWithContentRect:NSMakeRect(0, 0, 300,300)
											  styleMask:NSWindowStyleMaskTitled|NSWindowStyleMaskClosable|NSWindowStyleMaskMiniaturizable|NSWindowStyleMaskResizable
                                                backing:NSBackingStoreBuffered
												  defer:FALSE];
	[w center];
	[w makeKeyAndOrderFront:w];
	[NSApp setActivationPolicy:NSApplicationActivationPolicyRegular];
	[NSApp activateIgnoringOtherApps:TRUE];
}

@end

void runApp() {
	[NSApplication sharedApplication];
	AppDelegate *delegate = [[AppDelegate alloc] init];
	[NSApp setDelegate:delegate];
	[NSApp run];
}
*/
import "C"
import (
	"log"
	"runtime"
	"time"
)

func main() {
	runtime.LockOSThread()
	go func() {
		ticker := time.NewTicker(time.Second * 3)
		for range ticker.C {
			log.Println("timer")
		}
	}()
	C.runApp()
}

@zhuah
Copy link
Author

@zhuah zhuah commented Dec 29, 2021

I added some logs around this line in go runtime

if semasleep(ns) >= 0 {

  	start:=nanotime()
		println("sleepStart",ns)
		res:= semasleep(ns)
		println("sleepDone",nanotime()-start)

the logs shows that it may sleep extra 10s:

    sleepStart 1355733237
	2021/12/29 14:52:36 timer
	sleepDone 11355835169

@beoran
Copy link

@beoran beoran commented Dec 30, 2021

The problem is that you are not using lockosthread correctly. To run the GUI on the main thread you need to do it slightly different, like by using this library: https://github.com/faiface/mainthread. Otherwise the goroutine you start will be on the same main OS thread as the GUI library, which leads to the Goroutine getting blocked due to GUI library blocking that thread.

Edit: also the is thread needs to be locked in an init function, and not in main, where we might already not be on the main thread anymore.

@zhuah
Copy link
Author

@zhuah zhuah commented Dec 31, 2021

@ianlancetaylor thank you, i have found the answer here https://stackoverflow.com/a/49681111/4023435, it's because of App Nap.

@beoran it should be ok to LockOSThread in main function, in GUI world, the main thread, aka ui thread is where you create the Application and Windows, instead of the startup thread. Please correct me if i'm wrong.

I'm going close this issue, thanks all for help.

@zhuah zhuah closed this as completed Dec 31, 2021
@beoran
Copy link

@beoran beoran commented Dec 31, 2021

@zhuah It depends on the operating and windowing system. For example, on Linux with X11 it is ok to run the GUI on the same OS thread that started the GUI, but on other OS or windowing systems all graphics must be run from the first main thread only, and then you need the library above.

But yes, in this case the problem was due to app nap. Similar problems can arise on mobile systems that agressively attempt to save power.

@zhuah
Copy link
Author

@zhuah zhuah commented Dec 31, 2021

@beoran Got it, thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WaitingForInfo
Projects
None yet
Development

No branches or pull requests

4 participants