New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: Windows 2008: all goroutines are asleep - deadlock #22394

Closed
felixge opened this Issue Oct 23, 2017 · 36 comments

Comments

Projects
None yet
8 participants
@felixge

felixge commented Oct 23, 2017

I already reported this issue in #20362 (which might be the same or related), but am opening a new issue as requested by @davecheney .

What version of Go are you using (go version)?

go version go1.9.1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=".exe"
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="windows"
GOPATH="/dn"
GORACE=""
GOROOT="/usr/local/go19"
GOTOOLDIR="/usr/local/go19/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-m64 -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build141570660=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="0"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

I was able to reproduce the deadlock with a minimal test program that basically just executes a http GET request.

I ran the command in a loop and got the following results:

  • go1.7.5: No deadlocks
  • go1.8.4: No deadlocks (This means that the e4371fb aka go1.8beta2-389-ge4371fb179 commit mentioned in #20362 is probably not the root cause)
  • go1.9.1: Lots of deadlocks, see example

Notes:

  • I was able to consistently reproduce this in 64Bit Win7 VM with Cygwin. However, it doesn't happen for every run of the program, but 20-50% of the runs trigger the problem.
  • I was not able to reproduce this in a local 64Bit Win7 Professional VM (also using Cygwin)

What did you expect to see?

No deadlock.

What did you see instead?

https://gist.github.com/felixge/a2efcaf098e1b8fa0ec1f7d67fb2481c

@davecheney

This comment has been minimized.

Contributor

davecheney commented Oct 23, 2017

@felixge

This comment has been minimized.

felixge commented Oct 23, 2017

@davecheney it's a bit tricky, because this machine is in a factory in china and I don't have physical access to it. But I'll try to have somebody on-site execute the example program above using cmd.exe as soon as possible.

That being said, I'm curious, how is executing a Go program under cygwin different from executing it in cmd.exe? The Go program is compiled under Linux (not in cygwin). The only reason we're using cygwin is because it's easier for us to manage machines running sshd under cygwin, than managing windows machines directly.

Edit: Not sure if it counts, but I just launched cmd.exe from cygwin, and then executed the test case a few times and was able to produce the deadlock as well.

@davecheney

This comment has been minimized.

Contributor

davecheney commented Oct 23, 2017

@felixge

This comment has been minimized.

felixge commented Oct 23, 2017

@davecheney I was able to get remote desktop access to the machine and was able to reproduce the deadlock panic using the reproduction program from this issue, as well as by running the httputil test from #20362 . So I think cygwin related trouble can be ruled out.

Edit: Also managed to get a screenshot:

image

@rsc

This comment has been minimized.

Contributor

rsc commented Oct 23, 2017

I think this is likely a dup of the underlying problem in #20362, but it's OK to keep discussing it here until we are sure. Thanks for the very short reproduction program and for checking it against Go 1.7 and Go 1.8.

Because this is new in Go 1.9 my first guess would be that the internal/poll changes introduced in Go 1.9 are not quite right on Windows. We've already found problems with them, which are slated to be fixed in Go 1.9.2. The interesting question is whether the fixes we have queued also address your issue, or else there are more issues lurking.

The stack trace makes it look like package net is waiting on network I/O (goroutine 4), but that fact has not been communicated properly to the runtime, which sees all goroutines asleep and exits.

Can you please try your test with Go 1.9.2rc2 (#22277 (comment)) and see if the problem is fixed there or perhaps changes in some way? Also, could you please

set GOTRACEBACK=system

in your environment before running the test? That will give us stack traces for the system goroutines as well, in particular the network poller (if any).

/cc @ianlancetaylor @alexbrainman

@stevenh

This comment has been minimized.

Contributor

stevenh commented Oct 23, 2017

We're just hit this issue, I can confirm it affects non-cygwin builds and I've reproduced it with a slightly amended reproduction case.

package main
  
import (
        "flag"
        "io/ioutil"
        "log"
        "net/http"
        "os"
        "runtime"
)

func main() {
        var (
                url      = flag.String("url", "", "URL to upload to")
                attempts = flag.Int("attempts", 1, "max attempts")
        )

        log.Printf("%s\n", runtime.Version())
        flag.Parse()

        if err := run(*url, *attempts); err != nil {
                log.Println(os.Stderr, err)
                os.Exit(1)
        }
}

func run(url string, attempts int) (err error) {
        for i := 0; i < attempts; i++ {
                res, err2 := http.Get(url)
                if err2 != nil {
                        log.Println(os.Stderr, err2)
                        err = err2
                        continue
                }
                defer res.Body.Close()

                resData, err2 := ioutil.ReadAll(res.Body)
                if err != nil {
                        log.Println(os.Stderr, err2)
                        err = err2
                        continue
                }
                log.Printf("%d %s\n", res.StatusCode, res.Status)
                log.Printf("%s\n", resData)
        }

        return err
}

While it doesn't happen every time its easy to reproduce, usually under 10 runs.

This is the output from a run with set GOTRACEBACK=system set.

./test5.exe -url https://XXXXXXXXXXXXX -attempts 2
2017/10/23 17:10:38 go1.9rc2
2017/10/23 17:10:48 Get https://XXXXXXXXXXXXX: net/http: TLS handshake timeout
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [select]:
net/http.(*Transport).getConn(0x79dc60, 0xc042070ea0, 0x0, 0xc042008240, 0x5, 0xc04200c880, 0x17, 0x0, 0x0, 0x7bf040)
        /root/sdk/go1.9rc2/src/net/http/transport.go:948 +0x5c6
net/http.(*Transport).RoundTrip(0x79dc60, 0xc0420ea100, 0x79dc60, 0x0, 0x0)
        /root/sdk/go1.9rc2/src/net/http/transport.go:400 +0x6ad
net/http.send(0xc0420ea100, 0x76c960, 0x79dc60, 0x0, 0x0, 0x0, 0xc042004088, 0x100, 0xc042063b98, 0x1)
        /root/sdk/go1.9rc2/src/net/http/client.go:249 +0x1b0
net/http.(*Client).send(0x7a1700, 0xc0420ea100, 0x0, 0x0, 0x0, 0xc042004088, 0x0, 0x1, 0xc042063c20)
        /root/sdk/go1.9rc2/src/net/http/client.go:173 +0x104
net/http.(*Client).Do(0x7a1700, 0xc0420ea100, 0xc042008240, 0x2f, 0x0)
        /root/sdk/go1.9rc2/src/net/http/client.go:602 +0x294
net/http.(*Client).Get(0x7a1700, 0xc042008240, 0x2f, 0x55, 0x0, 0x0)
        /root/sdk/go1.9rc2/src/net/http/client.go:393 +0xae
net/http.Get(0xc042008240, 0x2f, 0x1, 0x76cc20, 0xc042070db0)
        /root/sdk/go1.9rc2/src/net/http/client.go:367 +0x48
main.run(0xc042008240, 0x2f, 0x2, 0x76cc20, 0xc042070db0)
        /data/go/src/github.com/multiplay/go/apps/test5/main.go:29 +0x8f
main.main()
        /data/go/src/github.com/multiplay/go/apps/test5/main.go:21 +0x16c

goroutine 12 [IO wait]:
internal/poll.runtime_pollWaitCanceled(0x1d74f20, 0x77)
        /root/sdk/go1.9rc2/src/runtime/netpoll.go:189 +0x40
internal/poll.(*pollDesc).waitCanceled(0xc04207ced8, 0x77)
        /root/sdk/go1.9rc2/src/internal/poll/fd_poll_runtime.go:101 +0x4e
internal/poll.(*ioSrv).ExecIO(0x79fea8, 0xc04207ce38, 0x671f60, 0xc042035200, 0xc042035208, 0xc0420351f8)
        /root/sdk/go1.9rc2/src/internal/poll/fd_windows.go:225 +0x238
internal/poll.(*FD).ConnectEx(0xc04207cd80, 0x76ce60, 0xc04200c8a0, 0xc04203ca20, 0xc04207cd80)
        /root/sdk/go1.9rc2/src/internal/poll/fd_windows.go:738 +0x80
net.(*netFD).connect(0xc04207cd80, 0x7701e0, 0xc04203ca20, 0x0, 0x0, 0x76ce60, 0xc04200c8a0, 0x0, 0x0, 0x0, ...)
        /root/sdk/go1.9rc2/src/net/fd_windows.go:116 +0x243
net.(*netFD).dial(0xc04207cd80, 0x7701e0, 0xc04203ca20, 0x7711e0, 0x0, 0x7711e0, 0xc042070ff0, 0xc0420353f8, 0x526905)
        /root/sdk/go1.9rc2/src/net/sock_posix.go:142 +0xf3
net.socket(0x7701e0, 0xc04203ca20, 0x65f97c, 0x3, 0x2, 0x1, 0x0, 0x0, 0x7711e0, 0x0, ...)
        /root/sdk/go1.9rc2/src/net/sock_posix.go:93 +0x1c1
net.internetSocket(0x7701e0, 0xc04203ca20, 0x65f97c, 0x3, 0x7711e0, 0x0, 0x7711e0, 0xc042070ff0, 0x1, 0x0, ...)
        /root/sdk/go1.9rc2/src/net/ipsock_posix.go:141 +0x158
net.doDialTCP(0x7701e0, 0xc04203ca20, 0x65f97c, 0x3, 0x0, 0xc042070ff0, 0x7bf040, 0x0, 0x0)
        /root/sdk/go1.9rc2/src/net/tcpsock_posix.go:62 +0xc0
net.dialTCP(0x7701e0, 0xc04203ca20, 0x65f97c, 0x3, 0x0, 0xc042070ff0, 0x0, 0xc042035630, 0x42a500)
        /root/sdk/go1.9rc2/src/net/tcpsock_posix.go:58 +0xeb
net.dialSingle(0x7701e0, 0xc04203ca20, 0xc042068280, 0x76e8e0, 0xc042070ff0, 0x0, 0x0, 0x0, 0x0)
        /root/sdk/go1.9rc2/src/net/dial.go:547 +0x3e9
net.dialSerial(0x7701e0, 0xc04203c900, 0xc042068280, 0xc0420ca9e0, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0)
        /root/sdk/go1.9rc2/src/net/dial.go:515 +0x24e
net.(*Dialer).DialContext(0xc04203c120, 0x7701a0, 0xc04200e1b0, 0x65f97c, 0x3, 0xc04200c880, 0x17, 0x0, 0x0, 0x0, ...)
        /root/sdk/go1.9rc2/src/net/dial.go:397 +0x6f5
net.(*Dialer).DialContext-fm(0x7701a0, 0xc04200e1b0, 0x65f97c, 0x3, 0xc04200c880, 0x17, 0x403580, 0x60, 0x5868c0, 0xc04200e1b0)
        /root/sdk/go1.9rc2/src/net/http/transport.go:46 +0x7a
net/http.(*Transport).dial(0x79dc60, 0x7701a0, 0xc04200e1b0, 0x65f97c, 0x3, 0xc04200c880, 0x17, 0x0, 0x0, 0x0, ...)
        /root/sdk/go1.9rc2/src/net/http/transport.go:884 +0x22a
net/http.(*Transport).dialConn(0x79dc60, 0x7701a0, 0xc04200e1b0, 0x0, 0xc042008240, 0x5, 0xc04200c880, 0x17, 0x0, 0x0, ...)
        /root/sdk/go1.9rc2/src/net/http/transport.go:1060 +0x1d69
net/http.(*Transport).getConn.func4(0x79dc60, 0x7701a0, 0xc04200e1b0, 0xc042070ed0, 0xc0420444e0)
        /root/sdk/go1.9rc2/src/net/http/transport.go:943 +0x7f
created by net/http.(*Transport).getConn
        /root/sdk/go1.9rc2/src/net/http/transport.go:942 +0x39a

goroutine 10 [IO wait]:
internal/poll.runtime_pollWaitCanceled(0x1d74fe0, 0x72)
        /root/sdk/go1.9rc2/src/runtime/netpoll.go:189 +0x40
internal/poll.(*pollDesc).waitCanceled(0xc04207cc98, 0x72)
        /root/sdk/go1.9rc2/src/internal/poll/fd_poll_runtime.go:101 +0x4e
internal/poll.(*ioSrv).ExecIO(0x79fea0, 0xc04207cb58, 0x671f68, 0x50, 0x1, 0x1003f0100000016)
        /root/sdk/go1.9rc2/src/internal/poll/fd_windows.go:225 +0x238
internal/poll.(*FD).Read(0xc04207cb40, 0xc042098800, 0x400, 0x400, 0x0, 0x0, 0x0)
        /root/sdk/go1.9rc2/src/internal/poll/fd_windows.go:439 +0x266
net.(*netFD).Read(0xc04207cb40, 0xc042098800, 0x400, 0x400, 0x0, 0xc042098800, 0x0)
        /root/sdk/go1.9rc2/src/net/fd_windows.go:151 +0x59
net.(*conn).Read(0xc042004068, 0xc042098800, 0x400, 0x400, 0x0, 0x0, 0x0)
        /root/sdk/go1.9rc2/src/net/net.go:176 +0x74
crypto/tls.(*block).readFromUntil(0xc042070d80, 0x1d750a0, 0xc042004068, 0x5, 0xc042004068, 0xc042016000)
        /root/sdk/go1.9rc2/src/crypto/tls/conn.go:488 +0x9c
crypto/tls.(*Conn).readRecord(0xc04202d500, 0xc042016016, 0x400, 0xbc)
        /root/sdk/go1.9rc2/src/crypto/tls/conn.go:590 +0xe7
crypto/tls.(*Conn).readHandshake(0xc04202d500, 0xc0420fa016, 0xc0420fa000, 0xb7, 0xb7)
        /root/sdk/go1.9rc2/src/crypto/tls/conn.go:943 +0x94
crypto/tls.(*Conn).clientHandshake(0xc04202d500, 0x672688, 0xc04202d620)
        /root/sdk/go1.9rc2/src/crypto/tls/handshake_client.go:159 +0x9d1
crypto/tls.(*Conn).Handshake(0xc04202d500, 0x0, 0x0)
        /root/sdk/go1.9rc2/src/crypto/tls/conn.go:1307 +0x196
net/http.(*Transport).dialConn.func3(0x0, 0xc04202d500, 0xc0420383c0, 0xc04203c720)
        /root/sdk/go1.9rc2/src/net/http/transport.go:1151 +0x49
created by net/http.(*Transport).dialConn
        /root/sdk/go1.9rc2/src/net/http/transport.go:1147 +0xdc5

goroutine 15 [chan receive]:
net.(*netFD).connect.func2(0x7701e0, 0xc04203ca20, 0xc04207cd80, 0xc042044660)
        /root/sdk/go1.9rc2/src/net/fd_windows.go:110 +0x16a
created by net.(*netFD).connect
        /root/sdk/go1.9rc2/src/net/fd_windows.go:104 +0x218

After the initial net/http: TLS handshake timeout the network system seams dead in the water.

In our app after we saw this error during processing on a startup API call, then everything stopped working, not just the client but also the web server wired up to debug pprof also refused to respond preventing us from obtaining a stack trace.

I'm wondering if its related to the net polling fix: bf0f692

This currently seems to be Windows specific, which in our case is Windows Web Server 2008 R2

Just in case this is relevant the test binary was generated using cross compile on FreeBSD.

@stevenh

This comment has been minimized.

Contributor

stevenh commented Oct 23, 2017

Testing go1.9.1 with bf0f692 reverted still exhibits the same behaviour unfortunately :(

@davecheney

This comment has been minimized.

Contributor

davecheney commented Oct 23, 2017

@stevenh

This comment has been minimized.

Contributor

stevenh commented Oct 24, 2017

@rsc git bisect indicates the change at fault is likely e4371fb

In addition we've only been able to reproduce this issue under Windows 2008 R2, Windows 10, 2012 R2 and 2016 appear to work OK.

@ianlancetaylor ianlancetaylor changed the title from net/http: all goroutines are asleep - deadlock to runtime: Windows 2008: all goroutines are asleep - deadlock Oct 24, 2017

@ianlancetaylor ianlancetaylor added this to the Go1.10 milestone Oct 24, 2017

@stevenh

This comment has been minimized.

Contributor

stevenh commented Oct 24, 2017

It looks like e4371fb may contain a error with the conversion of the go runtime.nanotime to asm, specifically it subtracts startNano here:
https://github.com/golang/go/blob/master/src/runtime/sys_windows_amd64.s#L489
which was not present in the original:
https://github.com/golang/go/blob/go1.8.4/src/runtime/os_windows.go#L604

This looks like it might be cut and paste error from monotonic component of time.now aka time_now.

The following fixes the hangs for me on windows amd64, although I'm still not clear why newer versions of Windows don't seem to be affected by this.

diff --git a/src/runtime/sys_windows_amd64.s b/src/runtime/sys_windows_amd64.s
index 744e78c708..fd26dfe75f 100644
--- a/src/runtime/sys_windows_amd64.s
+++ b/src/runtime/sys_windows_amd64.s
@@ -486,7 +486,6 @@ loop:
        SHLQ    $32, CX
        ORQ     BX, CX
        IMULQ   $100, CX
-       SUBQ    runtime·startNano(SB), CX
        MOVQ    CX, ret+0(FP)
        RET
 useQPC:
@stevenh

This comment has been minimized.

Contributor

stevenh commented Oct 24, 2017

While this seems to fix the issue, its now clear that this change was intended as its explicitly mentioned, so more digging required.

@stevenh

This comment has been minimized.

Contributor

stevenh commented Oct 24, 2017

Just found an interesting difference between a machine which seems to work and one which doesn't, namely the monotonic time is always zero when running just the MonotonicString test on the broken machine.
Broken Machine

time.exe -test.v -test.run MonotonicString
=== RUN   TestMonotonicString
--- PASS: TestMonotonicString (0.00s)
        mono_test.go:250: Now() = 2017-10-24 15:20:05.1025971 -0700 PDT m=+0.000000000
PASS

Working Machine

time.exe -test.v -test.run MonotonicString
=== RUN   TestMonotonicString
--- PASS: TestMonotonicString (0.00s)
        mono_test.go:250: Now() = 2017-10-24 15:20:02.3098048 -0700 PDT m=+0.001954200
PASS

For reference the time.exe was built using a go 1.9.1 on the working machine and transferred the broken machine.

Update Its on't zero when running just the MonotonicString test.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Oct 25, 2017

Thanks for looking into this. If the monotonic time is always zero, that is clearly a problem.

Can you see if https://golang.org/cl/73230 fixes this?

@gopherbot

This comment has been minimized.

gopherbot commented Oct 25, 2017

Change https://golang.org/cl/73230 mentions this issue: runtime: if Windows interrupt time is zero, fall back

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Oct 25, 2017

Never mind, that change clearly didn't work, though I don't know why.

@stevenh

This comment has been minimized.

Contributor

stevenh commented Oct 25, 2017

To clarify, its not always zero but it does appear to be always zero when running that single test, which is quite different from other versions, which when running the same test has never been zero in my test runs.

I've not ruled out hardware differences causing this behaviour change yet.

I'm continuing to test but with the change in nanotime() from absolute to relative I'm concerned we could be seeing a number of issues due to cases where zero was treated as a special case e.g. not yet intialised, which can now happen during early initialisation as well.

@alexbrainman

This comment has been minimized.

Member

alexbrainman commented Oct 25, 2017

Never mind, that change clearly didn't work, though I don't know why.

I suspect you did not initialize qpcStartCounter and qpcMultiplier that are used by nowQPC and nanotimeQPC.

But I would like to understand the problem before we change the code to automatically use nowQPC and nanotimeQPC.

Alex

@stevenh

This comment has been minimized.

Contributor

stevenh commented Oct 25, 2017

@alexbrainman the summary of the facts I've found so far:

  1. The issue results in very unstable apps built with go 1.9+ on some windows machines, typically deadlocking during network related activity.
  2. A simple http.Get is enough to trigger the issue.
  3. git bisect identified e4371fb as the cause of the issue.
  4. go1.9.2rc2 does not fix the issue.
  5. Affected machines return monotonic time values of zero during early run stages according to time.TestMonotonicString, where those unaffected do not.
  6. removing SUBQ runtime·startNano(SB), CX from runtime.nanotime prevents the deadlock for the simple http.Get test.
@alexbrainman

This comment has been minimized.

Member

alexbrainman commented Oct 25, 2017

Thank you @stevenh for the update. Unfortunately I don't have any bright ideas. It would be nice if I could reproduce it here. Maybe I will try and find PC similar to yours. You PC is real PC or VM?

Affected machines return monotonic time values of zero during early run stages according to time.TestMonotonicString, where those unaffected do not.

The TestMonotonicString calls time.Now at the very beginning and then print the value. It is reasonable for monotonic time to be close to 0. Maybe even 0 is OK if monotonic time changes slowly. Maybe change the test to sleep a little before calling time.Now and see what it prints. Also it might be useful to write program that calls time.Now in a loop and see how often monotonic time change, and run it on both good and bad computers. It might give us some clues. Perhaps http.Get assumes that monotonic time increases reasonable quickly, but real monotonic clock is slower than http.Get expects.

Alex

@rsc

This comment has been minimized.

Contributor

rsc commented Oct 25, 2017

I'm still a little confused about why timing matters here. What if you do

go func() {
    for { time.Sleep(1*time.Minute) }
}()

in main? That will certainly stop the deadlock from being diagnosed, since the program is no longer deadlocked. Does it also give the http.Get time to complete?

@stevenh

This comment has been minimized.

Contributor

stevenh commented Oct 25, 2017

I've made some more progress with this; there is definitely something in the runtime which breaks if nanotime() returns zero. I proved this by applying the following which prevents the issues:

diff --git a/src/runtime/time.go b/src/runtime/time.go
index abf200d7d3..e87b0fd461 100644
--- a/src/runtime/time.go
+++ b/src/runtime/time.go
@@ -309,4 +309,4 @@ func time_runtimeNano() int64 {
        return nanotime()
 }
 
-var startNano int64 = nanotime()
+var startNano int64 = nanotime() - 1

The reason this effects Windows 2008 boxes more than newer versions is the default clock resolution on Windows 2008 is 15ms instead of the new 1ms default, so it has a larger window during which nanotime() can return zero.

Two other things that prevent it, or at least prevent me from reproducing it, are:

  1. Setting a higher timer resolution e.g. 0.5ms vs 15ms default.
  2. Adding a small delay to the app before using any networking functions.

I'm currently working my way though the nanotime() consumers to try and identify which ones are impacted.

@rsc

This comment has been minimized.

Contributor

rsc commented Oct 25, 2017

I'm OK with that being the fix as long as we think that nanotime does in fact move forward eventually on the machines exhibiting the problem.

@gopherbot

This comment has been minimized.

gopherbot commented Oct 25, 2017

Change https://golang.org/cl/73410 mentions this issue: runtime: avoid monotonic time zero on systems with low-res timers

@rsc

This comment has been minimized.

Contributor

rsc commented Oct 25, 2017

I sent CL 74310. I'm comfortable with that minimal change being in Go 1.9.2, which is scheduled to go out today. Thanks for identifying such a nice fix.

@rsc rsc modified the milestones: Go1.10, Go1.9.2 Oct 25, 2017

@rsc

This comment has been minimized.

Contributor

rsc commented Oct 25, 2017

Moved to Go 1.9.2 but waiting to mark CL as approved for Go 1.9.2 until @ianlancetaylor and @aclements look at it.

@stevenh

This comment has been minimized.

Contributor

stevenh commented Oct 25, 2017

My hunch about a consumer of nanotime() treating zero as a special case was correct.

The runtime scheduler sched.lastpoll does just that in several places. When it is zero it can prevent a network poll from being triggered which leads to deadlock. I tested this by creating a nanotime2() which was nanotime() + 1 which also fixes the issue.

What I would like to understand @rsc is why do we want / need nanotime() to be relative to go runtime start instead of being just a given value e.g. time since machine boot?

Currently this creates a disparity between platforms that use go and those that use asm for nanotime(), in addition removing this would mean less operations, so a small speed up too. Could you enlighten me?

@rsc

This comment has been minimized.

Contributor

rsc commented Oct 25, 2017

@stevenh I care that monotonic time (what package time gets from runtime.time_now) is in a known limited range, so that comparison is well-defined and we don't have to worry about wraparound. Defining that it starts at process start time is a way to do that. System boot time might also be a way to do that, but not all operating systems make that guarantee. Rather than assume the guarantee or work out which systems do guarantee and which don't, we just define monotonic time as relative to our process start. That also reinforces the point that monotonic times are meaningless outside the current process.

The fact that nanotime either does or does not have the process start reference point is definitely confusing. I would not mind if the subtraction of startNano was pushed into the remaining implementations and removed from timestub.go, so that nanotime has process start = -1 on all systems.

@stevenh

This comment has been minimized.

Contributor

stevenh commented Oct 25, 2017

@rsc thanks for clarification, makes total sense.

It would definitely be good to get all platforms aligned and this specific goal / criteria documented, as not only would it eliminate the confusion, but it would also provide a wider audience for testing which in turn should help prevent issues like this slipping into releases unnoticed.

@aclements

This comment has been minimized.

Member

aclements commented Oct 25, 2017

I would not mind if the subtraction of startNano was pushed into the remaining implementations and removed from timestub.go, so that nanotime has process start = -1 on all systems.

I'd be okay with this, too, though I would add two other things:

  • Buy into it completely and eliminate runtimeInitTime.
  • Either panic if nanotime is called before the zero time is established, or establish the zero so early that it's hard to accidentally call nanotime before that. (This was the source of #21554.)
@rsc

This comment has been minimized.

Contributor

rsc commented Oct 25, 2017

@aclements, both those SGTM. Want to open a separate issue for that? This issue will be closed when the fix lands in Go 1.9.2.

@aclements

This comment has been minimized.

Member

aclements commented Oct 25, 2017

Filed #22436.

@gopherbot gopherbot closed this in 9865821 Oct 25, 2017

@rsc

This comment has been minimized.

Contributor

rsc commented Oct 25, 2017

Per chat with @ianlancetaylor waiting for CL 73410 to pass the build dashboard, and then will approve for Go 1.9.2.

@rsc rsc reopened this Oct 25, 2017

@rsc

This comment has been minimized.

Contributor

rsc commented Oct 25, 2017

CL 73410 OK for Go 1.9.2.

@gopherbot

This comment has been minimized.

gopherbot commented Oct 25, 2017

Change https://golang.org/cl/73491 mentions this issue: [release-branch.go1.9] runtime: avoid monotonic time zero on systems with low-res timers

gopherbot pushed a commit that referenced this issue Oct 25, 2017

[release-branch.go1.9] runtime: avoid monotonic time zero on systems …
…with low-res timers

Otherwise low-res timers cause problems at call sites that expect to
be able to use 0 as meaning "no time set" and therefore expect that
nanotime never returns 0 itself. For example, sched.lastpoll == 0
means no last poll.

Fixes #22394.

Change-Id: Iea28acfddfff6f46bc90f041ec173e0fea591285
Reviewed-on: https://go-review.googlesource.com/73410
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-on: https://go-review.googlesource.com/73491
TryBot-Result: Russ Cox <rsc@golang.org>
@rsc

This comment has been minimized.

Contributor

rsc commented Oct 26, 2017

go1.9.2 has been packaged and includes:

The release is posted at golang.org/dl.

— golang.org/x/build/cmd/releasebot, Oct 26 21:09:26 UTC

@rsc rsc closed this Oct 26, 2017

@felixge

This comment has been minimized.

felixge commented Oct 27, 2017

Thanks to everybody who worked on this for the incredibly fast fix!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.