Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: netpoll responding to I/O events owned by client code crashes #21172

Closed
bradfitz opened this issue Jul 26, 2017 · 18 comments

Comments

Projects
None yet
6 participants
@bradfitz
Copy link
Member

commented Jul 26, 2017

In #21153 @johnsonj and I discovered that Go 1.9 crashes on Windows (during early boot before the network is up? related?), but Go 1.8 does not.

Relevant crash:

fatal error: unexpected signal during runtime execution
[signal 0xc0000005 code=0x0 addr=0xffffffffffffffff pc=0x427c12]

runtime stack:
runtime.throw(0x695d84, 0x2a)
        /home/bradfitz/go/src/runtime/panic.go:605 +0x9c
runtime.sigpanic()
        /home/bradfitz/go/src/runtime/signal_windows.go:155 +0x184
runtime.netpoll(0xc042016000, 0x0)
        /home/bradfitz/go/src/runtime/netpoll_windows.go:105 +0x332
runtime.findrunnable(0xc042016000, 0x0)
        /home/bradfitz/go/src/runtime/proc.go:1966 +0x90f
runtime.schedule()
        /home/bradfitz/go/src/runtime/proc.go:2245 +0x13a
runtime.mstart1()
        /home/bradfitz/go/src/runtime/proc.go:1182 +0xcd
runtime.mstart()
        /home/bradfitz/go/src/runtime/proc.go:1142 +0x6b

goroutine 1 [runnable]:
net/http.(*Transport).RoundTrip(0xc0420f0000, 0xc0420f4100, 0xc0420f0000, 0xbe56
069a33e72c04, 0x13197476c)
        /home/bradfitz/go/src/net/http/transport.go:341 +0xf86
net/http.send(0xc0420f4000, 0x7aaac0, 0xc0420f0000, 0xbe56069a33e72c04, 0x131974
76c, 0x7e0260, 0xc04206a048, 0xbe56069a33e72c04, 0xc042067ac8, 0x1)
        /home/bradfitz/go/src/net/http/client.go:249 +0x1b0
net/http.(*Client).send(0xc0420748d0, 0xc0420f4000, 0xbe56069a33e72c04, 0x131974
76c, 0x7e0260, 0xc04206a048, 0x0, 0x1, 0xc042067b50)
        /home/bradfitz/go/src/net/http/client.go:173 +0x104
net/http.(*Client).Do(0xc0420748d0, 0xc0420f4000, 0x693570, 0x21, 0x0)
        /home/bradfitz/go/src/net/http/client.go:602 +0x294
net/http.(*Client).Get(0xc0420748d0, 0x693570, 0x21, 0xc04204a3e0, 0x48a76d, 0x5
977a2e3)
        /home/bradfitz/go/src/net/http/client.go:393 +0xae
main.isNetworkUp(0xbe560698f3e72c04)
        /home/bradfitz/src/golang.org/x/build/cmd/buildlet/stage0/stage0.go:204
+0xab
main.awaitNetwork(0x690700)
        /home/bradfitz/src/golang.org/x/build/cmd/buildlet/stage0/stage0.go:181
+0x16a
main.main()
        /home/bradfitz/src/golang.org/x/build/cmd/buildlet/stage0/stage0.go:82 +
0xd2

goroutine 19 [runnable]:
net/http.setRequestCancel.func3(0x0, 0xc0420749c0, 0xc0420482c0, 0xc04204c42c, 0
xc04203e240)
        /home/bradfitz/go/src/net/http/client.go:319
created by net/http.setRequestCancel
        /home/bradfitz/go/src/net/http/client.go:319 +0x2c6

the line we're crashing on is src/runtime/netpoll_windows.go:105 at:

                        if stdcall5(_WSAGetOverlappedResult, op.pd.fd, uintptr(unsafe.Pointer(op)), uintptr(unsafe.Pointer(&qty)), 0, uintptr(unsafe.Pointer(&flags))) == 0 {

And 0xc0000005 is EXCEPTION_ACCESS_VIOLATION I believe.

Is op or op.fd nil?

The only interesting and unique thing about this Go program on Windows is that it's running before/while the network is coming up, which is something we've never tested before.

/cc @alexbrainman @ianlancetaylor @aclements

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Jul 26, 2017

@ianlancetaylor wrote in #21153 (comment) :

runtime/netpoll_windows.go has not changed meaningfully since 1.8.

The layer above has moved from the net package to the internal/poll package, but any changes during that move were inadvertent.

I don't fully understand how the Windows code works. As far as I can see it always sets up op and op.pd before any operation that could trigger the poller. But I wouldn't be shocked if there is some error case that can let op.pd be nil.

@dsnet dsnet added the OS-Windows label Jul 26, 2017

@alexbrainman

This comment has been minimized.

Copy link
Member

commented Jul 26, 2017

Unfortunately I agree with what Ian said. Nothing more to add.
I will try and investigate this when I am back home.

Alex

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Jul 28, 2017

Unfortunately this is really hard (and really slow) to test in an automated fashion, since it involves building new VM images (itself slow) and then booting them on GCE and watching serial console output.

This is probably only reasonably debuggable by hand on an actual Windows machine.

@gopherbot

This comment has been minimized.

Copy link

commented Aug 1, 2017

Change https://golang.org/cl/52490 mentions this issue: windows: repro helper for go1.9rc1 crash

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Aug 1, 2017

@ianlancetaylor, we have a theory this might be related to the internal runtime poller interacting badly with github.com/tarm/serial on Windows. Maybe the serial port FD is in a weird state and the http request in the backtrace gets into the runtime poller code and finds the serial port readable/writeable but in a weird state and explodes.

Will try to write a test later today.

@johnsonj

This comment has been minimized.

Copy link
Member

commented Aug 1, 2017

src

go1.8.3: no crash
serial:

stage0: 2017/08/01 22:14:33 using serial log output
stage0: debug-self: 2017/08/01 22:14:33 using serial output for debug-self

stage0: debug-self: 2017/08/01 22:14:33 executing: C:\golang\repro.exe
stage0: 2017/08/01 22:14:33 not using serial log output
stage0: 2017/08/01 22:14:33 bootstrap binary running
stage0: 2017/08/01 22:14:33 waiting for network.
stage0: 2017/08/01 22:14:33 network up after 0s
stage0: 2017/08/01 22:14:33 downloading https://storage.googleapis.com/go-builder-data/buildlet.windows-amd64 to .\buildlet.exe ...
stage0: 2017/08/01 22:14:33 downloaded .\buildlet.exe (7435776 bytes)
stage0: 2017/08/01 22:14:33 downloaded buildlet in 400ms
2017/08/01 22:14:33 buildlet starting.
2017/08/01 22:14:33 Will listen on :80
2017/08/01 22:14:33 Listening on :80 ...

go1.9rc1_cl163855560: crash
serial:

stage0: 2017/08/01 20:54:29 using serial log output
stage0: debug-self: 2017/08/01 20:54:29 using serial output for debug-self

stage0: debug-self: 2017/08/01 20:54:29 executing: C:\golang\repro.exe
stage0: 2017/08/01 20:54:29 not using serial log output
2017/08/01 20:54:29 buildlet starting.
2017/08/01 20:54:29 Will listen on :80
2017/08/01 20:54:29 Listening on :80 ...

console:

stage0: 2017/08/01 20:54:29 using serial log output
stage0: debug-self: 2017/08/01 20:54:29 using serial output for debug-self

stage0: debug-self: 2017/08/01 20:54:29 executing: C:\golang\repro.exe
fatal error: unexpected signal during runtime execution
[signal 0xc0000005 code=0x0 addr=0x8 pc=0x423d6b]

runtime stack:
runtime.throw(0x61244b, 0x2a)
        /usr/lib/google-golang/src/runtime/panic.go:608 +0x85
runtime.sigpanic()
        /usr/lib/google-golang/src/runtime/signal_windows.go:155 +0x15e
runtime.netpoll(0x12396000, 0x0)
        /usr/lib/google-golang/src/runtime/netpoll_windows.go:105 +0x2ab
runtime.findrunnable(0x12396000, 0x0)
        /usr/lib/google-golang/src/runtime/proc.go:1975 +0x777
runtime.schedule()
        /usr/lib/google-golang/src/runtime/proc.go:2254 +0x104
runtime.park_m(0x1239a380)
        /usr/lib/google-golang/src/runtime/proc.go:2317 +0x9f
runtime.mcall(0x0)
        /usr/lib/google-golang/src/runtime/asm_386.s:335 +0x47

goroutine 1 [syscall, locked to thread]:
syscall.Syscall(0x750f88c0, 0x2, 0x150, 0xffffffff, 0x0, 0x58e7db, 0x58e7db, 0x8
)
        /usr/lib/google-golang/src/runtime/syscall_windows.go:163 +0x4f
syscall.WaitForSingleObject(0x150, 0xffffffff, 0x123dc5a0, 0x0, 0x0)
        /usr/lib/google-golang/src/syscall/zsyscall_windows.go:718 +0x53
os.(*Process).wait(0x123dc5a0, 0x0, 0x0, 0x0)
        /usr/lib/google-golang/src/os/exec_windows.go:18 +0x5a
os.(*Process).Wait(0x123dc5a0, 0x123dbe04, 0x123ce1b0, 0x619fdc)
        /usr/lib/google-golang/src/os/exec.go:115 +0x21
os/exec.(*Cmd).Wait(0x123ce160, 0x0, 0x0)
        /usr/lib/google-golang/src/os/exec/exec.go:446 +0x43
os/exec.(*Cmd).Run(0x123ce160, 0xd, 0x123dbf3c)
        /usr/lib/google-golang/src/os/exec/exec.go:289 +0x45
main.main()
        /usr/local/google/home/jrjohnson/dev/projects/gobuild/src/golang.org/x/b
uild/cmd/buildlet/stage0/stage0.go:108 +0x2dc

goroutine 18 [runnable, locked to thread]:
syscall.Syscall6(0x750f8cf0, 0x5, 0x11c, 0x12474000, 0x38, 0x1239df00, 0x123c83c
0, 0x0, 0x1, 0x12c, ...)
        /usr/lib/google-golang/src/runtime/syscall_windows.go:174 +0x4f
syscall.WriteFile(0x11c, 0x12474000, 0x38, 0x8000, 0x1239df00, 0x123c83c0, 0x0,
0x484971)
        /usr/lib/google-golang/src/syscall/zsyscall_windows.go:327 +0x97
golang.org/x/build/vendor/github.com/tarm/serial.(*Port).Write(0x123c5e80, 0x124
74000, 0x38, 0x8000, 0x0, 0x0, 0x0)
        /usr/local/google/home/jrjohnson/dev/projects/gobuild/src/golang.org/x/b
uild/vendor/github.com/tarm/serial/serial_windows.go:104 +0xcc
io.(*multiWriter).Write(0x1242d290, 0x12474000, 0x38, 0x8000, 0x38, 0x0, 0x0)
        /usr/lib/google-golang/src/io/multi.go:60 +0x6d
io.copyBuffer(0x6ffb10, 0x1242d290, 0x6ffd90, 0x123c25d8, 0x12474000, 0x8000, 0x
8000, 0x0, 0x0, 0x0, ...)
        /usr/lib/google-golang/src/io/io.go:394 +0x192
io.Copy(0x6ffb10, 0x1242d290, 0x6ffd90, 0x123c25d8, 0x0, 0x0, 0x0, 0x0)
        /usr/lib/google-golang/src/io/io.go:362 +0x51
os/exec.(*Cmd).writerDescriptor.func1(0x0, 0x0)
        /usr/lib/google-golang/src/os/exec/exec.go:264 +0x3c
os/exec.(*Cmd).Start.func1(0x123ce160, 0x1242d2e0)
        /usr/lib/google-golang/src/os/exec/exec.go:380 +0x1d
created by os/exec.(*Cmd).Start
        /usr/lib/google-golang/src/os/exec/exec.go:379 +0x4fb

C:\golang>
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Aug 1, 2017

I still don't understand what is happening here, but can you try https://golang.org/cl/52551 ?

@gopherbot

This comment has been minimized.

Copy link

commented Aug 1, 2017

Change https://golang.org/cl/52551 mentions this issue: internal/poll: don't add non-sockets to runtime poller

@alexbrainman

This comment has been minimized.

Copy link
Member

commented Aug 2, 2017

we have a theory this might be related to the internal runtime poller interacting badly with github.com/tarm/serial on Windows

I did look at the code, and I agree that it is COM port file handle that might be causing problems. The github.com/tarm/serial is designed to do its own IO, but we, probably, include COM port file handle into netpoll monitor.

github.com/tarm/serial opens the COM port with Windows CreateFile (as everything else on Windows), and then to read it calls Windows ReadFile passing it not nil OVERLAPPED structure that has hEvent field set to an Windows Event object. This will make ReadFile return immediately, and github.com/tarm/serial calls Windows GetOverlappedResult with the same OVERLAPPED structure. The GetOverlappedResult will block until some data arrives to COM port, which will trigger the Windows Event that hEvent points to, and that will make blocked GetOverlappedResult return. This is how the authors of github.com/tarm/serial expect things to work.

netpoller on the other hand uses different strategy to do IO. It calls CreateIoCompletionPort to associate many file / IO handles with single "completion port". Once setup you make one single thread wait for "io is completed" events to be reported to that "completion port" - you call GetQueuedCompletionStatus in a loop. The GetQueuedCompletionStatus only returns when some IO is completed, and it uses lpOverlapped parameter to report which IO is completed. The lpOverlapped parameter is set to whatever you pass into Windows ReadFile or WriteFile when you start IO.

So what is happening here, I suspect, github.com/tarm/serial calls os.NewFile with its file handle, and this calls CreateIoCompletionPort and associates COM port with our netpoller. Since COM port is now monitored by netpoller, when GetQueuedCompletionStatus (in netpoller) returns, we get what github.com/tarm/serial has passed to syscall.WriteFile - syscall.Overlapped, while netpoller expect that to be internal/poll.operation. Therefore memory corruption and crash. Or nice stack trace, if we are luck and memory is filled with zeros.

Alex

@johnsonj

This comment has been minimized.

Copy link
Member

commented Aug 2, 2017

@ianlancetaylor that patch fixes the crash which I think further shines light into @alexbrainman 's excellent explanation.

This bug doesn't actually require a fresh VM, I can RDP in, execute it and it repros. Nothing special about boot time/network.

@johnsonj

This comment has been minimized.

Copy link
Member

commented Aug 2, 2017

to quote @alexbrainman from the CL

I agree. We should only submit to netpoller file handles that will use internal/poll.ioSrv.ExecIO to do IO. In this CL Ian is trying to exclude file handles based on their kind. But I think we need some explicit flag for that, and pass different flag from net package and from os. None of file objects created in os package needs to be included in the netpoller. Same for generic file handle passed into os.NewFile. Only handles created in net package needs to be included.

It sounds like some additional plumbing is needed and the default behavior should be not to wait on files unless explicitly indicated. It also sounds like this code hasn't changed so this crash could be a diffrence in undefined behavior rather than an true regression. If that's all true then this shouldn't block rc1 but the problem demonstrated is a bug in implementation that should be addressed.

@johnsonj johnsonj changed the title runtime: netpoll crash with Go1.9rc1 when before Windows network is up runtime: netpoll responding to I/O events owned by client code crashes Aug 2, 2017

@gopherbot

This comment has been minimized.

Copy link

commented Aug 2, 2017

Change https://golang.org/cl/52857 mentions this issue: internal/poll: attempt at crash repro for a test

gopherbot pushed a commit that referenced this issue Aug 3, 2017

internal/poll: don't add non-sockets to runtime poller
Updates #21172

Change-Id: I0fec6e645328bbc85f3e47f4f71dd8d1d68c75ab
Reviewed-on: https://go-review.googlesource.com/52551
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
@gopherbot

This comment has been minimized.

Copy link

commented Aug 3, 2017

Change https://golang.org/cl/52930 mentions this issue: internal/poll: test for not adding non-sockets to runtime poller on Windows

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Aug 3, 2017

The basic problem seems to be fixed in 1.9. Pushing to 1.10 for a cleaner fix and a good test. (We can still put a test into 1.9 if it comes soon enough.)

@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.9 Aug 3, 2017

@gopherbot

This comment has been minimized.

Copy link

commented Aug 7, 2017

Change https://golang.org/cl/53530 mentions this issue: internal/poll: add tests for Windows file and serial ports

gopherbot pushed a commit that referenced this issue Aug 11, 2017

internal/poll: add tests for Windows file and serial ports
I also wanted to test net sockets, but I do not know how to
access their file handles. So I did not implement socket tests.

Updates #21172

Change-Id: I5062c0e65a817571d755397d60762c175f9791ce
Reviewed-on: https://go-review.googlesource.com/53530
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@gopherbot

This comment has been minimized.

Copy link

commented Sep 25, 2017

Change https://golang.org/cl/65810 mentions this issue: internal/poll: be explicit when using runtime netpoller

@gopherbot gopherbot closed this in 8e2d90d Sep 26, 2017

@gopherbot

This comment has been minimized.

Copy link

commented Oct 16, 2017

Change https://golang.org/cl/71131 mentions this issue: [release-branch.go1.9] internal/poll: add tests for Windows file and serial ports

@gopherbot

This comment has been minimized.

Copy link

commented Oct 16, 2017

Change https://golang.org/cl/71132 mentions this issue: [release-branch.go1.9] internal/poll: be explicit when using runtime netpoller

gopherbot pushed a commit that referenced this issue Oct 25, 2017

[release-branch.go1.9] internal/poll: add tests for Windows file and …
…serial ports

I also wanted to test net sockets, but I do not know how to
access their file handles. So I did not implement socket tests.

Updates #21172

Change-Id: I5062c0e65a817571d755397d60762c175f9791ce
Reviewed-on: https://go-review.googlesource.com/53530
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-on: https://go-review.googlesource.com/71131
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>

gopherbot pushed a commit that referenced this issue Oct 25, 2017

[release-branch.go1.9] internal/poll: be explicit when using runtime …
…netpoller

internal/poll package assumes that only net sockets use runtime
netpoller on windows. We get memory corruption if other file
handles are passed into runtime poller. Make FD.Init receive
and use useNetpoller argument, so FD.Init caller is explicit
about using runtime netpoller.

Fixes #21172

Change-Id: I60e2bfedf9dda9b341eb7a3e5221035db29f5739
Reviewed-on: https://go-review.googlesource.com/65810
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-on: https://go-review.googlesource.com/71132
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>

@golang golang locked and limited conversation to collaborators Oct 16, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.