-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime,x/sys/windows: panic accessing (possibly) nil field of net_op #58870
Comments
Per @ianlancetaylor 's request, here is information about how to reproduce this error: Using the code at https://github.com/hawkinsw/windowpane here is the
I am more than happy to create something that is more compact as a reproducer but I wanted to send you something quickly to demonstrate that it is a real, actual problem and not hypothetical. I hope that this information is helpful! |
Thanks. Evidently you can reproduce this problem, but I don't understand it. The docs for CC @golang/windows |
I hope that what I am about to say is helpful and not something that is obvious to wizards like you. If it is obvious and does not address your point, I apologize! It's a tortuous path, but I will try to lay it out:
we do err = fd.pd.init(fd) which (eventually) leads to I know that's circuitous, but that's the way that I see it. I hope that helps! More importantly, I hope that I was answering your question and not telling you something that you already knew (you know this code base far better than this dodo). Thanks again for everything! |
Thanks, that part I knew. But maybe I see what I am missing. I think you are suggesting that if a file descriptor has been associated with the netpoll completion port, then any other operation on that file can cause that completion port to return the descriptor, even if the operation in question does not refer to the completion port. That seems strange to me, but I don't know much about Windows. Anyhow, I think it's best that this CL be reviewed by someone familiar with Windows, so I'll simply say that I have no further objections to it. Thanks. |
That is my understanding as well! It does seem odd to me, but it seems like any operation that completes on an fd (even if it was not started in a way that is associated with that completion port) can be noticed by anyone with a completion port for that fd.
No, thank you! Is there someone else that I should add to the reviewers on the CLs to take the load off you? |
We've already CC'ed @golang/windows here so we'll see if someone picks it up. |
From the description of
and
To me that indicates the behavior we are seeing is entirely rational and to be expected. In other words, it seems like we will have to provide a fix somehow. In addition to the proposed patch(es), I found some additional information that might give us another opportunity for solid fix without impacting how users interact with the runtime: From the documentation for
Given that information, we could provide the user with a function to create a quote-unquote It seems like either of these two options would work. I hope that the documentation above is helpful! Please let me know how else I can help and how I can fixup the CL to be more useful! (Edited to include a second quote from the |
FYI: I have created a POC for this bug and to demonstrate how file can have "spooky action at a distance": https://github.com/hawkinsw/windowpane/tree/main/bug_poc. I hope that this helps. Please, please let me know if/how I can help! |
In triage now, it would be really useful to get some input from someone in @golang/windows on this. |
I'll take a look at this. Nothing to say for the moment other than the first step is to get a reproducer than can fit into the Go repo. I can see how this is difficult to reproduce from outside the Go runtime, I'll try to prepare for |
Please let me know if/how I can help! The repository that I posted has "reproducing" code written directly using the win32 APIs. It at least demonstrates that the behavior is real. I hope that the patch is helpful, too. It seems to do everything that needs to be done. |
I've been able to reproduce some interference between a user-space I still don't understand which is the root cause. CL 473515 seems to help in some cases but the runtime still crashes occasionally, I think it just hides the issue and move the crash even farther. Test code: package poll_test
import (
"errors"
"internal/poll"
"internal/syscall/windows"
"syscall"
"testing"
"unsafe"
)
func TestWSASocketConflict(t *testing.T) {
s, err := windows.WSASocket(syscall.AF_INET, syscall.SOCK_STREAM, syscall.IPPROTO_TCP, nil, 0, windows.WSA_FLAG_OVERLAPPED)
if err != nil {
t.Fatal(err)
}
fd := poll.FD{Sysfd: s, IsStream: true, ZeroReadIsEOF: true}
_, err = fd.Init("tcp", true)
if err != nil {
syscall.CloseHandle(s)
t.Fatal(err)
}
defer fd.Close()
const SIO_TCP_INFO = syscall.IOC_INOUT | syscall.IOC_VENDOR | 39
inbuf := uint32(0)
var outbuf _TCP_INFO_v0
cbbr := uint32(0)
err = fd.WSAIoctl(
SIO_TCP_INFO,
(*byte)(unsafe.Pointer(&inbuf)),
uint32(unsafe.Sizeof(inbuf)),
(*byte)(unsafe.Pointer(&outbuf)),
uint32(unsafe.Sizeof(outbuf)),
&cbbr,
&syscall.Overlapped{},
0,
)
if err != nil && !errors.Is(err, syscall.ERROR_IO_PENDING) {
t.Fatal(err)
}
}
type _TCP_INFO_v0 struct {
State uint32
Mss uint32
ConnectionTimeMs uint64
TimestampsEnabled bool
RttUs uint32
MinRttUs uint32
BytesInFlight uint32
Cwnd uint32
SndWnd uint32
RcvWnd uint32
RcvBuf uint32
BytesOut uint64
BytesIn uint64
BytesReordered uint32
BytesRetrans uint32
FastRetrans uint32
DupAcksIn uint32
TimeoutEpisodes uint32
SynRetrans uint8
} Panic: === RUN TestWSASocketConflict
fatal error: unexpected signal during runtime execution
--- PASS: TestWSASocketConflict (0.00s)
[signal 0xc0000005 code=0x0 addr=0x8 pc=0x12171e]
runtime stack:
runtime.throw({0x24033c?, 0x0?})
runtime/panic.go:1076 +0x65 fp=0x543bdff148 sp=0x543bdff118 pc=0x127645
runtime.sigpanic()
runtime/signal_windows.go:345 +0x265 fp=0x543bdff190 sp=0x543bdff148 pc=0x13c065
runtime.netpoll(0x8ff77537ce84?)
runtime/netpoll_windows.go:138 +0x29e fp=0x543bdffa18 sp=0x543bdff190 pc=0x12171e
runtime.findRunnable()
runtime/proc.go:2960 +0x76a fp=0x543bdffb20 sp=0x543bdffa18 pc=0x12ee8a
runtime.schedule()
runtime/proc.go:3349 +0xb1 fp=0x543bdffb58 sp=0x543bdffb20 pc=0x12ff31
runtime.park_m(0xc0000589c0?)
runtime/proc.go:3500 +0x119 fp=0x543bdffb88 sp=0x543bdffb58 pc=0x1303f9
runtime.mcall()
runtime/asm_amd64.s:455 +0x4f fp=0x543bdffba0 sp=0x543bdffb88 pc=0x1568ef
goroutine 1 [runnable]:
runtime.gopark(0xc000119a28?, 0xfca68?, 0x20?, 0x81?, 0x18?)
runtime/proc.go:381 +0xce fp=0xc0001199c0 sp=0xc0001199a0 pc=0x12a02e
runtime.chanrecv(0xc0000242a0, 0xc000119aa7, 0x1)
runtime/chan.go:583 +0x3d0 fp=0xc000119a38 sp=0xc0001199c0 pc=0xf5d70
runtime.chanrecv1(0x33abc0?, 0x20ee80?)
runtime/chan.go:442 +0x12 fp=0xc000119a60 sp=0xc000119a38 pc=0xf5992
testing.(*T).Run(0xc000059380, {0x239ebc?, 0x15d76d?}, 0x243ab8)
testing/testing.go:1649 +0x3c8 fp=0xc000119b20 sp=0xc000119a60 pc=0x1bf828
testing.runTests.func1(0x33c280?)
testing/testing.go:2055 +0x3e fp=0xc000119b70 sp=0xc000119b20 pc=0x1c189e
testing.tRunner(0xc000059380, 0xc000119c88)
testing/testing.go:1595 +0xff fp=0xc000119bc0 sp=0xc000119b70 pc=0x1be9bf
testing.runTests(0xc00007c320?, {0x3355a0, 0xd, 0xd}, {0xc00000db40?, 0x100c000119d10?, 0x33ad20?})
testing/testing.go:2053 +0x445 fp=0xc000119cb8 sp=0xc000119bc0 pc=0x1c1785
testing.(*M).Run(0xc00007c320)
testing/testing.go:1925 +0x636 fp=0xc000119f00 sp=0xc000119cb8 pc=0x1c01b6
main.main()
_testmain.go:73 +0x19c fp=0xc000119f80 sp=0xc000119f00 pc=0x1ffcdc
runtime.main()
runtime/proc.go:250 +0x1f2 fp=0xc000119fe0 sp=0xc000119f80 pc=0x129c52
runtime.goexit()
runtime/asm_amd64.s:1619 +0x1 fp=0xc000119fe8 sp=0xc000119fe0 pc=0x1588e1
exit status 2
FAIL internal/poll 0.212s |
Thank you! If you could help me by telling me how to integrate your test into the build framework and how to run it (sorry that I am not well-versed in that area of the code base despite having contributed code to the runtime before), I would love to help confirm a fix and continue to debug. I love working on this stuff and hope that I am being helpful! |
That's a little tricky, as it is using
|
Thank you!! I sincerely appreciate that. I hope that you are okay with me helping. Again, I love this!! |
@qmuntal FYI: I can reproduce the error locally! |
I built a version of the compiler/runtime with CL473515 applied and was able to successfully trigger the bug with a slightly modified version of the patch that, I think, shows that the CL is useful. func TestWSASocketConflict(t *testing.T) {
s, err := windows.WSASocket(syscall.AF_INET, syscall.SOCK_STREAM, syscall.IPPROTO_TCP, nil, 0, windows.WSA_FLAG_OVERLAPPED)
if err != nil {
t.Fatal(err)
}
fd := poll.FD{Sysfd: s, IsStream: true, ZeroReadIsEOF: true}
_, err = fd.Init("tcp", true)
if err != nil {
syscall.CloseHandle(s)
t.Fatal(err)
}
defer fd.Close()
const SIO_TCP_INFO = syscall.IOC_INOUT | syscall.IOC_VENDOR | 39
inbuf := uint32(0)
var outbuf _TCP_INFO_v0
cbbr := uint32(0)
// Set up an array of 2 syscall.Overlapped structs so that we
// can play with triggering bad behavior even in the patched version
// of the code. The net_op is bigger than the syscall.Overlapped
// and the field in the net_op that is dangerously dereferenced (alliteration)
// (before the CL is applied) would fall immediately after the end
// of the syscall.Overlapped. So ...
ovs := make([]syscall.Overlapped, 2)
// ... if the first field of the adjacent syscall.Overlapped is 0 then
// even the patched code will fail -- it sees *something* in the pd field
// of the net_op. So, let's test that case and make sure that it fails.
ovs[1].Internal -= 1
// However, if the adjacent in-memory element is 0 then
// the patched code will properly detect that what it found during
// completion was *not* something that it manages. Whew.
// ovs[1].Internal = 0
err = fd.WSAIoctl(
SIO_TCP_INFO,
(*byte)(unsafe.Pointer(&inbuf)),
uint32(unsafe.Sizeof(inbuf)),
(*byte)(unsafe.Pointer(&outbuf)),
uint32(unsafe.Sizeof(outbuf)),
&cbbr,
&ovs[0],
0,
)
if err != nil && !errors.Is(err, syscall.ERROR_IO_PENDING) {
t.Fatal(err)
}
} I hope that the commentary in the test helps explain what I am trying to demonstrate. As I ran the test in the configuration where CL473515 properly shielded a dangerous access (according to the above criteria), I was not able to reproduce. I hope that this information helps! Thanks again for all that you are doing! Will |
After running the test repeatedly for approximately 45 minutes, I was able to trigger an error: I think that the source of the issue is that the test is closing the socket (which leads to a I have included the dump below. Thanks again! Will
|
After further investigation, it appears that the problem above had a relatively simple cause: The test is using I still think that there may be another issue with which we have to contend: In the case where the caller of Does that seem like a reasonable statement? If it is, we can come up with a workaround. As usual, thank you for letting me help! |
Sorry for the delay -- the CL is now updated. Let me know if/how I can help further! Will |
Thanks for the new insights @hawkinsw. I still think that when calling For example, CreateIoCompletionPort accepts a per-handle unique key ( |
Thanks for the feedback! I think that
I am convinced that you know more than I, so please help me understand what you are seeing that I am not! Thanks for letting me help! |
The evidences are clear, you explain it very well @hawkinsw, the current behavior is in accordance with what's documented. My point is that we should try to improve how the Go runtime manages the completion ports and also make sure that the handler passed to the user via |
@qmuntal I am relieved -- I was definitely starting to question my sanity! I really have enjoyed our co-investigation! Thank you, again, for letting me participate. As I said, I had a thought about how to be proactive rather than reactive managing what completion entries ( |
I've tried to materialize my previous thoughts about using a completion key to filter out entries not originated from the runtime (without having to update |
Thanks @qmuntal -- I will do that as soon as possible! |
@qmuntal -- I have applied the patch and am running my test! Everything looks great ... until it doesn't! I see an issue that might be specific to the that would not be visible to users. It would appear that under certain circumstances the Thanks (again!) for letting me work with you! |
This comment was marked as off-topic.
This comment was marked as off-topic.
I believe that I have isolated the source of the failure. Judging by the fact that the PC is So, we go to 492487: 48 89 c8 mov rax,rcx
49248a: 80 b8 20 02 00 00 00 cmp BYTE PTR [rax+0x220],0x0
492491: 75 15 jne 4924a8 <internal/poll.(*FD).destroy+0x68>
492493: 48 8b 15 2e 2e 1b 00 mov rdx,QWORD PTR [rip+0x1b2e2e] # 6452c8 <internal/poll.CloseFunc>
49249a: 48 8b 0a mov rcx,QWORD PTR [rdx]
49249d: 48 8b 58 10 mov rbx,QWORD PTR [rax+0x10]
4924a1: 48 89 d8 mov rax,rbx
4924a4: ff d1 call rcx
For better or worse, this does seem consistent with a timing or memory corruption error: We read the contents of @qmuntal, I think that seems like a reasonable analysis and will continue down this path. I am sorry it has taken me this long to finally come up with a reasonable hypothesis but I am happy that I finally have something to go on. If you have thoughts, please let me know -- I would love to get your insight! Thanks again and hope you had a great weekend! UPDATE: UPDATE 2: (22:39:28GMT 27 March 2023 Monday) |
Take my test with caution, I've used interacted few times with |
Sorry for the delay between updates. While I do not know why what I am about to describe is happening, I am confident that I have some (what I think is) definitive proof of what is going on. While running a test, Because the error is nondeterministic it is not always at this place that I see the panic. However, with logging I have demonstrated that it is at least one place where the problem occurs. That is great news because now I have at least something to go on. Given the data accumulated so far, it would appear that there is something happening asynchronously that is scribbling over the value of In one common failure scenario, the instantiation of c.runCleanup(normalPanic) In other words, we were literally just executing in a place where we know that I will keep you posted as I look for more. cc @qmuntal |
UPDATE: It would appear that this is exactly the issue. I am going to modify your patch @qmuntal to work around the issue. I will keep you posted! |
@qmuntal I believe that I know fully understand the problem and have a workable solution. I am going to spend time today writing a complete description of the problem and the proposed solution so that we have it for posterity and to use as a way to discuss whether to go forward with that solution. Let me know if that sounds okay! |
I am so sorry that it took me as long as it did to produce this document. I hope that you find it valuable and perhaps there is something worthwhile inside. I look forward to hearing what you think. Thanks again for everything! https://github.com/hawkinsw/windowpane/blob/main/description.md |
That was a good reading. Some thoughts:
|
Absolutely! I will do it as soon as possible. It will be my first time attempting to modify someone else's CL on Gerrit so please forgive me (in advance) if I somehow screw up!
That is absolutely correct. The only problem is ... to use
Nor am I! I was just suggesting that it might be something we would consider. We can reconsider after you see the updated CL. Thank you for the feedback! |
@qmuntal Sorry if I am being dense. I just attempted to Thanks! |
No problem. Submit a new CL copying whatever relevant code from my CL and set my as co-author. I'll abandone myh CL latter. |
Will do!! |
Tried to set you! Let me know if I did it correctly: 482495: runtime: filter i/o async entries using completion key on windows |
Change https://go.dev/cl/482495 mentions this issue: |
Change https://go.dev/cl/558895 mentions this issue: |
It is currently not possible to distinguish between a netpollBreak entry and an entry initiated by external WSA operations (as in #58870). This CL sets a unique completion key when posting the netpollBreak entry so that it can be identified as such. Change-Id: I8e74a7ddc607dc215d6ed8c59d5c3cf47ec8dc62 Reviewed-on: https://go-review.googlesource.com/c/go/+/558895 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Bryan Mills <bcmills@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
Change https://go.dev/cl/561895 mentions this issue: |
This is another attempt at CL 558895, but without adding stale pollDescs protection, which deviates from the original goal of the CL and adds complexity without proper testing. It is currently not possible to distinguish between a netpollBreak, an internal/poll WSA operation, and an external WSA operation (as in #58870). This can cause spurious wakeups when external WSA operations are retrieved from the queue, as they are treated as netpollBreak events. This CL makes use of completion keys to identify the source of the event. While here, fix TestWSASocketConflict, which was not properly exercising the "external WSA operation" case. Change-Id: I91f746d300d32eb7fed3c8f27266fef379360d98 Reviewed-on: https://go-review.googlesource.com/c/go/+/561895 Reviewed-by: Than McIntosh <thanm@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
It is currently not possible to distinguish between a netpollBreak entry and an entry initiated by external WSA operations (as in golang#58870). This CL sets a unique completion key when posting the netpollBreak entry so that it can be identified as such. Change-Id: I8e74a7ddc607dc215d6ed8c59d5c3cf47ec8dc62 Reviewed-on: https://go-review.googlesource.com/c/go/+/558895 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Bryan Mills <bcmills@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
This is another attempt at CL 558895, but without adding stale pollDescs protection, which deviates from the original goal of the CL and adds complexity without proper testing. It is currently not possible to distinguish between a netpollBreak, an internal/poll WSA operation, and an external WSA operation (as in golang#58870). This can cause spurious wakeups when external WSA operations are retrieved from the queue, as they are treated as netpollBreak events. This CL makes use of completion keys to identify the source of the event. While here, fix TestWSASocketConflict, which was not properly exercising the "external WSA operation" case. Change-Id: I91f746d300d32eb7fed3c8f27266fef379360d98 Reviewed-on: https://go-review.googlesource.com/c/go/+/561895 Reviewed-by: Than McIntosh <thanm@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
First of all, thank you all for everything that you do!
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
No, but reproducing the issue is timing dependent so my inability to reproduce seems to be the result of a different order of scheduling timeouts, etc.
What operating system and processor architecture are you using (
go env
)?On Windows on amd64 hardware.
What did you do?
Using
windows.WSAIoctl
on an overlapped socket with a "zero" values in thewindows.Overlapped
struct
will cause apanic
innetpoll_windows.go
at line 134 (as of the version of the code tagged at version 1.20.1) where the runtime attempts to dereferenceop.pd
without checking whether it isnil
.The problem is that the runtime does not expect that something other than itself (i.e., the runtime) will add items to the list of operations that could be returned by a call to
GetQueuedCompletionStatusEx
. The commentis telling.
Unfortunately, we cannot guarantee that is the case.
If an application added something to that list (e.g., through a use of
windows.WSAIoctl
) that does not complete before the scheduler does another "round" of netpoll, it is possible thatop
is not actually anet_op
. If the application user does "the right thing" by passing a pointer to a "zerod"windows.Overlapped
structure, the runtime will break.There was a related issue from several years earlier that @bradfitz found and fixed: #21172
This issue seems to be related.
I have two one-line patches to the
sys
andruntime
packages that (I believe) provide a simple fix for the issue. I wanted to write this issue before posting them to gerrit so that there is something to reference. I will update this report with those patchsets.See
sys/windows: Pad windows.Overlapped to protect the runtime | https://go-review.googlesource.com/c/sys/+/473163
and
473163: sys/windows: Pad windows.Overlapped to protect the runtime | https://go-review.googlesource.com/c/sys/+/473163
The text was updated successfully, but these errors were encountered: