Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: intermittent "failed to create new OS thread" on Windows since 2022-01-24 #52572

Closed
bcmills opened this issue Apr 26, 2022 · 18 comments
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 OS-Windows release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Apr 26, 2022

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
runtime: failed to create new OS thread (22)
FAIL	golang.org/x/tools/go/packages	25.671s

greplogs --dashboard -md -l -e '(?ms)\Awindows-.*runtime: failed to create new OS thread'

2022-04-26T02:30:39-5bb9a5e-17d7983/windows-amd64-race
2022-03-21T13:26:21-86b02b3-4aa1efe/windows-arm64-11
2022-02-04T14:02:15-25d2ab2-4afcc9f/windows-arm64-11
2022-01-24T16:55:59-f9df4ea/windows-amd64-2008

Curiously, I don't see any of these errors in the logs before 2022-01-24. 🤔

@bcmills bcmills added OS-Windows NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 26, 2022
@bcmills bcmills added this to the Go1.19 milestone Apr 26, 2022
@bcmills
Copy link
Contributor Author

bcmills commented Apr 26, 2022

Since most of these failures are in x/tools, I wouldn't be entirely surprised if this is actual thread-exhaustion due to the tests running with high parallelism and spawning subprocesses that themselves run with high parallelism. (Some of these failures also correlate with out of memory errors from the same test run.)

However the “requested the Runtime to terminate it in an usual way" failure mode in particular looks like a Go runtime bug. Since windows/amd64 is a first-class port, I'm marking this as a release-blocker at least until we can determine whether it is a regression (attn @golang/runtime).

@heschi
Copy link
Contributor

heschi commented May 11, 2022

ping: this is a release blocker that hasn't been updated in a while.

@aclements
Copy link
Member

Three out of the four also have an "out of memory" failure, which strongly suggests this is a real resource exhaustion on the host. That's consistent with most of the failures being on a particular subrepo.

The "requested the Runtime to terminate it in an usual way" message is not coming from the Go runtime, and must be coming from the C runtime. That's presumably printed by CreateThread failing, but it is odd that the Go runtime still gets to print "failed to create new OS thread" after the C runtime prints its message. (Maybe that's a buffer flushing thing?)

@prattmic
Copy link
Member

I reproduced this 12 times over the weekend on gomotes, interestingly always in package runtime's TestCgoSignalDeadlock:

--- FAIL: TestCgoSignalDeadlock (44.53s)
    crash_test.go:124: running go build -o C:\Users\gopher\AppData\Local\Temp\1\go-build3458167679\testprogcgo.exe 
    crash_cgo_test.go:37: expected "OK\n", but got:
        OK
        
        This application has requested the Runtime to terminate it in an unusual way.
        Please contact the application's support team for more information.
        runtime: failed to create new OS thread (13)
FAIL
FAIL    runtime 418.665s

@prattmic
Copy link
Member

I should add that most, but not all, failures include the "This application has requested ..." message, and that none of the failures include any other resource or other test failures in the same run.

@prattmic
Copy link
Member

The tests did seem to take abnormally long to run. That said, the first run (which passed) on a new gomote took 12s to complete (mostly spent building the test binary), so these aren't that far off.

--- FAIL: TestCgoSignalDeadlock (4.41s)
--- FAIL: TestCgoSignalDeadlock (23.29s)
--- FAIL: TestCgoSignalDeadlock (26.25s)
--- FAIL: TestCgoSignalDeadlock (41.33s)
--- FAIL: TestCgoSignalDeadlock (43.05s)
--- FAIL: TestCgoSignalDeadlock (35.08s)
--- FAIL: TestCgoSignalDeadlock (27.79s)
--- FAIL: TestCgoSignalDeadlock (21.33s)
--- FAIL: TestCgoSignalDeadlock (37.15s)
--- FAIL: TestCgoSignalDeadlock (36.62s)
--- FAIL: TestCgoSignalDeadlock (5.74s)
--- FAIL: TestCgoSignalDeadlock (44.53s)

@prattmic
Copy link
Member

Ah, well the "requested the Runtime to terminate it in an usual way" message is not surprising. The runtime message comes from https://cs.opensource.google/go/go/+/master:src/runtime/cgo/gcc_windows_amd64.c;l=31-32.

Per the Microsoft docs, "By default, the abort routine prints the message:

"This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information.""

I am now more confused why I sometimes don't get the message, and also why the runtime error is printed after the message, given that the docs say "abort does not flush stream buffers or do atexit/_onexit processing."

@prattmic
Copy link
Member

"runtime: failed to create new OS thread (13)" is errno EACCES. _beginthread "may return EACCES if there are insufficient resources (such as memory)", which fits well with the out-of-memory errors.

@prattmic
Copy link
Member

I am running with https://go.dev/cl/408216 to try to get a sense of whether the crashing process is using lots of memory, the overall system is low on memory, or this has nothing to do with memory.

(Hopefully there is enough memory to run the dump function 😅)

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/408216 mentions this issue: WIP: runtime: dump memory stats on failure

@ianlancetaylor
Copy link
Contributor

@prattmic Have you tried extending the idea of https://go.dev/cl/33894 to Windows? The exact reason for an "insufficient resources" error isn't that interesting if retrying fixes the problem.

@prattmic
Copy link
Member

I haven't, but it is not a bad idea, as _beginthread provides a very clear errno for this failure.

That said, while this is the only error in my repros, all of the failures in Bryan's original comment also have other fatal OOMs in the allocator. So while a retry might make thread creation succeed, the program may die anyways due to another OOM.

It seems like the ultimate problem here might end up being test/builder misconfiguration. i.e., we are running too many high-memory tests in parallel and exhausting all memory on the builder.

@mknyszek
Copy link
Contributor

Since there's a decent chance this is resource exhaustion on the builders I'm not sure this is going to impede users using the beta, so marking as okay-after-beta1.

@mknyszek mknyszek added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label May 25, 2022
@prattmic
Copy link
Member

prattmic commented Jun 1, 2022

@prattmic
Copy link
Member

prattmic commented Jun 1, 2022

I reproduced with https://go.dev/cl/408216:

...
=== dist: memory usage:
        phys: 2891743232 / 17179447296 (16.832574%)
        pagefile: 3027419136 / 18253189120 (16.585700%)
        virtual: 669831168 / 140737488224256 (0.000476%)
=== dist: memory usage:
        phys: 2885001216 / 17179447296 (16.793330%)
        pagefile: 2997710848 / 18253189120 (16.422943%)
        virtual: 669831168 / 140737488224256 (0.000476%)
--- FAIL: TestCgoSignalDeadlock (3.26s)
    crash_test.go:124: running go build -o C:\Users\gopher\AppData\Local\Temp\1\go-build1460092851\testprogcgo.exe 
    crash_cgo_test.go:37: expected "OK\n", but got:
        OK
        
        This application has requested the Runtime to terminate it in an unusual way.
        Please contact the application's support team for more information.
        runtime: failed to create new OS thread (13)
        runtime: process memory stats:
                PageFaultCount: 2833
                PeakWorkingSetSize: 10567680
                WorkingSetSize: 10567680
                QuotaPeakPagedPoolUsage: 52240
                QuotaPagedPoolUsage: 52240
                QuotaPeakNonPagedPoolUsage: 26080
                QuotaNonPagedPoolUsage: 25944
                PagefileUsage: 26787840
                PeakPagefileUsage: 26816512
        runtime: global memory stats:
                MemoryLoad: 18
                TotalPhys: 17179447296
                AvailPhys: 14058536960
                TotalPageFile: 18253189120
                AvailPageFile: 14885646336
                TotalVirtual: 140737488224256
                AvailVirtual: 140736538796032
FAIL
FAIL    runtime 410.043s

Total memory usage on the system is only ~18% of available memory, and the failing process is only using ~25MB of memory, so this does not seem like an obvious memory pressure issue.

I'll give Ian's suggestion a shot and see if retry of _beginthread is enough to resolve this, at least in my case. I'm less sure about the failures on the builders that also had other OOM crashes.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/410354 mentions this issue: runtime/cgo: merge bodies of cgo_sys_thread_start on windows

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/410355 mentions this issue: runtime/cgo: retry _beginthread on EACCES

gopherbot pushed a commit that referenced this issue Jun 9, 2022
The bodies of cgo_sys_thread_start (and x_cgo_sys_thread_create) are
nearly identical on all of the windows ports.

Create a single _cgo_beginthread implementation that contains the body
and is used on all ports. This will reduce churn in an upcoming CL to
add retry logic.

We could theoretically have a single implementation of
_cgo_sys_thread_start shared by all ports, but I keep them separate for
ease of searching. Right now every single port implements this function
in their gcc_GOOS_GOARCH.c file, so it is nice to keep this symmetry.

_cgo_dummy_export must move out of libcgo_windows.h because it is a
definition and the inclusion of libcgo_windows.h in multiple files
creates duplicate definitions.

For #52572.

Change-Id: I9fa22009389349c754210274c7db2631b061f9c7
Reviewed-on: https://go-review.googlesource.com/c/go/+/410354
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
@prattmic
Copy link
Member

prattmic commented Jun 9, 2022

I haven't been able to reproduce this issue with CL 410355, so closing for now, though of course that CL isn't a hard fix so we'll see if this comes back.

@golang golang locked and limited conversation to collaborators Jun 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 OS-Windows release-blocker
Projects
None yet
Development

No branches or pull requests

7 participants