-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testing: go 1.17 regression: deadlock in test suite #48402
Comments
@tbonfort, if you set It isn't clear to me whether CL 303330 introduced a deadlock itself, exposed some other runtime bug (perhaps in |
@bcmills , sorry, I should have posted the go backtrace in the main issue. I didn't include it because it didn't contain any meaningful info for me aside of messages of the like of "goroutine running on other thread; stack unavailable". I'll post the exact output tomorrow UTC |
Here is the dump:
|
Yikes, that's not as helpful as I was hoping. 😞 |
Some additional info that may be useful:
|
Some more debugging info:
In the hanging case:
Which seems strange as both goroutines seem to be blocking on read and write to the same channel.
|
Checking in on this issue, as it's labeled as a release blocker. |
I'm actively looking into it today. |
Using
I can reproduce the reported deadlock using both |
This only requires two test functions to reproduce, but it does require two test functions:
|
Change https://golang.org/cl/351532 mentions this issue: |
This looks like either memory corruption from something in the C library or a bug in the Go compiler. Digging a bit further to figure out which. |
@bcmills I firmly believe that failing test is unrelated to the issue at hand, but is due to incorrect credential handling in another part of the code. (i.e. they'll fail identically if you use go <1.17). |
With go < 1.17 (I don't have enough feedback for 1.17) things have been running smoothely with same codebase. The valgrind output before the hang is "clean", i.e not showing errors in anything related to the C code |
Looks like memory corruption from the C side. It looks like something in the Then the |
I don't really have the time to debug how the zeroed Go pages are getting corrupted (I can't set a watchpoint easily because the value hasn't even been allocated yet), but at this point I'm confident enough to rule out a bug in the Go compiler or the You might be able to get more mileage out of |
(Also, note that you might be able to use the new |
This issue can be closed as invalid. The OpenShared testcase was asking the cgo lib to populate 300 bytes of data inside a slice we had only sized to 100 bytes. I'm actually confused as to why this didn't crash in previous go versions. Thanks @bcmills for the investigation, and sorry for erroneously dragging you into an issue that's entirely unrelated with the bisected commit. |
I have a couple of notes from debugging.
https://github.com/airbusgeo/godal/blob/00980a3df723677b2d949a10eedd78ab6c5d6aee/godal.go#L1477 and similar lines might become something like:
|
Thanks for the suggestion, I was also wondering if I should add some extra checks for that buffer, as it is error-prone (there were a couple of other tests that had similar buffer sizing errors). The actual minimal buffer size is a bit more tricky to calculate than just While I have your attention, do you have an idea why the gs:// test failed with a 400 error on your side? It has been raised by the google cloud api directly, via https://github.com/airbusgeo/osio/blob/main/gcs.go#L63 and https://github.com/airbusgeo/osio/blob/main/gcs.go#L81 (I would have expected a 403 or 404 error if you had credentials available) |
No idea what's up with the 400 error, unfortunately. 🤷♂️ |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
My cgo testsuite deadlocks, commenting out a seemingly random cgo call from 2 tests before the deadlock happens fixes the issue.
Failing test run:
Commenting out this line in the
TestOpenShared
test fixes the deadlock:https://github.com/airbusgeo/godal/blob/00980a3df723677b2d949a10eedd78ab6c5d6aee/godal_test.go#L1242
Here is a gdb stacktrace when deadlocked:
I have bisected the issue down to 1c59066 (from @bcmills). Reverting this specific commit in master fixes the deadlock.
The text was updated successfully, but these errors were encountered: