Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: go 1.17 regression: deadlock in test suite #48402

Closed
tbonfort opened this issue Sep 15, 2021 · 21 comments
Closed

testing: go 1.17 regression: deadlock in test suite #48402

tbonfort opened this issue Sep 15, 2021 · 21 comments

Comments

@tbonfort
Copy link
Contributor

@tbonfort tbonfort commented Sep 15, 2021

What version of Go are you using (go version)?

$ go version
go version go1.17 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/tbonfort/.cache/go-build"
GOENV="/home/tbonfort/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/tbonfort/go/pkg/mod"
GOOS="linux"
GOPATH="/home/tbonfort/go"
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build2515039918=/tmp/go-build -gno-record-gcc-switches"

What did you do?

My cgo testsuite deadlocks, commenting out a seemingly random cgo call from 2 tests before the deadlock happens fixes the issue.

Failing test run:

...snip...
=== RUN   TestOpen
--- PASS: TestOpen (0.00s)
=== RUN   TestOpenUpdate
--- PASS: TestOpenUpdate (0.00s)
=== RUN   TestClosingErrors
--- PASS: TestClosingErrors (0.00s)
=== RUN   TestOpenShared
--- PASS: TestOpenShared (0.00s)
=== RUN   TestRegister
--- PASS: TestRegister (0.00s)
^Csignal: interrupt
FAIL	github.com/airbusgeo/godal	9.929s

Commenting out this line in the TestOpenShared test fixes the deadlock:
https://github.com/airbusgeo/godal/blob/00980a3df723677b2d949a10eedd78ab6c5d6aee/godal_test.go#L1242

Here is a gdb stacktrace when deadlocked:

sudo gdb ~/dev/godal/godal.test 1904477
GNU gdb (Ubuntu 10.1-2ubuntu2) 10.1.90.20210411-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/tbonfort/dev/godal/godal.test...
Attaching to program: /home/tbonfort/dev/godal/godal.test, process 1904477
[New LWP 1904478]
[New LWP 1904479]
[New LWP 1904480]
[New LWP 1904481]
[New LWP 1904482]
Error while reading shared library symbols for /lib/x86_64-linux-gnu/libpthread.so.0:
Cannot find user-level thread for LWP 1904482: generic error
runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:520
520		MOVL	AX, ret+40(FP)
Loading Go Runtime support.
(gdb) bt
#0  runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:520
#1  0x000000000043b6f6 in runtime.futexsleep (addr=0xfffffffffffffe00, val=2, ns=4670691) at /usr/local/go/src/runtime/os_linux.go:44
#2  0x00000000004155f9 in runtime.lock2 (l=0xfffffffffffffe00) at /usr/local/go/src/runtime/lock_futex.go:107
#3  0x0000000000410318 in runtime.lockWithRank (l=0x4744e3 <runtime.futex+35>, rank=<optimized out>)
    at /usr/local/go/src/runtime/lockrank_off.go:23
#4  runtime.lock (l=0x4744e3 <runtime.futex+35>) at /usr/local/go/src/runtime/lock_futex.go:48
#5  runtime.chanrecv (c=0xc0001cd340, ep=0xc00025fb67, block=true) at /usr/local/go/src/runtime/chan.go:508
#6  0x0000000000410198 in runtime.chanrecv1 (c=0xfffffffffffffe00, elem=0x2) at /usr/local/go/src/runtime/chan.go:439
#7  0x0000000000513d95 in testing.(*T).Run (t=0xc0003b41a0, name=..., f={void (testing.T *)} 0xc00025fbd8)
    at /usr/local/go/src/testing/testing.go:1307
#8  0x00000000005159ae in testing.runTests.func1 (t=0xc000283040) at /usr/local/go/src/testing/testing.go:1598
#9  0x0000000000513142 in testing.tRunner (t=0xc000283040, fn={void (testing.T *)} 0xc00025fc68)
    at /usr/local/go/src/testing/testing.go:1259
#10 0x000000000051585f in testing.runTests (matchString={void (string, string, bool *, error *)} 0xc00025fd10, 
    tests=[]testing.InternalTest = {...}, deadline=...) at /usr/local/go/src/testing/testing.go:1596
#11 0x00000000005145bd in testing.(*M).Run (m=0xc00011c800, code=0) at /usr/local/go/src/testing/testing.go:1504
#12 0x00000000009ca56b in main.main () at _testmain.go:177
(gdb) thread apply all bt

Thread 6 (LWP 1904482 "godal.test"):
#0  runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:520
#1  0x000000000043b6f6 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4670691) at /usr/local/go/src/runtime/os_linux.go:44
#2  0x0000000000415947 in runtime.notesleep (n=0xfffffffffffffe00) at /usr/local/go/src/runtime/lock_futex.go:160
#3  0x00000000004461d1 in runtime.templateThread () at /usr/local/go/src/runtime/proc.go:2385
#4  0x0000000000444c93 in runtime.mstart1 () at /usr/local/go/src/runtime/proc.go:1407
#5  0x0000000000444bd9 in runtime.mstart0 () at /usr/local/go/src/runtime/proc.go:1365
#6  0x0000000000470445 in runtime.mstart () at /usr/local/go/src/runtime/asm_amd64.s:248
#7  0x0000000000474b65 in runtime.mstart () at <autogenerated>:1
#8  0x00000000009d00b2 in crosscall_amd64 () at gcc_amd64.S:40
#9  0x00007fb86affd640 in ?? ()
#10 0x0000000000000000 in ?? ()

Thread 5 (LWP 1904481 "godal.test"):
#0  runtime.epollwait () at /usr/local/go/src/runtime/sys_linux_amd64.s:666
#1  0x000000000043b45c in runtime.netpoll (delay=<optimized out>) at /usr/local/go/src/runtime/netpoll_epoll.go:127
#2  0x00000000004474d3 in runtime.findrunnable () at /usr/local/go/src/runtime/proc.go:2947
#3  0x00000000004486d9 in runtime.schedule () at /usr/local/go/src/runtime/proc.go:3367
#4  0x0000000000448c2d in runtime.park_m (gp=0xc000282340) at /usr/local/go/src/runtime/proc.go:3516
#5  0x00000000004704c3 in runtime.mcall () at /usr/local/go/src/runtime/asm_amd64.s:307
#6  0xf934f160b63ffd00 in ?? ()
#7  0x0000000000800000 in text/template/parse.(*lexer).errorf (l=0x0, format=..., args=...) at /usr/local/go/src/text/template/parse/lex.go:188
#8  text/template/parse.lexComment (l=0x0) at /usr/local/go/src/text/template/parse/lex.go:319
#9  0x0000000000000000 in ?? ()

Thread 4 (LWP 1904480 "godal.test"):
#0  runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:520
--Type <RET> for more, q to quit, c to continue without paging-- 
#1  0x000000000043b6f6 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4670691) at /usr/local/go/src/runtime/os_linux.go:44
#2  0x0000000000415947 in runtime.notesleep (n=0xfffffffffffffe00) at /usr/local/go/src/runtime/lock_futex.go:160
#3  0x0000000000444d8a in runtime.mPark () at /usr/local/go/src/runtime/proc.go:1441
#4  0x00000000004462d8 in runtime.stopm () at /usr/local/go/src/runtime/proc.go:2408
#5  0x00000000004477a5 in runtime.findrunnable () at /usr/local/go/src/runtime/proc.go:2984
#6  0x00000000004486d9 in runtime.schedule () at /usr/local/go/src/runtime/proc.go:3367
#7  0x0000000000448c2d in runtime.park_m (gp=0xc0000011e0) at /usr/local/go/src/runtime/proc.go:3516
#8  0x00000000004704c3 in runtime.mcall () at /usr/local/go/src/runtime/asm_amd64.s:307
#9  0xf934f160b63ffd00 in ?? ()
#10 0x0000000000800000 in text/template/parse.(*lexer).errorf (l=0x0, format=..., args=...) at /usr/local/go/src/text/template/parse/lex.go:188
#11 text/template/parse.lexComment (l=0x0) at /usr/local/go/src/text/template/parse/lex.go:319
#12 0x0000000000000000 in ?? ()

Thread 3 (LWP 1904479 "godal.test"):
#0  runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:520
#1  0x000000000043b6f6 in runtime.futexsleep (addr=0xfffffffffffffe00, val=2, ns=4670691) at /usr/local/go/src/runtime/os_linux.go:44
#2  0x00000000004155f9 in runtime.lock2 (l=0xfffffffffffffe00) at /usr/local/go/src/runtime/lock_futex.go:107
#3  0x000000000040f55a in runtime.lockWithRank (l=0x4744e3 <runtime.futex+35>, rank=<optimized out>) at /usr/local/go/src/runtime/lockrank_off.go:23
#4  runtime.lock (l=0x4744e3 <runtime.futex+35>) at /usr/local/go/src/runtime/lock_futex.go:48
#5  runtime.chansend (c=0xc0001cd340, ep=0xc000406e4f, block=true, callerpc=<optimized out>) at /usr/local/go/src/runtime/chan.go:200
#6  0x000000000040f47d in runtime.chansend1 (c=0xfffffffffffffe00, elem=0x2) at /usr/local/go/src/runtime/chan.go:143
#7  0x00000000005139e8 in testing.tRunner.func1.1 () at /usr/local/go/src/testing/testing.go:1189
#8  0x00000000005134e7 in testing.tRunner.func1 () at /usr/local/go/src/testing/testing.go:1250
#9  0x000000000051317b in testing.tRunner (t=0xc0003b41a0, fn={void (testing.T *)} 0xc000406fc8) at /usr/local/go/src/testing/testing.go:1265
#10 0x0000000000513e4a in testing.(*T).Run·dwrap·21 () at /usr/local/go/src/testing/testing.go:1306
#11 0x0000000000472701 in runtime.goexit () at /usr/local/go/src/runtime/asm_amd64.s:1581
--Type <RET> for more, q to quit, c to continue without paging--
#12 0x0000000000000000 in ?? ()

Thread 2 (LWP 1904478 "godal.test"):
#0  runtime.usleep () at /usr/local/go/src/runtime/sys_linux_amd64.s:146
#1  0x000000000044d2d1 in runtime.sysmon () at /usr/local/go/src/runtime/proc.go:5337
#2  0x0000000000444c93 in runtime.mstart1 () at /usr/local/go/src/runtime/proc.go:1407
#3  0x0000000000444bd9 in runtime.mstart0 () at /usr/local/go/src/runtime/proc.go:1365
#4  0x0000000000470445 in runtime.mstart () at /usr/local/go/src/runtime/asm_amd64.s:248
#5  0x0000000000474b65 in runtime.mstart () at <autogenerated>:1
#6  0x00000000009d00b2 in crosscall_amd64 () at gcc_amd64.S:40
#7  0x00007fb871623640 in ?? ()
#8  0x0000000000000000 in ?? ()

Thread 1 (LWP 1904477 "godal.test"):
#0  runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:520
#1  0x000000000043b6f6 in runtime.futexsleep (addr=0xfffffffffffffe00, val=2, ns=4670691) at /usr/local/go/src/runtime/os_linux.go:44
#2  0x00000000004155f9 in runtime.lock2 (l=0xfffffffffffffe00) at /usr/local/go/src/runtime/lock_futex.go:107
#3  0x0000000000410318 in runtime.lockWithRank (l=0x4744e3 <runtime.futex+35>, rank=<optimized out>) at /usr/local/go/src/runtime/lockrank_off.go:23
#4  runtime.lock (l=0x4744e3 <runtime.futex+35>) at /usr/local/go/src/runtime/lock_futex.go:48
#5  runtime.chanrecv (c=0xc0001cd340, ep=0xc00025fb67, block=true) at /usr/local/go/src/runtime/chan.go:508
#6  0x0000000000410198 in runtime.chanrecv1 (c=0xfffffffffffffe00, elem=0x2) at /usr/local/go/src/runtime/chan.go:439
#7  0x0000000000513d95 in testing.(*T).Run (t=0xc0003b41a0, name=..., f={void (testing.T *)} 0xc00025fbd8) at /usr/local/go/src/testing/testing.go:1307
#8  0x00000000005159ae in testing.runTests.func1 (t=0xc000283040) at /usr/local/go/src/testing/testing.go:1598
#9  0x0000000000513142 in testing.tRunner (t=0xc000283040, fn={void (testing.T *)} 0xc00025fc68) at /usr/local/go/src/testing/testing.go:1259
#10 0x000000000051585f in testing.runTests (matchString={void (string, string, bool *, error *)} 0xc00025fd10, tests=[]testing.InternalTest = {...}, deadline=...) at /usr/local/go/src/testing/testing.go:1596
--Type <RET> for more, q to quit, c to continue without paging--
#11 0x00000000005145bd in testing.(*M).Run (m=0xc00011c800, code=0) at /usr/local/go/src/testing/testing.go:1504
#12 0x00000000009ca56b in main.main () at _testmain.go:177
(gdb) q
A debugging session is active.

	Inferior 1 [process 1904477] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/tbonfort/dev/godal/godal.test, process 1904477
[Inferior 1 (process 1904477) detached]

I have bisected the issue down to 1c59066 (from @bcmills). Reverting this specific commit in master fixes the deadlock.

@ianlancetaylor ianlancetaylor changed the title (c?)go 1.17 regression: deadlock in test suite testing: go 1.17 regression: deadlock in test suite Sep 15, 2021
@ianlancetaylor ianlancetaylor added this to the Go1.18 milestone Sep 15, 2021
@bcmills bcmills self-assigned this Sep 15, 2021
@bcmills
Copy link
Member

@bcmills bcmills commented Sep 15, 2021

@tbonfort, if you set -timeout=15s or similar you should get a goroutine dump from the Go runtime. Could you paste the full goroutine dump into a <details> block in a comment on this issue? The GDB backtrace appears to include only the subset of goroutines currently running on OS threads.

It isn't clear to me whether CL 303330 introduced a deadlock itself, exposed some other runtime bug (perhaps in defer and recover handling; compare #43941?) or exposed a latent bug in the test or one of the helper-libraries it uses. A full goroutine dump would help to determine which of those is the case.

Loading

@tbonfort
Copy link
Contributor Author

@tbonfort tbonfort commented Sep 15, 2021

@bcmills , sorry, I should have posted the go backtrace in the main issue. I didn't include it because it didn't contain any meaningful info for me aside of messages of the like of "goroutine running on other thread; stack unavailable". I'll post the exact output tomorrow UTC

Loading

@tbonfort
Copy link
Contributor Author

@tbonfort tbonfort commented Sep 16, 2021

Here is the dump:

panic: test timed out after 10s

goroutine 50 [running]:
testing.(*M).startAlarm.func1()
	/home/tbonfort/dev/thirdparty/go/src/testing/testing.go:1788 +0x8e
created by time.goFunc
	/home/tbonfort/dev/thirdparty/go/src/time/sleep.go:176 +0x32

goroutine 1 [running]:
	goroutine running on other thread; stack unavailable

goroutine 6 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc0000fa280)
	/home/tbonfort/go/pkg/mod/go.opencensus.io@v0.23.0/stats/view/worker.go:276 +0xad
created by go.opencensus.io/stats/view.init.0
	/home/tbonfort/go/pkg/mod/go.opencensus.io@v0.23.0/stats/view/worker.go:34 +0x8d

goroutine 35 [running]:
	goroutine running on other thread; stack unavailable
created by testing.(*T).Run
	/home/tbonfort/dev/thirdparty/go/src/testing/testing.go:1306 +0x35f
exit status 2
FAIL	github.com/airbusgeo/godal	10.032s

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Sep 16, 2021

Yikes, that's not as helpful as I was hoping. 😞
Thanks anyway — I'll dig into it more from here!

Loading

@tbonfort
Copy link
Contributor Author

@tbonfort tbonfort commented Sep 20, 2021

Some additional info that may be useful:

  • I had a cgo pointer passing rule that was not respected. Fixing it did not change the outcome of this issue, the hang happens at the same point. The test suite now runs cleanly with GODEBUG=cgocheck=2.
  • I added a one-second sleep to all c functions called from cgo in order to force a gouroutine creation for all cgo calls (c.f. #19574 (comment)). The hang still happens at the same point.

Loading

@tbonfort
Copy link
Contributor Author

@tbonfort tbonfort commented Sep 22, 2021

Some more debugging info:

  • using make(chan bool, 2) on t.signal does not modify the hanging behavior

  • I added some fmt.Fprintf(os.Stderr, "START|DONE <- t.signal %p\n", t.signal) calls around the lines sending/receiving on t.signal, in t.Run and tRunner's defered func.
    On "working tests", I get

=== RUN   TestBandMask
START <- t.signal 0xc0000c21c0
--- PASS: TestBandMask (0.00s)
START: t.signal <- signal 0xc0000c21c0
DONE: t.signal <- signal 0xc0000c21c0
DONE <- t.signal 0xc0000c21c0

In the hanging case:

=== RUN   TestOpenShared
START <- t.signal 0xc0003a00e0
--- PASS: TestOpenShared (0.00s)
START: t.signal <- signal 0xc0003a00e0
DONE: t.signal <- signal 0xc0003a00e0
DONE <- t.signal 0xc0003a00e0
=== RUN   TestRegister
START <- t.signal 0xc0003a01c0
--- PASS: TestRegister (0.00s)
START: t.signal <- signal 0xc0003a01c0
/* hang */

Which seems strange as both goroutines seem to be blocking on read and write to the same channel.

  • running the test suite under valgrind causes a hang with the same t.signal locking pattern, but in a subsequent test (i.e. a few tests after the TestRegister one). Outside of valgrind, the test suite consistently hangs right after TestRegister.

Loading

@toothrot
Copy link
Contributor

@toothrot toothrot commented Sep 22, 2021

Checking in on this issue, as it's labeled as a release blocker.

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Sep 22, 2021

I'm actively looking into it today.

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Sep 22, 2021

Using libgdal 3.2.2+dfsg-3 (installed via apt) I see a test failure in TestVSIGCS under go1.16.8:

~/tmp/godal$ go1.16.8 test . -timeout=20s
ERROR 10: Pointer 'hSRS1' is NULL in 'OSRIsSame'.

ERROR 10: Pointer 'hSRS2' is NULL in 'OSRIsSame'.

ERROR 10: Pointer 'hSRS' is NULL in 'OSRIsGeographic'.

ERROR 10: Pointer 'hSRS' is NULL in 'OSRGetSemiMajor'.

ERROR 10: Pointer 'hSRS' is NULL in 'OSRGetSemiMajor'.

ERROR 10: Pointer 'hSRS' is NULL in 'OSRGetAuthorityCode'.

ERROR 10: Pointer 'hSRS' is NULL in 'OSRGetAuthorityName'.

ERROR 10: Pointer 'hSRS' is NULL in 'OSRAutoIdentifyEPSG'.

ERROR 1: PROJ: proj_get_ellipsoid: CRS has no geodetic CRS
ERROR 1: PROJ: proj_get_ellipsoid: Object is not a CRS or GeodeticReferenceFrame
ERROR 1: PROJ: proj_get_ellipsoid: CRS has no geodetic CRS
ERROR 1: PROJ: proj_get_ellipsoid: Object is not a CRS or GeodeticReferenceFrame
ERROR 1: PROJ: proj_get_ellipsoid: CRS has no geodetic CRS
ERROR 1: PROJ: proj_get_ellipsoid: Object is not a CRS or GeodeticReferenceFrame
ERROR 10: Pointer 'hGeom' is NULL in 'OGR_G_IsEmpty'.

--- FAIL: TestVSIGCS (0.22s)
    godal_test.go:3044: new reader for gs://godal-ci-data/test.tif: Get "https://storage.googleapis.com/godal-ci-data/test.tif": oauth2: cannot fetch token: 400 Bad Request
        Response: {
          "error": "invalid_grant",
          "error_description": "Bad Request"
        }
OGRCT: Wrap target at 50.
ERROR 10: Pointer 'hBand' is NULL in 'GDALGetMaskBand'.

ERROR 10: Pointer 'hBand' is NULL in 'GDALGetMaskBand'.

FAIL
FAIL    github.com/airbusgeo/godal      2.441s
FAIL

I can reproduce the reported deadlock using both go1.17 and gotip.

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Sep 22, 2021

This only requires two test functions to reproduce, but it does require two test functions:

~/tmp/godal$ gotip test -run='TestOpenShared' -timeout=12s
PASS
ok      github.com/airbusgeo/godal      0.097s

~/tmp/godal$ gotip test -run='TestRegister' -timeout=12s
PASS
ok      github.com/airbusgeo/godal      0.102s

~/tmp/godal$ gotip test -run='Test(OpenShared|Register)' -timeout=12s
panic: test timed out after 12s

goroutine 18 [running]:
testing.(*M).startAlarm.func1()
        /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1970 +0x8e
created by time.goFunc
        /usr/local/google/home/bcmills/sdk/gotip/src/time/sleep.go:176 +0x32

goroutine 1 [running]:
        goroutine running on other thread; stack unavailable

goroutine 6 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc0000fa280)
        /usr/local/google/home/bcmills/pkg/mod/go.opencensus.io@v0.23.0/stats/view/worker.go:276 +0xad
created by go.opencensus.io/stats/view.init.0
        /usr/local/google/home/bcmills/pkg/mod/go.opencensus.io@v0.23.0/stats/view/worker.go:34 +0x8d

goroutine 9 [running]:
        goroutine running on other thread; stack unavailable
created by testing.(*T).Run
        /usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1439 +0x35f
exit status 2
FAIL    github.com/airbusgeo/godal      12.095s

Loading

@gopherbot
Copy link

@gopherbot gopherbot commented Sep 22, 2021

Change https://golang.org/cl/351532 mentions this issue: testing: add debugging assertions

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Sep 22, 2021

This looks like either memory corruption from something in the C library or a bug in the Go compiler. Digging a bit further to figure out which.

Loading

@tbonfort
Copy link
Contributor Author

@tbonfort tbonfort commented Sep 22, 2021

Using libgdal 3.2.2+dfsg-3 (installed via apt) I see a test failure in TestVSIGCS under go1.16.8:

~/tmp/godal$ go1.16.8 test . -timeout=20s
ERROR 10: Pointer 'hSRS1' is NULL in 'OSRIsSame'.

ERROR 10: Pointer 'hSRS2' is NULL in 'OSRIsSame'.

ERROR 10: Pointer 'hSRS' is NULL in 'OSRIsGeographic'.

ERROR 10: Pointer 'hSRS' is NULL in 'OSRGetSemiMajor'.

ERROR 10: Pointer 'hSRS' is NULL in 'OSRGetSemiMajor'.

ERROR 10: Pointer 'hSRS' is NULL in 'OSRGetAuthorityCode'.

ERROR 10: Pointer 'hSRS' is NULL in 'OSRGetAuthorityName'.

ERROR 10: Pointer 'hSRS' is NULL in 'OSRAutoIdentifyEPSG'.

ERROR 1: PROJ: proj_get_ellipsoid: CRS has no geodetic CRS
ERROR 1: PROJ: proj_get_ellipsoid: Object is not a CRS or GeodeticReferenceFrame
ERROR 1: PROJ: proj_get_ellipsoid: CRS has no geodetic CRS
ERROR 1: PROJ: proj_get_ellipsoid: Object is not a CRS or GeodeticReferenceFrame
ERROR 1: PROJ: proj_get_ellipsoid: CRS has no geodetic CRS
ERROR 1: PROJ: proj_get_ellipsoid: Object is not a CRS or GeodeticReferenceFrame
ERROR 10: Pointer 'hGeom' is NULL in 'OGR_G_IsEmpty'.

--- FAIL: TestVSIGCS (0.22s)
    godal_test.go:3044: new reader for gs://godal-ci-data/test.tif: Get "https://storage.googleapis.com/godal-ci-data/test.tif": oauth2: cannot fetch token: 400 Bad Request
        Response: {
          "error": "invalid_grant",
          "error_description": "Bad Request"
        }
OGRCT: Wrap target at 50.
ERROR 10: Pointer 'hBand' is NULL in 'GDALGetMaskBand'.

ERROR 10: Pointer 'hBand' is NULL in 'GDALGetMaskBand'.

FAIL
FAIL    github.com/airbusgeo/godal      2.441s
FAIL

I can reproduce the reported deadlock using both go1.17 and gotip.

@bcmills I firmly believe that failing test is unrelated to the issue at hand, but is due to incorrect credential handling in another part of the code. (i.e. they'll fail identically if you use go <1.17).
The ERROR:XXX messages are expected and due to the test suite performing error-raising cases.

Loading

@tbonfort
Copy link
Contributor Author

@tbonfort tbonfort commented Sep 22, 2021

This looks like either memory corruption from something in the C

With go < 1.17 (I don't have enough feedback for 1.17) things have been running smoothely with same codebase. The valgrind output before the hang is "clean", i.e not showing errors in anything related to the C code

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Sep 22, 2021

Looks like memory corruption from the C side. It looks like something in the unsafe code is freeing a Go-owned zero-page to the C heap, which is overwriting the zero-page with nonzero bytes.

Then the make(chan bool, 1) in (*testing.T).Run ends up allocating a channel with capacity 1 and initial length 0x6363636363636363, instead of the initial length 0 that it is supposed to have, and sending to a 1-buffered channel that already contains 0x6363636363636363 values blocks. 😅

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Sep 22, 2021

I don't really have the time to debug how the zeroed Go pages are getting corrupted (I can't set a watchpoint easily because the value hasn't even been allocated yet), but at this point I'm confident enough to rule out a bug in the Go compiler or the testing package.

You might be able to get more mileage out of valgrind using the Go -msan flag. Good luck!

Loading

@bcmills bcmills removed this from the Go1.18 milestone Sep 22, 2021
@bcmills bcmills added this to the Unplanned milestone Sep 22, 2021
@bcmills
Copy link
Member

@bcmills bcmills commented Sep 22, 2021

(Also, note that you might be able to use the new unsafe.Slice and unsafe.Add added in Go 1.17 to simplify this code...)

Loading

@tbonfort
Copy link
Contributor Author

@tbonfort tbonfort commented Sep 23, 2021

This issue can be closed as invalid. The OpenShared testcase was asking the cgo lib to populate 300 bytes of data inside a slice we had only sized to 100 bytes. I'm actually confused as to why this didn't crash in previous go versions. Thanks @bcmills for the investigation, and sorry for erroneously dragging you into an issue that's entirely unrelated with the bisected commit.

Loading

@bcmills bcmills closed this Sep 23, 2021
@bcmills
Copy link
Member

@bcmills bcmills commented Sep 23, 2021

I have a couple of notes from debugging.

  1. This was probably masked on earlier Go versions because the same memory corruption manifested as an undetected goroutine leak within the testing package.

  2. You could add an extra margin of safety to the cBuffer helper function by passing in the expected dimensions and then slicing the buffer to those dimensions — that will cause a run-time panic (instead of memory corruption) if the passed-in buffer is too small.

https://github.com/airbusgeo/godal/blob/00980a3df723677b2d949a10eedd78ab6c5d6aee/godal.go#L1477 and similar lines might become something like:

 cBuf = unsafe.Pointer(&buf[:bufWidth*bufHeight*nBands][0])

Loading

@tbonfort
Copy link
Contributor Author

@tbonfort tbonfort commented Sep 23, 2021

Thanks for the suggestion, I was also wondering if I should add some extra checks for that buffer, as it is error-prone (there were a couple of other tests that had similar buffer sizing errors). The actual minimal buffer size is a bit more tricky to calculate than just bufWidth*bufHeight*nBands as it is also possible to configure the pixel, line and band strides (i.e. spacing from one pixel | line | band to the next).

While I have your attention, do you have an idea why the gs:// test failed with a 400 error on your side? It has been raised by the google cloud api directly, via https://github.com/airbusgeo/osio/blob/main/gcs.go#L63 and https://github.com/airbusgeo/osio/blob/main/gcs.go#L81 (I would have expected a 403 or 404 error if you had credentials available)

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Sep 23, 2021

No idea what's up with the 400 error, unfortunately. 🤷‍♂️

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants