Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: cgo call to symbol from library loaded dynamically will panic with go 1.21.1 and ld >2.38 #63264

Closed
braydonk opened this issue Sep 27, 2023 · 21 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@braydonk
Copy link

braydonk commented Sep 27, 2023

What version of Go are you using (go version)?

$ go version
go version go1.21.1 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/braydonk/.cache/go-build'
GOENV='/home/braydonk/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/braydonk/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/braydonk/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.1'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/braydonk/Git/cgo_dl_repro/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1951551231=/tmp/go-build -gno-record-gcc-switches'

What did you do?

I created a minimal reproduction setup at https://github.com/braydonk/cgo_dl_repro

In this scenario, I have a header file that references a single function get42 that I will get from a shared object, which I will load at runtime with dlopen. The ld flags -Wl,--unresolved-symbols=ignore-in-object-files are used.

First, I run make liblib, which will compile the C file in this repo that implements the get42 function and then turn it into a shared object.
Then I run go run .

What did you expect to see?

In go1.20.8, and in go1.21.1 with ld version 2.34, I get the expected result:

braydonk@braydonk:~/Git/cgo_dl_repro$ go run .
get42 address:  0x7fb06a1fe0f9
42

What did you see instead?

In go1.21 with an ld version > 2.38 I get a panic:

braydonk@braydonk:~/Git/cgo_dl_repro$ go run .
get42 address:  0x7f2c601c00f9
SIGSEGV: segmentation violation
PC=0x0 m=0 sigcode=1
signal arrived during cgo execution

goroutine 1 [syscall]:
runtime.cgocall(0x48a800, 0xc000065eb8)
        /usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc000065e90 sp=0xc000065e58 pc=0x40590b
main._Cfunc_get42()
        _cgo_gotypes.go:139 +0x47 fp=0xc000065eb8 sp=0xc000065e90 pc=0x48a007
main.main()
        /home/braydonk/Git/cgo_dl_repro/main.go:24 +0xf9 fp=0xc000065f40 sp=0xc000065eb8 pc=0x48a6b9
runtime.main()
        /usr/local/go/src/runtime/proc.go:267 +0x2bb fp=0xc000065fe0 sp=0xc000065f40 pc=0x435e9b
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000065fe8 sp=0xc000065fe0 pc=0x45f901

goroutine 2 [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000050fa8 sp=0xc000050f88 pc=0x4362ee
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:404
runtime.forcegchelper()
        /usr/local/go/src/runtime/proc.go:322 +0xb3 fp=0xc000050fe0 sp=0xc000050fa8 pc=0x436173
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000050fe8 sp=0xc000050fe0 pc=0x45f901
created by runtime.init.6 in goroutine 1
        /usr/local/go/src/runtime/proc.go:310 +0x1a

goroutine 3 [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000051778 sp=0xc000051758 pc=0x4362ee
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:404
runtime.bgsweep(0x0?)
        /usr/local/go/src/runtime/mgcsweep.go:280 +0x94 fp=0xc0000517c8 sp=0xc000051778 pc=0x422c14
runtime.gcenable.func1()
        /usr/local/go/src/runtime/mgc.go:200 +0x25 fp=0xc0000517e0 sp=0xc0000517c8 pc=0x417fa5
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000517e8 sp=0xc0000517e0 pc=0x45f901
created by runtime.gcenable in goroutine 1
        /usr/local/go/src/runtime/mgc.go:200 +0x66

goroutine 4 [GC scavenge wait]:
runtime.gopark(0xc00007a000?, 0x4c5128?, 0x1?, 0x0?, 0xc0000071e0?)
        /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000051f70 sp=0xc000051f50 pc=0x4362ee
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:404
runtime.(*scavengerState).park(0x53bfe0)
        /usr/local/go/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc000051fa0 sp=0xc000051f70 pc=0x4204a9
runtime.bgscavenge(0x0?)
        /usr/local/go/src/runtime/mgcscavenge.go:653 +0x3c fp=0xc000051fc8 sp=0xc000051fa0 pc=0x420a3c
runtime.gcenable.func2()
        /usr/local/go/src/runtime/mgc.go:201 +0x25 fp=0xc000051fe0 sp=0xc000051fc8 pc=0x417f45
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000051fe8 sp=0xc000051fe0 pc=0x45f901
created by runtime.gcenable in goroutine 1
        /usr/local/go/src/runtime/mgc.go:201 +0xa5

goroutine 5 [finalizer wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000052628 sp=0xc000052608 pc=0x4362ee
runtime.runfinq()
        /usr/local/go/src/runtime/mfinal.go:193 +0x107 fp=0xc0000527e0 sp=0xc000052628 pc=0x417027
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000527e8 sp=0xc0000527e0 pc=0x45f901
created by runtime.createfing in goroutine 1
        /usr/local/go/src/runtime/mfinal.go:163 +0x3d

rax    0x0
rbx    0xc000065eb8
rcx    0xc000065eb8
rdx    0xc000065e48
rdi    0xc000065eb8
rsi    0x53c080
rbp    0xc000065e48
rsp    0x7ffd04bb7088
r8     0x53c460
r9     0x0
r10    0x1
r11    0x206
r12    0xc000066000
r13    0x53c460
r14    0xc0000061a0
r15    0x8
rip    0x0
rflags 0x10246
cs     0x33
fs     0x0
gs     0x0
exit status 2

Additional Info

This seems to be a result of how CGO handles --unresolved-symbols=ignore-in-object-files. The unresolved symbol results in SIGSEGV because the address of the symbols is 0x0. In go1.20.8 when I completely eschew the dlopen step and just try to call C.get42() without loading anything, I get an unresolved symbol lookup error:

braydonk@braydonk:~/Git/cgo_dl_repro$ go run .
/tmp/go-build1699640599/b001/exe/cgo_dl_repro: symbol lookup error: /tmp/go-build1699640599/b001/exe/cgo_dl_repro: undefined symbol: get42
exit status 127

However in go1.21.1, I get a panic identical to calling it after loading the library.

Different ld versions

My setup for testing the different ld versions was actually by changing distros entirely. I have my personal machine which is on a Rolling Debian Testing distro, and VMs on Debian Bullseye (11), Ubuntu Jammy (22.04), and Ubuntu Focal (20.04). The panic in go1.21.1 occurs on every OS expect Ubuntu Focal, and the only difference I could think of was the lower ld version, which is why I have called that out, BUT technically there could be some other secret difference that is causing this which I missed.

Why the strange setup?

This setup case may seem very oddly specific. I am mirroring the setup used by NVIDIA's Go NVML bindings; we discovered this error through our usage of that library. See NVIDIA/go-nvml#36, particularly you'll want to scroll down to the newest comments which talk about how this specific breakage happened after upgrading to go1.21.1.

@braydonk braydonk changed the title runtime: cgo call to symbol from library loaded dynamically will panic with go 1.21.1 and ld 2.41 runtime: cgo call to symbol from library loaded dynamically will panic with go 1.21.1 and ld >2.38 Sep 27, 2023
@braydonk
Copy link
Author

braydonk commented Sep 27, 2023

I tried building the binary in my reproduction repro with go1.20.8 and go1.21.1 and then ran nm to check the symbols in each binary.

Go 1.20:

braydonk@braydonk:~/Git/cgo_dl_repro$ nm cgo_dl_repro | grep get42
0000000000483740 T _cgo_59b4640d347f_Cfunc_get42
                 U get42
0000000000483580 t main._Cfunc_get42.abi0
000000000051a1c8 d main._cgo_59b4640d347f_Cfunc_get42

Go 1.21:

braydonk@braydonk:~/Git/cgo_dl_repro$ nm cgo_dl_repro | grep get42
000000000047ce70 T _cgo_59b4640d347f_Cfunc_get42
000000000047ccc0 t main._Cfunc_get42.abi0
00000000005191a8 d main._cgo_59b4640d347f_Cfunc_get42

So in this case it didn't even show up as an undefined symbol in Go 1.21.1. I think this would explain why it panics in Go 1.21; in Go 1.20 the symbol is there as an undefined symbol, which is why it works with a symbol lookup error in Go 1.20, and (I think; this is all new to me) why we can find the symbol after dlopen.

@braydonk
Copy link
Author

braydonk commented Sep 27, 2023

Tried it in an Ubuntu 20.04 VM (ld version 2.34) with Go 1.21, and got the same result as compiling with Go 1.20 on ld version 2.41:

braydonk@focal-test:~/cgo_dl_repro$ nm cgo_dl_repro | grep get42
000000000048a820 T _cgo_b06122c1f854_Cfunc_get42
                 U get42
0000000000489fc0 t main._Cfunc_get42.abi0
00000000005321e8 d main._cgo_b06122c1f854_Cfunc_get42

@bcmills bcmills added the compiler/runtime Issues related to the Go compiler and/or runtime. label Sep 27, 2023
@braydonk
Copy link
Author

braydonk commented Sep 27, 2023

I'm trying to rule out ld by trying out a different linker. Using Go 1.21 with the following build command:

go build -a --ldflags '-extldflags "-fuse-ld=gold -Wl,--unresolved-symbols=ignore-in-object-files"' .

By checking with nm the unresolved symbol is there:

braydonk@braydonk:~/Git/cgo_dl_repro$ nm cgo_dl_repro | grep get42
000000000047bff0 T _cgo_59b4640d347f_Cfunc_get42
                 U get42
000000000047be40 t main._Cfunc_get42.abi0
00000000005181a8 d main._cgo_59b4640d347f_Cfunc_get42

However, running the binary still results in a panic. So I guess the difference in symbols in the executable isn't the root cause here.

@braydonk
Copy link
Author

braydonk commented Sep 27, 2023

Tried another trick borrowed from the earlier linked go-nvml issue discussion, using the --weak-unresolved-symbols flag for gold. This resulted in a panic as well.
(This was kind of silly and a red herring, because in this case with a weak symbol where I don't load the library the panic probably makes sense. It also panics in Go 1.20)

@braydonk
Copy link
Author

braydonk commented Sep 27, 2023

Actually, this works. In the last comment, I was testing by just trying to call the symbol and not loading the library. However, with gold and --weak-unresolved-symbols and running dlopen it works with Go 1.21.

@braydonk
Copy link
Author

braydonk commented Sep 28, 2023

In my reproduction repro, I ran the cgo command directly on main.go (with any references to the other file commented out) and the all generated output between go1.21.1 and go1.20.8 was identical (at least according to my attempts to diff the two generated _obj folders with meld).

@braydonk
Copy link
Author

braydonk commented Sep 28, 2023

I debugged two binaries built with go1.20.8 and go1.21.1 respectively, using the reproduction repo but commenting out the part where the dynamic library is loaded. This is a run where the expected output would be a symbol lookup error.

CGO output

In the generated CGO output, the generated C function _cgo_1dc841591e27_Cfunc_get42:

CGO_NO_SANITIZE_THREAD
void
_cgo_1dc841591e27_Cfunc_get42(void *v)
{
	struct {
		int r;
		char __pad4[4];
	} __attribute__((__packed__, __gcc_struct__)) *_cgo_a = v;
	char *_cgo_stktop = _cgo_topofstack();
	__typeof__(_cgo_a->r) _cgo_r;
	_cgo_tsan_acquire();
	_cgo_r = get42();
	_cgo_tsan_release();
	_cgo_a = (void*)((char*)_cgo_a + (_cgo_topofstack() - _cgo_stktop));
	_cgo_a->r = _cgo_r;
	_cgo_msan_write(&_cgo_a->r, sizeof(_cgo_a->r));
}

Go 1.21

At the line _cgo_r = get42(), in go1.21.1, the program segfaults. Here's the GDB output with a few steps of context:

_cgo_topofstack () at /usr/local/go/src/runtime/asm_amd64.s:1645
1645		RET
(gdb) info registers
rax            0xc000038800        824633952256
rbx            0xc0000386f8        824633951992
rcx            0xc0000386f8        824633951992
rdx            0xc000038688        824633951880
rsi            0x533100            5452032
rdi            0xc0000386f8        824633951992
rbp            0xc000038688        0xc000038688
rsp            0x7fffffffe298      0x7fffffffe298
r8             0x5334e0            5453024
r9             0x0                 0
r10            0x410               1040
r11            0xffffffffffffffff  -1
r12            0x100               256
r13            0x6a                106
r14            0xc0000061a0        824633745824
r15            0x4                 4
rip            0x45ed98            0x45ed98 <_cgo_topofstack+24>
eflags         0x216               [ PF AF IF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
(gdb) step
_cgo_effbaea66e62_Cfunc_get42 (v=0xc0000386f8) at /tmp/go-build/cgo-gcc-prolog:52
52	/tmp/go-build/cgo-gcc-prolog: No such file or directory.
(gdb) info registers
rax            0xc000038800        824633952256
rbx            0xc0000386f8        824633951992
rcx            0xc0000386f8        824633951992
rdx            0xc000038688        824633951880
rsi            0x533100            5452032
rdi            0xc0000386f8        824633951992
rbp            0xc000038688        0xc000038688
rsp            0x7fffffffe2a0      0x7fffffffe2a0
r8             0x5334e0            5453024
r9             0x0                 0
r10            0x410               1040
r11            0xffffffffffffffff  -1
r12            0xc000038800        824633952256
r13            0x6a                106
r14            0xc0000061a0        824633745824
r15            0x4                 4
rip            0x485a03            0x485a03 <_cgo_effbaea66e62_Cfunc_get42+19>
eflags         0x216               [ PF AF IF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
(gdb) step

Thread 1 "cgo_dl_repro" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) info registers
rax            0x0                 0
rbx            0xc0000386f8        824633951992
rcx            0xc0000386f8        824633951992
rdx            0xc000038688        824633951880
rsi            0x533100            5452032
rdi            0xc0000386f8        824633951992
rbp            0xc000038688        0xc000038688
rsp            0x7fffffffe298      0x7fffffffe298
r8             0x5334e0            5453024
r9             0x0                 0
r10            0x410               1040
r11            0xffffffffffffffff  -1
r12            0xc000038800        824633952256
r13            0x6a                106
r14            0xc0000061a0        824633745824
r15            0x4                 4
rip            0x0                 0x0
eflags         0x10246             [ PF ZF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
(gdb) 

Go 1.20

In go1.20.8, when that same line in the generated CGO is reached, it moves on to dl_signal_exception. GDB output starting from the same spot as the above example:

_cgo_topofstack () at /home/braydonk/go_versions/go1.20.8/go/src/runtime/asm_amd64.s:1593
1593		RET
(gdb) step
_cgo_effbaea66e62_Cfunc_get42 (v=0xc00004bf38) at /tmp/go-build/cgo-gcc-prolog:52
52	/tmp/go-build/cgo-gcc-prolog: No such file or directory.
(gdb) step
__GI___libc_malloc (bytes=69) at ./malloc/malloc.c:3287
3287	./malloc/malloc.c: No such file or directory.
(gdb) step
3294	in ./malloc/malloc.c
(gdb) step
3299	in ./malloc/malloc.c
(gdb) step
checked_request2size (sz=<synthetic pointer>, req=69) at ./malloc/malloc.c:1343
1343	in ./malloc/malloc.c
(gdb) finish
Run till exit from #0  checked_request2size (sz=<synthetic pointer>, req=69) at ./malloc/malloc.c:1343
__GI___libc_malloc (bytes=69) at ./malloc/malloc.c:3299
3299	in ./malloc/malloc.c
(gdb) finish
Run till exit from #0  __GI___libc_malloc (bytes=69) at ./malloc/malloc.c:3299
0x00007ffff7fc7cca in malloc (size=69) at ../include/rtld-malloc.h:56
56	../include/rtld-malloc.h: No such file or directory.
Value returned is $2 = (void *) 0x566820
(gdb) step
__GI__dl_signal_exception (errcode=0, exception=0x7fffffffde50, occasion=0x7ffff7ff0ecd "symbol lookup error") at ./elf/dl-error-skeleton.c:91
91	./elf/dl-error-skeleton.c: No such file or directory.
(gdb) step
92	in ./elf/dl-error-skeleton.c
(gdb) step
93	in ./elf/dl-error-skeleton.c
(gdb) step
102	in ./elf/dl-error-skeleton.c
(gdb) finish
warning: Function __GI__dl_signal_exception does not return normally.
Try to finish anyway? (y or n) y
Run till exit from #0  __GI__dl_signal_exception (errcode=0, exception=0x7fffffffde50, 
    occasion=0x7ffff7ff0ecd "symbol lookup error") at ./elf/dl-error-skeleton.c:102
/home/braydonk/cgo_dl_repro_120/cgo_dl_repro: symbol lookup error: /home/braydonk/cgo_dl_repro_120/cgo_dl_repro: undefined symbol: get42
[Thread 0x7fffcf9e8640 (LWP 25416) exited]
[Thread 0x7fffd01e9640 (LWP 25415) exited]
[Thread 0x7fffd09ea640 (LWP 25414) exited]
[Thread 0x7fffcf1a7640 (LWP 25417) exited]
[Inferior 1 (process 25413) exited with code 0177]
(gdb) 

@braydonk
Copy link
Author

Added a new experiment in the reproduction repo where I wrote a small C program that attempts to get symbol resolution the same way that worked in go1.20.8; unresolved symbols ignore in object files, call dlopen, and expect a function call to work.

When compiled with gcc 9.4.0 (Focal) and gcc 11.4.0 (Jammy), the program segfaulted at the function call.

When compiled with gcc version 13.2.0 (Rolling Debian Testing), the program produced the error ./main: error while loading shared libraries: unexpected PLT reloc type 0x00.

I'm sure CGO's version of "call this function from the header" is different than C's version of "call this function from the header", although when I look at the cgo generation it does look like it just calls the function kind of the same way.

Admittedly it does seem off to me that a dlopen earlier in the program would cause a previously unresolved symbol to just work; usually with dlopen you call into stuff from the dynamic library through a dlsym lookup. I figured there must be some magic dlopen does at runtime that I wasn't aware of.

Either way, it is very strange that go1.20.8 does not crash in this scenario on any system I've tested, and that go1.21.1 only worked on my Focal test system when the cgo generation is the same, and that when I do the same thing manually in C it segfaults on all systems.

@thanm thanm added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 28, 2023
@thanm
Copy link
Contributor

thanm commented Sep 28, 2023

@golang/compiler

@braydonk
Copy link
Author

braydonk commented Sep 28, 2023

On my Rolling Debian Testing machine, I did a go tool objdump on a binary built with go1.20.8 and a binary built with go1.21.1. This is without the dlopen, just trying to call the unknown symbol.

Go 1.20.8

TEXT _cgo_49665a31f432_Cfunc_get42(SB) 
  :0			0x483740		4154			PUSHQ R12			
  :0			0x483742		55			PUSHQ BP			
  :0			0x483743		53			PUSHQ BX			
  :0			0x483744		4889fb			MOVQ DI, BX			
  :0			0x483747		e894c1fdff		CALL _cgo_topofstack(SB)	
  :0			0x48374c		4989c4			MOVQ AX, R12			
  :0			0x48374f		31c0			XORL AX, AX			
  :0			0x483751		e85ae9f7ff		CALL 0x4020b0			
  :0			0x483756		89c5			MOVL AX, BP			
  :0			0x483758		e883c1fdff		CALL _cgo_topofstack(SB)	
  :0			0x48375d		4c29e0			SUBQ R12, AX			
  :0			0x483760		892c03			MOVL BP, 0(BX)(AX*1)		
  :0			0x483763		5b			POPQ BX				
  :0			0x483764		5d			POPQ BP				
  :0			0x483765		415c			POPQ R12			
  :0			0x483767		c3			RET				

Go 1.21.1

TEXT _cgo_49665a31f432_Cfunc_get42(SB) 
  :0			0x47ce70		4154			PUSHQ R12			
  :0			0x47ce72		55			PUSHQ BP			
  :0			0x47ce73		53			PUSHQ BX			
  :0			0x47ce74		4889fb			MOVQ DI, BX			
  :0			0x47ce77		e88416feff		CALL _cgo_topofstack(SB)	
  :0			0x47ce7c		4989c4			MOVQ AX, R12			
  :0			0x47ce7f		31c0			XORL AX, AX			
  :0			0x47ce81		e87a31b8ff		CALL 0x0			
  :0			0x47ce86		89c5			MOVL AX, BP			
  :0			0x47ce88		e87316feff		CALL _cgo_topofstack(SB)	
  :0			0x47ce8d		4c29e0			SUBQ R12, AX			
  :0			0x47ce90		892c03			MOVL BP, 0(BX)(AX*1)		
  :0			0x47ce93		5b			POPQ BX				
  :0			0x47ce94		5d			POPQ BP				
  :0			0x47ce95		415c			POPQ R12			
  :0			0x47ce97		c3			RET				

In the Go 1.21.1 dump, the generated cgo binding generates a CALL 0x0 at instruction 0x47ce88, which is the instruction to call get42 from the generated cgo function I showed in #63264 (comment). In the Go 1.20.8 compilation this is not address CALL 0x0, but CALL 0x4020b0 instead (instruction 0x483751). Not sure what that might be referring to.

@braydonk
Copy link
Author

Could 0x402000 be the PLT from the program header?

@braydonk
Copy link
Author

The 0x402000 is a PT_LOAD in the built exe header, using a build of go from master that I just downloaded:

braydonk@braydonk:~/Git/cgo_dl_repro$ readelf --segments with_dev_go 

Elf file type is EXEC (Executable file)
Entry point 0x402330
There are 14 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000400040 0x0000000000400040
                 0x0000000000000310 0x0000000000000310  R      0x8
  INTERP         0x0000000000000350 0x0000000000400350 0x0000000000400350
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x0000000000001158 0x0000000000001158  R      0x1000
  LOAD           0x0000000000002000 0x0000000000402000 0x0000000000402000
                 0x000000000007b74d 0x000000000007b74d  R E    0x1000
  LOAD           0x000000000007e000 0x000000000047e000 0x000000000047e000
                 0x000000000009bfc0 0x000000000009bfc0  R      0x1000
  LOAD           0x000000000011adf0 0x000000000051adf0 0x000000000051adf0
                 0x0000000000009ab0 0x000000000003b940  RW     0x1000
  DYNAMIC        0x000000000011ae00 0x000000000051ae00 0x000000000051ae00
                 0x00000000000001d0 0x00000000000001d0  RW     0x8
  NOTE           0x0000000000000370 0x0000000000400370 0x0000000000400370
                 0x0000000000000020 0x0000000000000020  R      0x8
  NOTE           0x0000000000000390 0x0000000000400390 0x0000000000400390
                 0x00000000000000a8 0x00000000000000a8  R      0x4
  TLS            0x000000000011adf0 0x000000000051adf0 0x000000000051adf0
                 0x0000000000000000 0x0000000000000008  R      0x8
  GNU_PROPERTY   0x0000000000000370 0x0000000000400370 0x0000000000400370
                 0x0000000000000020 0x0000000000000020  R      0x8
  GNU_EH_FRAME   0x0000000000119880 0x0000000000519880 0x0000000000519880
                 0x0000000000000154 0x0000000000000154  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x000000000011adf0 0x000000000051adf0 0x000000000051adf0
                 0x0000000000000210 0x0000000000000210  R      0x1

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.gnu.property .note.gnu.build-id .note.ABI-tag .note.go.buildid .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt 
   03     .init .plt .text .fini 
   04     .rodata .typelink .itablink .gopclntab .eh_frame_hdr .eh_frame 
   05     .init_array .fini_array .dynamic .got .got.plt .data .go.buildinfo .noptrdata .bss .noptrbss 
   06     .dynamic 
   07     .note.gnu.property 
   08     .note.gnu.build-id .note.ABI-tag .note.go.buildid 
   09     .tbss 
   10     .note.gnu.property 
   11     .eh_frame_hdr 
   12     
   13     .init_array .fini_array .dynamic .got 

I do think it's the PLT. So I guess perhaps in the PLT itself this unresolved symbol doesn't have a section like I expect it would and perhaps that explains why CALL 0x0 is being generated?

@braydonk
Copy link
Author

I have now confirmed that get42 is not added to the PLT when building with go1.21.1. As a result, when referring to this symbol in the generated cgo bindings, it's just generating CALL 0x0. In the go1.20.8 build, You can see at this address it is in the PLT:

(gdb) x/3i 0x4020b0
   0x4020b0 <get42@plt>:	jmp    *0x117f8a(%rip)        # 0x51a040 <get42@got.plt>
   0x4020b6 <get42@plt+6>:	push   $0x8
   0x4020bb <get42@plt+11>:	jmp    0x402020

get42@plt is not present in the go1.21.1 build.

@braydonk
Copy link
Author

This code has me suspicious, however I tried it with the if target.IsExternal() commented out and that didn't seem to fix it.

case objabi.R_CALL:
if targType != sym.SDYNIMPORT {
// nothing to do, the relocation will be laid out in reloc
return true
}
if target.IsExternal() {
// External linker will do this relocation.
return true
}
// Internal linking, for both ELF and Mach-O.
// Build a PLT entry and change the relocation target to that entry.
addpltsym(target, ldr, syms, targ)
su := ldr.MakeSymbolUpdater(s)
su.SetRelocSym(rIdx, syms.PLT)
su.SetRelocAdd(rIdx, int64(ldr.SymPlt(targ)))
return true

@braydonk
Copy link
Author

I'm new to actually working with the Go codebase. I tried to add some log.Printfs to the asm.go file, but I must be missing a trick to actually see wherever those logs are coming from (or it's just not hitting the adddynrel function at all)

@braydonk
Copy link
Author

It seems that the code from go tool link is never hit in a build of this application. I guess I don't really understand how it fits together. 🤔
When I tried adding some prints to cmd/cgo, specifically to look at the opened elf objects to get the symbols, it seems at that point the get42 symbol isn't in those yet (in both go1.20.8 and go1.21.1), so it depends on when cgo actually compiles the cgo-gcc-prolog stuff, cause the assembly from that is what generates the 0x0 instead of a PLT offset for the get42 symbol.

@thanm
Copy link
Contributor

thanm commented Sep 29, 2023

Stupid question: if you are loading up a library using dlopen() already, why not just use "dlsym" to find the address of the function you are interested in and call it that way?

FYI one thing that I think can help when working on these sorts of problems us to use the Go linker's "-tmpdir" option. Example:

$ rm -rf /tmp/xxx
$ mkdir /tmp/xxx
$ go build -ldflags=-tmpdir=/tmp/xxx mycgoprogram.go
$ ls /tmp/xxx
000000.o 000005.o 000010.o 000015.o go.dwarf
000001.o 000006.o 000011.o 000016.o go.o
000002.o 000007.o 000012.o 000017.o trivial.c
000003.o 000008.o 000013.o 000018.o
000004.o 000009.o 000014.o a.out
$

The object files in /tmp/xxx are going to be the ones passed to the external linker in the final step, so it is a good spot where you can inspect them (both Go and C objects to see what's going on).

@braydonk
Copy link
Author

Stupid question: if you are loading up a library using dlopen() already, why not just use "dlsym" to find the address of the function you are interested in and call it that way?

No, it is a good question. This is generally the best way to do this and what I would do if I wrote it myself.
The reason I am interested in the pattern I'm messing with here is what I mentioned at the end of the original comment on this issue; we discovered the bug when we tried to use https://github.com/NVIDIA/go-nvml with go1.21.1. It does this exact same pattern as in my reproduction; it has an nvml.h with all the functions from the shared object and includes that in the build with --unresolved-symbols=ignore-in-object-files ld flag, calls dlopen when initializing, and then calls the direct CGO bindings instead of looking up each symbol (it does look up symbols, but only to verify their presence not to actually call into them).

FYI one thing that I think can help when working on these sorts of problems us to use the Go linker's "-tmpdir" option.

Great idea, thank you! I didn't notice this flag when looking through options. I'll give that a try.

@braydonk
Copy link
Author

braydonk commented Oct 3, 2023

A git bisect produced this commit as the origin of the behaviour change: 1f29f39

The result of my issue seems to be here:

if ctxt.DynlinkingGo() || ctxt.BuildMode == BuildModeCShared || !linkerFlagSupported(ctxt.Arch, argv[0], altLinker, "-Wl,--export-dynamic-symbol=main") {
argv = append(argv, "-rdynamic")
} else {
var exports []string
ctxt.loader.ForAllCgoExportDynamic(func(s loader.Sym) {
exports = append(exports, "-Wl,--export-dynamic-symbol="+ctxt.loader.SymExtname(s))
})
sort.Strings(exports)
argv = append(argv, exports...)
}

When I forced this into the old behaviour (always adding -rdynamic to argv) my reproduction worked as expected. So in my reproduction, I tried adding -Wl,--export-dynamic and building with go1.21.1 it worked.

So I'm tempted to say this isn't really a bug. This is just a strange behaviour in this particular case when -rdynamic isn't added unilaterally like it was before.

I suppose I'll ping @ianlancetaylor in case he's interested since it was his change, but looking at the original issue from the change I think it makes sense to stay the way it is now (at least based on what I understand). So I'm going to suggest to the go-nvml maintainers that this flag be added to their LDFLAGS.

I will now close this issue. Thanks Than for the suggestions!

@thanm
Copy link
Contributor

thanm commented Oct 3, 2023

Good detective work @braydonk . Yeah in retrospect the export dynamic change would seem to make sense given what you described.

@ianlancetaylor
Copy link
Contributor

Thanks for digging into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants