Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: missing profile info for CGo code when compiled with CC="zig cc" #59466

Closed
motiejus opened this issue Apr 6, 2023 · 12 comments
Closed
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@motiejus
Copy link
Contributor

motiejus commented Apr 6, 2023

TLDR: when profiling CGo programs compiled with zig cc, we are missing the profiles of pure C functions as observed with go tool pprof.

I work with the C++ toolchain at Uber and we recently switched to using zig cc to compile our CGo programs. I have not yet heard complains internally of
missing CGo profiles, but I suspect it's a matter of time. :)

Also, this seems to be the last suite of tests that fail when running CC="zig cc" ./all.bash on Linux amd64.

Go version: master (8d68b38)
Zig version: 0.10.1 and master (ziglang/zig@8853005)

Steps to reproduce

Compile testprogcgo with zig cc and clang-15:

$ cd src/runtime/testdata/testprogcgo
$ CC="/code/zig-linux-x86_64-0.10.1/zig cc" ../../../../bin/go build -o testprogcgo-zig-0.10.1
$ CC=clang-15 ../../../../bin/go build -o testprogcgo-clang-15

Run both resulting binaries:

$ ./testprogcgo-zig-0.10.1 CgoPprof
/tmp/prof1336230719
$ ./testprogcgo-clang-15 CgoPprof
/tmp/prof2489818717

Compare the profiles:

$ go tool pprof -traces ./testprogcgo-clang-15 /tmp/prof2489818717
File: testprogcgo-clang-15
Build ID: 57e147c5c3713dc48115b4856c5c9a568154a3cc
Type: cpu
Time: Apr 6, 2023 at 2:58pm (EEST)
Duration: 1.10s, Total samples = 1s (90.71%)
-----------+-------------------------------------------------------
	 990ms   cpuHog
			 cpuHog2
			 runtime.cgocall
			 main._Cfunc_cpuHog
			 main.CgoPprof
			 main.main
			 runtime.main
-----------+-------------------------------------------------------
	  10ms   runtime.futex
			 runtime.futexwakeup
			 runtime.notewakeup
			 runtime.exitsyscallfast_pidle
			 runtime.exitsyscallfast.func1
			 runtime.systemstack
			 runtime.exitsyscallfast
			 runtime.exitsyscall
			 runtime.cgocall
			 main._Cfunc_cpuHog
			 main.CgoPprof
			 main.main
			 runtime.main
-----------+-------------------------------------------------------
$ go tool pprof -traces ./testprogcgo-zig-0.10.1 /tmp/prof1336230719
File: testprogcgo-zig-0.10.1
Type: cpu
Time: Apr 6, 2023 at 2:56pm (EEST)
Duration: 1.10s, Total samples = 1s (90.70%)
-----------+-------------------------------------------------------
	 990ms   pprofCgoTraceback
			 pprofCgoTraceback
			 runtime.cgocallbackg
			 main._Cfunc_foo2
			 main.CgoPprofCallback
			 main._Cfunc_CheckM
			 runtime.forcegchelper
-----------+-------------------------------------------------------
	  10ms   runtime.malg
			 runtime.exitsyscall0
			 runtime.unspillArgs
			 runtime.exitsyscallfast_pidle
			 syscall.runtime_BeforeFork
			 runtime.cgocallbackg1
			 main._Cfunc_foo2
			 main.CgoPprofCallback
			 main._Cfunc_CheckM
			 runtime.forcegchelper
-----------+-------------------------------------------------------

Note that cpuHog and cpuHog2 are not present in the profile of the program that was compiled with zig cc. This is what the unit test TestCgoPprof asserts.

Background / steps taken

I have observed and compared the resulting compiler and linker flags from both builds did not observe anything standing out.

I would appreciate some help debugging this. I am somewhat comfortable looking at disassembly and the low-level compiler flags, but, in this case, I don't know where to look.

@motiejus
Copy link
Contributor Author

motiejus commented Apr 6, 2023

Attaching both binaries for quick reference (this is x86_64 ELF, mind you)

testprogcgo-zig-0.10.1.pptx

testprogcgo-clang-15.pptx

@motiejus motiejus changed the title missing CGo profiles with CC="zig cc" missing profile info for CGo code when compiled with with CC="zig cc" Apr 6, 2023
@motiejus motiejus changed the title missing profile info for CGo code when compiled with with CC="zig cc" missing profile info for CGo code when compiled with CC="zig cc" Apr 6, 2023
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 6, 2023
@mknyszek mknyszek changed the title missing profile info for CGo code when compiled with CC="zig cc" runtime/pprof: missing profile info for CGo code when compiled with CC="zig cc" Apr 6, 2023
@mknyszek mknyszek added this to the Backlog milestone Apr 6, 2023
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Apr 6, 2023
@mknyszek
Copy link
Contributor

mknyszek commented Apr 6, 2023

CC @golang/runtime

@cherrymui
Copy link
Member

cherrymui commented Apr 6, 2023

It looks like the profile you got from zig is running testprogcgo CgoPprofCallback whereas the clang one is from testprogcgo CgoPprof? (The command line you pasted looks like both are running CgoPprof, but the second profile has main.CgoPprofCallback)

@prattmic
Copy link
Member

prattmic commented Apr 6, 2023

The second profile looks like it has symbolization wrong. e.g., this doesn't make any sense:

runtime.malg
runtime.exitsyscall0
runtime.unspillArgs
runtime.exitsyscallfast_pidle
syscall.runtime_BeforeFork

Are you sure that the binary passed to pprof matches the binary used for the profile? In fact, you don't need to pass the binary to pprof at all, as the profile contains the symbol information. Do you get the same results if you drop the binary from the pprof cli?

@motiejus
Copy link
Contributor Author

motiejus commented Apr 6, 2023

I may have messed something up. A fresh honest copy-paste now, single command, no previous state:

zig

$ rm -fr ~/.cache/{zig,go-build}; CC="/code/zig-linux-x86_64-0.10.1/zig cc" ../../../../bin/go build -o testprogcgo-zig-0.10.1 && go tool pprof -traces ./testprogcgo-zig-0.10.1 $(./testprogcgo-zig-0.10.1 CgoPprof)
File: testprogcgo-zig-0.10.1
Type: cpu
Time: Apr 6, 2023 at 8:07pm (EEST)
Duration: 1.10s, Total samples = 1s (90.72%)
-----------+-------------------------------------------------------
        1s   pprofCgoTraceback
             pprofCgoTraceback
             runtime.cgocallbackg
             main._Cfunc_foo2
             main.CgoPprofCallback
             main._Cfunc_CheckM
             runtime.forcegchelper
-----------+-------------------------------------------------------

clang-15

$ CC=clang-15 ../../../../bin/go build -o testprogcgo-clang-15 && go tool pprof -traces ./testprogcgo-clang-15 $(./testprogcgo-clang-15 CgoPprof)
# runtime/testdata/testprogcgo
In file included from _cgo_export.c:4:
bindm.go:25:15: warning: passing arguments to 'GoCheckBindM' without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
bindm.go:17:13: warning: a function declaration without a prototype is deprecated in all versions of C and is treated as a zero-parameter prototype in C2x, conflicting with a subsequent declaration [-Wdeprecated-non-prototype]
cgo-gcc-export-header-prolog:50:13: note: conflicting prototype is here
# runtime/testdata/testprogcgo
./bindm.go:25:15: warning: passing arguments to 'GoCheckBindM' without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
# runtime/testdata/testprogcgo
In file included from aprof_c.c:5:
bindm.go:25:15: warning: passing arguments to 'GoCheckBindM' without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
bindm.go:17:13: warning: a function declaration without a prototype is deprecated in all versions of C and is treated as a zero-parameter prototype in C2x, conflicting with a subsequent declaration [-Wdeprecated-non-prototype]
cgo-gcc-export-header-prolog:50:13: note: conflicting prototype is here
File: testprogcgo-clang-15
Build ID: 57e147c5c3713dc48115b4856c5c9a568154a3cc
Type: cpu
Time: Apr 6, 2023 at 8:07pm (EEST)
Duration: 1.10s, Total samples = 990ms (89.81%)
-----------+-------------------------------------------------------
     990ms   cpuHog
             cpuHog2
             runtime.cgocall
             main._Cfunc_cpuHog
             main.CgoPprof
             main.main
             runtime.main
-----------+-------------------------------------------------------

testprogcgo-zig-0.10.1.pptx

testprogcgo-clang-15.pptx

@prattmic

Do you get the same results if you drop the binary from the pprof cli?

I get an interactive pprof shell:

$ CC="/code/zig-linux-x86_64-0.10.1/zig cc" ../../../../bin/go build -o testprogcgo-zig-0.10.1 && go tool pprof $(./testprogcgo-zig-0.10.1 CgoPprof)
# runtime/testdata/testprogcgo
In file included from _cgo_export.c:4:
bindm.go:25:15: warning: passing arguments to 'GoCheckBindM' without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
                GoCheckBindM((uintptr_t)thread);
                            ^
bindm.go:17:13: warning: a function declaration without a prototype is deprecated in all versions of C and is treated as a zero-parameter prototype in C2x, conflicting with a subsequent declaration [-Wdeprecated-non-prototype]
extern void GoCheckBindM();
            ^
cgo-gcc-export-header-prolog:50:13: note: conflicting prototype is here
extern void GoCheckBindM(GoUintptr thread);
            ^
2 warnings generated.
# runtime/testdata/testprogcgo
./bindm.go:25:15: warning: passing arguments to 'GoCheckBindM' without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
                GoCheckBindM((uintptr_t)thread);
                            ^
1 warning generated.
# runtime/testdata/testprogcgo
In file included from aprof_c.c:5:
bindm.go:25:15: warning: passing arguments to 'GoCheckBindM' without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]
                GoCheckBindM((uintptr_t)thread);
                            ^
bindm.go:17:13: warning: a function declaration without a prototype is deprecated in all versions of C and is treated as a zero-parameter prototype in C2x, conflicting with a subsequent declaration [-Wdeprecated-non-prototype]
extern void GoCheckBindM();
            ^
cgo-gcc-export-header-prolog:50:13: note: conflicting prototype is here
extern void GoCheckBindM(GoUintptr thread);
            ^
2 warnings generated.
File: testprogcgo-zig-0.10.1
Type: cpu
Time: Apr 6, 2023 at 8:10pm (EEST)
Duration: 1.10s, Total samples = 990ms (89.81%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) 

@cherrymui
Copy link
Member

In the new one the call stack also doesn't make much sense. Maybe the symbolization is completely broken. And you probably run CgoPprof correctly but as the symbolization is wrong it shows main.CgoPprofCallback instead...

Do you know how zig's linker is implemented? Does it wrap a C linker, or it has its own implementation? Is there anything special?

(I sent CL https://golang.org/cl/482975 to fix the C compiler warning.)

@motiejus
Copy link
Contributor Author

motiejus commented Apr 6, 2023

In the new one the call stack also doesn't make much sense. Maybe the symbolization is completely broken. And you probably run CgoPprof correctly but as the symbolization is wrong it shows main.CgoPprofCallback instead...

Do you know how zig's linker is implemented? Does it wrap a C linker, or it has its own implementation? Is there anything special?

On linux it wraps clang for compiling and ld.lld for linking, adds a few flags.

One can see the exact lld invocations with ZIG_VERBOSE_LINK=1 zig cc <...>. I can paste the output here, but there is a lot of linking going on, so tell me which files to narrow it down to.

cc @andrewrk

(I sent CL https://golang.org/cl/482975 to fix the C compiler warning.)

Thanks!

@cherrymui
Copy link
Member

Hmmm, the symbolization definitely is wrong. I ran the zig binary you pasted above. I got a profile, which contains a sample like

     5: 0x44875a M=1 main.CgoPprofCallback /code/go/src/runtime/testdata/testprogcgo/pprof_callback.go:64 s=0

But that PC is actually corresponding to main.CgoPprof:

$ go tool addr2line ./testprogcgo-zig-0.10.1.pptx 
0x44875a
main.CgoPprof
/code/go/src/runtime/testdata/testprogcgo/pprof.go:81

which looks correct.

So it seems the symbolization collected at run time by the profiler is wrong.

I wonder if you use zig to link a cgo program that just panics, does it print the right panic stack trace?

The zig linking is definitely somewhat unusual. For a quick look, it puts the text section after the rodata section, instead of before like other linkers usually do (this probably doesn't matter, I just noted this is something unusual I saw). I'll see if there is anything differs that could matter.

@cherrymui
Copy link
Member

cherrymui commented Apr 6, 2023

Well, this is interesting: with the profile I have, I got the similar incorrect symbolization

$ go tool pprof -top /tmp/prof1466990752
File: testprogcgo-zig-0.10.1.pptx
Type: cpu
Time: Apr 6, 2023 at 10:24pm (UTC)
Duration: 1.10s, Total samples = 990ms (89.63%)
Showing nodes accounting for 990ms, 100% of 990ms total
      flat  flat%   sum%        cum   cum%
     990ms   100%   100%      990ms   100%  pprofCgoTraceback
         0     0%   100%      990ms   100%  main.CgoPprofCallback
         0     0%   100%      990ms   100%  main._Cfunc_CheckM
         0     0%   100%      990ms   100%  main._Cfunc_foo2
         0     0%   100%      990ms   100%  runtime.cgocallbackg
         0     0%   100%      990ms   100%  runtime.forcegchelper

Rename the binary, so the pprof command cannot find it from the memory mapping, then we get the correct symbolization! (It cannot symbolize C functions because it cannot find the binary's symbol table, but at least the Go functions are correct.)

$ mv ./testprogcgo-zig-0.10.1.pptx ./testprogcgo-zig-0.10.1.pptx1
$ go tool pprof -top /tmp/prof1466990752
Local symbolization failed for testprogcgo-zig-0.10.1.pptx: open /tmp/testprogcgo-zig-0.10.1.pptx: no such file or directory
Some binary filenames not available. Symbolization may be incomplete.
Try setting PPROF_BINARY_PATH to the search path for local binaries.
File: testprogcgo-zig-0.10.1.pptx
Type: cpu
Time: Apr 6, 2023 at 10:24pm (UTC)
Duration: 1.10s, Total samples = 990ms (89.63%)
Showing nodes accounting for 990ms, 100% of 990ms total
      flat  flat%   sum%        cum   cum%
     990ms   100%   100%      990ms   100%  [testprogcgo-zig-0.10.1.pptx]
         0     0%   100%      990ms   100%  main.CgoPprof
         0     0%   100%      990ms   100%  main._Cfunc_cpuHog
         0     0%   100%      990ms   100%  main.main
         0     0%   100%      990ms   100%  runtime.cgocall
         0     0%   100%      990ms   100%  runtime.main

So the mapping and the binary actually confused the pprof command. (The symbol information collected at run time is actually fine.)

Looking at the binary, this is the readelf -l output:

$ readelf -l testprogcgo-zig-0.10.1.pptx 

Elf file type is EXEC (Executable file)
Entry point 0x337220
There are 12 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000200040 0x0000000000200040
                 0x00000000000002a0 0x00000000000002a0  R      0x8
  INTERP         0x00000000000002e0 0x00000000002002e0 0x00000000002002e0
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000200000 0x0000000000200000
                 0x000000000013620c 0x000000000013620c  R      0x1000
  LOAD           0x0000000000136220 0x0000000000337220 0x0000000000337220
                 0x000000000011b440 0x000000000011b440  R E    0x1000
  LOAD           0x0000000000251660 0x0000000000453660 0x0000000000453660
                 0x00000000000003e8 0x00000000000003e8  RW     0x1000
  LOAD           0x0000000000251a60 0x0000000000454a60 0x0000000000454a60
                 0x000000000000fc20 0x0000000000042970  RW     0x1000
  TLS            0x0000000000251660 0x0000000000452660 0x0000000000452660
                 0x0000000000000000 0x0000000000000008  R      0x8
  DYNAMIC        0x0000000000251680 0x0000000000453680 0x0000000000453680
                 0x0000000000000190 0x0000000000000190  RW     0x8
  GNU_RELRO      0x0000000000251660 0x0000000000453660 0x0000000000453660
                 0x00000000000003e8 0x00000000000009a0  R      0x1
  GNU_EH_FRAME   0x00000000001344c0 0x00000000003344c0 0x00000000003344c0
                 0x000000000000055c 0x000000000000055c  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000001000000  RW     0x0
  NOTE           0x00000000000002fc 0x00000000002002fc 0x00000000002002fc
                 0x0000000000000084 0x0000000000000084  R      0x4

Note that the 4th segment,

  LOAD           0x0000000000136220 0x0000000000337220 0x0000000000337220
                 0x000000000011b440 0x000000000011b440  R E    0x1000

It has alignment 0x1000, so the address and offset should be aligned to 0x1000. But it doesn't. This leads to this interesting memory mapping at run time: (from GDB)

(gdb) info proc mappings
process 2444322
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
            0x200000           0x337000   0x137000        0x0 /tmp/testprogcgo-zig-0.10.1.pptx
            0x337000           0x453000   0x11c000   0x136000 /tmp/testprogcgo-zig-0.10.1.pptx
            0x453000           0x454000     0x1000   0x251000 /tmp/testprogcgo-zig-0.10.1.pptx
            0x454000           0x465000    0x11000   0x251000 /tmp/testprogcgo-zig-0.10.1.pptx

The second one is the text segment. The start address of the segment in the binary is 0x337220, but the mapping starts at 0x337000 due to misalignment. I think this confused the pprof command, causing a shift of 0x220 in address. In fact, if we shift the PCs by 0x220, we get the wrong symbolization:

$ go tool addr2line ./testprogcgo-zig-0.10.1.pptx 
0x44875a
main.CgoPprof
/code/go/src/runtime/testdata/testprogcgo/pprof.go:81
$ go tool addr2line ./testprogcgo-zig-0.10.1.pptx 
0x44897a
main.CgoPprofCallback
/code/go/src/runtime/testdata/testprogcgo/pprof_callback.go:64

The second one is the first one +0x220. The profile incorrectly shows main.CgoPprofCallback /code/go/src/runtime/testdata/testprogcgo/pprof_callback.go:64, which matches the shifted address.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/483035 mentions this issue: cmd/internal/objfile: align the load address of ELF binary

@motiejus
Copy link
Contributor Author

motiejus commented Apr 7, 2023

Wow, thanks for the investigation! I was able to replicate the zig's issue with clang+lld too:

$ CGO_LDFLAGS=-fuse-ld=lld CC=clang-15 ../../../../bin/go.master build -o testprogcgo-clang-15 && go tool pprof -traces ./testprogcgo-clang-15 $(./testprogcgo-clang-15 CgoPprof)
File: testprogcgo-clang-15
Build ID: 37f82e8a4287db11
Type: cpu
Time: Apr 7, 2023 at 12:27pm (EEST)
Duration: 1.10s, Total samples = 1s (90.69%)
-----------+-------------------------------------------------------
        1s   _cgo_f1f0deeaaac9_Cfunc_sigsegv
             _cgo_f1f0deeaaac9_Cfunc_f1
             runtime.cgocallbackg1.func2
             main._Cfunc_pthread_setspecific
             main.init.19
             main._Cfunc_clearRestart
             runtime.acquireSudog
-----------+-------------------------------------------------------

I can confirm your fix fixed the original issue, with both zig cc and clang-15+lld. Yay!

Now looking at the different linker:

The zig linking is definitely somewhat unusual. For a quick look, it puts the text section after the rodata section, instead of before like other linkers usually do (this probably doesn't matter, I just noted this is something unusual I saw). I'll see if there is anything differs that could matter.

This is what lld does. Here is a quick example:

cgo.go

lld

$ CC=clang-15 CGO_LDFLAGS=-fuse-ld=lld go build -o cgo.clang-15-lld && readelf -l cgo.clang-15-lld 2>/dev/null | grep -E 'text|rodata' 
   02     .interp .note.ABI-tag .note.go.buildid .note.gnu.build-id .dynsym .gnu.version .gnu.version_r .gnu.hash .dynstr .rela.dyn .rela.plt .rodata .typelink .itablink .gosymtab .gopclntab .eh_frame_hdr .eh_frame 
   03     .text .init .fini .plt 

gnu ld

$ CC=clang-15 go build -o cgo.clang-15 && readelf -l cgo.clang-15 2>/dev/null | grep -E 'text|rodata' 
   03     .init .plt .text .fini 
   04     .rodata .typelink .itablink .gopclntab .eh_frame_hdr .eh_frame 

As you can observe above, text and rodata sections are flipped.

I will create two follow-up issues soon:

  1. Propose a Linux amd64 CI worker that uses clang+ld.lld.
  2. Propose a Linux amd64 CI worker that uses zig cc.

@cherrymui
Copy link
Member

Thanks!

Filed #59482 for using LLD on builder. (@prattmic and I were thinking about this the other day)

@dmitshur dmitshur modified the milestones: Backlog, Go1.21 Apr 17, 2023
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 17, 2023
@golang golang locked and limited conversation to collaborators May 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

6 participants