Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/link: go tool dist test testshared failed if linked with lld or mold #46560

Open
rui314 opened this issue Jun 4, 2021 · 12 comments
Open

cmd/link: go tool dist test testshared failed if linked with lld or mold #46560

rui314 opened this issue Jun 4, 2021 · 12 comments

Comments

@rui314
Copy link
Member

@rui314 rui314 commented Jun 4, 2021

What version of Go are you using (go version)?

$ go version
go version devel go1.17-962d5c997a Fri Jun 4 01:31:23 2021 +0000 linux/amd64

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/ruiu/.cache/go-build"
GOENV="/home/ruiu/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/ruiu/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/ruiu/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/ruiu/golang"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/ruiu/golang/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="devel go1.17-962d5c997a Fri Jun 4 01:31:23 2021 +0000"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/ruiu/golang/src/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build237113391=/tmp/go-build -gno-record-gcc-switches

What did you do?

I tried to build Go with my own linker, mold (https://github.com/rui314/mold), and noticed that a CGO-related test fails only when linked with mold. The same test fails with lld. So the test seems to pass only when you are using GNU ld or GNU gold.

Specifically, this is the exact command that I can reproduce the issue on my Ubuntu 20.04 machine.

$ git clone git@github.com:golang/go.git golang
$ cd golang/src
$ ./make.bash
$ sudo ln -sf /usr/bin/ld.lld-11 /usr/bin/ld
$ ../bin/go tool dist test testshared

If I do not substitute the default linker with lld using sudo ln, the last test command succeeds. Before running the above command, please install LLVM lld 11 by apt-get install lld-11.

To restore the original ld, run (cd /usr/bin; sudo ln -sf x86_64-linux-gnu-ld ld).

What did you expect to see?

The test succeeds

What did you see instead?

The test fails with the following error message.

--- FAIL: TestGCData (0.61s)
    shared_test.go:50: executing ./main (running gcdata/main) failed exit status 2:
        x[4] == -2401053088876216593, want 12345
        panic: FAIL

        goroutine 1 [running]:
        panic({0x7ff705a2d180, 0x556bea3cb938})
                /home/ruiu/golang/src/runtime/panic.go:1147 +0x3d3 fp=0xc0001b1f00 sp=0xc0001b1e40 pc=0x7ff7059b11f3
        main.main()
                /tmp/shared_test2784724792/gopath/src/testshared/gcdata/main/main.go:34 +0x14c fp=0xc0001b1f80 sp=0xc0001b1f00 pc=0x556bea3b6bec
        runtime.main()
                /home/ruiu/golang/src/runtime/proc.go:255 +0x282 fp=0xc0001b1fe0 sp=0xc0001b1f80 pc=0x7ff7059b4d42
        runtime.goexit()
                /home/ruiu/golang/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0001b1fe8 sp=0xc0001b1fe0 pc=0x7ff7059ed801

So, the test fails because Go garbage collector wrongly collects live objects.

I'm debugging the issue for two days so far without any luck. It looks like if I build all but gopath/pkg/linux_amd64_dynlink/libtestshared-gcdata-p.so using lld and link the particular DSO using GNU ld, the test passes. But I can't find a cause why that test dislikes lld or mold-linked shared object file. Is there any chance that CGO unnecessarily depends on GNU ld-specific section or symbol layout or something?

@thanm thanm added this to the Go1.17 milestone Jun 4, 2021
@thanm
Copy link
Contributor

@thanm thanm commented Jun 4, 2021

I'll take a look.

@cherrymui cherrymui removed this from the Go1.17 milestone Jun 4, 2021
@cherrymui cherrymui added this to the Unplanned milestone Jun 4, 2021
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 4, 2021

My first guess would be that mold/lld don't implement the special treatment of sections named .init_array that is found in GNU ld and gold.

@ianlancetaylor ianlancetaylor changed the title go tool dist test testshared failed if linked with lld or mold cmd/link: go tool dist test testshared failed if linked with lld or mold Jun 4, 2021
@thanm
Copy link
Contributor

@thanm thanm commented Jun 4, 2021

@ianlancetaylor could you please expand a little on what you meant by "special treatment of scetions named .init_array"? I compared the .init_array sections and related symbols in the good/bad objdump output and I don't see anything that looks obviously wrong.

@rui314
Copy link
Member Author

@rui314 rui314 commented Jun 4, 2021

Both lld and mold creates DT_INIT, DT_INIT_ARRAY and DT_INIT_ARRAYSZ fields in the dynamic section, and I believe that's all we have to do for .init_array. lld is now fairly battle-tested, as it's been the standard linker for FreeBSD/AMD64 and other platforms, so I don't think there's a big missing feature in lld. ​What am I missing

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 4, 2021

@thanm The special treatment is what @rui314 mentions: sections named .init_array get added to DT_INIT_ARRAY entries in the PT_DYNAMIC segment.

This could potentially be related to section layout. The Go runtime relies on a moduledata struct (https://golang.org/src/runtime/symtab.go#L354), with one instance of that struct for each Go shared library included in the program image. The contents of that struct are created by the Go linker. The data segment is separated into data that can contain pointers and data that can't contain pointers, and those offsets are recorded in moduledata. If some pointer-containing data is mixed into the non-pointer-containing area, that will throw off the garbage collector. The linker puts these into .data and .noptrdata sections. There are also .bss and .noptrbss sections. The linker defines symbols at the start and end of those sections and puts those symbol values are recorded in moduledata. I don't see how this could go wrong, but if it did go wrong bad things would happen.

I don't have any other ideas yet.

@thanm
Copy link
Contributor

@thanm thanm commented Jun 4, 2021

Thanks -- intermixing sounds like a good theory; I'll pursue that.

@thanm
Copy link
Contributor

@thanm thanm commented Jun 4, 2021

I can definitely reproduce this problem on tip. I've concocted a smaller stand-alone test case (attached) since it's kind of difficult to work with the dist test version.

In the Go code for the testcase, there is this code:

--- package "p" ---

type T [10]*int

--- package "main" ---

package main

// ... imports

var xyzabc p.T

func main() {
  for i := range xyzabc {
    xyzabc[i] = new(int)
    *xyzabc[i] = 12345
  }
  runtime.GC()
  runtime.GC()
  runtime.GC()
  for i := range xyzabc {
    if *xyzabc[i] != 12345 {
      fmt.Printf("xyzabc[%d] == %d, want 12345\n", i, *xyzabc[i])
      panic("FAIL")
    }
  }
}

Spent lots of time inspecting the shared library for "p", but as far as I can tell it look ok (maybe missed something).

According to the objdump output, the offending symbol ("main.xyzabc") is indeed placed in the .bss section and not the .noptrbss section. I hacked the Go runtime to print out the boundaries of each mod on startup:

Run of "good" executable (for this binary, "objdump -t" reports the address of main.xyzabc as 0x518620):

mod:  
mod: bss= 0x7ff299d076a0  ebss= 0x7ff299d382a8
mod: noptrbss= 0x7ff299d38320  enoptrbss= 0x7ff299d3f600
mod:  libissue46560-p.so
mod: bss= 0x7ff299d44239  ebss= 0x7ff299d44239
mod: noptrbss= 0x7ff299d44240  enoptrbss= 0x7ff299d44240
mod:  the executable
mod: bss= 0x518180  ebss= 0x5186e8
mod: noptrbss= 0x518700  enoptrbss= 0x518c40

Run of "bad" executable (for this binary, "objdump -t" reports the address of main.xyzabc as 0x31be20):

mod:  
mod: bss= 0x7f4a53309440  ebss= 0x7f4a5333a048
mod: noptrbss= 0x7f4a5333a0c0  enoptrbss= 0x7f4a533413a0
mod:  libissue46560-p.so
mod: bss= 0x7f4a53346759  ebss= 0x7f4a53346759
mod: noptrbss= 0x7f4a53346759  enoptrbss= 0x7f4a53346759
mod:  the executable
mod: bss= 0x31b980  ebss= 0x31bee8
mod: noptrbss= 0x31bf00  enoptrbss= 0x31c440
xyzabc[4] == -2401053088876216593, want 12345
panic: FAIL

goroutine 1 [running]:
main.main()
	/ssd2/go1/src/issue46560/main/main.go:34 +0x14c

I did some tracing in the Go linker when linking the main program; it does look as though the runtime.gcbss symbol is being generated, but when I turn on tracing I'm seeing different behavior between good + bad.

Here's the good gcprog generation for the symbol:

gcprog sym: main.xyzabc at 1184 (ptr=148+10)
gcprog: advance to 148 by literals
gcprog: ptr at 148
gcprog: ptr at 149
gcprog: ptr at 150
gcprog: ptr at 151
gcprog: ptr at 152
gcprog: ptr at 153
gcprog: ptr at 154
gcprog: ptr at 155
gcprog: ptr at 156
gcprog: ptr at 157

Now here the bad:

gcprog sym: main.xyzabc at 1184 (ptr=148+10)
gcprog: advance to 148 by literals
gcprog: ptr at 148
gcprog: ptr at 149
gcprog: ptr at 150
gcprog: advance to 154 by literals
gcprog: ptr at 154
gcprog: advance to 156 by literals
gcprog: ptr at 156
gcprog: ptr at 157

So that may be the smoking gun? Not sure why this is happening though.

From the linker trace output when linking the shared library it looks as though the type symbol in question "type.issue46560/p.T" is being mangled to "type.rb7zizEO".

Looking at that symbol in the p.so shared object link, I see:

Good:

...
Symtab entry:
0000000000003c20 g     O .data.rel.ro	0000000000000058              type.rb7zizEO
...
Dump of .data.rel.ro:
 3c20 50000000 00000000 50000000 00000000  P.......P.......
 3c30 5b1f1250 0f080811 00000000 00000000  [..P............
 3c40 02200000 00000000 02000000 80000000  . ..............
 3c50 00000000 00000000 00000000 00000000  ................
 3c60 0a000000 00000000 12000000 00000000  ................

Bad:

Symtab entry:
0000000000003180 g     O .data.rel.ro	0000000000000058              type.rb7zizEO
Dump of .data.rel.ro:
...
 3180 50000000 00000000 50000000 00000000  P.......P.......
 3190 5b1f1250 0f080811 00000000 00000000  [..P............
 31a0 00000000 00000000 02000000 80000000  ................
 31b0 00000000 00000000 00000000 00000000  ................
 31c0 0a000000 00000000 12000000 00000000  ................
...

which seems to look the same (very puzzling).

I'll need to spend a bit more time with the linker to understand why it is picking up the wrong type data for this symbol.

issue46560.zip

@thanm
Copy link
Contributor

@thanm thanm commented Jun 4, 2021

Whoops, my eyeballs are not working properly. There is indeed a diff between the content of the two symbols at offset 32 (02200000 good vs 00000000 bad). So the question is: why is that happening?

@rui314
Copy link
Member Author

@rui314 rui314 commented Jun 5, 2021

@thanm

I think the missing values in .data.rel.ro are not a problem. .data.rel.ro has relocations of type R_X86_64_64, so that the section contains absolute addresses of the referenced symbols at runtime. The absolute addresses of the referenced symbols are not known at static-link-time, because we are linking a shared object file, and a shared object file is not guaranteed to be loaded at a fixed address in the virtual address space. Therefore, the linker emits dynamic relocations so that the section gets correct values at runtime. GNU ld write values to .data.rel.ro and create dynamic relocations for them, but the value written to .data.rel.ro will be overwritten by the dynamic linker, so the values don't matter. lld just leaves them as value zero. I think this is why some bytes in lld-generated .data.rel.ro are just zero.

@rui314
Copy link
Member Author

@rui314 rui314 commented Jun 12, 2021

I created two executables, one is linked against mold-built libissue46560-p.so and the other is linked against BFD-built libissue46560-p.so. Note that the executables themselves are linked by BFD. Then, I copied .rodata from the working one to the failing one using objcopy, which fixed the issue. So, the data in .rodata seems to be wrong.

More specifically, .rodata sections are almost identical except a few bytes, and the difference is in runtime.gcbss. I'm not familiar with Go's garbage collector, but I believe this is metadata used by GC.

I'll investigate how cgo generate runtime.gcbss contents.

@rui314
Copy link
Member Author

@rui314 rui314 commented Jun 12, 2021

It was extremely puzzling, but I think I found the cause of the issue. It looks like there's a bug in go's linker. Here is what was happening:

  • src/cmd/link/internal/loader/loader.go has code that reads section contents from a DSO. decodetypeGcmask in decodesym.go calls ctxt.loader.Data to reads GC bitmaps from a DSO's .data.rel.ro section for type symbols.
  • .data.rel.ro may have dynamic relocations. Therefore, even if the section contents are just zeros, it may have non-zero values at runtime.
  • For REL-type relocations, relocation addends are stored to sections. The dynamic loader is expected to add a relocated value to an existing value in a section. On the other hand, for RELA-type relocations, addends are stored to relocations themselves, and the dynamic loader overwrites existing values. x86-64 uses RELA-type relocations.
  • GNU linkers always write addends to sections even for RELA-type relocations. That's not necessary, though that doesn't hurt anyone, as such values will just be overwritten at runtime.
  • lld and mold don't write addends to sections for RELA-type relocations. They are just left as zero bytes.
  • When reading section contents, src/cmd/link/internal/loader/loader.go does not seem to apply dynamic relocations before reading section contents. Therefore, some values that are non-zero when generated by a GNU linker seem to be just zeros when generated by lld or mold. That causes the difference of GC bitmaps. mold/lld-generated GC bitmaps have more zeros, causing GC to reclaim more objects than it should be.

So, I think the proper fix is to change loader.go so that it apply dynamic relocations for a DSO before reading its section contents and returning it to decodetypeGcmask.

@thanm
Copy link
Contributor

@thanm thanm commented Jun 12, 2021

So, I think the proper fix is to change loader.go so that it apply dynamic relocations for a DSO before reading its section contents and returning it to decodetypeGcmask.

Thanks. Yes, you made that clear already in your previous comment 8 days ago, and it is indeed evident that the problem is due to the fact that the go linker is not applying dynamic relocations to the section in question. Please bear with me while I work on this bug; I have many other demands on my time; I need to balance working on your bug with working on other bugs as well. Thank for your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants