Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/link: rare corruption of ELF binaries #53804

Open
stapelberg opened this issue Jul 12, 2022 · 44 comments
Open

cmd/link: rare corruption of ELF binaries #53804

stapelberg opened this issue Jul 12, 2022 · 44 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@stapelberg
Copy link
Contributor

What version of Go are you using (go version)?

$ go version
go version go1.18 linux/amd64

Does this issue reproduce with the latest release?

Yes, Go 1.18 is the latest (stable) release.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/michael/.cache/go-build"
GOENV="/home/michael/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/michael/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/michael/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/michael/sdk/go1.18"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/michael/sdk/go1.18/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.18"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/michael/2022-06-25-squashfs-debug/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1868362460=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I am building and deploying Go software from a cron job every day.

Recently, I noticed that sometimes, some of my executable binary files do not start up because they are corrupt!

The first time I noticed the issue, the init binary of one of my https://gokrazy.org/ installations was affected, resulting in an installation that wouldn’t boot at all.

The other time, it wasn’t the init binary, but a program of mine called regelwerk which is involved in motion sensor/light control in my home, so I noticed that because the lights weren’t working as they should.

It’s possible this happened more times and I just didn’t notice it.

Yesterday, I found someone on twitter who is also running into this issue, but with an entirely different program (not related to gokrazy at all): https://twitter.com/alvs_versteck/status/1546601648532983808

What did you expect to see?

The Go compiler/linker should produce ELF binaries that contain a valid ELF header.

What did you see instead?

The first 4096 bytes of the ELF binary are zeroed out, as well as another block of 4096 bytes at offset 256K.

You can find the files at https://t.zekjur.net/_2022-06-25-init/

In the other occurrence, it was 4096 bytes at the start of the ELF binary, then 4096 bytes at offset 0x9000.

Unfortunately I have no idea how to reproduce this issue.

@bcmills
Copy link
Member

bcmills commented Jul 12, 2022

Are the corrupted binaries deterministic?

If not, I wonder if #51611 is related.

@alvaroaleman
Copy link

Are the corrupted binaries deterministic?

WDYM by that, if they are always corrupted in the same way? If yes, I think so but currently only have one sample lying around where this happened so would need to wait for it to happen again to be able to tell for sure.

@stapelberg
Copy link
Contributor Author

Are the corrupted binaries deterministic?

I can’t reproduce the issue unfortunately. When I did another build right after the corrupted build, everything worked.

But what all the 3 different corrupted ELF binaries I have seen (two of my own, one from alvs) have had in common so far is that they had a block of all-zero bytes at their first 4096 bytes.

@mknyszek mknyszek changed the title Go 1.18 sometimes produces corrupt ELF binaries cmd/compile, cmd/link: rare corruption of ELF binaries Jul 12, 2022
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 12, 2022
@mknyszek mknyszek added this to the Backlog milestone Jul 12, 2022
@mknyszek
Copy link
Contributor

CC @golang/compiler

@cherrymui cherrymui changed the title cmd/compile, cmd/link: rare corruption of ELF binaries cmd/link: rare corruption of ELF binaries Jul 12, 2022
@cherrymui
Copy link
Member

Is the corruption always on the ELF header? Is it on the same machine or on multiple machines? What the file system are you running on? Thanks.

@alvaroaleman
Copy link

alvaroaleman commented Jul 12, 2022

Is the corruption always on the ELF header?

I think so, I remember I got a couple of cannot execute binary file: Exec format error errors and rebuilding helped. I only have one lying around right now though and it happens intermittently, so I can't say for sure until it happens again.

Is it on the same machine or on multiple machines?

Multiple, mine and @stapelberg also sees it on his. Mine is a Lenovo t470s:

- Intel Core i7-7500U Prozessor (4 MB Cache, up to 3,5 GHz)
- 35,56 cm (14,0") WQHD (2.560 x 1.440) IPS-Display, Non-Touch
- 24 GB DDR4-2133 – 8 GB Onboard + 16 GB (1 SoDIMM)
- Intel HD-graphics 620
- [Crucial P5 Plus 1TB PCIe 4.0 3D NAND NVMe M.2 SSD, up to 6600MB/s - CT1000P5PSSD8](https://www.amazon.ca/gp/product/B098WL46RS)

What the file system are you running on?

xfs with Linux 5.17.9-200.fc35.x86_64

@stapelberg
Copy link
Contributor Author

Is the corruption always on the ELF header?

I don’t know if all instances of this corruption will always affect the ELF header.

What I can say is that whenever I have noticed corruption, the ELF header was corrupted.

Note that ELF header failures are rather loud in comparison (binary can’t be started), so perhaps this is selection bias — other corruption could exist, I just don’t know about it because it doesn’t fail as badly perhaps (just speculating here).

Is it on the same machine or on multiple machines?

Multiple machines: mine (see https://michael.stapelberg.ch/posts/2022-01-15-high-end-linux-pc/ for specs) and @alvaroaleman’s machine (can you share your specs, too, please?)

What the file system are you running on? Thanks.

I’m using ext4 on this Linux 5.17.7 machine, but note that both the source code (generated) and the binary output location are on /tmp, which is a tmpfs mount in my case. The Go build cache is on ~ on ext4, but given that we’re talking about the final ELF binary, I would think that the Go linker should always create that, not copy it from cache.

The build command I’m using is: go build -mod=mod -o /tmp/gokr-packer456452338/init -tags=gokrazy,netgo,osusergo /tmp/gokr-packer456452338/init.go

@cherrymui
Copy link
Member

cherrymui commented Jul 12, 2022

Thanks. Frequent corruption on the ELF header is probably enough information. The ELF header is written by the linker only. The compiler and (most of) the Go command are unrelated. The one place that the go command touches the binary after linking is stamping the build ID. So it is either the linker or that build ID stamping.

Is your program a pure-Go binary or it uses cgo (i.e. whether it is internal linking or external linking)?

@alvaroaleman
Copy link

Is your program a pure-Go binary or it uses cgo (i.e. whether it is internal linking or external linking)?

Mine is pure go (i.E. CGO_ENABLED=0)

@stapelberg
Copy link
Contributor Author

I am building with CGO_ENABLED=0, too.

@mengzhuo
Copy link
Contributor

4096 = block size of ext4, file system bug?

@stapelberg
Copy link
Contributor Author

4096 = block size of ext4, file system bug?

Again: note that both the source code (generated) and the binary output location are on /tmp, which is a tmpfs mount in my case.

So I don’t think a file system bug is likely here.

@cherrymui
Copy link
Member

Thanks. Pure-Go means that the file is written by the Go linker, not the C linker.

If this is somewhat reproducible, could you try if this patch makes any difference? (You'll need to rebuild the linker, go install cmd/link)

diff --git a/src/cmd/link/internal/ld/outbuf_mmap.go b/src/cmd/link/internal/ld/outbuf_mmap.go
index 40a3222788..c3cc157579 100644
--- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -9,6 +9,7 @@ package ld
 
 import (
        "syscall"
+       "unsafe"
 )
 
 // Mmap maps the output file with the given size. It unmaps the old mapping
@@ -55,6 +56,7 @@ func (out *OutBuf) munmap() {
        if out.buf == nil {
                return
        }
+       syscall.Syscall(syscall.SYS_MSYNC, uintptr(unsafe.Pointer(&out.buf[0])), uintptr(len(out.buf)), syscall.MS_SYNC) // XXX
        syscall.Munmap(out.buf)
        out.buf = nil
 }

@cherrymui
Copy link
Member

Also, do you know if this is a new bug, or it occurs with old versions of Go as well? Thanks.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022
@alvaroaleman
Copy link

@cherrymui I haven't seen this before using go 1.18, so I guess it is new. I did some more experimenting and found out:

  • I can somewhat reliably reproduce this by go building in a loop, in five attempts it always took less than 200 builds
  • It is not reproducible in a docker container, so seems specific to my env
  • It is reproducible with an empty go cache
  • It is not reproducible with a hello world go program, so seems to either be specific to my project or maybe just the size of the codebase

Regarding your patch, I suppose I need to replace the link binary in the GOTOOLDIR? I did that and the issue still happens. I then looked more at the link binary and it has the same sha256sum regardless of applying your patch or not...?

$ g diff
diff --git a/src/cmd/link/internal/ld/outbuf_mmap.go b/src/cmd/link/internal/ld/outbuf_mmap.go
index 40a3222788..c3cc157579 100644
--- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -9,6 +9,7 @@ package ld
 
 import (
        "syscall"
+       "unsafe"
 )
 
 // Mmap maps the output file with the given size. It unmaps the old mapping
@@ -55,6 +56,7 @@ func (out *OutBuf) munmap() {
        if out.buf == nil {
                return
        }
+       syscall.Syscall(syscall.SYS_MSYNC, uintptr(unsafe.Pointer(&out.buf[0])), uintptr(len(out.buf)), syscall.MS_SYNC) // XXX
        syscall.Munmap(out.buf)
        out.buf = nil
 }
[2022-07-13-23:17]:[app.ci@ci][master][1]~/.../goroot/src
$ go build -o /tmp/link cmd/link/
[2022-07-13-23:17]:[app.ci@ci][master][1]~/.../goroot/src
$ sha256sum /tmp/link
e7e671ba4d83e6f3a820ba3f0269ef21e26f2c477938499d389f58a696ec7e19  /tmp/link
[2022-07-13-23:17]:[app.ci@ci][master][1]~/.../goroot/src
$ g reset --hard
HEAD is now at 558785a0a9 cmd/compile: remove -installsuffix flag
[2022-07-13-23:17]:[app.ci@ci][master][1]~/.../goroot/src
$ go build -o /tmp/link cmd/link/
[2022-07-13-23:17]:[app.ci@ci][master][1]~/.../goroot/src
$ sha256sum /tmp/link
e7e671ba4d83e6f3a820ba3f0269ef21e26f2c477938499d389f58a696ec7e19  /tmp/link

@cherrymui
Copy link
Member

This is weird. The patch does make a difference on my machine

$ git status
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean
$ git log -1 --oneline
558785a0a9 (HEAD -> master, origin/master, origin/HEAD) cmd/compile: remove -installsuffix flag
$ go build -o /tmp/link cmd/link
$ sha256sum /tmp/link 
173de0c0efed5ea3c57ef497effe803f20c60603889de4c64cb1ac726b7ed45f  /tmp/link
$ git apply </tmp/p
$ git diff
diff --git a/src/cmd/link/internal/ld/outbuf_mmap.go b/src/cmd/link/internal/ld/outbuf_mmap.go
index 40a3222788..c3cc157579 100644
--- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -9,6 +9,7 @@ package ld
 
 import (
        "syscall"
+       "unsafe"
 )
 
 // Mmap maps the output file with the given size. It unmaps the old mapping
@@ -55,6 +56,7 @@ func (out *OutBuf) munmap() {
        if out.buf == nil {
                return
        }
+       syscall.Syscall(syscall.SYS_MSYNC, uintptr(unsafe.Pointer(&out.buf[0])), uintptr(len(out.buf)), syscall.MS_SYNC) // XXX
        syscall.Munmap(out.buf)
        out.buf = nil
 }
$ go build -o /tmp/link cmd/link
$ sha256sum /tmp/link 
3237cd32022b5b6cee6a9286cdec0d10e7e24455d9ee610553e8caecd8754c81  /tmp/link
$ git reset --hard
HEAD is now at 558785a0a9 cmd/compile: remove -installsuffix flag
$ go build -o /tmp/link cmd/link
$ sha256sum /tmp/link 
173de0c0efed5ea3c57ef497effe803f20c60603889de4c64cb1ac726b7ed45f  /tmp/link

@cherrymui
Copy link
Member

Maybe I messed up tabs and spaces when I first pasted the patch, so it didn't directly apply. Try this.

diff --git a/src/cmd/link/internal/ld/outbuf_mmap.go b/src/cmd/link/internal/ld/outbuf_mmap.go
index 40a3222788..c3cc157579 100644
--- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -9,6 +9,7 @@ package ld
 
 import (
 	"syscall"
+	"unsafe"
 )
 
 // Mmap maps the output file with the given size. It unmaps the old mapping
@@ -55,6 +56,7 @@ func (out *OutBuf) munmap() {
 	if out.buf == nil {
 		return
 	}
+	syscall.Syscall(syscall.SYS_MSYNC, uintptr(unsafe.Pointer(&out.buf[0])), uintptr(len(out.buf)), syscall.MS_SYNC) // XXX
 	syscall.Munmap(out.buf)
 	out.buf = nil
 }

@stapelberg
Copy link
Contributor Author

@alvaroaleman I think the problem you ran into when trying to apply the patch is:

go build -o /tmp/link cmd/link/

Go will not use the local directory, but your installation’s cmd/link source instead. You need to use:

go build -o /tmp/link ./cmd/link/

You can verify the code was updated by using go tool objdump ./pkg/tool/linux_amd64/link. Search the output for outbuf_mmap.go and look for the munmap code. You should see disassembly like:

  outbuf_mmap.go:55     0x5ae5d2                4889442450              MOVQ AX, 0x50(SP)                               
  outbuf_mmap.go:59     0x5ae5d7                4889542438              MOVQ DX, 0x38(SP)                               
  outbuf_mmap.go:59     0x5ae5dc                488b4818                MOVQ 0x18(AX), CX                               
  outbuf_mmap.go:59     0x5ae5e0                48c704241a000000        MOVQ $0x1a, 0(SP)                               
  outbuf_mmap.go:59     0x5ae5e8                4889542408              MOVQ DX, 0x8(SP)                                
  outbuf_mmap.go:59     0x5ae5ed                48894c2410              MOVQ CX, 0x10(SP)                               
  outbuf_mmap.go:59     0x5ae5f2                48c744241804000000      MOVQ $0x4, 0x18(SP)                             
  outbuf_mmap.go:59     0x5ae5fb                0f1f440000              NOPL 0(AX)(AX*1)                                
  outbuf_mmap.go:59     0x5ae600                e8bbf0eeff              CALL syscall.Syscall.abi0(SB)                   
  outbuf_mmap.go:59     0x5ae605                450f57ff                XORPS X15, X15                                  
  outbuf_mmap.go:59     0x5ae609                644c8b3425f8ffffff      MOVQ FS:0xfffffff8, R14                         
  outbuf_mmap.go:60     0x5ae612                488b442450              MOVQ 0x50(SP), AX                               
  outbuf_mmap.go:60     0x5ae617                488b5810                MOVQ 0x10(AX), BX                               
  outbuf_mmap.go:60     0x5ae61b                488b4818                MOVQ 0x18(AX), CX                               
  outbuf_mmap.go:60     0x5ae61f                488b7820                MOVQ 0x20(AX), DI                               
  syscall_linux.go:1142 0x5ae623                488b1576be2600          MOVQ syscall.mapper(SB), DX                     
  syscall_linux.go:1142 0x5ae62a                4889d0                  MOVQ DX, AX                                     
  syscall_linux.go:1142 0x5ae62d                e88eceeeff              CALL syscall.(*mmapper).Munmap(SB)              

If you can’t find munmap, look for Close — maybe the code got inlined.

If you only see the syscall.(*mmapper).Munmap call, the patch was not applied correctly. If you see the syscall.Syscall.abi0 call, the patch was applied correctly.

I have applied the patch on my system and will report back if I still encounter the corruption. However, as I can’t trigger it even when building hundreds of times in a loop, it would be better if @alvaroaleman could report back, as it sounds like it happens more frequently for you?

Thanks

@abner-chenc
Copy link
Contributor

abner-chenc commented Jul 20, 2022

   1  0000000 0000 0000 0000 0000 0000 0000 0000 0000
   2  *
   3  0004000 0020 4c00 3d24 1c01 d084 02fc 2064 29c0
   4  0004010 9004 02c0 4064 29c0 4800 576a 0000 0014
   5  0004020 62d3 28c0 8e73 0012 1260 5c00 0025 0015

This problem is caused by the occasional loss of 16K data at the beginning of the file after the link process calls the syscall.Fallocate function for the second time. However, not all machines can reproduce this error (the file system formats I tested include xfs, ext4, tmfs ).

  • my kernel version is 5.17.0+, pagesize is 16K。
  • In the case of ensuring enough disk space, I removed the second fallocate call, the problem no longer occurs.
  • I tried doing MS_SYNC before munmap, but the problem persists
  • I tested the kernel with version 4.19.190, but this problem did not occur

@stapelberg
Copy link
Contributor Author

Thanks for adding to this report, @abner-chenc!

I also just encountered this corruption again today, with @cherrymui’s patch applied — so the extra MSYNC does not help unfortunately.

@abner-chenc
Copy link
Contributor

abner-chenc commented Jul 20, 2022

I tried using posix_fallocate (the implementation refers to glib's posix_fallocate.c) replaces syscall.fallcate, and this error does not occur on the same machine , but this will cost performance.

@stapelberg, Here is my implementation of posix_fallocate: https://github.com/abner-chenc/go-abner/commit/c43a4a75a031ac9550eb02697b50032f1dced9ae

@alvaroaleman
Copy link

@stapelberg thanks for the hint, using go build -o /tmp/link ./cmd/link just resulted in main module (std) does not contain package std/cmd/link and building from the cmd folder as go build -o /tmp/link ./link still ended up using my installations cmd/link. I ended up applying the patch there, but as you already noted, it doesn't fix the issue.

@alvaroaleman
Copy link

Seems like @abner-chenc patch is indeed fixing this, I couldn't reproduce the issue anymore in 500 builds when I never needed more than 200 for it to occur prior to their patch.

@cherrymui
Copy link
Member

Are all the Exec format error case that the first page is missing? Or (sometimes) just the ELF header?

Does the fallocate syscall not work on your file system? But it also doesn't return an error? The fallocate isn't really necessary unless you're running low on disk space. Is that the case on your system? What happens if you just comment that fallocate call out?

@abner-chenc your patch also reorders Truncate and fallocate. What happens if you don't reorder them, or just reorder them without doing the rest?

In the case of ensuring enough disk space, I removed the second fallocate call, the problem no longer occurs.

Could you explain what you did exactly? What is "the second fallocate call"?

@alvaroaleman it looks like @abner-chenc 's patch only affects Loong64. Just to make sure, did you change it to apply to your architecture as well?

Thanks.

@cherrymui
Copy link
Member

@abner-chenc are you also running on tmpfs?

@alvaroaleman @stapelberg does it reproduce on a non-tmpfs file system?

Thanks.

@alvaroaleman
Copy link

@cherrymui I just applied https://github.com/abner-chenc/go-abner/commit/c43a4a75a031ac9550eb02697b50032f1dced9ae.patch without any changes, then built the link binary. While testing it, I output the binary to /tmp so a tmpfs, but when I orginally encountered the error I had targeted an xfs fs. I can check later if it also fixes the issue when targeting the xfs fs.

@cherrymui
Copy link
Member

That patch should not make any difference if you're not running on Loong64. Would be good to try on xfs. Thanks.

@alvaroaleman
Copy link

@cherrymui well, but it does. I can reproduce this issue with a link binary without that patch (most of the time in less than 50 tries, all the time so far in less than 200) and can not reproduce it with one that has the patch in 500 tries. I just verified this again, this time both when outputting to tmpfs and xfs which doesn't change the result.

The patch also moves the Truncate call around and that happens on all archs for which outbufs_mmap.go is compiled - Maybe that is why?

@cherrymui
Copy link
Member

Thanks. Could be. Could you try just moving the Truncate call without the rest of the change? Thanks.

@alvaroaleman
Copy link

@cherrymui yes, reducing the patch to only the changes in outbufs_mmap.go as shown below is enough to make the issue go away. Thanks for taking the time to look into this btw!

--- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -20,6 +20,10 @@ func (out *OutBuf) Mmap(filesize uint64) (err error) {
                out.munmap()
        }
 
+       err = out.f.Truncate(int64(filesize))
+       if err != nil {
+               Exitf("resize output file failed: %v", err)
+       }
        for {
                if err = out.fallocate(filesize); err != syscall.EINTR {
                        break
@@ -33,10 +37,6 @@ func (out *OutBuf) Mmap(filesize uint64) (err error) {
                        return err
                }
        }
-       err = out.f.Truncate(int64(filesize))
-       if err != nil {
-               Exitf("resize output file failed: %v", err)
-       }
        out.buf, err = syscall.Mmap(int(out.f.Fd()), 0, int(filesize), syscall.PROT_READ|syscall.PROT_WRITE, syscall.MAP_SHARED|syscall.MAP_FILE)
        if err != nil {
                return err

@ianlancetaylor
Copy link
Contributor

Thanks. If that is the fix then it seems that it should be possible to cause the same problem in a fairly simple C program.

@cherrymui
Copy link
Member

Thanks @alvaroaleman !

I couldn't find the requirement of ftruncate before fallocate being documented anywhere. But it is probably a good thing to do to swap the two calls anyway.

cc @aclements in case you know something about the FS.

@abner-chenc
Copy link
Contributor

@abner-chenc your patch also reorders Truncate and fallocate. What happens if you don't reorder them, or just reorder them without doing the rest?

I have tested and found that by exchanging the calling order of 'fallocate' and 'ftruncate', the occurrence of this error can be reduced, but it cannot completely solve the problem.

@abner-chenc
Copy link
Contributor

abner-chenc commented Jul 21, 2022

Could you explain what you did exactly? What is "the second fallocate call"?

In the function copyHeap(outbuf.go), if len(out.heap) > 0, syscall.Fallocate will be called again to allocate more disk space.
I tested and found that it is this Fallocate call that occasionally causes the loss of existing data in the file, all missing data is the beginning of the file 16K, exactly one page size. I tried executing MS_SYNC before munmap, but the problem persists.

  • I tried printing the ELF magic number before and after calling syscall.Fallocate:
diff --git a/src/cmd/link/internal/ld/outbuf.go b/src/cmd/link/internal/ld/outbuf.go
index 54fafcaf99..4fb9df7410 100644
--- a/src/cmd/link/internal/ld/outbuf.go
+++ b/src/cmd/link/internal/ld/outbuf.go
@@ -172,6 +173,7 @@ func (out *OutBuf) copyHeap() bool {
 	bufLen := len(out.buf)
 	heapLen := len(out.heap)
 	total := uint64(bufLen + heapLen)
+	fmt.Printf("# copyHeap: bufLen = %v, heapLen = %v\n", bufLen, heapLen)
 	if heapLen != 0 {
 		if err := out.Mmap(total); err != nil { // Mmap will copy out.heap over to out.buf
 			Exitf("mapping output file failed: %v", err)
diff --git a/src/cmd/link/internal/ld/outbuf_mmap.go b/src/cmd/link/internal/ld/outbuf_mmap.go
index 40a3222788..4391f5d772 100644
--- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -20,6 +21,11 @@ func (out *OutBuf) Mmap(filesize uint64) (err error) {
 		out.munmap()
 	}
 
+	fmt.Printf("# Mmap-0: filesize = %v, oldlen = %v \n", filesize, oldlen)
+	rdata := make([]byte, 4)
+	n, ret := out.f.ReadAt(rdata, 0)
+	fmt.Printf("# Mmap-1: n:%v,ret:%v, data: %v\n", n, ret, rdata)
+
 	for {
 		if err = out.fallocate(filesize); err != syscall.EINTR {
 			break
@@ -33,6 +39,10 @@ func (out *OutBuf) Mmap(filesize uint64) (err error) {
 			return err
 		}
 	}
+
+	n, ret = out.f.ReadAt(rdata, 0)
+	fmt.Printf("# Mmap-2: n:%v,ret:%v, data: %v\n", n, ret, rdata)
+
 	err = out.f.Truncate(int64(filesize))
 	if err != nil {
 		Exitf("resize output file failed: %v", err)
  • log when an error occurs
cmd/link
# cmd/link
# Mmap-0: filesize = 6475068, oldlen = 0 
# Mmap-1: n:0,ret:EOF, data: [0 0 0 0]
# Mmap-2: n:4,ret:<nil>, data: [0 0 0 0]
# copyHeap: bufLen = 6475068, heapLen = 271630
# Mmap-0: filesize = 6746698, oldlen = 6475068 
# Mmap-1: n:4,ret:<nil>, data: [127 69 76 70]
# Mmap-2: n:4,ret:<nil>, data: [0 0 0 0]
  • hexdump pkg/tool/linux_loong64/link | vim -
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0004000 4065 28c1 2085 29c0 6064 29c1 0061 28c0
0004010 0063 02c1 0020 4c00 4e24 1c00 a084 02e4
0004020 2064 29c0 6304 1c00 7084 02ee 4064 29c0
0004030 e000 575d 5644 1c00 b084 02d5 2064 29c0

@abner-chenc
Copy link
Contributor

@abner-chenc are you also running on tmpfs?
Yes, I also tested on tmpfs

@cherrymui
Copy link
Member

I think mixing file IO syscalls and mmaps on the same file is probably not a good idea. So your f.ReadAt may not work as expected. (For the same reason I think we don't want to use the implementation of posix_fallocate.) Can you observe the same corruption if you read the first 4 bytes from the mmap immediately after the mmap call?

Also, in the patch above the truncate call is still after the fallocate. Could you try moving it up?

On the other hand, if fallocate causes occasional data lost, it sounds like an OS bug to me. Could you try writing a C program to see if this happens? Thanks.

@stapelberg
Copy link
Contributor Author

The problem still happens with Go 1.19 (just happened to me again).

@cherrymui
Copy link
Member

Has anyone tried to reproduce the failure in C?

It is interesting to know why it seems to only fail for you. Like, it doesn't fail on the builders (except loongarch). It would be good to know what is special on those machines, and why it fails. Thanks.

stapelberg added a commit to gokrazy/tools that referenced this issue Sep 17, 2022
related to golang/go#53804

When validation fails, the error message looks like this:

2022/09/17 20:53:07 /tmp/gokrazy-bins-1604202262/dhcp4d is not an ELF binary!
bad magic number '[0 0 0 0]' in record at byte 0x0
@abner-chenc
Copy link
Contributor

  • I tried testing with C code fallocate.c , but the problem doesn't occur.

  • I tested and found that when fallocate returns the error code EINTR for the first time, and trying to fallocate again will cause the problem to appear (the second time no error code is returned).

  • I tried preallocate new space at the end of the file and the problem didn't reappear by testing, but I don't know what the root cause is. this is my patch

@gopherbot
Copy link

Change https://go.dev/cl/445835 mentions this issue: cmd/link: preallocate new space at the end of the file on Linux

@cherrymui
Copy link
Member

Thanks for trying the C code.

when fallocate returns the error code EINTR for the first time, and trying to fallocate again will cause the problem to appear

Could you try in the C code, having a thread (or process) sending some signals to the fallocate thread, so it may return EINTR?

@abner-chenc
Copy link
Contributor

Thanks for trying the C code.

when fallocate returns the error code EINTR for the first time, and trying to fallocate again will cause the problem to appear

Could you try in the C code, having a thread (or process) sending some signals to the fallocate thread, so it may return EINTR?

@cherrymui

@stapelberg
Copy link
Contributor Author

Thanks very much @abner-chenc for tracking this down!

A fix has landed in the Linux kernel as commit torvalds/linux@44bcabd, which is included in Linux 6.1 and newer. I’ll update my machine when I get a chance.

@stapelberg
Copy link
Contributor Author

A fix has landed in the Linux kernel as commit torvalds/linux@44bcabd, which is included in Linux 6.1 and newer. I’ll update my machine when I get a chance.

I haven’t run into this issue with Linux 6.1+, so I think the kernel fix solves the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: Triage Backlog
Development

No branches or pull requests

9 participants