New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/link: rare corruption of ELF binaries #53804
Comments
Are the corrupted binaries deterministic? If not, I wonder if #51611 is related. |
WDYM by that, if they are always corrupted in the same way? If yes, I think so but currently only have one sample lying around where this happened so would need to wait for it to happen again to be able to tell for sure. |
I can’t reproduce the issue unfortunately. When I did another build right after the corrupted build, everything worked. But what all the 3 different corrupted ELF binaries I have seen (two of my own, one from alvs) have had in common so far is that they had a block of all-zero bytes at their first 4096 bytes. |
CC @golang/compiler |
Is the corruption always on the ELF header? Is it on the same machine or on multiple machines? What the file system are you running on? Thanks. |
I think so, I remember I got a couple of
Multiple, mine and @stapelberg also sees it on his. Mine is a Lenovo t470s:
xfs with Linux 5.17.9-200.fc35.x86_64 |
I don’t know if all instances of this corruption will always affect the ELF header. What I can say is that whenever I have noticed corruption, the ELF header was corrupted. Note that ELF header failures are rather loud in comparison (binary can’t be started), so perhaps this is selection bias — other corruption could exist, I just don’t know about it because it doesn’t fail as badly perhaps (just speculating here).
Multiple machines: mine (see https://michael.stapelberg.ch/posts/2022-01-15-high-end-linux-pc/ for specs) and @alvaroaleman’s machine (can you share your specs, too, please?)
I’m using ext4 on this Linux 5.17.7 machine, but note that both the source code (generated) and the binary output location are on The build command I’m using is: |
Thanks. Frequent corruption on the ELF header is probably enough information. The ELF header is written by the linker only. The compiler and (most of) the Go command are unrelated. The one place that the go command touches the binary after linking is stamping the build ID. So it is either the linker or that build ID stamping. Is your program a pure-Go binary or it uses cgo (i.e. whether it is internal linking or external linking)? |
Mine is pure go (i.E. |
I am building with |
4096 = block size of ext4, file system bug? |
Again: note that both the source code (generated) and the binary output location are on /tmp, which is a tmpfs mount in my case. So I don’t think a file system bug is likely here. |
Thanks. Pure-Go means that the file is written by the Go linker, not the C linker. If this is somewhat reproducible, could you try if this patch makes any difference? (You'll need to rebuild the linker,
|
Also, do you know if this is a new bug, or it occurs with old versions of Go as well? Thanks. |
@cherrymui I haven't seen this before using go 1.18, so I guess it is new. I did some more experimenting and found out:
Regarding your patch, I suppose I need to replace the
|
This is weird. The patch does make a difference on my machine
|
Maybe I messed up tabs and spaces when I first pasted the patch, so it didn't directly apply. Try this.
|
@alvaroaleman I think the problem you ran into when trying to apply the patch is:
Go will not use the local directory, but your installation’s cmd/link source instead. You need to use:
You can verify the code was updated by using
If you can’t find If you only see the I have applied the patch on my system and will report back if I still encounter the corruption. However, as I can’t trigger it even when building hundreds of times in a loop, it would be better if @alvaroaleman could report back, as it sounds like it happens more frequently for you? Thanks |
This problem is caused by the occasional loss of 16K data at the beginning of the file after the link process calls the syscall.Fallocate function for the second time. However, not all machines can reproduce this error (the file system formats I tested include xfs, ext4, tmfs ).
|
Thanks for adding to this report, @abner-chenc! I also just encountered this corruption again today, with @cherrymui’s patch applied — so the extra MSYNC does not help unfortunately. |
I tried using posix_fallocate (the implementation refers to glib's posix_fallocate.c) replaces syscall.fallcate, and this error does not occur on the same machine , but this will cost performance. @stapelberg, Here is my implementation of posix_fallocate: https://github.com/abner-chenc/go-abner/commit/c43a4a75a031ac9550eb02697b50032f1dced9ae |
@stapelberg thanks for the hint, using |
Seems like @abner-chenc patch is indeed fixing this, I couldn't reproduce the issue anymore in 500 builds when I never needed more than 200 for it to occur prior to their patch. |
Are all the Does the @abner-chenc your patch also reorders
Could you explain what you did exactly? What is "the second fallocate call"? @alvaroaleman it looks like @abner-chenc 's patch only affects Loong64. Just to make sure, did you change it to apply to your architecture as well? Thanks. |
@abner-chenc are you also running on tmpfs? @alvaroaleman @stapelberg does it reproduce on a non-tmpfs file system? Thanks. |
@cherrymui I just applied https://github.com/abner-chenc/go-abner/commit/c43a4a75a031ac9550eb02697b50032f1dced9ae.patch without any changes, then built the |
That patch should not make any difference if you're not running on Loong64. Would be good to try on xfs. Thanks. |
@cherrymui well, but it does. I can reproduce this issue with a The patch also moves the |
Thanks. Could be. Could you try just moving the Truncate call without the rest of the change? Thanks. |
@cherrymui yes, reducing the patch to only the changes in --- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -20,6 +20,10 @@ func (out *OutBuf) Mmap(filesize uint64) (err error) {
out.munmap()
}
+ err = out.f.Truncate(int64(filesize))
+ if err != nil {
+ Exitf("resize output file failed: %v", err)
+ }
for {
if err = out.fallocate(filesize); err != syscall.EINTR {
break
@@ -33,10 +37,6 @@ func (out *OutBuf) Mmap(filesize uint64) (err error) {
return err
}
}
- err = out.f.Truncate(int64(filesize))
- if err != nil {
- Exitf("resize output file failed: %v", err)
- }
out.buf, err = syscall.Mmap(int(out.f.Fd()), 0, int(filesize), syscall.PROT_READ|syscall.PROT_WRITE, syscall.MAP_SHARED|syscall.MAP_FILE)
if err != nil {
return err |
Thanks. If that is the fix then it seems that it should be possible to cause the same problem in a fairly simple C program. |
Thanks @alvaroaleman ! I couldn't find the requirement of cc @aclements in case you know something about the FS. |
I have tested and found that by exchanging the calling order of 'fallocate' and 'ftruncate', the occurrence of this error can be reduced, but it cannot completely solve the problem. |
In the function copyHeap(outbuf.go), if len(out.heap) > 0, syscall.Fallocate will be called again to allocate more disk space.
|
|
I think mixing file IO syscalls and mmaps on the same file is probably not a good idea. So your Also, in the patch above the truncate call is still after the fallocate. Could you try moving it up? On the other hand, if |
The problem still happens with Go 1.19 (just happened to me again). |
Has anyone tried to reproduce the failure in C? It is interesting to know why it seems to only fail for you. Like, it doesn't fail on the builders (except loongarch). It would be good to know what is special on those machines, and why it fails. Thanks. |
related to golang/go#53804 When validation fails, the error message looks like this: 2022/09/17 20:53:07 /tmp/gokrazy-bins-1604202262/dhcp4d is not an ELF binary! bad magic number '[0 0 0 0]' in record at byte 0x0
|
Change https://go.dev/cl/445835 mentions this issue: |
Thanks for trying the C code.
Could you try in the C code, having a thread (or process) sending some signals to the fallocate thread, so it may return EINTR? |
|
Thanks very much @abner-chenc for tracking this down! A fix has landed in the Linux kernel as commit torvalds/linux@44bcabd, which is included in Linux 6.1 and newer. I’ll update my machine when I get a chance. |
I haven’t run into this issue with Linux 6.1+, so I think the kernel fix solves the issue. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes, Go 1.18 is the latest (stable) release.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I am building and deploying Go software from a cron job every day.
Recently, I noticed that sometimes, some of my executable binary files do not start up because they are corrupt!
The first time I noticed the issue, the
init
binary of one of my https://gokrazy.org/ installations was affected, resulting in an installation that wouldn’t boot at all.The other time, it wasn’t the
init
binary, but a program of mine calledregelwerk
which is involved in motion sensor/light control in my home, so I noticed that because the lights weren’t working as they should.It’s possible this happened more times and I just didn’t notice it.
Yesterday, I found someone on twitter who is also running into this issue, but with an entirely different program (not related to gokrazy at all): https://twitter.com/alvs_versteck/status/1546601648532983808
What did you expect to see?
The Go compiler/linker should produce ELF binaries that contain a valid ELF header.
What did you see instead?
The first 4096 bytes of the ELF binary are zeroed out, as well as another block of 4096 bytes at offset 256K.
You can find the files at https://t.zekjur.net/_2022-06-25-init/
In the other occurrence, it was 4096 bytes at the start of the ELF binary, then 4096 bytes at offset 0x9000.
Unfortunately I have no idea how to reproduce this issue.
The text was updated successfully, but these errors were encountered: