Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/link: unexpected fault address (when low on disk space?) #37310

Open
bradfitz opened this issue Feb 20, 2020 · 14 comments
Open

cmd/link: unexpected fault address (when low on disk space?) #37310

bradfitz opened this issue Feb 20, 2020 · 14 comments
Milestone

Comments

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Feb 20, 2020

I just got some mysterious linker errors. I suspect they're because this machine only has 20MB of disk free (which I just noticed).

# tailscale.io/control/cfgdb.test
unexpected fault address 0x7f24b561d000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7f24b561d000 pc=0x463c2e]

goroutine 1 [running]:
runtime.throw(0x6b7228, 0x5)
        /home/bradfitz/go/src/runtime/panic.go:1112 +0x72 fp=0xc003df2ee8 sp=0xc003df2eb8 pc=0x432f62
runtime.sigpanic()
        /home/bradfitz/go/src/runtime/signal_unix.go:674 +0x443 fp=0xc003df2f18 sp=0xc003df2ee8 pc=0x449533
runtime.memmove(0x7f24b561b6b0, 0x7f24b6c77c85, 0x22df)
        /home/bradfitz/go/src/runtime/memmove_amd64.s:108 +0xbe fp=0xc003df2f20 sp=0xc003df2f18 pc=0x463c2e
cmd/link/internal/ld.(*OutBuf).Write(0xc000024900, 0x7f24b6c77c85, 0x22df, 0x22df, 0x1, 0x1, 0x0)
        /home/bradfitz/go/src/cmd/link/internal/ld/outbuf.go:65 +0xa1 fp=0xc003df2f70 sp=0xc003df2f20 pc=0x5c83a1
cmd/link/internal/ld.(*OutBuf).WriteSym(0xc000024900, 0xc002807a90)
        /home/bradfitz/go/src/cmd/link/internal/ld/outbuf.go:159 +0x6c fp=0xc003df2fc0 sp=0xc003df2f70 pc=0x5c8b7c
cmd/link/internal/ld.blk(0xc000024900, 0xc004dd0000, 0x18d8, 0x1c00, 0x5ca6b0, 0x31299c, 0xc0063c4b00, 0x1, 0x1)
        /home/bradfitz/go/src/cmd/link/internal/ld/data.go:787 +0x10f fp=0xc003df3090 sp=0xc003df2fc0 pc=0x570d1f
cmd/link/internal/ld.CodeblkPad(0xc000001e00, 0x401000, 0x31299c, 0xc0063c4b00, 0x1, 0x1)
        /home/bradfitz/go/src/cmd/link/internal/ld/data.go:701 +0xbb fp=0xc003df31a0 sp=0xc003df3090 pc=0x5703fb
cmd/link/internal/amd64.asmb(0xc000001e00)
        /home/bradfitz/go/src/cmd/link/internal/amd64/asm.go:669 +0xc6 fp=0xc003df3200 sp=0xc003df31a0 pc=0x5ec9f6
cmd/link/internal/ld.Main(0x899300, 0x10, 0x20, 0x1, 0x7, 0x10, 0x6c25b5, 0x1b, 0x6be64d, 0x14, ...)
        /home/bradfitz/go/src/cmd/link/internal/ld/main.go:269 +0xd61 fp=0xc003df3358 sp=0xc003df3200 pc=0x5c7451
main.main()
        /home/bradfitz/go/src/cmd/link/main.go:68 +0x1bc fp=0xc003df3f88 sp=0xc003df3358 pc=0x63d2bc
runtime.main()
        /home/bradfitz/go/src/runtime/proc.go:203 +0x212 fp=0xc003df3fe0 sp=0xc003df3f88 pc=0x4355b2
runtime.goexit()
        /home/bradfitz/go/src/runtime/asm_amd64.s:1375 +0x1 fp=0xc003df3fe8 sp=0xc003df3fe0 pc=0x4629c1
FAIL    tailscale.io/control/cfgdb [build failed]
FAIL

And another, which looks like the same:

goroutine 1 [running]:
runtime.throw(0x6b7228, 0x5)
        /home/bradfitz/go/src/runtime/panic.go:1112 +0x72 fp=0xc00004eee8 sp=0xc00004eeb8 pc=0x432f62
runtime.sigpanic()
        /home/bradfitz/go/src/runtime/signal_unix.go:674 +0x443 fp=0xc00004ef18 sp=0xc00004eee8 pc=0x449533
runtime.memmove(0x7f29fc0a5000, 0x7f29fc6447ce, 0x4b)
        /home/bradfitz/go/src/runtime/memmove_amd64.s:205 +0x1b2 fp=0xc00004ef20 sp=0xc00004ef18 pc=0x463d22
cmd/link/internal/ld.(*OutBuf).Write(0xc000024900, 0x7f29fc6447ce, 0x4b, 0x4b, 0x7f2a23f586e0, 0x3f, 0xb)
        /home/bradfitz/go/src/cmd/link/internal/ld/outbuf.go:65 +0xa1 fp=0xc00004ef70 sp=0xc00004ef20 pc=0x5c83a1
cmd/link/internal/ld.(*OutBuf).WriteSym(0xc000024900, 0xc000a5d5f0)
        /home/bradfitz/go/src/cmd/link/internal/ld/outbuf.go:159 +0x6c fp=0xc00004efc0 sp=0xc00004ef70 pc=0x5c8b7c
cmd/link/internal/ld.blk(0xc000024900, 0xc0022e8000, 0xa80, 0xc00, 0x401000, 0x102b7c, 0xc000b93280, 0x1, 0x1)
        /home/bradfitz/go/src/cmd/link/internal/ld/data.go:787 +0x10f fp=0xc00004f090 sp=0xc00004efc0 pc=0x570d1f
cmd/link/internal/ld.CodeblkPad(0xc000001e00, 0x401000, 0x102b7c, 0xc000b93280, 0x1, 0x1)
        /home/bradfitz/go/src/cmd/link/internal/ld/data.go:701 +0xbb fp=0xc00004f1a0 sp=0xc00004f090 pc=0x5703fb
cmd/link/internal/amd64.asmb(0xc000001e00)
        /home/bradfitz/go/src/cmd/link/internal/amd64/asm.go:669 +0xc6 fp=0xc00004f200 sp=0xc00004f1a0 pc=0x5ec9f6
cmd/link/internal/ld.Main(0x899300, 0x10, 0x20, 0x1, 0x7, 0x10, 0x6c25b5, 0x1b, 0x6be64d, 0x14, ...)
        /home/bradfitz/go/src/cmd/link/internal/ld/main.go:269 +0xd61 fp=0xc00004f358 sp=0xc00004f200 pc=0x5c7451
main.main()
        /home/bradfitz/go/src/cmd/link/main.go:68 +0x1bc fp=0xc00004ff88 sp=0xc00004f358 pc=0x63d2bc
runtime.main()
        /home/bradfitz/go/src/runtime/proc.go:203 +0x212 fp=0xc00004ffe0 sp=0xc00004ff88 pc=0x4355b2
runtime.goexit()
        /home/bradfitz/go/src/runtime/asm_amd64.s:1375 +0x1 fp=0xc00004ffe8 sp=0xc00004ffe0 pc=0x4629c1
FAIL    tailscale.com/logtail/filch [build failed]
FAIL
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Feb 20, 2020

Yeah, this is writing to mmap'd memory backed by the output file. SIGBUS may occur if write to mmap'd memory fails (maybe cannot flush to disk?).

It would be good if we can fail nicely. But I don't know how. Maybe write (using file IO) to the last byte of the output file before mmap? If that write fails we can exit with a nice error. Does it work? cc @aclements

@toothrot toothrot added this to the Backlog milestone Feb 20, 2020
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 21, 2020

I haven't looked at the code, but if we're going to mmap the output file, we should fallocate it first. Otherwise, it's possible for the program to exit after writing all blocks to the memory mapped area, calling munmap and close, but before disk blocks have been assigned. In that case the generated executable may be incomplete, which is bad. Or, of course, we can get a SIGBUS during execution which is less bad but still bad. The fix is to either fallocate or fdatasync, and fallocate is faster.

@gopherbot
Copy link

@gopherbot gopherbot commented Apr 16, 2020

Change https://golang.org/cl/228385 mentions this issue: [dev.link] cmd/link: fallocate space, and remove all msync calls

@nightlyone
Copy link
Contributor

@nightlyone nightlyone commented Apr 17, 2020

What is the story for platforms that don't support fallocate? Should there be pre-zeroing or something else?

@aclements
Copy link
Member

@aclements aclements commented Apr 17, 2020

We definitely considered pre-zeroing, but it adds 50–100% to the IO cost, and considering we've had one report of this in 10 years, it doesn't seem worth significantly slowing down every link for such a rare case. fallocate is obviously the right thing to do when it's available, but when it isn't, there isn't an obviously right answer. (It would be nice if the runtime gave us a good way to handle SIGBUS, like recognizing that it's in an mmaped region and turning it into a useful panic; that may be worth doing, but it's also a non-trivial project.)

For reference, here are the benchmark numbers for allocating a 100 MiB file on three different machines using different techniques (source by @jeremyfaller and myself):

Linux ext4, SSD, laptop:
name         time/op
Msync         105ms ±12%
Fallocate    52.6ms ± 2%
WriteBlocks  78.4ms ± 2%
NoSync       55.0ms ± 2%

Linux ext4, 7200 RPM HDD, workstation:
name         time/op
Msync         730ms ± 8%
Fallocate    74.1ms ± 1%
WriteBlocks   114ms ± 1%
NoSync       76.0ms ± 1%

macOS laptop:
name         time/op
Msync         141ms ±19%
Fallocate    93.4ms ±13%
WriteBlocks   193ms ± 4%
NoSync       88.9ms ±10%
@aclements
Copy link
Member

@aclements aclements commented Apr 17, 2020

considering we've had one report of this in 10 years

Actually, this is somewhat unfair since the linker didn't use mmap until a couple years ago. I don't believe we've had reports from other uses of mmap, but I don't know how commonly used it is.

It would be nice if the runtime gave us a good way to handle SIGBUS

Also EXCEPTION_IN_PAGE_ERROR on Windows.

@bradfitz
Copy link
Contributor Author

@bradfitz bradfitz commented Apr 17, 2020

If we don't fail nicely when out of disk I fear we'll get lots of bogus issues filed.

Perhaps cmd/go can check disk space if (and only if) cmd/link returns a failed exit status.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Apr 17, 2020

When the linker uses fallocate it should fail nicely when out of disk space.

fallocate is part of POSIX (as posix_fallocate) so it is generally available. On Windows perhaps it will be necessary to write out zeroes, I don't know.

gopherbot pushed a commit that referenced this issue Apr 20, 2020
The fallocate calls will lower the chances of SIGBUS in the linker, but
it might still happen on other unsupported platforms and filesystems.

Darwin cmd/compile stats:

Munmap                    16.0ms ± 8%     0.8ms ± 3%   -95.19%  (p=0.000 n=8+10)
TotalTime                  484ms ± 2%     462ms ± 2%    -4.52%  (p=0.000 n=10+9)

Updates #37310

Change-Id: I41c6e490adec26fa1ebee49a5b268828f5ba05e1
Reviewed-on: https://go-review.googlesource.com/c/go/+/228385
Run-TryBot: Jeremy Faller <jeremy@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@aclements
Copy link
Member

@aclements aclements commented Apr 20, 2020

At least in glibc, posix_fallocate falls back to an implementation that just writes the blocks. Beyond Linux and macOS, I'm not sure what platforms natively support it, though I'm sure some of them do.

On Windows, as far as I can tell, there's no equivalent of fallocate and you're expected to catch EXCEPTION_IN_PAGE_ERROR: "To guard against exceptions due to input and output (I/O) errors, all attempts to access memory mapped files should be wrapped in structured exception handlers."

xujianhai666 added a commit to xujianhai666/go-1 that referenced this issue May 21, 2020
The fallocate calls will lower the chances of SIGBUS in the linker, but
it might still happen on other unsupported platforms and filesystems.

Darwin cmd/compile stats:

Munmap                    16.0ms ± 8%     0.8ms ± 3%   -95.19%  (p=0.000 n=8+10)
TotalTime                  484ms ± 2%     462ms ± 2%    -4.52%  (p=0.000 n=10+9)

Updates golang#37310

Change-Id: I41c6e490adec26fa1ebee49a5b268828f5ba05e1
Reviewed-on: https://go-review.googlesource.com/c/go/+/228385
Run-TryBot: Jeremy Faller <jeremy@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Jun 10, 2020

@jeremyfaller @aclements is this fixed by CL https://go-review.googlesource.com/c/go/+/228385 , or there is still things to be done for this? I guess we don't close this as currently we only pre-allocate on some platforms, not all?

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Sep 18, 2020

As @rsc pointed at #41466 (comment) , on OSes that we don't preallocate, we could use SetPanicOnFault to make it fail gracefully, if we do #41155 .

@jeremyfaller
Copy link
Contributor

@jeremyfaller jeremyfaller commented Sep 18, 2020

Sorry, missed this discussion.

I believe we are fixed as well as we can be without prezeroing. The OSs that don't support fallocate -- we currently do a best effort, but could still SIGBUS. I like the solution to SetPanicOnFault as it's a better experience than dying with a terrible error message.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Sep 18, 2020

Just to be clear, we didn't use SetPanicOnFault because it doesn't catch SIGBUS, only SIGSEGV. But now there is ongoing discussion on #41155 . If we do that, I think we want to revisit SetPanicOnFault.

@rsc
Copy link
Contributor

@rsc rsc commented Sep 18, 2020

I don't see any dissents on #41155 and it's marked NeedsFix already, so I won't comment there, but replying to the "if we do #41155" above:

I added SetPanicOnFault for exactly this specific use case, during an earlier, aborted rewrite of the linker. If an mmap fault is producing SIGBUS then that signal should be included, no question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants
You can’t perform that action at this time.