-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: intermittent slice bounds out of range panic on linux/mipsle #58212
Comments
(attn @golang/mips; CC @golang/runtime) If the machine is memory constrained, it seems likely that the runtime is missing an error check somewhere in a memory-allocation path. |
CC @golang/runtime That's odd. That might be some kind of memory corruption, since I don't see how this slicing operation could fail like that. (The other kind of failure is interesting, too.) This is a relatively simple program, so it's not great that it's falling over so easily. Here's a few things to try:
|
@bcmills To me, this doesn't look like a symptom of running out of memory, and in general we do a decent job of surfacing out of memory errors that we can catch (after some improvements in that area, some years ago). What we can't catch is something like the Linux OOM killer turning on when a page is demanded, but I don't think we'd crash like this. Perhaps I'm missing your point though. |
Here the slice index is 0x800138, which looks like an address. (The cap 0x200 looks correct, which matches the write barrier buffer size.) Looks like a memory corruption. Agreed that this doesn't look like running out of memory.
Would it be possible to try newer kernels? Thanks. |
Yes, there are newer firmware versions available which I suspect upgrade the kernel. However, we only have one of these devices and a firmware update would be a one-way operation, so I'd lose the ability to further investigate or test potential fixes against this current kernel version. |
I agree that this doesn't look like OOM too. 69MiB of free memory is enough for your case. I can't reproduce the issue at all (no panic or OOM) on Loongson 3A4000 with a 16MiB memory limit. The most probable cause is memory corruption, either a software/kernel problem or a hardware one. As upgrading the kernel is impossible, though broken RAM seems not very common, could you do some memory tests (e.g. https://pyropus.ca./software/memtester/)?
I suspect that the manufacturer offers firmware with their downstream kernel, which is expected to be not so robust. In such a case, only knowing that its version is, sometimes, not enough. Could you please tell that what the device is (MT7621?) and do the manufacturer offers the source code of their downstream kernel?
I don't think the manufacturer will apply a strict downgrade restriction (e.g. e-fuse). If it deserves a try to you, dump the ROM with a flasher and it should be able to be used to flash it back at any time. |
Trying the GODEBUG options suggested by @mknyszek:
No change in behavior -- 48% failure rate across 25 test runs; same panics as before.
No change in behavior -- 32% failure rate across 25 test runs; same panics as before.
Aha! No panics across 25 test runs. Whatever this is doing, it seems to avoid causing the conditions that lead to the panic. |
@racingmars That suggests to me a possible kernel bug. We've seen both Linux and Illumos (Illumos-derivative? and maybe some BSD variant but I forget) kernel bugs with issues in their signal delivery implementation. Usually something very subtle like a page fault in the signal handler causing a register's state to not get properly restored. What It's possible it's still an issue on the Go side, as we have of course discovered bugs in this path since it was released in Go 1.14. But a firmware/kernel upgrade might be sufficient to just solve your problem. Also, if |
Thanks @mknyszek -- (and cc @Rongronggg9 regarding firmware/kernel update) I've ordered a second device so I'll be able to upgrade the firmware on that one without risking losing the reproducible test case on this 3.15.0 kernel version in case that's useful in the future. It'll be about a week before it arrives, probably, but I'll follow up with a report on whether or not upgrading makes the problem go away. Thanks! |
Sorry for the delay in the follow-up, but I was able to get more hardware and investigate this further. On another device with the same kernel version (3.15.0), the problem was reproducible, so it wasn't due to bad hardware on the first device. After upgrading to the next oldest firmware version available to me, which uses kernel version 4.4.27, the problem is no longer reproducible. The test Go program and my real application run 100% reliably. So it does appear something was fixed in the mipsle kernel along the way related to how Go is depending on signals. If folks come across this issue by searching for the error message in the future, the summary is:
|
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?I'm building from linux/amd64; I'm cross-compiling to
GOOS=linux
GOARCH=mipsle
The target device is running Linux kernel version 3.15.0.
The target device is relatively memory-constrained:
However, if I intentionally exhaust the memory from my Go code, I get an "out of memory" panic, as expected, not the panic I am reporting in this bug.
On the target device, /proc/cpuinfo lists the CPU as (repeated four times for processors/cores 0-3):
What did you do?
Using Go 1.19.5 on Linux/amd64, I built the following program using:
$ GOOS=linux GOARCH=mipsle CGO_ENABLED=0 go build -o mipstest -ldflags="-s -w"
I built an ~800k, 28,000 line sample input file for the program with the following commands:
I then transferred the binary and the samplefile.txt to the MIPS system and run with:
$ ./mipstest < samplefile.txt
What did you expect to see?
I expected the program to run to completion. Or, if this ultimately is related to a low-memory condition, I'd expect something other than a slice bounds out of range panic.
What did you see instead?
Out of 100 runs, 41 of the runs resulted in the program terminating with a panic:
I have attached a couple complete representative examples of the full stack traces. In one case, I got a very different looking panic -- see panic.different.txt, which has "fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)".
A successful run of the program with the samplefile.txt takes a couple minutes on this platform. Sometimes the panics occur very quickly after beginning execution, but other times they occur farther into the execution time.
Attachments
main.go.txt
samplefile.txt
panic.1.txt
panic.2.txt
panic.different.txt
The text was updated successfully, but these errors were encountered: