Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: non CGO application segfaulted once #42977

Open
szuecs opened this issue Dec 3, 2020 · 3 comments
Open

runtime: non CGO application segfaulted once #42977

szuecs opened this issue Dec 3, 2020 · 3 comments

Comments

@szuecs
Copy link

@szuecs szuecs commented Dec 3, 2020

Issue is more for the record, because I can't reproduce it and saw it only once. @aclements maybe interesting for you, I don't know.

What version of Go are you using (go version)?

$ go version
go version go1.15.3 linux/amd64

Does this issue reproduce with the latest release?

I don't know. I can't reproduce it, but saw it once in production.

What operating system and processor architecture are you using (go env)?

I can't share go env, because build setup changes in our CI environment.

# docker --version
Docker version 19.03.13, build 4484c46d9d
# uname -a
Linux my-host 5.4.0-1028-aws #29-Ubuntu SMP Mon Oct 5 15:30:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.1 LTS
Release:	20.04
Codename:	focal

AWS c5.9xlarge instance /proc/cpuinfo output (don't know if this matters, but CPU flags seem interesting based on #35777)

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
stepping	: 4
microcode	: 0x2006906
cpu MHz		: 3101.777
cache size	: 33792 KB
physical id	: 0
siblings	: 48
core id		: 0
cpu cores	: 24
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 4999.99
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
stepping	: 4
microcode	: 0x2006906
cpu MHz		: 3101.480
cache size	: 33792 KB
physical id	: 0
siblings	: 48
core id		: 1
cpu cores	: 24
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 4999.99
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
stepping	: 4
microcode	: 0x2006906
cpu MHz		: 3096.565
cache size	: 33792 KB
physical id	: 0
siblings	: 48
core id		: 2
cpu cores	: 24
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 4999.99
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

.....


processor	: 45
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
stepping	: 4
microcode	: 0x2006906
cpu MHz		: 3103.684
cache size	: 33792 KB
physical id	: 0
siblings	: 48
core id		: 21
cpu cores	: 24
apicid		: 43
initial apicid	: 43
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 4999.99
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 46
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
stepping	: 4
microcode	: 0x2006906
cpu MHz		: 3102.689
cache size	: 33792 KB
physical id	: 0
siblings	: 48
core id		: 22
cpu cores	: 24
apicid		: 45
initial apicid	: 45
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 4999.99
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 47
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
stepping	: 4
microcode	: 0x2006906
cpu MHz		: 3096.456
cache size	: 33792 KB
physical id	: 0
siblings	: 48
core id		: 23
cpu cores	: 24
apicid		: 47
initial apicid	: 47
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 4999.99
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

What did you do?

We run a net/http based proxy skipper, that runs 24x7 serving our business. The application is running in Kubernetes with docker.
Today, I saw a pod restart and got a panic with:

fatal error: unexpected signal during runtime execution
 [signal SIGSEGV: segmentation violation code=0x2 addr=0xc0000a1668 pc=0x440856] runtime stack:
 runtime.throw(0x1186064, 0x2a)
 /usr/local/go/src/runtime/panic.go:1116 +0x72
 runtime.sigpanic()
 /usr/local/go/src/runtime/signal_unix.go:704 +0x4ac
 runtime.checkTimers(0xc0000a0000, 0x4c1e45bd3eaf4, 0x4c1e45bd3ea01, 0x0, 0x0)
 /usr/local/go/src/runtime/proc.go:2738 +0x36
 runtime.findrunnable(0xc000066800, 0x0)
 /usr/local/go/src/runtime/proc.go:2275 +0x1aa
 runtime.schedule()
 /usr/local/go/src/runtime/proc.go:2669 +0x2d7
 runtime.park_m(0xc000000d80)
 /usr/local/go/src/runtime/proc.go:2837 +0x9d
 runtime.mcall(0x0)
 /usr/local/go/src/runtime/asm_amd64.s:318 +0x5b

....8<...

Rest of the panic is >80kB long and does not show any why to me. Only things you would think are normal for a http proxy application.

What did you expect to see?

no panic

What did you see instead?

a panic

additional information

C Code shown in https://github.com/golang/go/wiki/LinuxKernelSignalVectorBug was tested on the same machine of the restarted process:

# gcc  -pthread test.c 
# ls
a.out  test.c
# ./a.out 
# echo $?
0

Shutdown of the application was not triggered (I don't see any logs that show logs from our signal handler). So seems unrelated to #40085

I read #35777 but I don't know if this is related. If so it seems to happen very rarely.

@ianlancetaylor ianlancetaylor changed the title non CGO application segfaulted once runtime: non CGO application segfaulted once Dec 3, 2020
@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Dec 3, 2020
@aclements
Copy link
Member

@aclements aclements commented Dec 3, 2020

The crash is happening on line 2738 here:

go/src/runtime/proc.go

Lines 2734 to 2742 in 1984ee0

func checkTimers(pp *p, now int64) (rnow, pollUntil int64, ran bool) {
// If there are no timers to adjust, and the first timer on
// the heap is not yet ready to run, then there is nothing to do.
if atomic.Load(&pp.adjustTimers) == 0 {
next := int64(atomic.Load64(&pp.timer0When))
if next == 0 {
return now, 0, false
}
if now == 0 {

From the traceback, we know pp is 0xc0000a0000. The previous line read pp.adjustTimers, which should be at 0xc0000a278c, then this line segfaulted when reading pp.timer0When at 0xc0000a1668 (confirmed against struct layout and assembly).

What's weird is that there is no reason why 0xc0000a278c would be mapped and 0xc0000a1668 wouldn't. In the heap, the runtime manipulates the address space in 64MiB-aligned 64MiB chunks. So either something went horribly wrong and Go managed to unmap a hole in its own heap, or something went wrong with the kernel or the hardware. Given how rarely Go unmaps anything, and the fact that this only happened once, I think the most likely explanation may actually be a hardware issue.

If you see it again, we'd love to know. But otherwise I'm not sure there's enough information to debug any further.

@szuecs
Copy link
Author

@szuecs szuecs commented Dec 4, 2020

Thanks, that you looked into the issue. I also think the issue is not very valuable as it is and to be honest I don't know how I can add more information.
Should we let it open that if someone else sees a similar issue they can link to this one?
If you want to close it, it's also fine for me.

@aclements
Copy link
Member

@aclements aclements commented Dec 7, 2020

Thanks. It's not doing much harm in the Unplanned milestone and it can stay there in case this issue is encountered again. If not, GopherBot will close it in a few months since it has the WaitingForInfo label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants