New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet failing with SIGILL #70357

Open
mborsz opened this Issue Oct 29, 2018 · 21 comments

Comments

Projects
None yet
7 participants
@mborsz
Member

mborsz commented Oct 29, 2018

https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/242 shows that one of kubelets crashed multiple times with the same error

SIGILL: illegal instruction
PC=0x460103 m=13 sigcode=2
goroutine 0 [idle]:
runtime.memclrNoHeapPointers(0xc004420000, 0x2000)
	/usr/local/go/src/runtime/memclr_amd64.s:76 +0x113
runtime.(*mheap).alloc(0x70651e0, 0x1, 0x7f181801000d, 0x7f181d1eb1c8)
	/usr/local/go/src/runtime/mheap.go:764 +0xda
runtime.(*mcentral).grow(0x7066858, 0x0)
	/usr/local/go/src/runtime/mcentral.go:232 +0x94
runtime.(*mcentral).cacheSpan(0x7066858, 0x7f181d30df88)
	/usr/local/go/src/runtime/mcentral.go:106 +0x2f8
runtime.(*mcache).refill(0x7f181d33d000, 0xc001419d0d)
	/usr/local/go/src/runtime/mcache.go:122 +0x95
runtime.(*mcache).nextFree.func1()
	/usr/local/go/src/runtime/malloc.go:749 +0x32
runtime.systemstack(0x0)
	/usr/local/go/src/runtime/asm_amd64.s:351 +0x66
runtime.mstart()
	/usr/local/go/src/runtime/proc.go:1229
(...)

Full kubelet logs are available here: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/242/artifacts/gce-scale-cluster-minion-group-3-d7s6/kubelet.log

@kubernetes/sig-node-bugs
/sig scalability
/sig node

@mborsz

This comment has been minimized.

Member

mborsz commented Oct 29, 2018

/cc @wojtek-t

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented Oct 29, 2018

/assign @yujuhong

YuJu - can you please delegate?

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Oct 29, 2018

AFAIK, SIGILL could happen because of go compile issue, or corrupted binary/memory.

In this case (running on the most common architecture), I'd suspect the latter. If the node is still there, we can check the hash of the kubelet binary to see if it's corrupted, but I think it's already gone.

BTW, from the kube-node-installation log:

-- Logs begin at Thu 2018-10-25 08:12:54 UTC, end at Thu 2018-10-25 22:28:52 UTC. --
Oct 25 08:13:02.063634 gce-scale-cluster-minion-group-3-d7s6 systemd[1]: Starting Download and install k8s binaries and configurations...
Oct 25 08:13:02.160449 gce-scale-cluster-minion-group-3-d7s6 configure.sh[1351]: Start to install kubernetes files
Oct 25 08:13:02.246452 gce-scale-cluster-minion-group-3-d7s6 configure.sh[1351]: Downloading Kubelet config file, if it exists
Oct 25 08:13:02.258525 gce-scale-cluster-minion-group-3-d7s6 configure.sh[1351]: Downloading binary release tar
Oct 25 08:13:02.265975 gce-scale-cluster-minion-group-3-d7s6 configure.sh[1351]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Oct 25 08:13:02.267307 gce-scale-cluster-minion-group-3-d7s6 configure.sh[1351]:                                  Dload  Upload   Total   Spent    Left  Speed
Oct 25 08:13:10.480542 gce-scale-cluster-minion-group-3-d7s6 configure.sh[1351]: [790B blob data]
Oct 25 08:13:11.441188 gce-scale-cluster-minion-group-3-d7s6 configure.sh[1351]: == Downloaded https://storage.googleapis.com/kubernetes-staging-a5dbc9bafa/gce-scale-cluster-devel/kubernetes-server-linux-amd64.tar.gz (SHA1 = 110c978338a7c3363fad0e019201ddc999ec8925) ==

I downloaded https://storage.googleapis.com/kubernetes-staging-a5dbc9bafa/gce-scale-cluster-devel/kubernetes-server-linux-amd64.tar.gz to test, and the sha1 does not match whatever logged in the file (110c978338a7c3363fad0e019201ddc999ec8925). This may be a red herring or word-as-intended (do we reuse the GCS buckets?), but I checked an arbitrary test and was not able to find the same problem.
@mborsz could you check why that's the case for the test job?

@mborsz

This comment has been minimized.

Member

mborsz commented Oct 30, 2018

The sha1 from the kube-node-installation logs matches the one from build log:

I1025 08:02:25.420] Will download kubernetes-server-linux-amd64.tar.gz from https://storage.googleapis.com/kubernetes-release-dev/ci/v1.13.0-alpha.2.38+689df2010da51b
I1025 08:02:25.421] Will download and extract kubernetes-client-linux-amd64.tar.gz from https://storage.googleapis.com/kubernetes-release-dev/ci/v1.13.0-alpha.2.38+689df2010da51b
I1025 08:02:25.421] Will download and extract kubernetes-test.tar.gz from https://storage.googleapis.com/kubernetes-release-dev/ci/v1.13.0-alpha.2.38+689df2010da51b
W1025 08:02:25.521]   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
W1025 08:02:25.522]                                  Dload  Upload   Total   Spent    Left  Speed
W1025 08:02:27.047] 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 45  333M   45  151M    0     0   195M      0  0:00:01 --:--:--  0:00:01  195M
100  333M  100  333M    0     0   206M      0  0:00:01  0:00:01 --:--:--  206M
I1025 08:02:27.147] 
I1025 08:02:27.839] md5sum(kubernetes-server-linux-amd64.tar.gz)=58523c26c4fe327cf98e3a06c392debe
I1025 08:02:28.760] sha1sum(kubernetes-server-linux-amd64.tar.gz)=110c978338a7c3363fad0e019201ddc999ec8925

I see right now that the last modification time for https://storage.googleapis.com/kubernetes-staging-a5dbc9bafa/gce-scale-cluster-devel/kubernetes-server-linux-amd64.tar.gz is 'Tue, 30 Oct 2018 08:03:32 GMT' where the test is from '2018-10-25 10:02 CEST' so we must be reusing buckets for some reason.

In kubelet logs I see 3 SIGILLs and each of them is in

runtime.memclrNoHeapPointers(0xc004420000, 0x2000)
	/usr/local/go/src/runtime/memclr_amd64.s:76 +0x113

which according to: https://golang.org/src/runtime/memclr_amd64.s#L76 is VMOVDQU.

For me, this suggests that this is rather some architecture issue than corrupted binary (in that case I would expect the SIGILLs to happen in some other places as well, not only in memclrNoHeapPointers).

@mborsz

This comment has been minimized.

Member

mborsz commented Oct 30, 2018

/assign @yujuhong

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Oct 30, 2018

Looks like Go 1.11 may be the problem: https://github.com/golang/go/wiki/AVX512
It enables AVX-512, which may not be supported on some platforms.

I'm not sure how to verify or disable this when building the binaries.

/cc @ixdy

@ixdy

This comment has been minimized.

Member

ixdy commented Oct 30, 2018

If I had to guess, I'd point at golang/go@d071209 which seems to change how AVX is detected.

We should probably open an issue against Go with these findings.

@ixdy

This comment has been minimized.

Member

ixdy commented Oct 30, 2018

have we seen the crashes on more than one test run?

@ixdy

This comment has been minimized.

Member

ixdy commented Oct 30, 2018

Looking at that same node (gce-scale-cluster-minion-group-3-d7s6), it looks like kube-proxy and docker had similar crashes:

kube-proxy

SIGILL: illegal instruction
PC=0x45cb63 m=5 sigcode=2

goroutine 0 [idle]:
runtime.memclrNoHeapPointers(0xc007dc8000, 0x52000)
        /usr/local/go/src/runtime/memclr_amd64.s:76 +0x113
runtime.(*mheap).alloc(0x2524300, 0x29, 0xc000010100, 0x416355)
        /usr/local/go/src/runtime/mheap.go:764 +0xda
runtime.largeAlloc(0x51f98, 0x450001, 0x7f7535130000)
        /usr/local/go/src/runtime/malloc.go:1019 +0x97
runtime.mallocgc.func1()
        /usr/local/go/src/runtime/malloc.go:914 +0x46
runtime.systemstack(0x0)
        /usr/local/go/src/runtime/asm_amd64.s:351 +0x66
runtime.mstart()
        /usr/local/go/src/runtime/proc.go:1229

goroutine 1 [running]:
runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_amd64.s:311 fp=0xc009a33488 sp=0xc009a33480 pc=0x459ca0
runtime.mallocgc(0x51f98, 0x1499d60, 0xc009a33501, 0x4cacc0)
        /usr/local/go/src/runtime/malloc.go:913 +0x896 fp=0xc009a33528 sp=0xc009a33488 pc=0x40c0b6
runtime.newarray(0x1499d60, 0x8a1, 0xffe71f494f05418e)
        /usr/local/go/src/runtime/malloc.go:1048 +0x6a fp=0xc009a33558 sp=0xc009a33528 pc=0x40c48a
runtime.makeBucketArray(0x13bf9a0, 0xc009a3350b, 0x0, 0x0, 0xffe71f494f05418e)
        /usr/local/go/src/runtime/map.go:355 +0x184 fp=0xc009a33590 sp=0xc009a33558 pc=0x40d354
runtime.hashGrow(0x13bf9a0, 0xc009a34040)
        /usr/local/go/src/runtime/map.go:963 +0x89 fp=0xc009a335e0 sp=0xc009a33590 pc=0x40ed29
runtime.mapassign_faststr(0x13bf9a0, 0xc009a34040, 0xc00374ab60, 0x19, 0xc00877ffa8)
        /usr/local/go/src/runtime/map_faststr.go:256 +0x206 fp=0xc009a33648 sp=0xc009a335e0 pc=0x412556
k8s.io/kubernetes/pkg/proxy/iptables.(*Proxier).syncProxyRules(0xc000451200)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go:1103 +0x4747 fp=0xc009a357d8 sp=0xc009a33648 pc=0x11af9e7
k8s.io/kubernetes/pkg/proxy/iptables.(*Proxier).syncProxyRules-fm()
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go:356 +0x2a fp=0xc009a357f0 sp=0xc009a357d8 pc=0x11b79ea
k8s.io/kubernetes/pkg/util/async.(*BoundedFrequencyRunner).tryRun(0xc00056e620)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/async/bounded_frequency_runner.go:217 +0xae fp=0xc009a358f8 sp=0xc009a357f0 pc=0x11809ee
k8s.io/kubernetes/pkg/util/async.(*BoundedFrequencyRunner).Loop(0xc00056e620, 0xc00007c0c0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/async/bounded_frequency_runner.go:181 +0x1c9 fp=0xc009a359e0 sp=0xc009a358f8 pc=0x1180749
k8s.io/kubernetes/pkg/proxy/iptables.(*Proxier).SyncLoop(0xc000451200)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go:498 +0x4e fp=0xc009a35a00 sp=0xc009a359e0 pc=0x11aa6de
k8s.io/kubernetes/cmd/kube-proxy/app.(*ProxyServer).Run(0xc0004462c0, 0xc0004462c0, 0x0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-proxy/app/server.go:585 +0x3c7 fp=0xc009a35c18 sp=0xc009a35a00 pc=0x1275837
k8s.io/kubernetes/cmd/kube-proxy/app.(*Options).Run(0xc0000ba000, 0xc0000ba6c0, 0x0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-proxy/app/server.go:255 +0x67 fp=0xc009a35c48 sp=0xc009a35c18 pc=0x1274027
k8s.io/kubernetes/cmd/kube-proxy/app.NewProxyCommand.func1(0xc0001c4780, 0xc0000ba6c0, 0x0, 0xc)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-proxy/app/server.go:377 +0x156 fp=0xc009a35cc0 sp=0xc009a35c48 pc=0x1278c86
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute(0xc0001c4780, 0xc0000381b0, 0xc, 0xc, 0xc0001c4780, 0xc0000381b0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:760 +0x2cc fp=0xc009a35db0 sp=0xc009a35cc0 pc=0x1269b1c
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc0001c4780, 0xc0001bdd10, 0x1637948, 0xc000629f28)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:846 +0x2fd fp=0xc009a35ef0 sp=0xc009a35db0 pc=0x126a6bd
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute(0xc0001c4780, 0x1638e40, 0x251d420)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:794 +0x2b fp=0xc009a35f20 sp=0xc009a35ef0 pc=0x126a39b
main.main()
        _output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-proxy/proxy.go:49 +0xf3 fp=0xc009a35f98 sp=0xc009a35f20 pc=0x127a6b3
runtime.main()
        /usr/local/go/src/runtime/proc.go:201 +0x207 fp=0xc009a35fe0 sp=0xc009a35f98 pc=0x42daf7
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc009a35fe8 sp=0xc009a35fe0 pc=0x45bc01
...
rax    0x0
rbx    0x2f000
rcx    0x52000
rdx    0x7f7532bdc310
rdi    0xc007deb000
rsi    0x29
rbp    0xc00045ff50
rsp    0xc00045ff08
r8     0x7f753237be48
r9     0x29
r10    0x1f0c
r11    0x28
r12    0x1a00
r13    0x600
r14    0xc00013f818
r15    0xc
rip    0x45cb63
rflags 0x10202
cs     0x33
fs     0x0
gs     0x0
Flag --resource-container has been deprecated, This feature will be removed in a later release.

docker:

SIGILL: illegal instruction
PC=0x466593 m=9 sigcode=2
goroutine 0 [idle]:
runtime.memclrNoHeapPointers(0xc4233a2000, 0x2000)
        /usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/memclr_amd64.s:75 +0x113
runtime.(*mheap).alloc(0x22e5620, 0x1, 0x7f69a7010024, 0x7f69a7ffed78)
        /usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/mheap.go:738 +0xf3
runtime.(*mcentral).grow(0x22e7270, 0x0)
        /usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/mcentral.go:232 +0x94
runtime.(*mcentral).cacheSpan(0x22e7270, 0x1947e10)
        /usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/mcentral.go:106 +0x33a
runtime.(*mcache).refill(0x7f69b7c1f000, 0x70b624, 0xc422981b00)
        /usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/mcache.go:123 +0xa4
runtime.(*mcache).nextFree.func1()
        /usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/malloc.go:557 +0x32
runtime.systemstack(0xc42001e000)
        /usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/asm_amd64.s:344 +0x79
runtime.mstart()
        /usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/proc.go:1135
goroutine 61777 [running]:
runtime.systemstack_switch()
        /usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/asm_amd64.s:298 fp=0xc422ee7980 sp=0xc422ee7978 pc=0x463330
runtime.(*mcache).nextFree(0x7f69b7c1f000, 0x24, 0x4, 0xc422ee7a80, 0x658867)
        /usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/malloc.go:556 +0xa9 fp=0xc422ee79d8 sp=0xc422ee7980 pc=0x419479
runtime.mallocgc(0x120, 0x17754c0, 0xc42338d301, 0x51)
        /usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/malloc.go:711 +0x6f0 fp=0xc422ee7a80 sp=0xc422ee79d8 pc=0x419d40
runtime.newarray(0x17754c0, 0x1, 0x58a06a4ba16d4f80)
        /usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/malloc.go:853 +0x60 fp=0xc422ee7ab0 sp=0xc422ee7a80 pc=0x41a160
runtime.mapassign(0x1683240, 0xc423398cf0, 0xc422ee7b80, 0xc4208ca4c8)
        /usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/hashmap.go:545 +0x5c5 fp=0xc422ee7b48 sp=0xc422ee7ab0 pc=0x411f15
github.com/docker/docker/vendor/github.com/gorilla/context.Set(0xc42336f300, 0x15f0e00, 0x1a12840, 0x1689360, 0xc423398cc0)
        /build/lakitu/tmp/portage/app-emulation/docker-17.03.2-r5/work/docker-17.03.2/src/github.com/docker/docker/vendor/github.com/gorilla/context/context.go:26 +0xd9 fp=0xc422ee7bb8 sp=0xc422ee7b48 pc=0x73c709
github.com/docker/docker/vendor/github.com/gorilla/mux.setVars(0xc42336f300, 0x1689360, 0xc423398cc0)
        /build/lakitu/tmp/portage/app-emulation/docker-17.03.2-r5/work/docker-17.03.2/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:331 +0x66 fp=0xc422ee7bf0 sp=0xc422ee7bb8 pc=0x73ee46
github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc42028ac80, 0x22985e0, 0xc423357b20, 0xc42336f300)
        /build/lakitu/tmp/portage/app-emulation/docker-17.03.2-r5/work/docker-17.03.2/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:94 +0x2e9 fp=0xc422ee7ce0 sp=0xc422ee7bf0 pc=0x73cf79
github.com/docker/docker/api/server.(*routerSwapper).ServeHTTP(0xc4208ab4b0, 0x22985e0, 0xc423357b20, 0xc42336f300)
        /build/lakitu/tmp/portage/app-emulation/docker-17.03.2-r5/work/docker-17.03.2/src/github.com/docker/docker/api/server/router_swapper.go:29 +0x70 fp=0xc422ee7d18 sp=0xc422ee7ce0 pc=0x822d10
net/http.serverHandler.ServeHTTP(0xc4203dcd00, 0x22985e0, 0xc423357b20, 0xc42336f300)
        /usr/lib64/go/x86_64-cros-linux-gnu/src/net/http/server.go:2619 +0xb4 fp=0xc422ee7d48 sp=0xc422ee7d18 pc=0x714db4
net/http.(*conn).serve(0xc420179d60, 0x229a0e0, 0xc421931e40)
        /usr/lib64/go/x86_64-cros-linux-gnu/src/net/http/server.go:1801 +0x71d fp=0xc422ee7fc8 sp=0xc422ee7d48 pc=0x710fad
runtime.goexit()
        /usr/lib64/go/x86_64-cros-linux-gnu/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc422ee7fd0 sp=0xc422ee7fc8 pc=0x465f61
created by net/http.(*Server).Serve
        /usr/lib64/go/x86_64-cros-linux-gnu/src/net/http/server.go:2720 +0x288
...
rax    0x0
rbx    0x1000
rcx    0x2000
rdx    0x2
rdi    0xc4233a3000
rsi    0x39
rbp    0x7f69a7ffecc8
rsp    0x7f69a7ffec80
r8     0xc000000000
r9     0x19d1
r10    0x22a3a20
r11    0xffffffff
r12    0xc4208ca510
r13    0x1
r14    0xc4208ca510
r15    0xc4208ca518
rip    0x466593
rflags 0x10202
cs     0x33
fs     0x0
gs     0x0

full logs attached:
kube-proxy.log
kubelet.log
docker.log.gz

@ixdy ixdy closed this Oct 30, 2018

@ixdy ixdy reopened this Oct 30, 2018

@ixdy

This comment has been minimized.

Member

ixdy commented Oct 30, 2018

Curiously I think docker 17.03.2 is using go1.10.3, not go1.11, though line 75 is still a VMOVDQU:
https://github.com/golang/go/blob/go1.10.3/src/runtime/memclr_amd64.s#L75

Maybe something is wrong with this VM, or the particular node running this VM?

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Oct 30, 2018

Hmm...... @mborsz @wojtek-t has this happened more than once on multiple nodes?

@mborsz

This comment has been minimized.

Member

mborsz commented Oct 31, 2018

I was able to find the following cases in the last 7 days (grepping only for kubelet.log):

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/243/artifacts/gce-scale-cluster-minion-group-2-bchx/kubelet.log
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/243/artifacts/gce-scale-cluster-minion-group-1-xx5q/kubelet.log
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/243/artifacts/gce-scale-cluster-minion-group-4-x03q/kubelet.log

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/242/artifacts/gce-scale-cluster-minion-group-3-d7s6/kubelet.log
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/242/artifacts/gce-scale-cluster-minion-group-gdxm/kubelet.log
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/242/artifacts/gce-scale-cluster-minion-group-2-p1qd/kubelet.log
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/242/artifacts/gce-scale-cluster-minion-group-2-p1qd/kubelet.log
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/242/artifacts/gce-scale-cluster-minion-group-1-z78j/kubelet.log

All of them are from 2018-10-25 and 2018-10-26.

For docker.log I was able to find e.g. gce-scale-cluster-minion-group-2-17df and gce-scale-cluster-minion-group-3-3pvv with SIGILL, but the test hasn't completed yet so I'm not able to provide link.

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Nov 5, 2018

From https://golang.org/src/runtime/memclr_amd64.s#L76 that @mborsz posted above, the instruction seems to be from AVX2, which is only supported on Haswell and above.

From the GCE documentation (https://cloud.google.com/compute/docs/regions-zones/), the available cpu platforms are:

  • Intel Xeon E5 v3 (Haswell) platform (default)
  • Intel Xeon E5 v4 (Broadwell) platform
  • Intel Xeon (Skylake) platfor

All of them should support AVX2.

@wonderfly @adityakali, just to double check, the COS image cos-stable-65-10323-64-0 should support AVX2, right?

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Nov 5, 2018

Curiously I think docker 17.03.2 is using go1.10.3, not go1.11, though line 75 is still a VMOVDQU:

Also, @wonderfly @adityakali, have you ever seen Docker hit SIGILL on COS images?

@wonderfly

This comment has been minimized.

Contributor

wonderfly commented Nov 6, 2018

@wonderfly @adityakali, just to double check, the COS image cos-stable-65-10323-64-0 should support AVX2, right?

I believe it's a CPU feature and not an OS thing? And according to Wikipedia AVX512 is not supported until Skylake X?

Also, @wonderfly @adityakali, have you ever seen Docker hit SIGILL on COS images?

Not off the top of my head. If you still have the node, you could check which Go version docker was compiled with: docker version.

Otherwise I'd recommend filing a bug against COS (internally) and we'll investigate.

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Nov 6, 2018

I believe it's a CPU feature and not an OS thing? And according to Wikipedia AVX512 is not supported until Skylake X?

I'm not sure if this instruction is from AVX512 or just AVX2. The information I found seemed to suggest the later

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Nov 6, 2018

Otherwise I'd recommend filing a bug against COS (internally) and we'll investigate.

Filed one and cc'd everyone that has been on this thread.

Not off the top of my head. If you still have the node, you could check which Go version docker was compiled with: docker version.

@mborsz @wojtek-t is it possible to keep one of the problematic nodes from the test?

@wonderfly

This comment has been minimized.

Contributor

wonderfly commented Nov 6, 2018

I'm not sure if this instruction is from AVX512 or just AVX2. The information I found seemed to suggest the later

Either way, I don't think the OS will stop you from executing any instruction that's supported by the CPU.

Do we have a way to reproduce this reliably?

@jberkus

This comment has been minimized.

jberkus commented Nov 6, 2018

Adding testing tag:

/kind flake

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment