kube-apiserver 1.10.[0-5] & 1.11.0 uses up all available cpu on arm64 #64649

joejulian · 2018-06-02T03:04:59Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

I upgraded kube-apiserver on a Raspberry Pi 3 cluster and after it begins adding the http listeners, the cpu usage goes up on all processors to use every available cycle and all attempts to connect to the api via https (6443) or http (8080) time out.

Since I host etcd on the controller nodes, it also becomes unresponsive.

If I set GOMAXPROCS=1, it does limit that and prevents etcd from timing out, but all attempts to connect to kube-apiserver still time out. Eventually, the initial IP allocation check times out, causing a fatal error.

What you expected to happen:

It should idle at about 4%.

How to reproduce it (as minimally and precisely as possible):

On a raspberry pi 3 running Debian jessie (arm64), I run:

kube-apiserver \
  --etcd-servers=https://kubecon1:2379,https://kubecon2:2379,https://kubecon3:2379 \
  --bind-address=0.0.0.0 \
  --etcd-cafile=/etc/kubernetes/pki/ca.pem \
  --etcd-certfile=/etc/kubernetes/pki/kubecon1.etcd-client.pem \
  --etcd-keyfile=/etc/kubernetes/pki/kubecon1.etcd-client-key.pem \
  --allow-privileged=true \
  --service-cluster-ip-range=10.96.0.0/12 \
  --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,PersistentVolumeLabel,DefaultStorageClass,ResourceQuota,DefaultTolerationSeconds \
  --advertise-address=192.168.2.22 \
  --tls-cert-file=/etc/kubernetes/pki/kubecon1.kube-apiserver.pem \
  --tls-private-key-file=/etc/kubernetes/pki/kubecon1.kube-apiserver-key.pem \
  --service-account-key-file=/etc/kubernetes/pki/serviceaccount-key.pem \
  --client-ca-file=/etc/kubernetes/pki/ca.pem \
  --apiserver-count=3 \
  --audit-log-maxage=30 \
  --audit-log-maxbackup=3 \
  --audit-log-maxsize=100 \
  --authorization-mode=Node,RBAC \
  --enable-swagger-ui=true \
  --event-ttl=1h \
  --insecure-bind-address=127.0.0.1 \
  --runtime-config=rbac.authorization.k8s.io/v1alpha1 \
  --v=2

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.10.3 arm64
Cloud provider or hardware configuration: Raspberry Pi 3
OS (e.g. from /etc/os-release):

PRETTY_NAME="Debian GNU/Linux 8 (jessie)"
NAME="Debian GNU/Linux"
VERSION_ID="8"
VERSION="8 (jessie)"
ID=debian
HOME_URL="http://www.debian.org/"
SUPPORT_URL="http://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel (e.g. uname -a): Linux kubecon1 4.9.13-bee42-v8 #1 SMP PREEMPT Fri Mar 3 16:42:37 UTC 2017 aarch64 GNU/Linux
Install tools: n/a
Others:

The text was updated successfully, but these errors were encountered:

joejulian · 2018-06-02T03:07:28Z

I've tried profiling this, but since it doesn't respond to the http interface, I can't get it that way. I've tried hacking in the ability to write profile data to a file, but the os.Exit in the fatal error log causes the defers to get skipped the the profile file to remain empty or have corrupted data if I do get it to write.

joejulian · 2018-06-02T03:07:51Z

/sig api-machinery

MasayaAoyama · 2018-06-05T09:44:19Z

@joe2far
Hello, I faced nearly problem.

Do you use etcd v2?
If we use etcd v2, we need "--storage-backend=etcd2" options.
But api-server logs don't notice to us.

joejulian · 2018-06-05T16:27:38Z

Nope, etcd 3.2.13 per #57480

jennybuckley · 2018-06-07T20:18:14Z

/cc @fedebongio

joejulian · 2018-06-28T22:55:43Z

Tested 1.10.5 and 1.11.0... still have the same problem.

joejulian · 2018-06-29T16:39:49Z

Actually, 1.11 is not causing the problem. My test method was flawed. 1.10 is still broken but I can upgrade.

/close

sebt3 · 2018-07-02T14:20:32Z

I'm still having the issue with 1.11, i'ld like to know what was your fix

joejulian · 2018-07-02T14:43:30Z

I don't know. It worked fine for a while then I needed to restart that node for unrelated purposes and it failed again so I'm still having the problem, too.

/reopen

joejulian · 2018-07-02T14:46:29Z

I've built a binary with go 1.10 and it still has the same problem, fwiw.

joejulian · 2018-07-02T17:26:38Z

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.10.md#scalability says

Upgrade to etcd client 3.2.13 and grpc 1.7.5 to improve HA etcd cluster stability. (#57480, @jpbetz)
but the server being tested was reverted back to 3.1.11 (#60891) for e2e testing due to performance problems.

I'll try that next.

sebt3 · 2018-07-02T18:37:42Z

running etcd 3.2.18 :

ETCDCTL_API=3 etcdctl version
etcdctl version: 3.2.18
API version: 3.2

Still have the issue

joejulian · 2018-07-06T17:56:05Z

Ok, bisected this trouble back to #57480. @jpbetz PTAL

joejulian · 2018-07-06T19:52:37Z

Reverted 57480 and commit 31ff8c6 on v1.11.0 and have a working 1.11 image on arm64: https://github.com/joejulian/kubernetes/tree/v1.11.0_undo_pr57480

joejulian · 2018-07-16T17:03:06Z

Built with go1.11beta1 which contains several changes to math/big specifically to address performance issues with arm64. Since these functions are used with encryption, they have a direct affect on establishing TLS communications. That build allowed the TLS connections to establish within the timeout (10s) whereas they were not before.

I'll try extending the timeouts and building with go1.10 again and see if the problem can be worked around without using the beta compiler.

@sebt3 Part of this may be the fact that I was using RSA certificates instead of ECDSA. If you are, too, you may be able to work around this problem by regenerating your certificates to use ECDSA.

The math/big functions are slow on arm64. There is improvement coming with go1.11 but in the mean time if a server uses rsa certificates on arm64, the math load for the multitude of watches over taxes the ability of the processor and the TLS connections time out. Retries will also not succeed and serve to exacerbate the problem. By extending the timeout, the TLS connections will eventually be successful and the load will drop. Fixes kubernetes#64649

Automatic merge from submit-queue (batch tested with PRs 66341, 66405, 66403, 66264, 66447). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. extend timeout to workaround slow arm64 math **What this PR does / why we need it**: The math/big functions are slow on arm64. There is improvement coming with go1.11 but until such time as that version can be used to build releases, if a server uses rsa certificates on arm64, the math load for the multitude of watches over-taxes the ability of the processor and the TLS connections time out. Retries will also not succeed and serve to exacerbate the problem. By extending the timeout, the TLS connections will eventually be successful and the load will drop. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #64649 **Special notes for your reviewer**: This was tested on a Raspberry Pi 3 **Release note**: ```release-note Extend TLS timeouts to work around slow arm64 math/big ```

The math/big functions are slow on arm64. There is improvement coming with go1.11 but in the mean time if a server uses rsa certificates on arm64, the math load for the multitude of watches over taxes the ability of the processor and the TLS connections time out. Retries will also not succeed and serve to exacerbate the problem. By extending the timeout, the TLS connections will eventually be successful and the load will drop. Fixes kubernetes#64649

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Jun 2, 2018

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 2, 2018

joejulian changed the title ~~kube-apiserver 1.10.[0-3] uses up all available cpu on arm64~~ kube-apiserver 1.10.[0-4] uses up all available cpu on arm64 Jun 20, 2018

joejulian changed the title ~~kube-apiserver 1.10.[0-4] uses up all available cpu on arm64~~ kube-apiserver 1.10.[0-4] & 1.11.0-beta.2 uses up all available cpu on arm64 Jun 20, 2018

joejulian changed the title ~~kube-apiserver 1.10.[0-4] & 1.11.0-beta.2 uses up all available cpu on arm64~~ kube-apiserver 1.10.[0-5] & 1.11.0 uses up all available cpu on arm64 Jun 28, 2018

k8s-ci-robot closed this as completed Jun 29, 2018

joejulian changed the title ~~kube-apiserver 1.10.[0-5] & 1.11.0 uses up all available cpu on arm64~~ kube-apiserver 1.10.[0-5] uses up all available cpu on arm64 Jun 29, 2018

k8s-ci-robot reopened this Jul 2, 2018

joejulian changed the title ~~kube-apiserver 1.10.[0-5] uses up all available cpu on arm64~~ kube-apiserver 1.10.[0-5] & 1.11.0 uses up all available cpu on arm64 Jul 2, 2018

neolit123 mentioned this issue Jul 12, 2018

apiserver fails to start because livenessprobe is too aggressive kubernetes/kubeadm#413

Closed

joejulian mentioned this issue Jul 17, 2018

extend timeout to workaround slow arm64 math #66264

Merged

k8s-github-robot closed this as completed in #66264 Jul 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-apiserver 1.10.[0-5] & 1.11.0 uses up all available cpu on arm64 #64649

kube-apiserver 1.10.[0-5] & 1.11.0 uses up all available cpu on arm64 #64649

joejulian commented Jun 2, 2018 •

edited

Loading

joejulian commented Jun 2, 2018

joejulian commented Jun 2, 2018

MasayaAoyama commented Jun 5, 2018

joejulian commented Jun 5, 2018

jennybuckley commented Jun 7, 2018

joejulian commented Jun 28, 2018

joejulian commented Jun 29, 2018

sebt3 commented Jul 2, 2018

joejulian commented Jul 2, 2018

joejulian commented Jul 2, 2018

joejulian commented Jul 2, 2018

sebt3 commented Jul 2, 2018

joejulian commented Jul 6, 2018 •

edited

Loading

joejulian commented Jul 6, 2018

joejulian commented Jul 16, 2018

kube-apiserver 1.10.[0-5] & 1.11.0 uses up all available cpu on arm64 #64649

kube-apiserver 1.10.[0-5] & 1.11.0 uses up all available cpu on arm64 #64649

Comments

joejulian commented Jun 2, 2018 • edited Loading

joejulian commented Jun 2, 2018

joejulian commented Jun 2, 2018

MasayaAoyama commented Jun 5, 2018

joejulian commented Jun 5, 2018

jennybuckley commented Jun 7, 2018

joejulian commented Jun 28, 2018

joejulian commented Jun 29, 2018

sebt3 commented Jul 2, 2018

joejulian commented Jul 2, 2018

joejulian commented Jul 2, 2018

joejulian commented Jul 2, 2018

sebt3 commented Jul 2, 2018

joejulian commented Jul 6, 2018 • edited Loading

joejulian commented Jul 6, 2018

joejulian commented Jul 16, 2018

joejulian commented Jun 2, 2018 •

edited

Loading

joejulian commented Jul 6, 2018 •

edited

Loading