Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-apiserver 1.10.[0-5] & 1.11.0 uses up all available cpu on arm64 #64649

Closed
joejulian opened this issue Jun 2, 2018 · 15 comments · Fixed by #66264
Closed

kube-apiserver 1.10.[0-5] & 1.11.0 uses up all available cpu on arm64 #64649

joejulian opened this issue Jun 2, 2018 · 15 comments · Fixed by #66264
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@joejulian
Copy link
Contributor

joejulian commented Jun 2, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

I upgraded kube-apiserver on a Raspberry Pi 3 cluster and after it begins adding the http listeners, the cpu usage goes up on all processors to use every available cycle and all attempts to connect to the api via https (6443) or http (8080) time out.

Since I host etcd on the controller nodes, it also becomes unresponsive.

If I set GOMAXPROCS=1, it does limit that and prevents etcd from timing out, but all attempts to connect to kube-apiserver still time out. Eventually, the initial IP allocation check times out, causing a fatal error.

What you expected to happen:

It should idle at about 4%.

How to reproduce it (as minimally and precisely as possible):

On a raspberry pi 3 running Debian jessie (arm64), I run:

kube-apiserver \
  --etcd-servers=https://kubecon1:2379,https://kubecon2:2379,https://kubecon3:2379 \
  --bind-address=0.0.0.0 \
  --etcd-cafile=/etc/kubernetes/pki/ca.pem \
  --etcd-certfile=/etc/kubernetes/pki/kubecon1.etcd-client.pem \
  --etcd-keyfile=/etc/kubernetes/pki/kubecon1.etcd-client-key.pem \
  --allow-privileged=true \
  --service-cluster-ip-range=10.96.0.0/12 \
  --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,PersistentVolumeLabel,DefaultStorageClass,ResourceQuota,DefaultTolerationSeconds \
  --advertise-address=192.168.2.22 \
  --tls-cert-file=/etc/kubernetes/pki/kubecon1.kube-apiserver.pem \
  --tls-private-key-file=/etc/kubernetes/pki/kubecon1.kube-apiserver-key.pem \
  --service-account-key-file=/etc/kubernetes/pki/serviceaccount-key.pem \
  --client-ca-file=/etc/kubernetes/pki/ca.pem \
  --apiserver-count=3 \
  --audit-log-maxage=30 \
  --audit-log-maxbackup=3 \
  --audit-log-maxsize=100 \
  --authorization-mode=Node,RBAC \
  --enable-swagger-ui=true \
  --event-ttl=1h \
  --insecure-bind-address=127.0.0.1 \
  --runtime-config=rbac.authorization.k8s.io/v1alpha1 \
  --v=2

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.10.3 arm64
  • Cloud provider or hardware configuration: Raspberry Pi 3
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Debian GNU/Linux 8 (jessie)"
NAME="Debian GNU/Linux"
VERSION_ID="8"
VERSION="8 (jessie)"
ID=debian
HOME_URL="http://www.debian.org/"
SUPPORT_URL="http://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel (e.g. uname -a): Linux kubecon1 4.9.13-bee42-v8 #1 SMP PREEMPT Fri Mar 3 16:42:37 UTC 2017 aarch64 GNU/Linux
  • Install tools: n/a
  • Others:
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Jun 2, 2018
@joejulian
Copy link
Contributor Author

I've tried profiling this, but since it doesn't respond to the http interface, I can't get it that way. I've tried hacking in the ability to write profile data to a file, but the os.Exit in the fatal error log causes the defers to get skipped the the profile file to remain empty or have corrupted data if I do get it to write.

@joejulian
Copy link
Contributor Author

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 2, 2018
@MasayaAoyama
Copy link
Contributor

@joe2far
Hello, I faced nearly problem.

Do you use etcd v2?
If we use etcd v2, we need "--storage-backend=etcd2" options.
But api-server logs don't notice to us.

@joejulian
Copy link
Contributor Author

Nope, etcd 3.2.13 per #57480

@jennybuckley
Copy link

/cc @fedebongio

@joejulian joejulian changed the title kube-apiserver 1.10.[0-3] uses up all available cpu on arm64 kube-apiserver 1.10.[0-4] uses up all available cpu on arm64 Jun 20, 2018
@joejulian joejulian changed the title kube-apiserver 1.10.[0-4] uses up all available cpu on arm64 kube-apiserver 1.10.[0-4] & 1.11.0-beta.2 uses up all available cpu on arm64 Jun 20, 2018
@joejulian
Copy link
Contributor Author

Tested 1.10.5 and 1.11.0... still have the same problem.

@joejulian joejulian changed the title kube-apiserver 1.10.[0-4] & 1.11.0-beta.2 uses up all available cpu on arm64 kube-apiserver 1.10.[0-5] & 1.11.0 uses up all available cpu on arm64 Jun 28, 2018
@joejulian
Copy link
Contributor Author

Actually, 1.11 is not causing the problem. My test method was flawed. 1.10 is still broken but I can upgrade.

/close

@joejulian joejulian changed the title kube-apiserver 1.10.[0-5] & 1.11.0 uses up all available cpu on arm64 kube-apiserver 1.10.[0-5] uses up all available cpu on arm64 Jun 29, 2018
@sebt3
Copy link

sebt3 commented Jul 2, 2018

I'm still having the issue with 1.11, i'ld like to know what was your fix

@joejulian
Copy link
Contributor Author

I don't know. It worked fine for a while then I needed to restart that node for unrelated purposes and it failed again so I'm still having the problem, too.

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Jul 2, 2018
@joejulian joejulian changed the title kube-apiserver 1.10.[0-5] uses up all available cpu on arm64 kube-apiserver 1.10.[0-5] & 1.11.0 uses up all available cpu on arm64 Jul 2, 2018
@joejulian
Copy link
Contributor Author

I've built a binary with go 1.10 and it still has the same problem, fwiw.

@joejulian
Copy link
Contributor Author

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.10.md#scalability says

Upgrade to etcd client 3.2.13 and grpc 1.7.5 to improve HA etcd cluster stability. (#57480, @jpbetz)
but the server being tested was reverted back to 3.1.11 (#60891) for e2e testing due to performance problems.

I'll try that next.

@sebt3
Copy link

sebt3 commented Jul 2, 2018

running etcd 3.2.18 :

ETCDCTL_API=3 etcdctl version
etcdctl version: 3.2.18
API version: 3.2

Still have the issue

@joejulian
Copy link
Contributor Author

joejulian commented Jul 6, 2018

Ok, bisected this trouble back to #57480. @jpbetz PTAL

@joejulian
Copy link
Contributor Author

Reverted 57480 and commit 31ff8c6 on v1.11.0 and have a working 1.11 image on arm64: https://github.com/joejulian/kubernetes/tree/v1.11.0_undo_pr57480

@joejulian
Copy link
Contributor Author

Built with go1.11beta1 which contains several changes to math/big specifically to address performance issues with arm64. Since these functions are used with encryption, they have a direct affect on establishing TLS communications. That build allowed the TLS connections to establish within the timeout (10s) whereas they were not before.

I'll try extending the timeouts and building with go1.10 again and see if the problem can be worked around without using the beta compiler.

@sebt3 Part of this may be the fact that I was using RSA certificates instead of ECDSA. If you are, too, you may be able to work around this problem by regenerating your certificates to use ECDSA.

joejulian added a commit to joejulian/kubernetes that referenced this issue Jul 19, 2018
The math/big functions are slow on arm64. There is improvement coming
with go1.11 but in the mean time if a server uses rsa certificates on
arm64, the math load for the multitude of watches over taxes the ability
of the processor and the TLS connections time out. Retries will also not
succeed and serve to exacerbate the problem.

By extending the timeout, the TLS connections will eventually be
successful and the load will drop.

Fixes kubernetes#64649
joejulian added a commit to joejulian/kubernetes that referenced this issue Jul 20, 2018
The math/big functions are slow on arm64. There is improvement coming
with go1.11 but in the mean time if a server uses rsa certificates on
arm64, the math load for the multitude of watches over taxes the ability
of the processor and the TLS connections time out. Retries will also not
succeed and serve to exacerbate the problem.

By extending the timeout, the TLS connections will eventually be
successful and the load will drop.

Fixes kubernetes#64649
joejulian added a commit to joejulian/kubernetes that referenced this issue Jul 20, 2018
The math/big functions are slow on arm64. There is improvement coming
with go1.11 but in the mean time if a server uses rsa certificates on
arm64, the math load for the multitude of watches over taxes the ability
of the processor and the TLS connections time out. Retries will also not
succeed and serve to exacerbate the problem.

By extending the timeout, the TLS connections will eventually be
successful and the load will drop.

Fixes kubernetes#64649
k8s-github-robot pushed a commit that referenced this issue Jul 20, 2018
Automatic merge from submit-queue (batch tested with PRs 66341, 66405, 66403, 66264, 66447). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

extend timeout to workaround slow arm64 math

**What this PR does / why we need it**:

The math/big functions are slow on arm64. There is improvement coming
with go1.11 but until such time as that version can be used to build releases, 
if a server uses rsa certificates on arm64, the math load for the multitude
of watches over-taxes the ability of the processor and the TLS connections
time out. Retries will also not succeed and serve to exacerbate the problem.

By extending the timeout, the TLS connections will eventually be
successful and the load will drop.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #64649

**Special notes for your reviewer**:
This was tested on a Raspberry Pi 3

**Release note**:
```release-note
Extend TLS timeouts to work around slow arm64 math/big
```
tanshanshan pushed a commit to tanshanshan/kubernetes that referenced this issue Aug 10, 2018
The math/big functions are slow on arm64. There is improvement coming
with go1.11 but in the mean time if a server uses rsa certificates on
arm64, the math load for the multitude of watches over taxes the ability
of the processor and the TLS connections time out. Retries will also not
succeed and serve to exacerbate the problem.

By extending the timeout, the TLS connections will eventually be
successful and the load will drop.

Fixes kubernetes#64649
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants