New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage after pod deletion with layer2 #246

Closed
flaktack opened this Issue Apr 19, 2018 · 5 comments

Comments

2 participants
@flaktack
Copy link
Contributor

flaktack commented Apr 19, 2018

Is this a bug report or a feature request?:

bug

What happened:

When an interface (pod) was deleted the speaker cpu usage went above >100% until it was restarted.

What you expected to happen:

Things remain "normal".

How to reproduce it (as minimally and precisely as possible):

  1. Edit metallb.yaml to remove resource limits and example-layer2-config.yaml.
  2. Create a simple LoadBalancer (lb.yaml):
apiVersion: v1
kind: Service
metadata:
  name: kubernetes-lb
  namespace: kube-system
spec:
  externalTrafficPolicy: Cluster
  ports:
  - name: https
    port: 6443
    protocol: TCP
    targetPort: 6443
  selector:
    component: kube-apiserver
    tier: control-plane
  type: LoadBalancer
  1. Create minikube
minikube start
kubectl create -f manifests/metallb.yaml
kubectl create -f manifests/example-layer2-config.yaml
kubectl create -f lb.yaml
# Wait for metallb to start / allocate ip
kubectl -n kube-system scale deployment kube-dns --replicas=2
# Wait for kube-dns to scale
kubectl -n kube-system scale deployment kube-dns --replicas=1
# Wait for kube-dns to scale
minikube ssh top

Speaker logs:

{"caller":"announcer.go:88","event":"createARPResponder","interface":"veth0de2337","msg":"created ARP responder for interface","ts":"2018-04-19T11:49:13.962397408Z"}
{"caller":"announcer.go:93","error":"creating NDP responder for \"veth0de2337\": listen ip6:ipv6-icmp fe80::8c3a:5eff:fe01:fd81%veth0de2337: bind: invalid argument","interface":"veth0de2337","msg":"failed to create NDP responder","op":"createNDPResponder","ts":"2018-04-19T11:49:13.962963259Z"}
...
{"caller":"announcer.go:105","event":"deleteARPResponder","interface":"veth0de2337","msg":"deleted ARP responder for interface","ts":"2018-04-19T11:50:33.98100683Z"}
{"caller":"announcer.go:112","event":"deleteNDPResponder","interface":"veth0de2337","msg":"deleted NDP responder for interface","ts":"2018-04-19T11:50:33.981467945Z"}

Top line:

  PID USER      PR  NI    VIRT    RES  %CPU %MEM     TIME+ S COMMAND 
 5458 root      20   0   41.8m  30.7m 157.6  1.5   1:18.84 S              `- speaker

Anything else we need to know?:

The NDP responder wasn't created for the pod interface, which may or may not have anything to do with this:

{"caller":"announcer.go:93","error":"creating NDP responder for \"veth0de2337\": listen ip6:ipv6-icmp fe80::8c3a:5eff:fe01:fd81%veth0de2337: bind: invalid argument","interface":"veth0de2337","msg":"failed to create NDP responder","op":"createNDPResponder","ts":"2018-04-19T11:49:23.964802177Z"}

Environment:

  • MetalLB version: (commit 83a2aedc, branch master)
  • minikube version: v0.26.1
  • Kubernetes version: v1.10.0
  • BGP router type/version: layer2
  • OS (e.g. from /etc/os-release): Buildroot 2017.11
  • Kernel (e.g. uname -a): Linux minikube 4.9.64 #1 SMP Fri Mar 30 21:27:22 UTC 2018 x86_64 GNU/Linux

@danderson danderson added the bug label Apr 19, 2018

@danderson danderson self-assigned this Apr 19, 2018

@danderson danderson added this to To Do in Layer 2 mode via automation Apr 19, 2018

@danderson

This comment has been minimized.

Copy link
Member

danderson commented Apr 19, 2018

Thanks for the report. This smells like a responder goroutine spinning after deletion rather than cleaning itself up. I'll dig into it.

@danderson

This comment has been minimized.

Copy link
Member

danderson commented Apr 20, 2018

Okay, takes a couple of attempts to reproduce, but I can trigger this, 170% CPU utilization from speaker. The logs suggest it went through the normal/correct deletion sequence for the torn-down network interface, so... Hmm. Needs more logs in the responder goroutine.

@danderson

This comment has been minimized.

Copy link
Member

danderson commented Apr 20, 2018

Aha, I think I simply forgot to port the revised shutdown logic from the ARP code to the NDP code, so NDP isn't shutting down correctly. Working to verify that now, but should be an easy fix if true...

@danderson

This comment has been minimized.

Copy link
Member

danderson commented Apr 20, 2018

Yup, there it goes. With a log entry in the right place, spinning NDP responder goroutines become painfully visible. Fixing...

danderson added a commit to danderson/metallb that referenced this issue Apr 20, 2018

Correctly shut down NDP responders when interfaces go away.
Failure to do so leads to goroutines spinning on as much CPU as k8s
is willing to give them.

Fixes google#246

Layer 2 mode automation moved this from To Do to Done Apr 20, 2018

danderson added a commit that referenced this issue Apr 20, 2018

Correctly shut down NDP responders when interfaces go away.
Failure to do so leads to goroutines spinning on as much CPU as k8s
is willing to give them.

Fixes #246

danderson added a commit that referenced this issue Apr 20, 2018

Correctly shut down NDP responders when interfaces go away.
Failure to do so leads to goroutines spinning on as much CPU as k8s
is willing to give them.

Fixes #246

(cherry picked from commit a6e01da)
@danderson

This comment has been minimized.

Copy link
Member

danderson commented Apr 20, 2018

Releasing 0.6.1 to fix this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment