High glbc CPU usage #25

bowei · 2017-10-11T17:35:56Z

From @jmn on January 27, 2017 1:53

I noticed in Stackdriver monitoring that glbc is using about 30% CPU constantly in my cluster. @thockin said on Slack that this might be related to misconfigured GCE ingress. However I looked through all my Ingresses and they are all class nginx. Anyone got a clue or a suggestion as to troubleshoot further?

One thing which I noticed is in my Ingress "kube-lego", which is configured automatically by kube-lego, there is a part "status" which looks like this:

status:
  loadBalancer:
    ingress:
    - ip: 130.211....
    - ip: 146.148....

The second IP adress adress is my nginx loadbalancer, however: the first IP adress is unknown to me, I currently do not know where that is from.

Copied from original issue: kubernetes/ingress-nginx#183

The text was updated successfully, but these errors were encountered:

bowei · 2017-10-11T17:35:57Z

From @thockin on January 27, 2017 1:57

is that 130. IP a GCE load-balancer?

Are all the Services behind that Ingress NodePort or LoadBalancer type?

On Thu, Jan 26, 2017 at 5:53 PM, jmn notifications@github.com wrote:

I noticed in Stackdriver monitoring that glbc is using about 30% CPU
constantly in my cluster. @thockin https://github.com/thockin said on
Slack that this might be related to misconfigured GCE ingress. However I
looked through all my Ingresses and they are all class nginx. Anyone got a
clue or a suggestion as to troubleshoot further?

One thing which I noticed is in my Ingress "kube-lego", which is
configured automatically by kube-lego, there is a part "status" which looks
like this:

status:
loadBalancer:
ingress:
- ip: 130.211....
- ip: 146.148....

The second IP adress adress is my nginx loadbalancer, however: the first
IP adress is unknown to me, I currently do not know where that is from.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
kubernetes/ingress-nginx#183, or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVD7AmsUYWYBWhIMwurRetIcCrZhKks5rWU4agaJpZM4LvXHM
.

bowei · 2017-10-11T17:35:58Z

From @jmn on January 27, 2017 2:0

I only have services of type NodePort and ClusterIP

the 130. IP is unknown to me.

bowei · 2017-10-11T17:35:58Z

From @thockin on January 27, 2017 2:2

Look for it in your cloud console - maybe it is a load-balancer or a VM IP.

If you put a ClutsterIP service behind an ingress you could trigger this.

On Thu, Jan 26, 2017 at 6:00 PM, jmn notifications@github.com wrote:

I only have services of type NodePort and ClusterIP

the 130. IP is unknown to me.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
kubernetes/ingress-nginx#183 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVFzD70gbqF4_qgUXt7zOual6uNViks5rWU-1gaJpZM4LvXHM
.

bowei · 2017-10-11T17:35:58Z

From @jmn on January 27, 2017 2:18

I had two ClusterIPs behind Ingress, I changed those services to NodePort.
the IP is not one of my current Node IPs or external IPs - possibly a reference to an earlier used resource.
glbc CPU usage currently still high.

bowei · 2017-10-11T17:35:59Z

From @bprashanth on January 27, 2017 2:29

i think your loadbalancer controllers are fighting. The fact that you have multiple ips in the status list indicates that you are using both the nginx and gce ingress controller, the 130 is a google ip.

You need to specify an ingress.class annotation to stop the fighting:
https://github.com/kubernetes/contrib/blob/master/ingress/controllers/gce/BETA_LIMITATIONS.md#disabling-glbc

bowei · 2017-10-11T17:35:59Z

From @bprashanth on January 27, 2017 2:30

oh sorry, you did mention that they're all class nginx, can you please post the ingress that has multiple ips?

bowei · 2017-10-11T17:36:00Z

From @bprashanth on January 27, 2017 2:32

GLBC runs on the master, are you also running it on your nodes?

bowei · 2017-10-11T17:36:00Z

From @jmn on January 27, 2017 2:35

My nodes are running the GCI I have not added GLBC

bowei · 2017-10-11T17:36:01Z

From @jmn on January 27, 2017 2:40

Not sure that I want to post the entire Ingress here, it is being generated by kube-lego can I share in private?

bowei · 2017-10-11T17:36:01Z

From @jmn on January 27, 2017 2:48

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    ingress.kubernetes.io/ssl-redirect: "false"
    kubernetes.io/ingress.class: nginx
    kubernetes.io/tls-acme-challenge-endpoints: "true"
  creationTimestamp: 2017-01-22T22:02:54Z
  generation: 23
  name: kube-lego-nginx
  namespace: kube-lego
  resourceVersion: "1803893"
  selfLink: /apis/extensions/v1beta1/namespaces/kube-lego/ingresses/kube-lego-nginx
  uid: 86c4db93-e0ee-11e6-a9ff-xx
spec:
  rules:
  - host: h1.example.com
    http:
      paths:
      - backend:
          serviceName: kube-lego-nginx
          servicePort: 8080
        path: /.well-known/acme-challenge
  - host: h2.example.com
    http:
      paths:
      - backend:
          serviceName: kube-lego-nginx
          servicePort: 8080
        path: /.well-known/acme-challenge
  - host: h3.example.com
    http:
      paths:
      - backend:
          serviceName: kube-lego-nginx
          servicePort: 8080
        path: /.well-known/acme-challenge
status:
  loadBalancer:
    ingress:
    - ip: 130.211...
    - ip: 146.148...

bowei · 2017-10-11T17:36:02Z

From @jmn on January 27, 2017 2:50

The annotation nginx is not quoted, don't know if it is significant.

bowei · 2017-10-11T17:36:03Z

From @jmn on January 27, 2017 3:1

I basically run https://github.com/kubernetes/ingress/blob/master/examples/deployment/nginx/nginx-ingress-controller.yaml deployment. It doesn't come with a Service like in the kube-lego example. I have not spent a very long time configuring this and studying the kube-lego documentation so I might have missed something essential.

bowei · 2017-10-11T17:36:04Z

From @bprashanth on January 27, 2017 3:7

So stack driver is showing CPU usage of what to be 30%? Glbc runnning on your gke master? Afaik it won't show master CPU usage, and the master is managed by Google anyway.

If you're running a GCE cluster can you please SSH into your master and post the out put of tail on /var/log/glbc?

bowei · 2017-10-11T17:36:05Z

From @jmn on January 27, 2017 3:13

It's GKE I'm not sure I can SSH to master, can I?

bowei · 2017-10-11T17:36:05Z

From @jmn on January 27, 2017 3:19

It's stackdriver GKE container metric CPU Usage, not sure how to narrow it down beyond this in stackdriver.

bowei · 2017-10-11T17:36:06Z

From @bprashanth on January 27, 2017 3:23

Hmm, I'll try it and report

bowei · 2017-10-11T17:36:06Z

From @bprashanth on January 27, 2017 18:12

I can't seem to reproduce your chart, mine has more details in the annotations. Anyway, applying stackdriver to a GKE cluster I only see nodes, which means it's showing you something on the node.

My guess is that it's showing you the cpu usage of the default backend, which runs in the kube-system namespace (@piosz might be able to confirm -- Piotr, what is it using to produce that "glbc" annotation on the tooltip, the container name? in which case is it actually getting a container from the GKE master?). You can observe it via: kubectl get po --namespace=kube-system -l name=glbc.

This is a simple go http server that serves the 404 page, if it's using cpu, that probably means your frontend is getting a lot of requests that don't match any url.

bowei · 2017-10-11T17:36:06Z

From @jmn on January 27, 2017 20:22

Is the l7-default-backend-v1.0 logging to somewhere? In logs in the kube-system namespace I find only some errors related to heapster-nanny.

 textPayload: "I0127 20:19:29.341031       1 nanny_lib.go:98] The container resources are {Limits:map[cpu:{i:{value:88 scale:-3} d:{Dec:<nil>} s:88m Format:DecimalSI} memory:{i:{value:213909504 scale:0} d:{Dec:<nil>} s:204Mi Format:BinarySI}] Requests:map[cpu:{i:{value:88 scale:-3} d:{Dec:<nil>} s:88m Format:DecimalSI} memory:{i:{value:213909504 scale:0} d:{Dec:<nil>} s:204Mi Format:BinarySI}]}
"  
 insertId: "1lpzkieg13vxwxm"  
 resource: {…}  
 timestamp: "2017-01-27T20:19:29Z"  
 severity: "ERROR"  
 labels: {…}  
 logName: "projects/example/logs/heapster-nanny"

bowei · 2017-10-11T17:36:07Z

From @jmn on January 27, 2017 20:24

Also, is the l7-default-backend-v1.0 replicationcontroller required to run in GKE?

bowei · 2017-10-11T17:36:07Z

From @bprashanth on January 27, 2017 20:28

A standard go server doesn't log each request. The default backend is used by the nginx controller, and by all gce cloud lbs, whenever they receive a url they don't understand, to return a http 404. It has both resource and limits set:

    resources:
      limits:
        cpu: 10m
        memory: 20Mi
      requests:
        cpu: 10m
        memory: 20Mi

So if something else lands on the node and needs the resources, it'll be forced into the 10m/20mi it requested

bowei · 2017-10-11T17:36:08Z

From @jmn on January 27, 2017 20:40

Not sure that I need the GLBC at the moment (?) so I did

gcloud container clusters update example --update-addons HttpLoadBalancing=DISABLED
kubectl delete rc l7-default-backend-v1.0 -n kube-system

Seems fine?

bowei · 2017-10-11T17:36:08Z

From @jmn on January 27, 2017 21:3

Note I have another default-http-backend (in the default namespace) https://github.com/kubernetes/contrib/blob/master/ingress/controllers/nginx/examples/default-backend.yaml

bowei · 2017-10-11T17:36:09Z

From @tonglil on January 30, 2017 16:21

I can identify the same issue. Always just thought this was normal (maybe CPU on master, which I can't access so 🤷‍♂️) until reading this.

Here is a screenshot of 3 different dev clusters, where glbc is the top CPU consumer.

1 hour:

1 week:

bowei · 2017-10-11T17:36:09Z

From @bprashanth on January 30, 2017 16:27

@tonglil your screenshot confirms that it's the default-http-backend (as you can see in the tool tip at the bottom right).

First thing, I think we need to add some clarity around this.
Secondly, we need to figure out why a standard go http server is taking that much. Is it just a mis-representation, are you guys getting a lot of requests to unknown urls, or something else?

The source code for the default http backend is https://github.com/kubernetes/contrib/tree/master/404-server, maybe as a first step we can output a periodic log line for every 100th request?

Anyway as mentioned before it has limits set, so if there's anything else that needs the cpu, the webserver will be constrainted to 10m

bowei · 2017-10-11T17:36:10Z

From @tonglil on January 31, 2017 21:22

Thanks, yes I realize it is the default HTTP backend now.

Regards to the second point, I don't believe there is a lot of requests to unknown URLs. For that endpoint where there is http + https set up, together there is about 50 requests per day from IP hits + unknown sources. The rest is the standard /healthz endpoint from "GoogleHC/1.0"

It hasn't impacted anything so far (with the limits) so it's not a big concern.

bowei · 2017-10-11T17:36:10Z

From @mlazarov on February 16, 2017 21:48

I'm having the same issue on GKE and the glbc is not default backend

So far what I'm observing that the cpu load on glbc is going higher and higher over time. The cluster is almost 5 months old.

Updating from kubernetes 1.4.7 to 1.5.2 didn't make any change on the load.

bowei · 2017-10-11T17:36:11Z

From @bprashanth on February 16, 2017 22:2

Yeah that is the default backend, it's just wrongly marked as glbc based on the label "name: glbc''. You simply won't get glbc stats on GKE, glbc is running on the master, which is not in your project.

The default backend is constrained to 10m https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/cluster-loadbalancing/glbc/default-svc-controller.yaml#L38, so if you shove more pods on that node and specify limits > 10m on them, and they actually need more cpu, the default backed should get boxed into that 10m.

I'm actually not sure why a dumb http server would start using more cpu over time. Perhaps there's a bug in the stats collection? what does something like top or atop show on the node?

bowei · 2017-10-11T17:36:11Z

From @mlazarov on February 16, 2017 22:27

The glbc is running on gke-cluster-pool-static-2cc4ff80-4z4z which is not the k8s master.

No one from the external world is hitting the glbc trough the load balancer as I have custom default backend service defined in the ingress.

Name:		l7-default-backend-2234341178-cx0sm
Namespace:	kube-system
Node:		gke-cluster-pool-static-2cc4ff80-4z4z/10.240.0.5
Start Time:	Thu, 16 Feb 2017 19:34:15 +0200
Labels:		k8s-app=glbc
		name=glbc
		pod-template-hash=2234341178
Status:		Running
IP:		10.204.3.14
Controllers:	ReplicaSet/l7-default-backend-2234341178
Containers:
  default-http-backend:
    Container ID:	docker://6c0e1d82f021c76538331dda23362546146a5f075a7344314f04e59b6a109b07
    Image:		gcr.io/google_containers/defaultbackend:1.0
    Image ID:		docker://sha256:137a07dfd084191742ebc0861606b64f5acff0f893e71348fb0c7d7aaee6a364
    Port:		8080/TCP
    Limits:
      cpu:	10m
      memory:	20Mi
    Requests:
      cpu:		10m
      memory:		20Mi
    State:		Running
      Started:		Thu, 16 Feb 2017 19:34:23 +0200
    Ready:		True
    Restart Count:	0
    Liveness:		http-get http://:8080/healthz delay=30s timeout=5s period=10s #success=1 #failure=3
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-gn8e2 (ro)
    Environment Variables:	<none>
Conditions:
  Type		Status
  Initialized 	True
  Ready 	True
  PodScheduled 	True
Volumes:
  default-token-gn8e2:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-gn8e2
QoS Class:	Guaranteed
Tolerations:	<none>
No events.

There are no logs inside /var/log/ directory

# kubectl exec -it l7-default-backend-2234341178-cx0sm --namespace=kube-system -- ls -la /var/log/`
total 16
drwxrwxrwt    1 root     root          4096 Feb 16 17:34 .
drwxr-xr-x    1 root     root          4096 Feb 16 17:34 ..
drwxr-xr-x    2 root     root          4096 May 22  2014 ldconfig
drwxr-xr-x    3 root     root          4096 Feb 16 17:34 secrets

Any idea how to debug what is this /server process doing?

bowei · 2017-10-11T17:36:12Z

From @tonglil on February 16, 2017 22:32

What about kubectl top --namespace=kube-system pod ...? Does it show resource consumption?

bowei · 2017-10-11T17:36:12Z

From @mlazarov on February 16, 2017 22:45

Here is the output of top:

NAME                                                                 CPU(cores)   MEMORY(bytes)
l7-default-backend-2234341178-cx0sm                                  7m           6Mi

If I'm correct 7m = 70% of 10m. And that is exactly the same information that I'm having for glbc at the stackdriver dashboard.

bowei · 2017-10-11T17:36:13Z

From @bprashanth on February 16, 2017 22:53

Oh, that's even more confusing then as I'd expect that dashboard to show % of available cpu used per pod, not % of specified limits (or indicate that 100 is 10m somewhere). I'll follow up with a bug against stackdriver monitoring.

bowei · 2017-10-11T17:36:13Z

From @mlazarov on February 16, 2017 22:56

If I'm correct this is not a bug.

7m (the real usage) of 10m (the real limit) is exactly 70% of the limit - and that stackdriver is displaying. It looks right to me.

bowei · 2017-10-11T17:36:13Z

From @bprashanth on February 16, 2017 22:57

It's a confusing feature of the UI, not a bug.

bowei · 2017-10-11T17:36:14Z

From @tonglil on February 16, 2017 23:1

I agree, I would expect the UI to show percentages for resources used of the pod or node based on past graph experiences.

It should be more clear in SD docs somewhere that the percentages are usage/limit.

On Feb 16, 2017, at 2:57 PM, Prashanth B notifications@github.com wrote:

It's a confusing feature of the UI, not a bug.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

bowei · 2017-10-11T17:36:14Z

From @mlazarov on February 16, 2017 23:9

Yes, when I saw it for the first time it was confusing but seems right way to display the information. Some more inline docs will be good too.

But what about the glbc using 70% of it's given CPU resource? How to debug this?

bowei · 2017-10-11T17:36:15Z

From @bprashanth on February 16, 2017 23:51

I think that's alright, the limits are set to be a little higher than normal usage. Setting them much higher would be a waste, and setting them lower would result in throttling.

fejta-bot · 2018-01-09T18:50:35Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

tonglil · 2018-01-10T05:20:27Z

/close.

nicksardo · 2018-02-07T17:53:46Z

Thanks @tonglil

bowei added the backend/gce label Oct 11, 2017

bowei mentioned this issue Oct 11, 2017

High glbc CPU usage kubernetes/ingress-nginx#183

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2018

nicksardo closed this as completed Feb 7, 2018

High glbc CPU usage #25

High glbc CPU usage #25

Comments

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

fejta-bot commented Jan 9, 2018

tonglil commented Jan 10, 2018

nicksardo commented Feb 7, 2018