Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High glbc CPU usage #25

Closed
bowei opened this issue Oct 11, 2017 · 39 comments
Closed

High glbc CPU usage #25

bowei opened this issue Oct 11, 2017 · 39 comments
Labels
backend/gce lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@bowei
Copy link
Member

bowei commented Oct 11, 2017

From @jmn on January 27, 2017 1:53

I noticed in Stackdriver monitoring that glbc is using about 30% CPU constantly in my cluster. @thockin said on Slack that this might be related to misconfigured GCE ingress. However I looked through all my Ingresses and they are all class nginx. Anyone got a clue or a suggestion as to troubleshoot further?

One thing which I noticed is in my Ingress "kube-lego", which is configured automatically by kube-lego, there is a part "status" which looks like this:

status:
  loadBalancer:
    ingress:
    - ip: 130.211....
    - ip: 146.148....

The second IP adress adress is my nginx loadbalancer, however: the first IP adress is unknown to me, I currently do not know where that is from.

Copied from original issue: kubernetes/ingress-nginx#183

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @thockin on January 27, 2017 1:57

is that 130. IP a GCE load-balancer?

Are all the Services behind that Ingress NodePort or LoadBalancer type?

On Thu, Jan 26, 2017 at 5:53 PM, jmn notifications@github.com wrote:

I noticed in Stackdriver monitoring that glbc is using about 30% CPU
constantly in my cluster. @thockin https://github.com/thockin said on
Slack that this might be related to misconfigured GCE ingress. However I
looked through all my Ingresses and they are all class nginx. Anyone got a
clue or a suggestion as to troubleshoot further?

One thing which I noticed is in my Ingress "kube-lego", which is
configured automatically by kube-lego, there is a part "status" which looks
like this:

status:
loadBalancer:
ingress:
- ip: 130.211....
- ip: 146.148....

The second IP adress adress is my nginx loadbalancer, however: the first
IP adress is unknown to me, I currently do not know where that is from.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
kubernetes/ingress-nginx#183, or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVD7AmsUYWYBWhIMwurRetIcCrZhKks5rWU4agaJpZM4LvXHM
.

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @jmn on January 27, 2017 2:0

I only have services of type NodePort and ClusterIP

the 130. IP is unknown to me.

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @thockin on January 27, 2017 2:2

Look for it in your cloud console - maybe it is a load-balancer or a VM IP.

If you put a ClutsterIP service behind an ingress you could trigger this.

On Thu, Jan 26, 2017 at 6:00 PM, jmn notifications@github.com wrote:

I only have services of type NodePort and ClusterIP

the 130. IP is unknown to me.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
kubernetes/ingress-nginx#183 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVFzD70gbqF4_qgUXt7zOual6uNViks5rWU-1gaJpZM4LvXHM
.

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @jmn on January 27, 2017 2:18

I had two ClusterIPs behind Ingress, I changed those services to NodePort.
the IP is not one of my current Node IPs or external IPs - possibly a reference to an earlier used resource.
glbc CPU usage currently still high.

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @bprashanth on January 27, 2017 2:29

i think your loadbalancer controllers are fighting. The fact that you have multiple ips in the status list indicates that you are using both the nginx and gce ingress controller, the 130 is a google ip.

You need to specify an ingress.class annotation to stop the fighting:
https://github.com/kubernetes/contrib/blob/master/ingress/controllers/gce/BETA_LIMITATIONS.md#disabling-glbc

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @bprashanth on January 27, 2017 2:30

oh sorry, you did mention that they're all class nginx, can you please post the ingress that has multiple ips?

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @bprashanth on January 27, 2017 2:32

GLBC runs on the master, are you also running it on your nodes?

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @jmn on January 27, 2017 2:35

My nodes are running the GCI I have not added GLBC

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @jmn on January 27, 2017 2:40

Not sure that I want to post the entire Ingress here, it is being generated by kube-lego can I share in private?

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @jmn on January 27, 2017 2:48

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    ingress.kubernetes.io/ssl-redirect: "false"
    kubernetes.io/ingress.class: nginx
    kubernetes.io/tls-acme-challenge-endpoints: "true"
  creationTimestamp: 2017-01-22T22:02:54Z
  generation: 23
  name: kube-lego-nginx
  namespace: kube-lego
  resourceVersion: "1803893"
  selfLink: /apis/extensions/v1beta1/namespaces/kube-lego/ingresses/kube-lego-nginx
  uid: 86c4db93-e0ee-11e6-a9ff-xx
spec:
  rules:
  - host: h1.example.com
    http:
      paths:
      - backend:
          serviceName: kube-lego-nginx
          servicePort: 8080
        path: /.well-known/acme-challenge
  - host: h2.example.com
    http:
      paths:
      - backend:
          serviceName: kube-lego-nginx
          servicePort: 8080
        path: /.well-known/acme-challenge
  - host: h3.example.com
    http:
      paths:
      - backend:
          serviceName: kube-lego-nginx
          servicePort: 8080
        path: /.well-known/acme-challenge
status:
  loadBalancer:
    ingress:
    - ip: 130.211...
    - ip: 146.148...
                          

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @jmn on January 27, 2017 2:50

The annotation nginx is not quoted, don't know if it is significant.

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @jmn on January 27, 2017 3:1

I basically run https://github.com/kubernetes/ingress/blob/master/examples/deployment/nginx/nginx-ingress-controller.yaml deployment. It doesn't come with a Service like in the kube-lego example. I have not spent a very long time configuring this and studying the kube-lego documentation so I might have missed something essential.

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @bprashanth on January 27, 2017 3:7

So stack driver is showing CPU usage of what to be 30%? Glbc runnning on your gke master? Afaik it won't show master CPU usage, and the master is managed by Google anyway.

If you're running a GCE cluster can you please SSH into your master and post the out put of tail on /var/log/glbc?

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @jmn on January 27, 2017 3:13

It's GKE I'm not sure I can SSH to master, can I?

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @jmn on January 27, 2017 3:19

image

It's stackdriver GKE container metric CPU Usage, not sure how to narrow it down beyond this in stackdriver.

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @bprashanth on January 27, 2017 3:23

Hmm, I'll try it and report

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @bprashanth on January 27, 2017 18:12

I can't seem to reproduce your chart, mine has more details in the annotations. Anyway, applying stackdriver to a GKE cluster I only see nodes, which means it's showing you something on the node.

My guess is that it's showing you the cpu usage of the default backend, which runs in the kube-system namespace (@piosz might be able to confirm -- Piotr, what is it using to produce that "glbc" annotation on the tooltip, the container name? in which case is it actually getting a container from the GKE master?). You can observe it via: kubectl get po --namespace=kube-system -l name=glbc.

This is a simple go http server that serves the 404 page, if it's using cpu, that probably means your frontend is getting a lot of requests that don't match any url.

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @jmn on January 27, 2017 20:22

Is the l7-default-backend-v1.0 logging to somewhere? In logs in the kube-system namespace I find only some errors related to heapster-nanny.

 textPayload: "I0127 20:19:29.341031       1 nanny_lib.go:98] The container resources are {Limits:map[cpu:{i:{value:88 scale:-3} d:{Dec:<nil>} s:88m Format:DecimalSI} memory:{i:{value:213909504 scale:0} d:{Dec:<nil>} s:204Mi Format:BinarySI}] Requests:map[cpu:{i:{value:88 scale:-3} d:{Dec:<nil>} s:88m Format:DecimalSI} memory:{i:{value:213909504 scale:0} d:{Dec:<nil>} s:204Mi Format:BinarySI}]}
"  
 insertId: "1lpzkieg13vxwxm"  
 resource: {…}  
 timestamp: "2017-01-27T20:19:29Z"  
 severity: "ERROR"  
 labels: {…}  
 logName: "projects/example/logs/heapster-nanny"  

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @jmn on January 27, 2017 20:24

Also, is the l7-default-backend-v1.0 replicationcontroller required to run in GKE?

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @bprashanth on January 27, 2017 20:28

A standard go server doesn't log each request. The default backend is used by the nginx controller, and by all gce cloud lbs, whenever they receive a url they don't understand, to return a http 404. It has both resource and limits set:

    resources:
      limits:
        cpu: 10m
        memory: 20Mi
      requests:
        cpu: 10m
        memory: 20Mi

So if something else lands on the node and needs the resources, it'll be forced into the 10m/20mi it requested

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @jmn on January 27, 2017 20:40

Not sure that I need the GLBC at the moment (?) so I did

gcloud container clusters update example --update-addons HttpLoadBalancing=DISABLED
kubectl delete rc l7-default-backend-v1.0 -n kube-system

Seems fine?

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @jmn on January 27, 2017 21:3

Note I have another default-http-backend (in the default namespace) https://github.com/kubernetes/contrib/blob/master/ingress/controllers/nginx/examples/default-backend.yaml

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @tonglil on January 30, 2017 16:21

I can identify the same issue. Always just thought this was normal (maybe CPU on master, which I can't access so 🤷‍♂️) until reading this.

Here is a screenshot of 3 different dev clusters, where glbc is the top CPU consumer.

1 hour:
screen shot 2017-01-30 at 11 19 57 am

1 week:
screen shot 2017-01-30 at 11 21 14 am

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @bprashanth on January 30, 2017 16:27

@tonglil your screenshot confirms that it's the default-http-backend (as you can see in the tool tip at the bottom right).

First thing, I think we need to add some clarity around this.
Secondly, we need to figure out why a standard go http server is taking that much. Is it just a mis-representation, are you guys getting a lot of requests to unknown urls, or something else?

The source code for the default http backend is https://github.com/kubernetes/contrib/tree/master/404-server, maybe as a first step we can output a periodic log line for every 100th request?

Anyway as mentioned before it has limits set, so if there's anything else that needs the cpu, the webserver will be constrainted to 10m

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @tonglil on January 31, 2017 21:22

Thanks, yes I realize it is the default HTTP backend now.

Regards to the second point, I don't believe there is a lot of requests to unknown URLs. For that endpoint where there is http + https set up, together there is about 50 requests per day from IP hits + unknown sources. The rest is the standard /healthz endpoint from "GoogleHC/1.0"

It hasn't impacted anything so far (with the limits) so it's not a big concern.

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @mlazarov on February 16, 2017 21:48

I'm having the same issue on GKE and the glbc is not default backend

screen shot 2017-02-16 at 23 41 24

So far what I'm observing that the cpu load on glbc is going higher and higher over time. The cluster is almost 5 months old.

Updating from kubernetes 1.4.7 to 1.5.2 didn't make any change on the load.

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @bprashanth on February 16, 2017 22:2

Yeah that is the default backend, it's just wrongly marked as glbc based on the label "name: glbc''. You simply won't get glbc stats on GKE, glbc is running on the master, which is not in your project.

The default backend is constrained to 10m https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/cluster-loadbalancing/glbc/default-svc-controller.yaml#L38, so if you shove more pods on that node and specify limits > 10m on them, and they actually need more cpu, the default backed should get boxed into that 10m.

I'm actually not sure why a dumb http server would start using more cpu over time. Perhaps there's a bug in the stats collection? what does something like top or atop show on the node?

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @mlazarov on February 16, 2017 22:27

The glbc is running on gke-cluster-pool-static-2cc4ff80-4z4z which is not the k8s master.

No one from the external world is hitting the glbc trough the load balancer as I have custom default backend service defined in the ingress.

Name:		l7-default-backend-2234341178-cx0sm
Namespace:	kube-system
Node:		gke-cluster-pool-static-2cc4ff80-4z4z/10.240.0.5
Start Time:	Thu, 16 Feb 2017 19:34:15 +0200
Labels:		k8s-app=glbc
		name=glbc
		pod-template-hash=2234341178
Status:		Running
IP:		10.204.3.14
Controllers:	ReplicaSet/l7-default-backend-2234341178
Containers:
  default-http-backend:
    Container ID:	docker://6c0e1d82f021c76538331dda23362546146a5f075a7344314f04e59b6a109b07
    Image:		gcr.io/google_containers/defaultbackend:1.0
    Image ID:		docker://sha256:137a07dfd084191742ebc0861606b64f5acff0f893e71348fb0c7d7aaee6a364
    Port:		8080/TCP
    Limits:
      cpu:	10m
      memory:	20Mi
    Requests:
      cpu:		10m
      memory:		20Mi
    State:		Running
      Started:		Thu, 16 Feb 2017 19:34:23 +0200
    Ready:		True
    Restart Count:	0
    Liveness:		http-get http://:8080/healthz delay=30s timeout=5s period=10s #success=1 #failure=3
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-gn8e2 (ro)
    Environment Variables:	<none>
Conditions:
  Type		Status
  Initialized 	True
  Ready 	True
  PodScheduled 	True
Volumes:
  default-token-gn8e2:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-gn8e2
QoS Class:	Guaranteed
Tolerations:	<none>
No events.

There are no logs inside /var/log/ directory

# kubectl exec -it l7-default-backend-2234341178-cx0sm --namespace=kube-system -- ls -la /var/log/`
total 16
drwxrwxrwt    1 root     root          4096 Feb 16 17:34 .
drwxr-xr-x    1 root     root          4096 Feb 16 17:34 ..
drwxr-xr-x    2 root     root          4096 May 22  2014 ldconfig
drwxr-xr-x    3 root     root          4096 Feb 16 17:34 secrets

Any idea how to debug what is this /server process doing?

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @tonglil on February 16, 2017 22:32

What about kubectl top --namespace=kube-system pod ...? Does it show resource consumption?

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @mlazarov on February 16, 2017 22:45

Here is the output of top:

NAME                                                                 CPU(cores)   MEMORY(bytes)
l7-default-backend-2234341178-cx0sm                                  7m           6Mi

If I'm correct 7m = 70% of 10m. And that is exactly the same information that I'm having for glbc at the stackdriver dashboard.

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @bprashanth on February 16, 2017 22:53

Oh, that's even more confusing then as I'd expect that dashboard to show % of available cpu used per pod, not % of specified limits (or indicate that 100 is 10m somewhere). I'll follow up with a bug against stackdriver monitoring.

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @mlazarov on February 16, 2017 22:56

If I'm correct this is not a bug.

7m (the real usage) of 10m (the real limit) is exactly 70% of the limit - and that stackdriver is displaying. It looks right to me.

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @bprashanth on February 16, 2017 22:57

It's a confusing feature of the UI, not a bug.

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @tonglil on February 16, 2017 23:1

I agree, I would expect the UI to show percentages for resources used of the pod or node based on past graph experiences.

It should be more clear in SD docs somewhere that the percentages are usage/limit.

On Feb 16, 2017, at 2:57 PM, Prashanth B notifications@github.com wrote:

It's a confusing feature of the UI, not a bug.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @mlazarov on February 16, 2017 23:9

Yes, when I saw it for the first time it was confusing but seems right way to display the information. Some more inline docs will be good too.

But what about the glbc using 70% of it's given CPU resource? How to debug this?

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @bprashanth on February 16, 2017 23:51

I think that's alright, the limits are set to be a little higher than normal usage. Setting them much higher would be a waste, and setting them lower would result in throttling.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2018
@tonglil
Copy link
Contributor

tonglil commented Jan 10, 2018

/close.

@nicksardo
Copy link
Contributor

Thanks @tonglil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend/gce lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants