Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not obviously serious issue about CPU, nginx-ingress http://IP:10254/healthz fails constantly #4735

Closed
kwladyka opened this issue Nov 4, 2019 · 13 comments · Fixed by #4843
Closed

Comments

@kwladyka
Copy link

kwladyka commented Nov 4, 2019

TL;DR;

jump to #4735 (comment)

kubectl --context=etingroup-production -n nginx-ingress logs -f nginx-ingress-controller-f8f7c9b69-r9rbq
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       0.26.1
  Build:         git-2de5a893a
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: openresty/1.15.8.2

-------------------------------------------------------------------------------

I1104 15:31:13.442249       6 flags.go:198] Watching for Ingress class: nginx
W1104 15:31:13.442782       6 flags.go:243] SSL certificate chain completion is disabled (--enable-ssl-chain-completion=false)
W1104 15:31:13.443033       6 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I1104 15:31:13.443404       6 main.go:182] Creating API client for https://10.55.240.1:443
I1104 15:31:13.457511       6 main.go:226] Running in Kubernetes cluster version v1.14+ (v1.14.7-gke.10) - git (clean) commit 8cea5f8ae165065f0d35e5de5dfa2f73617f02d1 - platform linux/amd64
I1104 15:31:13.466100       6 main.go:90] Validated nginx-ingress/nginx-ingress-default-backend as the default backend.
I1104 15:31:27.613232       6 main.go:101] SSL fake certificate created /etc/ingress-controller/ssl/default-fake-certificate.pem
W1104 15:31:35.091635       6 store.go:635] Unexpected error reading configuration configmap: configmaps "nginx-ingress-controller" not found
I1104 15:31:39.614309       6 nginx.go:263] Starting NGINX Ingress controller
I1104 15:31:47.906011       6 backend_ssl.go:66] Adding Secret "argocd/argocd-secret" to the local store
I1104 15:31:47.911764       6 nginx.go:307] Starting NGINX process
I1104 15:31:47.917560       6 leaderelection.go:241] attempting to acquire leader lease  nginx-ingress/ingress-controller-leader-nginx...
I1104 15:31:47.923784       6 backend_ssl.go:66] Adding Secret "etingroup/main-etingroup-pl-tls" to the local store
I1104 15:31:47.929354       6 event.go:255] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"etingroup", Name:"etingroup-sync", UID:"35893ce3-84ae-11e9-90aa-42010a840106", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"49361143", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress etingroup/etingroup-sync
I1104 15:31:47.933374       6 event.go:255] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"argocd", Name:"argocd-server-http-ingress", UID:"1e332eb9-8451-11e9-9fe8-42010a8400b2", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"49361142", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress argocd/argocd-server-http-ingress
I1104 15:31:47.939507       6 status.go:86] new leader elected: nginx-ingress-controller-f8f7c9b69-smj4z
I1104 15:31:47.951045       6 controller.go:134] Configuration changes detected, backend reload required.
I1104 15:31:49.883927       6 main.go:153] Received SIGTERM, shutting down
I1104 15:31:50.595026       6 nginx.go:390] Shutting down controller queues
I1104 15:31:50.595482       6 status.go:117] updating status of Ingress rules (remove)
W1104 15:31:53.083266       6 template.go:120] unexpected error cleaning template: signal: terminated
E1104 15:31:53.088962       6 controller.go:146] Unexpected failure reloading the backend:
invalid NGINX configuration (empty)
W1104 15:31:53.093849       6 queue.go:130] requeuing initial-sync, err invalid NGINX configuration (empty)
I1104 15:31:53.098883       6 status.go:136] removing address from ingress status ([35.195.148.61])
I1104 15:31:53.197577       6 status.go:274] updating Ingress argocd/argocd-server-http-ingress status from [{35.195.148.61 }] to []
I1104 15:31:53.204586       6 status.go:274] updating Ingress etingroup/etingroup-sync status from [{35.195.148.61 }] to []
I1104 15:31:55.776450       6 nginx.go:406] Stopping NGINX process
2019/11/04 15:31:55 [notice] 31#31: signal process started
I1104 15:31:58.974559       6 nginx.go:419] NGINX process has stopped
I1104 15:31:59.072914       6 main.go:161] Handled quit, awaiting Pod deletion
I1104 15:32:09.568890       6 main.go:164] Exiting with 0

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

BUG

Kubernetes version (use kubectl version):

kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T23:42:50Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.7-gke.10", GitCommit:"8cea5f8ae165065f0d35e5de5dfa2f73617f02d1", GitTreeState:"clean", BuildDate:"2019-10-05T00:08:10Z", GoVersion:"go1.12.9b4", Compiler:"gc", Platform:"linux/amd64"}

Environment:

GCP Kubernetes Engines

What happened:

Nginx restart in loop. It always fix itself after 20-30 minutes itself.

What you expected to happen:

Not restart / crash.

How to reproduce it (as minimally and precisely as possible):

I don't know. I only guess it can be about https://argoproj.github.io/argo-cd/ maybe? While both in the same cluster nginx-ingress crash constantly? Blind guess. But probably it is configuration values.yaml file. Hard to say. I need it to use nginx.ingress.kubernetes.io/whitelist-source-range.

It always happen when downgrade / upgrade nodes. So cluster have to re-run everything on new machines. And from time to time randomly.

My helm values.yaml

        service:
            externalTrafficPolicy: "Local"
            loadBalancerIP: "35.195.XXX.XX"
            omitClusterIP: true

        publishService:
            enabled: true

    defaultBackend:
        service:
            omitClusterIP: true

I use Google Cloud Engine.

@kwladyka
Copy link
Author

kwladyka commented Nov 7, 2019

Another logs from crash:

kubectl --context=foo-production -n nginx-ingress logs -f nginx-ingress-controller-f8f7c9b69-zkkgx
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       0.26.1
  Build:         git-2de5a893a
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: openresty/1.15.8.2

-------------------------------------------------------------------------------

I1107 12:59:49.823919       6 flags.go:198] Watching for Ingress class: nginx
W1107 12:59:49.824589       6 flags.go:243] SSL certificate chain completion is disabled (--enable-ssl-chain-completion=false)
W1107 12:59:49.824798       6 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I1107 12:59:49.825271       6 main.go:182] Creating API client for https://10.55.240.1:443
I1107 12:59:50.020006       6 main.go:226] Running in Kubernetes cluster version v1.14+ (v1.14.7-gke.14) - git (clean) commit d7951e52ecdf56a1de9e031c553569ec13849a80 - platform linux/amd64
I1107 12:59:51.362836       6 main.go:90] Validated nginx-ingress/nginx-ingress-default-backend as the default backend.
I1107 13:00:05.219513       6 main.go:101] SSL fake certificate created /etc/ingress-controller/ssl/default-fake-certificate.pem
W1107 13:00:08.541308       6 store.go:635] Unexpected error reading configuration configmap: configmaps "nginx-ingress-controller" not found
I1107 13:00:10.153910       6 nginx.go:263] Starting NGINX Ingress controller
I1107 13:00:13.708645       6 backend_ssl.go:66] Adding Secret "argocd/argocd-secret" to the local store
I1107 13:00:13.710747       6 backend_ssl.go:66] Adding Secret "etingroup/main-etingroup-pl-tls" to the local store
I1107 13:00:13.712569       6 event.go:255] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"argocd", Name:"argocd-server-http-ingress", UID:"1e332eb9-8451-11e9-9fe8-42010a8400b2", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"50486517", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress argocd/argocd-server-http-ingress
I1107 13:00:13.712760       6 event.go:255] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"etingroup", Name:"etingroup-sync", UID:"35893ce3-84ae-11e9-90aa-42010a840106", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"50486516", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress etingroup/etingroup-sync
I1107 13:00:15.310010       6 nginx.go:307] Starting NGINX process
I1107 13:00:17.372980       6 leaderelection.go:241] attempting to acquire leader lease  nginx-ingress/ingress-controller-leader-nginx...
I1107 13:00:18.204723       6 controller.go:134] Configuration changes detected, backend reload required.
I1107 13:00:23.934359       6 leaderelection.go:251] successfully acquired lease nginx-ingress/ingress-controller-leader-nginx
I1107 13:00:24.683612       6 status.go:86] new leader elected: nginx-ingress-controller-f8f7c9b69-zkkgx
I1107 13:00:27.969726       6 main.go:153] Received SIGTERM, shutting down
I1107 13:00:29.800005       6 nginx.go:390] Shutting down controller queues
I1107 13:00:29.800184       6 status.go:117] updating status of Ingress rules (remove)
I1107 13:00:29.838929       6 status.go:136] removing address from ingress status ([35.195.148.61])
I1107 13:00:29.858318       6 nginx.go:406] Stopping NGINX process
I1107 13:00:29.846504       6 status.go:274] updating Ingress etingroup/etingroup-sync status from [] to [{35.195.148.61 }]
I1107 13:00:29.846643       6 status.go:274] updating Ingress argocd/argocd-server-http-ingress status from [] to [{35.195.148.61 }]
2019/11/07 13:00:32 [notice] 30#30: signal process started
I1107 13:00:44.037409       6 nginx.go:419] NGINX process has stopped
I1107 13:00:44.642710       6 main.go:161] Handled quit, awaiting Pod deletion
E1107 13:00:50.088180       6 controller.go:146] Unexpected failure reloading the backend:
exit status 1
2019/11/07 13:00:45 [notice] 37#37: signal process started
2019/11/07 13:00:45 [error] 37#37: invalid PID number "" in "/tmp/nginx.pid"
nginx: [error] invalid PID number "" in "/tmp/nginx.pid"
W1107 13:00:50.088217       6 queue.go:130] requeuing initial-sync, err exit status 1
2019/11/07 13:00:45 [notice] 37#37: signal process started
2019/11/07 13:00:45 [error] 37#37: invalid PID number "" in "/tmp/nginx.pid"
nginx: [error] invalid PID number "" in "/tmp/nginx.pid"
I1107 13:00:54.750748       6 main.go:164] Exiting with 0

It can be useful

image

@aledbf
Copy link
Member

aledbf commented Nov 7, 2019

I1104 15:31:49.883927 6 main.go:153] Received SIGTERM, shutting down

This is not a crash. The kubelet is instructing the pod to terminate.
Please use kubectl describe <ingress pod> to see the reason

@aledbf
Copy link
Member

aledbf commented Nov 7, 2019

From the screenshot, it seems the kubelet cannot reach the ingress controller pod

@kwladyka
Copy link
Author

kwladyka commented Nov 7, 2019

This is not a crash.

When it is happening ALL ingress don't work. So users can't connect to ALL http(s) services in k8s. We can call it bug then :)

From the screenshot, it seems the kubelet cannot reach the ingress controller pod

Any tips how to debug it further? What can I do with it?

@aledbf
Copy link
Member

aledbf commented Nov 7, 2019

We can call it bug then :)

To confirm this is not an issue with the ingress controller itself, please remove the probes in the deployment. The ingress controller should not restart anymore

Any tips how to debug it further? What can I do with it?

SSH to the node where the pod is running and try to execute curl http://:10254/healthz

@kwladyka
Copy link
Author

kwladyka commented Nov 7, 2019

SSH to the node where the pod is running and try to execute curl http://:10254/healthz

Do you mean gcloud compute ssh node or kubectl exec -it pod -- /bin/bash ssh to pod? I understand gcloud version. So I have to do it with IP http://IP:10254/healthz or not?.

remove the probes in the deployment

I don't see such option https://github.com/helm/charts/blob/master/stable/nginx-ingress/values.yaml#L179

How can I do it?

@aledbf
Copy link
Member

aledbf commented Nov 7, 2019

Do you mean gcloud compute ssh node

SSH to the Kubernetes node where the ingress controller pod is running.

How can I do it?

Editing the deployment (kubectl edit deployment) without using helm?

@kwladyka
Copy link
Author

kwladyka commented Nov 7, 2019

For people who have similar issues:

# read IP from pod
kubectl --context=xxx-production -n nginx-ingress describe pod nginx-ingress-controller-xxxxx
# log to node
gcloud beta compute --project "project-name" ssh --zone "europe-west1-b" "gke-production-pool-1-xxxxx"
curl -v http://IP:10254/healthz

Edit deployment to remove livenessProbe:
kubectl --context=xxx-production -n nginx-ingress edit deployment nginx-ingress-controller

In my case I have to also turn off sync in GitOps, because otherwise it will detect difference and overwrite probs.

@kwladyka
Copy link
Author

kwladyka commented Nov 7, 2019

Ha I found when it is happening!

curl -v http://10.52.3.15:10254/healthz
*   Trying 10.52.3.15...
* TCP_NODELAY set
* Connected to 10.52.3.15 (10.52.3.15) port 10254 (#0)
> GET /healthz HTTP/1.1
> Host: 10.52.3.15:10254
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Thu, 07 Nov 2019 16:01:23 GMT
< Content-Length: 84
<
[+]ping ok
[-]nginx-ingress-controller failed: reason withheld
healthz check failed
* Connection #0 to host 10.52.3.15 left intact
* Closing connection 0

after a while there is no connection to nginx-ingress controller.

curl -v http://10.52.3.15:10254/healthz
*   Trying 10.52.3.15...
* TCP_NODELAY set
* connect to 10.52.3.15 port 10254 failed: Connection refused
* Failed to connect to 10.52.3.15 port 10254: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 10.52.3.15 port 10254: Connection refused
kubectl --context=etingroup-production -n nginx-ingress describe pod nginx-ingress-controller-f8f7c9b69-zkkgx | grep IP:
IP:             10.52.3.15
kubectl --context=etingroup-production -n nginx-ingress describe pod nginx-ingress-controller-f8f7c9b69-zkkgx
Name:           nginx-ingress-controller-f8f7c9b69-zkkgx
Namespace:      nginx-ingress
Priority:       0
Node:           gke-production-pool-1-ce587bf0-g3mh/10.132.0.15
Start Time:     Thu, 07 Nov 2019 09:08:09 +0100
Labels:         app=nginx-ingress
                component=controller
                pod-template-hash=f8f7c9b69
                release=nginx-ingress
Annotations:    <none>
Status:         Running
IP:             10.52.3.15
IPs:            <none>
Controlled By:  ReplicaSet/nginx-ingress-controller-f8f7c9b69
Containers:
  nginx-ingress-controller:
    Container ID:  docker://6429cc9fcfb0d3dbd2832d306fd5deff3d33a980070445dadecc235a1a833725
    Image:         quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.26.1
    Image ID:      docker-pullable://quay.io/kubernetes-ingress-controller/nginx-ingress-controller@sha256:d0b22f715fcea5598ef7f869d308b55289a3daaa12922fa52a1abf17703c88e7
    Ports:         80/TCP, 443/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      /nginx-ingress-controller
      --default-backend-service=nginx-ingress/nginx-ingress-default-backend
      --publish-service=nginx-ingress/nginx-ingress-controller
      --election-id=ingress-controller-leader
      --ingress-class=nginx
      --configmap=nginx-ingress/nginx-ingress-controller
    State:          Running
      Started:      Thu, 07 Nov 2019 17:17:26 +0100
    Last State:     Terminated
      Reason:       Error
      Exit Code:    143
      Started:      Thu, 07 Nov 2019 17:16:41 +0100
      Finished:     Thu, 07 Nov 2019 17:17:14 +0100
    Ready:          False
    Restart Count:  32
    Liveness:       http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       nginx-ingress-controller-f8f7c9b69-zkkgx (v1:metadata.name)
      POD_NAMESPACE:  nginx-ingress (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-token-n59h8 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  nginx-ingress-token-n59h8:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nginx-ingress-token-n59h8
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                     From                                          Message
  ----     ------     ----                    ----                                          -------
  Warning  Unhealthy  17m (x11 over 4h8m)     kubelet, gke-production-pool-1-ce587bf0-g3mh  Liveness probe failed: Get http://10.52.3.15:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  16m (x12 over 4h8m)     kubelet, gke-production-pool-1-ce587bf0-g3mh  Readiness probe failed: Get http://10.52.3.15:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  16m (x14 over 4h7m)     kubelet, gke-production-pool-1-ce587bf0-g3mh  Readiness probe failed: HTTP probe failed with statuscode: 500
  Warning  Unhealthy  16m                     kubelet, gke-production-pool-1-ce587bf0-g3mh  Readiness probe failed: Get http://10.52.3.15:10254/healthz: read tcp 10.52.3.1:51828->10.52.3.15:10254: read: connection reset by peer
  Warning  Unhealthy  15m (x58 over 4h6m)     kubelet, gke-production-pool-1-ce587bf0-g3mh  Readiness probe failed: Get http://10.52.3.15:10254/healthz: dial tcp 10.52.3.15:10254: connect: connection refused
  Warning  Unhealthy  15m (x65 over 4h6m)     kubelet, gke-production-pool-1-ce587bf0-g3mh  Liveness probe failed: Get http://10.52.3.15:10254/healthz: dial tcp 10.52.3.15:10254: connect: connection refused
  Normal   Killing    15m (x25 over 4h8m)     kubelet, gke-production-pool-1-ce587bf0-g3mh  Container nginx-ingress-controller failed liveness probe, will be restarted
  Normal   Pulled     15m (x25 over 4h7m)     kubelet, gke-production-pool-1-ce587bf0-g3mh  Container image "quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.26.1" already present on machine
  Normal   Created    15m (x26 over 8h)       kubelet, gke-production-pool-1-ce587bf0-g3mh  Created container nginx-ingress-controller
  Normal   Started    15m (x26 over 8h)       kubelet, gke-production-pool-1-ce587bf0-g3mh  Started container nginx-ingress-controller
  Warning  BackOff    2m27s (x209 over 4h2m)  kubelet, gke-production-pool-1-ce587bf0-g3mh  Back-off restarting failed container
kubectl --context=etingroup-production -n nginx-ingress describe node                             350ms  Thu Nov  7 17:17:47 2019
Name:               gke-production-pool-1-ce587bf0-g3mh
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/fluentd-ds-ready=true
                    beta.kubernetes.io/instance-type=n1-standard-2
                    beta.kubernetes.io/os=linux
                    cloud.google.com/gke-nodepool=pool-1
                    cloud.google.com/gke-os-distribution=cos
                    failure-domain.beta.kubernetes.io/region=europe-west1
                    failure-domain.beta.kubernetes.io/zone=europe-west1-b
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=gke-production-pool-1-ce587bf0-g3mh
                    kubernetes.io/os=linux
Annotations:        container.googleapis.com/instance_id: 1804273139999206952
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 04 Nov 2019 16:07:31 +0100
Taints:             <none>
Unschedulable:      false
Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  FrequentDockerRestart         False   Thu, 07 Nov 2019 17:19:37 +0100   Thu, 07 Nov 2019 09:08:05 +0100   NoFrequentDockerRestart         docker is functioning properly
  FrequentContainerdRestart     False   Thu, 07 Nov 2019 17:19:37 +0100   Thu, 07 Nov 2019 09:08:05 +0100   NoFrequentContainerdRestart     containerd is functioning properly
  KernelDeadlock                False   Thu, 07 Nov 2019 17:19:37 +0100   Thu, 07 Nov 2019 09:08:05 +0100   KernelHasNoDeadlock             kernel has no deadlock
  ReadonlyFilesystem            False   Thu, 07 Nov 2019 17:19:37 +0100   Thu, 07 Nov 2019 09:08:05 +0100   FilesystemIsNotReadOnly         Filesystem is not read-only
  CorruptDockerOverlay2         False   Thu, 07 Nov 2019 17:19:37 +0100   Thu, 07 Nov 2019 09:08:05 +0100   NoCorruptDockerOverlay2         docker overlay2 is functioning properly
  FrequentUnregisterNetDevice   False   Thu, 07 Nov 2019 17:19:37 +0100   Thu, 07 Nov 2019 09:08:05 +0100   NoFrequentUnregisterNetDevice   node is functioning properly
  FrequentKubeletRestart        False   Thu, 07 Nov 2019 17:19:37 +0100   Thu, 07 Nov 2019 09:08:05 +0100   NoFrequentKubeletRestart        kubelet is functioning properly
  NetworkUnavailable            False   Mon, 04 Nov 2019 16:07:47 +0100   Mon, 04 Nov 2019 16:07:47 +0100   RouteCreated                    RouteController created a route
  MemoryPressure                False   Thu, 07 Nov 2019 17:19:43 +0100   Thu, 07 Nov 2019 09:08:07 +0100   KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure                  False   Thu, 07 Nov 2019 17:19:43 +0100   Thu, 07 Nov 2019 09:08:07 +0100   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Thu, 07 Nov 2019 17:19:43 +0100   Thu, 07 Nov 2019 09:08:07 +0100   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         True    Thu, 07 Nov 2019 17:19:43 +0100   Thu, 07 Nov 2019 09:08:07 +0100   KubeletReady                    kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:   10.132.0.15
..........
Capacity:
 attachable-volumes-gce-pd:  127
 cpu:                        2
 ephemeral-storage:          98868448Ki
 hugepages-2Mi:              0
 memory:                     7657476Ki
 pods:                       110
Allocatable:
 attachable-volumes-gce-pd:  127
 cpu:                        1930m
 ephemeral-storage:          47093746742
 hugepages-2Mi:              0
 memory:                     5777412Ki
 pods:                       110
System Info:
..........
 Kernel Version:             4.14.145+
 OS Image:                   Container-Optimized OS from Google
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://18.9.7
 Kubelet Version:            v1.14.7-gke.14
 Kube-Proxy Version:         v1.14.7-gke.14
PodCIDR:                     10.52.3.0/24
..........
Non-terminated Pods:         (22 in total)
  Namespace                  Name                                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                                         ------------  ----------  ---------------  -------------  ---
  argocd                     argocd-application-controller-674fd875bf-djcf2               0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
  argocd                     argocd-dex-server-9d47975b-k54n8                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
  argocd                     argocd-redis-fc585c648-2l8j9                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
  argocd                     argocd-repo-server-564cf688d9-rt56k                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
  argocd                     argocd-server-6dcc6d6dcc-6p54f                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
  cert-manager               cert-manager-756d9f56d6-x99t4                                0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
  cert-manager               cert-manager-cainjector-74bb68d67c-vhsct                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
  cert-manager               cert-manager-webhook-645b8bdb7-r5kl7                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
  etingroup                  etingroup-sync-5c79979f67-5nxzm                              100m (5%)     2 (103%)    2Gi (36%)        2Gi (36%)      8h
  kube-system                event-exporter-v0.2.5-7df89f4b8f-hmzxv                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
  kube-system                fluentd-gcp-scaler-54ccb89d5-hkj5v                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
  kube-system                fluentd-gcp-v3.1.1-4h8nc                                     100m (5%)     1 (51%)     200Mi (3%)       500Mi (8%)     2d20h
  kube-system                heapster-64bb668f65-ptq5b                                    50m (2%)      50m (2%)    92560Ki (1%)     92560Ki (1%)   8h
  kube-system                kube-dns-5877696fb4-vvtsm                                    260m (13%)    0 (0%)      110Mi (1%)       170Mi (3%)     8h
  kube-system                kube-dns-autoscaler-85f8bdb54-wz9pq                          20m (1%)      0 (0%)      10Mi (0%)        0 (0%)         8h
  kube-system                kube-proxy-gke-production-pool-1-ce587bf0-g3mh               100m (5%)     0 (0%)      0 (0%)           0 (0%)         8h
  kube-system                l7-default-backend-fd59995cd-9fxbv                           10m (0%)      10m (0%)    20Mi (0%)        20Mi (0%)      8h
  kube-system                metrics-server-v0.3.1-57c75779f-ndkkj                        48m (2%)      143m (7%)   105Mi (1%)       355Mi (6%)     8h
  kube-system                prometheus-to-sd-84kw6                                       1m (0%)       3m (0%)     20Mi (0%)        20Mi (0%)      34h
  kube-system                stackdriver-metadata-agent-cluster-level-744579f94d-56th5    40m (2%)      0 (0%)      50Mi (0%)        0 (0%)         8h
  nginx-ingress              nginx-ingress-controller-f8f7c9b69-zkkgx                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
  nginx-ingress              nginx-ingress-default-backend-55f99b74bc-smmjk               0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests         Limits
  --------                   --------         ------
  cpu                        729m (37%)       3206m (166%)
  memory                     2717072Ki (47%)  3280272Ki (56%)
  ephemeral-storage          0 (0%)           0 (0%)
  attachable-volumes-gce-pd  0                0
Events:                      <none>

I have 1 URL which run process working about 20-30 minutes. Before it finish it always timeout in web browser. But it is not a problem for user. This link is to manually run synchronisation.

So I think nginx-igress controller stop working just after somebody open this link. It is what I discovered by experimenting manually when it is happening.

kubectl --context=etingroup-production top node
NAME                                  CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
gke-production-pool-1-ce587bf0-g3mh   2001m        103%   2298Mi          40%

So to summary up:

  1. I open in web browser link https://foo.pl/bar which connect to the Pod and start processing data which takes 20-30 minutes.
  2. nginx-controller healthz fail then and it fails constantly unless this process finish

How to debug it further? How to fix it?

Hmm alternately can it be because nginx-ingress-controller doesn't requests CPU on default settings and while data processing take all free CPU because of high limits, nginx-ingress-controller is too slow for healthz?

@kwladyka kwladyka changed the title [bug] nginx restart in loop about 20-30 minutes nginx-ingress http://IP:10254/healthz fails constantly, because of the CPU Nov 7, 2019
@kwladyka
Copy link
Author

kwladyka commented Nov 7, 2019

I can confirm after change CPU limits for pod processing data, nginx-ingress works stable.

I think it should be bolded warning in doc helm nginx-ingress installation and all places about that.

If processes will take 100% CPU, then nginx-ingress-controller will go down, which means nobody will be able to connect with your http(s) ingresses and all your systems

Default instalation doesn't requests CPU. It means if other processes will take free CPU because of their limits, then nginx-ingress reply too slow and http://:10254/healthz fail.

While it is not a bug, it is serious issue. I took about half a year to figure it out and nobody really knew why it was happening. It is not my first issue reported about it.

@kwladyka kwladyka changed the title nginx-ingress http://IP:10254/healthz fails constantly, because of the CPU not obviously serious issue about CPU, nginx-ingress http://IP:10254/healthz fails constantly Nov 7, 2019
@aledbf
Copy link
Member

aledbf commented Nov 7, 2019

@kwladyka what are you expecting exactly? From your previous comment, this seems an issue without the limits in your cluster, there is nothing related to the ingress controller itself.

Default instalation doesn't requests CPU.

This is on purpose. You should set limits only when you get enough information about the traffic you need to handle, not before.

It means if other processes will take free CPU because of their limits, then nginx-ingress reply too slow and http://:10254/healthz fail.

The same thing happens to any other pod running without limits or using the BestEffort QoS class.

Also, maybe your nodes are small for the things you are running?

@kwladyka
Copy link
Author

kwladyka commented Nov 7, 2019

what are you expecting exactly?

  1. Write about this in nginx-ingress documentation and helm instalation
  2. Set minimum default requests for CPU to let nginx-ingress-controller always answer for http://:10254/healthz during health check to not fail all services in cluster, because of nginx-ingress-controller restart constantly. It should be probably really low value, but it should be.
  3. Or fix it somehow to not restart nginx-ingress-controller constantly when requests CPU is 0 and machine use 100% of CPU during http://:10254/healthz.

So while nginx-ingress-controller fail, because of this, then other pods fail because can't connect to foo.bar.pl so all services start to fail and restart in loop and it turns really fast into chaos.

The same thing happens to any other pod running without limits or using the BestEffort QoS class.

It is very not obviously, because all processes work like they should. Only nginx-ingress-controller fail. At least in my case. So it is not happening with other processes in k8s.

You should set limits

The issue is not about nginx-ingress-controller limits, but requests or on the other hand limits of other pods which consume free CPU to get when needed.

When CPU is consumed in 100% on node, then nginx-ingress-controller immediatelly fail, because it doesn't have requests CPU so it answers too long for http://:10254/healthz (1 sec if I remember).

So for me it sounds like everybody should have CPU requests for nginx-ingress-controller or never let pods in node to use 100% of CPU, which sounds impossible to control.

@aledbf
Copy link
Member

aledbf commented Nov 10, 2019

Note: maybe using LimitRange could be possible to define the minimum requirements to avoid the 100% CPU problem https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/cpu-constraint-namespace/#create-a-limitrange-and-a-pod

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants