New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calico network pulgin deploy failed #3223

Closed
riverzhang opened this Issue Sep 2, 2018 · 6 comments

Comments

Projects
None yet
7 participants
@riverzhang
Member

riverzhang commented Sep 2, 2018

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG

Environment:

  • Cloud provider or hardware configuration:
    none

  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
    centos7.5

  • Version of Ansible (ansible --version):
    v2.6.1

Kubespray version (commit) (git rev-parse --short HEAD):
dee9324

Network plugin used:
calico
Copy of your inventory file:

## Configure 'ip' variable to bind kubernetes services on a

## different ip than the default iface

node2 ansible_host=147.75.42.159 ip=10.42.6.133
node3 ansible_host=147.75.42.157 ip=10.42.6.131

## configure a bastion host if your nodes are not directly reachable

bastion ansible_host=x.x.x.x ansible_user=some_user

[kube-master]
node2

[etcd]
node2

[kube-node]
node3

[k8s-cluster:children]
kube-master
kube-node

Command used to invoke ansible:

Output of ansible run:

kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-77f9f6495f-mp7cs 0/1 Pending 0 6m
calico-node-6xhqq 0/1 CrashLoopBackOff 5 6m
calico-node-c8f5m 0/1 CrashLoopBackOff 5 6m
kube-apiserver-node2 1/1 Running 0 6m
kube-controller-manager-node2 1/1 Running 0 6m
kube-dns-6d54bdd9d6-dswfw 0/3 Pending 0 5m
kube-proxy-qjxlt 1/1 Running 0 6m
kube-proxy-xcn5l 1/1 Running 0 6m
kube-scheduler-node2 1/1 Running 0 6m
kubedns-autoscaler-76f87d889d-g56rp 0/1 Pending 0 5m
kubernetes-dashboard-786c896b97-2v7xq 0/1 Pending 0 5m
nginx-proxy-node3 1/1 Running 0 6m

describe calico events:
Type Reason Age From Message


Normal Pulled 5m (x2 over 6m) kubelet, node3 Container image "quay.io/calico/node:v3.2.0-amd64" already present on machine
Normal Created 5m (x2 over 6m) kubelet, node3 Created container
Normal Started 5m (x2 over 6m) kubelet, node3 Started container
Normal Killing 5m kubelet, node3 Killing container with id docker://calico-node:Container failed liveness probe.. Container will be killed and recreateed and recreateed and recreated.
Warning Unhealthy 4m (x9 over 6m) kubelet, node3 Readiness probe failed: Get http://10.42.6.131:9099/readiness: dial tcp 10.42.6.131:9099: connect: connection refused
Warning Unhealthy 1m (x30 over 6m) kubelet, node3 Liveness probe failed: Get http://10.42.6.131:9099/liveness: dial tcp 10.42.6.131:9099: connect: connection refused

Anything else do we need to know:

@riverzhang riverzhang changed the title from Deploy kubernetes cluster failed to calico network pulgin deploy failed Sep 2, 2018

@grengojbo

This comment has been minimized.

Show comment
Hide comment
@grengojbo

grengojbo Sep 3, 2018

kubectl describe pod calico-node-xz7kr -n kube-system
Name:               calico-node-xz7kr
Namespace:          kube-system
Priority:           0
PriorityClassName:  <none>
Node:               node7.dc1.local/10.10.10.57
Start Time:         Mon, 03 Sep 2018 12:02:32 +0300
Labels:             controller-revision-hash=2964325039
                    k8s-app=calico-node
                    pod-template-generation=1
Annotations:        kubespray.etcd-cert/serial=E008CF25B3E0B34C
                    scheduler.alpha.kubernetes.io/critical-pod=
Status:             Running
IP:                 10.10.10.57
Controlled By:      DaemonSet/calico-node
Containers:
  calico-node:
    Container ID:   docker://153c93cda678e06cef27fc040446dc5e4ea68b8567c4b55a9c001be6c9f4977e
    Image:          quay.io/calico/node:v3.2.0-amd64
    Image ID:       docker-pullable://quay.io/calico/node@sha256:862fe34ef21a8eb05aeab72196c6c6d54a4d97de6b59f14b7cc99457124e8d35
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Mon, 03 Sep 2018 12:48:08 +0300
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 03 Sep 2018 12:46:59 +0300
      Finished:     Mon, 03 Sep 2018 12:48:07 +0300
    Ready:          False
    Restart Count:  17
    Limits:
      cpu:     300m
      memory:  500M
    Requests:
      cpu:      150m
      memory:   64M
    Liveness:   http-get http://:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  http-get http://:9099/readiness delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      ETCD_ENDPOINTS:                         <set to the key 'etcd_endpoints' of config map 'calico-config'>  Optional: false
      CALICO_NETWORKING_BACKEND:              <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                           <set to the key 'cluster_type' of config map 'calico-config'>    Optional: false
      CALICO_K8S_NODE_REF:                     (v1:spec.nodeName)
      CALICO_DISABLE_FILE_LOGGING:            true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:      RETURN
      FELIX_IPV6SUPPORT:                      false
      FELIX_LOGSEVERITYSCREEN:                info
      FELIX_PROMETHEUSMETRICSENABLED:         false
      FELIX_PROMETHEUSMETRICSPORT:            9091
      FELIX_PROMETHEUSGOMETRICSENABLED:       true
      FELIX_PROMETHEUSPROCESSMETRICSENABLED:  true
      ETCD_CA_CERT_FILE:                      <set to the key 'etcd_ca' of config map 'calico-config'>    Optional: false
      ETCD_KEY_FILE:                          <set to the key 'etcd_key' of config map 'calico-config'>   Optional: false
      ETCD_CERT_FILE:                         <set to the key 'etcd_cert' of config map 'calico-config'>  Optional: false
      IP:                                      (v1:status.hostIP)
      NODENAME:                                (v1:spec.nodeName)
      FELIX_HEALTHENABLED:                    true
      FELIX_IGNORELOOSERPF:                   False
    Mounts:
      /calico-secrets from etcd-certs (rw)
      /lib/modules from lib-modules (ro)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-tdnlj (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:
  etcd-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/calico/certs
    HostPathType:
  calico-node-token-tdnlj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-node-token-tdnlj
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     :NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason          Age                 From                      Message
  ----     ------          ----                ----                      -------
  Normal   Created         44m (x2 over 45m)   kubelet, node7.dc1.local  Created container
  Normal   Started         44m (x2 over 45m)   kubelet, node7.dc1.local  Started container
  Normal   Killing         44m                 kubelet, node7.dc1.local  Killing container with id docker://calico-node:Container failed liveness probe.. Container will be killed and recreated.
  Warning  Unhealthy       44m (x9 over 45m)   kubelet, node7.dc1.local  Readiness probe failed: Get http://10.10.10.57:9099/readiness: dial tcp 10.10.10.57:9099: connect: connection refused
  Normal   Pulled          30m (x9 over 45m)   kubelet, node7.dc1.local  Container image "quay.io/calico/node:v3.2.0-amd64" already present on machine
  Warning  Unhealthy       15m (x76 over 45m)  kubelet, node7.dc1.local  Liveness probe failed: Get http://10.10.10.57:9099/liveness: dial tcp 10.10.10.57:9099: connect: connection refused
  Warning  BackOff         5m (x112 over 39m)  kubelet, node7.dc1.local  Back-off restarting failed container
  Normal   SandboxChanged  1m                  kubelet, node7.dc1.local  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled          17s (x2 over 1m)    kubelet, node7.dc1.local  Container image "quay.io/calico/node:v3.2.0-amd64" already present on machine
  Normal   Created         17s (x2 over 1m)    kubelet, node7.dc1.local  Created container
  Warning  Unhealthy       17s (x6 over 1m)    kubelet, node7.dc1.local  Liveness probe failed: Get http://10.10.10.57:9099/liveness: dial tcp 10.10.10.57:9099: connect: connection refused
  Normal   Killing         17s                 kubelet, node7.dc1.local  Killing container with id docker://calico-node:Container failed liveness probe.. Container will be killed and recreated.
  Normal   Started         16s (x2 over 1m)    kubelet, node7.dc1.local  Started container
  Warning  Unhealthy       6s (x8 over 1m)     kubelet, node7.dc1.local  Readiness probe failed: Get http://10.10.10.57:9099/readiness: dial tcp 10.10.10.57:9099: connect: connection refused

grengojbo commented Sep 3, 2018

kubectl describe pod calico-node-xz7kr -n kube-system
Name:               calico-node-xz7kr
Namespace:          kube-system
Priority:           0
PriorityClassName:  <none>
Node:               node7.dc1.local/10.10.10.57
Start Time:         Mon, 03 Sep 2018 12:02:32 +0300
Labels:             controller-revision-hash=2964325039
                    k8s-app=calico-node
                    pod-template-generation=1
Annotations:        kubespray.etcd-cert/serial=E008CF25B3E0B34C
                    scheduler.alpha.kubernetes.io/critical-pod=
Status:             Running
IP:                 10.10.10.57
Controlled By:      DaemonSet/calico-node
Containers:
  calico-node:
    Container ID:   docker://153c93cda678e06cef27fc040446dc5e4ea68b8567c4b55a9c001be6c9f4977e
    Image:          quay.io/calico/node:v3.2.0-amd64
    Image ID:       docker-pullable://quay.io/calico/node@sha256:862fe34ef21a8eb05aeab72196c6c6d54a4d97de6b59f14b7cc99457124e8d35
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Mon, 03 Sep 2018 12:48:08 +0300
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 03 Sep 2018 12:46:59 +0300
      Finished:     Mon, 03 Sep 2018 12:48:07 +0300
    Ready:          False
    Restart Count:  17
    Limits:
      cpu:     300m
      memory:  500M
    Requests:
      cpu:      150m
      memory:   64M
    Liveness:   http-get http://:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  http-get http://:9099/readiness delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      ETCD_ENDPOINTS:                         <set to the key 'etcd_endpoints' of config map 'calico-config'>  Optional: false
      CALICO_NETWORKING_BACKEND:              <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                           <set to the key 'cluster_type' of config map 'calico-config'>    Optional: false
      CALICO_K8S_NODE_REF:                     (v1:spec.nodeName)
      CALICO_DISABLE_FILE_LOGGING:            true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:      RETURN
      FELIX_IPV6SUPPORT:                      false
      FELIX_LOGSEVERITYSCREEN:                info
      FELIX_PROMETHEUSMETRICSENABLED:         false
      FELIX_PROMETHEUSMETRICSPORT:            9091
      FELIX_PROMETHEUSGOMETRICSENABLED:       true
      FELIX_PROMETHEUSPROCESSMETRICSENABLED:  true
      ETCD_CA_CERT_FILE:                      <set to the key 'etcd_ca' of config map 'calico-config'>    Optional: false
      ETCD_KEY_FILE:                          <set to the key 'etcd_key' of config map 'calico-config'>   Optional: false
      ETCD_CERT_FILE:                         <set to the key 'etcd_cert' of config map 'calico-config'>  Optional: false
      IP:                                      (v1:status.hostIP)
      NODENAME:                                (v1:spec.nodeName)
      FELIX_HEALTHENABLED:                    true
      FELIX_IGNORELOOSERPF:                   False
    Mounts:
      /calico-secrets from etcd-certs (rw)
      /lib/modules from lib-modules (ro)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-tdnlj (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:
  etcd-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/calico/certs
    HostPathType:
  calico-node-token-tdnlj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-node-token-tdnlj
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     :NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason          Age                 From                      Message
  ----     ------          ----                ----                      -------
  Normal   Created         44m (x2 over 45m)   kubelet, node7.dc1.local  Created container
  Normal   Started         44m (x2 over 45m)   kubelet, node7.dc1.local  Started container
  Normal   Killing         44m                 kubelet, node7.dc1.local  Killing container with id docker://calico-node:Container failed liveness probe.. Container will be killed and recreated.
  Warning  Unhealthy       44m (x9 over 45m)   kubelet, node7.dc1.local  Readiness probe failed: Get http://10.10.10.57:9099/readiness: dial tcp 10.10.10.57:9099: connect: connection refused
  Normal   Pulled          30m (x9 over 45m)   kubelet, node7.dc1.local  Container image "quay.io/calico/node:v3.2.0-amd64" already present on machine
  Warning  Unhealthy       15m (x76 over 45m)  kubelet, node7.dc1.local  Liveness probe failed: Get http://10.10.10.57:9099/liveness: dial tcp 10.10.10.57:9099: connect: connection refused
  Warning  BackOff         5m (x112 over 39m)  kubelet, node7.dc1.local  Back-off restarting failed container
  Normal   SandboxChanged  1m                  kubelet, node7.dc1.local  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled          17s (x2 over 1m)    kubelet, node7.dc1.local  Container image "quay.io/calico/node:v3.2.0-amd64" already present on machine
  Normal   Created         17s (x2 over 1m)    kubelet, node7.dc1.local  Created container
  Warning  Unhealthy       17s (x6 over 1m)    kubelet, node7.dc1.local  Liveness probe failed: Get http://10.10.10.57:9099/liveness: dial tcp 10.10.10.57:9099: connect: connection refused
  Normal   Killing         17s                 kubelet, node7.dc1.local  Killing container with id docker://calico-node:Container failed liveness probe.. Container will be killed and recreated.
  Normal   Started         16s (x2 over 1m)    kubelet, node7.dc1.local  Started container
  Warning  Unhealthy       6s (x8 over 1m)     kubelet, node7.dc1.local  Readiness probe failed: Get http://10.10.10.57:9099/readiness: dial tcp 10.10.10.57:9099: connect: connection refused
@mattymo

This comment has been minimized.

Show comment
Hide comment
@mattymo

mattymo Sep 3, 2018

Collaborator

This issue should be updated to reflect kubeadm deployment is affected

Collaborator

mattymo commented Sep 3, 2018

This issue should be updated to reflect kubeadm deployment is affected

@magne

This comment has been minimized.

Show comment
Hide comment
@magne

magne Sep 4, 2018

The reason why the liveness and readyness probes are failing is that calico was upgraded to version 3.2.0 in #3140 (commit 22f911). Calico 3.2.0 has several "breaking" changes, among others that the default binding for health checks have changed from 0.0.0.0 to localhost.

magne commented Sep 4, 2018

The reason why the liveness and readyness probes are failing is that calico was upgraded to version 3.2.0 in #3140 (commit 22f911). Calico 3.2.0 has several "breaking" changes, among others that the default binding for health checks have changed from 0.0.0.0 to localhost.

@ant31

This comment has been minimized.

Show comment
Hide comment
@ant31

ant31 Sep 4, 2018

Member

@magne I can revert the commit.
Why would it fails in the kubeadm setup only ?

Member

ant31 commented Sep 4, 2018

@magne I can revert the commit.
Why would it fails in the kubeadm setup only ?

@wilmardo

This comment has been minimized.

Show comment
Hide comment
@wilmardo

wilmardo Sep 4, 2018

It seems I am running into this issue as well on a non kubeadm cluster (@ant31), just did an upgrade-cluster from 1.10.4 to 1.11.2 and the pods keeps crashing.

kubectl describe output
$ kubectl describe pod calico-node-fmmtv -n kube-system
Name:               calico-node-fmmtv
Namespace:          kube-system
Priority:           0
PriorityClassName:  <none>
Node:               node4/192.168.2.133
Start Time:         Mon, 03 Sep 2018 19:22:10 +0000
Labels:             controller-revision-hash=3605690650
                    k8s-app=calico-node
                    pod-template-generation=2
Annotations:        kubespray.etcd-cert/serial=D23B7B24AE3D10C3
                    scheduler.alpha.kubernetes.io/critical-pod=
Status:             Running
IP:                 192.168.2.133
Controlled By:      DaemonSet/calico-node
Containers:
  calico-node:
    Container ID:   docker://bda6607a61fe6f93edee6c00f357b872ff868ea69b9c73cc60c0b92f708fa4bb
    Image:          quay.io/calico/node:v3.2.0-amd64
    Image ID:       docker-pullable://quay.io/calico/node@sha256:862fe34ef21a8eb05aeab72196c6c6d54a4d97de6b59f14b7cc99457124e8d35
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 04 Sep 2018 18:14:22 +0000
      Finished:     Tue, 04 Sep 2018 18:15:22 +0000
    Ready:          False
    Restart Count:  383
    Limits:
      cpu:     300m
      memory:  500M
    Requests:
      cpu:      150m
      memory:   64M
    Liveness:   http-get http://:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  http-get http://:9099/readiness delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      ETCD_ENDPOINTS:                         <set to the key 'etcd_endpoints' of config map 'calico-config'>  Optional: false
      CALICO_NETWORKING_BACKEND:              <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                           <set to the key 'cluster_type' of config map 'calico-config'>    Optional: false
      CALICO_K8S_NODE_REF:                     (v1:spec.nodeName)
      CALICO_DISABLE_FILE_LOGGING:            true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:      RETURN
      FELIX_IPV6SUPPORT:                      false
      FELIX_LOGSEVERITYSCREEN:                info
      FELIX_PROMETHEUSMETRICSENABLED:         false
      FELIX_PROMETHEUSMETRICSPORT:            9091
      FELIX_PROMETHEUSGOMETRICSENABLED:       true
      FELIX_PROMETHEUSPROCESSMETRICSENABLED:  true
      ETCD_CA_CERT_FILE:                      <set to the key 'etcd_ca' of config map 'calico-config'>    Optional: false
      ETCD_KEY_FILE:                          <set to the key 'etcd_key' of config map 'calico-config'>   Optional: false
      ETCD_CERT_FILE:                         <set to the key 'etcd_cert' of config map 'calico-config'>  Optional: false
      IP:                                      (v1:status.hostIP)
      NODENAME:                                (v1:spec.nodeName)
      FELIX_HEALTHENABLED:                    true
      FELIX_IGNORELOOSERPF:                   False
    Mounts:
      /calico-secrets from etcd-certs (rw)
      /lib/modules from lib-modules (ro)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-znhfc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:  
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:  
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  etcd-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/calico/certs
    HostPathType:  
  calico-node-token-znhfc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-node-token-znhfc
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     :NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason     Age                   From            Message
  ----     ------     ----                  ----            -------
  Normal   Pulled     55m (x369 over 22h)   kubelet, node4  Container image "quay.io/calico/node:v3.2.0-amd64" already present on machine
  Warning  Unhealthy  10m (x2287 over 22h)  kubelet, node4  Liveness probe failed: Get http://192.168.2.133:9099/liveness: dial tcp 192.168.2.133:9099: connect: connection refused
  Warning  BackOff    16s (x4511 over 22h)  kubelet, node4  Back-off restarting failed container
 

I tried downgrading by reverting the changed values by commit 22f911 back to v3.1.3 in roles/download/defaults/main.yml But the 3.1.3 tag has no -amd64 appended. Seems to be so since 3.2.0 (tags on quay.io)

TASK [download : container_download | Download containers if pull is required or told to always pull (all nodes)] ***
Tuesday 04 September 2018  19:33:53 +0200 (0:00:00.047)       0:00:54.605 ***** 
FAILED - RETRYING: container_download | Download containers if pull is required or told to always pull (all nodes) (4 retries left).
FAILED - RETRYING: container_download | Download containers if pull is required or told to always pull (all nodes) (3 retries left).
FAILED - RETRYING: container_download | Download containers if pull is required or told to always pull (all nodes) (2 retries left).
FAILED - RETRYING: container_download | Download containers if pull is required or told to always pull (all nodes) (1 retries left).
fatal: [node1]: FAILED! => {"attempts": 4, "changed": true, "cmd": ["/usr/bin/docker", "pull", "quay.io/calico/ctl:v3.1.3-amd64"], "delta": "0:00:44.374214", "end": "2018-09-04 17:37:58.072701", "msg": "non-zero return code", "rc": 1, "start": "2018-09-04 17:37:13.698487", "stderr": "Error response from daemon: manifest for quay.io/calico/ctl:v3.1.3-amd64 not found", "stderr_lines": ["Error response from daemon: manifest for quay.io/calico/ctl:v3.1.3-amd64 not found"], "stdout": "", "stdout_lines": []}

So reverted the tag variables back to the way they were:

calicoctl_image_tag: "{{ calico_ctl_version }}"
calico_node_image_tag: "{{ calico_version }}"
calico_cni_image_tag: "{{ calico_cni_version }}"
calico_policy_image_tag: "{{ calico_policy_version }}"

Now everything is healthy again, so @magne is correct the issue seems to be v3.2.0

wilmardo commented Sep 4, 2018

It seems I am running into this issue as well on a non kubeadm cluster (@ant31), just did an upgrade-cluster from 1.10.4 to 1.11.2 and the pods keeps crashing.

kubectl describe output
$ kubectl describe pod calico-node-fmmtv -n kube-system
Name:               calico-node-fmmtv
Namespace:          kube-system
Priority:           0
PriorityClassName:  <none>
Node:               node4/192.168.2.133
Start Time:         Mon, 03 Sep 2018 19:22:10 +0000
Labels:             controller-revision-hash=3605690650
                    k8s-app=calico-node
                    pod-template-generation=2
Annotations:        kubespray.etcd-cert/serial=D23B7B24AE3D10C3
                    scheduler.alpha.kubernetes.io/critical-pod=
Status:             Running
IP:                 192.168.2.133
Controlled By:      DaemonSet/calico-node
Containers:
  calico-node:
    Container ID:   docker://bda6607a61fe6f93edee6c00f357b872ff868ea69b9c73cc60c0b92f708fa4bb
    Image:          quay.io/calico/node:v3.2.0-amd64
    Image ID:       docker-pullable://quay.io/calico/node@sha256:862fe34ef21a8eb05aeab72196c6c6d54a4d97de6b59f14b7cc99457124e8d35
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 04 Sep 2018 18:14:22 +0000
      Finished:     Tue, 04 Sep 2018 18:15:22 +0000
    Ready:          False
    Restart Count:  383
    Limits:
      cpu:     300m
      memory:  500M
    Requests:
      cpu:      150m
      memory:   64M
    Liveness:   http-get http://:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  http-get http://:9099/readiness delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      ETCD_ENDPOINTS:                         <set to the key 'etcd_endpoints' of config map 'calico-config'>  Optional: false
      CALICO_NETWORKING_BACKEND:              <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                           <set to the key 'cluster_type' of config map 'calico-config'>    Optional: false
      CALICO_K8S_NODE_REF:                     (v1:spec.nodeName)
      CALICO_DISABLE_FILE_LOGGING:            true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:      RETURN
      FELIX_IPV6SUPPORT:                      false
      FELIX_LOGSEVERITYSCREEN:                info
      FELIX_PROMETHEUSMETRICSENABLED:         false
      FELIX_PROMETHEUSMETRICSPORT:            9091
      FELIX_PROMETHEUSGOMETRICSENABLED:       true
      FELIX_PROMETHEUSPROCESSMETRICSENABLED:  true
      ETCD_CA_CERT_FILE:                      <set to the key 'etcd_ca' of config map 'calico-config'>    Optional: false
      ETCD_KEY_FILE:                          <set to the key 'etcd_key' of config map 'calico-config'>   Optional: false
      ETCD_CERT_FILE:                         <set to the key 'etcd_cert' of config map 'calico-config'>  Optional: false
      IP:                                      (v1:status.hostIP)
      NODENAME:                                (v1:spec.nodeName)
      FELIX_HEALTHENABLED:                    true
      FELIX_IGNORELOOSERPF:                   False
    Mounts:
      /calico-secrets from etcd-certs (rw)
      /lib/modules from lib-modules (ro)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-znhfc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:  
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:  
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  etcd-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/calico/certs
    HostPathType:  
  calico-node-token-znhfc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-node-token-znhfc
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     :NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason     Age                   From            Message
  ----     ------     ----                  ----            -------
  Normal   Pulled     55m (x369 over 22h)   kubelet, node4  Container image "quay.io/calico/node:v3.2.0-amd64" already present on machine
  Warning  Unhealthy  10m (x2287 over 22h)  kubelet, node4  Liveness probe failed: Get http://192.168.2.133:9099/liveness: dial tcp 192.168.2.133:9099: connect: connection refused
  Warning  BackOff    16s (x4511 over 22h)  kubelet, node4  Back-off restarting failed container
 

I tried downgrading by reverting the changed values by commit 22f911 back to v3.1.3 in roles/download/defaults/main.yml But the 3.1.3 tag has no -amd64 appended. Seems to be so since 3.2.0 (tags on quay.io)

TASK [download : container_download | Download containers if pull is required or told to always pull (all nodes)] ***
Tuesday 04 September 2018  19:33:53 +0200 (0:00:00.047)       0:00:54.605 ***** 
FAILED - RETRYING: container_download | Download containers if pull is required or told to always pull (all nodes) (4 retries left).
FAILED - RETRYING: container_download | Download containers if pull is required or told to always pull (all nodes) (3 retries left).
FAILED - RETRYING: container_download | Download containers if pull is required or told to always pull (all nodes) (2 retries left).
FAILED - RETRYING: container_download | Download containers if pull is required or told to always pull (all nodes) (1 retries left).
fatal: [node1]: FAILED! => {"attempts": 4, "changed": true, "cmd": ["/usr/bin/docker", "pull", "quay.io/calico/ctl:v3.1.3-amd64"], "delta": "0:00:44.374214", "end": "2018-09-04 17:37:58.072701", "msg": "non-zero return code", "rc": 1, "start": "2018-09-04 17:37:13.698487", "stderr": "Error response from daemon: manifest for quay.io/calico/ctl:v3.1.3-amd64 not found", "stderr_lines": ["Error response from daemon: manifest for quay.io/calico/ctl:v3.1.3-amd64 not found"], "stdout": "", "stdout_lines": []}

So reverted the tag variables back to the way they were:

calicoctl_image_tag: "{{ calico_ctl_version }}"
calico_node_image_tag: "{{ calico_version }}"
calico_cni_image_tag: "{{ calico_cni_version }}"
calico_policy_image_tag: "{{ calico_policy_version }}"

Now everything is healthy again, so @magne is correct the issue seems to be v3.2.0

@ant31

This comment has been minimized.

Show comment
Hide comment
@ant31

ant31 Sep 4, 2018

Member

@wilmardo thx,
could you push/pr the tag and version revert ?

Member

ant31 commented Sep 4, 2018

@wilmardo thx,
could you push/pr the tag and version revert ?

wilmardo added a commit to wilmardo/kubespray that referenced this issue Sep 4, 2018

wilmardo added a commit to wilmardo/kubespray that referenced this issue Sep 4, 2018

ant31 added a commit to ant31/kubespray that referenced this issue Sep 5, 2018

@ant31 ant31 closed this in #3244 Sep 5, 2018

ant31 added a commit that referenced this issue Sep 5, 2018

Merge pull request #3244 from ant31/calico31
Reverts calico update to 3.2.0, fixes #3223
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment