Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pod calico-node on worker nodes with 'CrashLoopBackOff' #2720

Closed
ekc opened this issue Jul 12, 2019 · 14 comments
Closed

pod calico-node on worker nodes with 'CrashLoopBackOff' #2720

ekc opened this issue Jul 12, 2019 · 14 comments
Assignees
Labels

Comments

@ekc
Copy link

@ekc ekc commented Jul 12, 2019

Expected Behavior

Pods calico-node on worker nodes should have status 'Running'

Current Behavior

Pods calico-node on worker nodes are in state 'CrashLoopBackOff'

vagrant@k8s-master:~$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                       READY   STATUS             RESTARTS   AGE
kube-system   calico-kube-controllers-59f54d6bbc-w4w6f   1/1     Running            1          29h
kube-system   calico-node-86mdg                          0/1     CrashLoopBackOff   25         29h
kube-system   calico-node-hzcsh                          0/1     CrashLoopBackOff   26         29h
kube-system   calico-node-q267c                          1/1     Running            1          29h
kube-system   coredns-5c98db65d4-m8wls                   1/1     Running            1          29h
kube-system   coredns-5c98db65d4-vdp4f                   1/1     Running            1          29h
kube-system   etcd-k8s-master                            1/1     Running            1          29h
kube-system   kube-apiserver-k8s-master                  1/1     Running            1          29h
kube-system   kube-controller-manager-k8s-master         1/1     Running            1          29h
kube-system   kube-proxy-6f6q9                           1/1     Running            1          29h
kube-system   kube-proxy-prpqv                           1/1     Running            1          29h
kube-system   kube-proxy-qds8x                           1/1     Running            1          29h
kube-system   kube-scheduler-k8s-master                  1/1     Running            1          29h

Steps to Reproduce (for bugs)

Basically, I followed the articel from a k8s blog - Kubernetes Setup Using Ansible and Vagrant with minor modifications (replace bento/ubuntu-16.04 <= generic/ubuntu1604 and use calico v3.8 <= 3.4). See gists here. Note that I also tried manual setup Installing with the Kubernetes API datastore—50 nodes or less and get the same error.
In short, I run the following on master node ...
1.kubeadm init --apiserver-advertise-address="192.168.50.10" --apiserver-cert-extra-sans="192.168.50.10" --node-name k8s-master --pod-network-cidr=192.168.0.0/16
2.populate ~/.kube/config with /etc/kubernetes/admin.conf
3.curl https://docs.projectcalico.org/v3.8/manifests/calico.yaml -O
4.kubectl apply -f calico.yaml
and then join worker nodes with kubeadm join 192.168.50.10:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>

Context

This issue seems to prevent me to schedule workload to worker nodes.
Running kubectl describe pod -n kube-system calico-node-86mdg shows

Liveness probe failed: Get http://localhost:9099/liveness: dial tcp 127.0.0.1:9099: connect: connection refused

Here is the actual output

vagrant@k8s-master:~$ kubectl describe pod -n kube-system calico-node-86mdg
Name:                 calico-node-86mdg
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 node-2/192.168.50.12
Start Time:           Wed, 10 Jul 2019 15:58:22 -0700
Labels:               controller-revision-hash=844ddd97c6
                      k8s-app=calico-node
                      pod-template-generation=1
Annotations:          scheduler.alpha.kubernetes.io/critical-pod:
Status:               Running
IP:                   192.168.50.12
Controlled By:        DaemonSet/calico-node
Init Containers:
  upgrade-ipam:
    Container ID:  docker://b53dcaaf8a7cd71b242573c35ab654c83dc5daf5d7a10de1cb42623fe3fca567
    Image:         calico/cni:v3.8.0
    Image ID:      docker-pullable://calico/cni@sha256:decba0501ab0658e6e7da2f5625f1eabb8aba5690f9206caba3bf98caca5094c
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/calico-ipam
      -upgrade
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 11 Jul 2019 20:31:28 -0700
      Finished:     Thu, 11 Jul 2019 20:31:28 -0700
    Ready:          True
    Restart Count:  1
    Environment:
      KUBERNETES_NODE_NAME:        (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:  <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/lib/cni/networks from host-local-net-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-2t8lm (ro)
  install-cni:
    Container ID:  docker://e374bb79296a23e83062d9d62cf8ea684e24aa3b634d1ec4948528672a9d18c7
    Image:         calico/cni:v3.8.0
    Image ID:      docker-pullable://calico/cni@sha256:decba0501ab0658e6e7da2f5625f1eabb8aba5690f9206caba3bf98caca5094c
    Port:          <none>
    Host Port:     <none>
    Command:
      /install-cni.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 11 Jul 2019 20:31:29 -0700
      Finished:     Thu, 11 Jul 2019 20:31:29 -0700
    Ready:          True
    Restart Count:  0
    Environment:
      CNI_CONF_NAME:         10-calico.conflist
      CNI_NETWORK_CONFIG:    <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      CNI_MTU:               <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      SLEEP:                 false
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-2t8lm (ro)
  flexvol-driver:
    Container ID:   docker://27fb9711cc45fa2fc07ce9f53a9619329e2fd544df502320c1f11b36f0a9a0e0
    Image:          calico/pod2daemon-flexvol:v3.8.0
    Image ID:       docker-pullable://calico/pod2daemon-flexvol@sha256:6ec8b823e5ce3440318edfcdd2ab8b6660110782713f24f53dac5a3c227afb11
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 11 Jul 2019 20:31:30 -0700
      Finished:     Thu, 11 Jul 2019 20:31:30 -0700
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-2t8lm (ro)
Containers:
  calico-node:
    Container ID:   docker://01ba662b148242659d4e2f0e43098efc70aefbdd1c0cdaf0619e7410853e2d88
    Image:          calico/node:v3.8.0
    Image ID:       docker-pullable://calico/node@sha256:6679ccc9f19dba3eb084db991c788dc9661ad3b5d5bafaa3379644229dca6b05
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Thu, 11 Jul 2019 21:47:46 -0700
      Finished:     Thu, 11 Jul 2019 21:48:56 -0700
    Ready:          False
    Restart Count:  29
    Requests:
      cpu:      250m
    Liveness:   http-get http://localhost:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/calico-node -bird-ready -felix-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      DATASTORE_TYPE:                     kubernetes
      WAIT_FOR_DATASTORE:                 true
      NODENAME:                            (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:          <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                       k8s,bgp
      IP:                                 autodetect
      CALICO_IPV4POOL_IPIP:               Always
      FELIX_IPINIPMTU:                    <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      CALICO_IPV4POOL_CIDR:               192.168.0.0/16
      CALICO_DISABLE_FILE_LOGGING:        true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:  ACCEPT
      FELIX_IPV6SUPPORT:                  false
      FELIX_LOGSEVERITYSCREEN:            info
      FELIX_HEALTHENABLED:                true
    Mounts:
      /lib/modules from lib-modules (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/nodeagent from policysync (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-2t8lm (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:
  host-local-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/networks
    HostPathType:
  policysync:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/nodeagent
    HostPathType:  DirectoryOrCreate
  flexvol-driver-host:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
    HostPathType:  DirectoryOrCreate
  calico-node-token-2t8lm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-node-token-2t8lm
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     :NoSchedule
                 :NoExecute
                 CriticalAddonsOnly
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/network-unavailable:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type     Reason     Age                   From             Message
  ----     ------     ----                  ----             -------
  Warning  BackOff    13m (x207 over 73m)   kubelet, node-2  Back-off restarting failed container
  Warning  Unhealthy  8m4s (x132 over 77m)  kubelet, node-2  Liveness probe failed: Get http://localhost:9099/liveness: dial tcp 127.0.0.1:9099: connect: connection refused
  Normal   Pulled     3m (x23 over 78m)     kubelet, node-2  Container image "calico/node:v3.8.0" already present on machine

Your Environment

  • Calico version: 3.8
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.15.0
  • Operating System and version: Ubuntu 16.04.6 LTS (Xenial Xerus)
  • Link to your project (optional): (none). (see gist Vagrant/Ansible sources )
@rafaelvanoni

This comment has been minimized.

Copy link
Member

@rafaelvanoni rafaelvanoni commented Jul 15, 2019

Could you post the logs for that particular node? kubectl logs -n kube-system calico-node-####

@rafaelvanoni rafaelvanoni self-assigned this Jul 15, 2019
@tmjd

This comment has been minimized.

Copy link
Member

@tmjd tmjd commented Jul 15, 2019

Because you're using Vagrant I'm wondering if you maybe need to change the IP Autodetection method. I've seen it before with Vagrant where the hosts have multiple interfaces and Calico is choosing the wrong one. Here is a link to the reference docs for autodetection https://docs.projectcalico.org/v3.8/reference/node/configuration#interfaceinterface-regex.

I'm not sure if you will be able to download the Calico manifest, update it, and then provide it to Ansible, so you could probably just do the installation like you have been then do a kubectl edit -n kube-system ds calico-node and add the env var IP_AUTODETECTION_METHOD with value interface=eth.* (with the proper interface prefix).

@ekc

This comment has been minimized.

Copy link
Author

@ekc ekc commented Jul 17, 2019

Hello @rafaelvanoni ,

I am new here. It is more likely to be my own mistake. I just cannot identify where.
So, please bear with me. :-)

Logs of both the worker pods look similar to the output below:

vagrant@k8s-master:~$ kubectl log -n kube-system calico-node-hzcsh
log is DEPRECATED and will be removed in a future version. Use logs instead.
2019-07-17 06:03:55.019 [INFO][8] startup.go 256: Early log level set to info
2019-07-17 06:03:55.019 [INFO][8] startup.go 272: Using NODENAME environment for node name
2019-07-17 06:03:55.019 [INFO][8] startup.go 284: Determined node name: node-1
2019-07-17 06:03:55.020 [INFO][8] k8s.go 228: Using Calico IPAM
2019-07-17 06:03:55.020 [INFO][8] startup.go 316: Checking datastore connection
2019-07-17 06:04:25.021 [INFO][8] startup.go 331: Hit error connecting to datastore - retry error=Get https://10.96.0.1:443/api/v1/nodes/foo: dial tcp 10.96.0.1:443: i/o timeout
2019-07-17 06:04:56.022 [INFO][8] startup.go 331: Hit error connecting to datastore - retry error=Get https://10.96.0.1:443/api/v1/nodes/foo: dial tcp 10.96.0.1:443: i/o timeout

Additionally, here is the output of curl command towards the url

vagrant@k8s-master:~$ curl -v --insecure https://10.96.0.1:443/api/v1/nodes/foo
*   Trying 10.96.0.1...
* Connected to 10.96.0.1 (10.96.0.1) port 443 (#0)
* found 148 certificates in /etc/ssl/certs/ca-certificates.crt
* found 592 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
*        server certificate verification SKIPPED
*        server certificate status verification SKIPPED
*        common name: kube-apiserver (matched)
*        server certificate expiration date OK
*        server certificate activation date OK
*        certificate public key: RSA
*        certificate version: #3
*        subject: CN=kube-apiserver
*        start date: Wed, 10 Jul 2019 22:47:12 GMT
*        expire date: Thu, 09 Jul 2020 22:47:12 GMT
*        issuer: CN=kubernetes
*        compression: NULL
* ALPN, server accepted to use http/1.1
> GET /api/v1/nodes/foo HTTP/1.1
> Host: 10.96.0.1
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 403 Forbidden
< Content-Type: application/json
< X-Content-Type-Options: nosniff
< Date: Wed, 17 Jul 2019 06:15:04 GMT
< Content-Length: 331
<
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "nodes \"foo\" is forbidden: User \"system:anonymous\" cannot get resource \"nodes\" in API group \"\" at the cluster scope",
  "reason": "Forbidden",
  "details": {
    "name": "foo",
    "kind": "nodes"
  },
  "code": 403
* Connection #0 to host 10.96.0.1 left intact
}
@ekc

This comment has been minimized.

Copy link
Author

@ekc ekc commented Jul 17, 2019

Hello Erik - @tmjd,
Thanks for your kind help. I am honored to have both the contributor (Rafael previously) and you - the active member to help.
You are probably referring to #2042 because I also got connection refused on port 9099. I tried your solution before (manually on a pristine cluster) and still got the same connection refused on localhost:9099 on worker nodes.
Could you tell me whether it is normal or not for worker nodes to try to reach port 9099 on itself? (Because with netstat -an and lsof -i :9099, I can only see calico-node listening on 9099 on master node only).
Ek

@tmjd

This comment has been minimized.

Copy link
Member

@tmjd tmjd commented Jul 17, 2019

I would imagine that the calico-node on your master is your one healthy calico-node pod, and that would be why you see it listening.
The logs you included from calico-node show what looks to be the problem. They are unable to reach 10.96.0.1, dial tcp 10.96.0.1:443: i/o timeout. That indicates that kube-proxy on the node is not working correctly. Kube-proxy is responsible for setting up the kubernetes service API address which I believe 10.96.0.1 to be. You should check the kube-proxy logs on one of the nodes where calico-node is in CrashLoopBackoff. To see the nodes where pods are running you can use kubectl get pods --all-namespaces -o wide.
Another thing you can try is doing that same curl command on one of the nodes, I'm guessing it would fail currently.

@ekc

This comment has been minimized.

Copy link
Author

@ekc ekc commented Jul 18, 2019

Hello Erik - @tmjd ,
Thank you so much for your kind advice.
Here is the output from kubectl get pods --all-namespaces -o wide

vagrant@k8s-master:~$ kubectl get pods --all-namespaces -o wide
NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE    IP                NODE         NOMINATED NODE   READINESS GATES
kube-system   calico-kube-controllers-59f54d6bbc-w4w6f   1/1     Running   4          7d1h   192.168.235.207   k8s-master   <none>           <none>
kube-system   calico-node-86mdg                          0/1     Running   649        7d     192.168.50.12     node-2       <none>           <none>
kube-system   calico-node-hzcsh                          0/1     Running   649        7d     192.168.50.11     node-1       <none>           <none>
kube-system   calico-node-q267c                          1/1     Running   4          7d1h   192.168.50.10     k8s-master   <none>           <none>
kube-system   coredns-5c98db65d4-m8wls                   1/1     Running   4          7d1h   192.168.235.205   k8s-master   <none>           <none>
kube-system   coredns-5c98db65d4-vdp4f                   1/1     Running   4          7d1h   192.168.235.206   k8s-master   <none>           <none>
kube-system   etcd-k8s-master                            1/1     Running   4          7d1h   192.168.50.10     k8s-master   <none>           <none>
kube-system   kube-apiserver-k8s-master                  1/1     Running   6          7d1h   192.168.50.10     k8s-master   <none>           <none>
kube-system   kube-controller-manager-k8s-master         1/1     Running   4          7d1h   192.168.50.10     k8s-master   <none>           <none>
kube-system   kube-proxy-6f6q9                           1/1     Running   5          7d     192.168.50.11     node-1       <none>           <none>
kube-system   kube-proxy-prpqv                           1/1     Running   4          7d1h   192.168.50.10     k8s-master   <none>           <none>
kube-system   kube-proxy-qds8x                           1/1     Running   4          7d     192.168.50.12     node-2       <none>           <none>
kube-system   kube-scheduler-k8s-master                  1/1     Running   4          7d1h   192.168.50.10     k8s-master   <none>           <none>

The log of kube-proxy on worker node-1 looks normal ...

vagrant@k8s-master:~$ kubectl log -n kube-system kube-proxy-6f6q9
log is DEPRECATED and will be removed in a future version. Use logs instead.
W0717 23:50:25.302498       1 server_others.go:249] Flag proxy-mode="" unknown, assuming iptables proxy
I0717 23:50:25.326295       1 server_others.go:143] Using iptables Proxier.
I0717 23:50:25.327449       1 server.go:534] Version: v1.15.0
I0717 23:50:25.352892       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I0717 23:50:25.352961       1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0717 23:50:25.355320       1 conntrack.go:83] Setting conntrack hashsize to 32768
I0717 23:50:25.355505       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I0717 23:50:25.355548       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I0717 23:50:25.355885       1 config.go:96] Starting endpoints config controller
I0717 23:50:25.355943       1 controller_utils.go:1029] Waiting for caches to sync for endpoints config controller
I0717 23:50:25.357434       1 config.go:187] Starting service config controller
I0717 23:50:25.357486       1 controller_utils.go:1029] Waiting for caches to sync for service config controller
I0717 23:50:25.456160       1 controller_utils.go:1036] Caches are synced for endpoints config controller
I0717 23:50:25.457652       1 controller_utils.go:1036] Caches are synced for service config controller

But running curl command on worker node-1 fails (timeout in about 2mins).

vagrant@node-1:~$ date; curl -v -insecure https://10.96.0.1:443/api/v1/nodes/foo ; date
Wed Jul 17 17:01:47 PDT 2019
* Couldn't find host 10.96.0.1 in the .netrc file; using defaults
*   Trying 10.96.0.1...
* connect to 10.96.0.1 port 443 failed: Connection timed out
* Failed to connect to 10.96.0.1 port 443: Connection timed out
* Closing connection 0
Wed Jul 17 17:03:54 PDT 2019

From this link, I dump iptables (KUBE-SERVICES chain) just in case it is useful. (sorry, iptables is too geek for me :-)

vagrant@k8s-master:~$ kubectl exec -it kube-proxy-6f6q9 --namespace kube-system -- /bin/sh
#
#
# iptables -t nat -L KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-MARK-MASQ  tcp  -- !192.168.0.0/16       10.96.0.10           /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  anywhere             10.96.0.10           /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-MARK-MASQ  tcp  -- !192.168.0.0/16       10.96.0.10           /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153
KUBE-SVC-JD5MR3NA4I4DYORP  tcp  --  anywhere             10.96.0.10           /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153
KUBE-MARK-MASQ  tcp  -- !192.168.0.0/16       10.96.0.1            /* default/kubernetes:https cluster IP */ tcp dpt:https
KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  anywhere             10.96.0.1            /* default/kubernetes:https cluster IP */ tcp dpt:https
KUBE-MARK-MASQ  udp  -- !192.168.0.0/16       10.96.0.10           /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  anywhere             10.96.0.10           /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
KUBE-NODEPORTS  all  --  anywhere             anywhere             /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

I have no idea, how I should proceed. Please help.
Ek

@tmjd

This comment has been minimized.

Copy link
Member

@tmjd tmjd commented Jul 18, 2019

Could you do iptables-save and capture that output and include it?
What I'd look for in there is a rule that directs the traffic destined for 10.96.0.1 to the actual address of your kube-apiserver (the master IP).

@ekc

This comment has been minimized.

Copy link
Author

@ekc ekc commented Jul 18, 2019

Hello Erik - @tmjd ,
My apology for late response.
Here is the output of iptables-save command

# iptables-save
# Generated by iptables-save v1.6.0 on Thu Jul 18 17:31:01 2019
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [1:60]
:POSTROUTING ACCEPT [2:120]
:DOCKER - [0:0]
:KUBE-MARK-DROP - [0:0]
:KUBE-MARK-MASQ - [0:0]
:KUBE-NODEPORTS - [0:0]
:KUBE-POSTROUTING - [0:0]
:KUBE-SEP-5IM7UFGZKAN7IH4H - [0:0]
:KUBE-SEP-6MNDFRE4G43UNLF2 - [0:0]
:KUBE-SEP-DKMZETKCWZBISSRU - [0:0]
:KUBE-SEP-FY6VP5NY42HP46QB - [0:0]
:KUBE-SEP-PFW6AFL5VADK64IP - [0:0]
:KUBE-SEP-QAI6IJGEKZYOMWSX - [0:0]
:KUBE-SEP-YOIRCAEDUNHFNIOV - [0:0]
:KUBE-SERVICES - [0:0]
:KUBE-SVC-ERIFXISQEP7F7OF4 - [0:0]
:KUBE-SVC-JD5MR3NA4I4DYORP - [0:0]
:KUBE-SVC-NPX46M4PTMTKRN6Y - [0:0]
:KUBE-SVC-TCOU7JCQXEZGVUNU - [0:0]
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
-A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE
-A KUBE-SEP-5IM7UFGZKAN7IH4H -s 192.168.235.209/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-5IM7UFGZKAN7IH4H -p tcp -m tcp -j DNAT --to-destination 192.168.235.209:9153
-A KUBE-SEP-6MNDFRE4G43UNLF2 -s 192.168.235.208/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-6MNDFRE4G43UNLF2 -p udp -m udp -j DNAT --to-destination 192.168.235.208:53
-A KUBE-SEP-DKMZETKCWZBISSRU -s 192.168.235.208/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-DKMZETKCWZBISSRU -p tcp -m tcp -j DNAT --to-destination 192.168.235.208:9153
-A KUBE-SEP-FY6VP5NY42HP46QB -s 192.168.235.209/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-FY6VP5NY42HP46QB -p tcp -m tcp -j DNAT --to-destination 192.168.235.209:53
-A KUBE-SEP-PFW6AFL5VADK64IP -s 192.168.235.209/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-PFW6AFL5VADK64IP -p udp -m udp -j DNAT --to-destination 192.168.235.209:53
-A KUBE-SEP-QAI6IJGEKZYOMWSX -s 192.168.50.10/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-QAI6IJGEKZYOMWSX -p tcp -m tcp -j DNAT --to-destination 192.168.50.10:6443
-A KUBE-SEP-YOIRCAEDUNHFNIOV -s 192.168.235.208/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-YOIRCAEDUNHFNIOV -p tcp -m tcp -j DNAT --to-destination 192.168.235.208:53
-A KUBE-SERVICES ! -s 192.168.0.0/16 -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-SVC-ERIFXISQEP7F7OF4
-A KUBE-SERVICES ! -s 192.168.0.0/16 -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-SVC-JD5MR3NA4I4DYORP
-A KUBE-SERVICES ! -s 192.168.0.0/16 -d 10.96.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.96.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU
-A KUBE-SERVICES ! -s 192.168.0.0/16 -d 10.96.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.96.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS
-A KUBE-SVC-ERIFXISQEP7F7OF4 -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-YOIRCAEDUNHFNIOV
-A KUBE-SVC-ERIFXISQEP7F7OF4 -j KUBE-SEP-FY6VP5NY42HP46QB
-A KUBE-SVC-JD5MR3NA4I4DYORP -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-DKMZETKCWZBISSRU
-A KUBE-SVC-JD5MR3NA4I4DYORP -j KUBE-SEP-5IM7UFGZKAN7IH4H
-A KUBE-SVC-NPX46M4PTMTKRN6Y -j KUBE-SEP-QAI6IJGEKZYOMWSX
-A KUBE-SVC-TCOU7JCQXEZGVUNU -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-6MNDFRE4G43UNLF2
-A KUBE-SVC-TCOU7JCQXEZGVUNU -j KUBE-SEP-PFW6AFL5VADK64IP
COMMIT
# Completed on Thu Jul 18 17:31:01 2019
# Generated by iptables-save v1.6.0 on Thu Jul 18 17:31:01 2019
*filter
:INPUT ACCEPT [369:48999]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [318:19532]
:DOCKER - [0:0]
:DOCKER-ISOLATION-STAGE-1 - [0:0]
:DOCKER-ISOLATION-STAGE-2 - [0:0]
:DOCKER-USER - [0:0]
:KUBE-EXTERNAL-SERVICES - [0:0]
:KUBE-FIREWALL - [0:0]
:KUBE-FORWARD - [0:0]
:KUBE-SERVICES - [0:0]
-A INPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A INPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes externally-visible service portals" -j KUBE-EXTERNAL-SERVICES
-A INPUT -j KUBE-FIREWALL
-A FORWARD -m comment --comment "kubernetes forwarding rules" -j KUBE-FORWARD
-A FORWARD -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A OUTPUT -m conntrack --ctstate NEW -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -j KUBE-FIREWALL
-A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j RETURN
-A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -j RETURN
-A DOCKER-USER -j RETURN
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
-A KUBE-FORWARD -s 192.168.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -d 192.168.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
COMMIT
# Completed on Thu Jul 18 17:31:01 2019
@tmjd

This comment has been minimized.

Copy link
Member

@tmjd tmjd commented Jul 18, 2019

From that it looks like the proper service DNAT rules are present assuming your master is at 192.168.50.10. Does a curl to 192.168.50.10:6443 (using the same path previously use) work?

If that works I'm a little at a loss as to what is wrong here. I expect the curl above will work because the kubelet running on that same node is reaching 192.168.50.10.

I noticed that it looks like the pod IP Cidr 192.168.0.0/16 overlaps with the addresses of your nodes, while I don't think that will cause a problem maybe it is? If the curl above works then I'd consider changing the pod IP Cidr and see if that is the fix.

@ekc

This comment has been minimized.

Copy link
Author

@ekc ekc commented Jul 19, 2019

Hello Erik - @tmj,
The curl does not timeout on node-1 i.e.

vagrant@node-1:~$ curl -v -insecure https://192.168.50.10:6443/api/v1/nodes/foo
* Couldn't find host 192.168.50.10 in the .netrc file; using defaults
*   Trying 192.168.50.10...
* Connected to 192.168.50.10 (192.168.50.10) port 6443 (#0)
* found 148 certificates in /etc/ssl/certs/ca-certificates.crt
* found 592 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none
* Closing connection 0

Out of my curiosity, my node ip address subnet is /24 as per

vagrant@node-1:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:72:c3:25 brd ff:ff:ff:ff:ff:ff
    inet 192.168.121.139/24 brd 192.168.121.255 scope global eth0
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:3f:47:fc brd ff:ff:ff:ff:ff:ff
    inet 192.168.50.11/24 brd 192.168.50.255 scope global eth1
       valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:c3:d9:82:80 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever

Isn't the default pod Cidr of 192.168.0.0/16 intentionally a bit greedy here? Any idea why? Does this mean by default, node ip should not be in 192.168.x.x range?

I will try changing the IP of pod Cidr as per the observation and share the result. It will take some time for me to find a vacant schedule to give it a try but it will be very high on my to do list. :-)

Many thanks,
Ek

@rafaelvanoni rafaelvanoni assigned tmjd and unassigned rafaelvanoni Jul 24, 2019
@Remit

This comment has been minimized.

Copy link

@Remit Remit commented Aug 14, 2019

@tmjd I had the same issue as in original message, your solution with CIDR resolved it. Thanks!

@tmjd

This comment has been minimized.

Copy link
Member

@tmjd tmjd commented Oct 9, 2019

@ekc We always recommend that the pod cidr and the host cidr should not overlap. I would not be surprised if that is the problem here.
As for the size of the cidr for pods, feel free to change that how you see fit. The 'block' size that is handed out to each node is /26 so you should make try to make sure the cidr you use has enough /26 blocks for the number of nodes you will have. This isn't a hard requirement though. And if one node runs out it can get another block or 'borrow' ips from other blocks if no more blocks exist.

@amemni

This comment has been minimized.

Copy link

@amemni amemni commented Dec 15, 2019

I had the exact same issue as @ekc with quite similar logging and environment versions and I was running Ubuntu boxes on VirtualBox with 2 interfaces each, my issue was that ETCD is allowing a connection only from the CIDR of the first interface with a default gateway (the NAT one) by default, thus it was giving a "connection refused". I ended up doing these which solved it:

  1. resetting my cluster and adding the --api-advertise-addresses=<IP of the host only inteface (enp0s8)> to the kubeadm init command, as explained in this issue,
  2. set the IP_AUTODETECTION_METHOD to interface=enp0s8 in the calico.yaml as explained by @tmjd here.
@caseydavenport

This comment has been minimized.

Copy link
Member

@caseydavenport caseydavenport commented Jan 2, 2020

Closing since this appears to be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.