Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico + KIND pods unable to communicate externally #2962

Closed
sager-tech opened this issue Oct 27, 2019 · 16 comments
Closed

Calico + KIND pods unable to communicate externally #2962

sager-tech opened this issue Oct 27, 2019 · 16 comments
Assignees

Comments

@sager-tech
Copy link

Expected Behavior

Deploy KIND
Deploy Calico
See pods come up successfully and have coreDNS pods be able to dig, ping successfully

Current Behavior

Any new pods deployed are not able to shift into ready state successfully and the coreDNS pods are not able to communicate externally via ping, or dig.

Steps to Reproduce (for bugs)

  1. Deploy KIND
  2. Deploy Calico
  3. Deploy another pod

Logs

 [ERROR] plugin/errors: 2 7893400373289152203.6025455479212086695. HINFO: unreachable backend: read udp 10.244.1.3:38701->192.168.65.1:53: i/o timeout
 [ERROR] plugin/errors: 2 7893400373289152203.6025455479212086695. HINFO: unreachable backend: read udp 10.244.1.3:47810->192.168.65.1:53: i/o timeout
 [ERROR] plugin/errors: 2 7893400373289152203.6025455479212086695. HINFO: unreachable backend: read udp 10.244.1.3:54801->192.168.65.1:53: i/o timeout
 [ERROR] plugin/errors: 2 7893400373289152203.6025455479212086695. HINFO: unreachable backend: read udp 10.244.1.3:45085->192.168.65.1:53: i/o timeout
 [ERROR] plugin/errors: 2 7893400373289152203.6025455479212086695. HINFO: unreachable backend: read udp 10.244.1.3:34876->192.168.65.1:53: i/o timeout
 [ERROR] plugin/errors: 2 amazon.com. A: unreachable backend: read udp 10.244.1.3:58544->192.168.65.1:53: i/o timeout
 [ERROR] plugin/errors: 2 amazon.com. A: unreachable backend: read udp 10.244.1.3:45441->192.168.65.1:53: i/o timeout
 [ERROR] plugin/errors: 2 amazon.com. A: unreachable backend: read udp 10.244.1.3:51907->192.168.65.1:53: i/o timeout
 [ERROR] plugin/errors: 2 amazon.com. A: unreachable backend: read udp 10.244.1.3:36537->192.168.65.1:53: i/o timeout
 [ERROR] plugin/errors: 2 amazon.com. A: unreachable backend: read udp 10.244.1.3:49806->192.168.65.1:53: i/o timeout
; <<>> DiG 9.11.3-1ubuntu1.8-Ubuntu <<>> -t A +tries=5 +retry=5 +time=1 amazon.com
;; global options: +cmd
;; connection timed out; no servers could be reached

Your Environment

  • Calico version: Attempted with 3.0, 3.2, 3.3, 3.10, master
  • Orchestrator version (e.g. kubernetes, mesos, rkt): {Major:"1", Minor:"14", GitVersion:"v1.14.3"}
  • Operating System and version: darwin
@tmjd tmjd self-assigned this Oct 28, 2019
@tmjd
Copy link
Member

tmjd commented Oct 28, 2019

@song-jiang or @neiljerram I think you've both been using KIND recently, do either of you have any suggestions or tricks for making this work. Or maybe this use case is too different from what you've been doing.

@nelljerram
Copy link
Member

Can the KIND nodes communicate externally? (E.g. docker exec kind-worker apt-get update)

If no: it's not a Calico problem then.

If yes: please check that NatOutgoing is enabled in your IP pool.

@nelljerram
Copy link
Member

Oh hang on, I think it might just be /etc/resolv.conf. Our recent KIND work has this:

    # Fix /etc/resolv.conf in each node.
    ${KIND} get nodes | xargs -n1 -I {} docker exec {} sh -c "echo nameserver 8.8.8.8 > /etc/resolv.conf"

@sager-tech
Copy link
Author

@neiljerram I ran docker exec kind-worker apt-get update and was able to see it trying to update apt-get. How would I enable NatOutgoing in the IP pool?

As an update, I deployed Calico v3.0 with Kubernetes-API-Datastore and it is able to bring up the deployment and the coredns pods without the nameserver or DNS resolve issues. Any new pods I deploy in a deployment however are not able to go into ready state successfully, always stuck in CrashLoopBackOff. There are no logs, the only indicator I see in the description of the pod is:

Warning  BackOff    77s (x25 over 6m26s)   kubelet, kind-worker  Back-off restarting failed container

Any ideas as to why this is happening?

@tmjd
Copy link
Member

tmjd commented Oct 29, 2019

I thought we had a doc for editing IP Pools but I'm not finding it. You should be able to change NATOutgoing by using calicoctl to get your IP Pool, updated it and then apply your changes.

When you query the logs for a pod you might want to try -p to see the logs from a previous run, I think that has helped me before. You could also try looking the kubelet logs to see if there is anything useful there but it kind of sounds like the pod itself is exiting so I am doubting the kubelet logs will have anything useful.

@sager-tech
Copy link
Author

sager-tech commented Oct 29, 2019

@tmjd I did this deployment using kind as:

kind create cluster --config `find . -name deployment.yaml` --image kindest/node:v1.13.7

with deployment.yaml as:

kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
- role: control-plane
- role: worker
- role: worker
networking:
  disableDefaultCNI: True
kubeadmConfigPatches:
- |
  apiVersion: kubeadm.k8s.io/v1beta2
  kind: ClusterConfiguration
  metadata:
    name: config
  networking:
    serviceSubnet: "10.96.0.1/12"
    podSubnet: "192.168.0.0/16"

I do have calicoctl, but I do not see anything in the docs about configuring calico to point to an existing deployment. Is that possible?

@nelljerram
Copy link
Member

@sager-tech

I do not see anything in the docs about configuring calico to point to an existing deployment. Is that possible?

Yes, please see https://docs.projectcalico.org/v3.10/getting-started/calicoctl/configure/kdd. If you are using KDD, you probably just need

export DATASTORE_TYPE=kubernetes
export KUBECONFIG=<path to your kubeconfig file>

and then calicoctl should connect.

@nelljerram
Copy link
Member

@sager-tech Also, stepping back to your reported problem...

Please try to distinguish between problems with name resolution (aka DNS) and IP reachability. If ping 8.8.8.8 works, but not ping google.com, it's a name resolution problem. In that case, look at the /etc/resolv.conf in the place (i.e. host or pod) that you're pinging from.

If you can ping 8.8.8.8 from the host, but not from a pod, that indicates missing SNAT/MASQUERADE, aka NatOutgoing - i.e. when the ping request reaches 8.8.8.8, the ping response can't be routed back, because the source IP of the request is still that of the originating pod, which is a private IP that makes no sense to 8.8.8.8. (Actually in this case the request would have been dropped earlier because of an RPF check, but I hope you get the idea anyway.)

Hope that gives you a few ideas to look at...

@sager-tech
Copy link
Author

@neiljerram I setup calicoctl using this strategy. I spent some more time and I can report the following:

This is the pared down deployment:

NAMESPACE     NAME                                         READY   STATUS             RESTARTS   AGE   IP           NODE                 NOMINATED NODE   READINESS GATES
default       myapp-pod                                    1/1     Running            0          33m   10.244.2.4   kind-worker          <none>           <none>
kube-system   calico-node-xq722                            2/2     Running            0          83m   172.17.0.2   kind-control-plane   <none>           <none>
kube-system   calicoctl                                    1/1     Running            0          95m   172.17.0.4   kind-worker2         <none>           <none>
kube-system   coredns-7747b9c446-w2kmh                     2/2     Running            0          24m   10.244.2.5   kind-worker          <none>           <none>
kube-system   etcd-kind-control-plane                      1/1     Running            0          98m   172.17.0.2   kind-control-plane   <none>           <none>
kube-system   kube-apiserver-kind-control-plane            1/1     Running            0          97m   172.17.0.2   kind-control-plane   <none>           <none>
kube-system   kube-controller-manager-kind-control-plane   1/1     Running            0          97m   172.17.0.2   kind-control-plane   <none>           <none>
kube-system   kube-proxy-szhjs                             1/1     Running            0          98m   172.17.0.4   kind-worker2         <none>           <none>
kube-system   kube-scheduler-kind-control-plane            1/1     Running            0          98m   172.17.0.2   kind-control-plane   <none>           <none>

From my own pod, myapp-pod I set the /etc/resolve.conf to specifically go to the coredns pod by editing the spec to:

spec:
  dnsPolicy: "None"
  dnsConfig:
    nameservers:
      - 10.244.1.2

and the nameserver on myapp-pod shows 10.244.1.2. I can go to the host machine of the coredns pod and I am able to successfully dig and ping. However, from the coredns pod I am not able to make any successful dig or curl connections, and the /etc/resolv.conf on the coredns pod is pointing to the IP of the host machine it is on.

I have checked the NetworkPolicy and it is currently set to enable all egress and ingress. I agree that it seems like a NatOutgoing issue - pod cannot talk to host, but host can talk to external world, but I am not sure where the resolution to that problem would live.

This is where I am currently stuck. I will look into your idea about SNAT/MASQUERADE.

@nelljerram
Copy link
Member

@sager-tech I'm afraid your comments are still mixing up name resolution and IP reachability. Can you ping 8.8.8.8 from myapp-pod and the coredns pod?

@sager-tech
Copy link
Author

@neiljerram You were correct about the name resolution and IP reachability mixup.

I am able to successfully ping from the host, and not the pods. Neither coredns or myapp-pod. I inspected the ipPool file and it does have natOutgoing and ipipMode enabled:

apiVersion: projectcalico.org/v3
items:
- apiVersion: projectcalico.org/v3
  kind: IPPool
  metadata:
    creationTimestamp: 2019-10-29T20:50:50Z
    name: default-ipv4-ippool
    resourceVersion: "2039"
    uid: ca9ddcf6-fa8d-11e9-a93e-0242ac110004
  spec:
    blockSize: 26
    cidr: 192.168.0.0/16
    ipipMode: Always
    natOutgoing: true
    nodeSelector: all()
kind: IPPoolList
metadata:
  resourceVersion: "19965"

So you are correct about

that indicates missing SNAT/MASQUERADE, aka NatOutgoing

but it is enabled in the ipPool config.

I'm going through more calico docs to see if there is something in there about how to modify the iptables, because I feel it is very close to working.. Any suggestions you have would be very much appreciated!

@nelljerram
Copy link
Member

Does your local network also have addresses that match 192.168.0.0/16 ? (For home networks, this is pretty common.) If so, I wonder if there is a confusion somewhere between routing to devices on your home network, and routing to pods?

@nelljerram
Copy link
Member

Oh, I think the problem is that KIND's default for the pod CIDR is 10.244.0.0/16, and Calico's default is 192.168.0.0/16, and they don't match.

Can you try again with something like this to modify the CIDR in the Calico YAML:

    wget -O - https://docs.projectcalico.org/v3.9/manifests/calico.yaml | \
	sed 's,192.168.0.0/16,10.244.0.0/16,' | \
	kubectl apply -f -

@sager-tech
Copy link
Author

@neiljerram I was able to have the pods come up and ping successfully, thank you. I'm a bit confused though -- in the config passed to KIND on cluster create I specified:

kubeadmConfigPatches:
- |
  apiVersion: kubeadm.k8s.io/v1beta2
  kind: ClusterConfiguration
  metadata:
    name: config
  networking:
    serviceSubnet: "10.96.0.1/12"
    podSubnet: "192.168.0.0/16"

so it should have matched the calico CIDR manifest (also 192.168.0.0/16). Does setting it there not affect the KIND pod cidr?

@nelljerram
Copy link
Member

Well, some of your output above definitely shows 10.244 pod addresses. So perhaps KIND missed processing that config for some reason, or another field needs setting, or something; but I'm afraid I don't know KIND that well yet.

Anyway, great that things seem to be working for you now.

@sager-tech
Copy link
Author

sager-tech commented Nov 7, 2019

Thanks a lot for your help! It's working now. @neiljerram

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants