Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodePort on new node doesn't work without reboot #2737

Closed
cchanley2003 opened this issue Jul 20, 2019 · 7 comments
Closed

NodePort on new node doesn't work without reboot #2737

cchanley2003 opened this issue Jul 20, 2019 · 7 comments
Assignees

Comments

@cchanley2003
Copy link

cchanley2003 commented Jul 20, 2019

Running k8s 1.12.3 and Calico Typha 3.8.0. NodePorts don't seem to work when a new node joins the cluster. When I curl an exposed node port that fronts a simple http server the call hangs. If I reboot that node, the node port works as expected. This is running Red Hat 7.6. The same steps with a cluster using weave's CNI doesn't have this behavior.

Expected Behavior

Expected behavior is that when a new node is marked Ready then NodePort services would work without having to reboot the machine.

Current Behavior

When a new node joins the cluster the node port doesn't seem to forward the request. The port is listening but calls to the port hang indefinitely. Rebooting the node fixes the erroneous behavior. Internal (pod to pod) cluster communication appears to work fine. Because this works fine if I stand up a cluster with everything the same but a different CNI I don't believe kube proxy is at fault. I confirmed that iptables is empty (other than the standard docker entries) before a node joins.

Possible Solution

None at this time

Steps to Reproduce (for bugs)

Any additional debugging steps would be appreciated?

Right now my steps are:

  1. Stand a up a kubernetes HA cluster with version 1.12.3 using kubeadm
  2. Install https://docs.projectcalico.org/v3.8/manifests/calico-typha.yaml
  3. Deploy simple web server (httpd for instance) -- deployment + node port service
  4. Add new nodes to cluster
  5. Node port calls hang to any node within the cluster that hasn't been rebooted since joining (including master nodes)

Your Environment

  • Calico version typha 3.8.0 api datastore
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.12.3
  • Operating System and version: Red Hat 7.6
@tmjd
Copy link
Member

tmjd commented Jul 22, 2019

I've never heard of a problem like this before so I don't know of anything immediately to check out.
If you could collect the output of iptables-save -c that may be helpful, please do that before you've rebooted a new node but after you've generated some traffic that reproduces the problem. (the -c should include packet counts so we can see if iptables is dropping traffic)

Is NetworkManager running on your nodes? I know there have been problems before with it but none that I remember with this type of behavior.

Please check the kubelet (for CNI errors) and calico-node logs before rebooting a new node to see if there are any problems setting up the networking for pods (or anything odd).

It may also be useful to check calicoctl node status (this must be run on the node with the problem) to see that all BGP sessions are able to be setup, though I would expect you to be able to see problems with that in the readiness of calico-node.

@cchanley2003
Copy link
Author

cchanley2003 commented Jul 23, 2019

I'll work on getting the iptables outputs. Unfortunately behind an air gap for this cluster. The behavior seems similar to this: #875. We are running docker version 18.06.

The suggested solution from this ticket works as well. If I run iptables -P FORWARD ACCEPT the problem disappears on the problem nodes. So this fixes the issue without a reboot.

It looks like the issue was believed to be fixed by kubernetes/kubernetes#40182 which should be in k8s 1.12.3.

To answer your questions:

  1. Is NetworkManager running -- No
  2. Kubelet and calico node logs look clean
  3. Node status indicates mesh connectivity to all other nodes in cluster

I'll work on providing iptables output and diffs.

@cchanley2003
Copy link
Author

So I was able to reproduce the problem with k8s 1.15.0. Here is the iptables output when the nodeport is being blocked. In this case we have a nexus server with a node port of 30100 and when this new node joined curl hung until iptables -P FORWARD ACCEPT was run. Then the node works as expected. This was run with a -c.
iptables_np.txt

@caseydavenport
Copy link
Member

@cchanley2003 are you passing --cluster-cidr to your kube-proxy? I believe that should trigger a rule which allows pod traffic and might be related (based on only a quick skim)

@cchanley2003
Copy link
Author

cchanley2003 commented Jul 31, 2019

I am not passing cluster-cidr to my kube-proxy. I'll look at kubeadm setup and see what I need to do to have that added into the default kubeadm installation. Just to be clear cluster-cidr should match the calico cidr range correct?

@tmjd
Copy link
Member

tmjd commented Aug 1, 2019

Yes that is correct.

@cchanley2003
Copy link
Author

Going to close this issue. Believe that the problems were related to not handing kubeadm the cluster-cidr range. Thanks for the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants