-
Notifications
You must be signed in to change notification settings - Fork 39.3k
-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NodePort accessible only from host where POD is running #89632
Comments
/sig network |
Proxy-mode |
It seems I can reproduce this problem. It is not consistent, I can reach the nodeport on some nodes. Not only the one where the pod is executing. But on some nodes it doesn't work. I never get this problem with @cfabio If possible try with proxy-mode=ipvs. /assign |
Weirdest finding today? Well here it is; On nodes where no pod has ever executed forwarding to NodePort doesn't work(?!) It works to access the nodeport from within a node where no pod has executed but traffic from an external source to the nodeport via such a node does not work. Tested on K8s v1.16.7, v1.17.3, v1.18.0. So it is not bound to K8s > v1.16.x @cfabio Can it be that when you tested on k8s < v1.17 pods had been executing on the nodes? StoryI accidentally loaded a deployment with many replicas for test and thought I just scale it down to 1, but then .... it worked! Access to nodeport worked via all nodes. I then loaded an alpine daemonset completely unrelated to the test-app and access to the nodeport worked via all nodes. Finally I added the alpine daemonsed, removed it again, waited until no alpine pods were running (and 5 extra sec after that) and then loaded the test-app and it still worked fine! The alpine manifest; apiVersion: apps/v1
kind: DaemonSet
metadata:
name: alpine-daemonset
spec:
selector:
matchLabels:
app: alpine
template:
metadata:
labels:
app: alpine
spec:
containers:
- name: alpine
image: library/alpine:latest
command: ["tail", "-f", "/dev/null"] @cfabio When you have the problem can you load this manifest and see if it helps? By coincidence (I think) the header for this issue is very accurate 😄 |
My first idea was that some kernel module was not loaded, but |
I am currently playing with IPVS to see if it makes any difference.
Before rolling back to v1.15.* and v1.16.* I killed the cluster (drain, delete nodes, remove-etcd-member, cleaned up iptables, etc) and started again from scratch. I'll run your Alpine daemonset and let you know about that. |
@uablrek here is what I did:
This might have to do with me doing something wrong when changing KubeProxy config from iptables to ipvs though. |
The error printouts are caused by this problem; It is not present in k8s v1.17.x. |
An advantage with ipvs is that it is easier to troubleshoot. When you use proxy-mode=ipvs, do;
on nodes where forwarding doesn't work. |
This is how it looks on my test system. My test server uses port 5001 instead of 1883 otherwise I took your manifests more or less as-is;
|
Here is the output of
I am using CentOS 7 which has a very old kernel 3.10.*, which seems also the case for bug #89520. |
Looks good actually. The entry in ipvs is there with 2 inactive connections. Your problem seem to be another than the one I found 😞 |
The way I have setup KubeProxy to work with IPVS is the following:
Does this procedure make sense to you? |
Yes looks ok, but I might have missed something. To use
|
@uablrek manually changing kube-proxy configuration by editing the
This is the yml file used to build the cluster with
At this point the only thing I can think of is trying to use an OS with a more up to date Linux kernel or staying with Kubernetes v1.16.8 till these issues are ironed out. |
The "parseIP Error" is in k8s v1.18.0 only. It should work for v1.17.3 for instance. |
@cfabio SInce the problem I had when reproducing this issue does not seem to be the same as yours I can't take this much further. Since the problem does not appear in other cluster it must be something in your environment. I have not time to duplicate your setup but can only give some advice for trouble-shooting. At least the ipvs setup seems OK so next step may be some traffic monitoring with
Where the "outgoing-if" is the interface your CNI-plugin (flannel) uses to forward packets to other nodes. Compare with a trace where it works. If traffic never leaves the receiving node the problem is there. However there are multiple steps that can fail. Packets may reach the POD but return traffic may fail, etc |
@uablrek thanks for the suggestion. |
I use an image I built myself. Bare minimum BusyBox based. It removes interference from other things and give control over tool versions. E.g my kernel is linux-5.6. I tried btw to back down to linux-3.10 but then |
But the 3.10 kernel in CentOS 7 is very old. So IMHO it is worth upgrading just to rule out kernel-version problems at least. |
I setup a brand new cluster made of 3 CentOS 8 nodes (1 master 2 slaves) and tried every combination of Kubernetes versions 1.18.0, 1.17.4 and 1.17.3 with
Choosen backend is
which seems correct, as traffic is being routed to the IP address and port of the MQTT broker POD (10.244.1.2.mqtt).
slave2 ipvs rules:
|
Both IPVS and iptables modes us iptables for NodePort forwarding, I believe. But there are no other reports of this happening, so I don't buy that it is broken IN GENERAL. I am (obviously) more inclined to blame flannel, but before I do that maybe we can rule out the iptables parts. Can you provide a full |
@cfabio please take a look into #88986 So far, if this is the same case as I'm seeing (CentOS 7.7, Kernel 3.10.0-1062.*, flannel + vxlan) you should disable tx offloading in the source of the communication and it will work. I've been taking a look into ClusterIP cases but can confirm that this happens also with NodePort (made a test with your case). @thockin do you mind I take this issue and agreggate into #88986 ? |
I think I'm facing this same issue on two different clusters. Playing around with
The TCP SYN packet arrives into I also tried to tcpdump on all interfaces in the node where the only instance of the POD is running and I get zero packets for either ports (30010 and 80).
While in the working one I see:
My only concern is about the |
For what it's worth, I can't reproduce this issue with a multi-node |
@thockin
@rikatz disabing TX and RX offload for flannel.1 interface on every node of the cluster ( |
It would be awesome to have an |
I realised that changing the |
Yes, this is a known issue with RH/CentOS 7 kernel (3.10.0-1062) with vxlan and txoffload, so the only workaround right now is to disable the txoffload, wait for an update into the kernel or use another OS. If this solves the problem, so this is a known problem. I'll mark this issue as a Duplicate of #88986 and close this, if you want you can also follow in this other issue, as soon as there is a release of an updated kernel version we can test and post there. But reinforcing, in this case is not a Kubernetes problem, but a CNI problem with the current CentOS 7.7 kernel Thank you! /remove-triage unresolved |
@rikatz: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@rikatz: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Oh and please feel free to reopen if you don't think its the same case :) |
I am not sure this can really be considered "closed", there are two issues here, present in both CentOS 7 and CentOS 8:
In the next days I will try Debian or Fedora with an up to date kernel. |
May be this issue related to firewall setting on cloud provider. In my case, I setup cluster (kubeadm) on AWS EC2 instances & got the same issue. After I change Security Group (firewall) setting to allow all traffic then everything back to normal. I never got this issue when setup cluster on bare metal. Hope it help! |
What happened:
I have a 1 master 2 slaves Kubernetes cluster, nodes IP addresses are 192.168.122.[110-111-112]/24.
Deployment and Service defined as follows:
What you expected to happen:
I would expect to be able to reach the service exposed on port 31883 from every IP of the cluster, this worked fine in Kubernetes version 1.15 and 1.16.
Since Kubernetes 1.17.0 this is not the case anymore, service exposed on port 31883 is only accessible on the IP address of the physical node where the POD is actually running.
How to reproduce it (as minimally and precisely as possible):
Kubernetes cluster is created using
kubeadm init --pod-network-cidr=10.244.0.0/16
command, network plugin used is flannel which is installed using the yaml definition file provided here: https://github.com/coreos/flannel#deploying-flannel-manuallyAnything else we need to know?:
I tried going back to Kubernetes version 1.15.4 and 1.16.8 and the same service/deployment definition magically start to work again.
When using kubernetes 1.18.0 if I run nmap against all 3 nodes this is the output I get:
I am not really sure if this is actually a bug in Kubernetes, flannel or some kind of incompatibility between some other component of the system.
Environment:
kubectl version
): 1.18.0 (same issue also on version 1.17.4)cat /etc/os-release
): CentOS Linux release 7.7.1908uname -a
): Linux centos 3.10.0-1062.18.1.el7.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Tue Mar 17 23:49:17 UTC 2020 x86_64 x86_64 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: