-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodeport packet get lost with vmware when tcp client transmits large payload #8349
Comments
Do you observe similar with a newer version? Say latest 3.26 or would you be able to try with 3.27? When you do a tcpdump, would you also mind to include icmp? The issue is most likely related to MTU. icmp should adjust the PMTU but we have seen in the past the the icmp message where dropped by linux because of wrong csum. |
Yes, we tried versions 3.26 , 3.27. @tomastigera I would like to point out that when we disable eBPF, everything works correctly |
Did you observe the icmp message to adjust MTU or the lack of? Are you runnign on bare metal or VMs? What network driver?
Yeah, I believe that :) likely a checksum error on an icmp message generated by the ebpf dtaplane. |
It's a vm on vmware esxi(7), network adapter
|
About MTU: I looked, there were no such messages No MTU adjustment was observed. |
Hmmm we have seen bunch of issues with vmware driver before :( If you could try 3.27 and provide some ebpf debug logs, that would be appreciated. With 3.27 (perhaps with 3.26 as well) you can set a filter using |
We update calico to 3.27.0, add apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
spec:
bpfConnectTimeLoadBalancing: TCP
bpfEnabled: true
bpfExternalServiceMode: DSR
bpfHostNetworkedNATWithoutCTLB: Enabled
bpfLogFilters:
all: tcp port 30856 or host 10.222.56.70
bpfLogLevel: ""
floatingIPs: Disabled
logSeverityScreen: Info
prometheusMetricsEnabled: true
reportingInterval: 0s
vxlanEnabled: true
vxlanPort: 4789
vxlanVNI: 4096 But how correct using [root@ng01-7d485bb5bx98hk8-z8d6v /]# bpftool prog tracelog
could not find tracefs, attempting to mount it now And nothing happens, even when i send a request. |
You also need to set |
Thanks for the clarification :) , we collect dumps from both node (transit and target) ❯ k get felixconfigurations.crd.projectcalico.org default -o yaml
apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
spec:
bpfConnectTimeLoadBalancing: TCP
bpfEnabled: true
bpfExternalServiceMode: DSR
bpfHostNetworkedNATWithoutCTLB: Enabled
bpfLogFilters:
all: tcp port 30865 or host 10.222.56.70
bpfLogLevel: Debug
floatingIPs: Disabled
logSeverityScreen: Info
prometheusMetricsEnabled: true
reportingInterval: 0s
vxlanEnabled: true
vxlanPort: 4789
vxlanVNI: 4096
#pod
❯ k get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
echoserver-84c789fbd8-gbqms 1/1 Running 1 (61m ago) 13h 10.222.56.70 ng01-7d485bb5bx98hk8-nczpv <none> <none>
#service
❯ k describe svc echoserver
Name: echoserver
Namespace: echoserver
Labels: <none>
Annotations: <none>
Selector: app=echoserver
Type: LoadBalancer
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.222.76.181
IPs: 10.222.76.181
LoadBalancer Ingress: 10.73.123.84
Port: <unset> 80/TCP
TargetPort: 80/TCP
NodePort: <unset> 30865/TCP
Endpoints: 10.222.56.70:80
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
#calico nodes
❯ k -n kube-system get pods -l k8s-app=calico-node -o wide | grep ng01-7d485bb5bx98hk8-z8d6v
calico-node-lx8pz 1/1 Running 1 (51m ago) 53m 10.220.1.130 ng01-7d485bb5bx98hk8-z8d6v <none> <none>
❯ k -n kube-system get pods -l k8s-app=calico-node -o wide | grep ng01-7d485bb5bx98hk8-nczpv
calico-node-lkfml 1/1 Running 1 (51m ago) 53m 10.220.1.131 ng01-7d485bb5bx98hk8-nczpv <none> <none>
#dump
k exec -n kube-system calico-node-lkfml -- bpftool prog tracelog > ng01-7d485bb5bx98hk8-nczpv_10.220.1.131.log
k exec -n kube-system calico-node-lx8pz -- bpftool prog tracelog > ng01-7d485bb5bx98hk8-z8d6v_10.220.1.130.log ng01-7d485bb5bx98hk8-nczpv_10.220.1.131.log |
The tcpdumps you posted initially it seems like packet I do see a time gap in the bpflogs, but I think I would need to correlate them with a matching tcpdump as well. I tried to replicate the issue, however, all works well with a virtio driver on If you could try one more log/tcpdump on what you call the transit node (but ideally for both) and also include udp port 4789 (vxlan) |
We collect trace and pcap. ❯ k describe svc echoserver
Name: echoserver
Namespace: echoserver
Labels: <none>
Annotations: <none>
Selector: app=echoserver
Type: LoadBalancer
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.222.76.181
IPs: 10.222.76.181
LoadBalancer Ingress: 10.73.123.84
Port: <unset> 80/TCP
TargetPort: 80/TCP
NodePort: <unset> 30865/TCP
Endpoints: 10.222.56.82:80
Session Affinity: None
External Traffic Policy: Cluster
Events: <none> FelixConfiguration with bpf Filter ❯ k get felixconfigurations.crd.projectcalico.org default -o json | jq -r '.spec'
{
"bpfConnectTimeLoadBalancing": "TCP",
"bpfEnabled": true,
"bpfExternalServiceMode": "DSR",
"bpfHostNetworkedNATWithoutCTLB": "Enabled",
"bpfLogFilters": {
"all": "tcp port 30865 or host 10.222.56.82 or udp port 4789"
},
"bpfLogLevel": "Debug",
"floatingIPs": "Disabled",
"logSeverityScreen": "Info",
"prometheusMetricsEnabled": true,
"reportingInterval": "0s",
"vxlanEnabled": true,
"vxlanPort": 4789,
"vxlanVNI": 4096
} Calico node pods ❯ k -n kube-system get pods -l k8s-app=calico-node -o wide | grep ng01-7d485bb5bx98hk8-nczpv
calico-node-wqqk4 1/1 Running 1 (136m ago) 153m 10.220.1.131 ng01-7d485bb5bx98hk8-nczpv <none> <none>
❯ k -n kube-system get pods -l k8s-app=calico-node -o wide | grep ng01-7d485bb5bx98hk8-z8d6v
calico-node-tpxbg 1/1 Running 1 (136m ago) 152m 10.220.1.130 ng01-7d485bb5bx98hk8-z8d6v <none> <none>
#collect bpf trace
❯ k exec -n kube-system calico-node-wqqk4 -- bpftool prog tracelog > ng01-7d485bb5bx98hk8-nczpv_10.220.1.131.log
❯ k exec -n kube-system calico-node-tpxbg -- bpftool prog tracelog > ng01-7d485bb5bx98hk8-z8d6v_10.220.1.130.log TCPDUMP #On client node (10.220.0.1)
root@jump:~# tcpdump -i ens192 host 10.220.1.130 and port not 22 -w /tmp/echo-server-client.pcap
#On transit node
root@ng01-7d485bb5bx98hk8-z8d6v:~# tcpdump -i eth0 host 10.220.0.1 and port not 22 or udp port 4789 -w /tmp/echo-server-transit.pcap
#On target node
root@ng01-7d485bb5bx98hk8-nczpv:~# tcpdump -i eth0 -n host 10.220.0.1 and port not 22 or udp port 4789 -w /tmp/echo-server-target.pcap
#On pod in target node
root@ng01-7d485bb5bx98hk8-nczpv:~# ip netns exec cni-1c4081d0-5b51-985b-fca7-a396392f28f0 tcpdump -i any -w /tmp/echo-server-pod.pcap And after that, run curl with start and stop timestamp root@jump:# date +%H:%M:%S:%N; curl http://10.220.1.130:30865 -X POST -H 'Content-Type: text/plain' --data @bs66k -s > /dev/null; date +%H:%M:%S:%N
11:55:12:432352148
11:56:09:380735559 MTU 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:87:82:58 brd ff:ff:ff:ff:ff:ff
altname enp11s0
inet 10.220.1.130/25 brd 10.220.1.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:fe87:8258/64 scope link
valid_lft forever preferred_lft forever Calico-node create 15: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 66:ba:5a:61:9e:0d brd ff:ff:ff:ff:ff:ff
inet 10.222.54.0/32 scope global vxlan.calico
valid_lft forever preferred_lft forever Additionally, we have not configured the |
So the first oversized packet does not make it from the transit node to the other node. It is the This drop happens outside of calico's ebpf dataplane. Calico sees the packet leaving and so see it tcpdump. By this time, calico wrapped the packet in VXLAN and it might be that the driver does not honor/handle GSO correctly. It would not be the first time I have seen it in vmware. The packet is oversized because the driver's receive offload likely merges multiple packets together. That is correct but the segmentation needs to be correctly set to be handle by the kernel. Could you test whether you would observer the same if you turn off GRO on the driver? In calico 3.27 you can set |
We turn off GRO offload on {
"bpfConnectTimeLoadBalancing": "TCP",
"bpfDisableGROForIfaces": "eth0",
"bpfEnabled": true,
"bpfExternalServiceMode": "DSR",
"bpfHostNetworkedNATWithoutCTLB": "Enabled",
"bpfLogFilters": {
"all": "tcp port 30865 or host 10.222.56.82 or udp port 4789"
},
"bpfLogLevel": "Debug",
"floatingIPs": "Disabled",
"logSeverityScreen": "Info",
"prometheusMetricsEnabled": true,
"reportingInterval": "0s",
"vxlanEnabled": true,
"vxlanPort": 4789,
"vxlanVNI": 4096
} root@ng01-7d485bb5bx98hk8-z8d6v:~# ethtool -k eth0 | grep generic-receive
generic-receive-offload: off
root@ng01-7d485bb5bx98hk8-nczpv:~# ethtool -k eth0 | grep generic-receive
generic-receive-offload: off But it not help. We collect new tcpdump from node's and pod date +%H:%M:%S:%N; curl http://10.220.1.130:30865 -X POST -H 'Content-Type: text/plain' --data @bs66k -s > /dev/null; date +%H:%M:%S:%N
14:47:24:823919940
14:48:18:515367574 |
@tomastigera we already have problem with GRO in past |
Additional, i disable eBPF dataplane and collect tcpdump for request with huge payload > date +%H:%M:%S:%N; curl http://10.220.1.130:30865 -X POST -H 'Content-Type: text/plain' --data @bs66k -s > /dev/null; date +%H:%M:%S:%N
16:57:35:010936725
16:57:35:041028383
#^ no delay ^
root@jump:~# tcpdump -i ens192 host 10.220.1.130 and port not 22 -w /tmp/echo-server-client.pcap
root@ng01-7d485bb5bx98hk8-z8d6v:~# tcpdump -i eth0 host 10.220.0.1 and port not 22 or udp port 4789 -w /tmp/echo-server-transit.pcap
root@ng01-7d485bb5bx98hk8-nczpv:~# tcpdump -i eth0 -n host 10.220.0.1 and port not 22 or udp port 4789 -w /tmp/echo-server-target.pcap
root@ng01-7d485bb5bx98hk8-nczpv:~# ip netns exec cni-9f775343-7de2-ca5d-75fe-8ad8a9336f49 tcpdump -i any -w /tmp/echo-server-pod.pcap |
Looking at the new dump with ebpf, it seems like turning off GRO helped a lot. We do not see packets dropped because of jumbo frames. In fact, it seems like there is no retransmission observed on the target node at all. There also does not seem to be any retransmission on the transit node (you see the packets 2x in the dump, because the second copy is the vxlan encap forwarded to the target node). However, what we see is that the client retrasmitted some packets, but the original sends did not make it to the transit node at all. So they dd not reach calico at all. In the attached screen shot from the clients dump, none of the packets 33-42 make it to the node. Only after 33 is sent again as 43, it makes it to the node, to calico and all the way to the pod - and from the clusters point of view, all communication is ok. The client sent packet 22 with ~11k payload. It got chopped up into 1398 segments and the last part of 480 bytes never made it to the transit node. All the rest gets acked and this 480 bytes get resent as packet 33. I cannot see how calico ebpf would get involved in this, except that we had to turn off buggy GRO in the vmware driver. Note that there is a bunch of things different when (not) using ebpf. The dynamics of the packets in different so something in between may drop. Note that you are using DSR mode with ebpf, that means the ACKs from the server take a different route back than in the iptables mode etc. You may try to set Unfortunately, vmware drivers are buggy, we have seen nodes crashing because of issues triggered by the drivers in kernel. You may try newer kernel with newer driver. But things look pretty good from the calico ebpf point of view so far ¯_(ツ)_/¯ sorry to disappoint you :( |
Hello, apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
metadata:
name: default
spec:
bpfConnectTimeLoadBalancing: TCP
bpfEnabled: true
bpfExternalServiceMode: Tunnel
bpfLogLevel: ""
logSeverityScreen: Info
prometheusMetricsEnabled: true
reportingInterval: 0s
vxlanEnabled: true
vxlanPort: 4799
vxlanVNI: 4096 And after these changes, everything works correctly. |
@xander-sh Is there any non-calico vxlan in your network that would conflict with our vxlan? |
Kubernetes cluster are working on vSphere infrastructure with NSX-T SDN. NSX-T using |
Ohhh that explains! TIL Thanks for reporting back, closing as resolved. |
We have vanilla Kubernetes cluster v1.23.17 with Calico in eBPF mode v3.25.0 and vxlan mode on virtual machine (Ubuntu 20.04). And for load balancing service we use external loadBalancer which sends traffic to NodePort.
We noticed that requests to NodePort with payload larger than 65Kb pass through proxy node with large delay (more than 60 seconds)
Expected Behavior
No delay for request with big payload
Current Behavior
Delay fo request with payload more than 65Kb through proxy node
Requests to the load balancer are sent from the VM which is not a part of the cluster, but it is located in the different subnet.
We collected tcpdump from the transit node (10.220.1.130) and from the pod.
In tcpdump we see retransmit packets when a request is sent to the transit node (10.220.1.130)
From the pod side we do not see retransmitted packets. And on the transit node we do not see retransmitted packets, maybe bpf program drop packet before tcpdump.
Possible Solution
Steps to Reproduce (for bugs)
head --byte=67584 /dev/urandom | base64 |tr -d '\n' | head --byte=67584 > bs66k
curl http://10.220.1.130:30865 -X POST -H 'Content-Type: text/plain' --data @bs66k
Context
Your Environment
Felix configuration:
echo-server-pod.pcap - dump from pod interface
echo-server-transit-node.pcap - dump from transit node
echo-server-client.pcap - dump from client
tcpdump.zip
The text was updated successfully, but these errors were encountered: