How to deal with the error "Failed to send packet" of "interconnect encountered a network error, please check your network"
It means sending UDP packages failed between two hosts.
a log example:
WARNING: interconnect may encountered a network error, please check your network (seg1 slice1 192.168.1.1:6000 pid=xxx)
DETAIL: Failed to send packet (seq xx) to 192.168.1.2:12345 (pid xxxx cid -1) after 100 retries.
means 192.168.1.1:6000 --failed--> 192.168.1.2:12345
Here are the 2 most likely reasons:
The default UDP package size of GP is 8192 (GUC: gp_max_packet_size), some router/switcher may drop all packages which bigger than it. It causes all queries to fail continuously.
So, you need to try different package sizes to see if the query runs normally, e.g.
PGOPTIONS='-c gp_max_packet_size=8192' psql
PGOPTIONS='-c gp_max_packet_size=1400' psql
(test other value between 512~8192)
If one size fixes it, please adjust the gp_max_packet_size
or your MTU settings correspondingly.
In this scenario, the query only failed sometimes.
Please use iperf3 to test UDP connectivity the between hosts, command example:
server side: iperf3 -s
client side: iperf3 -uVc {server-host} -b1000m -t60 --get-server-output -l8192
Please contact your network admin to fix it if you find the packet drop rate is keep showing non-zero value like below:
[ 5] local xxxxxxx port xxxxx connected to xxxxx port xxxxx
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[ 5] 0.00-1.00 sec 94.2 MBytes 791 Mbits/sec 0.003 ms 1944/14008 (14%)
[ 5] 1.00-2.00 sec 109 MBytes 913 Mbits/sec 0.010 ms 1314/15241 (8.6%)
[ 5] 2.00-3.00 sec 107 MBytes 895 Mbits/sec 0.002 ms 1621/15273 (11%)
[ 5] 3.00-4.00 sec 109 MBytes 912 Mbits/sec 0.003 ms 1349/15269 (8.8%)
And one possible solution: adjust the network parmaters
net.core.rmem_default = 25165824
net.core.rmem_max = 33554432
net.ipv4.tcp_rmem = 16777216 25165824 33554432
net.ipv4.udp_rmem_min = 16777216
Another small probability reason is code bug: one specific query hangs each time. We have fixed some previously.
Please open a new issue with:
- the query
- pstack of all hung QD/QEs
- provide master and segment logs with
set gp_log_interconnect to debug; set log_min_messages to debug5;