Skip to content

How to deal with the error "Failed to send packet" of "interconnect encountered a network error, please check your network"

Hongxu Ma edited this page Dec 13, 2023 · 2 revisions

What does it mean?

It means sending UDP packages failed between two hosts.

a log example:

WARNING:  interconnect may encountered a network error, please check your network  (seg1 slice1 192.168.1.1:6000 pid=xxx)
DETAIL:  Failed to send packet (seq xx) to 192.168.1.2:12345 (pid xxxx cid -1) after 100 retries.

means 192.168.1.1:6000 --failed--> 192.168.1.2:12345

Solution

Here are the 2 most likely reasons:

MTU setting

The default UDP package size of GP is 8192 (GUC: gp_max_packet_size), some router/switcher may drop all packages which bigger than it. It causes all queries to fail continuously.

So, you need to try different package sizes to see if the query runs normally, e.g.

PGOPTIONS='-c gp_max_packet_size=8192' psql
PGOPTIONS='-c gp_max_packet_size=1400' psql
(test other value between 512~8192)

If one size fixes it, please adjust the gp_max_packet_size or your MTU settings correspondingly.

UDP package loss

In this scenario, the query only failed sometimes.

Please use iperf3 to test UDP connectivity the between hosts, command example:

 server side:   iperf3 -s
 client side:   iperf3 -uVc {server-host} -b1000m -t60 --get-server-output -l8192

Please contact your network admin to fix it if you find the packet drop rate is keep showing non-zero value like below:

[  5] local xxxxxxx port xxxxx connected to xxxxx port xxxxx
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec  94.2 MBytes   791 Mbits/sec  0.003 ms  1944/14008 (14%)
[  5]   1.00-2.00   sec   109 MBytes   913 Mbits/sec  0.010 ms  1314/15241 (8.6%)
[  5]   2.00-3.00   sec   107 MBytes   895 Mbits/sec  0.002 ms  1621/15273 (11%)
[  5]   3.00-4.00   sec   109 MBytes   912 Mbits/sec  0.003 ms  1349/15269 (8.8%)

And one possible solution: adjust the network parmaters

net.core.rmem_default = 25165824
net.core.rmem_max = 33554432
net.ipv4.tcp_rmem = 16777216   25165824       33554432
net.ipv4.udp_rmem_min = 16777216

Other

Another small probability reason is code bug: one specific query hangs each time. We have fixed some previously.

Please open a new issue with:

  • the query
  • pstack of all hung QD/QEs
  • provide master and segment logs with set gp_log_interconnect to debug; set log_min_messages to debug5;