-
Notifications
You must be signed in to change notification settings - Fork 548
TX hang on ixgbe with random packet sizes #325
Description
I was playing with pkt-gen-b to send out random size UDP packets when I experienced TX hangs reported in dmesg.
Information which I have for now:
- looks like the problem is caused only on rates above 1.3Mpps, so it reproduces only with
pkg-gen-b - Sometimes pkg-gen-b just reports 0pps when sending packets and there is no message in the dmesg log.
- It reproduces easier when using just one ring from the interface (e.g.
-i enp129s0f1-3). It reproduces on every ring I tried. It also reproduces with all rings on the interface but rarely.
System information:
- Ubuntu 17.04
- kernel 4.10.0-19
- netmap cloned at 9317a62 (from Jun 13)
- using
Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapterwith ixgbe driver
Dmesg log when the hang occurs:
[ 952.366422] 503.883459 [ 131] ixgbe_netmap_configure_srrctl bufsz: 2048 srrctl: 2
[ 952.568401] ixgbe 0000:81:00.1 enp129s0f1: detected SFP+: 6
[ 954.136470] ixgbe 0000:81:00.1 enp129s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[ 982.395147] ixgbe 0000:81:00.1 enp129s0f1: Detected Tx Unit Hang
Tx Queue <10>
TDH, TDT <4>, <6>
next_to_use <6>
next_to_clean <4>
[ 982.395170] ixgbe 0000:81:00.1 enp129s0f1: tx_buffer_info[next_to_clean]
time_stamp <10002967e>
jiffies <100029a58>
[ 982.395173] ixgbe 0000:81:00.1 enp129s0f1: Detected Tx Unit Hang
Tx Queue <5>
TDH, TDT <4>, <5>
next_to_use <5>
next_to_clean <4>
[ 982.395182] ixgbe 0000:81:00.1 enp129s0f1: tx_buffer_info[next_to_clean]
time_stamp <1000296ab>
jiffies <100029a58>
[ 982.395186] ixgbe 0000:81:00.1 enp129s0f1: tx hang 5 detected on queue 5, resetting adapter
[ 982.395193] ixgbe 0000:81:00.1 enp129s0f1: tx hang 5 detected on queue 10, resetting adapter
[ 982.395198] ixgbe 0000:81:00.1 enp129s0f1: Reset adapter
[ 982.669072] 534.187467 [ 131] ixgbe_netmap_configure_srrctl bufsz: 2048 srrctl: 2
[ 982.859639] ixgbe 0000:81:00.1 enp129s0f1: detected SFP+: 6
[ 983.595003] ixgbe 0000:81:00.1 enp129s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[ 990.594665] ixgbe 0000:81:00.1 enp129s0f1: detected SFP+: 6
[ 991.085927] IPv6: ADDRCONF(NETDEV_UP): enp129s0f1: link is not ready
[ 991.342176] ixgbe 0000:81:00.1 enp129s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[ 991.342301] IPv6: ADDRCONF(NETDEV_CHANGE): enp129s0f1: link becomes ready
The smallest patch which reproduces the problem in pkt-gen.c is:
diff --git a/apps/pkt-gen/pkt-gen.c b/apps/pkt-gen/pkt-gen.c
index a34cdc06..deff61cb 100644
--- a/apps/pkt-gen/pkt-gen.c
+++ b/apps/pkt-gen/pkt-gen.c
@@ -1597,6 +1597,7 @@ sender_body(void *data)
continue;
if (frags > 1)
limit = ((limit + frags - 1) / frags) * frags;
+ size = random() % 800 + 400;
m = send_packets(txring, pkt, frame, size, targ->g,
limit, options, frags);And then run as ./pkt-gen-b -i enp129s0f1-3 -f tx -n 0 -l 1500. Use 1500 size so that the packet prepared is bigger than the truncation done in the patch.
This is not the cleanest patch - I also tried to correctly set the sizes in the IP and UDP header, but it also exposes the problem. So this is the smallest reproduction patch.
Sample output from pkt-gen:
906.009036 main [2637] interface is enp129s0f1-3
906.009065 main [2757] using default burst size: 512
906.009081 main [2765] running on 1 cpus (have 16)
906.009196 extract_ip_range [462] range is 10.0.0.1:1234 to 10.0.0.1:1234
906.009205 extract_ip_range [462] range is 10.1.0.1:1234 to 10.1.0.1:1234
906.268770 main [2860] mapped 334980KB at 0x7fc5a8942000
Sending on netmap:enp129s0f1-3: 16 queues, 1 threads and 1 cpus.
10.0.0.1 -> 10.1.0.1 (00:00:00:00:00:00 -> ff:ff:ff:ff:ff:ff)
906.268795 main [2957] Sending 512 packets every 0.000000000 s
906.268812 start_threads [2319] Wait 2 secs for phy reset
908.268910 start_threads [2321] Ready...
908.268973 sender_body [1505] start, fd 3 main_fd 3
908.336593 sender_body [1587] drop copy
909.269959 main_thread [2409] 1.513 Mpps (1.514 Mpkts 11.437 Gbps in 1001025 usec) 1.24 avg_batch 0 min_space
910.270998 main_thread [2409] 1.519 Mpps (1.521 Mpkts 14.399 Gbps in 1001039 usec) 1.24 avg_batch 99999 min_space
911.272037 main_thread [2409] 1.519 Mpps (1.521 Mpkts 5.801 Gbps in 1001039 usec) 1.24 avg_batch 99999 min_space
912.273074 main_thread [2409] 1.519 Mpps (1.520 Mpkts 1.126 Gbps in 1001037 usec) 1.24 avg_batch 99999 min_space
913.274109 main_thread [2409] 1.514 Mpps (1.516 Mpkts 147573952.586 Tbps in 1001035 usec) 1.24 avg_batch 99999 min_space
914.275146 main_thread [2409] 1.522 Mpps (1.524 Mpkts 22.612 Gbps in 1001037 usec) 1.24 avg_batch 99999 min_space
915.275778 main_thread [2409] 1.516 Mpps (1.517 Mpkts 40.052 Gbps in 1000632 usec) 1.24 avg_batch 99999 min_space
916.276736 main_thread [2409] 1.078 Mpps (1.079 Mpkts 147573952.566 Tbps in 1000958 usec) 1.24 avg_batch 99999 min_space
917.277773 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001037 usec) 0.00 avg_batch 99999 min_space
918.278808 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001035 usec) 0.00 avg_batch 99999 min_space
919.279587 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1000779 usec) 0.00 avg_batch 99999 min_space
920.280621 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001034 usec) 0.00 avg_batch 99999 min_space
921.280735 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1000113 usec) 0.00 avg_batch 99999 min_space
922.281770 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001036 usec) 0.00 avg_batch 99999 min_space
923.282805 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001035 usec) 0.00 avg_batch 99999 min_space
924.283841 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001036 usec) 0.00 avg_batch 99999 min_space
925.284879 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001038 usec) 0.00 avg_batch 99999 min_space
926.285915 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001036 usec) 0.00 avg_batch 99999 min_space
927.286952 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001036 usec) 0.00 avg_batch 99999 min_space
928.287987 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001036 usec) 0.00 avg_batch 99999 min_space
929.289023 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001035 usec) 0.00 avg_batch 99999 min_space
^C930.201296 sigint_h [517] received control-C on thread 0x7fc5bdd49700
930.201305 sender_body [1623] flush tail 149 head 149 on thread 0x7fc5a8941700
930.201306 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 912283 usec) 0.00 avg_batch 99999 min_space
931.202349 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001044 usec) 0.00 avg_batch 99999 min_space
Sent 11712149 packets 8608429515 bytes 9450097 events 735 bytes each in 21.93 seconds.
Speed: 534.014 Kpps Bandwidth: 3.140 Gbps (raw 3.243 Gbps). Average batch: 1.24 pkts
And it goes with zeroes forever. Sometimes the hardware resets and it sends packets for a while and then it gets stuck again.
Any light over this is welcome.