Skip to content

TX hang on ixgbe with random packet sizes #325

@borislavmatvey

Description

@borislavmatvey

I was playing with pkt-gen-b to send out random size UDP packets when I experienced TX hangs reported in dmesg.

Information which I have for now:

  • looks like the problem is caused only on rates above 1.3Mpps, so it reproduces only with pkg-gen-b
  • Sometimes pkg-gen-b just reports 0pps when sending packets and there is no message in the dmesg log.
  • It reproduces easier when using just one ring from the interface (e.g. -i enp129s0f1-3). It reproduces on every ring I tried. It also reproduces with all rings on the interface but rarely.

System information:

  • Ubuntu 17.04
  • kernel 4.10.0-19
  • netmap cloned at 9317a62 (from Jun 13)
  • using Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapter with ixgbe driver

Dmesg log when the hang occurs:

[  952.366422] 503.883459 [ 131] ixgbe_netmap_configure_srrctl bufsz: 2048 srrctl: 2
[  952.568401] ixgbe 0000:81:00.1 enp129s0f1: detected SFP+: 6
[  954.136470] ixgbe 0000:81:00.1 enp129s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[  982.395147] ixgbe 0000:81:00.1 enp129s0f1: Detected Tx Unit Hang
                 Tx Queue             <10>
                 TDH, TDT             <4>, <6>
                 next_to_use          <6>
                 next_to_clean        <4>
[  982.395170] ixgbe 0000:81:00.1 enp129s0f1: tx_buffer_info[next_to_clean]
                 time_stamp           <10002967e>
                 jiffies              <100029a58>
[  982.395173] ixgbe 0000:81:00.1 enp129s0f1: Detected Tx Unit Hang
                 Tx Queue             <5>
                 TDH, TDT             <4>, <5>
                 next_to_use          <5>
                 next_to_clean        <4>
[  982.395182] ixgbe 0000:81:00.1 enp129s0f1: tx_buffer_info[next_to_clean]
                 time_stamp           <1000296ab>
                 jiffies              <100029a58>
[  982.395186] ixgbe 0000:81:00.1 enp129s0f1: tx hang 5 detected on queue 5, resetting adapter
[  982.395193] ixgbe 0000:81:00.1 enp129s0f1: tx hang 5 detected on queue 10, resetting adapter
[  982.395198] ixgbe 0000:81:00.1 enp129s0f1: Reset adapter
[  982.669072] 534.187467 [ 131] ixgbe_netmap_configure_srrctl bufsz: 2048 srrctl: 2
[  982.859639] ixgbe 0000:81:00.1 enp129s0f1: detected SFP+: 6
[  983.595003] ixgbe 0000:81:00.1 enp129s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[  990.594665] ixgbe 0000:81:00.1 enp129s0f1: detected SFP+: 6
[  991.085927] IPv6: ADDRCONF(NETDEV_UP): enp129s0f1: link is not ready
[  991.342176] ixgbe 0000:81:00.1 enp129s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[  991.342301] IPv6: ADDRCONF(NETDEV_CHANGE): enp129s0f1: link becomes ready

The smallest patch which reproduces the problem in pkt-gen.c is:

diff --git a/apps/pkt-gen/pkt-gen.c b/apps/pkt-gen/pkt-gen.c
index a34cdc06..deff61cb 100644
--- a/apps/pkt-gen/pkt-gen.c
+++ b/apps/pkt-gen/pkt-gen.c
@@ -1597,6 +1597,7 @@ sender_body(void *data)
                                continue;
                        if (frags > 1)
                                limit = ((limit + frags - 1) / frags) * frags;
+                       size = random() % 800 + 400;

                        m = send_packets(txring, pkt, frame, size, targ->g,
                                         limit, options, frags);

And then run as ./pkt-gen-b -i enp129s0f1-3 -f tx -n 0 -l 1500. Use 1500 size so that the packet prepared is bigger than the truncation done in the patch.

This is not the cleanest patch - I also tried to correctly set the sizes in the IP and UDP header, but it also exposes the problem. So this is the smallest reproduction patch.

Sample output from pkt-gen:

906.009036 main [2637] interface is enp129s0f1-3
906.009065 main [2757] using default burst size: 512
906.009081 main [2765] running on 1 cpus (have 16)
906.009196 extract_ip_range [462] range is 10.0.0.1:1234 to 10.0.0.1:1234
906.009205 extract_ip_range [462] range is 10.1.0.1:1234 to 10.1.0.1:1234
906.268770 main [2860] mapped 334980KB at 0x7fc5a8942000
Sending on netmap:enp129s0f1-3: 16 queues, 1 threads and 1 cpus.
10.0.0.1 -> 10.1.0.1 (00:00:00:00:00:00 -> ff:ff:ff:ff:ff:ff)
906.268795 main [2957] Sending 512 packets every  0.000000000 s
906.268812 start_threads [2319] Wait 2 secs for phy reset
908.268910 start_threads [2321] Ready...
908.268973 sender_body [1505] start, fd 3 main_fd 3
908.336593 sender_body [1587] drop copy
909.269959 main_thread [2409] 1.513 Mpps (1.514 Mpkts 11.437 Gbps in 1001025 usec) 1.24 avg_batch 0 min_space
910.270998 main_thread [2409] 1.519 Mpps (1.521 Mpkts 14.399 Gbps in 1001039 usec) 1.24 avg_batch 99999 min_space
911.272037 main_thread [2409] 1.519 Mpps (1.521 Mpkts 5.801 Gbps in 1001039 usec) 1.24 avg_batch 99999 min_space
912.273074 main_thread [2409] 1.519 Mpps (1.520 Mpkts 1.126 Gbps in 1001037 usec) 1.24 avg_batch 99999 min_space
913.274109 main_thread [2409] 1.514 Mpps (1.516 Mpkts 147573952.586 Tbps in 1001035 usec) 1.24 avg_batch 99999 min_space
914.275146 main_thread [2409] 1.522 Mpps (1.524 Mpkts 22.612 Gbps in 1001037 usec) 1.24 avg_batch 99999 min_space
915.275778 main_thread [2409] 1.516 Mpps (1.517 Mpkts 40.052 Gbps in 1000632 usec) 1.24 avg_batch 99999 min_space
916.276736 main_thread [2409] 1.078 Mpps (1.079 Mpkts 147573952.566 Tbps in 1000958 usec) 1.24 avg_batch 99999 min_space
917.277773 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001037 usec) 0.00 avg_batch 99999 min_space
918.278808 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001035 usec) 0.00 avg_batch 99999 min_space
919.279587 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1000779 usec) 0.00 avg_batch 99999 min_space
920.280621 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001034 usec) 0.00 avg_batch 99999 min_space
921.280735 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1000113 usec) 0.00 avg_batch 99999 min_space
922.281770 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001036 usec) 0.00 avg_batch 99999 min_space
923.282805 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001035 usec) 0.00 avg_batch 99999 min_space
924.283841 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001036 usec) 0.00 avg_batch 99999 min_space
925.284879 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001038 usec) 0.00 avg_batch 99999 min_space
926.285915 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001036 usec) 0.00 avg_batch 99999 min_space
927.286952 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001036 usec) 0.00 avg_batch 99999 min_space
928.287987 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001036 usec) 0.00 avg_batch 99999 min_space
929.289023 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001035 usec) 0.00 avg_batch 99999 min_space
^C930.201296 sigint_h [517] received control-C on thread 0x7fc5bdd49700
930.201305 sender_body [1623] flush tail 149 head 149 on thread 0x7fc5a8941700
930.201306 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 912283 usec) 0.00 avg_batch 99999 min_space
931.202349 main_thread [2409] 0.000 pps (0.000 pkts 0.000 bps in 1001044 usec) 0.00 avg_batch 99999 min_space
Sent 11712149 packets 8608429515 bytes 9450097 events 735 bytes each in 21.93 seconds.
Speed: 534.014 Kpps Bandwidth: 3.140 Gbps (raw 3.243 Gbps). Average batch: 1.24 pkts

And it goes with zeroes forever. Sometimes the hardware resets and it sends packets for a while and then it gets stuck again.

Any light over this is welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions