panic under heavy network load #19

dch · 2023-03-22T11:54:09Z

this only reproduces when more than usual cross-dpaa interface traffic is present.
I can trigger it using iperf3 reliably. This is using normal CURRENT, not fork.

$ iperf3 --parallel 16 --client 172.16.2.24  --get-server-output --time 120
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.01   sec  2.37 MBytes  19.7 Mbits/sec  149   42.1 KBytes
[  7]   0.00-1.01   sec  2.39 MBytes  19.9 Mbits/sec  133   5.28 KBytes
[  9]   0.00-1.01   sec  3.26 MBytes  27.1 Mbits/sec  210   44.8 KBytes
[ 11]   0.00-1.01   sec  2.21 MBytes  18.4 Mbits/sec   53   2.63 KBytes
[ 13]   0.00-1.01   sec  2.28 MBytes  19.0 Mbits/sec  114   1.32 KBytes
[ 15]   0.00-1.01   sec  2.33 MBytes  19.3 Mbits/sec   75   2.66 KBytes
[ 17]   0.00-1.01   sec  2.53 MBytes  21.0 Mbits/sec  201   86.9 KBytes
[ 19]   0.00-1.01   sec  2.58 MBytes  21.5 Mbits/sec  119   90.8 KBytes
[ 21]   0.00-1.01   sec  2.55 MBytes  21.2 Mbits/sec  102    134 KBytes
[ 23]   0.00-1.01   sec  2.61 MBytes  21.7 Mbits/sec   31   1.32 KBytes
[ 25]   0.00-1.01   sec  2.27 MBytes  18.9 Mbits/sec   76   1.32 KBytes
[ 27]   0.00-1.01   sec  2.53 MBytes  21.1 Mbits/sec   88    147 KBytes
[ 29]   0.00-1.01   sec  2.48 MBytes  20.6 Mbits/sec   94   1.32 KBytes
[ 31]   0.00-1.01   sec  2.54 MBytes  21.1 Mbits/sec   23   1.32 KBytes
[ 33]   0.00-1.01   sec  2.56 MBytes  21.3 Mbits/sec  113   47.4 KBytes
[ 35]   0.00-1.01   sec  2.48 MBytes  20.7 Mbits/sec   92    161 KBytes
[SUM]   0.00-1.01   sec  40.0 MBytes   333 Mbits/sec  1673

  x0:                0
  x1: ffff00010d052200
  x2: ffff0000009e0078 (console_pausestr + 25688)
  x3:              30d
  x4:                0
  x5:                d
  x6:  a009b033eaa2d8c
  x7:    8172b24fa0a00
  x8: ffff000114a9d000
  x9:                0
 x10:                1
 x11:                3
 x12:                1
 x13:                0
 x14:            10000
 x15:                1
 x16:            10000
 x17: ffff0001737b1958 (ng_unref_node + 0)
 x18: ffff00010e4806d0
 x19: ffff0001146d9000
 x20: ffff0001146d9058
 x21: ffffa000024e1200
 x22:                0
 x23: ffff00010e480750
 x24: ffffa000031db680
 x25: ffffa000031c2c00
 x26: ffff000000cac018 (Giant + 18)
 x27: ffff000000961b27 (digits + 227b9)
 x28: ffffa000031c2c10
 x29: ffff00010e4806d0
  sp: ffff00010e4806d0
  lr: ffff00000081afc0 (dpaa2_ni_poll + 3c)
 elr: ffff00000081b028 (dpaa2_ni_poll + a4)
spsr:         40000045
 far:                1
 esr:         96000004
panic: vm_fault failed: ffff00000081b028 error 1
cpuid = 7
time = 1679392819
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x13c
panic() at panic+0x44
data_abort() at data_abort+0x32c
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0
(null)() at 0x10000
KDB: enter: panic
[ thread pid 12 tid 100119 ]
Stopped at      kdb_enter+0x44: undefined       f906427f
db>
Tracing pid 12 tid 100119 td 0xffff00010d052200
db_trace_self() at db_trace_self
db_stack_trace() at db_stack_trace+0x11c
db_command() at db_command+0x2d8
db_command_loop() at db_command_loop+0x54
db_trap() at db_trap+0xf8
kdb_trap() at kdb_trap+0x28c
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0
(null)() at 0
db>

output of while true; vmstat -i | grep dpaa2_io; sleep 1; end
and top -SjwHPz -mcpu at moment of crash (tmux over mosh)

its0,43: dpaa2_io0                                 25732021        399
its0,44: dpaa2_io1                                  4262516         66
its0,45: dpaa2_io2                                  4645361         72
its0,46: dpaa2_io3                                  4869407         76
its0,47: dpaa2_io4                                  4506983         70
its0,48: dpaa2_io5                                  4257987         66
its0,49: dpaa2_io6                                  3052330         47
its0,50: dpaa2_io7                                  2853862         44


436 threads:   9 running, 373 sleeping, 54 waiting
CPU 0:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 1:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 2:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 3:  0.0% user,  0.0% nice,  0.0% system,  100% interrupt,  0.0% idle
CPU 4:  0.0% user,  0.0% nice,  0.0% system,  100% interrupt,  0.0% idle
CPU 5:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 6:  0.0% user,  0.0% nice,  100% system,  0.0% interrupt,  0.0% idle
CPU 7:  0.0% user,  0.0% nice,  0.0% system,  100% interrupt,  0.0% idle
Mem: 122M Active, 5184M Inact, 4528M Wired, 40K Buf, 21G Free
ARC: 2678M Total, 388M MFU, 1623M MRU, 1153K Anon, 53M Header, 607M Other
     1706M Compressed, 3914M Uncompressed, 2.29:1 Ratio
Swap: 4096M Total, 4096M Free

  PID   JID USERNAME2   PRI NICE   SIZE    RES SWAP STATE    C   TIME    WCPU COMMAND
21012     0 root         21    0    17M  4516K   0B CPU3     3   0:00 100.00% top
52722     0 dch          20    0    28M    17M   0B select   4   0:02  44.43% mosh-server
   12     0 root        -64    -     0B   736K   0B WAIT     4   4:24  21.61% intr{its0,46: dpaa2_io3}
   12     0 root        -64    -     0B   736K   0B WAIT     7   7:47  21.43% intr{its0,43: dpaa2_io0}
85664     0 dch          20    0    14M  5016K   0B select   1   0:01  13.89% tmux
   12     0 root        -64    -     0B   736K   0B WAIT     5   4:12  13.20% intr{its0,45: dpaa2_io2}
   12     0 root        -64    -     0B   736K   0B WAIT     0   2:37  10.00% intr{its0,50: dpaa2_io7}
   12     0 root        -64    -     0B   736K   0B WAIT     2   3:33   9.24% intr{its0,48: dpaa2_io5}
   12     0 root        -64    -     0B   736K   0B WAIT     3   4:37   8.09% intr{its0,47: dpaa2_io4}
   12     0 root        -64    -     0B   736K   0B WAIT     6   3:42   3.56% intr{its0,44: dpaa2_io1}
    0     0 root        -64    -     0B  2400K   0B -        1   0:14   2.45% kernel{dpaa2_ni1_tqbp}
59646     1    317       20    0   363M   128M   0B kqread   6   2:27   1.19% node{node}
    0     0 root        -64    -     0B  2400K   0B -        2   0:17   0.38% kernel{dpaa2_ni2_tqbp}
20183     0 root         20    0    33M    14M   0B select   6   2:39   0.00% zerotier-one{zerotier-one}
   12     0 root        -64    -     0B   736K   0B WAIT     1   2:38   0.00% intr{its0,49: dpaa2_io6}
44098     0 root         20    0  1674M   537M   0B kqread   2   2:05   0.00% kresd
    6     0 root         -8    -     0B  1360K   0B tx->tx   1   1:54   0.00% zfskern{txg_thread_enter}
    2     0 root        -60    -     0B   128K   0B WAIT     1   1:44   0.00% clock{clock (0)}
19910     0 root        -16    -     0B    16K   0B pftm     1   1:38   0.00% pf purge

The text was updated successfully, but these errors were encountered:

dsalychev · 2023-03-23T10:54:26Z

@dch thanks for all of the details! Btw, was it GENERIC kernel you tested on?

dch · 2023-03-26T19:46:42Z

On Thu, 23 Mar 2023, at 10:54, dsl wrote: @dch <https://github.com/dch> thanks for all of the details! Btw, was it GENERIC kernel you tested on?

yes.

dch · 2023-05-20T10:01:47Z

I've been seeing this a lot (like every 5-10 minutes) after 0d574d8 with https://reviews.freebsd.org/D40094
sometimes so, early ten64 doesn't even complete switch to userland.
May be a generic arm64 issue, but I'm not seeing this on other
h/w pushing a lot more traffic.

dsalychev · 2023-05-20T11:11:36Z

@dch I'm not sure about 0d574d8, but I've modified addresses translation recently. Could you try 718bdb6 and one before 74192f9?

dch · 2023-05-28T09:57:05Z

sorry it took a while but 718bdb6 is the culprit. Reverting this & we're all ok again.

markmi · 2023-06-11T14:23:06Z

sorry it took a while but 718bdb6 is the culprit. Reverting this & we're all ok again.

The original report here is from Mar 22 but that commit is from May 11. Time relationship seems wrong for 718bdb6 to be the only issue.

dch · 2023-06-12T06:43:03Z

Correct, I thought that was clear from the original title & updated comment.

the symptom is the same, under load from 1 physical interface to another, I get the vm_fault failed panic
with 718bdb6 included, panics are frequent, every 5-10 minutes
with that reverted, they occur roughly weekly, with autoreboot this is very usable

dsalychev · 2023-06-16T13:33:02Z

@dch Thanks for a summary, that's how I understood the issue. Its root cause is in the different channels accessing bus_dma resources concurrently, I assume. You won't see those panics with the only channel up and running. Just FYI, I'm trying to isolate channels within their own tasks and limit an access to shared resources as much as possible.

dsalychev · 2023-07-26T10:08:07Z

@dch I've prepared a lot of changes in the https://github.com/mcusim/freebsd-src/tree/dpaa2 branch. Could you try it? GENERIC kernel had worked for me under high network load for ~14 hours when I stopped the test myself. Btw, I've also discovered that the kernel panics with "undefined instruction" when the Ten64's SoC is heated up to 80-90C (sysctl hw.temperature). Please, keep an eye on it.

dsalychev · 2023-08-20T15:01:59Z

It should be fixed on CURRENT with https://cgit.freebsd.org/src/commit/?id=58983e4b0253ad38a3e1ef2166fedd3133fdb552 merged in.

dch · 2023-09-04T14:11:21Z

so far LGTM on 15.0-CURRENT - a 3h test (albeit on 1G ifaces only) is stable.
awesome! I need to move some cabling around for 10G but this is great progress!

thanks @dsalychev

pkubaj · 2023-09-04T14:47:25Z

I'm on stable/14 and am planning to switch to releng/14.0 when it's branched off, but it also seems stable.
But regarding SFP+ ports, I'm not able to connect to them. I have Intel X520-DA2 card:

ix0@pci0:1:0:0: class=0x020000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x10fb subvendor=0x8086 subdevice=0x7a11
    vendor     = 'Intel Corporation'
    device     = '82599ES 10-Gigabit SFI/SFP+ Network Connection'
    class      = network
    subclass   = ethernet

It's able to link up when plugged in via loopback, but not when I plug in to Ten64. I haven't reported it yet, because I still haven't tested it working under Linux.

dsalychev · 2023-09-04T18:26:46Z

@dch, @pkubaj thanks for all of the tests. Please, don't expect SFP+ to be operational anyhow at the moment. I've just started working on a design of something I call "sffbus" (similar to miibus(4)).

dch · 2023-09-05T08:57:32Z

using e04c4b4 this still stable. thanks!

dsalychev · 2023-09-05T09:06:34Z

Good to know :) Thanks for testing!

netlink(4) calls back into the driver during detach and it attempts to start an internal synchronized op recursively, causing an interruptible hang. Fix it by failing the ioctl if the VI has been marked as DOOMED by cxgbe_detach. Here's the stack for the hang for reference. #6 begin_synchronized_op #7 cxgbe_media_status #8 ifmedia_ioctl #9 cxgbe_ioctl #10 if_ioctl #11 get_operstate_ether #12 get_operstate #13 dump_iface #14 rtnl_handle_ifevent #15 rtnl_handle_ifnet_event #16 rt_ifmsg #17 if_unroute #18 if_down #19 if_detach_internal #20 if_detach #21 ether_ifdetach #22 cxgbe_vi_detach #23 cxgbe_detach #24 DEVICE_DETACH MFC after: 3 days Sponsored by: Chelsio Communications

dsalychev added bug Something isn't working panic Kernel panic labels Mar 22, 2023

dsalychev mentioned this issue Jun 16, 2023

kernel panics #20

Closed

dsalychev self-assigned this Jul 7, 2023

pkubaj mentioned this issue Aug 22, 2023

ten64: No dataflow on boot until cable replugged #21

Closed

dsalychev added the priority Something important to fix label Aug 22, 2023

dch closed this as completed Sep 5, 2023

dch mentioned this issue Dec 16, 2023

10G support #24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic under heavy network load #19

panic under heavy network load #19

dch commented Mar 22, 2023

dsalychev commented Mar 23, 2023

dch commented Mar 26, 2023 via email

dch commented May 20, 2023

dsalychev commented May 20, 2023

dch commented May 28, 2023

markmi commented Jun 11, 2023 •

edited

Loading

dch commented Jun 12, 2023

dsalychev commented Jun 16, 2023

dsalychev commented Jul 26, 2023

dsalychev commented Aug 20, 2023

dch commented Sep 4, 2023

pkubaj commented Sep 4, 2023

dsalychev commented Sep 4, 2023

dch commented Sep 5, 2023

dsalychev commented Sep 5, 2023

panic under heavy network load #19

panic under heavy network load #19

Comments

dch commented Mar 22, 2023

dsalychev commented Mar 23, 2023

dch commented Mar 26, 2023 via email

dch commented May 20, 2023

dsalychev commented May 20, 2023

dch commented May 28, 2023

markmi commented Jun 11, 2023 • edited Loading

dch commented Jun 12, 2023

dsalychev commented Jun 16, 2023

dsalychev commented Jul 26, 2023

dsalychev commented Aug 20, 2023

dch commented Sep 4, 2023

pkubaj commented Sep 4, 2023

dsalychev commented Sep 4, 2023

dch commented Sep 5, 2023

dsalychev commented Sep 5, 2023

markmi commented Jun 11, 2023 •

edited

Loading