Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic under heavy network load #19

Closed
dch opened this issue Mar 22, 2023 · 15 comments
Closed

panic under heavy network load #19

dch opened this issue Mar 22, 2023 · 15 comments
Assignees
Labels
bug Something isn't working panic Kernel panic priority Something important to fix

Comments

@dch
Copy link

dch commented Mar 22, 2023

this only reproduces when more than usual cross-dpaa interface traffic is present.
I can trigger it using iperf3 reliably. This is using normal CURRENT, not fork.

$ iperf3 --parallel 16 --client 172.16.2.24  --get-server-output --time 120
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.01   sec  2.37 MBytes  19.7 Mbits/sec  149   42.1 KBytes
[  7]   0.00-1.01   sec  2.39 MBytes  19.9 Mbits/sec  133   5.28 KBytes
[  9]   0.00-1.01   sec  3.26 MBytes  27.1 Mbits/sec  210   44.8 KBytes
[ 11]   0.00-1.01   sec  2.21 MBytes  18.4 Mbits/sec   53   2.63 KBytes
[ 13]   0.00-1.01   sec  2.28 MBytes  19.0 Mbits/sec  114   1.32 KBytes
[ 15]   0.00-1.01   sec  2.33 MBytes  19.3 Mbits/sec   75   2.66 KBytes
[ 17]   0.00-1.01   sec  2.53 MBytes  21.0 Mbits/sec  201   86.9 KBytes
[ 19]   0.00-1.01   sec  2.58 MBytes  21.5 Mbits/sec  119   90.8 KBytes
[ 21]   0.00-1.01   sec  2.55 MBytes  21.2 Mbits/sec  102    134 KBytes
[ 23]   0.00-1.01   sec  2.61 MBytes  21.7 Mbits/sec   31   1.32 KBytes
[ 25]   0.00-1.01   sec  2.27 MBytes  18.9 Mbits/sec   76   1.32 KBytes
[ 27]   0.00-1.01   sec  2.53 MBytes  21.1 Mbits/sec   88    147 KBytes
[ 29]   0.00-1.01   sec  2.48 MBytes  20.6 Mbits/sec   94   1.32 KBytes
[ 31]   0.00-1.01   sec  2.54 MBytes  21.1 Mbits/sec   23   1.32 KBytes
[ 33]   0.00-1.01   sec  2.56 MBytes  21.3 Mbits/sec  113   47.4 KBytes
[ 35]   0.00-1.01   sec  2.48 MBytes  20.7 Mbits/sec   92    161 KBytes
[SUM]   0.00-1.01   sec  40.0 MBytes   333 Mbits/sec  1673
  x0:                0
  x1: ffff00010d052200
  x2: ffff0000009e0078 (console_pausestr + 25688)
  x3:              30d
  x4:                0
  x5:                d
  x6:  a009b033eaa2d8c
  x7:    8172b24fa0a00
  x8: ffff000114a9d000
  x9:                0
 x10:                1
 x11:                3
 x12:                1
 x13:                0
 x14:            10000
 x15:                1
 x16:            10000
 x17: ffff0001737b1958 (ng_unref_node + 0)
 x18: ffff00010e4806d0
 x19: ffff0001146d9000
 x20: ffff0001146d9058
 x21: ffffa000024e1200
 x22:                0
 x23: ffff00010e480750
 x24: ffffa000031db680
 x25: ffffa000031c2c00
 x26: ffff000000cac018 (Giant + 18)
 x27: ffff000000961b27 (digits + 227b9)
 x28: ffffa000031c2c10
 x29: ffff00010e4806d0
  sp: ffff00010e4806d0
  lr: ffff00000081afc0 (dpaa2_ni_poll + 3c)
 elr: ffff00000081b028 (dpaa2_ni_poll + a4)
spsr:         40000045
 far:                1
 esr:         96000004
panic: vm_fault failed: ffff00000081b028 error 1
cpuid = 7
time = 1679392819
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x13c
panic() at panic+0x44
data_abort() at data_abort+0x32c
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0
(null)() at 0x10000
KDB: enter: panic
[ thread pid 12 tid 100119 ]
Stopped at      kdb_enter+0x44: undefined       f906427f
db>
Tracing pid 12 tid 100119 td 0xffff00010d052200
db_trace_self() at db_trace_self
db_stack_trace() at db_stack_trace+0x11c
db_command() at db_command+0x2d8
db_command_loop() at db_command_loop+0x54
db_trap() at db_trap+0xf8
kdb_trap() at kdb_trap+0x28c
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0
(null)() at 0
db>
  • output of while true; vmstat -i | grep dpaa2_io; sleep 1; end
  • and top -SjwHPz -mcpu at moment of crash (tmux over mosh)
its0,43: dpaa2_io0                                 25732021        399
its0,44: dpaa2_io1                                  4262516         66
its0,45: dpaa2_io2                                  4645361         72
its0,46: dpaa2_io3                                  4869407         76
its0,47: dpaa2_io4                                  4506983         70
its0,48: dpaa2_io5                                  4257987         66
its0,49: dpaa2_io6                                  3052330         47
its0,50: dpaa2_io7                                  2853862         44


436 threads:   9 running, 373 sleeping, 54 waiting
CPU 0:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 1:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 2:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 3:  0.0% user,  0.0% nice,  0.0% system,  100% interrupt,  0.0% idle
CPU 4:  0.0% user,  0.0% nice,  0.0% system,  100% interrupt,  0.0% idle
CPU 5:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 6:  0.0% user,  0.0% nice,  100% system,  0.0% interrupt,  0.0% idle
CPU 7:  0.0% user,  0.0% nice,  0.0% system,  100% interrupt,  0.0% idle
Mem: 122M Active, 5184M Inact, 4528M Wired, 40K Buf, 21G Free
ARC: 2678M Total, 388M MFU, 1623M MRU, 1153K Anon, 53M Header, 607M Other
     1706M Compressed, 3914M Uncompressed, 2.29:1 Ratio
Swap: 4096M Total, 4096M Free

  PID   JID USERNAME2   PRI NICE   SIZE    RES SWAP STATE    C   TIME    WCPU COMMAND
21012     0 root         21    0    17M  4516K   0B CPU3     3   0:00 100.00% top
52722     0 dch          20    0    28M    17M   0B select   4   0:02  44.43% mosh-server
   12     0 root        -64    -     0B   736K   0B WAIT     4   4:24  21.61% intr{its0,46: dpaa2_io3}
   12     0 root        -64    -     0B   736K   0B WAIT     7   7:47  21.43% intr{its0,43: dpaa2_io0}
85664     0 dch          20    0    14M  5016K   0B select   1   0:01  13.89% tmux
   12     0 root        -64    -     0B   736K   0B WAIT     5   4:12  13.20% intr{its0,45: dpaa2_io2}
   12     0 root        -64    -     0B   736K   0B WAIT     0   2:37  10.00% intr{its0,50: dpaa2_io7}
   12     0 root        -64    -     0B   736K   0B WAIT     2   3:33   9.24% intr{its0,48: dpaa2_io5}
   12     0 root        -64    -     0B   736K   0B WAIT     3   4:37   8.09% intr{its0,47: dpaa2_io4}
   12     0 root        -64    -     0B   736K   0B WAIT     6   3:42   3.56% intr{its0,44: dpaa2_io1}
    0     0 root        -64    -     0B  2400K   0B -        1   0:14   2.45% kernel{dpaa2_ni1_tqbp}
59646     1    317       20    0   363M   128M   0B kqread   6   2:27   1.19% node{node}
    0     0 root        -64    -     0B  2400K   0B -        2   0:17   0.38% kernel{dpaa2_ni2_tqbp}
20183     0 root         20    0    33M    14M   0B select   6   2:39   0.00% zerotier-one{zerotier-one}
   12     0 root        -64    -     0B   736K   0B WAIT     1   2:38   0.00% intr{its0,49: dpaa2_io6}
44098     0 root         20    0  1674M   537M   0B kqread   2   2:05   0.00% kresd
    6     0 root         -8    -     0B  1360K   0B tx->tx   1   1:54   0.00% zfskern{txg_thread_enter}
    2     0 root        -60    -     0B   128K   0B WAIT     1   1:44   0.00% clock{clock (0)}
19910     0 root        -16    -     0B    16K   0B pftm     1   1:38   0.00% pf purge
@dsalychev dsalychev added bug Something isn't working panic Kernel panic labels Mar 22, 2023
@dsalychev
Copy link

@dch thanks for all of the details! Btw, was it GENERIC kernel you tested on?

@dch
Copy link
Author

dch commented Mar 26, 2023 via email

@dch
Copy link
Author

dch commented May 20, 2023

I've been seeing this a lot (like every 5-10 minutes) after 0d574d8 with https://reviews.freebsd.org/D40094
sometimes so, early ten64 doesn't even complete switch to userland.
May be a generic arm64 issue, but I'm not seeing this on other
h/w pushing a lot more traffic.

@dsalychev
Copy link

@dch I'm not sure about 0d574d8, but I've modified addresses translation recently. Could you try 718bdb6 and one before 74192f9?

@dch
Copy link
Author

dch commented May 28, 2023

sorry it took a while but 718bdb6 is the culprit. Reverting this & we're all ok again.

@markmi
Copy link

markmi commented Jun 11, 2023

sorry it took a while but 718bdb6 is the culprit. Reverting this & we're all ok again.

The original report here is from Mar 22 but that commit is from May 11. Time relationship seems wrong for 718bdb6 to be the only issue.

@dch
Copy link
Author

dch commented Jun 12, 2023

Correct, I thought that was clear from the original title & updated comment.

  • the symptom is the same, under load from 1 physical interface to another, I get the vm_fault failed panic
  • with 718bdb6 included, panics are frequent, every 5-10 minutes
  • with that reverted, they occur roughly weekly, with autoreboot this is very usable

@dsalychev
Copy link

@dch Thanks for a summary, that's how I understood the issue. Its root cause is in the different channels accessing bus_dma resources concurrently, I assume. You won't see those panics with the only channel up and running. Just FYI, I'm trying to isolate channels within their own tasks and limit an access to shared resources as much as possible.

@dsalychev dsalychev mentioned this issue Jun 16, 2023
@dsalychev dsalychev self-assigned this Jul 7, 2023
@dsalychev
Copy link

@dch I've prepared a lot of changes in the https://github.com/mcusim/freebsd-src/tree/dpaa2 branch. Could you try it? GENERIC kernel had worked for me under high network load for ~14 hours when I stopped the test myself. Btw, I've also discovered that the kernel panics with "undefined instruction" when the Ten64's SoC is heated up to 80-90C (sysctl hw.temperature). Please, keep an eye on it.

@dsalychev
Copy link

It should be fixed on CURRENT with https://cgit.freebsd.org/src/commit/?id=58983e4b0253ad38a3e1ef2166fedd3133fdb552 merged in.

@dsalychev dsalychev added the priority Something important to fix label Aug 22, 2023
@dch
Copy link
Author

dch commented Sep 4, 2023

so far LGTM on 15.0-CURRENT - a 3h test (albeit on 1G ifaces only) is stable.
awesome! I need to move some cabling around for 10G but this is great progress!

thanks @dsalychev

@pkubaj
Copy link

pkubaj commented Sep 4, 2023

I'm on stable/14 and am planning to switch to releng/14.0 when it's branched off, but it also seems stable.
But regarding SFP+ ports, I'm not able to connect to them. I have Intel X520-DA2 card:

ix0@pci0:1:0:0: class=0x020000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x10fb subvendor=0x8086 subdevice=0x7a11
    vendor     = 'Intel Corporation'
    device     = '82599ES 10-Gigabit SFI/SFP+ Network Connection'
    class      = network
    subclass   = ethernet

It's able to link up when plugged in via loopback, but not when I plug in to Ten64. I haven't reported it yet, because I still haven't tested it working under Linux.

@dsalychev
Copy link

@dch, @pkubaj thanks for all of the tests. Please, don't expect SFP+ to be operational anyhow at the moment. I've just started working on a design of something I call "sffbus" (similar to miibus(4)).

@dch
Copy link
Author

dch commented Sep 5, 2023

using e04c4b4 this still stable. thanks!

@dch dch closed this as completed Sep 5, 2023
@dsalychev
Copy link

Good to know :) Thanks for testing!

dsalychev pushed a commit that referenced this issue Sep 9, 2023
netlink(4) calls back into the driver during detach and it attempts to
start an internal synchronized op recursively, causing an interruptible
hang.  Fix it by failing the ioctl if the VI has been marked as DOOMED
by cxgbe_detach.

Here's the stack for the hang for reference.
 #6  begin_synchronized_op
 #7  cxgbe_media_status
 #8  ifmedia_ioctl
 #9  cxgbe_ioctl
 #10 if_ioctl
 #11 get_operstate_ether
 #12 get_operstate
 #13 dump_iface
 #14 rtnl_handle_ifevent
 #15 rtnl_handle_ifnet_event
 #16 rt_ifmsg
 #17 if_unroute
 #18 if_down
 #19 if_detach_internal
 #20 if_detach
 #21 ether_ifdetach
 #22 cxgbe_vi_detach
 #23 cxgbe_detach
 #24 DEVICE_DETACH

MFC after:	3 days
Sponsored by:	Chelsio Communications
@dch dch mentioned this issue Dec 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working panic Kernel panic priority Something important to fix
Projects
None yet
Development

No branches or pull requests

4 participants