dpaa2_ni_rx panic: dpaa2_ni_rx: unexpected frame buffer fd_addr != buf_paddr #8

mcbridematt · 2022-05-17T11:22:33Z

Commit: 173aa2a

I ran into this twice when running my stresstest for long periods of time (>1 hour)

panic: dpaa2_ni_rx: unexpected frame buffer: fd_addr(0x305800008c900000) != buf_paddr(0x3058000088ccf000)
cpuid = 5
time = 1652662301
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
kdb_backtrace() at kdb_backtrace+0x38
vpanic() at vpanic+0x17c
panic() at panic+0x44
dpaa2_ni_rx() at dpaa2_ni_rx+0x26c
dpaa2_ni_poll_task() at dpaa2_ni_poll_task+0x1b0
taskqueue_run_locked() at taskqueue_run_locked+0xac
taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
[ thread pid 0 tid 100118 ]
Stopped at      kdb_enter+0x40: undefined       f902027f

First crash trace;

(kgdb) frame 3
#3  0xffff0000007d3394 in dpaa2_ni_rx (chan=0xffff0000fd6f8000, fq=<optimized out>, fd=0xffff0000fda42020) at /usr/src/freebsd-src/sys/dev/dpaa2/dpaa2_ni.c:2630
2630            KASSERT(paddr == buf->paddr, ("%s: unexpected frame buffer: "
(kgdb) info locals
released = {0, 0, 0, 8589934592, 18446462598741807642, 18446462602928396336, 18446462598741044400}
ifp = <optimized out>
sc = <optimized out>
paddr = 3483534314129326080
released_n = 0
buf = <optimized out>
buf_chan = 0xec36f06f7149058a
buf_idx = <optimized out>
m = <optimized out>
buf_len = <optimized out>
buf_data = <optimized out>
error = <optimized out>
bp_dev = <optimized out>
bpsc = <optimized out>
chan_idx = <optimized out>
(kgdb) frame 4
#4  0xffff0000007d2da8 in dpaa2_ni_consume_frames (chan=0xffff0000fd6f8000, src=<optimized out>, consumed=<optimized out>) at /usr/src/freebsd-src/sys/dev/dpaa2/dpaa2_ni.c:2568
2568                                    fq->consume(chan, fq, fd);
(kgdb) info locals
retries = <optimized out>
fq = 0x80
rc = 36
dq = 0xffff0000fda42000
fd = 0xffff0000008c5b8a
frames = <optimized out>
(kgdb) print *dq
$4 = {{common = {verb = 96 '`', _reserved = "\223\000\000\000\000\000̖", '\000' <repeats 16 times>, "\321s\375\000\000\377\377\000P\347\070\202\000\024\000\352\005\000\000\000\000\300\000\000\200\000 \000\000\200\a\000\000\000\000\000\000\000"}, fdr = {desc = {
        verb = 96 '`', stat = 147 '\223', seqnum = 0, oprid = 0, _reserved = 0 '\000', tok = 204 '\314', fqid = 150, _reserved1 = 0, fq_byte_cnt = 0, fq_frm_cnt = 0, fqd_ctx = 18446462602985066752}, fd = {addr = 5630058834644992, data_length = 1514, bpid_ivp_bmt = 0,
        offset_fmt_sl = 192, frame_ctx = 536903680, ctrl = 125829120, flow_ctx = 0}}, scn = {verb = 96 '`', stat = 147 '\223', state = 0 '\000', _reserved = 0 '\000', rid_tok = 3422552064, ctx = 150}}}
(kgdb) print *fd
$5 = {addr = 7165916604720706863, data_length = 1701996079, bpid_ivp_bmt = 25189, offset_fmt_sl = 25715, frame_ctx = 1668444973, ctrl = 1937339183, flow_ctx = 7307986971750918959}
(kgdb) print *fq
Cannot access memory at address 0x80
(kgdb) print fd
$6 = (struct dpaa2_fd *) 0

Second time:

#4  0xffff0000007d2da8 in dpaa2_ni_consume_frames (chan=0xffff0000fc616000, src=<optimized out>, consumed=<optimized out>) at /usr/src/freebsd-src/sys/dev/dpaa2/dpaa2_ni.c:2568
2568                                    fq->consume(chan, fq, fd);
(kgdb) info locals
retries = <optimized out>
fq = 0x80
rc = 36
dq = 0xffff0000fcc58000
fd = 0xffff0000008c5b8a
frames = <optimized out>
(kgdb) print *dq
$1 = {{common = {verb = 96 '`', _reserved = "\022\000\000\000\000\000̵", '\000' <repeats 16 times>, "\215a\374\000\000\377\377\000\000=\214\000\000\270qB\000\000\000\000\000\300@\000\240\000 \000\000\001\000\000\000\000\000\000\000\000"}, fdr = {desc = {verb = 96 '`',
        stat = 18 '\022', seqnum = 0, oprid = 0, _reserved = 0 '\000', tok = 204 '\314', fqid = 181, _reserved1 = 0, fq_byte_cnt = 0, fq_frm_cnt = 0, fqd_ctx = 18446462602967092480}, fd = {addr = 8194299524353425408, data_length = 66, bpid_ivp_bmt = 0,
        offset_fmt_sl = 16576, frame_ctx = 536911872, ctrl = 65536, flow_ctx = 0}}, scn = {verb = 96 '`', stat = 18 '\022', state = 0 '\000', _reserved = 0 '\000', rid_tok = 3422552064, ctx = 181}}}
(kgdb) print *fd
$2 = {addr = 7165916604720706863, data_length = 1701996079, bpid_ivp_bmt = 25189, offset_fmt_sl = 25715, frame_ctx = 1668444973, ctrl = 1937339183, flow_ctx = 7307986971750918959}

The text was updated successfully, but these errors were encountered:

dsalychev · 2022-06-22T20:32:51Z

@mcbridematt Could you try to reproduce it with 34014de, for example? And with both GENERIC and GENERIC-NODEBUG kernel configurations?

mcbridematt · 2022-06-26T08:22:24Z

Hit the same(?) problem, but this time in the tx path:

panic: dpaa2_ni_tx_conf: unexpected frame buffer: fd_addr(0x93a5e000) != txb_paddr(0x8cf27000)
cpuid = 5
time = 1656151823
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x13c
panic() at panic+0x44
dpaa2_ni_tx_conf() at dpaa2_ni_tx_conf+0x138
dpaa2_ni_poll_task() at dpaa2_ni_poll_task+0x160
taskqueue_run_locked() at taskqueue_run_locked+0x17c
taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
fork_exit() at fork_exit+0x74

dsalychev · 2022-07-09T18:40:54Z

@mcbridematt Could you test with 1a7aba9?

mcbridematt · 2022-07-10T07:05:30Z

Unfortunately it still happens, I saw both the RX and TX assertions triggered testing today.

dsalychev · 2022-08-01T09:33:44Z

@mcbridematt Could you try e95fb52? I've simplified software portals locking mechanism there and tested with several task threads to poll frames in dpaa2_ni_poll_task().

mcbridematt · 2022-08-09T23:11:18Z

Looking good so far, no panic and no warnings/errors in dmesg when testing 4 ports and debug kernel over 9 hours.
I will try NODEBUG next.

dsalychev · 2022-08-10T06:31:03Z

@mcbridematt btw, I noticed that you were using a network interface with several Rx queues/channels (custom DPL?). Could you try it as well?

mcbridematt · 2022-08-11T08:18:58Z

@dsalychev It looks like NODEBUG is working fine as well :)

I'm pretty sure the multiple Rx queues is from the new DPL which has been default since Ten64 firmware v0.8.10, it was part of the method suggested to me by NXP that allows all 10 ports to balance traffic across all CPUs
https://forum.traverse.com.au/t/more-details-on-interrupt-balancing-dpaa2-config-dpio-splitting/114/4?u=mcbridematt

To be honest I haven't checked if Linux takes advantage of all Rx queues but I might go and check..

dsalychev · 2022-08-11T08:49:38Z

@mcbridematt I recently started using multiple threads to receive frames: https://github.com/mcusim/freebsd-src/blob/lx2160acex7-dev/sys/dev/dpaa2/dpaa2_ni.c#L646-L648 That's why I'm interested :) I'll check my Ten64 firmware and try to stress my Ten64, thanks for info!

dsalychev · 2023-03-01T10:17:58Z

I haven't noticed this panic on https://github.com/mcusim/freebsd-src/tree/ten64. @mcbridematt Could you confirm after your test?

Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 #1 0xffffffff8058e86f at vpanic+0x17f #2 0xffffffff8058e6e3 at panic+0x43 #3 0xffffffff808adc15 at trap_fatal+0x385 #4 0xffffffff808adc6f at trap_pfault+0x4f #5 0xffffffff80886da8 at calltrap+0x8 #6 0xffffffff80669186 at vgonel+0x186 #7 0xffffffff80669841 at vgone+0x31 #8 0xffffffff8065806d at vfs_hash_insert+0x26d #9 0xffffffff81a39069 at sfs_vgetx+0x149 #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #11 0xffffffff8065a28c at lookup+0x45c #12 0xffffffff806594b9 at namei+0x259 #13 0xffffffff80676a33 at kern_statat+0xf3 #14 0xffffffff8067712f at sys_fstatat+0x2f #15 0xffffffff808ae50c at amd64_syscall+0x10c #16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 #4 0xffffffff8066fdee at vinactivef+0xde #5 0xffffffff80670b8a at vgonel+0x1ea #6 0xffffffff806711e1 at vgone+0x31 #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d #8 0xffffffff81a39069 at sfs_vgetx+0x149 #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #10 0xffffffff80661c2c at lookup+0x45c #11 0xffffffff80660e59 at namei+0x259 #12 0xffffffff8067e3d3 at kern_statat+0xf3 #13 0xffffffff8067eacf at sys_fstatat+0x2f #14 0xffffffff808b5ecc at amd64_syscall+0x10c #15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes #14501

dsalychev · 2023-07-28T10:21:41Z

@mcbridematt Could you conduct the same stress test again on https://github.com/mcusim/freebsd-src/tree/dpaa2 ? The ten64 branch is stale now and almost all of the changes have found their way into the dpaa2 one.

mcbridematt · 2023-08-01T10:36:35Z

Sorry, I should have closed this issue long ago. But it has definitely not reappeared in the latest code.

Avoid locking issues when if_allmulti() calls the driver's if_ioctl, because that may acquire sleepable locks (while we hold a non-sleepable rwlock). Fortunately there's no pressing need to hold the mroute lock while we do this, so we can postpone the call slightly, until after we've released the lock. This avoids the following WITNESS warning (with iflib drivers): lock order reversal: (sleepable after non-sleepable) 1st 0xffffffff82f64960 IPv4 multicast forwarding (IPv4 multicast forwarding, rw) @ /usr/src/sys/netinet/ip_mroute.c:1050 2nd 0xfffff8000480f180 iflib ctx lock (iflib ctx lock, sx) @ /usr/src/sys/net/iflib.c:4525 lock order IPv4 multicast forwarding -> iflib ctx lock attempted at: #0 0xffffffff80bbd6ce at witness_checkorder+0xbbe #1 0xffffffff80b56d10 at _sx_xlock+0x60 #2 0xffffffff80c9ce5c at iflib_if_ioctl+0x2dc #3 0xffffffff80c7c395 at if_setflag+0xe5 #4 0xffffffff82f60a0e at del_vif_locked+0x9e #5 0xffffffff82f5f0d5 at X_ip_mrouter_set+0x265 #6 0xffffffff80bfd402 at sosetopt+0xc2 #7 0xffffffff80c02105 at kern_setsockopt+0xa5 #8 0xffffffff80c02054 at sys_setsockopt+0x24 #9 0xffffffff81046be8 at amd64_syscall+0x138 #10 0xffffffff8101930b at fast_syscall_common+0xf8 See also: https://redmine.pfsense.org/issues/12079 Reviewed by: mjg Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D41209

netlink(4) calls back into the driver during detach and it attempts to start an internal synchronized op recursively, causing an interruptible hang. Fix it by failing the ioctl if the VI has been marked as DOOMED by cxgbe_detach. Here's the stack for the hang for reference. #6 begin_synchronized_op #7 cxgbe_media_status #8 ifmedia_ioctl #9 cxgbe_ioctl #10 if_ioctl #11 get_operstate_ether #12 get_operstate #13 dump_iface #14 rtnl_handle_ifevent #15 rtnl_handle_ifnet_event #16 rt_ifmsg #17 if_unroute #18 if_down #19 if_detach_internal #20 if_detach #21 ether_ifdetach #22 cxgbe_vi_detach #23 cxgbe_detach #24 DEVICE_DETACH MFC after: 3 days Sponsored by: Chelsio Communications

mcbridematt mentioned this issue May 17, 2022

Panic in dpaa2_ni_poll_task #4

Closed

dsalychev added bug Something isn't working panic Kernel panic labels May 18, 2022

dsalychev self-assigned this May 20, 2022

dsalychev mentioned this issue Jul 28, 2023

[ten64 branch] Dataflow stops with multiple port traffic #17

Closed

mcbridematt closed this as completed Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dpaa2_ni_rx panic: dpaa2_ni_rx: unexpected frame buffer fd_addr != buf_paddr #8

dpaa2_ni_rx panic: dpaa2_ni_rx: unexpected frame buffer fd_addr != buf_paddr #8

mcbridematt commented May 17, 2022

dsalychev commented Jun 22, 2022

mcbridematt commented Jun 26, 2022

dsalychev commented Jul 9, 2022

mcbridematt commented Jul 10, 2022

dsalychev commented Aug 1, 2022

mcbridematt commented Aug 9, 2022

dsalychev commented Aug 10, 2022

mcbridematt commented Aug 11, 2022

dsalychev commented Aug 11, 2022

dsalychev commented Mar 1, 2023

dsalychev commented Jul 28, 2023

mcbridematt commented Aug 1, 2023

dpaa2_ni_rx panic: dpaa2_ni_rx: unexpected frame buffer fd_addr != buf_paddr #8

dpaa2_ni_rx panic: dpaa2_ni_rx: unexpected frame buffer fd_addr != buf_paddr #8

Comments

mcbridematt commented May 17, 2022

dsalychev commented Jun 22, 2022

mcbridematt commented Jun 26, 2022

dsalychev commented Jul 9, 2022

mcbridematt commented Jul 10, 2022

dsalychev commented Aug 1, 2022

mcbridematt commented Aug 9, 2022

dsalychev commented Aug 10, 2022

mcbridematt commented Aug 11, 2022

dsalychev commented Aug 11, 2022

dsalychev commented Mar 1, 2023

dsalychev commented Jul 28, 2023

mcbridematt commented Aug 1, 2023