Panic in dpaa2_ni_poll_task #4

mcbridematt · 2022-05-08T09:17:53Z

Commit: 0e7b9be

This happened to me just now. Same environment as #3.

I'm not sure if the textdump is helpful or not. I will try and setup my install to properly generate a minidump


Fatal data abort:
  x0: ffff00011cd4c000
  x1:                0
  x2: ffff00011d094020
  x3:              152
  x4:                0
  x5:                d
  x6: 9c58ea23fe5fba08
  x7:    8ebb800fc9c58
  x8: ffff00011d094000
  x9: ffff00011d094000
 x10:                3
 x11:                3
 x12:                0
 x13:                0
 x14:                0
 x15:                8
 x16: ffff00016862ad48 (__stop_set_modmetadata_set + 448)
 x17: ffff00000059264c (if_inc_counter + 0)
 x18: ffff00010e3b1840
 x19: ffff00011cd4c000
 x20: ffff00011c9f1000
 x21: ffffa00002d4ec00
 x22: ffff00000091a4b3 (console_pausestr + 366c)
 x23: ffff0000008c3829 (do_execve.fexecv_proc_title + 26782)
 x24: ffff000000bc1000 (sccs + 8)
 x25:                1
 x26:               25
 x27: ffff000000e71000 (epoch_array + 1e00)
 x28: ffff000000e1c0e0 (thread0_st + 0)
 x29: ffff00010e3b1840
  sp: ffff00010e3b1840
  lr: ffff0000007e83d8 (dpaa2_ni_poll_task + c0)
 elr: ffff0000007e84b4 (dpaa2_ni_poll_task + 19c)
spsr:         60000045
 far:                0
 esr:         96000004
panic: vm_fault failed: ffff0000007e84b4 error 1
cpuid = 3
time = 1651998794
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x174
panic() at panic+0x44
data_abort() at data_abort+0x2f0
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0x96000004
dpaa2_ni_poll_task() at dpaa2_ni_poll_task+0x19c
taskqueue_run_locked() at taskqueue_run_locked+0x178
taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic

The text was updated successfully, but these errors were encountered:

mcbridematt · 2022-05-14T13:52:36Z

I can almost consistently trigger this by running 2 x iperf3 flows through the machine simultaneously
(Two ports "client", two ports "server")

I can't get a exact backtrace from the minidump (the stack disappears due to the fault), but I believe the problem is the call to fq->consume on line 2565 in 0e0f350.
This is inside dpaa2_ni_consume_frames which appears to have been inlined inside dpaa2_ni_poll_task.

freebsd-src/sys/dev/dpaa2/dpaa2_ni.c

Lines 2561 to 2567 in 0e0f350

    
           } else if (rc == EALREADY || rc == ENOENT) { 
        
           	if (dq != NULL) { 
        
           		fd = &dq->fdr.fd; 
        
           		fq = (struct dpaa2_ni_fq *) dq->fdr.desc.fqd_ctx; 
        
           		fq->consume(chan, fq, fd); 
        
           		frames++; 
        
           	}

I can upload the vmcore / minidump if you would like.

dsalychev · 2022-05-14T20:38:38Z

fq->consume points to one of those three functions:

freebsd-src/sys/dev/dpaa2/dpaa2_ni.c

Lines 1098 to 1106 in 0e0f350

    
           case DPAA2_NI_QUEUE_TX_CONF: 
        
           	/* One queue per channel. */ 
        
           	fq = &chan->txc_queue; 
        
           	fq->consume = dpaa2_ni_tx_conf; 
        
           	fq->chan = chan; 
        
           	fq->flowid = chan->flowid; 
        
           	fq->tc = 0; /* ignored */ 
        
           	fq->type = queue_type;

freebsd-src/sys/dev/dpaa2/dpaa2_ni.c

Lines 1109 to 1127 in 0e0f350

    
           case DPAA2_NI_QUEUE_RX: 
        
           	KASSERT(sc->attr.num.rx_tcs <= DPAA2_NI_MAX_TCS, 
        
           	    ("too many Rx traffic classes: rx_tcs=%d\n", 
        
           	    sc->attr.num.rx_tcs)); 
        
           	chan->rxq_n = 0; 
        
           	/* One queue per Rx traffic class within a channel. */ 
        
           	for (int i = 0; i < sc->attr.num.rx_tcs; i++) { 
        
           		fq = &chan->rx_queues[i]; 
        
           		fq->consume = dpaa2_ni_rx; 
        
           		fq->chan = chan; 
        
           		fq->flowid = chan->flowid; 
        
           		fq->tc = (uint8_t) i; 
        
           		fq->type = queue_type; 
        
           		chan->rxq_n++; 
        
           	}

freebsd-src/sys/dev/dpaa2/dpaa2_ni.c

Lines 1129 to 1138 in 0e0f350

    
           case DPAA2_NI_QUEUE_RX_ERR: 
        
           	/* One queue per network interface. */ 
        
           	fq = &sc->rxe_queue; 
        
           	fq->consume = dpaa2_ni_rx_err; 
        
           	fq->chan = chan; 
        
           	fq->flowid = 0; /* ignored */ 
        
           	fq->tc = 0; /* ignored */ 
        
           	fq->type = queue_type; 
        
           	break;

Could you add a debug output to understand which one was called at the moment when the kernel panicked?

mcbridematt · 2022-05-15T11:59:53Z

I have remote (serial) debug working with kgdb now, so I can inspect the variables in dpaa2_ni_consume_frames when the panic occurs.

fq is 0 (NULL) which is why this results in an exception/abort. In fact, most values of dq->fdr (where fq) is are:

print *dq
$2 = {{common = {verb = 96 '`', _reserved = "\003", '\000' <repeats 61 times>}, fdr = {desc = {verb = 96 '`', stat = 3 '\003', seqnum = 0, oprid = 0, _reserved = 0 '\000', tok = 0 '\000', fqid = 0, _reserved1 = 0, fq_byte_cnt = 0,
        fq_frm_cnt = 0, fqd_ctx = 0}, fd = {addr = 0, data_length = 0, bpid_ivp_bmt = 0, offset_fmt_sl = 0, frame_ctx = 0, ctrl = 0, flow_ctx = 0}}, scn = {verb = 96 '`', stat = 3 '\003', state = 0 '\000', _reserved = 0 '\000',
      rid_tok = 0, ctx = 0}}}

(kgdb) bt
#0  kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/freebsd-src/sys/kern/subr_kdb.c:506
#1  0xffff000000468ce8 in vpanic (fmt=0xffff00000091e329 "vm_fault failed: %lx error %d", ap=...) at /usr/src/freebsd-src/sys/kern/kern_shutdown.c:963
#2  0xffff000000468a58 in panic (fmt=0x12 <error: Cannot access memory at address 0x12>) at /usr/src/freebsd-src/sys/kern/kern_shutdown.c:899
#3  0xffff000000776150 in data_abort (td=0xffffa0000c807000, frame=0xffff0000f9fd66a0, esr=2516582404, far=<optimized out>, lower=0) at /usr/src/freebsd-src/sys/arm64/arm64/trap.c:364
#4  0xffff0000007750e0 in do_el1h_sync (td=0xffffa0000c807000, frame=0xffff0000f9fd66a0) at /usr/src/freebsd-src/sys/arm64/arm64/trap.c:462
#5  <signal handler called>
#6  0xffff0000007d1578 in dpaa2_ni_consume_frames (chan=0xffff000101d8e000, src=<optimized out>, consumed=<optimized out>) at /usr/src/freebsd-src/sys/dev/dpaa2/dpaa2_ni.c:2565
#7  dpaa2_ni_poll_task (arg=0xffff000101d8e000, count=<optimized out>) at /usr/src/freebsd-src/sys/dev/dpaa2/dpaa2_ni.c:2423
#8  0xffff0000004c9718 in taskqueue_run_locked (queue=queue@entry=0xffffa0000a85c300) at /usr/src/freebsd-src/sys/kern/subr_taskqueue.c:477
#9  0xffff0000004ca6c8 in taskqueue_thread_loop (arg=arg@entry=0xffff000101d86798) at /usr/src/freebsd-src/sys/kern/subr_taskqueue.c:794
#10 0xffff00000041fdf8 in fork_exit (callout=0xffff0000004ca5fc <taskqueue_thread_loop>, arg=0xffff000101d86798, frame=0xffff0000f9fd6990) at /usr/src/freebsd-src/sys/kern/kern_fork.c:1102
#11 <signal handler called>
(kgdb) frame 6
#6  0xffff0000007d1578 in dpaa2_ni_consume_frames (chan=0xffff000101d8e000, src=<optimized out>, consumed=<optimized out>) at /usr/src/freebsd-src/sys/dev/dpaa2/dpaa2_ni.c:2565
2565                                    KASSERT((
(kgdb) info locals
retries = 1
fq = 0x0
rc = 37
dq = 0xffff0001020d8000
fd = 0xffff0001020d8020
frames = <optimized out>
(kgdb) print *dq
$8 = {{common = {verb = 96 '`', _reserved = "\003", '\000' <repeats 61 times>}, fdr = {desc = {verb = 96 '`', stat = 3 '\003', seqnum = 0, oprid = 0, _reserved = 0 '\000', tok = 0 '\000', fqid = 0, _reserved1 = 0, fq_byte_cnt = 0,
        fq_frm_cnt = 0, fqd_ctx = 0}, fd = {addr = 0, data_length = 0, bpid_ivp_bmt = 0, offset_fmt_sl = 0, frame_ctx = 0, ctrl = 0, flow_ctx = 0}}, scn = {verb = 96 '`', stat = 3 '\003', state = 0 '\000', _reserved = 0 '\000',
      rid_tok = 0, ctx = 0}}}
(kgdb) print *fd
$9 = {addr = 0, data_length = 0, bpid_ivp_bmt = 0, offset_fmt_sl = 0, frame_ctx = 0, ctrl = 0, flow_ctx = 0}

I added a KASSERT((fq != 0),...) just before fq->consume to confirm this, and the assert trigged as expected:

panic: dpaa2_ni_consume_frames: fq is 0
cpuid = 7
time = 1652643273
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
kdb_backtrace() at kdb_backtrace+0x38
vpanic() at vpanic+0x17c
panic() at panic+0x44
dpaa2_ni_poll_task() at dpaa2_ni_poll_task+0x2c0
taskqueue_run_locked() at taskqueue_run_locked+0xac
taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic

(kgdb) frame 3
#3  0xffff0000007d167c in dpaa2_ni_consume_frames (chan=0xffff0000fca59000, src=<optimized out>, consumed=<optimized out>) at /usr/src/freebsd-src/sys/dev/dpaa2/dpaa2_ni.c:2565
2565                                    KASSERT((fq != 0), ("%s: fq is 0", __func__));
(kgdb) print fq
$1 = <optimized out>
(kgdb) info locals
retries = 0
fq = <optimized out>
rc = 37
dq = 0xffff0000fcda1000
fd = 0xffff0000fcda1020
frames = <optimized out>
(kgdb) print *dq
$2 = {{common = {verb = 96 '`', _reserved = "\003", '\000' <repeats 61 times>}, fdr = {desc = {verb = 96 '`', stat = 3 '\003', seqnum = 0, oprid = 0, _reserved = 0 '\000', tok = 0 '\000', fqid = 0, _reserved1 = 0, fq_byte_cnt = 0,
        fq_frm_cnt = 0, fqd_ctx = 0}, fd = {addr = 0, data_length = 0, bpid_ivp_bmt = 0, offset_fmt_sl = 0, frame_ctx = 0, ctrl = 0, flow_ctx = 0}}, scn = {verb = 96 '`', stat = 3 '\003', state = 0 '\000', _reserved = 0 '\000',
      rid_tok = 0, ctx = 0}}}
(kgdb) print *fd
$3 = {addr = 0, data_length = 0, bpid_ivp_bmt = 0, offset_fmt_sl = 0, frame_ctx = 0, ctrl = 0, flow_ctx = 0}

dsalychev · 2022-05-15T20:49:42Z

Hmm, verb = 96 (0x60) means Frame Dequeue Response, but stat = 3 means that the FQ is empty and no frames were delivered to this Volatile Dequeue command. It's just a null response I haven't checked. Please, try 173aa2a.

mcbridematt · 2022-05-17T11:23:42Z

I think that fixed it! I've run the 4-port iperf stresstest for several long intervals, including a 10 hour one that just completed.

dpni0  1.5K <Link#1>      00:0a:fa:24:24:a3        0     0     0        42G        0     0          0     0
dpni1  1.5K <Link#2>      00:0a:fa:24:24:a4        0     0     0        43G        0     0          0     0
dpni2  1.5K <Link#3>      00:0a:fa:24:24:a5        0     0     0       2.4T        0     0          0     0
dpni3  1.5K <Link#4>      00:0a:fa:24:24:a6        0     0     0       2.4T        0     0          0     0
bridg  1.5K <Link#12>     58:9c:fc:10:0f:76     1.4G     0     0        85G     3.6G     0       4.9T     0

The next panic I have come across is #8 but that appears to be more sporadic.

Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 #1 0xffffffff8058e86f at vpanic+0x17f #2 0xffffffff8058e6e3 at panic+0x43 #3 0xffffffff808adc15 at trap_fatal+0x385 #4 0xffffffff808adc6f at trap_pfault+0x4f #5 0xffffffff80886da8 at calltrap+0x8 #6 0xffffffff80669186 at vgonel+0x186 #7 0xffffffff80669841 at vgone+0x31 #8 0xffffffff8065806d at vfs_hash_insert+0x26d #9 0xffffffff81a39069 at sfs_vgetx+0x149 #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #11 0xffffffff8065a28c at lookup+0x45c #12 0xffffffff806594b9 at namei+0x259 #13 0xffffffff80676a33 at kern_statat+0xf3 #14 0xffffffff8067712f at sys_fstatat+0x2f #15 0xffffffff808ae50c at amd64_syscall+0x10c #16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 #4 0xffffffff8066fdee at vinactivef+0xde #5 0xffffffff80670b8a at vgonel+0x1ea #6 0xffffffff806711e1 at vgone+0x31 #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d #8 0xffffffff81a39069 at sfs_vgetx+0x149 #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #10 0xffffffff80661c2c at lookup+0x45c #11 0xffffffff80660e59 at namei+0x259 #12 0xffffffff8067e3d3 at kern_statat+0xf3 #13 0xffffffff8067eacf at sys_fstatat+0x2f #14 0xffffffff808b5ecc at amd64_syscall+0x10c #15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes #14501

Avoid locking issues when if_allmulti() calls the driver's if_ioctl, because that may acquire sleepable locks (while we hold a non-sleepable rwlock). Fortunately there's no pressing need to hold the mroute lock while we do this, so we can postpone the call slightly, until after we've released the lock. This avoids the following WITNESS warning (with iflib drivers): lock order reversal: (sleepable after non-sleepable) 1st 0xffffffff82f64960 IPv4 multicast forwarding (IPv4 multicast forwarding, rw) @ /usr/src/sys/netinet/ip_mroute.c:1050 2nd 0xfffff8000480f180 iflib ctx lock (iflib ctx lock, sx) @ /usr/src/sys/net/iflib.c:4525 lock order IPv4 multicast forwarding -> iflib ctx lock attempted at: #0 0xffffffff80bbd6ce at witness_checkorder+0xbbe #1 0xffffffff80b56d10 at _sx_xlock+0x60 #2 0xffffffff80c9ce5c at iflib_if_ioctl+0x2dc #3 0xffffffff80c7c395 at if_setflag+0xe5 #4 0xffffffff82f60a0e at del_vif_locked+0x9e #5 0xffffffff82f5f0d5 at X_ip_mrouter_set+0x265 #6 0xffffffff80bfd402 at sosetopt+0xc2 #7 0xffffffff80c02105 at kern_setsockopt+0xa5 #8 0xffffffff80c02054 at sys_setsockopt+0x24 #9 0xffffffff81046be8 at amd64_syscall+0x138 #10 0xffffffff8101930b at fast_syscall_common+0xf8 See also: https://redmine.pfsense.org/issues/12079 Reviewed by: mjg Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D41209

Avoid calling _callout_stop_safe with a non-sleepable lock held when detaching by initializing callout_init_rw() with CALLOUT_SHAREDLOCK. It avoids the following WITNESS warning when stopping the service: # service ipfilter stop calling _callout_stop_safe with the following non-sleepable locks held: shared rw ipf filter load/unload mutex (ipf filter load/unload mutex) r = 0 (0xffff0000417c7530) locked @ /usr/src/sys/netpfil/ipfilter/netinet/fil.c:7926 stack backtrace: #0 0xffff00000052d394 at witness_debugger+0x60 #1 0xffff00000052e620 at witness_warn+0x404 #2 0xffff0000004d4ffc at _callout_stop_safe+0x8c #3 0xffff0000f7236674 at ipfdetach+0x3c #4 0xffff0000f723fa4c at ipf_ipf_ioctl+0x788 #5 0xffff0000f72367e0 at ipfioctl+0x144 #6 0xffff00000034abd8 at devfs_ioctl+0x100 #7 0xffff0000005c66a0 at vn_ioctl+0xbc #8 0xffff00000034b2cc at devfs_ioctl_f+0x24 #9 0xffff0000005331ec at kern_ioctl+0x2e0 #10 0xffff000000532eb4 at sys_ioctl+0x140 #11 0xffff000000880480 at do_el0_sync+0x604 #12 0xffff0000008579ac at handle_el0_sync+0x4c PR: 282478 Suggested by: markj Reviewed by: cy Approved by: emaste (mentor) MFC after: 1 week

dsalychev added bug Something isn't working panic Kernel panic labels May 8, 2022

mcbridematt closed this as completed May 17, 2022

mcbridematt mentioned this issue May 17, 2022

Many dpaa2_niX: dpaa2_ni_transmit: drbr_enqueue() failed errors #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic in dpaa2_ni_poll_task #4

Panic in dpaa2_ni_poll_task #4

mcbridematt commented May 8, 2022

mcbridematt commented May 14, 2022

dsalychev commented May 14, 2022 •

edited

Loading

mcbridematt commented May 15, 2022

dsalychev commented May 15, 2022

mcbridematt commented May 17, 2022

Panic in dpaa2_ni_poll_task #4

Panic in dpaa2_ni_poll_task #4

Comments

mcbridematt commented May 8, 2022

mcbridematt commented May 14, 2022

dsalychev commented May 14, 2022 • edited Loading

mcbridematt commented May 15, 2022

dsalychev commented May 15, 2022

mcbridematt commented May 17, 2022

dsalychev commented May 14, 2022 •

edited

Loading