Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TBS 6280 Dual DVB-T/T2 (saa716x) #19

Closed
MikeB2013 opened this issue Mar 23, 2014 · 4 comments
Closed

TBS 6280 Dual DVB-T/T2 (saa716x) #19

MikeB2013 opened this issue Mar 23, 2014 · 4 comments

Comments

@MikeB2013
Copy link

No data received on tuning. Both adapter frontends are reported as being attached.

w_scan gives no data:
Scanning 8MHz frequencies...
474000: (time: 00:48) (time: 00:50) signal ok:
QAM_AUTO f = 474000 kHz I999B8C999D999T999G999Y999
Info: no data from NIT(actual)
474167: (time: 01:07) (time: 01:09) signal ok:
QAM_AUTO f = 474167 kHz I999B8C999D999T999G999Y999
Info: no data from NIT(actual)
473833: (time: 01:26) (time: 01:29) signal ok:
QAM_AUTO f = 473833 kHz I999B8C999D999T999G999Y999
Info: no data from NIT(actual)

w_scan with TBS proprietary drivers gives:
Scanning 8MHz frequencies...
474000: (time: 00:40) (time: 00:42) signal ok:
QAM_AUTO f = 474000 kHz I999B8C999D999T999G999Y999
new transponder:
(QAM_64 f = 722000 kHz I999B8C34D0T8G32Y0) 0x405A
new transponder:
(QAM_64 f = 690000 kHz I999B8C34D0T8G32Y0) 0x405A
new transponder:
(QAM_64 f = 522000 kHz I999B8C23D0T8G32Y0) 0x405A
new transponder:
(QAM_64 f = 498000 kHz I999B8C23D0T8G32Y0) 0x405A
new transponder:
(QAM_64 f = 714000 kHz I999B8C34D0T8G32Y0) 0x405A
474167: (time: 00:59) (time: 01:00) signal ok:
QAM_AUTO f = 474167 kHz I999B8C999D999T999G999Y999
473833: (time: 01:16) (time: 01:17) signal ok:
QAM_AUTO f = 473833 kHz I999B8C999D999T999G999Y999

Kernel version:
3.12.13-031213-lowlatency #201402221735 SMP PREEMPT Sat Feb 22 22:44:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

I will happily attempt to debug if you can provide some instructions.

Mike

@MikeB2013
Copy link
Author

After more investigation I have now identified the problem with TBS 6280 , it is in saa716x_budget.c

Adapter 0 should use .ts_port = 1 not port 3 at line 835
Adapter 1 should use .ts_port = 3 not port 1 at line 840

lspci for my TBS6280, just in case there are different hardware versions

06:00.0 Multimedia controller [0480]: Philips Semiconductors SAA7160 [1131:7160](rev 03)
Subsystem: Device [6280:0011]
Flags: bus master, fast devsel, latency 0, IRQ 43
Memory at fd900000 (64-bit, non-prefetchable) [size=1M]
Capabilities: [40] MSI: Enable+ Count=1/32 Maskable- 64bit+
Capabilities: [50] Express Endpoint, MSI 00
Capabilities: [74] Power Management version 2
Capabilities: [80] Vendor Specific Information: Len=50 Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0 Len=088
Kernel driver in use: SAA716x Budget

Kernel version
3.12.14-031214-generic #201403111335 SMP Tue Mar 11 17:38:35 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

After rebuilding having made the changes to the .ts_port number everything now works with correct tuning for dvb-t and dvb-t2.

Mike

@ljalves
Copy link
Owner

ljalves commented Mar 26, 2014

Thanks for the report!
I've been lacking time to do updates... I'll try to do it whenever I can!

@paul-h
Copy link

paul-h commented Jun 5, 2014

I had the same problem as Mike and making the same changes to saa716x_budget.c fixed it for me also.

Thanks Mike :)

@ljalves
Copy link
Owner

ljalves commented Jun 9, 2014

Fixed.

@ljalves ljalves closed this as completed Jun 9, 2014
ljalves pushed a commit that referenced this issue Jul 17, 2014
Oleg reports a division by zero error on zero-length write() to the
percpu_pagelist_fraction sysctl:

    divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
    CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
    RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
    RSP: 0018:ffff8800d87a3e78  EFLAGS: 00010246
    RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
    RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
    R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
    R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
    FS:  00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
    Call Trace:
      proc_sys_call_handler+0xb3/0xc0
      proc_sys_write+0x14/0x20
      vfs_write+0xba/0x1e0
      SyS_write+0x46/0xb0
      tracesys+0xe1/0xe6

However, if the percpu_pagelist_fraction sysctl is set by the user, it
is also impossible to restore it to the kernel default since the user
cannot write 0 to the sysctl.

This patch allows the user to write 0 to restore the default behavior.
It still requires a fraction equal to or larger than 8, however, as
stated by the documentation for sanity.  If a value in the range [1, 7]
is written, the sysctl will return EINVAL.

This successfully solves the divide by zero issue at the same time.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Oleg Drokin <green@linuxhacker.ru>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ljalves pushed a commit that referenced this issue Jul 17, 2014
This patch tries to fix this crash:

 #5 [ffff88003c1cd690] do_invalid_op at ffffffff810166d5
 #6 [ffff88003c1cd730] invalid_op at ffffffff8159b2de
    [exception RIP: ocfs2_direct_IO_get_blocks+359]
    RIP: ffffffffa05dfa27  RSP: ffff88003c1cd7e8  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: ffff88003c1cdaa8  RCX: 0000000000000000
    RDX: 000000000000000c  RSI: ffff880027a95000  RDI: ffff88003c79b540
    RBP: ffff88003c1cd858   R8: 0000000000000000   R9: ffffffff815f6ba0
    R10: 00000000000001c9  R11: 00000000000001c9  R12: ffff88002d271500
    R13: 0000000000000001  R14: 0000000000000000  R15: 0000000000001000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff88003c1cd860] do_direct_IO at ffffffff811cd31b
 #8 [ffff88003c1cd950] direct_IO_iovec at ffffffff811cde9c
 #9 [ffff88003c1cd9b0] do_blockdev_direct_IO at ffffffff811ce764
#10 [ffff88003c1cdb80] __blockdev_direct_IO at ffffffff811ce7cc
#11 [ffff88003c1cdbb0] ocfs2_direct_IO at ffffffffa05df756 [ocfs2]
#12 [ffff88003c1cdbe0] generic_file_direct_write_iter at ffffffff8112f935
#13 [ffff88003c1cdc40] ocfs2_file_write_iter at ffffffffa0600ccc [ocfs2]
#14 [ffff88003c1cdd50] do_aio_write at ffffffff8119126c
#15 [ffff88003c1cddc0] aio_rw_vect_retry at ffffffff811d9bb4
#16 [ffff88003c1cddf0] aio_run_iocb at ffffffff811db880
#17 [ffff88003c1cde30] io_submit_one at ffffffff811dc238
#18 [ffff88003c1cde80] do_io_submit at ffffffff811dc437
#19 [ffff88003c1cdf70] sys_io_submit at ffffffff811dc530
#20 [ffff88003c1cdf80] system_call_fastpath at ffffffff8159a159

It crashes at
        BUG_ON(create && (ext_flags & OCFS2_EXT_REFCOUNTED));
in ocfs2_direct_IO_get_blocks.

ocfs2_direct_IO_get_blocks is expecting the OCFS2_EXT_REFCOUNTED be removed in
ocfs2_prepare_inode_for_write() if it was there. But no cluster lock is taken
during the time before (or inside) ocfs2_prepare_inode_for_write() and after
ocfs2_direct_IO_get_blocks().

It can happen in this case:

Node A(which crashes)				Node B
------------------------                 ---------------------------
ocfs2_file_aio_write
  ocfs2_prepare_inode_for_write
    ocfs2_inode_lock
    ...
    ocfs2_inode_unlock
  #no refcount found
....					ocfs2_reflink
                                          ocfs2_inode_lock
                                          ...
                                          ocfs2_inode_unlock
                                          #now, refcount flag set on extent

                                        ...
                                        flush change to disk

ocfs2_direct_IO_get_blocks
  ocfs2_get_clusters
    #extent map miss
    #buffer_head miss
    read extents from disk
  found refcount flag on extent
  crash..

Fix:
Take rw_lock in ocfs2_reflink path

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ljalves pushed a commit that referenced this issue Nov 11, 2014
This patch wires up the new syscall sys_bpf() on powerpc.

Passes the tests in samples/bpf:

    #0 add+sub+mul OK
    #1 unreachable OK
    #2 unreachable2 OK
    #3 out of range jump OK
    #4 out of range jump2 OK
    #5 test1 ld_imm64 OK
    #6 test2 ld_imm64 OK
    #7 test3 ld_imm64 OK
    #8 test4 ld_imm64 OK
    #9 test5 ld_imm64 OK
    #10 no bpf_exit OK
    #11 loop (back-edge) OK
    #12 loop2 (back-edge) OK
    #13 conditional loop OK
    #14 read uninitialized register OK
    #15 read invalid register OK
    #16 program doesn't init R0 before exit OK
    #17 stack out of bounds OK
    #18 invalid call insn1 OK
    #19 invalid call insn2 OK
    #20 invalid function call OK
    #21 uninitialized stack1 OK
    #22 uninitialized stack2 OK
    #23 check valid spill/fill OK
    #24 check corrupted spill/fill OK
    #25 invalid src register in STX OK
    #26 invalid dst register in STX OK
    #27 invalid dst register in ST OK
    #28 invalid src register in LDX OK
    #29 invalid dst register in LDX OK
    #30 junk insn OK
    #31 junk insn2 OK
    #32 junk insn3 OK
    #33 junk insn4 OK
    #34 junk insn5 OK
    #35 misaligned read from stack OK
    #36 invalid map_fd for function call OK
    #37 don't check return value before access OK
    #38 access memory with incorrect alignment OK
    #39 sometimes access memory with incorrect alignment OK
    #40 jump test 1 OK
    #41 jump test 2 OK
    #42 jump test 3 OK
    #43 jump test 4 OK

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
[mpe: test using samples/bpf]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
crazycat69 pushed a commit to crazycat69/linux_media that referenced this issue Apr 24, 2021
BPF programs explicitly initialise global variables to 0 to make sure
clang (v10 or older) do not put the variables in the common section.  Skip
"initialise globals to 0" check for BPF programs to elimiate error
messages like:

    ERROR: do not initialise globals to 0
    ljalves#19: FILE: samples/bpf/tracex1_kern.c:21:

Link: https://lkml.kernel.org/r/20210209211954.490077-1-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
crazycat69 pushed a commit to crazycat69/linux_media that referenced this issue Apr 24, 2021
Calling btrfs_qgroup_reserve_meta_prealloc from
btrfs_delayed_inode_reserve_metadata can result in flushing delalloc
while holding a transaction and delayed node locks. This is deadlock
prone. In the past multiple commits:

 * ae5e070 ("btrfs: qgroup: don't try to wait flushing if we're
already holding a transaction")

 * 6f23277 ("btrfs: qgroup: don't commit transaction when we already
 hold the handle")

Tried to solve various aspects of this but this was always a
whack-a-mole game. Unfortunately those 2 fixes don't solve a deadlock
scenario involving btrfs_delayed_node::mutex. Namely, one thread
can call btrfs_dirty_inode as a result of reading a file and modifying
its atime:

  PID: 6963   TASK: ffff8c7f3f94c000  CPU: 2   COMMAND: "test"
  #0  __schedule at ffffffffa529e07d
  ljalves#1  schedule at ffffffffa529e4ff
  ljalves#2  schedule_timeout at ffffffffa52a1bdd
  ljalves#3  wait_for_completion at ffffffffa529eeea             <-- sleeps with delayed node mutex held
  ljalves#4  start_delalloc_inodes at ffffffffc0380db5
  ljalves#5  btrfs_start_delalloc_snapshot at ffffffffc0393836
  ljalves#6  try_flush_qgroup at ffffffffc03f04b2
  ljalves#7  __btrfs_qgroup_reserve_meta at ffffffffc03f5bb6     <-- tries to reserve space and starts delalloc inodes.
  ljalves#8  btrfs_delayed_update_inode at ffffffffc03e31aa      <-- acquires delayed node mutex
  ljalves#9  btrfs_update_inode at ffffffffc0385ba8
 ljalves#10  btrfs_dirty_inode at ffffffffc038627b               <-- TRANSACTIION OPENED
 ljalves#11  touch_atime at ffffffffa4cf0000
 ljalves#12  generic_file_read_iter at ffffffffa4c1f123
 ljalves#13  new_sync_read at ffffffffa4ccdc8a
 ljalves#14  vfs_read at ffffffffa4cd0849
 ljalves#15  ksys_read at ffffffffa4cd0bd1
 ljalves#16  do_syscall_64 at ffffffffa4a052eb
 ljalves#17  entry_SYSCALL_64_after_hwframe at ffffffffa540008c

This will cause an asynchronous work to flush the delalloc inodes to
happen which can try to acquire the same delayed_node mutex:

  PID: 455    TASK: ffff8c8085fa4000  CPU: 5   COMMAND: "kworker/u16:30"
  #0  __schedule at ffffffffa529e07d
  ljalves#1  schedule at ffffffffa529e4ff
  ljalves#2  schedule_preempt_disabled at ffffffffa529e80a
  ljalves#3  __mutex_lock at ffffffffa529fdcb                    <-- goes to sleep, never wakes up.
  ljalves#4  btrfs_delayed_update_inode at ffffffffc03e3143      <-- tries to acquire the mutex
  ljalves#5  btrfs_update_inode at ffffffffc0385ba8              <-- this is the same inode that pid 6963 is holding
  ljalves#6  cow_file_range_inline.constprop.78 at ffffffffc0386be7
  ljalves#7  cow_file_range at ffffffffc03879c1
  ljalves#8  btrfs_run_delalloc_range at ffffffffc038894c
  ljalves#9  writepage_delalloc at ffffffffc03a3c8f
 ljalves#10  __extent_writepage at ffffffffc03a4c01
 ljalves#11  extent_write_cache_pages at ffffffffc03a500b
 ljalves#12  extent_writepages at ffffffffc03a6de2
 ljalves#13  do_writepages at ffffffffa4c277eb
 ljalves#14  __filemap_fdatawrite_range at ffffffffa4c1e5bb
 ljalves#15  btrfs_run_delalloc_work at ffffffffc0380987         <-- starts running delayed nodes
 ljalves#16  normal_work_helper at ffffffffc03b706c
 ljalves#17  process_one_work at ffffffffa4aba4e4
 ljalves#18  worker_thread at ffffffffa4aba6fd
 ljalves#19  kthread at ffffffffa4ac0a3d
 ljalves#20  ret_from_fork at ffffffffa54001ff

To fully address those cases the complete fix is to never issue any
flushing while holding the transaction or the delayed node lock. This
patch achieves it by calling qgroup_reserve_meta directly which will
either succeed without flushing or will fail and return -EDQUOT. In the
latter case that return value is going to be propagated to
btrfs_dirty_inode which will fallback to start a new transaction. That's
fine as the majority of time we expect the inode will have
BTRFS_DELAYED_NODE_INODE_DIRTY flag set which will result in directly
copying the in-memory state.

Fixes: c53e965 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
crazycat69 pushed a commit to crazycat69/linux_media that referenced this issue Jul 1, 2023
When doing link mtu negotiation, a malicious peer may send Activate msg
with a very small mtu, e.g. 4 in Shuang's testing, without checking for
the minimum mtu, l->mtu will be set to 4 in tipc_link_proto_rcv(), then
n->links[bearer_id].mtu is set to 4294967228, which is a overflow of
'4 - INT_H_SIZE - EMSG_OVERHEAD' in tipc_link_mss().

With tipc_link.mtu = 4, tipc_link_xmit() kept printing the warning:

 tipc: Too large msg, purging xmit list 1 5 0 40 4!
 tipc: Too large msg, purging xmit list 1 15 0 60 4!

And with tipc_link_entry.mtu 4294967228, a huge skb was allocated in
named_distribute(), and when purging it in tipc_link_xmit(), a crash
was even caused:

  general protection fault, probably for non-canonical address 0x2100001011000dd: 0000 [ljalves#1] PREEMPT SMP PTI
  CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 6.3.0.neta ljalves#19
  RIP: 0010:kfree_skb_list_reason+0x7e/0x1f0
  Call Trace:
   <IRQ>
   skb_release_data+0xf9/0x1d0
   kfree_skb_reason+0x40/0x100
   tipc_link_xmit+0x57a/0x740 [tipc]
   tipc_node_xmit+0x16c/0x5c0 [tipc]
   tipc_named_node_up+0x27f/0x2c0 [tipc]
   tipc_node_write_unlock+0x149/0x170 [tipc]
   tipc_rcv+0x608/0x740 [tipc]
   tipc_udp_recv+0xdc/0x1f0 [tipc]
   udp_queue_rcv_one_skb+0x33e/0x620
   udp_unicast_rcv_skb.isra.72+0x75/0x90
   __udp4_lib_rcv+0x56d/0xc20
   ip_protocol_deliver_rcu+0x100/0x2d0

This patch fixes it by checking the new mtu against tipc_bearer_min_mtu(),
and not updating mtu if it is too small.

Fixes: ed193ec ("tipc: simplify link mtu negotiation")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
crazycat69 pushed a commit to crazycat69/linux_media that referenced this issue Jul 1, 2023
The cited commit adds a compeletion to remove dependency on rtnl
lock. But it causes a deadlock for multiple encapsulations:

 crash> bt ffff8aece8a64000
 PID: 1514557  TASK: ffff8aece8a64000  CPU: 3    COMMAND: "tc"
  #0 [ffffa6d14183f368] __schedule at ffffffffb8ba7f45
  ljalves#1 [ffffa6d14183f3f8] schedule at ffffffffb8ba8418
  ljalves#2 [ffffa6d14183f418] schedule_preempt_disabled at ffffffffb8ba8898
  ljalves#3 [ffffa6d14183f428] __mutex_lock at ffffffffb8baa7f8
  ljalves#4 [ffffa6d14183f4d0] mutex_lock_nested at ffffffffb8baabeb
  ljalves#5 [ffffa6d14183f4e0] mlx5e_attach_encap at ffffffffc0f48c17 [mlx5_core]
  ljalves#6 [ffffa6d14183f628] mlx5e_tc_add_fdb_flow at ffffffffc0f39680 [mlx5_core]
  ljalves#7 [ffffa6d14183f688] __mlx5e_add_fdb_flow at ffffffffc0f3b636 [mlx5_core]
  ljalves#8 [ffffa6d14183f6f0] mlx5e_tc_add_flow at ffffffffc0f3bcdf [mlx5_core]
  ljalves#9 [ffffa6d14183f728] mlx5e_configure_flower at ffffffffc0f3c1d1 [mlx5_core]
 ljalves#10 [ffffa6d14183f790] mlx5e_rep_setup_tc_cls_flower at ffffffffc0f3d529 [mlx5_core]
 ljalves#11 [ffffa6d14183f7a0] mlx5e_rep_setup_tc_cb at ffffffffc0f3d714 [mlx5_core]
 ljalves#12 [ffffa6d14183f7b0] tc_setup_cb_add at ffffffffb8931bb8
 ljalves#13 [ffffa6d14183f810] fl_hw_replace_filter at ffffffffc0dae901 [cls_flower]
 ljalves#14 [ffffa6d14183f8d8] fl_change at ffffffffc0db5c57 [cls_flower]
 ljalves#15 [ffffa6d14183f970] tc_new_tfilter at ffffffffb8936047
 ljalves#16 [ffffa6d14183fac8] rtnetlink_rcv_msg at ffffffffb88c7c31
 ljalves#17 [ffffa6d14183fb50] netlink_rcv_skb at ffffffffb8942853
 ljalves#18 [ffffa6d14183fbc0] rtnetlink_rcv at ffffffffb88c1835
 ljalves#19 [ffffa6d14183fbd0] netlink_unicast at ffffffffb8941f27
 ljalves#20 [ffffa6d14183fc18] netlink_sendmsg at ffffffffb8942245
 ljalves#21 [ffffa6d14183fc98] sock_sendmsg at ffffffffb887d482
 ljalves#22 [ffffa6d14183fcb8] ____sys_sendmsg at ffffffffb887d81a
 ljalves#23 [ffffa6d14183fd38] ___sys_sendmsg at ffffffffb88806e2
 ljalves#24 [ffffa6d14183fe90] __sys_sendmsg at ffffffffb88807a2
 ljalves#25 [ffffa6d14183ff28] __x64_sys_sendmsg at ffffffffb888080f
 ljalves#26 [ffffa6d14183ff38] do_syscall_64 at ffffffffb8b9b6a8
 ljalves#27 [ffffa6d14183ff50] entry_SYSCALL_64_after_hwframe at ffffffffb8c0007c
 crash> bt 0xffff8aeb07544000
 PID: 1110766  TASK: ffff8aeb07544000  CPU: 0    COMMAND: "kworker/u20:9"
  #0 [ffffa6d14e6b7bd8] __schedule at ffffffffb8ba7f45
  ljalves#1 [ffffa6d14e6b7c68] schedule at ffffffffb8ba8418
  ljalves#2 [ffffa6d14e6b7c88] schedule_timeout at ffffffffb8baef88
  ljalves#3 [ffffa6d14e6b7d10] wait_for_completion at ffffffffb8ba968b
  ljalves#4 [ffffa6d14e6b7d60] mlx5e_take_all_encap_flows at ffffffffc0f47ec4 [mlx5_core]
  ljalves#5 [ffffa6d14e6b7da0] mlx5e_rep_update_flows at ffffffffc0f3e734 [mlx5_core]
  ljalves#6 [ffffa6d14e6b7df8] mlx5e_rep_neigh_update at ffffffffc0f400bb [mlx5_core]
  ljalves#7 [ffffa6d14e6b7e50] process_one_work at ffffffffb80acc9c
  ljalves#8 [ffffa6d14e6b7ed0] worker_thread at ffffffffb80ad012
  ljalves#9 [ffffa6d14e6b7f10] kthread at ffffffffb80b615d
 ljalves#10 [ffffa6d14e6b7f50] ret_from_fork at ffffffffb8001b2f

After the first encap is attached, flow will be added to encap
entry's flows list. If neigh update is running at this time, the
following encaps of the flow can't hold the encap_tbl_lock and
sleep. If neigh update thread is waiting for that flow's init_done,
deadlock happens.

Fix it by holding lock outside of the for loop. If neigh update is
running, prevent encap flows from offloading. Since the lock is held
outside of the for loop, concurrent creation of encap entries is not
allowed. So remove unnecessary wait_for_completion call for res_ready.

Fixes: 95435ad ("net/mlx5e: Only access fully initialized flows in neigh update")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants