-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lib/mpi: Use "static inline" instead of "inline" #2
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The commit 2e7dc31 had replaced "extern inline" with "inline" in order to fix the build with GCC 5, because of new inline semantics. However, this change breaks the build with GCC 4.9, causing many errors such as: lib/mpi/generic_mpih-mul1.o: In function `mpihelp_add_1': generic_mpih-mul1.c:(.text+0x0): multiple definition of `mpihelp_add_1' lib/mpi/generic_mpih-lshift.o:generic_mpih-lshift.c:(.text+0x0): first defined here Fix this like in the mainline commit 9c6bd0c, by using "static inline". Signed-off-by: Benoît Thébaudeau <benoit@wsystem.com>
numbqq
added a commit
that referenced
this pull request
Jun 1, 2018
watchdog: kvims: update watchdog driver
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
This patch fixes the following warning: [ 2.226264] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620 [ 2.226341] in_atomic(): 0, irqs_disabled(): 0, pid: 1, name: swapper/0 [ 2.226366] 4 locks held by swapper/0/1: [ 2.226385] #0: (&dev->mutex){......}, at: [<ffffff800851c664>] __device_attach+0x3c/0x134 [ 2.226515] #1: (cpu_hotplug.lock){......}, at: [<ffffff800809feb0>] get_online_cpus+0x38/0x9c [ 2.226594] #2: (subsys mutex#7){......}, at: [<ffffff800851b0e4>] subsys_interface_register+0x54/0xfc [ 2.226684] #3: (rcu_read_lock){......}, at: [<ffffff800845c764>] rockchip_adjust_power_scale+0x88/0x450 [ 2.226771] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 4.4.132 #600 [ 2.226790] Hardware name: Rockchip PX30 evb ddr3 board (DT) [ 2.226809] Call trace: [ 2.226840] [<ffffff800808a0d4>] dump_backtrace+0x0/0x1ec [ 2.226889] [<ffffff800808a2d4>] show_stack+0x14/0x1c [ 2.226918] [<ffffff80083c41b4>] dump_stack+0x94/0xbc [ 2.226946] [<ffffff80080cb708>] ___might_sleep+0x108/0x118 [ 2.226972] [<ffffff80080cb788>] __might_sleep+0x70/0x80 [ 2.227001] [<ffffff8008b6d1d0>] mutex_lock_nested+0x54/0x38c [ 2.227031] [<ffffff8008858714>] __of_clk_get_from_provider+0x44/0xec [ 2.227061] [<ffffff8008852814>] __of_clk_get_by_name+0xd4/0x14c [ 2.227112] [<ffffff80088528a4>] of_clk_get_by_name+0x18/0x28 [ 2.227140] [<ffffff800845c980>] rockchip_adjust_power_scale+0x2a4/0x450 [ 2.227171] [<ffffff800877f630>] cpufreq_init+0x154/0x394 [ 2.227199] [<ffffff8008776ac4>] cpufreq_online+0x1b0/0x68c [ 2.227225] [<ffffff8008777040>] cpufreq_add_dev+0x3c/0x94 [ 2.227253] [<ffffff800851b168>] subsys_interface_register+0xd8/0xfc [ 2.227281] [<ffffff80087772d8>] cpufreq_register_driver+0x10c/0x1a8 [ 2.227332] [<ffffff800877f93c>] dt_cpufreq_probe+0xcc/0xe8 [ 2.227360] [<ffffff800851e7b8>] platform_drv_probe+0x54/0xa8 [ 2.227385] [<ffffff800851c924>] driver_probe_device+0x194/0x278 [ 2.227411] [<ffffff800851cb40>] __device_attach_driver+0x60/0x9c [ 2.227439] [<ffffff800851ae14>] bus_for_each_drv+0x9c/0xbc [ 2.227464] [<ffffff800851c6f0>] __device_attach+0xc8/0x134 [ 2.227488] [<ffffff800851ccb8>] device_initial_probe+0x10/0x18 [ 2.227514] [<ffffff800851bcec>] bus_probe_device+0x2c/0x90 [ 2.227541] [<ffffff8008519e90>] device_add+0x44c/0x510 [ 2.227570] [<ffffff800851e4d4>] platform_device_add+0xa0/0x1e4 [ 2.227597] [<ffffff800851ef28>] platform_device_register_full+0xa4/0xe4 [ 2.227627] [<ffffff80090f8680>] rockchip_cpufreq_driver_init+0xd4/0x31c [ 2.227654] [<ffffff8008083468>] do_one_initcall+0x84/0x1a4 [ 2.227682] [<ffffff80090c0e98>] kernel_init_freeable+0x260/0x264 [ 2.227708] [<ffffff8008b6ab68>] kernel_init+0x10/0xf8 [ 2.227734] [<ffffff80080832a0>] ret_from_fork+0x10/0x30 Fixes: 197a2d3 ("soc: rockchip: opp_select: add missing rcu lock") Change-Id: Ib52b6f7bc77bb207a1c8c6880f7a5e916fa3d2ee Signed-off-by: Finley Xiao <finley.xiao@rock-chips.com>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
While testing the deadline scheduler + cgroup setup I hit this warning. [ 132.612935] ------------[ cut here ]------------ [ 132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80 [ 132.612952] Modules linked in: (a ton of modules...) [ 132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2 [ 132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014 [ 132.612982] 0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e [ 132.612984] 0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b [ 132.612985] 00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00 [ 132.612986] Call Trace: [ 132.612987] <IRQ> [<ffffffff813d229e>] dump_stack+0x63/0x85 [ 132.612994] [<ffffffff810a652b>] __warn+0xcb/0xf0 [ 132.612997] [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170 [ 132.612999] [<ffffffff810a665d>] warn_slowpath_null+0x1d/0x20 [ 132.613000] [<ffffffff810aba5b>] __local_bh_enable_ip+0x6b/0x80 [ 132.613008] [<ffffffff817d6c8a>] _raw_write_unlock_bh+0x1a/0x20 [ 132.613010] [<ffffffff817d6c9e>] _raw_spin_unlock_bh+0xe/0x10 [ 132.613015] [<ffffffff811388ac>] put_css_set+0x5c/0x60 [ 132.613016] [<ffffffff8113dc7f>] cgroup_free+0x7f/0xa0 [ 132.613017] [<ffffffff810a3912>] __put_task_struct+0x42/0x140 [ 132.613018] [<ffffffff810e776a>] dl_task_timer+0xca/0x250 [ 132.613027] [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170 [ 132.613030] [<ffffffff8111371e>] __hrtimer_run_queues+0xee/0x270 [ 132.613031] [<ffffffff81113ec8>] hrtimer_interrupt+0xa8/0x190 [ 132.613034] [<ffffffff81051a58>] local_apic_timer_interrupt+0x38/0x60 [ 132.613035] [<ffffffff817d9b0d>] smp_apic_timer_interrupt+0x3d/0x50 [ 132.613037] [<ffffffff817d7c5c>] apic_timer_interrupt+0x8c/0xa0 [ 132.613038] <EOI> [<ffffffff81063466>] ? native_safe_halt+0x6/0x10 [ 132.613043] [<ffffffff81037a4e>] default_idle+0x1e/0xd0 [ 132.613044] [<ffffffff810381cf>] arch_cpu_idle+0xf/0x20 [ 132.613046] [<ffffffff810e8fda>] default_idle_call+0x2a/0x40 [ 132.613047] [<ffffffff810e92d7>] cpu_startup_entry+0x2e7/0x340 [ 132.613048] [<ffffffff81050235>] start_secondary+0x155/0x190 [ 132.613049] ---[ end trace f91934d162ce9977 ]--- The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid this problem - and other problems of sharing a spinlock with an interrupt. Change-Id: I5b5d5c79c3f380ac35f58596fc2cebaf6348eb67 Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: cgroups@vger.kernel.org Cc: stable@vger.kernel.org # 4.5+ Cc: linux-kernel@vger.kernel.org Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
[ Upstream commit fba9eb7 ] Add a header with macros usable in assembler files to emit alternative code sequences. It works analog to the alternatives for inline assmeblies in C files, with the same restrictions and capabilities. The syntax is ALTERNATIVE "<default instructions sequence>", \ "<alternative instructions sequence>", \ "<features-bit>" and ALTERNATIVE_2 "<default instructions sequence>", \ "<alternative instructions sqeuence #1>", \ "<feature-bit #1>", "<alternative instructions sqeuence #2>", \ "<feature-bit #2>" Reviewed-by: Vasily Gorbik <gor@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
[ Upstream commit 2c0aa08 ] Scenario: 1. Port down and do fail over 2. Ap do rds_bind syscall PID: 47039 TASK: ffff89887e2fe640 CPU: 47 COMMAND: "kworker/u:6" #0 [ffff898e35f159f0] machine_kexec at ffffffff8103abf9 #1 [ffff898e35f15a60] crash_kexec at ffffffff810b96e3 #2 [ffff898e35f15b30] oops_end at ffffffff8150f518 #3 [ffff898e35f15b60] no_context at ffffffff8104854c #4 [ffff898e35f15ba0] __bad_area_nosemaphore at ffffffff81048675 #5 [ffff898e35f15bf0] bad_area_nosemaphore at ffffffff810487d3 #6 [ffff898e35f15c00] do_page_fault at ffffffff815120b8 #7 [ffff898e35f15d10] page_fault at ffffffff8150ea95 [exception RIP: unknown or invalid address] RIP: 0000000000000000 RSP: ffff898e35f15dc8 RFLAGS: 00010282 RAX: 00000000fffffffe RBX: ffff889b77f6fc00 RCX:ffffffff81c99d88 RDX: 0000000000000000 RSI: ffff896019ee08e8 RDI:ffff889b77f6fc00 RBP: ffff898e35f15df0 R8: ffff896019ee08c8 R9:0000000000000000 R10: 0000000000000400 R11: 0000000000000000 R12:ffff896019ee08c0 R13: ffff889b77f6fe68 R14: ffffffff81c99d80 R15: ffffffffa022a1e0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff898e35f15dc8] cma_ndev_work_handler at ffffffffa022a228 [rdma_cm] #9 [ffff898e35f15df8] process_one_work at ffffffff8108a7c6 #10 [ffff898e35f15e58] worker_thread at ffffffff8108bda0 #11 [ffff898e35f15ee8] kthread at ffffffff81090fe6 PID: 45659 TASK: ffff880d313d2500 CPU: 31 COMMAND: "oracle_45659_ap" #0 [ffff881024ccfc98] __schedule at ffffffff8150bac4 #1 [ffff881024ccfd40] schedule at ffffffff8150c2cf #2 [ffff881024ccfd50] __mutex_lock_slowpath at ffffffff8150cee7 #3 [ffff881024ccfdc0] mutex_lock at ffffffff8150cdeb #4 [ffff881024ccfde0] rdma_destroy_id at ffffffffa022a027 [rdma_cm] #5 [ffff881024ccfe10] rds_ib_laddr_check at ffffffffa0357857 [rds_rdma] #6 [ffff881024ccfe50] rds_trans_get_preferred at ffffffffa0324c2a [rds] #7 [ffff881024ccfe80] rds_bind at ffffffffa031d690 [rds] #8 [ffff881024ccfeb0] sys_bind at ffffffff8142a670 PID: 45659 PID: 47039 rds_ib_laddr_check /* create id_priv with a null event_handler */ rdma_create_id rdma_bind_addr cma_acquire_dev /* add id_priv to cma_dev->id_list */ cma_attach_to_dev cma_ndev_work_handler /* event_hanlder is null */ id_priv->id.event_handler Signed-off-by: Guanglei Li <guanglei.li@oracle.com> Signed-off-by: Honglei Wang <honglei.wang@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Yanjun Zhu <yanjun.zhu@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Doug Ledford <dledford@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
[ Upstream commit b6dd4d8 ] The pr_debug() in gic-v3 gic_send_sgi() can trigger a circular locking warning: GICv3: CPU10: ICC_SGI1R_EL1 5000400 ====================================================== WARNING: possible circular locking dependency detected 4.15.0+ #1 Tainted: G W ------------------------------------------------------ dynamic_debug01/1873 is trying to acquire lock: ((console_sem).lock){-...}, at: [<0000000099c891ec>] down_trylock+0x20/0x4c but task is already holding lock: (&rq->lock){-.-.}, at: [<00000000842e1587>] __task_rq_lock+0x54/0xdc which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&rq->lock){-.-.}: __lock_acquire+0x3b4/0x6e0 lock_acquire+0xf4/0x2a8 _raw_spin_lock+0x4c/0x60 task_fork_fair+0x3c/0x148 sched_fork+0x10c/0x214 copy_process.isra.32.part.33+0x4e8/0x14f0 _do_fork+0xe8/0x78c kernel_thread+0x48/0x54 rest_init+0x34/0x2a4 start_kernel+0x45c/0x488 -> #1 (&p->pi_lock){-.-.}: __lock_acquire+0x3b4/0x6e0 lock_acquire+0xf4/0x2a8 _raw_spin_lock_irqsave+0x58/0x70 try_to_wake_up+0x48/0x600 wake_up_process+0x28/0x34 __up.isra.0+0x60/0x6c up+0x60/0x68 __up_console_sem+0x4c/0x7c console_unlock+0x328/0x634 vprintk_emit+0x25c/0x390 dev_vprintk_emit+0xc4/0x1fc dev_printk_emit+0x88/0xa8 __dev_printk+0x58/0x9c _dev_info+0x84/0xa8 usb_new_device+0x100/0x474 hub_port_connect+0x280/0x92c hub_event+0x740/0xa84 process_one_work+0x240/0x70c worker_thread+0x60/0x400 kthread+0x110/0x13c ret_from_fork+0x10/0x18 -> #0 ((console_sem).lock){-...}: validate_chain.isra.34+0x6e4/0xa20 __lock_acquire+0x3b4/0x6e0 lock_acquire+0xf4/0x2a8 _raw_spin_lock_irqsave+0x58/0x70 down_trylock+0x20/0x4c __down_trylock_console_sem+0x3c/0x9c console_trylock+0x20/0xb0 vprintk_emit+0x254/0x390 vprintk_default+0x58/0x90 vprintk_func+0xbc/0x164 printk+0x80/0xa0 __dynamic_pr_debug+0x84/0xac gic_raise_softirq+0x184/0x18c smp_cross_call+0xac/0x218 smp_send_reschedule+0x3c/0x48 resched_curr+0x60/0x9c check_preempt_curr+0x70/0xdc wake_up_new_task+0x310/0x470 _do_fork+0x188/0x78c SyS_clone+0x44/0x50 __sys_trace_return+0x0/0x4 other info that might help us debug this: Chain exists of: (console_sem).lock --> &p->pi_lock --> &rq->lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&rq->lock); lock(&p->pi_lock); lock(&rq->lock); lock((console_sem).lock); *** DEADLOCK *** 2 locks held by dynamic_debug01/1873: #0: (&p->pi_lock){-.-.}, at: [<000000001366df53>] wake_up_new_task+0x40/0x470 #1: (&rq->lock){-.-.}, at: [<00000000842e1587>] __task_rq_lock+0x54/0xdc stack backtrace: CPU: 10 PID: 1873 Comm: dynamic_debug01 Tainted: G W 4.15.0+ #1 Hardware name: GIGABYTE R120-T34-00/MT30-GS2-00, BIOS T48 10/02/2017 Call trace: dump_backtrace+0x0/0x188 show_stack+0x24/0x2c dump_stack+0xa4/0xe0 print_circular_bug.isra.31+0x29c/0x2b8 check_prev_add.constprop.39+0x6c8/0x6dc validate_chain.isra.34+0x6e4/0xa20 __lock_acquire+0x3b4/0x6e0 lock_acquire+0xf4/0x2a8 _raw_spin_lock_irqsave+0x58/0x70 down_trylock+0x20/0x4c __down_trylock_console_sem+0x3c/0x9c console_trylock+0x20/0xb0 vprintk_emit+0x254/0x390 vprintk_default+0x58/0x90 vprintk_func+0xbc/0x164 printk+0x80/0xa0 __dynamic_pr_debug+0x84/0xac gic_raise_softirq+0x184/0x18c smp_cross_call+0xac/0x218 smp_send_reschedule+0x3c/0x48 resched_curr+0x60/0x9c check_preempt_curr+0x70/0xdc wake_up_new_task+0x310/0x470 _do_fork+0x188/0x78c SyS_clone+0x44/0x50 __sys_trace_return+0x0/0x4 GICv3: CPU0: ICC_SGI1R_EL1 12000 This could be fixed with printk_deferred() but that might lessen its usefulness for debugging. So change it to pr_devel to keep it out of production kernels. Developers working on gic-v3 can enable it as needed in their kernels. Signed-off-by: Mark Salter <msalter@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
[ Upstream commit 2bbea6e ] when mounting an ISO filesystem sometimes (very rarely) the system hangs because of a race condition between two tasks. PID: 6766 TASK: ffff88007b2a6dd0 CPU: 0 COMMAND: "mount" #0 [ffff880078447ae0] __schedule at ffffffff8168d605 #1 [ffff880078447b48] schedule_preempt_disabled at ffffffff8168ed49 #2 [ffff880078447b58] __mutex_lock_slowpath at ffffffff8168c995 #3 [ffff880078447bb8] mutex_lock at ffffffff8168bdef #4 [ffff880078447bd0] sr_block_ioctl at ffffffffa00b6818 [sr_mod] #5 [ffff880078447c10] blkdev_ioctl at ffffffff812fea50 #6 [ffff880078447c70] ioctl_by_bdev at ffffffff8123a8b3 #7 [ffff880078447c90] isofs_fill_super at ffffffffa04fb1e1 [isofs] #8 [ffff880078447da8] mount_bdev at ffffffff81202570 #9 [ffff880078447e18] isofs_mount at ffffffffa04f9828 [isofs] #10 [ffff880078447e28] mount_fs at ffffffff81202d09 #11 [ffff880078447e70] vfs_kern_mount at ffffffff8121ea8f #12 [ffff880078447ea8] do_mount at ffffffff81220fee #13 [ffff880078447f28] sys_mount at ffffffff812218d6 #14 [ffff880078447f80] system_call_fastpath at ffffffff81698c49 RIP: 00007fd9ea914e9a RSP: 00007ffd5d9bf648 RFLAGS: 00010246 RAX: 00000000000000a5 RBX: ffffffff81698c49 RCX: 0000000000000010 RDX: 00007fd9ec2bc210 RSI: 00007fd9ec2bc290 RDI: 00007fd9ec2bcf30 RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000010 R10: 00000000c0ed0001 R11: 0000000000000206 R12: 00007fd9ec2bc040 R13: 00007fd9eb6b2380 R14: 00007fd9ec2bc210 R15: 00007fd9ec2bcf30 ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b This task was trying to mount the cdrom. It allocated and configured a super_block struct and owned the write-lock for the super_block->s_umount rwsem. While exclusively owning the s_umount lock, it called sr_block_ioctl and waited to acquire the global sr_mutex lock. PID: 6785 TASK: ffff880078720fb0 CPU: 0 COMMAND: "systemd-udevd" #0 [ffff880078417898] __schedule at ffffffff8168d605 #1 [ffff880078417900] schedule at ffffffff8168dc59 #2 [ffff880078417910] rwsem_down_read_failed at ffffffff8168f605 #3 [ffff880078417980] call_rwsem_down_read_failed at ffffffff81328838 #4 [ffff8800784179d0] down_read at ffffffff8168cde0 #5 [ffff8800784179e8] get_super at ffffffff81201cc7 #6 [ffff880078417a10] __invalidate_device at ffffffff8123a8de #7 [ffff880078417a40] flush_disk at ffffffff8123a94b #8 [ffff880078417a88] check_disk_change at ffffffff8123ab50 #9 [ffff880078417ab0] cdrom_open at ffffffffa00a29e1 [cdrom] #10 [ffff880078417b68] sr_block_open at ffffffffa00b6f9b [sr_mod] #11 [ffff880078417b98] __blkdev_get at ffffffff8123ba86 #12 [ffff880078417bf0] blkdev_get at ffffffff8123bd65 #13 [ffff880078417c78] blkdev_open at ffffffff8123bf9b #14 [ffff880078417c90] do_dentry_open at ffffffff811fc7f7 #15 [ffff880078417cd8] vfs_open at ffffffff811fc9cf #16 [ffff880078417d00] do_last at ffffffff8120d53d #17 [ffff880078417db0] path_openat at ffffffff8120e6b2 #18 [ffff880078417e48] do_filp_open at ffffffff8121082b #19 [ffff880078417f18] do_sys_open at ffffffff811fdd33 #20 [ffff880078417f70] sys_open at ffffffff811fde4e #21 [ffff880078417f80] system_call_fastpath at ffffffff81698c49 RIP: 00007f29438b0c20 RSP: 00007ffc76624b78 RFLAGS: 00010246 RAX: 0000000000000002 RBX: ffffffff81698c49 RCX: 0000000000000000 RDX: 00007f2944a5fa70 RSI: 00000000000a0800 RDI: 00007f2944a5fa70 RBP: 00007f2944a5f540 R8: 0000000000000000 R9: 0000000000000020 R10: 00007f2943614c40 R11: 0000000000000246 R12: ffffffff811fde4e R13: ffff880078417f78 R14: 000000000000000c R15: 00007f2944a4b010 ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b This task tried to open the cdrom device, the sr_block_open function acquired the global sr_mutex lock. The call to check_disk_change() then saw an event flag indicating a possible media change and tried to flush any cached data for the device. As part of the flush, it tried to acquire the super_block->s_umount lock associated with the cdrom device. This was the same super_block as created and locked by the previous task. The first task acquires the s_umount lock and then the sr_mutex_lock; the second task acquires the sr_mutex_lock and then the s_umount lock. This patch fixes the issue by moving check_disk_change() out of cdrom_open() and let the caller take care of it. Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
[ Upstream commit a3ca831 ] When booting up with "threadirqs" in command line, all irq handlers of the DMA controller pl330 will be threaded forcedly. These threads will race for the same list, pl330->req_done. Before the callback, the spinlock was released. And after it, the spinlock was taken. This opened an race window where another threaded irq handler could steal the spinlock and be permitted to delete entries of the list, pl330->req_done. If the later deleted an entry that was still referred to by the former, there would be a kernel panic when the former was scheduled and tried to get the next sibling of the deleted entry. The scenario could be depicted as below: Thread: T1 pl330->req_done Thread: T2 | | | | -A-B-C-D- | Locked | | | | Waiting Del A | | | -B-C-D- | Unlocked | | | | Locked Waiting | | | | Del B | | | | -C-D- Unlocked Waiting | | | Locked | get C via B \ - Kernel panic The kernel panic looked like as below: Unable to handle kernel paging request at virtual address dead000000000108 pgd = ffffff8008c9e000 [dead000000000108] *pgd=000000027fffe003, *pud=000000027fffe003, *pmd=0000000000000000 Internal error: Oops: 96000044 [#1] PREEMPT SMP Modules linked in: CPU: 0 PID: 85 Comm: irq/59-66330000 Not tainted 4.8.24-WR9.0.0.12_standard #2 Hardware name: Broadcom NS2 SVK (DT) task: ffffffc1f5cc3c00 task.stack: ffffffc1f5ce0000 PC is at pl330_irq_handler+0x27c/0x390 LR is at pl330_irq_handler+0x2a8/0x390 pc : [<ffffff80084cb694>] lr : [<ffffff80084cb6c0>] pstate: 800001c5 sp : ffffffc1f5ce3d00 x29: ffffffc1f5ce3d00 x28: 0000000000000140 x27: ffffffc1f5c530b0 x26: dead000000000100 x25: dead000000000200 x24: 0000000000418958 x23: 0000000000000001 x22: ffffffc1f5ccd668 x21: ffffffc1f5ccd590 x20: ffffffc1f5ccd418 x19: dead000000000060 x18: 0000000000000001 x17: 0000000000000007 x16: 0000000000000001 x15: ffffffffffffffff x14: ffffffffffffffff x13: ffffffffffffffff x12: 0000000000000000 x11: 0000000000000001 x10: 0000000000000840 x9 : ffffffc1f5ce0000 x8 : ffffffc1f5cc3338 x7 : ffffff8008ce2020 x6 : 0000000000000000 x5 : 0000000000000000 x4 : 0000000000000001 x3 : dead000000000200 x2 : dead000000000100 x1 : 0000000000000140 x0 : ffffffc1f5ccd590 Process irq/59-66330000 (pid: 85, stack limit = 0xffffffc1f5ce0020) Stack: (0xffffffc1f5ce3d00 to 0xffffffc1f5ce4000) 3d00: ffffffc1f5ce3d80 ffffff80080f09d0 ffffffc1f5ca0c00 ffffffc1f6f7c600 3d20: ffffffc1f5ce0000 ffffffc1f6f7c600 ffffffc1f5ca0c00 ffffff80080f0998 3d40: ffffffc1f5ce0000 ffffff80080f0000 0000000000000000 0000000000000000 3d60: ffffff8008ce202c ffffff8008ce2020 ffffffc1f5ccd668 ffffffc1f5c530b0 3d80: ffffffc1f5ce3db0 ffffff80080f0d70 ffffffc1f5ca0c40 0000000000000001 3da0: ffffffc1f5ce0000 ffffff80080f0cfc ffffffc1f5ce3e20 ffffff80080bf4f8 3dc0: ffffffc1f5ca0c80 ffffff8008bf3798 ffffff8008955528 ffffffc1f5ca0c00 3de0: ffffff80080f0c30 0000000000000000 0000000000000000 0000000000000000 3e00: 0000000000000000 0000000000000000 0000000000000000 ffffff80080f0b68 3e20: 0000000000000000 ffffff8008083690 ffffff80080bf420 ffffffc1f5ca0c80 3e40: 0000000000000000 0000000000000000 0000000000000000 ffffff80080cb648 3e60: ffffff8008b1c780 0000000000000000 0000000000000000 ffffffc1f5ca0c00 3e80: ffffffc100000000 ffffff8000000000 ffffffc1f5ce3e90 ffffffc1f5ce3e90 3ea0: 0000000000000000 ffffff8000000000 ffffffc1f5ce3eb0 ffffffc1f5ce3eb0 3ec0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 3ee0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 3f00: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 3f20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 3f40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 3f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 3f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 3fa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 3fc0: 0000000000000000 0000000000000005 0000000000000000 0000000000000000 3fe0: 0000000000000000 0000000000000000 0000000275ce3ff0 0000000275ce3ff8 Call trace: Exception stack(0xffffffc1f5ce3b30 to 0xffffffc1f5ce3c60) 3b20: dead000000000060 0000008000000000 3b40: ffffffc1f5ce3d00 ffffff80084cb694 0000000000000008 0000000000000e88 3b60: ffffffc1f5ce3bb0 ffffff80080dac68 ffffffc1f5ce3b90 ffffff8008826fe4 3b80: 00000000000001c0 00000000000001c0 ffffffc1f5ce3bb0 ffffff800848dfcc 3ba0: 0000000000020000 ffffff8008b15ae4 ffffffc1f5ce3c00 ffffff800808f000 3bc0: 0000000000000010 ffffff80088377f0 ffffffc1f5ccd590 0000000000000140 3be0: dead000000000100 dead000000000200 0000000000000001 0000000000000000 3c00: 0000000000000000 ffffff8008ce2020 ffffffc1f5cc3338 ffffffc1f5ce0000 3c20: 0000000000000840 0000000000000001 0000000000000000 ffffffffffffffff 3c40: ffffffffffffffff ffffffffffffffff 0000000000000001 0000000000000007 [<ffffff80084cb694>] pl330_irq_handler+0x27c/0x390 [<ffffff80080f09d0>] irq_forced_thread_fn+0x38/0x88 [<ffffff80080f0d70>] irq_thread+0x140/0x200 [<ffffff80080bf4f8>] kthread+0xd8/0xf0 [<ffffff8008083690>] ret_from_fork+0x10/0x40 Code: f2a00838 f9405763 aa1c03e1 aa1503e0 (f9000443) ---[ end trace f50005726d31199c ]--- Kernel panic - not syncing: Fatal exception in interrupt SMP: stopping secondary CPUs SMP: failed to stop secondary CPUs 0-1 Kernel Offset: disabled Memory Limit: none ---[ end Kernel panic - not syncing: Fatal exception in interrupt To fix this, re-start with the list-head after dropping the lock then re-takeing it. Reviewed-by: Frank Mori Hess <fmh6jj@gmail.com> Tested-by: Frank Mori Hess <fmh6jj@gmail.com> Signed-off-by: Qi Hou <qi.hou@windriver.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
commit df30781 upstream. For problem determination we need to see whether and why we were successful or not. This allows deduction of scsi_eh escalation. Example trace record formatted with zfcpdbf from s390-tools: Timestamp : ... Area : SCSI Subarea : 00 Level : 1 Exception : - CPU ID : .. Caller : 0x... Record ID : 1 Tag : schrh_r SCSI host reset handler result Request ID : 0x0000000000000000 none (invalid) SCSI ID : 0xffffffff none (invalid) SCSI LUN : 0xffffffff none (invalid) SCSI LUN high : 0xffffffff none (invalid) SCSI result : 0x00002002 field re-used for midlayer value: SUCCESS or in other cases: 0x2009 == FAST_IO_FAIL SCSI retries : 0xff none (invalid) SCSI allowed : 0xff none (invalid) SCSI scribble : 0xffffffffffffffff none (invalid) SCSI opcode : ffffffff ffffffff ffffffff ffffffff none (invalid) FCP rsp inf cod: 0xff none (invalid) FCP rsp IU : 00000000 00000000 00000000 00000000 none (invalid) 00000000 00000000 v2.6.35 commit a1dbfdd ("[SCSI] zfcp: Pass return code from fc_block_scsi_eh to scsi eh") introduced the first return with something other than the previously hardcoded single SUCCESS return path. Signed-off-by: Steffen Maier <maier@linux.ibm.com> Fixes: a1dbfdd ("[SCSI] zfcp: Pass return code from fc_block_scsi_eh to scsi eh") Cc: <stable@vger.kernel.org> #2.6.38+ Reviewed-by: Jens Remus <jremus@linux.ibm.com> Reviewed-by: Benjamin Block <bblock@linux.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
commit 81979ae upstream. We already have a SCSI trace for the end of abort and scsi_eh TMF. Due to zfcp_erp_wait() and fc_block_scsi_eh() time can pass between the start of our eh callback and an actual send/recv of an abort / TMF request. In order to see the temporal sequence including any abort / TMF send retries, add a trace before the above two blocking functions. This supports problem determination with scsi_eh and parallel zfcp ERP. No need to explicitly trace the beginning of our eh callback, since we typically can send an abort / TMF and see its HBA response (in the worst case, it's a pseudo response on dismiss all of adapter recovery, e.g. due to an FSF request timeout [fsrth_1] of the abort / TMF). If we cannot send, we now get a trace record for the first "abrt_wt" or "[lt]r_wait" which denotes almost the beginning of the callback. No need to explicitly trace the wakeup after the above two blocking functions because the next retry loop causes another trace in any case and that is sufficient. Example trace records formatted with zfcpdbf from s390-tools: Timestamp : ... Area : SCSI Subarea : 00 Level : 1 Exception : - CPU ID : .. Caller : 0x... Record ID : 1 Tag : abrt_wt abort, before zfcp_erp_wait() Request ID : 0x0000000000000000 none (invalid) SCSI ID : 0x<scsi_id> SCSI LUN : 0x<scsi_lun> SCSI LUN high : 0x<scsi_lun_high> SCSI result : 0x<scsi_result_of_cmd_to_be_aborted> SCSI retries : 0x<retries_of_cmd_to_be_aborted> SCSI allowed : 0x<allowed_retries_of_cmd_to_be_aborted> SCSI scribble : 0x<req_id_of_cmd_to_be_aborted> SCSI opcode : <CDB_of_cmd_to_be_aborted> FCP rsp inf cod: 0x.. none (invalid) FCP rsp IU : ... none (invalid) Timestamp : ... Area : SCSI Subarea : 00 Level : 1 Exception : - CPU ID : .. Caller : 0x... Record ID : 1 Tag : lr_wait LUN reset, before zfcp_erp_wait() Request ID : 0x0000000000000000 none (invalid) SCSI ID : 0x<scsi_id> SCSI LUN : 0x<scsi_lun> SCSI LUN high : 0x<scsi_lun_high> SCSI result : 0x... unrelated SCSI retries : 0x.. unrelated SCSI allowed : 0x.. unrelated SCSI scribble : 0x... unrelated SCSI opcode : ... unrelated FCP rsp inf cod: 0x.. none (invalid) FCP rsp IU : ... none (invalid) Signed-off-by: Steffen Maier <maier@linux.ibm.com> Fixes: 63caf36 ("[SCSI] zfcp: Improve reliability of SCSI eh handlers in zfcp") Fixes: af4de36 ("[SCSI] zfcp: Block scsi_eh thread for rport state BLOCKED") Cc: <stable@vger.kernel.org> #2.6.38+ Reviewed-by: Benjamin Block <bblock@linux.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
…ailed commit 512857a upstream. If a SCSI device is deleted during scsi_eh host reset, we cannot get a reference to the SCSI device anymore since scsi_device_get returns !=0 by design. Assuming the recovery of adapter and port(s) was successful, zfcp_erp_strategy_followup_success() attempts to trigger a LUN reset for the half-gone SCSI device. Unfortunately, it causes the following confusing trace record which states that zfcp will do a LUN recovery as "ERP need" is ZFCP_ERP_ACTION_REOPEN_LUN == 1 and equals "ERP want". Old example trace record formatted with zfcpdbf from s390-tools: Tag: : ersfs_3 ERP, trigger, unit reopen, port reopen succeeded LUN : 0x<FCP_LUN> WWPN : 0x<WWPN> D_ID : 0x<N_Port-ID> Adapter status : 0x5400050b Port status : 0x54000001 LUN status : 0x40000000 ZFCP_STATUS_COMMON_RUNNING but not ZFCP_STATUS_COMMON_UNBLOCKED as it was closed on close part of adapter reopen ERP want : 0x01 ERP need : 0x01 misleading However, zfcp_erp_setup_act() returns NULL as it cannot get the reference. Hence, zfcp_erp_action_enqueue() takes an early goto out and _NO_ recovery actually happens. We always do want the recovery trigger trace record even if no erp_action could be enqueued as in this case. For other cases where we did not enqueue an erp_action, 'need' has always been zero to indicate this. In order to indicate above goto out, introduce an eyecatcher "flag" to mark the "ERP need" as 'not needed' but still keep the information which erp_action type, that zfcp_erp_required_act() had decided upon, is needed. 0xc_ is chosen to be visibly different from 0x0_ in "ERP want". New example trace record formatted with zfcpdbf from s390-tools: Tag: : ersfs_3 ERP, trigger, unit reopen, port reopen succeeded LUN : 0x<FCP_LUN> WWPN : 0x<WWPN> D_ID : 0x<N_Port-ID> Adapter status : 0x5400050b Port status : 0x54000001 LUN status : 0x40000000 ERP want : 0x01 ERP need : 0xc1 would need LUN ERP, but no action set up ^ Before v2.6.38 commit ae0904f ("[SCSI] zfcp: Redesign of the debug tracing for recovery actions.") we could detect this case because the "erp_action" field in the trace was NULL. The rework removed erp_action as argument and field from the trace. This patch here is for tracing. A fix to allow LUN recovery in the case at hand is a topic for a separate patch. See also commit fdbd1c5 ("[SCSI] zfcp: Allow running unit/LUN shutdown without acquiring reference") for a similar case and background info. Signed-off-by: Steffen Maier <maier@linux.ibm.com> Fixes: ae0904f ("[SCSI] zfcp: Redesign of the debug tracing for recovery actions.") Cc: <stable@vger.kernel.org> #2.6.38+ Reviewed-by: Benjamin Block <bblock@linux.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
… return commit 96d9270 upstream. get_device() and its internally used kobject_get() only return NULL if they get passed NULL as argument. zfcp_get_port_by_wwpn() loops over adapter->port_list so the iteration variable port is always non-NULL. Struct device is embedded in struct zfcp_port so &port->dev is always non-NULL. This is the argument to get_device(). However, if we get an fc_rport in terminate_rport_io() for which we cannot find a match within zfcp_get_port_by_wwpn(), the latter can return NULL. v2.6.30 commit 7093293 ("[SCSI] zfcp: Fix oops when port disappears") introduced an early return without adding a trace record for this case. Even if we don't need recovery in this case, for debugging we should still see that our callback was invoked originally by scsi_transport_fc. Example trace record formatted with zfcpdbf from s390-tools: Timestamp : ... Area : REC Subarea : 00 Level : 1 Exception : - CPU ID : .. Caller : 0x... Record ID : 1 Tag : sctrpin SCSI terminate rport I/O, no zfcp port LUN : 0xffffffffffffffff none (invalid) WWPN : 0x<wwpn> WWPN D_ID : 0x<n_port_id> N_Port-ID Adapter status : 0x... Port status : 0xffffffff unknown (-1) LUN status : 0x00000000 none (invalid) Ready count : 0x... Running count : 0x... ERP want : 0x03 ZFCP_ERP_ACTION_REOPEN_PORT_FORCED ERP need : 0xc0 ZFCP_ERP_ACTION_NONE Signed-off-by: Steffen Maier <maier@linux.ibm.com> Fixes: 7093293 ("[SCSI] zfcp: Fix oops when port disappears") Cc: <stable@vger.kernel.org> #2.6.38+ Reviewed-by: Benjamin Block <bblock@linux.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
…RP_FAILED commit d70aab5 upstream. For problem determination we always want to see when we were invoked on the terminate_rport_io callback whether we perform something or not. Temporal event sequence of interest with a long fast_io_fail_tmo of 27 sec: loose remote port t workqueue [s] zfcp_q_<dev> IRQ zfcperp<dev> === ================== =================== ============================ 0 recv RSCN q p.test_link_work block rport start fast_io_fail_tmo send ADISC ELS 4 recv ADISC fail block zfcp_port port forced reopen send open port 12 recv open port fail q p.gid_pn_work zfcp_erp_wakeup (zfcp_erp_wait would return) GID_PN fail Before this point, we got a SCSI trace with tag "sctrpi1" on fast_io_fail, e.g. with the typical 5 sec setting. port.status |= ERP_FAILED If fast_io_fail_tmo triggers after this point, we missed a SCSI trace. workqueue fc_dl_<host> ================== 27 fc_timeout_fail_rport_io fc_terminate_rport_io zfcp_scsi_terminate_rport_io zfcp_erp_port_forced_reopen _zfcp_erp_port_forced_reopen if (port.status & ERP_FAILED) return; Therefore, write a trace before above early return. Example trace record formatted with zfcpdbf from s390-tools: Timestamp : ... Area : REC Subarea : 00 Level : 1 Exception : - CPU ID : .. Caller : 0x... Record ID : 1 ZFCP_DBF_REC_TRIG Tag : sctrpi1 SCSI terminate rport I/O LUN : 0xffffffffffffffff none (invalid) WWPN : 0x<wwpn> D_ID : 0x<n_port_id> Adapter status : 0x... Port status : 0x... LUN status : 0x00000000 none (invalid) Ready count : 0x... Running count : 0x... ERP want : 0x03 ZFCP_ERP_ACTION_REOPEN_PORT_FORCED ERP need : 0xe0 ZFCP_ERP_ACTION_FAILED Signed-off-by: Steffen Maier <maier@linux.ibm.com> Cc: <stable@vger.kernel.org> #2.6.38+ Reviewed-by: Benjamin Block <bblock@linux.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
commit 8c3d20a upstream. That other commit introduced an inconsistency because it would trace on ERP_FAILED for all callers of port forced reopen triggers (not just terminate_rport_io), but it would not trace on ERP_FAILED for all callers of other ERP triggers such as adapter, port regular, LUN. Therefore, generalize that other commit. zfcp_erp_action_enqueue() already had two early outs which re-used the one zfcp_dbf_rec_trig() call. All ERP trigger functions finally run through zfcp_erp_action_enqueue(). So move the special handling for ZFCP_STATUS_COMMON_ERP_FAILED into zfcp_erp_action_enqueue() and add another early out with new trace marker for pseudo ERP need in this case. This removes all early returns from all ERP trigger functions so we always end up at zfcp_dbf_rec_trig(). Example trace record formatted with zfcpdbf from s390-tools: Timestamp : ... Area : REC Subarea : 00 Level : 1 Exception : - CPU ID : .. Caller : 0x... Record ID : 1 ZFCP_DBF_REC_TRIG Tag : ....... LUN : 0x... WWPN : 0x... D_ID : 0x... Adapter status : 0x... Port status : 0x... LUN status : 0x... Ready count : 0x... Running count : 0x... ERP want : 0x0. ZFCP_ERP_ACTION_REOPEN_... ERP need : 0xe0 ZFCP_ERP_ACTION_FAILED Signed-off-by: Steffen Maier <maier@linux.ibm.com> Cc: <stable@vger.kernel.org> #2.6.38+ Reviewed-by: Benjamin Block <bblock@linux.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
commit 6a76550 upstream. Example trace record formatted with zfcpdbf from s390-tools: Timestamp : ... Area : REC Subarea : 00 Level : 1 Exception : - CPU ID : .. Caller : 0x... Record ID : 1 ZFCP_DBF_REC_TRIG Tag : ....... LUN : 0x... WWPN : 0x... D_ID : 0x... Adapter status : 0x... Port status : 0x... LUN status : 0x... Ready count : 0x... Running count : 0x... ERP want : 0x0. ZFCP_ERP_ACTION_REOPEN_... ERP need : 0xc0 ZFCP_ERP_ACTION_NONE Signed-off-by: Steffen Maier <maier@linux.ibm.com> Cc: <stable@vger.kernel.org> #2.6.38+ Reviewed-by: Benjamin Block <bblock@linux.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
commit b63e132 upstream. The current MIPS implementation of arch_trigger_cpumask_backtrace() is broken because it attempts to use synchronous IPIs despite the fact that it may be run with interrupts disabled. This means that when arch_trigger_cpumask_backtrace() is invoked, for example by the RCU CPU stall watchdog, we may: - Deadlock due to use of synchronous IPIs with interrupts disabled, causing the CPU that's attempting to generate the backtrace output to hang itself. - Not succeed in generating the desired output from remote CPUs. - Produce warnings about this from smp_call_function_many(), for example: [42760.526910] INFO: rcu_sched detected stalls on CPUs/tasks: [42760.535755] 0-...!: (1 GPs behind) idle=ade/140000000000000/0 softirq=526944/526945 fqs=0 [42760.547874] 1-...!: (0 ticks this GP) idle=e4a/140000000000000/0 softirq=547885/547885 fqs=0 [42760.559869] (detected by 2, t=2162 jiffies, g=266689, c=266688, q=33) [42760.568927] ------------[ cut here ]------------ [42760.576146] WARNING: CPU: 2 PID: 1216 at kernel/smp.c:416 smp_call_function_many+0x88/0x20c [42760.587839] Modules linked in: [42760.593152] CPU: 2 PID: 1216 Comm: sh Not tainted 4.15.4-00373-gee058bb4d0c2 #2 [42760.603767] Stack : 8e09bd20 8e09bd20 8e09bd20 fffffff0 00000007 00000006 00000000 8e09bca8 [42760.616937] 95b2b379 95b2b379 807a0080 00000007 81944518 0000018a 00000032 00000000 [42760.630095] 00000000 00000030 80000000 00000000 806eca74 00000009 8017e2b8 000001a0 [42760.643169] 00000000 00000002 00000000 8e09baa4 00000008 808b8008 86d69080 8e09bca0 [42760.656282] 8e09ad50 805e20aa 00000000 00000000 00000000 8017e2b8 00000009 801070ca [42760.669424] ... [42760.673919] Call Trace: [42760.678672] [<27fde568>] show_stack+0x70/0xf0 [42760.685417] [<84751641>] dump_stack+0xaa/0xd0 [42760.692188] [<699d671c>] __warn+0x80/0x92 [42760.698549] [<68915d41>] warn_slowpath_null+0x28/0x36 [42760.705912] [<f7c76c1c>] smp_call_function_many+0x88/0x20c [42760.713696] [<6bbdfc2a>] arch_trigger_cpumask_backtrace+0x30/0x4a [42760.722216] [<f845bd33>] rcu_dump_cpu_stacks+0x6a/0x98 [42760.729580] [<796e7629>] rcu_check_callbacks+0x672/0x6ac [42760.737476] [<059b3b43>] update_process_times+0x18/0x34 [42760.744981] [<6eb94941>] tick_sched_handle.isra.5+0x26/0x38 [42760.752793] [<478d3d70>] tick_sched_timer+0x1c/0x50 [42760.759882] [<e56ea39f>] __hrtimer_run_queues+0xc6/0x226 [42760.767418] [<e88bbcae>] hrtimer_interrupt+0x88/0x19a [42760.775031] [<6765a19e>] gic_compare_interrupt+0x2e/0x3a [42760.782761] [<0558bf5f>] handle_percpu_devid_irq+0x78/0x168 [42760.790795] [<90c11ba2>] generic_handle_irq+0x1e/0x2c [42760.798117] [<1b6d462c>] gic_handle_local_int+0x38/0x86 [42760.805545] [<b2ada1c7>] gic_irq_dispatch+0xa/0x14 [42760.812534] [<90c11ba2>] generic_handle_irq+0x1e/0x2c [42760.820086] [<c7521934>] do_IRQ+0x16/0x20 [42760.826274] [<9aef3ce6>] plat_irq_dispatch+0x62/0x94 [42760.833458] [<6a94b53c>] except_vec_vi_end+0x70/0x78 [42760.840655] [<22284043>] smp_call_function_many+0x1ba/0x20c [42760.848501] [<54022b58>] smp_call_function+0x1e/0x2c [42760.855693] [<ab9fc705>] flush_tlb_mm+0x2a/0x98 [42760.862730] [<0844cdd0>] tlb_flush_mmu+0x1c/0x44 [42760.869628] [<cb259b74>] arch_tlb_finish_mmu+0x26/0x3e [42760.877021] [<1aeaaf74>] tlb_finish_mmu+0x18/0x66 [42760.883907] [<b3fce717>] exit_mmap+0x76/0xea [42760.890428] [<c4c8a2f6>] mmput+0x80/0x11a [42760.896632] [<a41a08f4>] do_exit+0x1f4/0x80c [42760.903158] [<ee01cef6>] do_group_exit+0x20/0x7e [42760.909990] [<13fa8d54>] __wake_up_parent+0x0/0x1e [42760.917045] [<46cf89d0>] smp_call_function_many+0x1a2/0x20c [42760.924893] [<8c21a93b>] syscall_common+0x14/0x1c [42760.931765] ---[ end trace 02aa09da9dc52a60 ]--- [42760.938342] ------------[ cut here ]------------ [42760.945311] WARNING: CPU: 2 PID: 1216 at kernel/smp.c:291 smp_call_function_single+0xee/0xf8 ... This patch switches MIPS' arch_trigger_cpumask_backtrace() to use async IPIs & smp_call_function_single_async() in order to resolve this problem. We ensure use of the pre-allocated call_single_data_t structures is serialized by maintaining a cpumask indicating that they're busy, and refusing to attempt to send an IPI when a CPU's bit is set in this mask. This should only happen if a CPU hasn't responded to a previous backtrace IPI - ie. if it's hung - and we print a warning to the console in this case. I've marked this for stable branches as far back as v4.9, to which it applies cleanly. Strictly speaking the faulty MIPS implementation can be traced further back to commit 856839b ("MIPS: Add arch_trigger_all_cpu_backtrace() function") in v3.19, but kernel versions v3.19 through v4.8 will require further work to backport due to the rework performed in commit 9a01c3e ("nmi_backtrace: add more trigger_*_cpu_backtrace() methods"). Signed-off-by: Paul Burton <paul.burton@mips.com> Patchwork: https://patchwork.linux-mips.org/patch/19597/ Cc: James Hogan <jhogan@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Huacai Chen <chenhc@lemote.com> Cc: linux-mips@linux-mips.org Cc: stable@vger.kernel.org # v4.9+ Fixes: 856839b ("MIPS: Add arch_trigger_all_cpu_backtrace() function") Fixes: 9a01c3e ("nmi_backtrace: add more trigger_*_cpu_backtrace() methods") [ Huacai: backported to 4.4: Restruction since generic NMI solution is unavailable ] Signed-off-by: Huacai Chen <chenhc@lemote.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
commit c604cb7 upstream. My recent fix for dns_resolver_preparse() printing very long strings was incomplete, as shown by syzbot which still managed to hit the WARN_ONCE() in set_precision() by adding a crafted "dns_resolver" key: precision 50001 too large WARNING: CPU: 7 PID: 864 at lib/vsprintf.c:2164 vsnprintf+0x48a/0x5a0 The bug this time isn't just a printing bug, but also a logical error when multiple options ("#"-separated strings) are given in the key payload. Specifically, when separating an option string into name and value, if there is no value then the name is incorrectly considered to end at the end of the key payload, rather than the end of the current option. This bypasses validation of the option length, and also means that specifying multiple options is broken -- which presumably has gone unnoticed as there is currently only one valid option anyway. A similar problem also applied to option values, as the kstrtoul() when parsing the "dnserror" option will read past the end of the current option and into the next option. Fix these bugs by correctly computing the length of the option name and by copying the option value, null-terminated, into a temporary buffer. Reproducer for the WARN_ONCE() that syzbot hit: perl -e 'print "#A#", "\0" x 50000' | keyctl padd dns_resolver desc @s Reproducer for "dnserror" option being parsed incorrectly (expected behavior is to fail when seeing the unknown option "foo", actual behavior was to read the dnserror value as "1#foo" and fail there): perl -e 'print "#dnserror=1#foo\0"' | keyctl padd dns_resolver desc @s Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: 4a2d789 ("DNS: If the DNS server returns an error, allow that to be cached [ver #2]") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
commit 89da619 upstream. Kernel panic when with high memory pressure, calltrace looks like, PID: 21439 TASK: ffff881be3afedd0 CPU: 16 COMMAND: "java" #0 [ffff881ec7ed7630] machine_kexec at ffffffff81059beb #1 [ffff881ec7ed7690] __crash_kexec at ffffffff81105942 #2 [ffff881ec7ed7760] crash_kexec at ffffffff81105a30 #3 [ffff881ec7ed7778] oops_end at ffffffff816902c8 #4 [ffff881ec7ed77a0] no_context at ffffffff8167ff46 #5 [ffff881ec7ed77f0] __bad_area_nosemaphore at ffffffff8167ffdc #6 [ffff881ec7ed7838] __node_set at ffffffff81680300 #7 [ffff881ec7ed7860] __do_page_fault at ffffffff8169320f #8 [ffff881ec7ed78c0] do_page_fault at ffffffff816932b5 #9 [ffff881ec7ed78f0] page_fault at ffffffff8168f4c8 [exception RIP: _raw_spin_lock_irqsave+47] RIP: ffffffff8168edef RSP: ffff881ec7ed79a8 RFLAGS: 00010046 RAX: 0000000000000246 RBX: ffffea0019740d00 RCX: ffff881ec7ed7fd8 RDX: 0000000000020000 RSI: 0000000000000016 RDI: 0000000000000008 RBP: ffff881ec7ed79a8 R8: 0000000000000246 R9: 000000000001a098 R10: ffff88107ffda000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000008 R14: ffff881ec7ed7a80 R15: ffff881be3afedd0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 It happens in the pagefault and results in double pagefault during compacting pages when memory allocation fails. Analysed the vmcore, the page leads to second pagefault is corrupted with _mapcount=-256, but private=0. It's caused by the race between migration and ballooning, and lock missing in virtballoon_migratepage() of virtio_balloon driver. This patch fix the bug. Fixes: e225042 ("virtio_balloon: introduce migration primitives to balloon pages") Cc: stable@vger.kernel.org Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Huang Chong <huang.chong@zte.com.cn> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
commit 4e1a720 upstream. slub debug reported: [ 440.648642] ============================================================================= [ 440.648649] BUG kmalloc-1024 (Tainted: G BU O ): Poison overwritten [ 440.648651] ----------------------------------------------------------------------------- [ 440.648655] INFO: 0xe70f4bec-0xe70f4bec. First byte 0x6a instead of 0x6b [ 440.648665] INFO: Allocated in sk_prot_alloc+0x6b/0xc6 age=33155 cpu=1 pid=1047 [ 440.648671] ___slab_alloc.constprop.24+0x1fc/0x292 [ 440.648675] __slab_alloc.isra.18.constprop.23+0x1c/0x25 [ 440.648677] __kmalloc+0xb6/0x17f [ 440.648680] sk_prot_alloc+0x6b/0xc6 [ 440.648683] sk_alloc+0x1e/0xa1 [ 440.648700] sco_sock_alloc.constprop.6+0x26/0xaf [bluetooth] [ 440.648716] sco_connect_cfm+0x166/0x281 [bluetooth] [ 440.648731] hci_conn_request_evt.isra.53+0x258/0x281 [bluetooth] [ 440.648746] hci_event_packet+0x28b/0x2326 [bluetooth] [ 440.648759] hci_rx_work+0x161/0x291 [bluetooth] [ 440.648764] process_one_work+0x163/0x2b2 [ 440.648767] worker_thread+0x1a9/0x25c [ 440.648770] kthread+0xf8/0xfd [ 440.648774] ret_from_fork+0x2e/0x38 [ 440.648779] INFO: Freed in __sk_destruct+0xd3/0xdf age=3815 cpu=1 pid=1047 [ 440.648782] __slab_free+0x4b/0x27a [ 440.648784] kfree+0x12e/0x155 [ 440.648787] __sk_destruct+0xd3/0xdf [ 440.648790] sk_destruct+0x27/0x29 [ 440.648793] __sk_free+0x75/0x91 [ 440.648795] sk_free+0x1c/0x1e [ 440.648810] sco_sock_kill+0x5a/0x5f [bluetooth] [ 440.648825] sco_conn_del+0x8e/0xba [bluetooth] [ 440.648840] sco_disconn_cfm+0x3a/0x41 [bluetooth] [ 440.648855] hci_event_packet+0x45e/0x2326 [bluetooth] [ 440.648868] hci_rx_work+0x161/0x291 [bluetooth] [ 440.648872] process_one_work+0x163/0x2b2 [ 440.648875] worker_thread+0x1a9/0x25c [ 440.648877] kthread+0xf8/0xfd [ 440.648880] ret_from_fork+0x2e/0x38 [ 440.648884] INFO: Slab 0xf4718580 objects=27 used=27 fp=0x (null) flags=0x40008100 [ 440.648886] INFO: Object 0xe70f4b88 @offset=19336 fp=0xe70f54f8 When KASAN was enabled, it reported: [ 210.096613] ================================================================== [ 210.096634] BUG: KASAN: use-after-free in ex_handler_refcount+0x5b/0x127 [ 210.096641] Write of size 4 at addr ffff880107e17160 by task kworker/u9:1/2040 [ 210.096651] CPU: 1 PID: 2040 Comm: kworker/u9:1 Tainted: G U O 4.14.47-20180606+ #2 [ 210.096654] Hardware name: , BIOS 2017.01-00087-g43e04de 08/30/2017 [ 210.096693] Workqueue: hci0 hci_rx_work [bluetooth] [ 210.096698] Call Trace: [ 210.096711] dump_stack+0x46/0x59 [ 210.096722] print_address_description+0x6b/0x23b [ 210.096729] ? ex_handler_refcount+0x5b/0x127 [ 210.096736] kasan_report+0x220/0x246 [ 210.096744] ex_handler_refcount+0x5b/0x127 [ 210.096751] ? ex_handler_clear_fs+0x85/0x85 [ 210.096757] fixup_exception+0x8c/0x96 [ 210.096766] do_trap+0x66/0x2c1 [ 210.096773] do_error_trap+0x152/0x180 [ 210.096781] ? fixup_bug+0x78/0x78 [ 210.096817] ? hci_debugfs_create_conn+0x244/0x26a [bluetooth] [ 210.096824] ? __schedule+0x113b/0x1453 [ 210.096830] ? sysctl_net_exit+0xe/0xe [ 210.096837] ? __wake_up_common+0x343/0x343 [ 210.096843] ? insert_work+0x107/0x163 [ 210.096850] invalid_op+0x1b/0x40 [ 210.096888] RIP: 0010:hci_debugfs_create_conn+0x244/0x26a [bluetooth] [ 210.096892] RSP: 0018:ffff880094a0f970 EFLAGS: 00010296 [ 210.096898] RAX: 0000000000000000 RBX: ffff880107e170e8 RCX: ffff880107e17160 [ 210.096902] RDX: 000000000000002f RSI: ffff88013b80ed40 RDI: ffffffffa058b940 [ 210.096906] RBP: ffff88011b2b0578 R08: 00000000852f0ec9 R09: ffffffff81cfcf9b [ 210.096909] R10: 00000000d21bdad7 R11: 0000000000000001 R12: ffff8800967b0488 [ 210.096913] R13: ffff880107e17168 R14: 0000000000000068 R15: ffff8800949c0008 [ 210.096920] ? __sk_destruct+0x2c6/0x2d4 [ 210.096959] hci_event_packet+0xff5/0x7de2 [bluetooth] [ 210.096969] ? __local_bh_enable_ip+0x43/0x5b [ 210.097004] ? l2cap_sock_recv_cb+0x158/0x166 [bluetooth] [ 210.097039] ? hci_le_meta_evt+0x2bb3/0x2bb3 [bluetooth] [ 210.097075] ? l2cap_ertm_init+0x94e/0x94e [bluetooth] [ 210.097093] ? xhci_urb_enqueue+0xbd8/0xcf5 [xhci_hcd] [ 210.097102] ? __accumulate_pelt_segments+0x24/0x33 [ 210.097109] ? __accumulate_pelt_segments+0x24/0x33 [ 210.097115] ? __update_load_avg_se.isra.2+0x217/0x3a4 [ 210.097122] ? set_next_entity+0x7c3/0x12cd [ 210.097128] ? pick_next_entity+0x25e/0x26c [ 210.097135] ? pick_next_task_fair+0x2ca/0xc1a [ 210.097141] ? switch_mm_irqs_off+0x346/0xb4f [ 210.097147] ? __switch_to+0x769/0xbc4 [ 210.097153] ? compat_start_thread+0x66/0x66 [ 210.097188] ? hci_conn_check_link_mode+0x1cd/0x1cd [bluetooth] [ 210.097195] ? finish_task_switch+0x392/0x431 [ 210.097228] ? hci_rx_work+0x154/0x487 [bluetooth] [ 210.097260] hci_rx_work+0x154/0x487 [bluetooth] [ 210.097269] process_one_work+0x579/0x9e9 [ 210.097277] worker_thread+0x68f/0x804 [ 210.097285] kthread+0x31c/0x32b [ 210.097292] ? rescuer_thread+0x70c/0x70c [ 210.097299] ? kthread_create_on_node+0xa3/0xa3 [ 210.097306] ret_from_fork+0x35/0x40 [ 210.097314] Allocated by task 2040: [ 210.097323] kasan_kmalloc.part.1+0x51/0xc7 [ 210.097328] __kmalloc+0x17f/0x1b6 [ 210.097335] sk_prot_alloc+0xf2/0x1a3 [ 210.097340] sk_alloc+0x22/0x297 [ 210.097375] sco_sock_alloc.constprop.7+0x23/0x202 [bluetooth] [ 210.097410] sco_connect_cfm+0x2d0/0x566 [bluetooth] [ 210.097443] hci_conn_request_evt.isra.53+0x6d3/0x762 [bluetooth] [ 210.097476] hci_event_packet+0x85e/0x7de2 [bluetooth] [ 210.097507] hci_rx_work+0x154/0x487 [bluetooth] [ 210.097512] process_one_work+0x579/0x9e9 [ 210.097517] worker_thread+0x68f/0x804 [ 210.097523] kthread+0x31c/0x32b [ 210.097529] ret_from_fork+0x35/0x40 [ 210.097533] Freed by task 2040: [ 210.097539] kasan_slab_free+0xb3/0x15e [ 210.097544] kfree+0x103/0x1a9 [ 210.097549] __sk_destruct+0x2c6/0x2d4 [ 210.097584] sco_conn_del.isra.1+0xba/0x10e [bluetooth] [ 210.097617] hci_event_packet+0xff5/0x7de2 [bluetooth] [ 210.097648] hci_rx_work+0x154/0x487 [bluetooth] [ 210.097653] process_one_work+0x579/0x9e9 [ 210.097658] worker_thread+0x68f/0x804 [ 210.097663] kthread+0x31c/0x32b [ 210.097670] ret_from_fork+0x35/0x40 [ 210.097676] The buggy address belongs to the object at ffff880107e170e8 which belongs to the cache kmalloc-1024 of size 1024 [ 210.097681] The buggy address is located 120 bytes inside of 1024-byte region [ffff880107e170e8, ffff880107e174e8) [ 210.097683] The buggy address belongs to the page: [ 210.097689] page:ffffea00041f8400 count:1 mapcount:0 mapping: (null) index:0xffff880107e15b68 compound_mapcount: 0 [ 210.110194] flags: 0x8000000000008100(slab|head) [ 210.115441] raw: 8000000000008100 0000000000000000 ffff880107e15b68 0000000100170016 [ 210.115448] raw: ffffea0004a47620 ffffea0004b48e20 ffff88013b80ed40 0000000000000000 [ 210.115451] page dumped because: kasan: bad access detected [ 210.115454] Memory state around the buggy address: [ 210.115460] ffff880107e17000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 210.115465] ffff880107e17080: fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb [ 210.115469] >ffff880107e17100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 210.115472] ^ [ 210.115477] ffff880107e17180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 210.115481] ffff880107e17200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 210.115483] ================================================================== And finally when BT_DBG() and ftrace was enabled it showed: <...>-14979 [001] .... 186.104191: sco_sock_kill <-sco_sock_close <...>-14979 [001] .... 186.104191: sco_sock_kill <-sco_sock_release <...>-14979 [001] .... 186.104192: sco_sock_kill: sk ef0497a0 state 9 <...>-14979 [001] .... 186.104193: bt_sock_unlink <-sco_sock_kill kworker/u9:2-792 [001] .... 186.104246: sco_sock_kill <-sco_conn_del kworker/u9:2-792 [001] .... 186.104248: sco_sock_kill: sk ef0497a0 state 9 kworker/u9:2-792 [001] .... 186.104249: bt_sock_unlink <-sco_sock_kill kworker/u9:2-792 [001] .... 186.104250: sco_sock_destruct <-__sk_destruct kworker/u9:2-792 [001] .... 186.104250: sco_sock_destruct: sk ef0497a0 kworker/u9:2-792 [001] .... 186.104860: hci_conn_del <-hci_event_packet kworker/u9:2-792 [001] .... 186.104864: hci_conn_del: hci0 hcon ef0484c0 handle 266 Only in the failed case, sco_sock_kill() gets called with the same sock pointer two times. Add a check for SOCK_DEAD to avoid continue killing a socket which has already been killed. Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Oct 10, 2018
commit ccd5b32 upstream. The following commit: 39a0526 ("x86/mm: Factor out LDT init from context init") renamed init_new_context() to init_new_context_ldt() and added a new init_new_context() which calls init_new_context_ldt(). However, the error code of init_new_context_ldt() was ignored. Consequently, if a memory allocation in alloc_ldt_struct() failed during a fork(), the ->context.ldt of the new task remained the same as that of the old task (due to the memcpy() in dup_mm()). ldt_struct's are not intended to be shared, so a use-after-free occurred after one task exited. Fix the bug by making init_new_context() pass through the error code of init_new_context_ldt(). This bug was found by syzkaller, which encountered the following splat: BUG: KASAN: use-after-free in free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116 Read of size 4 at addr ffff88006d2cb7c8 by task kworker/u9:0/3710 CPU: 1 PID: 3710 Comm: kworker/u9:0 Not tainted 4.13.0-rc4-next-20170811 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 print_address_description+0x73/0x250 mm/kasan/report.c:252 kasan_report_error mm/kasan/report.c:351 [inline] kasan_report+0x24e/0x340 mm/kasan/report.c:409 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429 free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116 free_ldt_struct arch/x86/kernel/ldt.c:173 [inline] destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171 destroy_context arch/x86/include/asm/mmu_context.h:157 [inline] __mmdrop+0xe9/0x530 kernel/fork.c:889 mmdrop include/linux/sched/mm.h:42 [inline] exec_mmap fs/exec.c:1061 [inline] flush_old_exec+0x173c/0x1ff0 fs/exec.c:1291 load_elf_binary+0x81f/0x4ba0 fs/binfmt_elf.c:855 search_binary_handler+0x142/0x6b0 fs/exec.c:1652 exec_binprm fs/exec.c:1694 [inline] do_execveat_common.isra.33+0x1746/0x22e0 fs/exec.c:1816 do_execve+0x31/0x40 fs/exec.c:1860 call_usermodehelper_exec_async+0x457/0x8f0 kernel/umh.c:100 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431 Allocated by task 3700: save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:447 set_track mm/kasan/kasan.c:459 [inline] kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551 kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3627 kmalloc include/linux/slab.h:493 [inline] alloc_ldt_struct+0x52/0x140 arch/x86/kernel/ldt.c:67 write_ldt+0x7b7/0xab0 arch/x86/kernel/ldt.c:277 sys_modify_ldt+0x1ef/0x240 arch/x86/kernel/ldt.c:307 entry_SYSCALL_64_fastpath+0x1f/0xbe Freed by task 3700: save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:447 set_track mm/kasan/kasan.c:459 [inline] kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524 __cache_free mm/slab.c:3503 [inline] kfree+0xca/0x250 mm/slab.c:3820 free_ldt_struct.part.2+0xdd/0x150 arch/x86/kernel/ldt.c:121 free_ldt_struct arch/x86/kernel/ldt.c:173 [inline] destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171 destroy_context arch/x86/include/asm/mmu_context.h:157 [inline] __mmdrop+0xe9/0x530 kernel/fork.c:889 mmdrop include/linux/sched/mm.h:42 [inline] __mmput kernel/fork.c:916 [inline] mmput+0x541/0x6e0 kernel/fork.c:927 copy_process.part.36+0x22e1/0x4af0 kernel/fork.c:1931 copy_process kernel/fork.c:1546 [inline] _do_fork+0x1ef/0xfb0 kernel/fork.c:2025 SYSC_clone kernel/fork.c:2135 [inline] SyS_clone+0x37/0x50 kernel/fork.c:2129 do_syscall_64+0x26c/0x8c0 arch/x86/entry/common.c:287 return_from_SYSCALL_64+0x0/0x7a Here is a C reproducer: #include <asm/ldt.h> #include <pthread.h> #include <signal.h> #include <stdlib.h> #include <sys/syscall.h> #include <sys/wait.h> #include <unistd.h> static void *fork_thread(void *_arg) { fork(); } int main(void) { struct user_desc desc = { .entry_number = 8191 }; syscall(__NR_modify_ldt, 1, &desc, sizeof(desc)); for (;;) { if (fork() == 0) { pthread_t t; srand(getpid()); pthread_create(&t, NULL, fork_thread, NULL); usleep(rand() % 10000); syscall(__NR_exit_group, 0); } wait(NULL); } } Note: the reproducer takes advantage of the fact that alloc_ldt_struct() may use vmalloc() to allocate a large ->entries array, and after commit: 5d17a73 ("vmalloc: back off when the current task is killed") it is possible for userspace to fail a task's vmalloc() by sending a fatal signal, e.g. via exit_group(). It would be more difficult to reproduce this bug on kernels without that commit. This bug only affected kernels with CONFIG_MODIFY_LDT_SYSCALL=y. Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> [v4.6+] Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Fixes: 39a0526 ("x86/mm: Factor out LDT init from context init") Link: http://lkml.kernel.org/r/20170824175029.76040-1-ebiggers3@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
gouwa
pushed a commit
that referenced
this pull request
Dec 24, 2018
This patch fixes the following warning: [ 2.226264] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620 [ 2.226341] in_atomic(): 0, irqs_disabled(): 0, pid: 1, name: swapper/0 [ 2.226366] 4 locks held by swapper/0/1: [ 2.226385] #0: (&dev->mutex){......}, at: [<ffffff800851c664>] __device_attach+0x3c/0x134 [ 2.226515] #1: (cpu_hotplug.lock){......}, at: [<ffffff800809feb0>] get_online_cpus+0x38/0x9c [ 2.226594] #2: (subsys mutex#7){......}, at: [<ffffff800851b0e4>] subsys_interface_register+0x54/0xfc [ 2.226684] #3: (rcu_read_lock){......}, at: [<ffffff800845c764>] rockchip_adjust_power_scale+0x88/0x450 [ 2.226771] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 4.4.132 #600 [ 2.226790] Hardware name: Rockchip PX30 evb ddr3 board (DT) [ 2.226809] Call trace: [ 2.226840] [<ffffff800808a0d4>] dump_backtrace+0x0/0x1ec [ 2.226889] [<ffffff800808a2d4>] show_stack+0x14/0x1c [ 2.226918] [<ffffff80083c41b4>] dump_stack+0x94/0xbc [ 2.226946] [<ffffff80080cb708>] ___might_sleep+0x108/0x118 [ 2.226972] [<ffffff80080cb788>] __might_sleep+0x70/0x80 [ 2.227001] [<ffffff8008b6d1d0>] mutex_lock_nested+0x54/0x38c [ 2.227031] [<ffffff8008858714>] __of_clk_get_from_provider+0x44/0xec [ 2.227061] [<ffffff8008852814>] __of_clk_get_by_name+0xd4/0x14c [ 2.227112] [<ffffff80088528a4>] of_clk_get_by_name+0x18/0x28 [ 2.227140] [<ffffff800845c980>] rockchip_adjust_power_scale+0x2a4/0x450 [ 2.227171] [<ffffff800877f630>] cpufreq_init+0x154/0x394 [ 2.227199] [<ffffff8008776ac4>] cpufreq_online+0x1b0/0x68c [ 2.227225] [<ffffff8008777040>] cpufreq_add_dev+0x3c/0x94 [ 2.227253] [<ffffff800851b168>] subsys_interface_register+0xd8/0xfc [ 2.227281] [<ffffff80087772d8>] cpufreq_register_driver+0x10c/0x1a8 [ 2.227332] [<ffffff800877f93c>] dt_cpufreq_probe+0xcc/0xe8 [ 2.227360] [<ffffff800851e7b8>] platform_drv_probe+0x54/0xa8 [ 2.227385] [<ffffff800851c924>] driver_probe_device+0x194/0x278 [ 2.227411] [<ffffff800851cb40>] __device_attach_driver+0x60/0x9c [ 2.227439] [<ffffff800851ae14>] bus_for_each_drv+0x9c/0xbc [ 2.227464] [<ffffff800851c6f0>] __device_attach+0xc8/0x134 [ 2.227488] [<ffffff800851ccb8>] device_initial_probe+0x10/0x18 [ 2.227514] [<ffffff800851bcec>] bus_probe_device+0x2c/0x90 [ 2.227541] [<ffffff8008519e90>] device_add+0x44c/0x510 [ 2.227570] [<ffffff800851e4d4>] platform_device_add+0xa0/0x1e4 [ 2.227597] [<ffffff800851ef28>] platform_device_register_full+0xa4/0xe4 [ 2.227627] [<ffffff80090f8680>] rockchip_cpufreq_driver_init+0xd4/0x31c [ 2.227654] [<ffffff8008083468>] do_one_initcall+0x84/0x1a4 [ 2.227682] [<ffffff80090c0e98>] kernel_init_freeable+0x260/0x264 [ 2.227708] [<ffffff8008b6ab68>] kernel_init+0x10/0xf8 [ 2.227734] [<ffffff80080832a0>] ret_from_fork+0x10/0x30 Fixes: 197a2d3 ("soc: rockchip: opp_select: add missing rcu lock") Change-Id: Ib52b6f7bc77bb207a1c8c6880f7a5e916fa3d2ee Signed-off-by: Finley Xiao <finley.xiao@rock-chips.com>
gouwa
pushed a commit
that referenced
this pull request
May 28, 2019
[ Upstream commit a3ca831 ] When booting up with "threadirqs" in command line, all irq handlers of the DMA controller pl330 will be threaded forcedly. These threads will race for the same list, pl330->req_done. Before the callback, the spinlock was released. And after it, the spinlock was taken. This opened an race window where another threaded irq handler could steal the spinlock and be permitted to delete entries of the list, pl330->req_done. If the later deleted an entry that was still referred to by the former, there would be a kernel panic when the former was scheduled and tried to get the next sibling of the deleted entry. The scenario could be depicted as below: Thread: T1 pl330->req_done Thread: T2 | | | | -A-B-C-D- | Locked | | | | Waiting Del A | | | -B-C-D- | Unlocked | | | | Locked Waiting | | | | Del B | | | | -C-D- Unlocked Waiting | | | Locked | get C via B \ - Kernel panic The kernel panic looked like as below: Unable to handle kernel paging request at virtual address dead000000000108 pgd = ffffff8008c9e000 [dead000000000108] *pgd=000000027fffe003, *pud=000000027fffe003, *pmd=0000000000000000 Internal error: Oops: 96000044 [#1] PREEMPT SMP Modules linked in: CPU: 0 PID: 85 Comm: irq/59-66330000 Not tainted 4.8.24-WR9.0.0.12_standard #2 Hardware name: Broadcom NS2 SVK (DT) task: ffffffc1f5cc3c00 task.stack: ffffffc1f5ce0000 PC is at pl330_irq_handler+0x27c/0x390 LR is at pl330_irq_handler+0x2a8/0x390 pc : [<ffffff80084cb694>] lr : [<ffffff80084cb6c0>] pstate: 800001c5 sp : ffffffc1f5ce3d00 x29: ffffffc1f5ce3d00 x28: 0000000000000140 x27: ffffffc1f5c530b0 x26: dead000000000100 x25: dead000000000200 x24: 0000000000418958 x23: 0000000000000001 x22: ffffffc1f5ccd668 x21: ffffffc1f5ccd590 x20: ffffffc1f5ccd418 x19: dead000000000060 x18: 0000000000000001 x17: 0000000000000007 x16: 0000000000000001 x15: ffffffffffffffff x14: ffffffffffffffff x13: ffffffffffffffff x12: 0000000000000000 x11: 0000000000000001 x10: 0000000000000840 x9 : ffffffc1f5ce0000 x8 : ffffffc1f5cc3338 x7 : ffffff8008ce2020 x6 : 0000000000000000 x5 : 0000000000000000 x4 : 0000000000000001 x3 : dead000000000200 x2 : dead000000000100 x1 : 0000000000000140 x0 : ffffffc1f5ccd590 Process irq/59-66330000 (pid: 85, stack limit = 0xffffffc1f5ce0020) Stack: (0xffffffc1f5ce3d00 to 0xffffffc1f5ce4000) 3d00: ffffffc1f5ce3d80 ffffff80080f09d0 ffffffc1f5ca0c00 ffffffc1f6f7c600 3d20: ffffffc1f5ce0000 ffffffc1f6f7c600 ffffffc1f5ca0c00 ffffff80080f0998 3d40: ffffffc1f5ce0000 ffffff80080f0000 0000000000000000 0000000000000000 3d60: ffffff8008ce202c ffffff8008ce2020 ffffffc1f5ccd668 ffffffc1f5c530b0 3d80: ffffffc1f5ce3db0 ffffff80080f0d70 ffffffc1f5ca0c40 0000000000000001 3da0: ffffffc1f5ce0000 ffffff80080f0cfc ffffffc1f5ce3e20 ffffff80080bf4f8 3dc0: ffffffc1f5ca0c80 ffffff8008bf3798 ffffff8008955528 ffffffc1f5ca0c00 3de0: ffffff80080f0c30 0000000000000000 0000000000000000 0000000000000000 3e00: 0000000000000000 0000000000000000 0000000000000000 ffffff80080f0b68 3e20: 0000000000000000 ffffff8008083690 ffffff80080bf420 ffffffc1f5ca0c80 3e40: 0000000000000000 0000000000000000 0000000000000000 ffffff80080cb648 3e60: ffffff8008b1c780 0000000000000000 0000000000000000 ffffffc1f5ca0c00 3e80: ffffffc100000000 ffffff8000000000 ffffffc1f5ce3e90 ffffffc1f5ce3e90 3ea0: 0000000000000000 ffffff8000000000 ffffffc1f5ce3eb0 ffffffc1f5ce3eb0 3ec0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 3ee0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 3f00: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 3f20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 3f40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 3f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 3f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 3fa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 3fc0: 0000000000000000 0000000000000005 0000000000000000 0000000000000000 3fe0: 0000000000000000 0000000000000000 0000000275ce3ff0 0000000275ce3ff8 Call trace: Exception stack(0xffffffc1f5ce3b30 to 0xffffffc1f5ce3c60) 3b20: dead000000000060 0000008000000000 3b40: ffffffc1f5ce3d00 ffffff80084cb694 0000000000000008 0000000000000e88 3b60: ffffffc1f5ce3bb0 ffffff80080dac68 ffffffc1f5ce3b90 ffffff8008826fe4 3b80: 00000000000001c0 00000000000001c0 ffffffc1f5ce3bb0 ffffff800848dfcc 3ba0: 0000000000020000 ffffff8008b15ae4 ffffffc1f5ce3c00 ffffff800808f000 3bc0: 0000000000000010 ffffff80088377f0 ffffffc1f5ccd590 0000000000000140 3be0: dead000000000100 dead000000000200 0000000000000001 0000000000000000 3c00: 0000000000000000 ffffff8008ce2020 ffffffc1f5cc3338 ffffffc1f5ce0000 3c20: 0000000000000840 0000000000000001 0000000000000000 ffffffffffffffff 3c40: ffffffffffffffff ffffffffffffffff 0000000000000001 0000000000000007 [<ffffff80084cb694>] pl330_irq_handler+0x27c/0x390 [<ffffff80080f09d0>] irq_forced_thread_fn+0x38/0x88 [<ffffff80080f0d70>] irq_thread+0x140/0x200 [<ffffff80080bf4f8>] kthread+0xd8/0xf0 [<ffffff8008083690>] ret_from_fork+0x10/0x40 Code: f2a00838 f9405763 aa1c03e1 aa1503e0 (f9000443) ---[ end trace f50005726d31199c ]--- Kernel panic - not syncing: Fatal exception in interrupt SMP: stopping secondary CPUs SMP: failed to stop secondary CPUs 0-1 Kernel Offset: disabled Memory Limit: none ---[ end Kernel panic - not syncing: Fatal exception in interrupt To fix this, re-start with the list-head after dropping the lock then re-takeing it. Reviewed-by: Frank Mori Hess <fmh6jj@gmail.com> Tested-by: Frank Mori Hess <fmh6jj@gmail.com> Signed-off-by: Qi Hou <qi.hou@windriver.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Jul 27, 2019
[ Upstream commit a9fd095 ] Leaving dev_init_lock mutex locked in probe causes BUG and a WARNING when kernel is compiled with CONFIG_PROVE_LOCKING. Convert mutex to completion which silences those warnings and improves code readability. Fix below errors when connecting the USB WiFi dongle: brcmfmac: brcmf_fw_alloc_request: using brcm/brcmfmac43143 for chip BCM43143/2 BUG: workqueue leaked lock or atomic: kworker/0:2/0x00000000/434 last function: hub_event 1 lock held by kworker/0:2/434: #0: 18d5dcdf (&devinfo->dev_init_lock){+.+.}, at: brcmf_usb_probe+0x78/0x550 [brcmfmac] CPU: 0 PID: 434 Comm: kworker/0:2 Not tainted 4.19.23-00084-g454a789-dirty #123 Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) Workqueue: usb_hub_wq hub_event [<8011237c>] (unwind_backtrace) from [<8010d74c>] (show_stack+0x10/0x14) [<8010d74c>] (show_stack) from [<809c4324>] (dump_stack+0xa8/0xd4) [<809c4324>] (dump_stack) from [<8014195c>] (process_one_work+0x710/0x808) [<8014195c>] (process_one_work) from [<80141a80>] (worker_thread+0x2c/0x564) [<80141a80>] (worker_thread) from [<80147bcc>] (kthread+0x13c/0x16c) [<80147bcc>] (kthread) from [<801010b4>] (ret_from_fork+0x14/0x20) Exception stack(0xed1d9fb0 to 0xed1d9ff8) 9fa0: 00000000 00000000 00000000 00000000 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 ====================================================== WARNING: possible circular locking dependency detected 4.19.23-00084-g454a789-dirty #123 Not tainted ------------------------------------------------------ kworker/0:2/434 is trying to acquire lock: e29cf799 ((wq_completion)"events"){+.+.}, at: process_one_work+0x174/0x808 but task is already holding lock: 18d5dcdf (&devinfo->dev_init_lock){+.+.}, at: brcmf_usb_probe+0x78/0x550 [brcmfmac] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&devinfo->dev_init_lock){+.+.}: mutex_lock_nested+0x1c/0x24 brcmf_usb_probe+0x78/0x550 [brcmfmac] usb_probe_interface+0xc0/0x1bc really_probe+0x228/0x2c0 __driver_attach+0xe4/0xe8 bus_for_each_dev+0x68/0xb4 bus_add_driver+0x19c/0x214 driver_register+0x78/0x110 usb_register_driver+0x84/0x148 process_one_work+0x228/0x808 worker_thread+0x2c/0x564 kthread+0x13c/0x16c ret_from_fork+0x14/0x20 (null) -> #1 (brcmf_driver_work){+.+.}: worker_thread+0x2c/0x564 kthread+0x13c/0x16c ret_from_fork+0x14/0x20 (null) -> #0 ((wq_completion)"events"){+.+.}: process_one_work+0x1b8/0x808 worker_thread+0x2c/0x564 kthread+0x13c/0x16c ret_from_fork+0x14/0x20 (null) other info that might help us debug this: Chain exists of: (wq_completion)"events" --> brcmf_driver_work --> &devinfo->dev_init_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&devinfo->dev_init_lock); lock(brcmf_driver_work); lock(&devinfo->dev_init_lock); lock((wq_completion)"events"); *** DEADLOCK *** 1 lock held by kworker/0:2/434: #0: 18d5dcdf (&devinfo->dev_init_lock){+.+.}, at: brcmf_usb_probe+0x78/0x550 [brcmfmac] stack backtrace: CPU: 0 PID: 434 Comm: kworker/0:2 Not tainted 4.19.23-00084-g454a789-dirty #123 Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) Workqueue: events request_firmware_work_func [<8011237c>] (unwind_backtrace) from [<8010d74c>] (show_stack+0x10/0x14) [<8010d74c>] (show_stack) from [<809c4324>] (dump_stack+0xa8/0xd4) [<809c4324>] (dump_stack) from [<80172838>] (print_circular_bug+0x210/0x330) [<80172838>] (print_circular_bug) from [<80175940>] (__lock_acquire+0x160c/0x1a30) [<80175940>] (__lock_acquire) from [<8017671c>] (lock_acquire+0xe0/0x268) [<8017671c>] (lock_acquire) from [<80141404>] (process_one_work+0x1b8/0x808) [<80141404>] (process_one_work) from [<80141a80>] (worker_thread+0x2c/0x564) [<80141a80>] (worker_thread) from [<80147bcc>] (kthread+0x13c/0x16c) [<80147bcc>] (kthread) from [<801010b4>] (ret_from_fork+0x14/0x20) Exception stack(0xed1d9fb0 to 0xed1d9ff8) 9fa0: 00000000 00000000 00000000 00000000 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 Signed-off-by: Piotr Figiel <p.figiel@camlintechnologies.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
numbqq
pushed a commit
that referenced
this pull request
Jul 27, 2019
…text commit 0c9e8b3 upstream. stub_probe() and stub_disconnect() call functions which could call sleeping function in invalid context whil holding busid_lock. Fix the problem by refining the lock holds to short critical sections to change the busid_priv fields. This fix restructures the code to limit the lock holds in stub_probe() and stub_disconnect(). stub_probe(): [15217.927028] BUG: sleeping function called from invalid context at mm/slab.h:418 [15217.927038] in_atomic(): 1, irqs_disabled(): 0, pid: 29087, name: usbip [15217.927044] 5 locks held by usbip/29087: [15217.927047] #0: 0000000091647f28 (sb_writers#6){....}, at: vfs_write+0x191/0x1c0 [15217.927062] #1: 000000008f9ba75b (&of->mutex){....}, at: kernfs_fop_write+0xf7/0x1b0 [15217.927072] #2: 00000000872e5b4b (&dev->mutex){....}, at: __device_driver_lock+0x3b/0x50 [15217.927082] #3: 00000000e74ececc (&dev->mutex){....}, at: __device_driver_lock+0x46/0x50 [15217.927090] #4: 00000000b20abbe0 (&(&busid_table[i].busid_lock)->rlock){....}, at: get_busid_priv+0x48/0x60 [usbip_host] [15217.927103] CPU: 3 PID: 29087 Comm: usbip Tainted: G W 5.1.0-rc6+ #40 [15217.927106] Hardware name: Dell Inc. OptiPlex 790/0HY9JP, BIOS A18 09/24/2013 [15217.927109] Call Trace: [15217.927118] dump_stack+0x63/0x85 [15217.927127] ___might_sleep+0xff/0x120 [15217.927133] __might_sleep+0x4a/0x80 [15217.927143] kmem_cache_alloc_trace+0x1aa/0x210 [15217.927156] stub_probe+0xe8/0x440 [usbip_host] [15217.927171] usb_probe_device+0x34/0x70 stub_disconnect(): [15279.182478] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:908 [15279.182487] in_atomic(): 1, irqs_disabled(): 0, pid: 29114, name: usbip [15279.182492] 5 locks held by usbip/29114: [15279.182494] #0: 0000000091647f28 (sb_writers#6){....}, at: vfs_write+0x191/0x1c0 [15279.182506] #1: 00000000702cf0f3 (&of->mutex){....}, at: kernfs_fop_write+0xf7/0x1b0 [15279.182514] #2: 00000000872e5b4b (&dev->mutex){....}, at: __device_driver_lock+0x3b/0x50 [15279.182522] #3: 00000000e74ececc (&dev->mutex){....}, at: __device_driver_lock+0x46/0x50 [15279.182529] #4: 00000000b20abbe0 (&(&busid_table[i].busid_lock)->rlock){....}, at: get_busid_priv+0x48/0x60 [usbip_host] [15279.182541] CPU: 0 PID: 29114 Comm: usbip Tainted: G W 5.1.0-rc6+ #40 [15279.182543] Hardware name: Dell Inc. OptiPlex 790/0HY9JP, BIOS A18 09/24/2013 [15279.182546] Call Trace: [15279.182554] dump_stack+0x63/0x85 [15279.182561] ___might_sleep+0xff/0x120 [15279.182566] __might_sleep+0x4a/0x80 [15279.182574] __mutex_lock+0x55/0x950 [15279.182582] ? get_busid_priv+0x48/0x60 [usbip_host] [15279.182587] ? reacquire_held_locks+0xec/0x1a0 [15279.182591] ? get_busid_priv+0x48/0x60 [usbip_host] [15279.182597] ? find_held_lock+0x94/0xa0 [15279.182609] mutex_lock_nested+0x1b/0x20 [15279.182614] ? mutex_lock_nested+0x1b/0x20 [15279.182618] kernfs_remove_by_name_ns+0x2a/0x90 [15279.182625] sysfs_remove_file_ns+0x15/0x20 [15279.182629] device_remove_file+0x19/0x20 [15279.182634] stub_disconnect+0x6d/0x180 [usbip_host] [15279.182643] usb_unbind_device+0x27/0x60 Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Jul 27, 2019
… sdevs) commit ef4021f upstream. When the user tries to remove a zfcp port via sysfs, we only rejected it if there are zfcp unit children under the port. With purely automatically scanned LUNs there are no zfcp units but only SCSI devices. In such cases, the port_remove erroneously continued. We close the port and this implicitly closes all LUNs under the port. The SCSI devices survive with their private zfcp_scsi_dev still holding a reference to the "removed" zfcp_port (still allocated but invisible in sysfs) [zfcp_get_port_by_wwpn in zfcp_scsi_slave_alloc]. This is not a problem as long as the fc_rport stays blocked. Once (auto) port scan brings back the removed port, we unblock its fc_rport again by design. However, there is no mechanism that would recover (open) the LUNs under the port (no "ersfs_3" without zfcp_unit [zfcp_erp_strategy_followup_success]). Any pending or new I/O to such LUN leads to repeated: Done: NEEDS_RETRY Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK See also v4.10 commit 6f2ce1c ("scsi: zfcp: fix rport unblock race with LUN recovery"). Even a manual LUN recovery (echo 0 > /sys/bus/scsi/devices/H:C:T:L/zfcp_failed) does not help, as the LUN links to the old "removed" port which remains to lack ZFCP_STATUS_COMMON_RUNNING [zfcp_erp_required_act]. The only workaround is to first ensure that the fc_rport is blocked (e.g. port_remove again in case it was re-discovered by (auto) port scan), then delete the SCSI devices, and finally re-discover by (auto) port scan. The port scan includes an fc_rport unblock, which in turn triggers a new scan on the scsi target to freshly get new pure auto scan LUNs. Fix this by rejecting port_remove also if there are SCSI devices (even without any zfcp_unit) under this port. Re-use mechanics from v3.7 commit d99b601 ("[SCSI] zfcp: restore refcount check on port_remove"). However, we have to give up zfcp_sysfs_port_units_mutex earlier in unit_add to prevent a deadlock with scsi_host scan taking shost->scan_mutex first and then zfcp_sysfs_port_units_mutex now in our zfcp_scsi_slave_alloc(). Signed-off-by: Steffen Maier <maier@linux.ibm.com> Fixes: b62a8d9 ("[SCSI] zfcp: Use SCSI device data zfcp scsi dev instead of zfcp unit") Fixes: f8210e3 ("[SCSI] zfcp: Allow midlayer to scan for LUNs when running in NPIV mode") Cc: <stable@vger.kernel.org> #2.6.37+ Reviewed-by: Benjamin Block <bblock@linux.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
numbqq
pushed a commit
that referenced
this pull request
Jul 27, 2019
[ Upstream commit 347ab94 ] This patch fixes deadlock warning if removing PWM device when CONFIG_PROVE_LOCKING is enabled. This issue can be reproceduced by the following steps on the R-Car H3 Salvator-X board if the backlight is disabled: # cd /sys/class/pwm/pwmchip0 # echo 0 > export # ls device export npwm power pwm0 subsystem uevent unexport # cd device/driver # ls bind e6e31000.pwm uevent unbind # echo e6e31000.pwm > unbind [ 87.659974] ====================================================== [ 87.666149] WARNING: possible circular locking dependency detected [ 87.672327] 5.0.0 #7 Not tainted [ 87.675549] ------------------------------------------------------ [ 87.681723] bash/2986 is trying to acquire lock: [ 87.686337] 000000005ea0e178 (kn->count#58){++++}, at: kernfs_remove_by_name_ns+0x50/0xa0 [ 87.694528] [ 87.694528] but task is already holding lock: [ 87.700353] 000000006313b17c (pwm_lock){+.+.}, at: pwmchip_remove+0x28/0x13c [ 87.707405] [ 87.707405] which lock already depends on the new lock. [ 87.707405] [ 87.715574] [ 87.715574] the existing dependency chain (in reverse order) is: [ 87.723048] [ 87.723048] -> #1 (pwm_lock){+.+.}: [ 87.728017] __mutex_lock+0x70/0x7e4 [ 87.732108] mutex_lock_nested+0x1c/0x24 [ 87.736547] pwm_request_from_chip.part.6+0x34/0x74 [ 87.741940] pwm_request_from_chip+0x20/0x40 [ 87.746725] export_store+0x6c/0x1f4 [ 87.750820] dev_attr_store+0x18/0x28 [ 87.754998] sysfs_kf_write+0x54/0x64 [ 87.759175] kernfs_fop_write+0xe4/0x1e8 [ 87.763615] __vfs_write+0x40/0x184 [ 87.767619] vfs_write+0xa8/0x19c [ 87.771448] ksys_write+0x58/0xbc [ 87.775278] __arm64_sys_write+0x18/0x20 [ 87.779721] el0_svc_common+0xd0/0x124 [ 87.783986] el0_svc_compat_handler+0x1c/0x24 [ 87.788858] el0_svc_compat+0x8/0x18 [ 87.792947] [ 87.792947] -> #0 (kn->count#58){++++}: [ 87.798260] lock_acquire+0xc4/0x22c [ 87.802353] __kernfs_remove+0x258/0x2c4 [ 87.806790] kernfs_remove_by_name_ns+0x50/0xa0 [ 87.811836] remove_files.isra.1+0x38/0x78 [ 87.816447] sysfs_remove_group+0x48/0x98 [ 87.820971] sysfs_remove_groups+0x34/0x4c [ 87.825583] device_remove_attrs+0x6c/0x7c [ 87.830197] device_del+0x11c/0x33c [ 87.834201] device_unregister+0x14/0x2c [ 87.838638] pwmchip_sysfs_unexport+0x40/0x4c [ 87.843509] pwmchip_remove+0xf4/0x13c [ 87.847773] rcar_pwm_remove+0x28/0x34 [ 87.852039] platform_drv_remove+0x24/0x64 [ 87.856651] device_release_driver_internal+0x18c/0x21c [ 87.862391] device_release_driver+0x14/0x1c [ 87.867175] unbind_store+0xe0/0x124 [ 87.871265] drv_attr_store+0x20/0x30 [ 87.875442] sysfs_kf_write+0x54/0x64 [ 87.879618] kernfs_fop_write+0xe4/0x1e8 [ 87.884055] __vfs_write+0x40/0x184 [ 87.888057] vfs_write+0xa8/0x19c [ 87.891887] ksys_write+0x58/0xbc [ 87.895716] __arm64_sys_write+0x18/0x20 [ 87.900154] el0_svc_common+0xd0/0x124 [ 87.904417] el0_svc_compat_handler+0x1c/0x24 [ 87.909289] el0_svc_compat+0x8/0x18 [ 87.913378] [ 87.913378] other info that might help us debug this: [ 87.913378] [ 87.921374] Possible unsafe locking scenario: [ 87.921374] [ 87.927286] CPU0 CPU1 [ 87.931808] ---- ---- [ 87.936331] lock(pwm_lock); [ 87.939293] lock(kn->count#58); [ 87.945120] lock(pwm_lock); [ 87.950599] lock(kn->count#58); [ 87.953908] [ 87.953908] *** DEADLOCK *** [ 87.953908] [ 87.959821] 4 locks held by bash/2986: [ 87.963563] #0: 00000000ace7bc30 (sb_writers#6){.+.+}, at: vfs_write+0x188/0x19c [ 87.971044] #1: 00000000287991b2 (&of->mutex){+.+.}, at: kernfs_fop_write+0xb4/0x1e8 [ 87.978872] #2: 00000000f739d016 (&dev->mutex){....}, at: device_release_driver_internal+0x40/0x21c [ 87.988001] #3: 000000006313b17c (pwm_lock){+.+.}, at: pwmchip_remove+0x28/0x13c [ 87.995481] [ 87.995481] stack backtrace: [ 87.999836] CPU: 0 PID: 2986 Comm: bash Not tainted 5.0.0 #7 [ 88.005489] Hardware name: Renesas Salvator-X board based on r8a7795 ES1.x (DT) [ 88.012791] Call trace: [ 88.015235] dump_backtrace+0x0/0x190 [ 88.018891] show_stack+0x14/0x1c [ 88.022204] dump_stack+0xb0/0xec [ 88.025514] print_circular_bug.isra.32+0x1d0/0x2e0 [ 88.030385] __lock_acquire+0x1318/0x1864 [ 88.034388] lock_acquire+0xc4/0x22c [ 88.037958] __kernfs_remove+0x258/0x2c4 [ 88.041874] kernfs_remove_by_name_ns+0x50/0xa0 [ 88.046398] remove_files.isra.1+0x38/0x78 [ 88.050487] sysfs_remove_group+0x48/0x98 [ 88.054490] sysfs_remove_groups+0x34/0x4c [ 88.058580] device_remove_attrs+0x6c/0x7c [ 88.062671] device_del+0x11c/0x33c [ 88.066154] device_unregister+0x14/0x2c [ 88.070070] pwmchip_sysfs_unexport+0x40/0x4c [ 88.074421] pwmchip_remove+0xf4/0x13c [ 88.078163] rcar_pwm_remove+0x28/0x34 [ 88.081906] platform_drv_remove+0x24/0x64 [ 88.085996] device_release_driver_internal+0x18c/0x21c [ 88.091215] device_release_driver+0x14/0x1c [ 88.095478] unbind_store+0xe0/0x124 [ 88.099048] drv_attr_store+0x20/0x30 [ 88.102704] sysfs_kf_write+0x54/0x64 [ 88.106359] kernfs_fop_write+0xe4/0x1e8 [ 88.110275] __vfs_write+0x40/0x184 [ 88.113757] vfs_write+0xa8/0x19c [ 88.117065] ksys_write+0x58/0xbc [ 88.120374] __arm64_sys_write+0x18/0x20 [ 88.124291] el0_svc_common+0xd0/0x124 [ 88.128034] el0_svc_compat_handler+0x1c/0x24 [ 88.132384] el0_svc_compat+0x8/0x18 The sysfs unexport in pwmchip_remove() is completely asymmetric to what we do in pwmchip_add_with_polarity() and commit 0733424 ("pwm: Unexport children before chip removal") is a strong indication that this was wrong to begin with. We should just move pwmchip_sysfs_unexport() where it belongs, which is right after pwmchip_sysfs_unexport_children(). In that case, we do not need separate functions anymore either. We also really want to remove sysfs irrespective of whether or not the chip will be removed as a result of pwmchip_remove(). We can only assume that the driver will be gone after that, so we shouldn't leave any dangling sysfs files around. This warning disappears if we move pwmchip_sysfs_unexport() to the top of pwmchip_remove(), pwmchip_sysfs_unexport_children(). That way it is also outside of the pwm_lock section, which indeed doesn't seem to be needed. Moving the pwmchip_sysfs_export() call outside of that section also seems fine and it'd be perfectly symmetric with pwmchip_remove() again. So, this patch fixes them. Signed-off-by: Phong Hoang <phong.hoang.wz@renesas.com> [shimoda: revise the commit log and code] Fixes: 76abbdd ("pwm: Add sysfs interface") Fixes: 0733424 ("pwm: Unexport children before chip removal") Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Tested-by: Hoan Nguyen An <na-hoan@jinso.co.jp> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Simon Horman <horms+renesas@verge.net.au> Reviewed-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Thierry Reding <thierry.reding@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
numbqq
pushed a commit
that referenced
this pull request
Jul 27, 2019
[ Upstream commit 5518424 ] ifmsh->csa is an RCU-protected pointer. The writer context in ieee80211_mesh_finish_csa() is already mutually exclusive with wdev->sdata.mtx, but the RCU checker did not know this. Use rcu_dereference_protected() to avoid a warning. fixes the following warning: [ 12.519089] ============================= [ 12.520042] WARNING: suspicious RCU usage [ 12.520652] 5.1.0-rc7-wt+ #16 Tainted: G W [ 12.521409] ----------------------------- [ 12.521972] net/mac80211/mesh.c:1223 suspicious rcu_dereference_check() usage! [ 12.522928] other info that might help us debug this: [ 12.523984] rcu_scheduler_active = 2, debug_locks = 1 [ 12.524855] 5 locks held by kworker/u8:2/152: [ 12.525438] #0: 00000000057be08c ((wq_completion)phy0){+.+.}, at: process_one_work+0x1a2/0x620 [ 12.526607] #1: 0000000059c6b07a ((work_completion)(&sdata->csa_finalize_work)){+.+.}, at: process_one_work+0x1a2/0x620 [ 12.528001] #2: 00000000f184ba7d (&wdev->mtx){+.+.}, at: ieee80211_csa_finalize_work+0x2f/0x90 [ 12.529116] #3: 00000000831a1f54 (&local->mtx){+.+.}, at: ieee80211_csa_finalize_work+0x47/0x90 [ 12.530233] #4: 00000000fd06f988 (&local->chanctx_mtx){+.+.}, at: ieee80211_csa_finalize_work+0x51/0x90 Signed-off-by: Thomas Pedersen <thomas@eero.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
numbqq
pushed a commit
that referenced
this pull request
May 17, 2024
We need to check mddev->del_work before flush workqueu since the purpose of flush is to ensure the previous md is disappeared. Otherwise the similar deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the bdev->bd_mutex before flush workqueue. kernel: [ 154.522645] ====================================================== kernel: [ 154.522647] WARNING: possible circular locking dependency detected kernel: [ 154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G O kernel: [ 154.522651] ------------------------------------------------------ kernel: [ 154.522653] mdadm/2482 is trying to acquire lock: kernel: [ 154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0 kernel: [ 154.522673] kernel: [ 154.522673] but task is already holding lock: kernel: [ 154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590 kernel: [ 154.522691] kernel: [ 154.522691] which lock already depends on the new lock. kernel: [ 154.522691] kernel: [ 154.522694] kernel: [ 154.522694] the existing dependency chain (in reverse order) is: kernel: [ 154.522696] kernel: [ 154.522696] -> #4 (&bdev->bd_mutex){+.+.}: kernel: [ 154.522704] __mutex_lock+0x87/0x950 kernel: [ 154.522706] __blkdev_get+0x79/0x590 kernel: [ 154.522708] blkdev_get+0x65/0x140 kernel: [ 154.522709] blkdev_get_by_dev+0x2f/0x40 kernel: [ 154.522716] lock_rdev+0x3d/0x90 [md_mod] kernel: [ 154.522719] md_import_device+0xd6/0x1b0 [md_mod] kernel: [ 154.522723] new_dev_store+0x15e/0x210 [md_mod] kernel: [ 154.522728] md_attr_store+0x7a/0xc0 [md_mod] kernel: [ 154.522732] kernfs_fop_write+0x117/0x1b0 kernel: [ 154.522735] vfs_write+0xad/0x1a0 kernel: [ 154.522737] ksys_write+0xa4/0xe0 kernel: [ 154.522745] do_syscall_64+0x64/0x2b0 kernel: [ 154.522748] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522749] kernel: [ 154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}: kernel: [ 154.522752] __mutex_lock+0x87/0x950 kernel: [ 154.522756] new_dev_store+0xc9/0x210 [md_mod] kernel: [ 154.522759] md_attr_store+0x7a/0xc0 [md_mod] kernel: [ 154.522761] kernfs_fop_write+0x117/0x1b0 kernel: [ 154.522763] vfs_write+0xad/0x1a0 kernel: [ 154.522765] ksys_write+0xa4/0xe0 kernel: [ 154.522767] do_syscall_64+0x64/0x2b0 kernel: [ 154.522769] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522770] kernel: [ 154.522770] -> #2 (kn->count#253){++++}: kernel: [ 154.522775] __kernfs_remove+0x253/0x2c0 kernel: [ 154.522778] kernfs_remove+0x1f/0x30 kernel: [ 154.522780] kobject_del+0x28/0x60 kernel: [ 154.522783] mddev_delayed_delete+0x24/0x30 [md_mod] kernel: [ 154.522786] process_one_work+0x2a7/0x5f0 kernel: [ 154.522788] worker_thread+0x2d/0x3d0 kernel: [ 154.522793] kthread+0x117/0x130 kernel: [ 154.522795] ret_from_fork+0x3a/0x50 kernel: [ 154.522796] kernel: [ 154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}: kernel: [ 154.522800] process_one_work+0x27e/0x5f0 kernel: [ 154.522802] worker_thread+0x2d/0x3d0 kernel: [ 154.522804] kthread+0x117/0x130 kernel: [ 154.522806] ret_from_fork+0x3a/0x50 kernel: [ 154.522807] kernel: [ 154.522807] -> #0 ((wq_completion)md_misc){+.+.}: kernel: [ 154.522813] __lock_acquire+0x1392/0x1690 kernel: [ 154.522816] lock_acquire+0xb4/0x1a0 kernel: [ 154.522818] flush_workqueue+0xab/0x4b0 kernel: [ 154.522821] md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522823] __blkdev_get+0xea/0x590 kernel: [ 154.522825] blkdev_get+0x65/0x140 kernel: [ 154.522828] do_dentry_open+0x1d1/0x380 kernel: [ 154.522831] path_openat+0x567/0xcc0 kernel: [ 154.522834] do_filp_open+0x9b/0x110 kernel: [ 154.522836] do_sys_openat2+0x201/0x2a0 kernel: [ 154.522838] do_sys_open+0x57/0x80 kernel: [ 154.522840] do_syscall_64+0x64/0x2b0 kernel: [ 154.522842] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522844] kernel: [ 154.522844] other info that might help us debug this: kernel: [ 154.522844] kernel: [ 154.522846] Chain exists of: kernel: [ 154.522846] (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex kernel: [ 154.522846] kernel: [ 154.522850] Possible unsafe locking scenario: kernel: [ 154.522850] kernel: [ 154.522852] CPU0 CPU1 kernel: [ 154.522853] ---- ---- kernel: [ 154.522854] lock(&bdev->bd_mutex); kernel: [ 154.522856] lock(&mddev->reconfig_mutex); kernel: [ 154.522858] lock(&bdev->bd_mutex); kernel: [ 154.522860] lock((wq_completion)md_misc); kernel: [ 154.522861] kernel: [ 154.522861] *** DEADLOCK *** kernel: [ 154.522861] kernel: [ 154.522864] 1 lock held by mdadm/2482: kernel: [ 154.522865] #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590 kernel: [ 154.522868] kernel: [ 154.522868] stack backtrace: kernel: [ 154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G O 5.6.0-rc7-lp151.27-default #25 kernel: [ 154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 kernel: [ 154.522878] Call Trace: kernel: [ 154.522881] dump_stack+0x8f/0xcb kernel: [ 154.522884] check_noncircular+0x194/0x1b0 kernel: [ 154.522888] ? __lock_acquire+0x1392/0x1690 kernel: [ 154.522890] __lock_acquire+0x1392/0x1690 kernel: [ 154.522893] lock_acquire+0xb4/0x1a0 kernel: [ 154.522895] ? flush_workqueue+0x84/0x4b0 kernel: [ 154.522898] flush_workqueue+0xab/0x4b0 kernel: [ 154.522900] ? flush_workqueue+0x84/0x4b0 kernel: [ 154.522905] ? md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522908] md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522910] __blkdev_get+0xea/0x590 kernel: [ 154.522912] ? bd_acquire+0xc0/0xc0 kernel: [ 154.522914] blkdev_get+0x65/0x140 kernel: [ 154.522916] ? bd_acquire+0xc0/0xc0 kernel: [ 154.522918] do_dentry_open+0x1d1/0x380 kernel: [ 154.522921] path_openat+0x567/0xcc0 kernel: [ 154.522923] ? __lock_acquire+0x380/0x1690 kernel: [ 154.522926] do_filp_open+0x9b/0x110 kernel: [ 154.522929] ? __alloc_fd+0xe5/0x1f0 kernel: [ 154.522935] ? kmem_cache_alloc+0x28c/0x630 kernel: [ 154.522939] ? do_sys_openat2+0x201/0x2a0 kernel: [ 154.522941] do_sys_openat2+0x201/0x2a0 kernel: [ 154.522944] do_sys_open+0x57/0x80 kernel: [ 154.522946] do_syscall_64+0x64/0x2b0 kernel: [ 154.522948] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522951] RIP: 0033:0x7f98d279d9ae And md_alloc also flushed the same workqueue, but the thing is different here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and the flush is necessary to avoid race condition, so leave it as it is. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jun 25, 2024
With commit c4cb231 ("iommu/amd: Add support for enable/disable IOPF") we are hitting below issue. This happens because in IOPF enablement path it holds spin lock with irq disable and then tries to take mutex lock. dmesg: ----- [ 0.938739] ============================= [ 0.938740] [ BUG: Invalid wait context ] [ 0.938742] 6.10.0-rc1+ khadas#1 Not tainted [ 0.938745] ----------------------------- [ 0.938746] swapper/0/1 is trying to lock: [ 0.938748] ffffffff8c9f01d8 (&port_lock_key){....}-{3:3}, at: serial8250_console_write+0x78/0x4a0 [ 0.938767] other info that might help us debug this: [ 0.938768] context-{5:5} [ 0.938769] 7 locks held by swapper/0/1: [ 0.938772] #0: ffff888101a91310 (&group->mutex){+.+.}-{4:4}, at: bus_iommu_probe+0x70/0x160 [ 0.938790] khadas#1: ffff888101d1f1b8 (&domain->lock){....}-{3:3}, at: amd_iommu_attach_device+0xa5/0x700 [ 0.938799] khadas#2: ffff888101cc3d18 (&dev_data->lock){....}-{3:3}, at: amd_iommu_attach_device+0xc5/0x700 [ 0.938806] khadas#3: ffff888100052830 (&iommu->lock){....}-{2:2}, at: amd_iommu_iopf_add_device+0x3f/0xa0 [ 0.938813] khadas#4: ffffffff8945a340 (console_lock){+.+.}-{0:0}, at: _printk+0x48/0x50 [ 0.938822] khadas#5: ffffffff8945a390 (console_srcu){....}-{0:0}, at: console_flush_all+0x58/0x4e0 [ 0.938867] khadas#6: ffffffff82459f80 (console_owner){....}-{0:0}, at: console_flush_all+0x1f0/0x4e0 [ 0.938872] stack backtrace: [ 0.938874] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.10.0-rc1+ khadas#1 [ 0.938877] Hardware name: HP HP EliteBook 745 G3/807E, BIOS N73 Ver. 01.39 04/16/2019 Fix above issue by re-arranging code in attach device path: - move device PASID/IOPF enablement outside lock in AMD IOMMU driver. This is safe as core layer holds group->mutex lock before calling iommu_ops->attach_dev. Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com> Fixes: c4cb231 ("iommu/amd: Add support for enable/disable IOPF") Tested-by: Borislav Petkov <bp@alien8.de> Tested-by: Chris Bainbridge <chris.bainbridge@gmail.com> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com> Link: https://lore.kernel.org/r/20240530084801.10758-1-vasant.hegde@amd.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jun 25, 2024
…PLES event" This reverts commit 7d1405c. This causes segfaults in some cases, as reported by Milian: ``` sudo /usr/bin/perf record -z --call-graph dwarf -e cycles -e raw_syscalls:sys_enter ls ... [ perf record: Woken up 3 times to write data ] malloc(): invalid next size (unsorted) Aborted ``` Backtrace with GDB + debuginfod: ``` malloc(): invalid next size (unsorted) Thread 1 "perf" received signal SIGABRT, Aborted. __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 Downloading source file /usr/src/debug/glibc/glibc/nptl/pthread_kill.c 44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0; (gdb) bt #0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 khadas#1 0x00007ffff6ea8eb3 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78 khadas#2 0x00007ffff6e50a30 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/ raise.c:26 khadas#3 0x00007ffff6e384c3 in __GI_abort () at abort.c:79 khadas#4 0x00007ffff6e39354 in __libc_message_impl (fmt=fmt@entry=0x7ffff6fc22ea "%s\n") at ../sysdeps/posix/libc_fatal.c:132 khadas#5 0x00007ffff6eb3085 in malloc_printerr (str=str@entry=0x7ffff6fc5850 "malloc(): invalid next size (unsorted)") at malloc.c:5772 khadas#6 0x00007ffff6eb657c in _int_malloc (av=av@entry=0x7ffff6ff6ac0 <main_arena>, bytes=bytes@entry=368) at malloc.c:4081 khadas#7 0x00007ffff6eb877e in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at malloc.c:3754 khadas#8 0x000055555569bdb6 in perf_session.do_write_header () khadas#9 0x00005555555a373a in __cmd_record.constprop.0 () khadas#10 0x00005555555a6846 in cmd_record () khadas#11 0x000055555564db7f in run_builtin () khadas#12 0x000055555558ed77 in main () ``` Valgrind memcheck: ``` ==45136== Invalid write of size 8 ==45136== at 0x2B38A5: perf_event__synthesize_id_sample (in /usr/bin/perf) ==45136== by 0x157069: __cmd_record.constprop.0 (in /usr/bin/perf) ==45136== by 0x15A845: cmd_record (in /usr/bin/perf) ==45136== by 0x201B7E: run_builtin (in /usr/bin/perf) ==45136== by 0x142D76: main (in /usr/bin/perf) ==45136== Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd ==45136== at 0x4849BF3: calloc (vg_replace_malloc.c:1675) ==45136== by 0x3574AB: zalloc (in /usr/bin/perf) ==45136== by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf) ==45136== by 0x15A845: cmd_record (in /usr/bin/perf) ==45136== by 0x201B7E: run_builtin (in /usr/bin/perf) ==45136== by 0x142D76: main (in /usr/bin/perf) ==45136== ==45136== Syscall param write(buf) points to unaddressable byte(s) ==45136== at 0x575953D: __libc_write (write.c:26) ==45136== by 0x575953D: write (write.c:24) ==45136== by 0x35761F: ion (in /usr/bin/perf) ==45136== by 0x357778: writen (in /usr/bin/perf) ==45136== by 0x1548F7: record__write (in /usr/bin/perf) ==45136== by 0x15708A: __cmd_record.constprop.0 (in /usr/bin/perf) ==45136== by 0x15A845: cmd_record (in /usr/bin/perf) ==45136== by 0x201B7E: run_builtin (in /usr/bin/perf) ==45136== by 0x142D76: main (in /usr/bin/perf) ==45136== Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd ==45136== at 0x4849BF3: calloc (vg_replace_malloc.c:1675) ==45136== by 0x3574AB: zalloc (in /usr/bin/perf) ==45136== by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf) ==45136== by 0x15A845: cmd_record (in /usr/bin/perf) ==45136== by 0x201B7E: run_builtin (in /usr/bin/perf) ==45136== by 0x142D76: main (in /usr/bin/perf) ==45136== ----- Closes: https://lore.kernel.org/linux-perf-users/23879991.0LEYPuXRzz@milian-workstation/ Reported-by: Milian Wolff <milian.wolff@kdab.com> Tested-by: Milian Wolff <milian.wolff@kdab.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: stable@kernel.org # 6.8+ Link: https://lore.kernel.org/lkml/Zl9ksOlHJHnKM70p@x1 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jun 25, 2024
We have been seeing crashes on duplicate keys in btrfs_set_item_key_safe(): BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:2620! invalid opcode: 0000 [khadas#1] PREEMPT SMP PTI CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 khadas#6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs] With the following stack trace: #0 btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4) khadas#1 btrfs_drop_extents (fs/btrfs/file.c:411:4) khadas#2 log_one_extent (fs/btrfs/tree-log.c:4732:9) khadas#3 btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9) khadas#4 btrfs_log_inode (fs/btrfs/tree-log.c:6626:9) khadas#5 btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8) khadas#6 btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8) khadas#7 btrfs_sync_file (fs/btrfs/file.c:1933:8) khadas#8 vfs_fsync_range (fs/sync.c:188:9) khadas#9 vfs_fsync (fs/sync.c:202:9) khadas#10 do_fsync (fs/sync.c:212:9) khadas#11 __do_sys_fdatasync (fs/sync.c:225:9) khadas#12 __se_sys_fdatasync (fs/sync.c:223:1) khadas#13 __x64_sys_fdatasync (fs/sync.c:223:1) khadas#14 do_syscall_x64 (arch/x86/entry/common.c:52:14) khadas#15 do_syscall_64 (arch/x86/entry/common.c:83:7) khadas#16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121) So we're logging a changed extent from fsync, which is splitting an extent in the log tree. But this split part already exists in the tree, triggering the BUG(). This is the state of the log tree at the time of the crash, dumped with drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py) to get more details than btrfs_print_leaf() gives us: >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"]) leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610 leaf 33439744 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 7 transid 9 size 8192 nbytes 8473563889606862198 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 204 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417704.983333333 (2024-05-22 15:41:44) mtime 1716417704.983333333 (2024-05-22 15:41:44) otime 17592186044416.000000000 (559444-03-08 01:40:16) item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13 index 195 namelen 3 name: 193 item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 4096 ram 12288 extent compression 0 (none) item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 4096 nr 8192 item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 ... So the real problem happened earlier: notice that items 4 (4k-12k) and 5 (8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and item 5 starts at i_size. Here is the state of the filesystem tree at the time of the crash: >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0)) >>> print_extent_buffer(nodes[0]) leaf 30425088 level 0 items 184 generation 9 owner 5 leaf 30425088 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da ... item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160 generation 7 transid 7 size 4096 nbytes 12288 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 6 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417703.220000000 (2024-05-22 15:41:43) mtime 1716417703.220000000 (2024-05-22 15:41:43) otime 1716417703.220000000 (2024-05-22 15:41:43) item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13 index 195 namelen 3 name: 193 item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 8192 ram 12288 extent compression 0 (none) item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 Item 5 in the log tree corresponds to item 183 in the filesystem tree, but nothing matches item 4. Furthermore, item 183 is the last item in the leaf. btrfs_log_prealloc_extents() is responsible for logging prealloc extents beyond i_size. It first truncates any previously logged prealloc extents that start beyond i_size. Then, it walks the filesystem tree and copies the prealloc extent items to the log tree. If it hits the end of a leaf, then it calls btrfs_next_leaf(), which unlocks the tree and does another search. However, while the filesystem tree is unlocked, an ordered extent completion may modify the tree. In particular, it may insert an extent item that overlaps with an extent item that was already copied to the log tree. This may manifest in several ways depending on the exact scenario, including an EEXIST error that is silently translated to a full sync, overlapping items in the log tree, or this crash. This particular crash is triggered by the following sequence of events: - Initially, the file has i_size=4k, a regular extent from 0-4k, and a prealloc extent beyond i_size from 4k-12k. The prealloc extent item is the last item in its B-tree leaf. - The file is fsync'd, which copies its inode item and both extent items to the log tree. - An xattr is set on the file, which sets the BTRFS_INODE_COPY_EVERYTHING flag. - The range 4k-8k in the file is written using direct I/O. i_size is extended to 8k, but the ordered extent is still in flight. - The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this calls copy_inode_items_to_log(), which calls btrfs_log_prealloc_extents(). - btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the filesystem tree. Since it starts before i_size, it skips it. Since it is the last item in its B-tree leaf, it calls btrfs_next_leaf(). - btrfs_next_leaf() unlocks the path. - The ordered extent completion runs, which converts the 4k-8k part of the prealloc extent to written and inserts the remaining prealloc part from 8k-12k. - btrfs_next_leaf() does a search and finds the new prealloc extent 8k-12k. - btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into the log tree. Note that it overlaps with the 4k-12k prealloc extent that was copied to the log tree by the first fsync. - fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k extent that was written. - This tries to drop the range 4k-8k in the log tree, which requires adjusting the start of the 4k-12k prealloc extent in the log tree to 8k. - btrfs_set_item_key_safe() sees that there is already an extent starting at 8k in the log tree and calls BUG(). Fix this by detecting when we're about to insert an overlapping file extent item in the log tree and truncating the part that would overlap. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 10, 2024
Synchronize the dev->driver usage in really_probe() and dev_uevent(). These can run in different threads, what can result in the following race condition for dev->driver uninitialization: Thread khadas#1: ========== really_probe() { ... probe_failed: ... device_unbind_cleanup(dev) { ... dev->driver = NULL; // <= Failed probe sets dev->driver to NULL ... } ... } Thread khadas#2: ========== dev_uevent() { ... if (dev->driver) // If dev->driver is NULLed from really_probe() from here on, // after above check, the system crashes add_uevent_var(env, "DRIVER=%s", dev->driver->name); ... } really_probe() holds the lock, already. So nothing needs to be done there. dev_uevent() is called with lock held, often, too. But not always. What implies that we can't add any locking in dev_uevent() itself. So fix this race by adding the lock to the non-protected path. This is the path where above race is observed: dev_uevent+0x235/0x380 uevent_show+0x10c/0x1f0 <= Add lock here dev_attr_show+0x3a/0xa0 sysfs_kf_seq_show+0x17c/0x250 kernfs_seq_show+0x7c/0x90 seq_read_iter+0x2d7/0x940 kernfs_fop_read_iter+0xc6/0x310 vfs_read+0x5bc/0x6b0 ksys_read+0xeb/0x1b0 __x64_sys_read+0x42/0x50 x64_sys_call+0x27ad/0x2d30 do_syscall_64+0xcd/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Similar cases are reported by syzkaller in https://syzkaller.appspot.com/bug?extid=ffa8143439596313a85a But these are regarding the *initialization* of dev->driver dev->driver = drv; As this switches dev->driver to non-NULL these reports can be considered to be false-positives (which should be "fixed" by this commit, as well, though). The same issue was reported and tried to be fixed back in 2015 in https://lore.kernel.org/lkml/1421259054-2574-1-git-send-email-a.sangwan@samsung.com/ already. Fixes: 239378f ("Driver core: add uevent vars for devices of a class") Cc: stable <stable@kernel.org> Cc: syzbot+ffa8143439596313a85a@syzkaller.appspotmail.com Cc: Ashish Sangwan <a.sangwan@samsung.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Dirk Behme <dirk.behme@de.bosch.com> Link: https://lore.kernel.org/r/20240513050634.3964461-1-dirk.behme@de.bosch.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 10, 2024
====================================================== WARNING: possible circular locking dependency detected 6.10.0-rc2-ktest-00018-gebd1d148b278 #144 Not tainted ------------------------------------------------------ fio/1345 is trying to acquire lock: ffff88813e200ab8 (&c->snapshot_create_lock){++++}-{3:3}, at: bch2_truncate+0x76/0xf0 but task is already holding lock: ffff888105a1fa38 (&sb->s_type->i_mutex_key#13){+.+.}-{3:3}, at: do_truncate+0x7b/0xc0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> khadas#2 (&sb->s_type->i_mutex_key#13){+.+.}-{3:3}: down_write+0x3d/0xd0 bch2_write_iter+0x1c0/0x10f0 vfs_write+0x24a/0x560 __x64_sys_pwrite64+0x77/0xb0 x64_sys_call+0x17e5/0x1ab0 do_syscall_64+0x68/0x130 entry_SYSCALL_64_after_hwframe+0x4b/0x53 -> khadas#1 (sb_writers#10){.+.+}-{0:0}: mnt_want_write+0x4a/0x1d0 filename_create+0x69/0x1a0 user_path_create+0x38/0x50 bch2_fs_file_ioctl+0x315/0xbf0 __x64_sys_ioctl+0x297/0xaf0 x64_sys_call+0x10cb/0x1ab0 do_syscall_64+0x68/0x130 entry_SYSCALL_64_after_hwframe+0x4b/0x53 -> #0 (&c->snapshot_create_lock){++++}-{3:3}: __lock_acquire+0x1445/0x25b0 lock_acquire+0xbd/0x2b0 down_read+0x40/0x180 bch2_truncate+0x76/0xf0 bchfs_truncate+0x240/0x3f0 bch2_setattr+0x7b/0xb0 notify_change+0x322/0x4b0 do_truncate+0x8b/0xc0 do_ftruncate+0x110/0x270 __x64_sys_ftruncate+0x43/0x80 x64_sys_call+0x1373/0x1ab0 do_syscall_64+0x68/0x130 entry_SYSCALL_64_after_hwframe+0x4b/0x53 other info that might help us debug this: Chain exists of: &c->snapshot_create_lock --> sb_writers#10 --> &sb->s_type->i_mutex_key#13 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&sb->s_type->i_mutex_key#13); lock(sb_writers#10); lock(&sb->s_type->i_mutex_key#13); rlock(&c->snapshot_create_lock); *** DEADLOCK *** Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 10, 2024
…git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: Patch khadas#1 fixes insufficient sanitization of netlink attributes for the inner expression which can trigger nul-pointer dereference, from Davide Ornaghi. Patch khadas#2 address a report that there is a race condition between namespace cleanup and the garbage collection of the list:set type. This patch resolves this issue with other minor issues as well, from Jozsef Kadlecsik. Patch khadas#3 ip6_route_me_harder() ignores flowlabel/dsfield when ip dscp has been mangled, this unbreaks ip6 dscp set $v, from Florian Westphal. All of these patches address issues that are present in several releases. * tag 'nf-24-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: Use flowlabel flow key when re-routing mangled packets netfilter: ipset: Fix race between namespace cleanup and gc in the list:set type netfilter: nft_inner: validate mandatory meta and payload ==================== Link: https://lore.kernel.org/r/20240611220323.413713-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 10, 2024
Nikolay Aleksandrov says: ==================== net: bridge: mst: fix suspicious rcu usage warning This set fixes a suspicious RCU usage warning triggered by syzbot[1] in the bridge's MST code. After I converted br_mst_set_state to RCU, I forgot to update the vlan group dereference helper. Fix it by using the proper helper, in order to do that we need to pass the vlan group which is already obtained correctly by the callers for their respective context. Patch 01 is a requirement for the fix in patch 02. Note I did consider rcu_dereference_rtnl() but the churn is much bigger and in every part of the bridge. We can do that as a cleanup in net-next. [1] https://syzkaller.appspot.com/bug?extid=9bbe2de1bc9d470eb5fe ============================= WARNING: suspicious RCU usage 6.10.0-rc2-syzkaller-00235-g8a92980606e3 #0 Not tainted ----------------------------- net/bridge/br_private.h:1599 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 4 locks held by syz-executor.1/5374: #0: ffff888022d50b18 (&mm->mmap_lock){++++}-{3:3}, at: mmap_read_lock include/linux/mmap_lock.h:144 [inline] #0: ffff888022d50b18 (&mm->mmap_lock){++++}-{3:3}, at: __mm_populate+0x1b0/0x460 mm/gup.c:2111 khadas#1: ffffc90000a18c00 ((&p->forward_delay_timer)){+.-.}-{0:0}, at: call_timer_fn+0xc0/0x650 kernel/time/timer.c:1789 khadas#2: ffff88805fb2ccb8 (&br->lock){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] khadas#2: ffff88805fb2ccb8 (&br->lock){+.-.}-{2:2}, at: br_forward_delay_timer_expired+0x50/0x440 net/bridge/br_stp_timer.c:86 khadas#3: ffffffff8e333fa0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:329 [inline] khadas#3: ffffffff8e333fa0 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:781 [inline] khadas#3: ffffffff8e333fa0 (rcu_read_lock){....}-{1:2}, at: br_mst_set_state+0x171/0x7a0 net/bridge/br_mst.c:105 stack backtrace: CPU: 1 PID: 5374 Comm: syz-executor.1 Not tainted 6.10.0-rc2-syzkaller-00235-g8a92980606e3 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 lockdep_rcu_suspicious+0x221/0x340 kernel/locking/lockdep.c:6712 nbp_vlan_group net/bridge/br_private.h:1599 [inline] br_mst_set_state+0x29e/0x7a0 net/bridge/br_mst.c:106 br_set_state+0x28a/0x7b0 net/bridge/br_stp.c:47 br_forward_delay_timer_expired+0x176/0x440 net/bridge/br_stp_timer.c:88 call_timer_fn+0x18e/0x650 kernel/time/timer.c:1792 expire_timers kernel/time/timer.c:1843 [inline] __run_timers kernel/time/timer.c:2417 [inline] __run_timer_base+0x66a/0x8e0 kernel/time/timer.c:2428 run_timer_base kernel/time/timer.c:2437 [inline] run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2447 handle_softirqs+0x2c4/0x970 kernel/softirq.c:554 __do_softirq kernel/softirq.c:588 [inline] invoke_softirq kernel/softirq.c:428 [inline] __irq_exit_rcu+0xf4/0x1c0 kernel/softirq.c:637 irq_exit_rcu+0x9/0x30 kernel/softirq.c:649 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1043 [inline] sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1043 </IRQ> <TASK> ==================== Link: https://lore.kernel.org/r/20240609103654.914987-1-razor@blackwall.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 10, 2024
The syzbot fuzzer found that the interrupt-URB completion callback in the cdc-wdm driver was taking too long, and the driver's immediate resubmission of interrupt URBs with -EPROTO status combined with the dummy-hcd emulation to cause a CPU lockup: cdc_wdm 1-1:1.0: nonzero urb status received: -71 cdc_wdm 1-1:1.0: wdm_int_callback - 0 bytes watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [syz-executor782:6625] CPU#0 Utilization every 4s during lockup: khadas#1: 98% system, 0% softirq, 3% hardirq, 0% idle khadas#2: 98% system, 0% softirq, 3% hardirq, 0% idle khadas#3: 98% system, 0% softirq, 3% hardirq, 0% idle khadas#4: 98% system, 0% softirq, 3% hardirq, 0% idle khadas#5: 98% system, 1% softirq, 3% hardirq, 0% idle Modules linked in: irq event stamp: 73096 hardirqs last enabled at (73095): [<ffff80008037bc00>] console_emit_next_record kernel/printk/printk.c:2935 [inline] hardirqs last enabled at (73095): [<ffff80008037bc00>] console_flush_all+0x650/0xb74 kernel/printk/printk.c:2994 hardirqs last disabled at (73096): [<ffff80008af10b00>] __el1_irq arch/arm64/kernel/entry-common.c:533 [inline] hardirqs last disabled at (73096): [<ffff80008af10b00>] el1_interrupt+0x24/0x68 arch/arm64/kernel/entry-common.c:551 softirqs last enabled at (73048): [<ffff8000801ea530>] softirq_handle_end kernel/softirq.c:400 [inline] softirqs last enabled at (73048): [<ffff8000801ea530>] handle_softirqs+0xa60/0xc34 kernel/softirq.c:582 softirqs last disabled at (73043): [<ffff800080020de8>] __do_softirq+0x14/0x20 kernel/softirq.c:588 CPU: 0 PID: 6625 Comm: syz-executor782 Tainted: G W 6.10.0-rc2-syzkaller-g8867bbd4a056 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024 Testing showed that the problem did not occur if the two error messages -- the first two lines above -- were removed; apparently adding material to the kernel log takes a surprisingly large amount of time. In any case, the best approach for preventing these lockups and to avoid spamming the log with thousands of error messages per second is to ratelimit the two dev_err() calls. Therefore we replace them with dev_err_ratelimited(). Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Suggested-by: Greg KH <gregkh@linuxfoundation.org> Reported-and-tested-by: syzbot+5f996b83575ef4058638@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-usb/00000000000073d54b061a6a1c65@google.com/ Reported-and-tested-by: syzbot+1b2abad17596ad03dcff@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-usb/000000000000f45085061aa9b37e@google.com/ Fixes: 9908a32 ("USB: remove err() macro from usb class drivers") Link: https://lore.kernel.org/linux-usb/40dfa45b-5f21-4eef-a8c1-51a2f320e267@rowland.harvard.edu/ Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/29855215-52f5-4385-b058-91f42c2bee18@rowland.harvard.edu Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 10, 2024
Luis has been reporting an assert failure when freeing an inode cluster during inode inactivation for a while. The assert looks like: XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241 ------------[ cut here ]------------ kernel BUG at fs/xfs/xfs_message.c:102! Oops: invalid opcode: 0000 [khadas#1] PREEMPT SMP KASAN NOPTI CPU: 4 PID: 73 Comm: kworker/4:1 Not tainted 6.10.0-rc1 khadas#4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Workqueue: xfs-inodegc/loop5 xfs_inodegc_worker [xfs] RIP: 0010:assfail (fs/xfs/xfs_message.c:102) xfs RSP: 0018:ffff88810188f7f0 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff88816e748250 RCX: 1ffffffff844b0e7 RDX: 0000000000000004 RSI: ffff88810188f558 RDI: ffffffffc2431fa0 RBP: 1ffff11020311f01 R08: 0000000042431f9f R09: ffffed1020311e9b R10: ffff88810188f4df R11: ffffffffac725d70 R12: ffff88817a3f4000 R13: ffff88812182f000 R14: ffff88810188f998 R15: ffffffffc2423f80 FS: 0000000000000000(0000) GS:ffff8881c8400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055fe9d0f109c CR3: 000000014426c002 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:241 (discriminator 1)) xfs xfs_imap_to_bp (fs/xfs/xfs_trans.h:210 fs/xfs/libxfs/xfs_inode_buf.c:138) xfs xfs_inode_item_precommit (fs/xfs/xfs_inode_item.c:145) xfs xfs_trans_run_precommits (fs/xfs/xfs_trans.c:931) xfs __xfs_trans_commit (fs/xfs/xfs_trans.c:966) xfs xfs_inactive_ifree (fs/xfs/xfs_inode.c:1811) xfs xfs_inactive (fs/xfs/xfs_inode.c:2013) xfs xfs_inodegc_worker (fs/xfs/xfs_icache.c:1841 fs/xfs/xfs_icache.c:1886) xfs process_one_work (kernel/workqueue.c:3231) worker_thread (kernel/workqueue.c:3306 (discriminator 2) kernel/workqueue.c:3393 (discriminator 2)) kthread (kernel/kthread.c:389) ret_from_fork (arch/x86/kernel/process.c:147) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) </TASK> And occurs when the the inode precommit handlers is attempt to look up the inode cluster buffer to attach the inode for writeback. The trail of logic that I can reconstruct is as follows. 1. the inode is clean when inodegc runs, so it is not attached to a cluster buffer when precommit runs. 2. khadas#1 implies the inode cluster buffer may be clean and not pinned by dirty inodes when inodegc runs. 3. khadas#2 implies that the inode cluster buffer can be reclaimed by memory pressure at any time. 4. The assert failure implies that the cluster buffer was attached to the transaction, but not marked done. It had been accessed earlier in the transaction, but not marked done. 5. khadas#4 implies the cluster buffer has been invalidated (i.e. marked stale). 6. khadas#5 implies that the inode cluster buffer was instantiated uninitialised in the transaction in xfs_ifree_cluster(), which only instantiates the buffers to invalidate them and never marks them as done. Given factors 1-3, this issue is highly dependent on timing and environmental factors. Hence the issue can be very difficult to reproduce in some situations, but highly reliable in others. Luis has an environment where it can be reproduced easily by g/531 but, OTOH, I've reproduced it only once in ~2000 cycles of g/531. I think the fix is to have xfs_ifree_cluster() set the XBF_DONE flag on the cluster buffers, even though they may not be initialised. The reasons why I think this is safe are: 1. A buffer cache lookup hit on a XBF_STALE buffer will clear the XBF_DONE flag. Hence all future users of the buffer know they have to re-initialise the contents before use and mark it done themselves. 2. xfs_trans_binval() sets the XFS_BLI_STALE flag, which means the buffer remains locked until the journal commit completes and the buffer is unpinned. Hence once marked XBF_STALE/XFS_BLI_STALE by xfs_ifree_cluster(), the only context that can access the freed buffer is the currently running transaction. 3. khadas#2 implies that future buffer lookups in the currently running transaction will hit the transaction match code and not the buffer cache. Hence XBF_STALE and XFS_BLI_STALE will not be cleared unless the transaction initialises and logs the buffer with valid contents again. At which point, the buffer will be marked marked XBF_DONE again, so having XBF_DONE already set on the stale buffer is a moot point. 4. khadas#2 also implies that any concurrent access to that cluster buffer will block waiting on the buffer lock until the inode cluster has been fully freed and is no longer an active inode cluster buffer. 5. khadas#4 + khadas#1 means that any future user of the disk range of that buffer will always see the range of disk blocks covered by the cluster buffer as not done, and hence must initialise the contents themselves. 6. Setting XBF_DONE in xfs_ifree_cluster() then means the unlinked inode precommit code will see a XBF_DONE buffer from the transaction match as it expects. It can then attach the stale but newly dirtied inode to the stale but newly dirtied cluster buffer without unexpected failures. The stale buffer will then sail through the journal and do the right thing with the attached stale inode during unpin. Hence the fix is just one line of extra code. The explanation of why we have to set XBF_DONE in xfs_ifree_cluster, OTOH, is long and complex.... Fixes: 82842fe ("xfs: fix AGF vs inode cluster buffer deadlock") Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 10, 2024
It is possible to trigger a use-after-free by: * attaching an fentry probe to __sock_release() and the probe calling the bpf_get_socket_cookie() helper * running traceroute -I 1.1.1.1 on a freshly booted VM A KASAN enabled kernel will log something like below (decoded and stripped): ================================================================== BUG: KASAN: slab-use-after-free in __sock_gen_cookie (./arch/x86/include/asm/atomic64_64.h:15 ./include/linux/atomic/atomic-arch-fallback.h:2583 ./include/linux/atomic/atomic-instrumented.h:1611 net/core/sock_diag.c:29) Read of size 8 at addr ffff888007110dd8 by task traceroute/299 CPU: 2 PID: 299 Comm: traceroute Tainted: G E 6.10.0-rc2+ khadas#2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:117 (discriminator 1)) print_report (mm/kasan/report.c:378 mm/kasan/report.c:488) ? __sock_gen_cookie (./arch/x86/include/asm/atomic64_64.h:15 ./include/linux/atomic/atomic-arch-fallback.h:2583 ./include/linux/atomic/atomic-instrumented.h:1611 net/core/sock_diag.c:29) kasan_report (mm/kasan/report.c:603) ? __sock_gen_cookie (./arch/x86/include/asm/atomic64_64.h:15 ./include/linux/atomic/atomic-arch-fallback.h:2583 ./include/linux/atomic/atomic-instrumented.h:1611 net/core/sock_diag.c:29) kasan_check_range (mm/kasan/generic.c:183 mm/kasan/generic.c:189) __sock_gen_cookie (./arch/x86/include/asm/atomic64_64.h:15 ./include/linux/atomic/atomic-arch-fallback.h:2583 ./include/linux/atomic/atomic-instrumented.h:1611 net/core/sock_diag.c:29) bpf_get_socket_ptr_cookie (./arch/x86/include/asm/preempt.h:94 ./include/linux/sock_diag.h:42 net/core/filter.c:5094 net/core/filter.c:5092) bpf_prog_875642cf11f1d139___sock_release+0x6e/0x8e bpf_trampoline_6442506592+0x47/0xaf __sock_release (net/socket.c:652) __sock_create (net/socket.c:1601) ... Allocated by task 299 on cpu 2 at 78.328492s: kasan_save_stack (mm/kasan/common.c:48) kasan_save_track (mm/kasan/common.c:68) __kasan_slab_alloc (mm/kasan/common.c:312 mm/kasan/common.c:338) kmem_cache_alloc_noprof (mm/slub.c:3941 mm/slub.c:4000 mm/slub.c:4007) sk_prot_alloc (net/core/sock.c:2075) sk_alloc (net/core/sock.c:2134) inet_create (net/ipv4/af_inet.c:327 net/ipv4/af_inet.c:252) __sock_create (net/socket.c:1572) __sys_socket (net/socket.c:1660 net/socket.c:1644 net/socket.c:1706) __x64_sys_socket (net/socket.c:1718) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Freed by task 299 on cpu 2 at 78.328502s: kasan_save_stack (mm/kasan/common.c:48) kasan_save_track (mm/kasan/common.c:68) kasan_save_free_info (mm/kasan/generic.c:582) poison_slab_object (mm/kasan/common.c:242) __kasan_slab_free (mm/kasan/common.c:256) kmem_cache_free (mm/slub.c:4437 mm/slub.c:4511) __sk_destruct (net/core/sock.c:2117 net/core/sock.c:2208) inet_create (net/ipv4/af_inet.c:397 net/ipv4/af_inet.c:252) __sock_create (net/socket.c:1572) __sys_socket (net/socket.c:1660 net/socket.c:1644 net/socket.c:1706) __x64_sys_socket (net/socket.c:1718) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Fix this by clearing the struct socket reference in sk_common_release() to cover all protocol families create functions, which may already attached the reference to the sk object with sock_init_data(). Fixes: c5dbb89 ("bpf: Expose bpf_get_socket_cookie to tracing programs") Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/netdev/20240613194047.36478-1-kuniyu@amazon.com/T/ Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: D. Wythe <alibuda@linux.alibaba.com> Link: https://lore.kernel.org/r/20240617210205.67311-1-ignat@cloudflare.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 10, 2024
…git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: Patch khadas#1 fixes the suspicious RCU usage warning that resulted from the recent fix for the race between namespace cleanup and gc in ipset left out checking the pernet exit phase when calling rcu_dereference_protected(), from Jozsef Kadlecsik. Patch khadas#2 fixes incorrect input and output netdevice in SRv6 prerouting hooks, from Jianguo Wu. Patch khadas#3 moves nf_hooks_lwtunnel sysctl toggle to the netfilter core. The connection tracking system is loaded on-demand, this ensures availability of this knob regardless. Patch khadas#4-khadas#5 adds selftests for SRv6 netfilter hooks also from Jianguo Wu. netfilter pull request 24-06-19 * tag 'nf-24-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: selftests: add selftest for the SRv6 End.DX6 behavior with netfilter selftests: add selftest for the SRv6 End.DX4 behavior with netfilter netfilter: move the sysctl nf_hooks_lwtunnel into the netfilter core seg6: fix parameter passing when calling NF_HOOK() in End.DX4 and End.DX6 behaviors netfilter: ipset: Fix suspicious rcu_dereference_protected() ==================== Link: https://lore.kernel.org/r/20240619170537.2846-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 10, 2024
…/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.10, take khadas#2 - Fix dangling references to a redistributor region if the vgic was prematurely destroyed. - Properly mark FFA buffers as released, ensuring that both parties can make forward progress.
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 10, 2024
into HEAD KVM/riscv fixes for 6.10, take khadas#2 - Fix compilation for KVM selftests
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 19, 2024
syzbot reported a lockdep violation involving bridge driver [1] Make sure netdev_rename_lock is softirq safe to fix this issue. [1] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected 6.10.0-rc2-syzkaller-00249-gbe27b8965297 #0 Not tainted ----------------------------------------------------- syz-executor.2/9449 [HC0[0]:SC0[2]:HE0:SE0] is trying to acquire: ffffffff8f5de668 (netdev_rename_lock.seqcount){+.+.}-{0:0}, at: rtnl_fill_ifinfo+0x38e/0x2270 net/core/rtnetlink.c:1839 and this task is already holding: ffff888060c64cb8 (&br->lock){+.-.}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline] ffff888060c64cb8 (&br->lock){+.-.}-{2:2}, at: br_port_slave_changelink+0x3d/0x150 net/bridge/br_netlink.c:1212 which would create a new lock dependency: (&br->lock){+.-.}-{2:2} -> (netdev_rename_lock.seqcount){+.+.}-{0:0} but this new dependency connects a SOFTIRQ-irq-safe lock: (&br->lock){+.-.}-{2:2} ... which became SOFTIRQ-irq-safe at: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:351 [inline] br_forward_delay_timer_expired+0x50/0x440 net/bridge/br_stp_timer.c:86 call_timer_fn+0x18e/0x650 kernel/time/timer.c:1792 expire_timers kernel/time/timer.c:1843 [inline] __run_timers kernel/time/timer.c:2417 [inline] __run_timer_base+0x66a/0x8e0 kernel/time/timer.c:2428 run_timer_base kernel/time/timer.c:2437 [inline] run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2447 handle_softirqs+0x2c4/0x970 kernel/softirq.c:554 __do_softirq kernel/softirq.c:588 [inline] invoke_softirq kernel/softirq.c:428 [inline] __irq_exit_rcu+0xf4/0x1c0 kernel/softirq.c:637 irq_exit_rcu+0x9/0x30 kernel/softirq.c:649 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1043 [inline] sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1043 asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702 lock_acquire+0x264/0x550 kernel/locking/lockdep.c:5758 fs_reclaim_acquire+0xaf/0x140 mm/page_alloc.c:3800 might_alloc include/linux/sched/mm.h:334 [inline] slab_pre_alloc_hook mm/slub.c:3890 [inline] slab_alloc_node mm/slub.c:3980 [inline] kmalloc_trace_noprof+0x3d/0x2c0 mm/slub.c:4147 kmalloc_noprof include/linux/slab.h:660 [inline] kzalloc_noprof include/linux/slab.h:778 [inline] class_dir_create_and_add drivers/base/core.c:3255 [inline] get_device_parent+0x2a7/0x410 drivers/base/core.c:3315 device_add+0x325/0xbf0 drivers/base/core.c:3645 netdev_register_kobject+0x17e/0x320 net/core/net-sysfs.c:2136 register_netdevice+0x11d5/0x19e0 net/core/dev.c:10375 nsim_init_netdevsim drivers/net/netdevsim/netdev.c:690 [inline] nsim_create+0x647/0x890 drivers/net/netdevsim/netdev.c:750 __nsim_dev_port_add+0x6c0/0xae0 drivers/net/netdevsim/dev.c:1390 nsim_dev_port_add_all drivers/net/netdevsim/dev.c:1446 [inline] nsim_dev_reload_create drivers/net/netdevsim/dev.c:1498 [inline] nsim_dev_reload_up+0x69b/0x8e0 drivers/net/netdevsim/dev.c:985 devlink_reload+0x478/0x870 net/devlink/dev.c:474 devlink_nl_reload_doit+0xbd6/0xe50 net/devlink/dev.c:586 genl_family_rcv_msg_doit net/netlink/genetlink.c:1115 [inline] genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline] genl_rcv_msg+0xb14/0xec0 net/netlink/genetlink.c:1210 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219 netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline] netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361 netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2585 ___sys_sendmsg net/socket.c:2639 [inline] __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2668 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f to a SOFTIRQ-irq-unsafe lock: (netdev_rename_lock.seqcount){+.+.}-{0:0} ... which became SOFTIRQ-irq-unsafe at: ... lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754 do_write_seqcount_begin_nested include/linux/seqlock.h:469 [inline] do_write_seqcount_begin include/linux/seqlock.h:495 [inline] write_seqlock include/linux/seqlock.h:823 [inline] dev_change_name+0x184/0x920 net/core/dev.c:1229 do_setlink+0xa4b/0x41f0 net/core/rtnetlink.c:2880 __rtnl_newlink net/core/rtnetlink.c:3696 [inline] rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743 rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564 netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline] netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361 netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:745 __sys_sendto+0x3a4/0x4f0 net/socket.c:2192 __do_sys_sendto net/socket.c:2204 [inline] __se_sys_sendto net/socket.c:2200 [inline] __x64_sys_sendto+0xde/0x100 net/socket.c:2200 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(netdev_rename_lock.seqcount); local_irq_disable(); lock(&br->lock); lock(netdev_rename_lock.seqcount); <Interrupt> lock(&br->lock); *** DEADLOCK *** 3 locks held by syz-executor.2/9449: #0: ffffffff8f5e7448 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock net/core/rtnetlink.c:79 [inline] #0: ffffffff8f5e7448 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x842/0x1180 net/core/rtnetlink.c:6632 khadas#1: ffff888060c64cb8 (&br->lock){+.-.}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline] khadas#1: ffff888060c64cb8 (&br->lock){+.-.}-{2:2}, at: br_port_slave_changelink+0x3d/0x150 net/bridge/br_netlink.c:1212 khadas#2: ffffffff8e333fa0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:329 [inline] khadas#2: ffffffff8e333fa0 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:781 [inline] khadas#2: ffffffff8e333fa0 (rcu_read_lock){....}-{1:2}, at: team_change_rx_flags+0x29/0x330 drivers/net/team/team_core.c:1767 the dependencies between SOFTIRQ-irq-safe lock and the holding lock: -> (&br->lock){+.-.}-{2:2} { HARDIRQ-ON-W at: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754 __raw_spin_lock_bh include/linux/spinlock_api_smp.h:126 [inline] _raw_spin_lock_bh+0x35/0x50 kernel/locking/spinlock.c:178 spin_lock_bh include/linux/spinlock.h:356 [inline] br_add_if+0xb34/0xef0 net/bridge/br_if.c:682 do_set_master net/core/rtnetlink.c:2701 [inline] do_setlink+0xe70/0x41f0 net/core/rtnetlink.c:2907 __rtnl_newlink net/core/rtnetlink.c:3696 [inline] rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743 rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564 netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline] netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361 netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:745 __sys_sendto+0x3a4/0x4f0 net/socket.c:2192 __do_sys_sendto net/socket.c:2204 [inline] __se_sys_sendto net/socket.c:2200 [inline] __x64_sys_sendto+0xde/0x100 net/socket.c:2200 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f IN-SOFTIRQ-W at: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:351 [inline] br_forward_delay_timer_expired+0x50/0x440 net/bridge/br_stp_timer.c:86 call_timer_fn+0x18e/0x650 kernel/time/timer.c:1792 expire_timers kernel/time/timer.c:1843 [inline] __run_timers kernel/time/timer.c:2417 [inline] __run_timer_base+0x66a/0x8e0 kernel/time/timer.c:2428 run_timer_base kernel/time/timer.c:2437 [inline] run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2447 handle_softirqs+0x2c4/0x970 kernel/softirq.c:554 __do_softirq kernel/softirq.c:588 [inline] invoke_softirq kernel/softirq.c:428 [inline] __irq_exit_rcu+0xf4/0x1c0 kernel/softirq.c:637 irq_exit_rcu+0x9/0x30 kernel/softirq.c:649 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1043 [inline] sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1043 asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702 lock_acquire+0x264/0x550 kernel/locking/lockdep.c:5758 fs_reclaim_acquire+0xaf/0x140 mm/page_alloc.c:3800 might_alloc include/linux/sched/mm.h:334 [inline] slab_pre_alloc_hook mm/slub.c:3890 [inline] slab_alloc_node mm/slub.c:3980 [inline] kmalloc_trace_noprof+0x3d/0x2c0 mm/slub.c:4147 kmalloc_noprof include/linux/slab.h:660 [inline] kzalloc_noprof include/linux/slab.h:778 [inline] class_dir_create_and_add drivers/base/core.c:3255 [inline] get_device_parent+0x2a7/0x410 drivers/base/core.c:3315 device_add+0x325/0xbf0 drivers/base/core.c:3645 netdev_register_kobject+0x17e/0x320 net/core/net-sysfs.c:2136 register_netdevice+0x11d5/0x19e0 net/core/dev.c:10375 nsim_init_netdevsim drivers/net/netdevsim/netdev.c:690 [inline] nsim_create+0x647/0x890 drivers/net/netdevsim/netdev.c:750 __nsim_dev_port_add+0x6c0/0xae0 drivers/net/netdevsim/dev.c:1390 nsim_dev_port_add_all drivers/net/netdevsim/dev.c:1446 [inline] nsim_dev_reload_create drivers/net/netdevsim/dev.c:1498 [inline] nsim_dev_reload_up+0x69b/0x8e0 drivers/net/netdevsim/dev.c:985 devlink_reload+0x478/0x870 net/devlink/dev.c:474 devlink_nl_reload_doit+0xbd6/0xe50 net/devlink/dev.c:586 genl_family_rcv_msg_doit net/netlink/genetlink.c:1115 [inline] genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline] genl_rcv_msg+0xb14/0xec0 net/netlink/genetlink.c:1210 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219 netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline] netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361 netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2585 ___sys_sendmsg net/socket.c:2639 [inline] __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2668 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f INITIAL USE at: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754 __raw_spin_lock_bh include/linux/spinlock_api_smp.h:126 [inline] _raw_spin_lock_bh+0x35/0x50 kernel/locking/spinlock.c:178 spin_lock_bh include/linux/spinlock.h:356 [inline] br_add_if+0xb34/0xef0 net/bridge/br_if.c:682 do_set_master net/core/rtnetlink.c:2701 [inline] do_setlink+0xe70/0x41f0 net/core/rtnetlink.c:2907 __rtnl_newlink net/core/rtnetlink.c:3696 [inline] rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743 rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564 netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline] netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361 netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:745 __sys_sendto+0x3a4/0x4f0 net/socket.c:2192 __do_sys_sendto net/socket.c:2204 [inline] __se_sys_sendto net/socket.c:2200 [inline] __x64_sys_sendto+0xde/0x100 net/socket.c:2200 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f } ... key at: [<ffffffff94b9a1a0>] br_dev_setup.__key+0x0/0x20 the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock: -> (netdev_rename_lock.seqcount){+.+.}-{0:0} { HARDIRQ-ON-W at: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754 do_write_seqcount_begin_nested include/linux/seqlock.h:469 [inline] do_write_seqcount_begin include/linux/seqlock.h:495 [inline] write_seqlock include/linux/seqlock.h:823 [inline] dev_change_name+0x184/0x920 net/core/dev.c:1229 do_setlink+0xa4b/0x41f0 net/core/rtnetlink.c:2880 __rtnl_newlink net/core/rtnetlink.c:3696 [inline] rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743 rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564 netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline] netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361 netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:745 __sys_sendto+0x3a4/0x4f0 net/socket.c:2192 __do_sys_sendto net/socket.c:2204 [inline] __se_sys_sendto net/socket.c:2200 [inline] __x64_sys_sendto+0xde/0x100 net/socket.c:2200 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f SOFTIRQ-ON-W at: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754 do_write_seqcount_begin_nested include/linux/seqlock.h:469 [inline] do_write_seqcount_begin include/linux/seqlock.h:495 [inline] write_seqlock include/linux/seqlock.h:823 [inline] dev_change_name+0x184/0x920 net/core/dev.c:1229 do_setlink+0xa4b/0x41f0 net/core/rtnetlink.c:2880 __rtnl_newlink net/core/rtnetlink.c:3696 [inline] rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743 rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564 netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline] netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361 netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:745 __sys_sendto+0x3a4/0x4f0 net/socket.c:2192 __do_sys_sendto net/socket.c:2204 [inline] __se_sys_sendto net/socket.c:2200 [inline] __x64_sys_sendto+0xde/0x100 net/socket.c:2200 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f INITIAL USE at: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754 do_write_seqcount_begin_nested include/linux/seqlock.h:469 [inline] do_write_seqcount_begin include/linux/seqlock.h:495 [inline] write_seqlock include/linux/seqlock.h:823 [inline] dev_change_name+0x184/0x920 net/core/dev.c:1229 do_setlink+0xa4b/0x41f0 net/core/rtnetlink.c:2880 __rtnl_newlink net/core/rtnetlink.c:3696 [inline] rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743 rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564 netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline] netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361 netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:745 __sys_sendto+0x3a4/0x4f0 net/socket.c:2192 __do_sys_sendto net/socket.c:2204 [inline] __se_sys_sendto net/socket.c:2200 [inline] __x64_sys_sendto+0xde/0x100 net/socket.c:2200 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f INITIAL READ USE at: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754 seqcount_lockdep_reader_access include/linux/seqlock.h:72 [inline] read_seqbegin include/linux/seqlock.h:772 [inline] netdev_copy_name+0x168/0x2c0 net/core/dev.c:949 rtnl_fill_ifinfo+0x38e/0x2270 net/core/rtnetlink.c:1839 rtmsg_ifinfo_build_skb+0x18a/0x260 net/core/rtnetlink.c:4073 rtmsg_ifinfo_event net/core/rtnetlink.c:4107 [inline] rtmsg_ifinfo+0x91/0x1b0 net/core/rtnetlink.c:4116 register_netdevice+0x1665/0x19e0 net/core/dev.c:10422 register_netdev+0x3b/0x50 net/core/dev.c:10512 loopback_net_init+0x73/0x150 drivers/net/loopback.c:217 ops_init+0x359/0x610 net/core/net_namespace.c:139 __register_pernet_operations net/core/net_namespace.c:1247 [inline] register_pernet_operations+0x2cb/0x660 net/core/net_namespace.c:1320 register_pernet_device+0x33/0x80 net/core/net_namespace.c:1407 net_dev_init+0xfcd/0x10d0 net/core/dev.c:11956 do_one_initcall+0x248/0x880 init/main.c:1267 do_initcall_level+0x157/0x210 init/main.c:1329 do_initcalls+0x3f/0x80 init/main.c:1345 kernel_init_freeable+0x435/0x5d0 init/main.c:1578 kernel_init+0x1d/0x2b0 init/main.c:1467 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 } ... key at: [<ffffffff8f5de668>] netdev_rename_lock+0x8/0xa0 ... acquired at: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754 seqcount_lockdep_reader_access include/linux/seqlock.h:72 [inline] read_seqbegin include/linux/seqlock.h:772 [inline] netdev_copy_name+0x168/0x2c0 net/core/dev.c:949 rtnl_fill_ifinfo+0x38e/0x2270 net/core/rtnetlink.c:1839 rtmsg_ifinfo_build_skb+0x18a/0x260 net/core/rtnetlink.c:4073 rtmsg_ifinfo_event net/core/rtnetlink.c:4107 [inline] rtmsg_ifinfo+0x91/0x1b0 net/core/rtnetlink.c:4116 __dev_notify_flags+0xf7/0x400 net/core/dev.c:8816 __dev_set_promiscuity+0x152/0x5a0 net/core/dev.c:8588 dev_set_promiscuity+0x51/0xe0 net/core/dev.c:8608 team_change_rx_flags+0x203/0x330 drivers/net/team/team_core.c:1771 dev_change_rx_flags net/core/dev.c:8541 [inline] __dev_set_promiscuity+0x406/0x5a0 net/core/dev.c:8585 dev_set_promiscuity+0x51/0xe0 net/core/dev.c:8608 br_port_clear_promisc net/bridge/br_if.c:135 [inline] br_manage_promisc+0x505/0x590 net/bridge/br_if.c:172 nbp_update_port_count net/bridge/br_if.c:242 [inline] br_port_flags_change+0x161/0x1f0 net/bridge/br_if.c:761 br_setport+0xcb5/0x16d0 net/bridge/br_netlink.c:1000 br_port_slave_changelink+0x135/0x150 net/bridge/br_netlink.c:1213 __rtnl_newlink net/core/rtnetlink.c:3689 [inline] rtnl_newlink+0x169f/0x20a0 net/core/rtnetlink.c:3743 rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564 netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline] netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361 netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2585 ___sys_sendmsg net/socket.c:2639 [inline] __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2668 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f stack backtrace: CPU: 0 PID: 9449 Comm: syz-executor.2 Not tainted 6.10.0-rc2-syzkaller-00249-gbe27b8965297 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/07/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 print_bad_irq_dependency kernel/locking/lockdep.c:2626 [inline] check_irq_usage kernel/locking/lockdep.c:2865 [inline] check_prev_add kernel/locking/lockdep.c:3138 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain+0x4de0/0x5900 kernel/locking/lockdep.c:3869 __lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754 seqcount_lockdep_reader_access include/linux/seqlock.h:72 [inline] read_seqbegin include/linux/seqlock.h:772 [inline] netdev_copy_name+0x168/0x2c0 net/core/dev.c:949 rtnl_fill_ifinfo+0x38e/0x2270 net/core/rtnetlink.c:1839 rtmsg_ifinfo_build_skb+0x18a/0x260 net/core/rtnetlink.c:4073 rtmsg_ifinfo_event net/core/rtnetlink.c:4107 [inline] rtmsg_ifinfo+0x91/0x1b0 net/core/rtnetlink.c:4116 __dev_notify_flags+0xf7/0x400 net/core/dev.c:8816 __dev_set_promiscuity+0x152/0x5a0 net/core/dev.c:8588 dev_set_promiscuity+0x51/0xe0 net/core/dev.c:8608 team_change_rx_flags+0x203/0x330 drivers/net/team/team_core.c:1771 dev_change_rx_flags net/core/dev.c:8541 [inline] __dev_set_promiscuity+0x406/0x5a0 net/core/dev.c:8585 dev_set_promiscuity+0x51/0xe0 net/core/dev.c:8608 br_port_clear_promisc net/bridge/br_if.c:135 [inline] br_manage_promisc+0x505/0x590 net/bridge/br_if.c:172 nbp_update_port_count net/bridge/br_if.c:242 [inline] br_port_flags_change+0x161/0x1f0 net/bridge/br_if.c:761 br_setport+0xcb5/0x16d0 net/bridge/br_netlink.c:1000 br_port_slave_changelink+0x135/0x150 net/bridge/br_netlink.c:1213 __rtnl_newlink net/core/rtnetlink.c:3689 [inline] rtnl_newlink+0x169f/0x20a0 net/core/rtnetlink.c:3743 rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564 netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline] netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361 netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2585 ___sys_sendmsg net/socket.c:2639 [inline] __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2668 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f3b3047cf29 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f3b311740c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f3b305b4050 RCX: 00007f3b3047cf29 RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000008 RBP: 00007f3b304ec074 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000006e R14: 00007f3b305b4050 R15: 00007ffca2f3dc68 </TASK> Fixes: 0840556 ("net: Protect dev->name by seqlock.") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 19, 2024
…play During inode logging (and log replay too), we are holding a transaction handle and we often need to call btrfs_iget(), which will read an inode from its subvolume btree if it's not loaded in memory and that results in allocating an inode with GFP_KERNEL semantics at the btrfs_alloc_inode() callback - and this may recurse into the filesystem in case we are under memory pressure and attempt to commit the current transaction, resulting in a deadlock since the logging (or log replay) task is holding a transaction handle open. Syzbot reported this with the following stack traces: WARNING: possible circular locking dependency detected 6.10.0-rc2-syzkaller-00361-g061d1af7b030 #0 Not tainted ------------------------------------------------------ syz-executor.1/9919 is trying to acquire lock: ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:334 [inline] ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:3891 [inline] ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:3981 [inline] ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020 but task is already holding lock: ffff88804b569358 (&ei->log_mutex){+.+.}-{3:3}, at: btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> khadas#3 (&ei->log_mutex){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:608 [inline] __mutex_lock+0x175/0x9c0 kernel/locking/mutex.c:752 btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481 btrfs_log_inode_parent+0x8cb/0x2a90 fs/btrfs/tree-log.c:7079 btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180 btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959 vfs_fsync_range+0x141/0x230 fs/sync.c:188 generic_write_sync include/linux/fs.h:2794 [inline] btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705 new_sync_write fs/read_write.c:497 [inline] vfs_write+0x6b6/0x1140 fs/read_write.c:590 ksys_write+0x12f/0x260 fs/read_write.c:643 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> khadas#2 (btrfs_trans_num_extwriters){++++}-{0:0}: join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_commit_super+0xa1/0x110 fs/btrfs/disk-io.c:4170 close_ctree+0xcb0/0xf90 fs/btrfs/disk-io.c:4324 generic_shutdown_super+0x159/0x3d0 fs/super.c:642 kill_anon_super+0x3a/0x60 fs/super.c:1226 btrfs_kill_super+0x3b/0x50 fs/btrfs/super.c:2096 deactivate_locked_super+0xbe/0x1a0 fs/super.c:473 deactivate_super+0xde/0x100 fs/super.c:506 cleanup_mnt+0x222/0x450 fs/namespace.c:1267 task_work_run+0x14e/0x250 kernel/task_work.c:180 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop kernel/entry/common.c:114 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x278/0x2a0 kernel/entry/common.c:218 __do_fast_syscall_32+0x80/0x120 arch/x86/entry/common.c:389 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> khadas#1 (btrfs_trans_num_writers){++++}-{0:0}: __lock_release kernel/locking/lockdep.c:5468 [inline] lock_release+0x33e/0x6c0 kernel/locking/lockdep.c:5774 percpu_up_read include/linux/percpu-rwsem.h:99 [inline] __sb_end_write include/linux/fs.h:1650 [inline] sb_end_intwrite include/linux/fs.h:1767 [inline] __btrfs_end_transaction+0x5ca/0x920 fs/btrfs/transaction.c:1071 btrfs_commit_inode_delayed_inode+0x228/0x330 fs/btrfs/delayed-inode.c:1301 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 dentry_unlink_inode+0x295/0x480 fs/dcache.c:400 __dentry_kill+0x1d0/0x600 fs/dcache.c:603 dput.part.0+0x4b1/0x9b0 fs/dcache.c:845 dput+0x1f/0x30 fs/dcache.c:835 ovl_stack_put+0x60/0x90 fs/overlayfs/util.c:132 ovl_destroy_inode+0xc6/0x190 fs/overlayfs/super.c:182 destroy_inode+0xc4/0x1b0 fs/inode.c:311 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 dentry_unlink_inode+0x295/0x480 fs/dcache.c:400 __dentry_kill+0x1d0/0x600 fs/dcache.c:603 shrink_kill fs/dcache.c:1048 [inline] shrink_dentry_list+0x140/0x5d0 fs/dcache.c:1075 prune_dcache_sb+0xeb/0x150 fs/dcache.c:1156 super_cache_scan+0x32a/0x550 fs/super.c:221 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab_memcg mm/shrinker.c:548 [inline] shrink_slab+0xa87/0x1310 mm/shrinker.c:626 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 -> #0 (fs_reclaim){+.+.}-{0:0}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 __fs_reclaim_acquire mm/page_alloc.c:3801 [inline] fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3815 might_alloc include/linux/sched/mm.h:334 [inline] slab_pre_alloc_hook mm/slub.c:3891 [inline] slab_alloc_node mm/slub.c:3981 [inline] kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020 btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411 alloc_inode+0x5d/0x230 fs/inode.c:261 iget5_locked fs/inode.c:1235 [inline] iget5_locked+0x1c9/0x2c0 fs/inode.c:1228 btrfs_iget_locked fs/btrfs/inode.c:5590 [inline] btrfs_iget_path fs/btrfs/inode.c:5607 [inline] btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636 add_conflicting_inode fs/btrfs/tree-log.c:5657 [inline] copy_inode_items_to_log+0x1039/0x1e30 fs/btrfs/tree-log.c:5928 btrfs_log_inode+0xa48/0x4660 fs/btrfs/tree-log.c:6592 log_new_delayed_dentries fs/btrfs/tree-log.c:6363 [inline] btrfs_log_inode+0x27dd/0x4660 fs/btrfs/tree-log.c:6718 btrfs_log_all_parents fs/btrfs/tree-log.c:6833 [inline] btrfs_log_inode_parent+0x22ba/0x2a90 fs/btrfs/tree-log.c:7141 btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180 btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959 vfs_fsync_range+0x141/0x230 fs/sync.c:188 generic_write_sync include/linux/fs.h:2794 [inline] btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705 do_iter_readv_writev+0x504/0x780 fs/read_write.c:741 vfs_writev+0x36f/0xde0 fs/read_write.c:971 do_pwritev+0x1b2/0x260 fs/read_write.c:1072 __do_compat_sys_pwritev2 fs/read_write.c:1218 [inline] __se_compat_sys_pwritev2 fs/read_write.c:1210 [inline] __ia32_compat_sys_pwritev2+0x121/0x1b0 fs/read_write.c:1210 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e other info that might help us debug this: Chain exists of: fs_reclaim --> btrfs_trans_num_extwriters --> &ei->log_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ei->log_mutex); lock(btrfs_trans_num_extwriters); lock(&ei->log_mutex); lock(fs_reclaim); *** DEADLOCK *** 7 locks held by syz-executor.1/9919: #0: ffff88802be20420 (sb_writers#23){.+.+}-{0:0}, at: do_pwritev+0x1b2/0x260 fs/read_write.c:1072 khadas#1: ffff888065c0f8f0 (&sb->s_type->i_mutex_key#33){++++}-{3:3}, at: inode_lock include/linux/fs.h:791 [inline] khadas#1: ffff888065c0f8f0 (&sb->s_type->i_mutex_key#33){++++}-{3:3}, at: btrfs_inode_lock+0xc8/0x110 fs/btrfs/inode.c:385 khadas#2: ffff888065c0f778 (&ei->i_mmap_lock){++++}-{3:3}, at: btrfs_inode_lock+0xee/0x110 fs/btrfs/inode.c:388 khadas#3: ffff88802be20610 (sb_internal#4){.+.+}-{0:0}, at: btrfs_sync_file+0x95b/0xe10 fs/btrfs/file.c:1952 khadas#4: ffff8880546323f0 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x430/0xf40 fs/btrfs/transaction.c:290 khadas#5: ffff888054632418 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x430/0xf40 fs/btrfs/transaction.c:290 khadas#6: ffff88804b569358 (&ei->log_mutex){+.+.}-{3:3}, at: btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481 stack backtrace: CPU: 2 PID: 9919 Comm: syz-executor.1 Not tainted 6.10.0-rc2-syzkaller-00361-g061d1af7b030 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114 check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187 check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 __fs_reclaim_acquire mm/page_alloc.c:3801 [inline] fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3815 might_alloc include/linux/sched/mm.h:334 [inline] slab_pre_alloc_hook mm/slub.c:3891 [inline] slab_alloc_node mm/slub.c:3981 [inline] kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020 btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411 alloc_inode+0x5d/0x230 fs/inode.c:261 iget5_locked fs/inode.c:1235 [inline] iget5_locked+0x1c9/0x2c0 fs/inode.c:1228 btrfs_iget_locked fs/btrfs/inode.c:5590 [inline] btrfs_iget_path fs/btrfs/inode.c:5607 [inline] btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636 add_conflicting_inode fs/btrfs/tree-log.c:5657 [inline] copy_inode_items_to_log+0x1039/0x1e30 fs/btrfs/tree-log.c:5928 btrfs_log_inode+0xa48/0x4660 fs/btrfs/tree-log.c:6592 log_new_delayed_dentries fs/btrfs/tree-log.c:6363 [inline] btrfs_log_inode+0x27dd/0x4660 fs/btrfs/tree-log.c:6718 btrfs_log_all_parents fs/btrfs/tree-log.c:6833 [inline] btrfs_log_inode_parent+0x22ba/0x2a90 fs/btrfs/tree-log.c:7141 btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180 btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959 vfs_fsync_range+0x141/0x230 fs/sync.c:188 generic_write_sync include/linux/fs.h:2794 [inline] btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705 do_iter_readv_writev+0x504/0x780 fs/read_write.c:741 vfs_writev+0x36f/0xde0 fs/read_write.c:971 do_pwritev+0x1b2/0x260 fs/read_write.c:1072 __do_compat_sys_pwritev2 fs/read_write.c:1218 [inline] __se_compat_sys_pwritev2 fs/read_write.c:1210 [inline] __ia32_compat_sys_pwritev2+0x121/0x1b0 fs/read_write.c:1210 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e RIP: 0023:0xf7334579 Code: b8 01 10 06 03 (...) RSP: 002b:00000000f5f265ac EFLAGS: 00000292 ORIG_RAX: 000000000000017b RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00000000200002c0 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Fix this by ensuring we are under a NOFS scope whenever we call btrfs_iget() during inode logging and log replay. Reported-by: syzbot+8576cfa84070dce4d59b@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000274a3a061abbd928@google.com/ Fixes: 712e36c ("btrfs: use GFP_KERNEL in btrfs_alloc_inode") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 19, 2024
The code in ocfs2_dio_end_io_write() estimates number of necessary transaction credits using ocfs2_calc_extend_credits(). This however does not take into account that the IO could be arbitrarily large and can contain arbitrary number of extents. Extent tree manipulations do often extend the current transaction but not in all of the cases. For example if we have only single block extents in the tree, ocfs2_mark_extent_written() will end up calling ocfs2_replace_extent_rec() all the time and we will never extend the current transaction and eventually exhaust all the transaction credits if the IO contains many single block extents. Once that happens a WARN_ON(jbd2_handle_buffer_credits(handle) <= 0) is triggered in jbd2_journal_dirty_metadata() and subsequently OCFS2 aborts in response to this error. This was actually triggered by one of our customers on a heavily fragmented OCFS2 filesystem. To fix the issue make sure the transaction always has enough credits for one extent insert before each call of ocfs2_mark_extent_written(). Heming Zhao said: ------ PANIC: "Kernel panic - not syncing: OCFS2: (device dm-1): panic forced after error" PID: xxx TASK: xxxx CPU: 5 COMMAND: "SubmitThread-CA" #0 machine_kexec at ffffffff8c069932 khadas#1 __crash_kexec at ffffffff8c1338fa khadas#2 panic at ffffffff8c1d69b9 khadas#3 ocfs2_handle_error at ffffffffc0c86c0c [ocfs2] khadas#4 __ocfs2_abort at ffffffffc0c88387 [ocfs2] khadas#5 ocfs2_journal_dirty at ffffffffc0c51e98 [ocfs2] khadas#6 ocfs2_split_extent at ffffffffc0c27ea3 [ocfs2] khadas#7 ocfs2_change_extent_flag at ffffffffc0c28053 [ocfs2] khadas#8 ocfs2_mark_extent_written at ffffffffc0c28347 [ocfs2] khadas#9 ocfs2_dio_end_io_write at ffffffffc0c2bef9 [ocfs2] khadas#10 ocfs2_dio_end_io at ffffffffc0c2c0f5 [ocfs2] khadas#11 dio_complete at ffffffff8c2b9fa7 khadas#12 do_blockdev_direct_IO at ffffffff8c2bc09f khadas#13 ocfs2_direct_IO at ffffffffc0c2b653 [ocfs2] khadas#14 generic_file_direct_write at ffffffff8c1dcf14 khadas#15 __generic_file_write_iter at ffffffff8c1dd07b khadas#16 ocfs2_file_write_iter at ffffffffc0c49f1f [ocfs2] khadas#17 aio_write at ffffffff8c2cc72e khadas#18 kmem_cache_alloc at ffffffff8c248dde khadas#19 do_io_submit at ffffffff8c2ccada khadas#20 do_syscall_64 at ffffffff8c004984 khadas#21 entry_SYSCALL_64_after_hwframe at ffffffff8c8000ba Link: https://lkml.kernel.org/r/20240617095543.6971-1-jack@suse.cz Link: https://lkml.kernel.org/r/20240614145243.8837-1-jack@suse.cz Fixes: c15471f ("ocfs2: fix sparse file & data ordering issue in direct io") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 19, 2024
…git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains two Netfilter fixes for net: Patch khadas#1 fixes CONFIG_SYSCTL=n for a patch coming in the previous PR to move the sysctl toggle to enable SRv6 netfilter hooks from nf_conntrack to the core, from Jianguo Wu. Patch khadas#2 fixes a possible pointer leak to userspace due to insufficient validation of NFT_DATA_VALUE. Linus found this pointer leak to userspace via zdi-disclosures@ and forwarded the notice to Netfilter maintainers, he appears as reporter because whoever found this issue never approached Netfilter maintainers neither via security@ nor in private. netfilter pull request 24-06-27 * tag 'nf-24-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: fully validate NFT_DATA_VALUE on store to data registers netfilter: fix undefined reference to 'netfilter_lwtunnel_*' when CONFIG_SYSCTL=n ==================== Link: https://patch.msgid.link/20240626233845.151197-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 19, 2024
This fixes the following deadlock introduced by 39a92a55be13 ("bluetooth/l2cap: sync sock recv cb and release") ============================================ WARNING: possible recursive locking detected 6.10.0-rc3-g4029dba6b6f1 #6823 Not tainted -------------------------------------------- kworker/u5:0/35 is trying to acquire lock: ffff888002ec2510 (&chan->lock#2/1){+.+.}-{3:3}, at: l2cap_sock_recv_cb+0x44/0x1e0 but task is already holding lock: ffff888002ec2510 (&chan->lock#2/1){+.+.}-{3:3}, at: l2cap_get_chan_by_scid+0xaf/0xd0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&chan->lock#2/1); lock(&chan->lock#2/1); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by kworker/u5:0/35: #0: ffff888002b8a940 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_one_work+0x750/0x930 khadas#1: ffff888002c67dd0 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, at: process_one_work+0x44e/0x930 khadas#2: ffff888002ec2510 (&chan->lock#2/1){+.+.}-{3:3}, at: l2cap_get_chan_by_scid+0xaf/0xd0 To fix the original problem this introduces l2cap_chan_lock at l2cap_conless_channel to ensure that l2cap_sock_recv_cb is called with chan->lock held. Fixes: 89e856e ("bluetooth/l2cap: sync sock recv cb and release") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 19, 2024
Bos can be put with multiple unrelated dma-resv locks held. But imported bos attempt to grab the bo dma-resv during dma-buf detach that typically happens during cleanup. That leads to lockde splats similar to the below and a potential ABBA deadlock. Fix this by always taking the delayed workqueue cleanup path for imported bos. Requesting stable fixes from when the Xe driver was introduced, since its usage of drm_exec and wide vm dma_resvs appear to be the first reliable trigger of this. [22982.116427] ============================================ [22982.116428] WARNING: possible recursive locking detected [22982.116429] 6.10.0-rc2+ khadas#10 Tainted: G U W [22982.116430] -------------------------------------------- [22982.116430] glxgears:sh0/5785 is trying to acquire lock: [22982.116431] ffff8c2bafa539a8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: dma_buf_detach+0x3b/0xf0 [22982.116438] but task is already holding lock: [22982.116438] ffff8c2d9aba6da8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec] [22982.116442] other info that might help us debug this: [22982.116442] Possible unsafe locking scenario: [22982.116443] CPU0 [22982.116444] ---- [22982.116444] lock(reservation_ww_class_mutex); [22982.116445] lock(reservation_ww_class_mutex); [22982.116447] *** DEADLOCK *** [22982.116447] May be due to missing lock nesting notation [22982.116448] 5 locks held by glxgears:sh0/5785: [22982.116449] #0: ffff8c2d9aba58c8 (&xef->vm.lock){+.+.}-{3:3}, at: xe_file_close+0xde/0x1c0 [xe] [22982.116507] khadas#1: ffff8c2e28cc8480 (&vm->lock){++++}-{3:3}, at: xe_vm_close_and_put+0x161/0x9b0 [xe] [22982.116578] khadas#2: ffff8c2e31982970 (&val->lock){.+.+}-{3:3}, at: xe_validation_ctx_init+0x6d/0x70 [xe] [22982.116647] khadas#3: ffffacdc469478a8 (reservation_ww_class_acquire){+.+.}-{0:0}, at: xe_vma_destroy_unlocked+0x7f/0xe0 [xe] [22982.116716] khadas#4: ffff8c2d9aba6da8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec] [22982.116719] stack backtrace: [22982.116720] CPU: 8 PID: 5785 Comm: glxgears:sh0 Tainted: G U W 6.10.0-rc2+ khadas#10 [22982.116721] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023 [22982.116723] Call Trace: [22982.116724] <TASK> [22982.116725] dump_stack_lvl+0x77/0xb0 [22982.116727] __lock_acquire+0x1232/0x2160 [22982.116730] lock_acquire+0xcb/0x2d0 [22982.116732] ? dma_buf_detach+0x3b/0xf0 [22982.116734] ? __lock_acquire+0x417/0x2160 [22982.116736] __ww_mutex_lock.constprop.0+0xd0/0x13b0 [22982.116738] ? dma_buf_detach+0x3b/0xf0 [22982.116741] ? dma_buf_detach+0x3b/0xf0 [22982.116743] ? ww_mutex_lock+0x2b/0x90 [22982.116745] ww_mutex_lock+0x2b/0x90 [22982.116747] dma_buf_detach+0x3b/0xf0 [22982.116749] drm_prime_gem_destroy+0x2f/0x40 [drm] [22982.116775] xe_ttm_bo_destroy+0x32/0x220 [xe] [22982.116818] ? __mutex_unlock_slowpath+0x3a/0x290 [22982.116821] drm_exec_unlock_all+0xa1/0xd0 [drm_exec] [22982.116823] drm_exec_fini+0x12/0xb0 [drm_exec] [22982.116824] xe_validation_ctx_fini+0x15/0x40 [xe] [22982.116892] xe_vma_destroy_unlocked+0xb1/0xe0 [xe] [22982.116959] xe_vm_close_and_put+0x41a/0x9b0 [xe] [22982.117025] ? xa_find+0xe3/0x1e0 [22982.117028] xe_file_close+0x10a/0x1c0 [xe] [22982.117074] drm_file_free+0x22a/0x280 [drm] [22982.117099] drm_release_noglobal+0x22/0x70 [drm] [22982.117119] __fput+0xf1/0x2d0 [22982.117122] task_work_run+0x59/0x90 [22982.117125] do_exit+0x330/0xb40 [22982.117127] do_group_exit+0x36/0xa0 [22982.117129] get_signal+0xbd2/0xbe0 [22982.117131] arch_do_signal_or_restart+0x3e/0x240 [22982.117134] syscall_exit_to_user_mode+0x1e7/0x290 [22982.117137] do_syscall_64+0xa1/0x180 [22982.117139] ? lock_acquire+0xcb/0x2d0 [22982.117140] ? __set_task_comm+0x28/0x1e0 [22982.117141] ? find_held_lock+0x2b/0x80 [22982.117144] ? __set_task_comm+0xe1/0x1e0 [22982.117145] ? lock_release+0xca/0x290 [22982.117147] ? __do_sys_prctl+0x245/0xab0 [22982.117149] ? lockdep_hardirqs_on_prepare+0xde/0x190 [22982.117150] ? syscall_exit_to_user_mode+0xb0/0x290 [22982.117152] ? do_syscall_64+0xa1/0x180 [22982.117154] ? __lock_acquire+0x417/0x2160 [22982.117155] ? reacquire_held_locks+0xd1/0x1f0 [22982.117156] ? do_user_addr_fault+0x30c/0x790 [22982.117158] ? lock_acquire+0xcb/0x2d0 [22982.117160] ? find_held_lock+0x2b/0x80 [22982.117162] ? do_user_addr_fault+0x357/0x790 [22982.117163] ? lock_release+0xca/0x290 [22982.117164] ? do_user_addr_fault+0x361/0x790 [22982.117166] ? trace_hardirqs_off+0x4b/0xc0 [22982.117168] ? clear_bhb_loop+0x45/0xa0 [22982.117170] ? clear_bhb_loop+0x45/0xa0 [22982.117172] ? clear_bhb_loop+0x45/0xa0 [22982.117174] entry_SYSCALL_64_after_hwframe+0x76/0x7e [22982.117176] RIP: 0033:0x7f943d267169 [22982.117192] Code: Unable to access opcode bytes at 0x7f943d26713f. [22982.117193] RSP: 002b:00007f9430bffc80 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca [22982.117195] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f943d267169 [22982.117196] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00005622f89579d0 [22982.117197] RBP: 00007f9430bffcb0 R08: 0000000000000000 R09: 00000000ffffffff [22982.117198] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [22982.117199] R13: 0000000000000000 R14: 0000000000000000 R15: 00005622f89579d0 [22982.117202] </TASK> Fixes: dd08ebf ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: Christian König <christian.koenig@amd.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: dri-devel@lists.freedesktop.org Cc: intel-xe@lists.freedesktop.org Cc: <stable@vger.kernel.org> # v6.8+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240628153848.4989-1-thomas.hellstrom@linux.intel.com
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 19, 2024
…git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following batch contains Netfilter fixes for net: Patch khadas#1 fixes a bogus WARN_ON splat in nfnetlink_queue. Patch khadas#2 fixes a crash due to stack overflow in chain loop detection by using the existing chain validation routines Both patches from Florian Westphal. netfilter pull request 24-07-11 * tag 'nf-24-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: prefer nft_chain_validate netfilter: nfnetlink_queue: drop bogus WARN_ON ==================== Link: https://patch.msgid.link/20240711093948.3816-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Jul 19, 2024
When putting an inode during extent map shrinking we're doing a standard iput() but that may take a long time in case the inode is dirty and we are doing the final iput that triggers eviction - the VFS will have to wait for writeback before calling the btrfs evict callback (see fs/inode.c:evict()). This slows down the task running the shrinker which may have been triggered while updating some tree for example, meaning locks are held as well as an open transaction handle. Also if the iput() ends up triggering eviction and the inode has no links anymore, then we trigger item truncation which requires flushing delayed items, space reservation to start a transaction and that may trigger the space reclaim task and wait for it, resulting in deadlocks in case the reclaim task needs for example to commit a transaction and the shrinker is being triggered from a path holding a transaction handle. Syzbot reported such a case with the following stack traces: ====================================================== WARNING: possible circular locking dependency detected 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Not tainted ------------------------------------------------------ kswapd0/111 is trying to acquire lock: ffff88801eae4610 (sb_internal#3){.+.+}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 but task is already holding lock: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> khadas#3 (fs_reclaim){+.+.}-{0:0}: __fs_reclaim_acquire mm/page_alloc.c:3783 [inline] fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3797 might_alloc include/linux/sched/mm.h:334 [inline] slab_pre_alloc_hook mm/slub.c:3890 [inline] slab_alloc_node mm/slub.c:3980 [inline] kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4019 btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411 alloc_inode+0x5d/0x230 fs/inode.c:261 iget5_locked fs/inode.c:1235 [inline] iget5_locked+0x1c9/0x2c0 fs/inode.c:1228 btrfs_iget_locked fs/btrfs/inode.c:5590 [inline] btrfs_iget_path fs/btrfs/inode.c:5607 [inline] btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636 create_reloc_inode+0x403/0x820 fs/btrfs/relocation.c:3911 btrfs_relocate_block_group+0x471/0xe60 fs/btrfs/relocation.c:4114 btrfs_relocate_chunk+0x143/0x450 fs/btrfs/volumes.c:3373 __btrfs_balance fs/btrfs/volumes.c:4157 [inline] btrfs_balance+0x211a/0x3f00 fs/btrfs/volumes.c:4534 btrfs_ioctl_balance fs/btrfs/ioctl.c:3675 [inline] btrfs_ioctl+0x12ed/0x8290 fs/btrfs/ioctl.c:4742 __do_compat_sys_ioctl+0x2c3/0x330 fs/ioctl.c:1007 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> khadas#2 (btrfs_trans_num_extwriters){++++}-{0:0}: join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323 btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999 open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554 btrfs_fill_super fs/btrfs/super.c:946 [inline] btrfs_get_tree_super fs/btrfs/super.c:1863 [inline] btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089 vfs_get_tree+0x8f/0x380 fs/super.c:1780 fc_mount+0x16/0xc0 fs/namespace.c:1125 btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline] btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090 vfs_get_tree+0x8f/0x380 fs/super.c:1780 do_new_mount fs/namespace.c:3352 [inline] path_mount+0x6e1/0x1f10 fs/namespace.c:3679 do_mount fs/namespace.c:3692 [inline] __do_sys_mount fs/namespace.c:3898 [inline] __se_sys_mount fs/namespace.c:3875 [inline] __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> khadas#1 (btrfs_trans_num_writers){++++}-{0:0}: join_transaction+0x148/0xf40 fs/btrfs/transaction.c:314 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323 btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999 open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554 btrfs_fill_super fs/btrfs/super.c:946 [inline] btrfs_get_tree_super fs/btrfs/super.c:1863 [inline] btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089 vfs_get_tree+0x8f/0x380 fs/super.c:1780 fc_mount+0x16/0xc0 fs/namespace.c:1125 btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline] btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090 vfs_get_tree+0x8f/0x380 fs/super.c:1780 do_new_mount fs/namespace.c:3352 [inline] path_mount+0x6e1/0x1f10 fs/namespace.c:3679 do_mount fs/namespace.c:3692 [inline] __do_sys_mount fs/namespace.c:3898 [inline] __se_sys_mount fs/namespace.c:3875 [inline] __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #0 (sb_internal#3){.+.+}-{0:0}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1655 [inline] sb_start_intwrite include/linux/fs.h:1838 [inline] start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694 btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline] btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189 super_cache_scan+0x409/0x550 fs/super.c:227 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab+0x18a/0x1310 mm/shrinker.c:662 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 other info that might help us debug this: Chain exists of: sb_internal#3 --> btrfs_trans_num_extwriters --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(btrfs_trans_num_extwriters); lock(fs_reclaim); rlock(sb_internal#3); *** DEADLOCK *** 2 locks held by kswapd0/111: #0: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924 khadas#1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_trylock_shared fs/super.c:562 [inline] khadas#1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_cache_scan+0x96/0x550 fs/super.c:196 stack backtrace: CPU: 0 PID: 111 Comm: kswapd0 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114 check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187 check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1655 [inline] sb_start_intwrite include/linux/fs.h:1838 [inline] start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694 btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline] btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189 super_cache_scan+0x409/0x550 fs/super.c:227 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab+0x18a/0x1310 mm/shrinker.c:662 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> So fix this by using btrfs_add_delayed_iput() so that the final iput is delegated to the cleaner kthread. Link: https://lore.kernel.org/linux-btrfs/000000000000892280061a344581@google.com/ Reported-by: syzbot+3dad89b3993a4b275e72@syzkaller.appspotmail.com Fixes: 956a17d ("btrfs: add a shrinker for extent maps") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
viraniac
pushed a commit
to viraniac/khadas_linux
that referenced
this pull request
Aug 21, 2024
commit 5a22fbc upstream. When LAN9303 is MDIO-connected two callchains exist into mdio->bus->write(): 1. switch ports 1&2 ("physical" PHYs): virtual (switch-internal) MDIO bus (lan9303_switch_ops->phy_{read|write})-> lan9303_mdio_phy_{read|write} -> mdiobus_{read|write}_nested 2. LAN9303 virtual PHY: virtual MDIO bus (lan9303_phy_{read|write}) -> lan9303_virt_phy_reg_{read|write} -> regmap -> lan9303_mdio_{read|write} If the latter functions just take mutex_lock(&sw_dev->device->bus->mdio_lock) it triggers a LOCKDEP false-positive splat. It's false-positive because the first mdio_lock in the second callchain above belongs to virtual MDIO bus, the second mdio_lock belongs to physical MDIO bus. Consequent annotation in lan9303_mdio_{read|write} as nested lock (similar to lan9303_mdio_phy_{read|write}, it's the same physical MDIO bus) prevents the following splat: WARNING: possible circular locking dependency detected 5.15.71 khadas#1 Not tainted ------------------------------------------------------ kworker/u4:3/609 is trying to acquire lock: ffff000011531c68 (lan9303_mdio:131:(&lan9303_mdio_regmap_config)->lock){+.+.}-{3:3}, at: regmap_lock_mutex but task is already holding lock: ffff0000114c44d8 (&bus->mdio_lock){+.+.}-{3:3}, at: mdiobus_read which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> khadas#1 (&bus->mdio_lock){+.+.}-{3:3}: lock_acquire __mutex_lock mutex_lock_nested lan9303_mdio_read _regmap_read regmap_read lan9303_probe lan9303_mdio_probe mdio_probe really_probe __driver_probe_device driver_probe_device __device_attach_driver bus_for_each_drv __device_attach device_initial_probe bus_probe_device deferred_probe_work_func process_one_work worker_thread kthread ret_from_fork -> #0 (lan9303_mdio:131:(&lan9303_mdio_regmap_config)->lock){+.+.}-{3:3}: __lock_acquire lock_acquire.part.0 lock_acquire __mutex_lock mutex_lock_nested regmap_lock_mutex regmap_read lan9303_phy_read dsa_slave_phy_read __mdiobus_read mdiobus_read get_phy_device mdiobus_scan __mdiobus_register dsa_register_switch lan9303_probe lan9303_mdio_probe mdio_probe really_probe __driver_probe_device driver_probe_device __device_attach_driver bus_for_each_drv __device_attach device_initial_probe bus_probe_device deferred_probe_work_func process_one_work worker_thread kthread ret_from_fork other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&bus->mdio_lock); lock(lan9303_mdio:131:(&lan9303_mdio_regmap_config)->lock); lock(&bus->mdio_lock); lock(lan9303_mdio:131:(&lan9303_mdio_regmap_config)->lock); *** DEADLOCK *** 5 locks held by kworker/u4:3/609: #0: ffff000002842938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work khadas#1: ffff80000bacbd60 (deferred_probe_work){+.+.}-{0:0}, at: process_one_work khadas#2: ffff000007645178 (&dev->mutex){....}-{3:3}, at: __device_attach khadas#3: ffff8000096e6e78 (dsa2_mutex){+.+.}-{3:3}, at: dsa_register_switch khadas#4: ffff0000114c44d8 (&bus->mdio_lock){+.+.}-{3:3}, at: mdiobus_read stack backtrace: CPU: 1 PID: 609 Comm: kworker/u4:3 Not tainted 5.15.71 khadas#1 Workqueue: events_unbound deferred_probe_work_func Call trace: dump_backtrace show_stack dump_stack_lvl dump_stack print_circular_bug check_noncircular __lock_acquire lock_acquire.part.0 lock_acquire __mutex_lock mutex_lock_nested regmap_lock_mutex regmap_read lan9303_phy_read dsa_slave_phy_read __mdiobus_read mdiobus_read get_phy_device mdiobus_scan __mdiobus_register dsa_register_switch lan9303_probe lan9303_mdio_probe ... Cc: stable@vger.kernel.org Fixes: dc70058 ("net: dsa: LAN9303: add MDIO managed mode support") Signed-off-by: Alexander Sverdlin <alexander.sverdlin@siemens.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20231027065741.534971-1-alexander.sverdlin@siemens.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
viraniac
pushed a commit
to viraniac/khadas_linux
that referenced
this pull request
Aug 21, 2024
[ Upstream commit 15319a4 ] As &card->tx_queue_lock is acquired under softirq context along the following call chain from solos_bh(), other acquisition of the same lock inside process context should disable at least bh to avoid double lock. <deadlock khadas#2> pclose() --> spin_lock(&card->tx_queue_lock) <interrupt> --> solos_bh() --> fpga_tx() --> spin_lock(&card->tx_queue_lock) This flaw was found by an experimental static analysis tool I am developing for irq-related deadlock. To prevent the potential deadlock, the patch uses spin_lock_bh() on &card->tx_queue_lock under process context code consistently to prevent the possible deadlock scenario. Fixes: 213e85d ("solos-pci: clean up pclose() function") Signed-off-by: Chengfeng Ye <dg573847474@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
viraniac
pushed a commit
to viraniac/khadas_linux
that referenced
this pull request
Aug 21, 2024
[ Upstream commit 3a42709 ] Validate @ioctl_rsp->OutputOffset and @ioctl_rsp->OutputCount so that their sum does not wrap to a number that is smaller than @reparse_buf and we end up with a wild pointer as follows: BUG: unable to handle page fault for address: ffff88809c5cd45f #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 4a01067 P4D 4a01067 PUD 0 Oops: 0000 [khadas#1] PREEMPT SMP NOPTI CPU: 2 PID: 1260 Comm: mount.cifs Not tainted 6.7.0-rc4 khadas#2 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 RIP: 0010:smb2_query_reparse_point+0x3e0/0x4c0 [cifs] Code: ff ff e8 f3 51 fe ff 41 89 c6 58 5a 45 85 f6 0f 85 14 fe ff ff 49 8b 57 48 8b 42 60 44 8b 42 64 42 8d 0c 00 49 39 4f 50 72 40 <8b> 04 02 48 8b 9d f0 fe ff ff 49 8b 57 50 89 03 48 8b 9d e8 fe ff RSP: 0018:ffffc90000347a90 EFLAGS: 00010212 RAX: 000000008000001f RBX: ffff88800ae11000 RCX: 00000000000000ec RDX: ffff88801c5cd440 RSI: 0000000000000000 RDI: ffffffff82004aa4 RBP: ffffc90000347bb0 R08: 00000000800000cd R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000024 R12: ffff8880114d4100 R13: ffff8880114d4198 R14: 0000000000000000 R15: ffff8880114d4000 FS: 00007f02c07babc0(0000) GS:ffff88806ba00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88809c5cd45f CR3: 0000000011750000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> ? __die+0x23/0x70 ? page_fault_oops+0x181/0x480 ? search_module_extables+0x19/0x60 ? srso_alias_return_thunk+0x5/0xfbef5 ? exc_page_fault+0x1b6/0x1c0 ? asm_exc_page_fault+0x26/0x30 ? _raw_spin_unlock_irqrestore+0x44/0x60 ? smb2_query_reparse_point+0x3e0/0x4c0 [cifs] cifs_get_fattr+0x16e/0xa50 [cifs] ? srso_alias_return_thunk+0x5/0xfbef5 ? lock_acquire+0xbf/0x2b0 cifs_root_iget+0x163/0x5f0 [cifs] cifs_smb3_do_mount+0x5bd/0x780 [cifs] smb3_get_tree+0xd9/0x290 [cifs] vfs_get_tree+0x2c/0x100 ? capable+0x37/0x70 path_mount+0x2d7/0xb80 ? srso_alias_return_thunk+0x5/0xfbef5 ? _raw_spin_unlock_irqrestore+0x44/0x60 __x64_sys_mount+0x11a/0x150 do_syscall_64+0x47/0xf0 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7f02c08d5b1e Fixes: 2e4564b ("smb3: add support for stat of WSL reparse points for special file types") Cc: stable@vger.kernel.org Reported-by: Robert Morris <rtm@csail.mit.edu> Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
viraniac
pushed a commit
to viraniac/khadas_linux
that referenced
this pull request
Aug 21, 2024
[ Upstream commit 1469417 ] Trying to suspend to RAM on SAMA5D27 EVK leads to the following lockdep warning: ============================================ WARNING: possible recursive locking detected 6.7.0-rc5-wt+ #532 Not tainted -------------------------------------------- sh/92 is trying to acquire lock: c3cf306c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0xe8/0x100 but task is already holding lock: c3d7c46c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0xe8/0x100 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&irq_desc_lock_class); lock(&irq_desc_lock_class); *** DEADLOCK *** May be due to missing lock nesting notation 6 locks held by sh/92: #0: c3aa0258 (sb_writers#6){.+.+}-{0:0}, at: ksys_write+0xd8/0x178 khadas#1: c4c2df44 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x138/0x284 khadas#2: c32684a0 (kn->active){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x148/0x284 khadas#3: c232b6d4 (system_transition_mutex){+.+.}-{3:3}, at: pm_suspend+0x13c/0x4e8 khadas#4: c387b088 (&dev->mutex){....}-{3:3}, at: __device_suspend+0x1e8/0x91c khadas#5: c3d7c46c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0xe8/0x100 stack backtrace: CPU: 0 PID: 92 Comm: sh Not tainted 6.7.0-rc5-wt+ #532 Hardware name: Atmel SAMA5 unwind_backtrace from show_stack+0x18/0x1c show_stack from dump_stack_lvl+0x34/0x48 dump_stack_lvl from __lock_acquire+0x19ec/0x3a0c __lock_acquire from lock_acquire.part.0+0x124/0x2d0 lock_acquire.part.0 from _raw_spin_lock_irqsave+0x5c/0x78 _raw_spin_lock_irqsave from __irq_get_desc_lock+0xe8/0x100 __irq_get_desc_lock from irq_set_irq_wake+0xa8/0x204 irq_set_irq_wake from atmel_gpio_irq_set_wake+0x58/0xb4 atmel_gpio_irq_set_wake from irq_set_irq_wake+0x100/0x204 irq_set_irq_wake from gpio_keys_suspend+0xec/0x2b8 gpio_keys_suspend from dpm_run_callback+0xe4/0x248 dpm_run_callback from __device_suspend+0x234/0x91c __device_suspend from dpm_suspend+0x224/0x43c dpm_suspend from dpm_suspend_start+0x9c/0xa8 dpm_suspend_start from suspend_devices_and_enter+0x1e0/0xa84 suspend_devices_and_enter from pm_suspend+0x460/0x4e8 pm_suspend from state_store+0x78/0xe4 state_store from kernfs_fop_write_iter+0x1a0/0x284 kernfs_fop_write_iter from vfs_write+0x38c/0x6f4 vfs_write from ksys_write+0xd8/0x178 ksys_write from ret_fast_syscall+0x0/0x1c Exception stack(0xc52b3fa8 to 0xc52b3ff0) 3fa0: 00000004 005a0ae8 00000001 005a0ae8 00000004 00000001 3fc0: 00000004 005a0ae8 00000001 00000004 00000004 b6c616c0 00000020 0059d190 3fe0: 00000004 b6c61678 aec5a041 aebf1a26 This warning is raised because pinctrl-at91-pio4 uses chained IRQ. Whenever a wake up source configures an IRQ through irq_set_irq_wake, it will lock the corresponding IRQ desc, and then call irq_set_irq_wake on "parent" IRQ which will do the same on its own IRQ desc, but since those two locks share the same class, lockdep reports this as an issue. Fix lockdep false positive by setting a different class for parent and children IRQ Fixes: 7761808 ("pinctrl: introduce driver for Atmel PIO4 controller") Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com> Link: https://lore.kernel.org/r/20231215-lockdep_warning-v1-1-8137b2510ed5@bootlin.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
viraniac
pushed a commit
to viraniac/khadas_linux
that referenced
this pull request
Aug 21, 2024
[ Upstream commit 90d025c ] If server replied SMB2_NEGOTIATE with a zero SecurityBufferOffset, smb2_get_data_area() sets @len to non-zero but return NULL, so decode_negTokeninit() ends up being called with a NULL @security_blob: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [khadas#1] PREEMPT SMP NOPTI CPU: 2 PID: 871 Comm: mount.cifs Not tainted 6.7.0-rc4 khadas#2 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 RIP: 0010:asn1_ber_decoder+0x173/0xc80 Code: 01 4c 39 2c 24 75 09 45 84 c9 0f 85 2f 03 00 00 48 8b 14 24 4c 29 ea 48 83 fa 01 0f 86 1e 07 00 00 48 8b 74 24 28 4d 8d 5d 01 <42> 0f b6 3c 2e 89 fa 40 88 7c 24 5c f7 d2 83 e2 1f 0f 84 3d 07 00 RSP: 0018:ffffc9000063f950 EFLAGS: 00010202 RAX: 0000000000000002 RBX: 0000000000000000 RCX: 000000000000004a RDX: 000000000000004a RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000002 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: 000000000000004d R15: 0000000000000000 FS: 00007fce52b0fbc0(0000) GS:ffff88806ba00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000001ae64000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> ? __die+0x23/0x70 ? page_fault_oops+0x181/0x480 ? __stack_depot_save+0x1e6/0x480 ? exc_page_fault+0x6f/0x1c0 ? asm_exc_page_fault+0x26/0x30 ? asn1_ber_decoder+0x173/0xc80 ? check_object+0x40/0x340 decode_negTokenInit+0x1e/0x30 [cifs] SMB2_negotiate+0xc99/0x17c0 [cifs] ? smb2_negotiate+0x46/0x60 [cifs] ? srso_alias_return_thunk+0x5/0xfbef5 smb2_negotiate+0x46/0x60 [cifs] cifs_negotiate_protocol+0xae/0x130 [cifs] cifs_get_smb_ses+0x517/0x1040 [cifs] ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? queue_delayed_work_on+0x5d/0x90 cifs_mount_get_session+0x78/0x200 [cifs] dfs_mount_share+0x13a/0x9f0 [cifs] ? srso_alias_return_thunk+0x5/0xfbef5 ? lock_acquire+0xbf/0x2b0 ? find_nls+0x16/0x80 ? srso_alias_return_thunk+0x5/0xfbef5 cifs_mount+0x7e/0x350 [cifs] cifs_smb3_do_mount+0x128/0x780 [cifs] smb3_get_tree+0xd9/0x290 [cifs] vfs_get_tree+0x2c/0x100 ? capable+0x37/0x70 path_mount+0x2d7/0xb80 ? srso_alias_return_thunk+0x5/0xfbef5 ? _raw_spin_unlock_irqrestore+0x44/0x60 __x64_sys_mount+0x11a/0x150 do_syscall_64+0x47/0xf0 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7fce52c2ab1e Fix this by setting @len to zero when @off == 0 so callers won't attempt to dereference non-existing data areas. Reported-by: Robert Morris <rtm@csail.mit.edu> Cc: stable@vger.kernel.org Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
viraniac
pushed a commit
to viraniac/khadas_linux
that referenced
this pull request
Aug 21, 2024
commit b684c09 upstream. ppc_save_regs() skips one stack frame while saving the CPU register states. Instead of saving current R1, it pulls the previous stack frame pointer. When vmcores caused by direct panic call (such as `echo c > /proc/sysrq-trigger`), are debugged with gdb, gdb fails to show the backtrace correctly. On further analysis, it was found that it was because of mismatch between r1 and NIP. GDB uses NIP to get current function symbol and uses corresponding debug info of that function to unwind previous frames, but due to the mismatching r1 and NIP, the unwinding does not work, and it fails to unwind to the 2nd frame and hence does not show the backtrace. GDB backtrace with vmcore of kernel without this patch: --------- (gdb) bt #0 0xc0000000002a53e8 in crash_setup_regs (oldregs=<optimized out>, newregs=0xc000000004f8f8d8) at ./arch/powerpc/include/asm/kexec.h:69 khadas#1 __crash_kexec (regs=<optimized out>) at kernel/kexec_core.c:974 khadas#2 0x0000000000000063 in ?? () khadas#3 0xc000000003579320 in ?? () --------- Further analysis revealed that the mismatch occurred because "ppc_save_regs" was saving the previous stack's SP instead of the current r1. This patch fixes this by storing current r1 in the saved pt_regs. GDB backtrace with vmcore of patched kernel: -------- (gdb) bt #0 0xc0000000002a53e8 in crash_setup_regs (oldregs=0x0, newregs=0xc00000000670b8d8) at ./arch/powerpc/include/asm/kexec.h:69 khadas#1 __crash_kexec (regs=regs@entry=0x0) at kernel/kexec_core.c:974 khadas#2 0xc000000000168918 in panic (fmt=fmt@entry=0xc000000001654a60 "sysrq triggered crash\n") at kernel/panic.c:358 khadas#3 0xc000000000b735f8 in sysrq_handle_crash (key=<optimized out>) at drivers/tty/sysrq.c:155 khadas#4 0xc000000000b742cc in __handle_sysrq (key=key@entry=99, check_mask=check_mask@entry=false) at drivers/tty/sysrq.c:602 khadas#5 0xc000000000b7506c in write_sysrq_trigger (file=<optimized out>, buf=<optimized out>, count=2, ppos=<optimized out>) at drivers/tty/sysrq.c:1163 khadas#6 0xc00000000069a7bc in pde_write (ppos=<optimized out>, count=<optimized out>, buf=<optimized out>, file=<optimized out>, pde=0xc00000000362cb40) at fs/proc/inode.c:340 khadas#7 proc_reg_write (file=<optimized out>, buf=<optimized out>, count=<optimized out>, ppos=<optimized out>) at fs/proc/inode.c:352 khadas#8 0xc0000000005b3bbc in vfs_write (file=file@entry=0xc000000006aa6b00, buf=buf@entry=0x61f498b4f60 <error: Cannot access memory at address 0x61f498b4f60>, count=count@entry=2, pos=pos@entry=0xc00000000670bda0) at fs/read_write.c:582 khadas#9 0xc0000000005b4264 in ksys_write (fd=<optimized out>, buf=0x61f498b4f60 <error: Cannot access memory at address 0x61f498b4f60>, count=2) at fs/read_write.c:637 khadas#10 0xc00000000002ea2c in system_call_exception (regs=0xc00000000670be80, r0=<optimized out>) at arch/powerpc/kernel/syscall.c:171 khadas#11 0xc00000000000c270 in system_call_vectored_common () at arch/powerpc/kernel/interrupt_64.S:192 -------- Nick adds: So this now saves regs as though it was an interrupt taken in the caller, at the instruction after the call to ppc_save_regs, whereas previously the NIP was there, but R1 came from the caller's caller and that mismatch is what causes gdb's dwarf unwinder to go haywire. Signed-off-by: Aditya Gupta <adityag@linux.ibm.com> Fixes: d16a58f ("powerpc: Improve ppc_save_regs()") Reivewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230615091047.90433-1-adityag@linux.ibm.com Cc: stable@vger.kernel.org Signed-off-by: Aditya Gupta <adityag@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Sep 5, 2024
Arm has introduced a new v10 GPU architecture that replaces the Job Manager interface with a new Command Stream Frontend. It adds firmware driven command stream queues that can be used by kernel and user space to submit jobs to the GPU. Add the initial schema for the device tree that is based on support for RK3588 SoC. The minimum number of clocks is one for the IP, but on Rockchip platforms they will tend to expose the semi-independent clocks for better power management. v5: - Move the opp-table node under the gpu node v4: - Fix formatting issue v3: - Cleanup commit message to remove redundant text - Added opp-table property and re-ordered entries - Clarified power-domains and power-domain-names requirements for RK3588. - Cleaned up example Note: power-domains and power-domain-names requirements for other platforms are still work in progress, hence the bindings are left incomplete here. v2: - New commit Signed-off-by: Liviu Dudau <liviu.dudau@arm.com> Cc: Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Conor Dooley <conor+dt@kernel.org> Cc: devicetree@vger.kernel.org Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Rob Herring <robh@kernel.org> drm: execution context for GEM buffers v7 This adds the infrastructure for an execution context for GEM buffers which is similar to the existing TTMs execbuf util and intended to replace it in the long term. The basic functionality is that we abstracts the necessary loop to lock many different GEM buffers with automated deadlock and duplicate handling. v2: drop xarray and use dynamic resized array instead, the locking overhead is unnecessary and measurable. v3: drop duplicate tracking, radeon is really the only one needing that. v4: fixes issues pointed out by Danilo, some typos in comments and a helper for lock arrays of GEM objects. v5: some suggestions by Boris Brezillon, especially just use one retry macro, drop loop in prepare_array, use flags instead of bool v6: minor changes suggested by Thomas, Boris and Danilo v7: minor typos pointed out by checkpatch.pl fixed Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Danilo Krummrich <dakr@redhat.com> Tested-by: Danilo Krummrich <dakr@redhat.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230711133122.3710-2-christian.koenig@amd.com drm: manager to keep track of GPUs VA mappings Add infrastructure to keep track of GPU virtual address (VA) mappings with a decicated VA space manager implementation. New UAPIs, motivated by Vulkan sparse memory bindings graphics drivers start implementing, allow userspace applications to request multiple and arbitrary GPU VA mappings of buffer objects. The DRM GPU VA manager is intended to serve the following purposes in this context. 1) Provide infrastructure to track GPU VA allocations and mappings, using an interval tree (RB-tree). 2) Generically connect GPU VA mappings to their backing buffers, in particular DRM GEM objects. 3) Provide a common implementation to perform more complex mapping operations on the GPU VA space. In particular splitting and merging of GPU VA mappings, e.g. for intersecting mapping requests or partial unmap requests. Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Tested-by: Matthew Brost <matthew.brost@intel.com> Tested-by: Donald Robson <donald.robson@imgtec.com> Suggested-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230720001443.2380-2-dakr@redhat.com drm: manager: Fix printk format for size_t sizeof() returns a size_t which may be different to an unsigned long. Use the correct format specifier of '%zu' to prevent compiler warnings. Fixes: e6303f323b1a ("drm: manager to keep track of GPUs VA mappings") Reviewed-by: Danilo Krummrich <dakr@redhat.com> Signed-off-by: Steven Price <steven.price@arm.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/2bf64010-c40a-8b84-144c-5387412b579e@arm.com drm/gpuva_mgr: remove unused prev pointer in __drm_gpuva_sm_map() The prev pointer in __drm_gpuva_sm_map() was used to implement automatic merging of mappings. Since automatic merging did not make its way upstream, remove this leftover. Fixes: e6303f323b1a ("drm: manager to keep track of GPUs VA mappings") Signed-off-by: Danilo Krummrich <dakr@redhat.com> Reviewed-by: Dave Airlie <airlied@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230823233119.2891-1-dakr@redhat.com drm/gpuvm: rename struct drm_gpuva_manager to struct drm_gpuvm Rename struct drm_gpuva_manager to struct drm_gpuvm including corresponding functions. This way the GPUVA manager's structures align very well with the documentation of VM_BIND [1] and VM_BIND locking [2]. It also provides a better foundation for the naming of data structures and functions introduced for implementing a common dma-resv per GPU-VM including tracking of external and evicted objects in subsequent patches. [1] Documentation/gpu/drm-vm-bind-async.rst [2] Documentation/gpu/drm-vm-bind-locking.rst Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Acked-by: Dave Airlie <airlied@redhat.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230920144343.64830-2-dakr@redhat.com drm/gpuvm: allow building as module HB: drivers/gpu/drm/nouveau/Kconfig skipped because there is no gpuvm support of nouveau in 6.1 Currently, the DRM GPUVM does not have any core dependencies preventing a module build. Also, new features from subsequent patches require helpers (namely drm_exec) which can be built as module. Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230920144343.64830-3-dakr@redhat.com drm/gpuvm: convert WARN() to drm_WARN() variants HB: drivers/gpu/drm/nouveau/nouveau_uvmm.c skipped since 6.1 does not support gpuvm on nv Use drm_WARN() and drm_WARN_ON() variants to indicate drivers the context the failing VM resides in. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231108001259.15123-2-dakr@redhat.com drm/gpuvm: don't always WARN in drm_gpuvm_check_overflow() Don't always WARN in drm_gpuvm_check_overflow() and separate it into a drm_gpuvm_check_overflow() and a dedicated drm_gpuvm_warn_check_overflow() variant. This avoids printing warnings due to invalid userspace requests. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231108001259.15123-3-dakr@redhat.com drm/gpuvm: export drm_gpuvm_range_valid() Drivers may use this function to validate userspace requests in advance, hence export it. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231108001259.15123-4-dakr@redhat.com drm/gpuvm: add common dma-resv per struct drm_gpuvm hb: drivers/gpu/drm/nouveau/nouveau_uvmm.c skipped Provide a common dma-resv for GEM objects not being used outside of this GPU-VM. This is used in a subsequent patch to generalize dma-resv, external and evicted object handling and GEM validation. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231108001259.15123-6-dakr@redhat.com drm/gpuvm: add drm_gpuvm_flags to drm_gpuvm HB: drivers/gpu/drm/nouveau/nouveau_uvmm.c skipped Introduce flags for struct drm_gpuvm, this required by subsequent commits. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231108001259.15123-8-dakr@redhat.com drm/gpuvm: reference count drm_gpuvm structures HB: drivers/gpu/drm/nouveau/nouveau_uvmm.c skipped Implement reference counting for struct drm_gpuvm. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231108001259.15123-10-dakr@redhat.com drm/gpuvm: add an abstraction for a VM / BO combination HB: drivers/gpu/drm/nouveau/nouveau_uvmm.c skipped Add an abstraction layer between the drm_gpuva mappings of a particular drm_gem_object and this GEM object itself. The abstraction represents a combination of a drm_gem_object and drm_gpuvm. The drm_gem_object holds a list of drm_gpuvm_bo structures (the structure representing this abstraction), while each drm_gpuvm_bo contains list of mappings of this GEM object. This has multiple advantages: 1) We can use the drm_gpuvm_bo structure to attach it to various lists of the drm_gpuvm. This is useful for tracking external and evicted objects per VM, which is introduced in subsequent patches. 2) Finding mappings of a certain drm_gem_object mapped in a certain drm_gpuvm becomes much cheaper. 3) Drivers can derive and extend the structure to easily represent driver specific states of a BO for a certain GPUVM. The idea of this abstraction was taken from amdgpu, hence the credit for this idea goes to the developers of amdgpu. Cc: Christian König <christian.koenig@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231108001259.15123-11-dakr@redhat.com drm/gpuvm: track/lock/validate external/evicted objects Currently the DRM GPUVM offers common infrastructure to track GPU VA allocations and mappings, generically connect GPU VA mappings to their backing buffers and perform more complex mapping operations on the GPU VA space. However, there are more design patterns commonly used by drivers, which can potentially be generalized in order to make the DRM GPUVM represent a basis for GPU-VM implementations. In this context, this patch aims at generalizing the following elements. 1) Provide a common dma-resv for GEM objects not being used outside of this GPU-VM. 2) Provide tracking of external GEM objects (GEM objects which are shared with other GPU-VMs). 3) Provide functions to efficiently lock all GEM objects dma-resv the GPU-VM contains mappings of. 4) Provide tracking of evicted GEM objects the GPU-VM contains mappings of, such that validation of evicted GEM objects is accelerated. 5) Provide some convinience functions for common patterns. Big thanks to Boris Brezillon for his help to figure out locking for drivers updating the GPU VA space within the fence signalling path. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Suggested-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231108001259.15123-12-dakr@redhat.com drm/gpuvm: fall back to drm_exec_lock_obj() Fall back to drm_exec_lock_obj() if num_fences is zero for the drm_gpuvm_prepare_* function family. Otherwise dma_resv_reserve_fences() would actually allocate slots even though num_fences is zero. Cc: Christian König <christian.koenig@amd.com> Acked-by: Donald Robson <donald.robson@imgtec.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231129220835.297885-2-dakr@redhat.com drm/gpuvm: Let drm_gpuvm_bo_put() report when the vm_bo object is destroyed Some users need to release resources attached to the vm_bo object when it's destroyed. In Panthor's case, we need to release the pin ref so BO pages can be returned to the system when all GPU mappings are gone. This could be done through a custom drm_gpuvm::vm_bo_free() hook, but this has all sort of locking implications that would force us to expose a drm_gem_shmem_unpin_locked() helper, not to mention the fact that having a ::vm_bo_free() implementation without a ::vm_bo_alloc() one seems odd. So let's keep things simple, and extend drm_gpuvm_bo_put() to report when the object is destroyed. Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231204151406.1977285-1-boris.brezillon@collabora.com drm/exec: Pass in initial # of objects HB: skipped drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c drivers/gpu/drm/amd/amdkfd/kfd_svm.c drivers/gpu/drm/imagination/pvr_job.c drivers/gpu/drm/nouveau/nouveau_uvmm.c In cases where the # is known ahead of time, it is silly to do the table resize dance. Signed-off-by: Rob Clark <robdclark@chromium.org> Reviewed-by: Christian König <christian.koenig@amd.com> Patchwork: https://patchwork.freedesktop.org/patch/568338/ drm/gem-shmem: When drm_gem_object_init failed, should release object when goto err_free, the object had init, so it should be release when fail. Signed-off-by: ChunyouTang <tangchunyou@163.com> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20221119064131.364-1-tangchunyou@163.com drm: Remove usage of deprecated DRM_DEBUG_PRIME drm_print.h says DRM_DEBUG_PRIME is deprecated in favor of drm_dbg_prime(). Signed-off-by: Siddh Raman Pant <code@siddh.me> Reviewed-by: Simon Ser <contact@emersion.fr> Signed-off-by: Simon Ser <contact@emersion.fr> Link: https://patchwork.freedesktop.org/patch/msgid/cd663b1bc42189e55898cddecdb3b73c591b341a.1673269059.git.code@siddh.me drm/shmem: Cleanup drm_gem_shmem_create_with_handle() Once we create the handle, the handle owns the reference. Currently nothing was doing anything with the shmem ptr after the handle was created, but let's change drm_gem_shmem_create_with_handle() to not return the pointer, so-as to not encourage problematic use of this function in the future. As a bonus, it makes the code a bit cleaner. Signed-off-by: Rob Clark <robdclark@chromium.org> Reviewed-by: Steven Price <steven.price@arm.com> Signed-off-by: Steven Price <steven.price@arm.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230123154831.3191821-1-robdclark@gmail.com drm/shmem-helper: Fix locking for drm_gem_shmem_get_pages_sgt() Other functions touching shmem->sgt take the pages lock, so do that here too. drm_gem_shmem_get_pages() & co take the same lock, so move to the _locked() variants to avoid recursive locking. Discovered while auditing locking to write the Rust abstractions. Fixes: 2194a63a818d ("drm: Add library for shmem backed GEM objects") Fixes: 4fa3d66f132b ("drm/shmem: Do dma_unmap_sg before purging pages") Signed-off-by: Asahi Lina <lina@asahilina.net> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Signed-off-by: Javier Martinez Canillas <javierm@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230205125124.2260-1-lina@asahilina.net drm/shmem-helper: Switch to use drm_* debug helpers Ease debugging of a multi-GPU system by using drm_WARN_*() and drm_dbg_kms() helpers that print out DRM device name corresponding to shmem GEM. Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Suggested-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Link: https://lore.kernel.org/all/20230108210445.3948344-6-dmitry.osipenko@collabora.com/ drm/shmem-helper: Don't use vmap_use_count for dma-bufs DMA-buf core has its own refcounting of vmaps, use it instead of drm-shmem counting. This change prepares drm-shmem for addition of memory shrinker support where drm-shmem will use a single dma-buf reservation lock for all operations performed over dma-bufs. Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Link: https://lore.kernel.org/all/20230108210445.3948344-7-dmitry.osipenko@collabora.com/ drm/shmem-helper: Switch to reservation lock Replace all drm-shmem locks with a GEM reservation lock. This makes locks consistent with dma-buf locking convention where importers are responsible for holding reservation lock for all operations performed over dma-bufs, preventing deadlock between dma-buf importers and exporters. Suggested-by: Daniel Vetter <daniel@ffwll.ch> Acked-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Link: https://lore.kernel.org/all/20230108210445.3948344-8-dmitry.osipenko@collabora.com/ drm/shmem-helper: Revert accidental non-GPL export The referenced commit added a wrapper for drm_gem_shmem_get_pages_sgt(), but in the process it accidentally changed the export type from GPL to non-GPL. Switch it back to GPL. Reported-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Fixes: ddddedaa0db9 ("drm/shmem-helper: Fix locking for drm_gem_shmem_get_pages_sgt()") Signed-off-by: Asahi Lina <lina@asahilina.net> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20230227-shmem-export-fix-v1-1-8880b2c25e81@asahilina.net Revert "drm/shmem-helper: Switch to reservation lock" This reverts commit 67b7836d4458790f1261e31fe0ce3250989784f0. The locking appears incomplete. A caller of SHMEM helper's pin function never acquires the dma-buf reservation lock. So we get WARNING: CPU: 3 PID: 967 at drivers/gpu/drm/drm_gem_shmem_helper.c:243 drm_gem_shmem_pin+0x42/0x90 [drm_shmem_helper] Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Acked-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230228152612.19971-1-tzimmermann@suse.de drm/shmem-helper: Switch to reservation lock Replace all drm-shmem locks with a GEM reservation lock. This makes locks consistent with dma-buf locking convention where importers are responsible for holding reservation lock for all operations performed over dma-bufs, preventing deadlock between dma-buf importers and exporters. Suggested-by: Daniel Vetter <daniel@ffwll.ch> Acked-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Emil Velikov <emil.l.velikov@gmail.com> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230529223935.2672495-7-dmitry.osipenko@collabora.com drm/shmem-helper: Reset vma->vm_ops before calling dma_buf_mmap() The dma-buf backend is supposed to provide its own vm_ops, but some implementation just have nothing special to do and leave vm_ops untouched, probably expecting this field to be zero initialized (this is the case with the system_heap implementation for instance). Let's reset vma->vm_ops to NULL to keep things working with these implementations. Fixes: 26d3ac3cb04d ("drm/shmem-helpers: Redirect mmap for imported dma-buf") Cc: <stable@vger.kernel.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Reported-by: Roman Stratiienko <r.stratiienko@gmail.com> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Tested-by: Roman Stratiienko <r.stratiienko@gmail.com> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20230724112610.60974-1-boris.brezillon@collabora.com iommu: Allow passing custom allocators to pgtable drivers This will be useful for GPU drivers who want to keep page tables in a pool so they can: - keep freed page tables in a free pool and speed-up upcoming page table allocations - batch page table allocation instead of allocating one page at a time - pre-reserve pages for page tables needed for map/unmap operations, to ensure map/unmap operations don't try to allocate memory in paths they're allowed to block or fail It might also be valuable for other aspects of GPU and similar use-cases, like fine-grained memory accounting and resource limiting. We will extend the Arm LPAE format to support custom allocators in a separate commit. Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/20231124142434.1577550-2-boris.brezillon@collabora.com Signed-off-by: Joerg Roedel <jroedel@suse.de> iommu: Extend LPAE page table format to support custom allocators We need that in order to implement the VM_BIND ioctl in the GPU driver targeting new Mali GPUs. VM_BIND is about executing MMU map/unmap requests asynchronously, possibly after waiting for external dependencies encoded as dma_fences. We intend to use the drm_sched framework to automate the dependency tracking and VM job dequeuing logic, but this comes with its own set of constraints, one of them being the fact we are not allowed to allocate memory in the drm_gpu_scheduler_ops::run_job() to avoid this sort of deadlocks: - VM_BIND map job needs to allocate a page table to map some memory to the VM. No memory available, so kswapd is kicked - GPU driver shrinker backend ends up waiting on the fence attached to the VM map job or any other job fence depending on this VM operation. With custom allocators, we will be able to pre-reserve enough pages to guarantee the map/unmap operations we queued will take place without going through the system allocator. But we can also optimize allocation/reservation by not free-ing pages immediately, so any upcoming page table allocation requests can be serviced by some free page table pool kept at the driver level. I might also be valuable for other aspects of GPU and similar use-cases, like fine-grained memory accounting and resource limiting. Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/20231124142434.1577550-3-boris.brezillon@collabora.com Signed-off-by: Joerg Roedel <jroedel@suse.de> drm/sched: Add FIFO sched policy to run queue When many entities are competing for the same run queue on the same scheduler, we observe an unusually long wait times and some jobs get starved. This has been observed on GPUVis. The issue is due to the Round Robin policy used by schedulers to pick up the next entity's job queue for execution. Under stress of many entities and long job queues within entity some jobs could be stuck for very long time in it's entity's queue before being popped from the queue and executed while for other entities with smaller job queues a job might execute earlier even though that job arrived later then the job in the long queue. Fix: Add FIFO selection policy to entities in run queue, chose next entity on run queue in such order that if job on one entity arrived earlier then job on another entity the first job will start executing earlier regardless of the length of the entity's job queue. v2: Switch to rb tree structure for entities based on TS of oldest job waiting in the job queue of an entity. Improves next entity extraction to O(1). Entity TS update O(log N) where N is the number of entities in the run-queue Drop default option in module control parameter. v3: Various cosmetical fixes and minor refactoring of fifo update function. (Luben) v4: Switch drm_sched_rq_select_entity_fifo to in order search (Luben) v5: Fix up drm_sched_rq_select_entity_fifo loop (Luben) v6: Add missing drm_sched_rq_remove_fifo_locked v7: Fix ts sampling bug and more cosmetic stuff (Luben) v8: Fix module parameter string (Luben) Cc: Luben Tuikov <luben.tuikov@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Direct Rendering Infrastructure - Development <dri-devel@lists.freedesktop.org> Cc: AMD Graphics <amd-gfx@lists.freedesktop.org> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Tested-by: Yunxiang Li (Teddy) <Yunxiang.Li@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220930041258.1050247-1-luben.tuikov@amd.com drm/scheduler: Set the FIFO scheduling policy as the default The currently default Round-Robin GPU scheduling can result in starvation of entities which have a large number of jobs, over entities which have a very small number of jobs (single digit). This can be illustrated in the following diagram, where jobs are alphabetized to show their chronological order of arrival, where job A is the oldest, B is the second oldest, and so on, to J, the most recent job to arrive. ---> entities j | H-F-----A--E--I-- o | --G-----B-----J-- b | --------C-------- s\/ --------D-------- WLOG, assuming all jobs are "ready", then a R-R scheduling will execute them in the following order (a slice off of the top of the entities' list), H, F, A, E, I, G, B, J, C, D. However, to mitigate job starvation, we'd rather execute C and D before E, and so on, given, of course, that they're all ready to be executed. So, if all jobs are ready at this instant, the order of execution for this and the next 9 instances of picking the next job to execute, should really be, A, B, C, D, E, F, G, H, I, J, which is their chronological order. The only reason for this order to be broken, is if an older job is not yet ready, but a younger job is ready, at an instant of picking a new job to execute. For instance if job C wasn't ready at time 2, but job D was ready, then we'd pick job D, like this: 0 +1 +2 ... A, B, D, ... And from then on, C would be preferred before all other jobs, if it is ready at the time when a new job for execution is picked. So, if C became ready two steps later, the execution order would look like this: ......0 +1 +2 ... A, B, D, E, C, F, G, H, I, J This is what the FIFO GPU scheduling algorithm achieves. It uses a Red-Black tree to keep jobs sorted in chronological order, where picking the oldest job is O(1) (we use the "cached" structure), and balancing the tree is O(log n). IOW, it picks the *oldest ready* job to execute now. The implementation is already in the kernel, and this commit only changes the default GPU scheduling algorithm to use. This was tested and achieves about 1% faster performance over the Round Robin algorithm. Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Cc: Direct Rendering Infrastructure - Development <dri-devel@lists.freedesktop.org> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221024212634.27230-1-luben.tuikov@amd.com Signed-off-by: Christian König <christian.koenig@amd.com> drm/scheduler: add drm_sched_job_add_resv_dependencies Add a new function to update job dependencies from a resv obj. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-3-christian.koenig@amd.com drm/scheduler: remove drm_sched_dependency_optimized Not used any more. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-12-christian.koenig@amd.com drm/scheduler: rework entity flush, kill and fini This was buggy because when we had to wait for entities which were killed as well we would just deadlock. Instead move all the dependency handling into the callbacks so that will all happen asynchronously. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-13-christian.koenig@amd.com drm/scheduler: rename dependency callback into prepare_job This now matches much better what this is doing. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-14-christian.koenig@amd.com drm/amdgpu: revert "implement tdr advanced mode" This reverts commit e6c6338f393b74ac0b303d567bb918b44ae7ad75. This feature basically re-submits one job after another to figure out which one was the one causing a hang. This is obviously incompatible with gang-submit which requires that multiple jobs run at the same time. It's also absolutely not helpful to crash the hardware multiple times if a clean recovery is desired. For testing and debugging environments we should rather disable recovery alltogether to be able to inspect the state with a hw debugger. Additional to that the sw implementation is clearly buggy and causes reference count issues for the hardware fence. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> drm/scheduler: Fix lockup in drm_sched_entity_kill() The drm_sched_entity_kill() is invoked twice by drm_sched_entity_destroy() while userspace process is exiting or being killed. First time it's invoked when sched entity is flushed and second time when entity is released. This causes a lockup within wait_for_completion(entity_idle) due to how completion API works. Calling wait_for_completion() more times than complete() was invoked is a error condition that causes lockup because completion internally uses counter for complete/wait calls. The complete_all() must be used instead in such cases. This patch fixes lockup of Panfrost driver that is reproducible by killing any application in a middle of 3d drawing operation. Fixes: 2fdb8a8f07c2 ("drm/scheduler: rework entity flush, kill and fini") Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221123001303.533968-1-dmitry.osipenko@collabora.com drm/scheduler: deprecate drm_sched_resubmit_jobs This interface is not working as it should. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221109095010.141189-5-christian.koenig@amd.com drm/scheduler: track GPU active time per entity Track the accumulated time that jobs from this entity were active on the GPU. This allows drivers using the scheduler to trivially implement the DRM fdinfo when the hardware doesn't provide more specific information than signalling job completion anyways. [Bagas: Append missing colon to @elapsed_ns] Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> drm/sched: Create wrapper to add a syncobj dependency to job In order to add a syncobj's fence as a dependency to a job, it is necessary to call drm_syncobj_find_fence() to find the fence and then add the dependency with drm_sched_job_add_dependency(). So, wrap these steps in one single function, drm_sched_job_add_syncobj_dependency(). Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Maíra Canal <mcanal@igalia.com> Signed-off-by: Maíra Canal <mairacanal@riseup.net> Link: https://patchwork.freedesktop.org/patch/msgid/20230209124447.467867-2-mcanal@igalia.com drm/scheduler: Fix variable name in function description Compiling AMD GPU drivers displays two warnings: drivers/gpu/drm/scheduler/sched_main.c:738: warning: Function parameter or member 'file' not described in 'drm_sched_job_add_syncobj_dependency' drivers/gpu/drm/scheduler/sched_main.c:738: warning: Excess function parameter 'file_private' description in 'drm_sched_job_add_syncobj_dependency' Get rid of them by renaming the variable name on the function description Signed-off-by: Caio Novais <caionovais@usp.br> Link: https://lore.kernel.org/r/20230325131532.6356-1-caionovais@usp.br Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> drm/scheduler: Add fence deadline support As the finished fence is the one that is exposed to userspace, and therefore the one that other operations, like atomic update, would block on, we need to propagate the deadline from from the finished fence to the actual hw fence. v2: Split into drm_sched_fence_set_parent() (ckoenig) v3: Ensure a thread calling drm_sched_fence_set_deadline_finished() sees fence->parent set before drm_sched_fence_set_parent() does this test_bit(DMA_FENCE_FLAG_HAS_DEADLINE_BIT). Signed-off-by: Rob Clark <robdclark@chromium.org> Acked-by: Luben Tuikov <luben.tuikov@amd.com> Revert "drm/scheduler: track GPU active time per entity" This reverts commit df622729ddbf as it introduces a use-after-free, which isn't easy to fix without going back to the design drawing board. Reported-by: Danilo Krummrich <dakr@redhat.com> Signed-off-by: Lucas Stach <l.stach@pengutronix.de> drm/scheduler: Fix UAF race in drm_sched_entity_push_job() After a job is pushed into the queue, it is owned by the scheduler core and may be freed at any time, so we can't write nor read the submit timestamp after that point. Fixes oopses observed with the drm/asahi driver, found with kASAN. Signed-off-by: Asahi Lina <lina@asahilina.net> Link: https://lore.kernel.org/r/20230406-scheduler-uaf-2-v1-1-972531cf0a81@asahilina.net Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> drm/sched: Check scheduler ready before calling timeout handling During an IGT GPU reset test we see the following oops, [ +0.000003] ------------[ cut here ]------------ [ +0.000000] WARNING: CPU: 9 PID: 0 at kernel/workqueue.c:1656 __queue_delayed_work+0x6d/0xa0 [ +0.000004] Modules linked in: iptable_filter bpfilter amdgpu(OE) nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic intel_rapl_msr ledtrig_audio snd_hda_codec_hdmi intel_rapl_common snd_hda_intel edac_mce_amd snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core iommu_v2 gpu_sched(OE) kvm_amd drm_buddy snd_hwdep kvm video drm_ttm_helper snd_pcm ttm snd_seq_midi drm_display_helper snd_seq_midi_event snd_rawmidi cec crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 snd_seq aesni_intel rc_core crypto_simd cryptd binfmt_misc drm_kms_helper rapl snd_seq_device input_leds joydev snd_timer i2c_algo_bit syscopyarea snd ccp sysfillrect sysimgblt wmi_bmof k10temp soundcore mac_hid sch_fq_codel msr parport_pc ppdev drm lp parport ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 hid_generic usbhid hid r8169 ahci xhci_pci gpio_amdpt realtek i2c_piix4 wmi crc32_pclmul xhci_pci_renesas libahci gpio_generic [ +0.000070] CPU: 9 PID: 0 Comm: swapper/9 Tainted: G W OE 6.1.11+ #2 [ +0.000003] Hardware name: Gigabyte Technology Co., Ltd. AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F7 06/16/2017 [ +0.000001] RIP: 0010:__queue_delayed_work+0x6d/0xa0 [ +0.000003] Code: 7a 50 48 01 c1 48 89 4a 30 81 ff 00 20 00 00 75 38 4c 89 cf e8 64 3e 0a 00 5d e9 1e c5 11 01 e8 99 f7 ff ff 5d e9 13 c5 11 01 <0f> 0b eb c1 0f 0b 48 81 7a 38 70 5c 0e 81 74 9f 0f 0b 48 8b 42 28 [ +0.000002] RSP: 0018:ffffc90000398d60 EFLAGS: 00010007 [ +0.000002] RAX: ffff88810d589c60 RBX: 0000000000000000 RCX: 0000000000000000 [ +0.000002] RDX: ffff88810d589c58 RSI: 0000000000000000 RDI: 0000000000002000 [ +0.000001] RBP: ffffc90000398d60 R08: 0000000000000000 R09: ffff88810d589c78 [ +0.000002] R10: 72705f305f39765f R11: 7866673a6d72645b R12: ffff88810d589c58 [ +0.000001] R13: 0000000000002000 R14: 0000000000000000 R15: 0000000000000000 [ +0.000002] FS: 0000000000000000(0000) GS:ffff8887fee40000(0000) knlGS:0000000000000000 [ +0.000001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000002] CR2: 00005562c4797fa0 CR3: 0000000110da0000 CR4: 00000000003506e0 [ +0.000002] Call Trace: [ +0.000001] <IRQ> [ +0.000001] mod_delayed_work_on+0x5e/0xa0 [ +0.000004] drm_sched_fault+0x23/0x30 [gpu_sched] [ +0.000007] gfx_v9_0_fault.isra.0+0xa6/0xd0 [amdgpu] [ +0.000258] gfx_v9_0_priv_reg_irq+0x29/0x40 [amdgpu] [ +0.000254] amdgpu_irq_dispatch+0x1ac/0x2b0 [amdgpu] [ +0.000243] amdgpu_ih_process+0x89/0x130 [amdgpu] [ +0.000245] amdgpu_irq_handler+0x24/0x60 [amdgpu] [ +0.000165] __handle_irq_event_percpu+0x4f/0x1a0 [ +0.000003] handle_irq_event_percpu+0x15/0x50 [ +0.000001] handle_irq_event+0x39/0x60 [ +0.000002] handle_edge_irq+0xa8/0x250 [ +0.000003] __common_interrupt+0x7b/0x150 [ +0.000002] common_interrupt+0xc1/0xe0 [ +0.000003] </IRQ> [ +0.000000] <TASK> [ +0.000001] asm_common_interrupt+0x27/0x40 [ +0.000002] RIP: 0010:native_safe_halt+0xb/0x10 [ +0.000003] Code: 46 ff ff ff cc cc cc cc cc cc cc cc cc cc cc eb 07 0f 00 2d 69 f2 5e 00 f4 e9 f1 3b 3e 00 90 eb 07 0f 00 2d 59 f2 5e 00 fb f4 <e9> e0 3b 3e 00 0f 1f 44 00 00 55 48 89 e5 53 e8 b1 d4 fe ff 66 90 [ +0.000002] RSP: 0018:ffffc9000018fdc8 EFLAGS: 00000246 [ +0.000002] RAX: 0000000000004000 RBX: 000000000002e5a8 RCX: 000000000000001f [ +0.000001] RDX: 0000000000000001 RSI: ffff888101298800 RDI: ffff888101298864 [ +0.000001] RBP: ffffc9000018fdd0 R08: 000000527f64bd8b R09: 000000000001dc90 [ +0.000001] R10: 000000000001dc90 R11: 0000000000000003 R12: 0000000000000001 [ +0.000001] R13: ffff888101298864 R14: ffffffff832d9e20 R15: ffff888193aa8c00 [ +0.000003] ? acpi_idle_do_entry+0x5e/0x70 [ +0.000002] acpi_idle_enter+0xd1/0x160 [ +0.000003] cpuidle_enter_state+0x9a/0x6e0 [ +0.000003] cpuidle_enter+0x2e/0x50 [ +0.000003] call_cpuidle+0x23/0x50 [ +0.000002] do_idle+0x1de/0x260 [ +0.000002] cpu_startup_entry+0x20/0x30 [ +0.000002] start_secondary+0x120/0x150 [ +0.000003] secondary_startup_64_no_verify+0xe5/0xeb [ +0.000004] </TASK> [ +0.000000] ---[ end trace 0000000000000000 ]--- [ +0.000003] BUG: kernel NULL pointer dereference, address: 0000000000000102 [ +0.006233] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=3, emitted seq=4 [ +0.000734] #PF: supervisor read access in kernel mode [ +0.009670] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process amd_deadlock pid 2002 thread amd_deadlock pid 2002 [ +0.005135] #PF: error_code(0x0000) - not-present page [ +0.000002] PGD 0 P4D 0 [ +0.000002] Oops: 0000 [#1] PREEMPT SMP NOPTI [ +0.000002] CPU: 9 PID: 0 Comm: swapper/9 Tainted: G W OE 6.1.11+ #2 [ +0.000002] Hardware name: Gigabyte Technology Co., Ltd. AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F7 06/16/2017 [ +0.012101] amdgpu 0000:0c:00.0: amdgpu: GPU reset begin! [ +0.005136] RIP: 0010:__queue_work+0x1f/0x4e0 [ +0.000004] Code: 87 cd 11 01 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 d5 41 54 49 89 f4 53 48 83 ec 10 89 7d d4 <f6> 86 02 01 00 00 01 0f 85 6c 03 00 00 e8 7f 36 08 00 8b 45 d4 48 For gfx_rings the schedulers may not be initialized by amdgpu_device_init_schedulers() due to ring->no_scheduler flag being set to true and thus the timeout_wq is NULL. As a result, since all ASICs call drm_sched_fault() unconditionally even for schedulers which have not been initialized, it is simpler to use the ready condition which indicates whether the given scheduler worker thread runs and whether the timeout_wq of the reset domain has been initialized. Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Cc: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://lore.kernel.org/r/20230406200054.633379-1-luben.tuikov@amd.com drm/scheduler: set entity to NULL in drm_sched_entity_pop_job() It already happend a few times that patches slipped through which implemented access to an entity through a job that was already removed from the entities queue. Since jobs and entities might have different lifecycles, this can potentially cause UAF bugs. In order to make it obvious that a jobs entity pointer shouldn't be accessed after drm_sched_entity_pop_job() was called successfully, set the jobs entity pointer to NULL once the job is removed from the entity queue. Moreover, debugging a potential NULL pointer dereference is way easier than potentially corrupted memory through a UAF. Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://lore.kernel.org/r/20230418100453.4433-1-dakr@redhat.com Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> drm/scheduler: properly forward fence errors When a hw fence is signaled with an error properly forward that to the finished fence. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230420115752.31470-1-christian.koenig@amd.com drm/scheduler: add drm_sched_entity_error and use rcu for last_scheduled Switch to using RCU handling for the last scheduled job and add a function to return the error code of it. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230420115752.31470-2-christian.koenig@amd.com drm/scheduler: mark jobs without fence as canceled When no hw fence is provided for a job that means that the job didn't executed. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230427122726.1290170-1-christian.koenig@amd.com drm/sched: Check scheduler work queue before calling timeout handling During an IGT GPU reset test we see again oops despite of commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling timeout handling). It uses ready condition whether to call drm_sched_fault which unwind the TDR leads to GPU reset. However it looks the ready condition is overloaded with other meanings, for example, for the following stack is related GPU reset : 0 gfx_v9_0_cp_gfx_start 1 gfx_v9_0_cp_gfx_resume 2 gfx_v9_0_cp_resume 3 gfx_v9_0_hw_init 4 gfx_v9_0_resume 5 amdgpu_device_ip_resume_phase2 does the following: /* start the ring */ gfx_v9_0_cp_gfx_start(adev); ring->sched.ready = true; The same approach is for other ASICs as well : gfx_v8_0_cp_gfx_resume gfx_v10_0_kiq_resume, etc... As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault and then drm_sched_fault. However now it depends on whether the interrupt service routine drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready field of the scheduler to true even for uninitialized schedulers and causes oops vs no fault or when ISR drm_sched_fault is completed prior gfx_v9_0_cp_gfx_start and NULL pointer dereference does not occur. Use the field timeout_wq to prevent oops for uninitialized schedulers. The field could be initialized by the work queue of resetting the domain. v1: Corrections to commit message (Luben) Fixes: 11b3b9f461c5c4 ("drm/sched: Check scheduler ready before calling timeout handling") Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Link: https://lore.kernel.org/r/20230510135111.58631-1-vitaly.prosyak@amd.com Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> drm/sched: Remove redundant check The rq pointer points inside the drm_gpu_scheduler structure. Thus it can't be NULL. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: c61cdbdbffc1 ("drm/scheduler: Fix hang when sched_entity released") Signed-off-by: Vladislav Efanov <VEfanov@ispras.ru> Link: https://lore.kernel.org/r/20230517125247.434103-1-VEfanov@ispras.ru Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> drm/sched: Rename to drm_sched_can_queue() Rename drm_sched_ready() to drm_sched_can_queue(). "ready" can mean many things and is thus meaningless in this context. Instead, rename to a name which precisely conveys what is being checked. Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Link: https://lore.kernel.org/r/20230517233550.377847-1-luben.tuikov@amd.com drm/sched: Rename to drm_sched_wakeup_if_can_queue() Rename drm_sched_wakeup() to drm_sched_wakeup_if_canqueue() since the former is misleading, as it wakes up the GPU scheduler _only if_ more jobs can be queued to the underlying hardware. This distinction is important to make, since the wake conditional in the GPU scheduler thread wakes up when other conditions are also true, e.g. when there are jobs to be cleaned. For instance, a user might want to wake up the scheduler only because there are more jobs to clean, but whether we can queue more jobs is irrelevant. v2: Separate "canqueue" to "can_queue". (Alex D.) Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://lore.kernel.org/r/20230517233550.377847-2-luben.tuikov@amd.com Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> drm/scheduler: avoid infinite loop if entity's dependency is a scheduled error fence [Why] drm_sched_entity_add_dependency_cb ignores the scheduled fence and return false. If entity's dependency is a scheduler error fence and drm_sched_stop is called due to TDR, drm_sched_entity_pop_job will wait for the dependency infinitely. [How] Do not wait or ignore the scheduled error fence, add drm_sched_entity_wakeup callback for the dependency with scheduled error fence. Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> drm/sched: Make sure we wait for all dependencies in kill_jobs_cb() drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped from the dependency array that was waited upon before drm_sched_entity_kill() was called (drm_sched_entity::dependency field), so we're basically waiting for all dependencies except one. In theory, this wait shouldn't be needed because resources should have their users registered to the dma_resv object, thus guaranteeing that future jobs wanting to access these resources wait on all the previous users (depending on the access type, of course). But we want to keep these explicit waits in the kill entity path just in case. Let's make sure we keep all dependencies in the array in drm_sched_job_dependency(), so we can iterate over the array and wait in drm_sched_entity_kill_jobs_cb(). We also make sure we wait on drm_sched_fence::finished if we were originally asked to wait on drm_sched_fence::scheduled. In that case, we assume the intent was to delegate the wait to the firmware/GPU or rely on the pipelining done at the entity/scheduler level, but when killing jobs, we really want to wait for completion not just scheduling. v2: - Don't evict deps in drm_sched_job_dependency() v3: - Always wait for drm_sched_fence::finished fences in drm_sched_entity_kill_jobs_cb() when we see a sched_fence v4: - Fix commit message - Fix a use-after-free bug v5: - Flag deps on which we should only wait for the scheduled event at insertion time v6: - Back to v4 implementation - Add Christian's R-b Cc: Frank Binns <frank.binns@imgtec.com> Cc: Sarah Walker <sarah.walker@imgtec.com> Cc: Donald Robson <donald.robson@imgtec.com> Cc: Luben Tuikov <luben.tuikov@amd.com> Cc: David Airlie <airlied@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: "Christian König" <christian.koenig@amd.com> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Suggested-by: "Christian König" <christian.koenig@amd.com> Reviewed-by: "Christian König" <christian.koenig@amd.com> Acked-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230619071921.3465992-1-boris.brezillon@collabora.com drm/sched: Call drm_sched_fence_set_parent() from drm_sched_fence_scheduled() Drivers that can delegate waits to the firmware/GPU pass the scheduled fence to drm_sched_job_add_dependency(), and issue wait commands to the firmware/GPU at job submission time. For this to be possible, they need all their 'native' dependencies to have a valid parent since this is where the actual HW fence information are encoded. In drm_sched_main(), we currently call drm_sched_fence_set_parent() after drm_sched_fence_scheduled(), leaving a short period of time during which the job depending on this fence can be submitted. Since setting parent and signaling the fence are two things that are kinda related (you can't have a parent if the job hasn't been scheduled), it probably makes sense to pass the parent fence to drm_sched_fence_scheduled() and let it call drm_sched_fence_set_parent() before it signals the scheduled fence. Here is a detailed description of the race we are fixing here: Thread A Thread B - calls drm_sched_fence_scheduled() - signals s_fence->scheduled which wakes up thread B - entity dep signaled, checking the next dep - no more deps waiting - entity is picked for job submission by drm_gpu_scheduler - run_job() is called - run_job() tries to collect native fence info from s_fence->parent, but it's NULL => BOOM, we can't do our native wait - calls drm_sched_fence_set_parent() v2: * Fix commit message v3: * Add a detailed description of the race to the commit message * Add Luben's R-b Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Cc: Frank Binns <frank.binns@imgtec.com> Cc: Sarah Walker <sarah.walker@imgtec.com> Cc: Donald Robson <donald.robson@imgtec.com> Cc: Luben Tuikov <luben.tuikov@amd.com> Cc: David Airlie <airlied@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: "Christian König" <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230623075204.382350-1-boris.brezillon@collabora.com dma-buf: add dma_fence_timestamp helper When a fence signals there is a very small race window where the timestamp isn't updated yet. sync_file solves this by busy waiting for the timestamp to appear, but on other ocassions didn't handled this correctly. Provide a dma_fence_timestamp() helper function for this and use it in all appropriate cases. Another alternative would be to grab the spinlock when that happens. v2 by teddy: add a wait parameter to wait for the timestamp to show up, in case the accurate timestamp is needed and/or the timestamp is not based on ktime (e.g. hw timestamp) v3 chk: drop the parameter again for unified handling Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Fixes: 1774baa64f93 ("drm/scheduler: Change scheduled fence track v2") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> CC: stable@vger.kernel.org Link: https://patchwork.freedesktop.org/patch/msgid/20230929104725.2358-1-christian.koenig@amd.com drm/sched: Convert the GPU scheduler to variable number of run-queues The GPU scheduler has now a variable number of run-queues, which are set up at drm_sched_init() time. This way, each driver announces how many run-queues it requires (supports) per each GPU scheduler it creates. Note, that run-queues correspond to scheduler "priorities", thus if the number of run-queues is set to 1 at drm_sched_init(), then that scheduler supports a single run-queue, i.e. single "priority". If a driver further sets a single entity per run-queue, then this creates a 1-to-1 correspondence between a scheduler and a scheduled entity. Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Russell King <linux+etnaviv@armlinux.org.uk> Cc: Qiang Yu <yuq825@gmail.com> Cc: Rob Clark <robdclark@gmail.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Danilo Krummrich <dakr@redhat.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Boris Brezillon <boris.brezillon@collabora.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Emma Anholt <emma@anholt.net> Cc: etnaviv@lists.freedesktop.org Cc: lima@lists.freedesktop.org Cc: linux-arm-msm@vger.kernel.org Cc: freedreno@lists.freedesktop.org Cc: nouveau@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20231023032251.164775-1-luben.tuikov@amd.com drm/sched: Add drm_sched_wqueue_* helpers Add scheduler wqueue ready, stop, and start helpers to hide the implementation details of the scheduler from the drivers. v2: - s/sched_wqueue/sched_wqueue (Luben) - Remove the extra white line after the return-statement (Luben) - update drm_sched_wqueue_ready comment (Luben) Cc: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://lore.kernel.org/r/20231031032439.1558703-2-matthew.brost@intel.com Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Convert drm scheduler to use a work queue rather than kthread In Xe, the new Intel GPU driver, a choice has made to have a 1 to 1 mapping between a drm_gpu_scheduler and drm_sched_entity. At first this seems a bit odd but let us explain the reasoning below. 1. In Xe the submission order from multiple drm_sched_entity is not guaranteed to be the same completion even if targeting the same hardware engine. This is because in Xe we have a firmware scheduler, the GuC, which allowed to reorder, timeslice, and preempt submissions. If a using shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls apart as the TDR expects submission order == completion order. Using a dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. 2. In Xe submissions are done via programming a ring buffer (circular buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow control on the ring for free. A problem with this design is currently a drm_gpu_scheduler uses a kthread for submission / job cleanup. This doesn't scale if a large number of drm_gpu_scheduler are used. To work around the scaling issue, use a worker rather than kthread for submission / job cleanup. v2: - (Rob Clark) Fix msm build - Pass in run work queue v3: - (Boris) don't have loop in worker v4: - (Tvrtko) break out submit ready, stop, start helpers into own patch v5: - (Boris) default to ordered work queue v6: - (Luben / checkpatch) fix alignment in msm_ringbuffer.c - (Luben) s/drm_sched_submit_queue/drm_sched_wqueue_enqueue - (Luben) Update comment for drm_sched_wqueue_enqueue - (Luben) Positive check for submit_wq in drm_sched_init - (Luben) s/alloc_submit_wq/own_submit_wq v7: - (Luben) s/drm_sched_wqueue_enqueue/drm_sched_run_job_queue v8: - (Luben) Adjust var names / comments Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://lore.kernel.org/r/20231031032439.1558703-3-matthew.brost@intel.com Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Split free_job into own work item Rather than call free_job and run_job in same work item have a dedicated work item for each. This aligns with the design and intended use of work queues. v2: - Test for DMA_FENCE_FLAG_TIMESTAMP_BIT before setting timestamp in free_job() work item (Danilo) v3: - Drop forward dec of drm_sched_select_entity (Boris) - Return in drm_sched_run_job_work if entity NULL (Boris) v4: - Replace dequeue with peek and invert logic (Luben) - Wrap to 100 lines (Luben) - Update comments for *_queue / *_queue_if_ready functions (Luben) v5: - Drop peek argument, blindly reinit idle (Luben) - s/drm_sched_free_job_queue_if_ready/drm_sched_free_job_queue_if_done (Luben) - Update work_run_job & work_free_job kernel doc (Luben) v6: - Do not move drm_sched_select_entity in file (Luben) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20231031032439.1558703-4-matthew.brost@intel.com Reviewed-by: Luben Tuikov <ltuikov89@gmail.com> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Add drm_sched_start_timeout_unlocked helper Also add a lockdep assert to drm_sched_start_timeout. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://lore.kernel.org/r/20231031032439.1558703-5-matthew.brost@intel.com Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Add a helper to queue TDR immediately Add a helper whereby a driver can invoke TDR immediately. v2: - Drop timeout args, rename function, use mod delayed work (Luben) v3: - s/XE/Xe (Luben) - present tense in commit message (Luben) - Adjust comment for drm_sched_tdr_queue_imm (Luben) v4: - Adjust commit message (Luben) Cc: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://lore.kernel.org/r/20231031032439.1558703-6-matthew.brost@intel.com Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Rename drm_sched_get_cleanup_job to be more descriptive "Get cleanup job" makes it sound like helper is returning a job which will execute some cleanup, or something, while the kerneldoc itself accurately says "fetch the next _finished_ job". So lets rename the helper to be self documenting. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Luben Tuikov <ltuikov89@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231102105538.391648-2-tvrtko.ursulin@linux.intel.com Reviewed-by: Luben Tuikov <ltuikov89@gmail.com> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Move free worker re-queuing out of the if block Whether or not there are more jobs to clean up does not depend on the existance of the current job, given both drm_sched_get_finished_job and drm_sched_free_job_queue_if_done take and drop the job list lock. Therefore it is confusing to make it read like there is a dependency. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Luben Tuikov <ltuikov89@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231102105538.391648-3-tvrtko.ursulin@linux.intel.com Reviewed-by: Luben Tuikov <ltuikov89@gmail.com> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Rename drm_sched_free_job_queue to be more descriptive The current name makes it sound like helper will free a queue, while what it does is it enqueues the free job worker. Rename it to drm_sched_run_free_queue to align with existing drm_sched_run_job_queue. Despite that creating an illusion there are two queues, while in reality there is only one, at least it creates a consistent naming for the two enqueuing helpers. At the same time simplify the "if done" helper by dropping the suffix and adding a double underscore prefix to the one which just enqueues. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Luben Tuikov <ltuikov89@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231102105538.391648-4-tvrtko.ursulin@linux.intel.com Reviewed-by: Luben Tuikov <ltuikov89@gmail.com> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Rename drm_sched_run_job_queue_if_ready and clarify kerneldoc "If ready" is not immediately clear what it means - is the scheduler ready or something else? Drop the suffix, clarify kerneldoc, and employ the same naming scheme as in drm_sched_run_free_queue: - drm_sched_run_job_queue - enqueues if there is something to enqueue *and* scheduler is ready (can queue) - __drm_sched_run_job_queue - low-level helper to simply queue the job Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Luben Tuikov <ltuikov89@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231102105538.391648-5-tvrtko.ursulin@linux.intel.com Reviewed-by: Luben Tuikov <ltuikov89@gmail.com> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Drop suffix from drm_sched_wakeup_if_can_queue Because a) helper is exported to other parts of the scheduler and b) there isn't a plain drm_sched_wakeup to begin with, I think we can drop the suffix and by doing so separate the intimiate knowledge between the scheduler components a bit better. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Luben Tuikov <ltuikov89@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231102105538.391648-6-tvrtko.ursulin@linux.intel.com Reviewed-by: Luben Tuikov <ltuikov89@gmail.com> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Don't disturb the entity when in RR-mode scheduling Don't call drm_sched_select_entity() in drm_sched_run_job_queue(). In fact, rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let it do just that, schedule the work item for execution. The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity() to determine if the scheduler has an entity ready in one of its run-queues, and in the case of the Round-Robin (RR) scheduling, the function drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity which is ready, sets up the run-queue and completion and returns that entity. The FIFO scheduling algorithm is unaffected. Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then in the case of RR scheduling, that would result in drm_sched_select_entity() having been called twice, which may result in skipping a ready entity if more than one entity is ready. This commit fixes this by eliminating the call to drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only in drm_sched_run_job_work(). v2: Rebased on top of Tvrtko's renames series of patches. (Luben) Add fixes-tag. (Tvrtko) Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item") Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231107041020.10035-2-ltuikov89@gmail.com drm/sched: Qualify drm_sched_wakeup() by drm_sched_entity_is_ready() Don't "wake up" the GPU scheduler unless the entity is ready, as well as we can queue to the scheduler, i.e. there is no point in waking up the scheduler for the entity unless the entity is ready. Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> Fixes: bc8d6a9df99038 ("drm/sched: Don't disturb the entity when in RR-mode scheduling") Reviewed-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231110000123.72565-2-ltuikov89@gmail.com drm/sched: implement dynamic job-flow control Currently, job flow control is implemented simply by limiting the number of jobs in flight. Therefore, a scheduler is initialized with a credit limit that corresponds to the number of jobs w…
JacobZang
pushed a commit
to JacobZang/linux
that referenced
this pull request
Sep 5, 2024
Arm has introduced a new v10 GPU architecture that replaces the Job Manager interface with a new Command Stream Frontend. It adds firmware driven command stream queues that can be used by kernel and user space to submit jobs to the GPU. Add the initial schema for the device tree that is based on support for RK3588 SoC. The minimum number of clocks is one for the IP, but on Rockchip platforms they will tend to expose the semi-independent clocks for better power management. v5: - Move the opp-table node under the gpu node v4: - Fix formatting issue v3: - Cleanup commit message to remove redundant text - Added opp-table property and re-ordered entries - Clarified power-domains and power-domain-names requirements for RK3588. - Cleaned up example Note: power-domains and power-domain-names requirements for other platforms are still work in progress, hence the bindings are left incomplete here. v2: - New commit Signed-off-by: Liviu Dudau <liviu.dudau@arm.com> Cc: Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Conor Dooley <conor+dt@kernel.org> Cc: devicetree@vger.kernel.org Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Rob Herring <robh@kernel.org> drm: execution context for GEM buffers v7 This adds the infrastructure for an execution context for GEM buffers which is similar to the existing TTMs execbuf util and intended to replace it in the long term. The basic functionality is that we abstracts the necessary loop to lock many different GEM buffers with automated deadlock and duplicate handling. v2: drop xarray and use dynamic resized array instead, the locking overhead is unnecessary and measurable. v3: drop duplicate tracking, radeon is really the only one needing that. v4: fixes issues pointed out by Danilo, some typos in comments and a helper for lock arrays of GEM objects. v5: some suggestions by Boris Brezillon, especially just use one retry macro, drop loop in prepare_array, use flags instead of bool v6: minor changes suggested by Thomas, Boris and Danilo v7: minor typos pointed out by checkpatch.pl fixed Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Danilo Krummrich <dakr@redhat.com> Tested-by: Danilo Krummrich <dakr@redhat.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230711133122.3710-2-christian.koenig@amd.com drm: manager to keep track of GPUs VA mappings Add infrastructure to keep track of GPU virtual address (VA) mappings with a decicated VA space manager implementation. New UAPIs, motivated by Vulkan sparse memory bindings graphics drivers start implementing, allow userspace applications to request multiple and arbitrary GPU VA mappings of buffer objects. The DRM GPU VA manager is intended to serve the following purposes in this context. 1) Provide infrastructure to track GPU VA allocations and mappings, using an interval tree (RB-tree). 2) Generically connect GPU VA mappings to their backing buffers, in particular DRM GEM objects. 3) Provide a common implementation to perform more complex mapping operations on the GPU VA space. In particular splitting and merging of GPU VA mappings, e.g. for intersecting mapping requests or partial unmap requests. Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Tested-by: Matthew Brost <matthew.brost@intel.com> Tested-by: Donald Robson <donald.robson@imgtec.com> Suggested-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230720001443.2380-2-dakr@redhat.com drm: manager: Fix printk format for size_t sizeof() returns a size_t which may be different to an unsigned long. Use the correct format specifier of '%zu' to prevent compiler warnings. Fixes: e6303f323b1a ("drm: manager to keep track of GPUs VA mappings") Reviewed-by: Danilo Krummrich <dakr@redhat.com> Signed-off-by: Steven Price <steven.price@arm.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/2bf64010-c40a-8b84-144c-5387412b579e@arm.com drm/gpuva_mgr: remove unused prev pointer in __drm_gpuva_sm_map() The prev pointer in __drm_gpuva_sm_map() was used to implement automatic merging of mappings. Since automatic merging did not make its way upstream, remove this leftover. Fixes: e6303f323b1a ("drm: manager to keep track of GPUs VA mappings") Signed-off-by: Danilo Krummrich <dakr@redhat.com> Reviewed-by: Dave Airlie <airlied@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230823233119.2891-1-dakr@redhat.com drm/gpuvm: rename struct drm_gpuva_manager to struct drm_gpuvm Rename struct drm_gpuva_manager to struct drm_gpuvm including corresponding functions. This way the GPUVA manager's structures align very well with the documentation of VM_BIND [1] and VM_BIND locking [2]. It also provides a better foundation for the naming of data structures and functions introduced for implementing a common dma-resv per GPU-VM including tracking of external and evicted objects in subsequent patches. [1] Documentation/gpu/drm-vm-bind-async.rst [2] Documentation/gpu/drm-vm-bind-locking.rst Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Acked-by: Dave Airlie <airlied@redhat.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230920144343.64830-2-dakr@redhat.com drm/gpuvm: allow building as module HB: drivers/gpu/drm/nouveau/Kconfig skipped because there is no gpuvm support of nouveau in 6.1 Currently, the DRM GPUVM does not have any core dependencies preventing a module build. Also, new features from subsequent patches require helpers (namely drm_exec) which can be built as module. Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230920144343.64830-3-dakr@redhat.com drm/gpuvm: convert WARN() to drm_WARN() variants HB: drivers/gpu/drm/nouveau/nouveau_uvmm.c skipped since 6.1 does not support gpuvm on nv Use drm_WARN() and drm_WARN_ON() variants to indicate drivers the context the failing VM resides in. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231108001259.15123-2-dakr@redhat.com drm/gpuvm: don't always WARN in drm_gpuvm_check_overflow() Don't always WARN in drm_gpuvm_check_overflow() and separate it into a drm_gpuvm_check_overflow() and a dedicated drm_gpuvm_warn_check_overflow() variant. This avoids printing warnings due to invalid userspace requests. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231108001259.15123-3-dakr@redhat.com drm/gpuvm: export drm_gpuvm_range_valid() Drivers may use this function to validate userspace requests in advance, hence export it. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231108001259.15123-4-dakr@redhat.com drm/gpuvm: add common dma-resv per struct drm_gpuvm hb: drivers/gpu/drm/nouveau/nouveau_uvmm.c skipped Provide a common dma-resv for GEM objects not being used outside of this GPU-VM. This is used in a subsequent patch to generalize dma-resv, external and evicted object handling and GEM validation. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231108001259.15123-6-dakr@redhat.com drm/gpuvm: add drm_gpuvm_flags to drm_gpuvm HB: drivers/gpu/drm/nouveau/nouveau_uvmm.c skipped Introduce flags for struct drm_gpuvm, this required by subsequent commits. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231108001259.15123-8-dakr@redhat.com drm/gpuvm: reference count drm_gpuvm structures HB: drivers/gpu/drm/nouveau/nouveau_uvmm.c skipped Implement reference counting for struct drm_gpuvm. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231108001259.15123-10-dakr@redhat.com drm/gpuvm: add an abstraction for a VM / BO combination HB: drivers/gpu/drm/nouveau/nouveau_uvmm.c skipped Add an abstraction layer between the drm_gpuva mappings of a particular drm_gem_object and this GEM object itself. The abstraction represents a combination of a drm_gem_object and drm_gpuvm. The drm_gem_object holds a list of drm_gpuvm_bo structures (the structure representing this abstraction), while each drm_gpuvm_bo contains list of mappings of this GEM object. This has multiple advantages: 1) We can use the drm_gpuvm_bo structure to attach it to various lists of the drm_gpuvm. This is useful for tracking external and evicted objects per VM, which is introduced in subsequent patches. 2) Finding mappings of a certain drm_gem_object mapped in a certain drm_gpuvm becomes much cheaper. 3) Drivers can derive and extend the structure to easily represent driver specific states of a BO for a certain GPUVM. The idea of this abstraction was taken from amdgpu, hence the credit for this idea goes to the developers of amdgpu. Cc: Christian König <christian.koenig@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231108001259.15123-11-dakr@redhat.com drm/gpuvm: track/lock/validate external/evicted objects Currently the DRM GPUVM offers common infrastructure to track GPU VA allocations and mappings, generically connect GPU VA mappings to their backing buffers and perform more complex mapping operations on the GPU VA space. However, there are more design patterns commonly used by drivers, which can potentially be generalized in order to make the DRM GPUVM represent a basis for GPU-VM implementations. In this context, this patch aims at generalizing the following elements. 1) Provide a common dma-resv for GEM objects not being used outside of this GPU-VM. 2) Provide tracking of external GEM objects (GEM objects which are shared with other GPU-VMs). 3) Provide functions to efficiently lock all GEM objects dma-resv the GPU-VM contains mappings of. 4) Provide tracking of evicted GEM objects the GPU-VM contains mappings of, such that validation of evicted GEM objects is accelerated. 5) Provide some convinience functions for common patterns. Big thanks to Boris Brezillon for his help to figure out locking for drivers updating the GPU VA space within the fence signalling path. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Suggested-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231108001259.15123-12-dakr@redhat.com drm/gpuvm: fall back to drm_exec_lock_obj() Fall back to drm_exec_lock_obj() if num_fences is zero for the drm_gpuvm_prepare_* function family. Otherwise dma_resv_reserve_fences() would actually allocate slots even though num_fences is zero. Cc: Christian König <christian.koenig@amd.com> Acked-by: Donald Robson <donald.robson@imgtec.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231129220835.297885-2-dakr@redhat.com drm/gpuvm: Let drm_gpuvm_bo_put() report when the vm_bo object is destroyed Some users need to release resources attached to the vm_bo object when it's destroyed. In Panthor's case, we need to release the pin ref so BO pages can be returned to the system when all GPU mappings are gone. This could be done through a custom drm_gpuvm::vm_bo_free() hook, but this has all sort of locking implications that would force us to expose a drm_gem_shmem_unpin_locked() helper, not to mention the fact that having a ::vm_bo_free() implementation without a ::vm_bo_alloc() one seems odd. So let's keep things simple, and extend drm_gpuvm_bo_put() to report when the object is destroyed. Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231204151406.1977285-1-boris.brezillon@collabora.com drm/exec: Pass in initial # of objects HB: skipped drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c drivers/gpu/drm/amd/amdkfd/kfd_svm.c drivers/gpu/drm/imagination/pvr_job.c drivers/gpu/drm/nouveau/nouveau_uvmm.c In cases where the # is known ahead of time, it is silly to do the table resize dance. Signed-off-by: Rob Clark <robdclark@chromium.org> Reviewed-by: Christian König <christian.koenig@amd.com> Patchwork: https://patchwork.freedesktop.org/patch/568338/ drm/gem-shmem: When drm_gem_object_init failed, should release object when goto err_free, the object had init, so it should be release when fail. Signed-off-by: ChunyouTang <tangchunyou@163.com> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20221119064131.364-1-tangchunyou@163.com drm: Remove usage of deprecated DRM_DEBUG_PRIME drm_print.h says DRM_DEBUG_PRIME is deprecated in favor of drm_dbg_prime(). Signed-off-by: Siddh Raman Pant <code@siddh.me> Reviewed-by: Simon Ser <contact@emersion.fr> Signed-off-by: Simon Ser <contact@emersion.fr> Link: https://patchwork.freedesktop.org/patch/msgid/cd663b1bc42189e55898cddecdb3b73c591b341a.1673269059.git.code@siddh.me drm/shmem: Cleanup drm_gem_shmem_create_with_handle() Once we create the handle, the handle owns the reference. Currently nothing was doing anything with the shmem ptr after the handle was created, but let's change drm_gem_shmem_create_with_handle() to not return the pointer, so-as to not encourage problematic use of this function in the future. As a bonus, it makes the code a bit cleaner. Signed-off-by: Rob Clark <robdclark@chromium.org> Reviewed-by: Steven Price <steven.price@arm.com> Signed-off-by: Steven Price <steven.price@arm.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230123154831.3191821-1-robdclark@gmail.com drm/shmem-helper: Fix locking for drm_gem_shmem_get_pages_sgt() Other functions touching shmem->sgt take the pages lock, so do that here too. drm_gem_shmem_get_pages() & co take the same lock, so move to the _locked() variants to avoid recursive locking. Discovered while auditing locking to write the Rust abstractions. Fixes: 2194a63a818d ("drm: Add library for shmem backed GEM objects") Fixes: 4fa3d66f132b ("drm/shmem: Do dma_unmap_sg before purging pages") Signed-off-by: Asahi Lina <lina@asahilina.net> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Signed-off-by: Javier Martinez Canillas <javierm@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230205125124.2260-1-lina@asahilina.net drm/shmem-helper: Switch to use drm_* debug helpers Ease debugging of a multi-GPU system by using drm_WARN_*() and drm_dbg_kms() helpers that print out DRM device name corresponding to shmem GEM. Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Suggested-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Link: https://lore.kernel.org/all/20230108210445.3948344-6-dmitry.osipenko@collabora.com/ drm/shmem-helper: Don't use vmap_use_count for dma-bufs DMA-buf core has its own refcounting of vmaps, use it instead of drm-shmem counting. This change prepares drm-shmem for addition of memory shrinker support where drm-shmem will use a single dma-buf reservation lock for all operations performed over dma-bufs. Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Link: https://lore.kernel.org/all/20230108210445.3948344-7-dmitry.osipenko@collabora.com/ drm/shmem-helper: Switch to reservation lock Replace all drm-shmem locks with a GEM reservation lock. This makes locks consistent with dma-buf locking convention where importers are responsible for holding reservation lock for all operations performed over dma-bufs, preventing deadlock between dma-buf importers and exporters. Suggested-by: Daniel Vetter <daniel@ffwll.ch> Acked-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Link: https://lore.kernel.org/all/20230108210445.3948344-8-dmitry.osipenko@collabora.com/ drm/shmem-helper: Revert accidental non-GPL export The referenced commit added a wrapper for drm_gem_shmem_get_pages_sgt(), but in the process it accidentally changed the export type from GPL to non-GPL. Switch it back to GPL. Reported-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Fixes: ddddedaa0db9 ("drm/shmem-helper: Fix locking for drm_gem_shmem_get_pages_sgt()") Signed-off-by: Asahi Lina <lina@asahilina.net> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20230227-shmem-export-fix-v1-1-8880b2c25e81@asahilina.net Revert "drm/shmem-helper: Switch to reservation lock" This reverts commit 67b7836d4458790f1261e31fe0ce3250989784f0. The locking appears incomplete. A caller of SHMEM helper's pin function never acquires the dma-buf reservation lock. So we get WARNING: CPU: 3 PID: 967 at drivers/gpu/drm/drm_gem_shmem_helper.c:243 drm_gem_shmem_pin+0x42/0x90 [drm_shmem_helper] Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Acked-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230228152612.19971-1-tzimmermann@suse.de drm/shmem-helper: Switch to reservation lock Replace all drm-shmem locks with a GEM reservation lock. This makes locks consistent with dma-buf locking convention where importers are responsible for holding reservation lock for all operations performed over dma-bufs, preventing deadlock between dma-buf importers and exporters. Suggested-by: Daniel Vetter <daniel@ffwll.ch> Acked-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Emil Velikov <emil.l.velikov@gmail.com> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230529223935.2672495-7-dmitry.osipenko@collabora.com drm/shmem-helper: Reset vma->vm_ops before calling dma_buf_mmap() The dma-buf backend is supposed to provide its own vm_ops, but some implementation just have nothing special to do and leave vm_ops untouched, probably expecting this field to be zero initialized (this is the case with the system_heap implementation for instance). Let's reset vma->vm_ops to NULL to keep things working with these implementations. Fixes: 26d3ac3cb04d ("drm/shmem-helpers: Redirect mmap for imported dma-buf") Cc: <stable@vger.kernel.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Reported-by: Roman Stratiienko <r.stratiienko@gmail.com> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Tested-by: Roman Stratiienko <r.stratiienko@gmail.com> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20230724112610.60974-1-boris.brezillon@collabora.com iommu: Allow passing custom allocators to pgtable drivers This will be useful for GPU drivers who want to keep page tables in a pool so they can: - keep freed page tables in a free pool and speed-up upcoming page table allocations - batch page table allocation instead of allocating one page at a time - pre-reserve pages for page tables needed for map/unmap operations, to ensure map/unmap operations don't try to allocate memory in paths they're allowed to block or fail It might also be valuable for other aspects of GPU and similar use-cases, like fine-grained memory accounting and resource limiting. We will extend the Arm LPAE format to support custom allocators in a separate commit. Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/20231124142434.1577550-2-boris.brezillon@collabora.com Signed-off-by: Joerg Roedel <jroedel@suse.de> iommu: Extend LPAE page table format to support custom allocators We need that in order to implement the VM_BIND ioctl in the GPU driver targeting new Mali GPUs. VM_BIND is about executing MMU map/unmap requests asynchronously, possibly after waiting for external dependencies encoded as dma_fences. We intend to use the drm_sched framework to automate the dependency tracking and VM job dequeuing logic, but this comes with its own set of constraints, one of them being the fact we are not allowed to allocate memory in the drm_gpu_scheduler_ops::run_job() to avoid this sort of deadlocks: - VM_BIND map job needs to allocate a page table to map some memory to the VM. No memory available, so kswapd is kicked - GPU driver shrinker backend ends up waiting on the fence attached to the VM map job or any other job fence depending on this VM operation. With custom allocators, we will be able to pre-reserve enough pages to guarantee the map/unmap operations we queued will take place without going through the system allocator. But we can also optimize allocation/reservation by not free-ing pages immediately, so any upcoming page table allocation requests can be serviced by some free page table pool kept at the driver level. I might also be valuable for other aspects of GPU and similar use-cases, like fine-grained memory accounting and resource limiting. Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/20231124142434.1577550-3-boris.brezillon@collabora.com Signed-off-by: Joerg Roedel <jroedel@suse.de> drm/sched: Add FIFO sched policy to run queue When many entities are competing for the same run queue on the same scheduler, we observe an unusually long wait times and some jobs get starved. This has been observed on GPUVis. The issue is due to the Round Robin policy used by schedulers to pick up the next entity's job queue for execution. Under stress of many entities and long job queues within entity some jobs could be stuck for very long time in it's entity's queue before being popped from the queue and executed while for other entities with smaller job queues a job might execute earlier even though that job arrived later then the job in the long queue. Fix: Add FIFO selection policy to entities in run queue, chose next entity on run queue in such order that if job on one entity arrived earlier then job on another entity the first job will start executing earlier regardless of the length of the entity's job queue. v2: Switch to rb tree structure for entities based on TS of oldest job waiting in the job queue of an entity. Improves next entity extraction to O(1). Entity TS update O(log N) where N is the number of entities in the run-queue Drop default option in module control parameter. v3: Various cosmetical fixes and minor refactoring of fifo update function. (Luben) v4: Switch drm_sched_rq_select_entity_fifo to in order search (Luben) v5: Fix up drm_sched_rq_select_entity_fifo loop (Luben) v6: Add missing drm_sched_rq_remove_fifo_locked v7: Fix ts sampling bug and more cosmetic stuff (Luben) v8: Fix module parameter string (Luben) Cc: Luben Tuikov <luben.tuikov@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Direct Rendering Infrastructure - Development <dri-devel@lists.freedesktop.org> Cc: AMD Graphics <amd-gfx@lists.freedesktop.org> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Tested-by: Yunxiang Li (Teddy) <Yunxiang.Li@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220930041258.1050247-1-luben.tuikov@amd.com drm/scheduler: Set the FIFO scheduling policy as the default The currently default Round-Robin GPU scheduling can result in starvation of entities which have a large number of jobs, over entities which have a very small number of jobs (single digit). This can be illustrated in the following diagram, where jobs are alphabetized to show their chronological order of arrival, where job A is the oldest, B is the second oldest, and so on, to J, the most recent job to arrive. ---> entities j | H-F-----A--E--I-- o | --G-----B-----J-- b | --------C-------- s\/ --------D-------- WLOG, assuming all jobs are "ready", then a R-R scheduling will execute them in the following order (a slice off of the top of the entities' list), H, F, A, E, I, G, B, J, C, D. However, to mitigate job starvation, we'd rather execute C and D before E, and so on, given, of course, that they're all ready to be executed. So, if all jobs are ready at this instant, the order of execution for this and the next 9 instances of picking the next job to execute, should really be, A, B, C, D, E, F, G, H, I, J, which is their chronological order. The only reason for this order to be broken, is if an older job is not yet ready, but a younger job is ready, at an instant of picking a new job to execute. For instance if job C wasn't ready at time 2, but job D was ready, then we'd pick job D, like this: 0 +1 +2 ... A, B, D, ... And from then on, C would be preferred before all other jobs, if it is ready at the time when a new job for execution is picked. So, if C became ready two steps later, the execution order would look like this: ......0 +1 +2 ... A, B, D, E, C, F, G, H, I, J This is what the FIFO GPU scheduling algorithm achieves. It uses a Red-Black tree to keep jobs sorted in chronological order, where picking the oldest job is O(1) (we use the "cached" structure), and balancing the tree is O(log n). IOW, it picks the *oldest ready* job to execute now. The implementation is already in the kernel, and this commit only changes the default GPU scheduling algorithm to use. This was tested and achieves about 1% faster performance over the Round Robin algorithm. Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Cc: Direct Rendering Infrastructure - Development <dri-devel@lists.freedesktop.org> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221024212634.27230-1-luben.tuikov@amd.com Signed-off-by: Christian König <christian.koenig@amd.com> drm/scheduler: add drm_sched_job_add_resv_dependencies Add a new function to update job dependencies from a resv obj. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-3-christian.koenig@amd.com drm/scheduler: remove drm_sched_dependency_optimized Not used any more. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-12-christian.koenig@amd.com drm/scheduler: rework entity flush, kill and fini This was buggy because when we had to wait for entities which were killed as well we would just deadlock. Instead move all the dependency handling into the callbacks so that will all happen asynchronously. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-13-christian.koenig@amd.com drm/scheduler: rename dependency callback into prepare_job This now matches much better what this is doing. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-14-christian.koenig@amd.com drm/amdgpu: revert "implement tdr advanced mode" This reverts commit e6c6338f393b74ac0b303d567bb918b44ae7ad75. This feature basically re-submits one job after another to figure out which one was the one causing a hang. This is obviously incompatible with gang-submit which requires that multiple jobs run at the same time. It's also absolutely not helpful to crash the hardware multiple times if a clean recovery is desired. For testing and debugging environments we should rather disable recovery alltogether to be able to inspect the state with a hw debugger. Additional to that the sw implementation is clearly buggy and causes reference count issues for the hardware fence. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> drm/scheduler: Fix lockup in drm_sched_entity_kill() The drm_sched_entity_kill() is invoked twice by drm_sched_entity_destroy() while userspace process is exiting or being killed. First time it's invoked when sched entity is flushed and second time when entity is released. This causes a lockup within wait_for_completion(entity_idle) due to how completion API works. Calling wait_for_completion() more times than complete() was invoked is a error condition that causes lockup because completion internally uses counter for complete/wait calls. The complete_all() must be used instead in such cases. This patch fixes lockup of Panfrost driver that is reproducible by killing any application in a middle of 3d drawing operation. Fixes: 2fdb8a8f07c2 ("drm/scheduler: rework entity flush, kill and fini") Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221123001303.533968-1-dmitry.osipenko@collabora.com drm/scheduler: deprecate drm_sched_resubmit_jobs This interface is not working as it should. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221109095010.141189-5-christian.koenig@amd.com drm/scheduler: track GPU active time per entity Track the accumulated time that jobs from this entity were active on the GPU. This allows drivers using the scheduler to trivially implement the DRM fdinfo when the hardware doesn't provide more specific information than signalling job completion anyways. [Bagas: Append missing colon to @elapsed_ns] Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> drm/sched: Create wrapper to add a syncobj dependency to job In order to add a syncobj's fence as a dependency to a job, it is necessary to call drm_syncobj_find_fence() to find the fence and then add the dependency with drm_sched_job_add_dependency(). So, wrap these steps in one single function, drm_sched_job_add_syncobj_dependency(). Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Maíra Canal <mcanal@igalia.com> Signed-off-by: Maíra Canal <mairacanal@riseup.net> Link: https://patchwork.freedesktop.org/patch/msgid/20230209124447.467867-2-mcanal@igalia.com drm/scheduler: Fix variable name in function description Compiling AMD GPU drivers displays two warnings: drivers/gpu/drm/scheduler/sched_main.c:738: warning: Function parameter or member 'file' not described in 'drm_sched_job_add_syncobj_dependency' drivers/gpu/drm/scheduler/sched_main.c:738: warning: Excess function parameter 'file_private' description in 'drm_sched_job_add_syncobj_dependency' Get rid of them by renaming the variable name on the function description Signed-off-by: Caio Novais <caionovais@usp.br> Link: https://lore.kernel.org/r/20230325131532.6356-1-caionovais@usp.br Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> drm/scheduler: Add fence deadline support As the finished fence is the one that is exposed to userspace, and therefore the one that other operations, like atomic update, would block on, we need to propagate the deadline from from the finished fence to the actual hw fence. v2: Split into drm_sched_fence_set_parent() (ckoenig) v3: Ensure a thread calling drm_sched_fence_set_deadline_finished() sees fence->parent set before drm_sched_fence_set_parent() does this test_bit(DMA_FENCE_FLAG_HAS_DEADLINE_BIT). Signed-off-by: Rob Clark <robdclark@chromium.org> Acked-by: Luben Tuikov <luben.tuikov@amd.com> Revert "drm/scheduler: track GPU active time per entity" This reverts commit df622729ddbf as it introduces a use-after-free, which isn't easy to fix without going back to the design drawing board. Reported-by: Danilo Krummrich <dakr@redhat.com> Signed-off-by: Lucas Stach <l.stach@pengutronix.de> drm/scheduler: Fix UAF race in drm_sched_entity_push_job() After a job is pushed into the queue, it is owned by the scheduler core and may be freed at any time, so we can't write nor read the submit timestamp after that point. Fixes oopses observed with the drm/asahi driver, found with kASAN. Signed-off-by: Asahi Lina <lina@asahilina.net> Link: https://lore.kernel.org/r/20230406-scheduler-uaf-2-v1-1-972531cf0a81@asahilina.net Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> drm/sched: Check scheduler ready before calling timeout handling During an IGT GPU reset test we see the following oops, [ +0.000003] ------------[ cut here ]------------ [ +0.000000] WARNING: CPU: 9 PID: 0 at kernel/workqueue.c:1656 __queue_delayed_work+0x6d/0xa0 [ +0.000004] Modules linked in: iptable_filter bpfilter amdgpu(OE) nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic intel_rapl_msr ledtrig_audio snd_hda_codec_hdmi intel_rapl_common snd_hda_intel edac_mce_amd snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core iommu_v2 gpu_sched(OE) kvm_amd drm_buddy snd_hwdep kvm video drm_ttm_helper snd_pcm ttm snd_seq_midi drm_display_helper snd_seq_midi_event snd_rawmidi cec crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 snd_seq aesni_intel rc_core crypto_simd cryptd binfmt_misc drm_kms_helper rapl snd_seq_device input_leds joydev snd_timer i2c_algo_bit syscopyarea snd ccp sysfillrect sysimgblt wmi_bmof k10temp soundcore mac_hid sch_fq_codel msr parport_pc ppdev drm lp parport ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 hid_generic usbhid hid r8169 ahci xhci_pci gpio_amdpt realtek i2c_piix4 wmi crc32_pclmul xhci_pci_renesas libahci gpio_generic [ +0.000070] CPU: 9 PID: 0 Comm: swapper/9 Tainted: G W OE 6.1.11+ #2 [ +0.000003] Hardware name: Gigabyte Technology Co., Ltd. AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F7 06/16/2017 [ +0.000001] RIP: 0010:__queue_delayed_work+0x6d/0xa0 [ +0.000003] Code: 7a 50 48 01 c1 48 89 4a 30 81 ff 00 20 00 00 75 38 4c 89 cf e8 64 3e 0a 00 5d e9 1e c5 11 01 e8 99 f7 ff ff 5d e9 13 c5 11 01 <0f> 0b eb c1 0f 0b 48 81 7a 38 70 5c 0e 81 74 9f 0f 0b 48 8b 42 28 [ +0.000002] RSP: 0018:ffffc90000398d60 EFLAGS: 00010007 [ +0.000002] RAX: ffff88810d589c60 RBX: 0000000000000000 RCX: 0000000000000000 [ +0.000002] RDX: ffff88810d589c58 RSI: 0000000000000000 RDI: 0000000000002000 [ +0.000001] RBP: ffffc90000398d60 R08: 0000000000000000 R09: ffff88810d589c78 [ +0.000002] R10: 72705f305f39765f R11: 7866673a6d72645b R12: ffff88810d589c58 [ +0.000001] R13: 0000000000002000 R14: 0000000000000000 R15: 0000000000000000 [ +0.000002] FS: 0000000000000000(0000) GS:ffff8887fee40000(0000) knlGS:0000000000000000 [ +0.000001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000002] CR2: 00005562c4797fa0 CR3: 0000000110da0000 CR4: 00000000003506e0 [ +0.000002] Call Trace: [ +0.000001] <IRQ> [ +0.000001] mod_delayed_work_on+0x5e/0xa0 [ +0.000004] drm_sched_fault+0x23/0x30 [gpu_sched] [ +0.000007] gfx_v9_0_fault.isra.0+0xa6/0xd0 [amdgpu] [ +0.000258] gfx_v9_0_priv_reg_irq+0x29/0x40 [amdgpu] [ +0.000254] amdgpu_irq_dispatch+0x1ac/0x2b0 [amdgpu] [ +0.000243] amdgpu_ih_process+0x89/0x130 [amdgpu] [ +0.000245] amdgpu_irq_handler+0x24/0x60 [amdgpu] [ +0.000165] __handle_irq_event_percpu+0x4f/0x1a0 [ +0.000003] handle_irq_event_percpu+0x15/0x50 [ +0.000001] handle_irq_event+0x39/0x60 [ +0.000002] handle_edge_irq+0xa8/0x250 [ +0.000003] __common_interrupt+0x7b/0x150 [ +0.000002] common_interrupt+0xc1/0xe0 [ +0.000003] </IRQ> [ +0.000000] <TASK> [ +0.000001] asm_common_interrupt+0x27/0x40 [ +0.000002] RIP: 0010:native_safe_halt+0xb/0x10 [ +0.000003] Code: 46 ff ff ff cc cc cc cc cc cc cc cc cc cc cc eb 07 0f 00 2d 69 f2 5e 00 f4 e9 f1 3b 3e 00 90 eb 07 0f 00 2d 59 f2 5e 00 fb f4 <e9> e0 3b 3e 00 0f 1f 44 00 00 55 48 89 e5 53 e8 b1 d4 fe ff 66 90 [ +0.000002] RSP: 0018:ffffc9000018fdc8 EFLAGS: 00000246 [ +0.000002] RAX: 0000000000004000 RBX: 000000000002e5a8 RCX: 000000000000001f [ +0.000001] RDX: 0000000000000001 RSI: ffff888101298800 RDI: ffff888101298864 [ +0.000001] RBP: ffffc9000018fdd0 R08: 000000527f64bd8b R09: 000000000001dc90 [ +0.000001] R10: 000000000001dc90 R11: 0000000000000003 R12: 0000000000000001 [ +0.000001] R13: ffff888101298864 R14: ffffffff832d9e20 R15: ffff888193aa8c00 [ +0.000003] ? acpi_idle_do_entry+0x5e/0x70 [ +0.000002] acpi_idle_enter+0xd1/0x160 [ +0.000003] cpuidle_enter_state+0x9a/0x6e0 [ +0.000003] cpuidle_enter+0x2e/0x50 [ +0.000003] call_cpuidle+0x23/0x50 [ +0.000002] do_idle+0x1de/0x260 [ +0.000002] cpu_startup_entry+0x20/0x30 [ +0.000002] start_secondary+0x120/0x150 [ +0.000003] secondary_startup_64_no_verify+0xe5/0xeb [ +0.000004] </TASK> [ +0.000000] ---[ end trace 0000000000000000 ]--- [ +0.000003] BUG: kernel NULL pointer dereference, address: 0000000000000102 [ +0.006233] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=3, emitted seq=4 [ +0.000734] #PF: supervisor read access in kernel mode [ +0.009670] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process amd_deadlock pid 2002 thread amd_deadlock pid 2002 [ +0.005135] #PF: error_code(0x0000) - not-present page [ +0.000002] PGD 0 P4D 0 [ +0.000002] Oops: 0000 [#1] PREEMPT SMP NOPTI [ +0.000002] CPU: 9 PID: 0 Comm: swapper/9 Tainted: G W OE 6.1.11+ #2 [ +0.000002] Hardware name: Gigabyte Technology Co., Ltd. AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F7 06/16/2017 [ +0.012101] amdgpu 0000:0c:00.0: amdgpu: GPU reset begin! [ +0.005136] RIP: 0010:__queue_work+0x1f/0x4e0 [ +0.000004] Code: 87 cd 11 01 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 d5 41 54 49 89 f4 53 48 83 ec 10 89 7d d4 <f6> 86 02 01 00 00 01 0f 85 6c 03 00 00 e8 7f 36 08 00 8b 45 d4 48 For gfx_rings the schedulers may not be initialized by amdgpu_device_init_schedulers() due to ring->no_scheduler flag being set to true and thus the timeout_wq is NULL. As a result, since all ASICs call drm_sched_fault() unconditionally even for schedulers which have not been initialized, it is simpler to use the ready condition which indicates whether the given scheduler worker thread runs and whether the timeout_wq of the reset domain has been initialized. Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Cc: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://lore.kernel.org/r/20230406200054.633379-1-luben.tuikov@amd.com drm/scheduler: set entity to NULL in drm_sched_entity_pop_job() It already happend a few times that patches slipped through which implemented access to an entity through a job that was already removed from the entities queue. Since jobs and entities might have different lifecycles, this can potentially cause UAF bugs. In order to make it obvious that a jobs entity pointer shouldn't be accessed after drm_sched_entity_pop_job() was called successfully, set the jobs entity pointer to NULL once the job is removed from the entity queue. Moreover, debugging a potential NULL pointer dereference is way easier than potentially corrupted memory through a UAF. Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://lore.kernel.org/r/20230418100453.4433-1-dakr@redhat.com Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> drm/scheduler: properly forward fence errors When a hw fence is signaled with an error properly forward that to the finished fence. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230420115752.31470-1-christian.koenig@amd.com drm/scheduler: add drm_sched_entity_error and use rcu for last_scheduled Switch to using RCU handling for the last scheduled job and add a function to return the error code of it. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230420115752.31470-2-christian.koenig@amd.com drm/scheduler: mark jobs without fence as canceled When no hw fence is provided for a job that means that the job didn't executed. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230427122726.1290170-1-christian.koenig@amd.com drm/sched: Check scheduler work queue before calling timeout handling During an IGT GPU reset test we see again oops despite of commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling timeout handling). It uses ready condition whether to call drm_sched_fault which unwind the TDR leads to GPU reset. However it looks the ready condition is overloaded with other meanings, for example, for the following stack is related GPU reset : 0 gfx_v9_0_cp_gfx_start 1 gfx_v9_0_cp_gfx_resume 2 gfx_v9_0_cp_resume 3 gfx_v9_0_hw_init 4 gfx_v9_0_resume 5 amdgpu_device_ip_resume_phase2 does the following: /* start the ring */ gfx_v9_0_cp_gfx_start(adev); ring->sched.ready = true; The same approach is for other ASICs as well : gfx_v8_0_cp_gfx_resume gfx_v10_0_kiq_resume, etc... As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault and then drm_sched_fault. However now it depends on whether the interrupt service routine drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready field of the scheduler to true even for uninitialized schedulers and causes oops vs no fault or when ISR drm_sched_fault is completed prior gfx_v9_0_cp_gfx_start and NULL pointer dereference does not occur. Use the field timeout_wq to prevent oops for uninitialized schedulers. The field could be initialized by the work queue of resetting the domain. v1: Corrections to commit message (Luben) Fixes: 11b3b9f461c5c4 ("drm/sched: Check scheduler ready before calling timeout handling") Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Link: https://lore.kernel.org/r/20230510135111.58631-1-vitaly.prosyak@amd.com Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> drm/sched: Remove redundant check The rq pointer points inside the drm_gpu_scheduler structure. Thus it can't be NULL. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: c61cdbdbffc1 ("drm/scheduler: Fix hang when sched_entity released") Signed-off-by: Vladislav Efanov <VEfanov@ispras.ru> Link: https://lore.kernel.org/r/20230517125247.434103-1-VEfanov@ispras.ru Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> drm/sched: Rename to drm_sched_can_queue() Rename drm_sched_ready() to drm_sched_can_queue(). "ready" can mean many things and is thus meaningless in this context. Instead, rename to a name which precisely conveys what is being checked. Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Link: https://lore.kernel.org/r/20230517233550.377847-1-luben.tuikov@amd.com drm/sched: Rename to drm_sched_wakeup_if_can_queue() Rename drm_sched_wakeup() to drm_sched_wakeup_if_canqueue() since the former is misleading, as it wakes up the GPU scheduler _only if_ more jobs can be queued to the underlying hardware. This distinction is important to make, since the wake conditional in the GPU scheduler thread wakes up when other conditions are also true, e.g. when there are jobs to be cleaned. For instance, a user might want to wake up the scheduler only because there are more jobs to clean, but whether we can queue more jobs is irrelevant. v2: Separate "canqueue" to "can_queue". (Alex D.) Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://lore.kernel.org/r/20230517233550.377847-2-luben.tuikov@amd.com Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> drm/scheduler: avoid infinite loop if entity's dependency is a scheduled error fence [Why] drm_sched_entity_add_dependency_cb ignores the scheduled fence and return false. If entity's dependency is a scheduler error fence and drm_sched_stop is called due to TDR, drm_sched_entity_pop_job will wait for the dependency infinitely. [How] Do not wait or ignore the scheduled error fence, add drm_sched_entity_wakeup callback for the dependency with scheduled error fence. Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> drm/sched: Make sure we wait for all dependencies in kill_jobs_cb() drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped from the dependency array that was waited upon before drm_sched_entity_kill() was called (drm_sched_entity::dependency field), so we're basically waiting for all dependencies except one. In theory, this wait shouldn't be needed because resources should have their users registered to the dma_resv object, thus guaranteeing that future jobs wanting to access these resources wait on all the previous users (depending on the access type, of course). But we want to keep these explicit waits in the kill entity path just in case. Let's make sure we keep all dependencies in the array in drm_sched_job_dependency(), so we can iterate over the array and wait in drm_sched_entity_kill_jobs_cb(). We also make sure we wait on drm_sched_fence::finished if we were originally asked to wait on drm_sched_fence::scheduled. In that case, we assume the intent was to delegate the wait to the firmware/GPU or rely on the pipelining done at the entity/scheduler level, but when killing jobs, we really want to wait for completion not just scheduling. v2: - Don't evict deps in drm_sched_job_dependency() v3: - Always wait for drm_sched_fence::finished fences in drm_sched_entity_kill_jobs_cb() when we see a sched_fence v4: - Fix commit message - Fix a use-after-free bug v5: - Flag deps on which we should only wait for the scheduled event at insertion time v6: - Back to v4 implementation - Add Christian's R-b Cc: Frank Binns <frank.binns@imgtec.com> Cc: Sarah Walker <sarah.walker@imgtec.com> Cc: Donald Robson <donald.robson@imgtec.com> Cc: Luben Tuikov <luben.tuikov@amd.com> Cc: David Airlie <airlied@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: "Christian König" <christian.koenig@amd.com> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Suggested-by: "Christian König" <christian.koenig@amd.com> Reviewed-by: "Christian König" <christian.koenig@amd.com> Acked-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230619071921.3465992-1-boris.brezillon@collabora.com drm/sched: Call drm_sched_fence_set_parent() from drm_sched_fence_scheduled() Drivers that can delegate waits to the firmware/GPU pass the scheduled fence to drm_sched_job_add_dependency(), and issue wait commands to the firmware/GPU at job submission time. For this to be possible, they need all their 'native' dependencies to have a valid parent since this is where the actual HW fence information are encoded. In drm_sched_main(), we currently call drm_sched_fence_set_parent() after drm_sched_fence_scheduled(), leaving a short period of time during which the job depending on this fence can be submitted. Since setting parent and signaling the fence are two things that are kinda related (you can't have a parent if the job hasn't been scheduled), it probably makes sense to pass the parent fence to drm_sched_fence_scheduled() and let it call drm_sched_fence_set_parent() before it signals the scheduled fence. Here is a detailed description of the race we are fixing here: Thread A Thread B - calls drm_sched_fence_scheduled() - signals s_fence->scheduled which wakes up thread B - entity dep signaled, checking the next dep - no more deps waiting - entity is picked for job submission by drm_gpu_scheduler - run_job() is called - run_job() tries to collect native fence info from s_fence->parent, but it's NULL => BOOM, we can't do our native wait - calls drm_sched_fence_set_parent() v2: * Fix commit message v3: * Add a detailed description of the race to the commit message * Add Luben's R-b Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Cc: Frank Binns <frank.binns@imgtec.com> Cc: Sarah Walker <sarah.walker@imgtec.com> Cc: Donald Robson <donald.robson@imgtec.com> Cc: Luben Tuikov <luben.tuikov@amd.com> Cc: David Airlie <airlied@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: "Christian König" <christian.koenig@amd.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230623075204.382350-1-boris.brezillon@collabora.com dma-buf: add dma_fence_timestamp helper When a fence signals there is a very small race window where the timestamp isn't updated yet. sync_file solves this by busy waiting for the timestamp to appear, but on other ocassions didn't handled this correctly. Provide a dma_fence_timestamp() helper function for this and use it in all appropriate cases. Another alternative would be to grab the spinlock when that happens. v2 by teddy: add a wait parameter to wait for the timestamp to show up, in case the accurate timestamp is needed and/or the timestamp is not based on ktime (e.g. hw timestamp) v3 chk: drop the parameter again for unified handling Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Fixes: 1774baa64f93 ("drm/scheduler: Change scheduled fence track v2") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> CC: stable@vger.kernel.org Link: https://patchwork.freedesktop.org/patch/msgid/20230929104725.2358-1-christian.koenig@amd.com drm/sched: Convert the GPU scheduler to variable number of run-queues The GPU scheduler has now a variable number of run-queues, which are set up at drm_sched_init() time. This way, each driver announces how many run-queues it requires (supports) per each GPU scheduler it creates. Note, that run-queues correspond to scheduler "priorities", thus if the number of run-queues is set to 1 at drm_sched_init(), then that scheduler supports a single run-queue, i.e. single "priority". If a driver further sets a single entity per run-queue, then this creates a 1-to-1 correspondence between a scheduler and a scheduled entity. Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Russell King <linux+etnaviv@armlinux.org.uk> Cc: Qiang Yu <yuq825@gmail.com> Cc: Rob Clark <robdclark@gmail.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Danilo Krummrich <dakr@redhat.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Boris Brezillon <boris.brezillon@collabora.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Emma Anholt <emma@anholt.net> Cc: etnaviv@lists.freedesktop.org Cc: lima@lists.freedesktop.org Cc: linux-arm-msm@vger.kernel.org Cc: freedreno@lists.freedesktop.org Cc: nouveau@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20231023032251.164775-1-luben.tuikov@amd.com drm/sched: Add drm_sched_wqueue_* helpers Add scheduler wqueue ready, stop, and start helpers to hide the implementation details of the scheduler from the drivers. v2: - s/sched_wqueue/sched_wqueue (Luben) - Remove the extra white line after the return-statement (Luben) - update drm_sched_wqueue_ready comment (Luben) Cc: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://lore.kernel.org/r/20231031032439.1558703-2-matthew.brost@intel.com Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Convert drm scheduler to use a work queue rather than kthread In Xe, the new Intel GPU driver, a choice has made to have a 1 to 1 mapping between a drm_gpu_scheduler and drm_sched_entity. At first this seems a bit odd but let us explain the reasoning below. 1. In Xe the submission order from multiple drm_sched_entity is not guaranteed to be the same completion even if targeting the same hardware engine. This is because in Xe we have a firmware scheduler, the GuC, which allowed to reorder, timeslice, and preempt submissions. If a using shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls apart as the TDR expects submission order == completion order. Using a dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. 2. In Xe submissions are done via programming a ring buffer (circular buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow control on the ring for free. A problem with this design is currently a drm_gpu_scheduler uses a kthread for submission / job cleanup. This doesn't scale if a large number of drm_gpu_scheduler are used. To work around the scaling issue, use a worker rather than kthread for submission / job cleanup. v2: - (Rob Clark) Fix msm build - Pass in run work queue v3: - (Boris) don't have loop in worker v4: - (Tvrtko) break out submit ready, stop, start helpers into own patch v5: - (Boris) default to ordered work queue v6: - (Luben / checkpatch) fix alignment in msm_ringbuffer.c - (Luben) s/drm_sched_submit_queue/drm_sched_wqueue_enqueue - (Luben) Update comment for drm_sched_wqueue_enqueue - (Luben) Positive check for submit_wq in drm_sched_init - (Luben) s/alloc_submit_wq/own_submit_wq v7: - (Luben) s/drm_sched_wqueue_enqueue/drm_sched_run_job_queue v8: - (Luben) Adjust var names / comments Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://lore.kernel.org/r/20231031032439.1558703-3-matthew.brost@intel.com Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Split free_job into own work item Rather than call free_job and run_job in same work item have a dedicated work item for each. This aligns with the design and intended use of work queues. v2: - Test for DMA_FENCE_FLAG_TIMESTAMP_BIT before setting timestamp in free_job() work item (Danilo) v3: - Drop forward dec of drm_sched_select_entity (Boris) - Return in drm_sched_run_job_work if entity NULL (Boris) v4: - Replace dequeue with peek and invert logic (Luben) - Wrap to 100 lines (Luben) - Update comments for *_queue / *_queue_if_ready functions (Luben) v5: - Drop peek argument, blindly reinit idle (Luben) - s/drm_sched_free_job_queue_if_ready/drm_sched_free_job_queue_if_done (Luben) - Update work_run_job & work_free_job kernel doc (Luben) v6: - Do not move drm_sched_select_entity in file (Luben) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20231031032439.1558703-4-matthew.brost@intel.com Reviewed-by: Luben Tuikov <ltuikov89@gmail.com> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Add drm_sched_start_timeout_unlocked helper Also add a lockdep assert to drm_sched_start_timeout. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://lore.kernel.org/r/20231031032439.1558703-5-matthew.brost@intel.com Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Add a helper to queue TDR immediately Add a helper whereby a driver can invoke TDR immediately. v2: - Drop timeout args, rename function, use mod delayed work (Luben) v3: - s/XE/Xe (Luben) - present tense in commit message (Luben) - Adjust comment for drm_sched_tdr_queue_imm (Luben) v4: - Adjust commit message (Luben) Cc: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Link: https://lore.kernel.org/r/20231031032439.1558703-6-matthew.brost@intel.com Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Rename drm_sched_get_cleanup_job to be more descriptive "Get cleanup job" makes it sound like helper is returning a job which will execute some cleanup, or something, while the kerneldoc itself accurately says "fetch the next _finished_ job". So lets rename the helper to be self documenting. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Luben Tuikov <ltuikov89@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231102105538.391648-2-tvrtko.ursulin@linux.intel.com Reviewed-by: Luben Tuikov <ltuikov89@gmail.com> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Move free worker re-queuing out of the if block Whether or not there are more jobs to clean up does not depend on the existance of the current job, given both drm_sched_get_finished_job and drm_sched_free_job_queue_if_done take and drop the job list lock. Therefore it is confusing to make it read like there is a dependency. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Luben Tuikov <ltuikov89@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231102105538.391648-3-tvrtko.ursulin@linux.intel.com Reviewed-by: Luben Tuikov <ltuikov89@gmail.com> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Rename drm_sched_free_job_queue to be more descriptive The current name makes it sound like helper will free a queue, while what it does is it enqueues the free job worker. Rename it to drm_sched_run_free_queue to align with existing drm_sched_run_job_queue. Despite that creating an illusion there are two queues, while in reality there is only one, at least it creates a consistent naming for the two enqueuing helpers. At the same time simplify the "if done" helper by dropping the suffix and adding a double underscore prefix to the one which just enqueues. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Luben Tuikov <ltuikov89@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231102105538.391648-4-tvrtko.ursulin@linux.intel.com Reviewed-by: Luben Tuikov <ltuikov89@gmail.com> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Rename drm_sched_run_job_queue_if_ready and clarify kerneldoc "If ready" is not immediately clear what it means - is the scheduler ready or something else? Drop the suffix, clarify kerneldoc, and employ the same naming scheme as in drm_sched_run_free_queue: - drm_sched_run_job_queue - enqueues if there is something to enqueue *and* scheduler is ready (can queue) - __drm_sched_run_job_queue - low-level helper to simply queue the job Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Luben Tuikov <ltuikov89@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231102105538.391648-5-tvrtko.ursulin@linux.intel.com Reviewed-by: Luben Tuikov <ltuikov89@gmail.com> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Drop suffix from drm_sched_wakeup_if_can_queue Because a) helper is exported to other parts of the scheduler and b) there isn't a plain drm_sched_wakeup to begin with, I think we can drop the suffix and by doing so separate the intimiate knowledge between the scheduler components a bit better. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Luben Tuikov <ltuikov89@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231102105538.391648-6-tvrtko.ursulin@linux.intel.com Reviewed-by: Luben Tuikov <ltuikov89@gmail.com> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> drm/sched: Don't disturb the entity when in RR-mode scheduling Don't call drm_sched_select_entity() in drm_sched_run_job_queue(). In fact, rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let it do just that, schedule the work item for execution. The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity() to determine if the scheduler has an entity ready in one of its run-queues, and in the case of the Round-Robin (RR) scheduling, the function drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity which is ready, sets up the run-queue and completion and returns that entity. The FIFO scheduling algorithm is unaffected. Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then in the case of RR scheduling, that would result in drm_sched_select_entity() having been called twice, which may result in skipping a ready entity if more than one entity is ready. This commit fixes this by eliminating the call to drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only in drm_sched_run_job_work(). v2: Rebased on top of Tvrtko's renames series of patches. (Luben) Add fixes-tag. (Tvrtko) Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item") Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231107041020.10035-2-ltuikov89@gmail.com drm/sched: Qualify drm_sched_wakeup() by drm_sched_entity_is_ready() Don't "wake up" the GPU scheduler unless the entity is ready, as well as we can queue to the scheduler, i.e. there is no point in waking up the scheduler for the entity unless the entity is ready. Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> Fixes: bc8d6a9df99038 ("drm/sched: Don't disturb the entity when in RR-mode scheduling") Reviewed-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231110000123.72565-2-ltuikov89@gmail.com drm/sched: implement dynamic job-flow control Currently, job flow control is implemented simply by limiting the number of jobs in flight. Therefore, a scheduler is initialized with a credit limit that corresponds to the number of jobs w…
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The commit 2e7dc31 had replaced
"extern inline" with "inline" in order to fix the build with GCC 5,
because of new inline semantics. However, this change breaks the build
with GCC 4.9, causing many errors such as:
lib/mpi/generic_mpih-mul1.o: In function
mpihelp_add_1': generic_mpih-mul1.c:(.text+0x0): multiple definition of
mpihelp_add_1'lib/mpi/generic_mpih-lshift.o:generic_mpih-lshift.c:(.text+0x0): first defined here
Fix this like in the mainline commit
9c6bd0c, by using "static inline".
Signed-off-by: Benoît Thébaudeau benoit@wsystem.com