-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LEGAL][COPYRIGHT INFRINGEMENT] Comply to terms of GNU GPLv2 License, release Redmi (1)'s Linux© kernel source code!/遵从 GNU GPLv2 授权条款的规定,释出红米手机 (1) 的 Linux© 作业系统内核来源代码!/遵從 GNU GPLv2 授權條款的規定,釋出紅米手機 (1) 的 Linux© 作業系統核心來源程式碼! #解放小米 #LibreXiaoMi #4
Comments
I thought it has been done? |
@qiansen1386 As far as I can see HM1S is not same to HM1(the very, original model) |
I see... |
紅米 2 呢? |
@sidealice 請另外建檔新的 issue,謝謝 |
Hi everyone concerning with this issue, please contribute other languages title(& issue summary also) so we can get more attention from @MiCode ;) |
Check this
commit 71abdc1 upstream. When kswapd exits, it can end up taking locks that were previously held by allocating tasks while they waited for reclaim. Lockdep currently warns about this: On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote: > inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage. > kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes: > (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130 > {RECLAIM_FS-ON-W} state was registered at: > mark_held_locks+0xb9/0x140 > lockdep_trace_alloc+0x7a/0xe0 > kmem_cache_alloc_trace+0x37/0x240 > flex_array_alloc+0x99/0x1a0 > cgroup_attach_task+0x63/0x430 > attach_task_by_pid+0x210/0x280 > cgroup_procs_write+0x16/0x20 > cgroup_file_write+0x120/0x2c0 > vfs_write+0xc0/0x1f0 > SyS_write+0x4c/0xa0 > tracesys+0xdd/0xe2 > irq event stamp: 49 > hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70 > hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0 > softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0 > softirqs last disabled at (0): (null) > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > ---- > lock(&sig->group_rwsem); > <Interrupt> > lock(&sig->group_rwsem); > > *** DEADLOCK *** > > no locks held by kswapd2/1151. > > stack backtrace: > CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ MiCode#4 > Call Trace: > dump_stack+0x19/0x1b > print_usage_bug+0x1f7/0x208 > mark_lock+0x21d/0x2a0 > __lock_acquire+0x52a/0xb60 > lock_acquire+0xa2/0x140 > down_read+0x51/0xa0 > exit_signals+0x24/0x130 > do_exit+0xb5/0xa50 > kthread+0xdb/0x100 > ret_from_fork+0x7c/0xb0 This is because the kswapd thread is still marked as a reclaimer at the time of exit. But because it is exiting, nobody is actually waiting on it to make reclaim progress anymore, and it's nothing but a regular thread at this point. Be tidy and strip it of all its powers (PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state) before returning from the thread function. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com> Conflicts: mm/vmscan.c
commit 71abdc1 upstream. When kswapd exits, it can end up taking locks that were previously held by allocating tasks while they waited for reclaim. Lockdep currently warns about this: On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote: > inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage. > kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes: > (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130 > {RECLAIM_FS-ON-W} state was registered at: > mark_held_locks+0xb9/0x140 > lockdep_trace_alloc+0x7a/0xe0 > kmem_cache_alloc_trace+0x37/0x240 > flex_array_alloc+0x99/0x1a0 > cgroup_attach_task+0x63/0x430 > attach_task_by_pid+0x210/0x280 > cgroup_procs_write+0x16/0x20 > cgroup_file_write+0x120/0x2c0 > vfs_write+0xc0/0x1f0 > SyS_write+0x4c/0xa0 > tracesys+0xdd/0xe2 > irq event stamp: 49 > hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70 > hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0 > softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0 > softirqs last disabled at (0): (null) > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > ---- > lock(&sig->group_rwsem); > <Interrupt> > lock(&sig->group_rwsem); > > *** DEADLOCK *** > > no locks held by kswapd2/1151. > > stack backtrace: > CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ MiCode#4 > Call Trace: > dump_stack+0x19/0x1b > print_usage_bug+0x1f7/0x208 > mark_lock+0x21d/0x2a0 > __lock_acquire+0x52a/0xb60 > lock_acquire+0xa2/0x140 > down_read+0x51/0xa0 > exit_signals+0x24/0x130 > do_exit+0xb5/0xa50 > kthread+0xdb/0x100 > ret_from_fork+0x7c/0xb0 This is because the kswapd thread is still marked as a reclaimer at the time of exit. But because it is exiting, nobody is actually waiting on it to make reclaim progress anymore, and it's nothing but a regular thread at this point. Be tidy and strip it of all its powers (PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state) before returning from the thread function. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com> Conflicts: mm/vmscan.c
This will pretty much never happen considering Mediatek's track record. |
workqueue: change BUG_ON() to WARN_ON() This BUG_ON() can be triggered if you call schedule_work() before calling INIT_WORK(). It is a bug definitely, but it's nicer to just print a stack trace and return. Reported-by: Matt Renzelmann <mjr@cs.wisc.edu> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: Catch more locking problems with flush_work() If a workqueue is flushed with flush_work() lockdep checking can be circumvented. For example: static DEFINE_MUTEX(mutex); static void my_work(struct work_struct *w) { mutex_lock(&mutex); mutex_unlock(&mutex); } static DECLARE_WORK(work, my_work); static int __init start_test_module(void) { schedule_work(&work); return 0; } module_init(start_test_module); static void __exit stop_test_module(void) { mutex_lock(&mutex); flush_work(&work); mutex_unlock(&mutex); } module_exit(stop_test_module); would not always print a warning when flush_work() was called. In this trivial example nothing could go wrong since we are guaranteed module_init() and module_exit() don't run concurrently, but if the work item is schedule asynchronously we could have a scenario where the work item is running just at the time flush_work() is called resulting in a classic ABBA locking problem. Add a lockdep hint by acquiring and releasing the work item lockdep_map in flush_work() so that we always catch this potential deadlock scenario. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> lockdep: fix oops in processing workqueue Under memory load, on x86_64, with lockdep enabled, the workqueue's process_one_work() has been seen to oops in __lock_acquire(), barfing on a 0xffffffff00000000 pointer in the lockdep_map's class_cache[]. Because it's permissible to free a work_struct from its callout function, the map used is an onstack copy of the map given in the work_struct: and that copy is made without any locking. Surprisingly, gcc (4.5.1 in Hugh's case) uses "rep movsl" rather than "rep movsq" for that structure copy: which might race with a workqueue user's wait_on_work() doing lock_map_acquire() on the source of the copy, putting a pointer into the class_cache[], but only in time for the top half of that pointer to be copied to the destination map. Boom when process_one_work() subsequently does lock_map_acquire() on its onstack copy of the lockdep_map. Fix this, and a similar instance in call_timer_fn(), with a lockdep_copy_map() function which additionally NULLs the class_cache[]. Note: this oops was actually seen on 3.4-next, where flush_work() newly does the racing lock_map_acquire(); but Tejun points out that 3.4 and earlier are already vulnerable to the same through wait_on_work(). * Patch orginally from Peter. Hugh modified it a bit and wrote the description. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Reported-by: Hugh Dickins <hughd@google.com> LKML-Reference: <alpine.LSU.2.00.1205070951170.1544@eggly.anvils> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: perform cpu down operations from low priority cpu_notifier() Currently, all workqueue cpu hotplug operations run off CPU_PRI_WORKQUEUE which is higher than normal notifiers. This is to ensure that workqueue is up and running while bringing up a CPU before other notifiers try to use workqueue on the CPU. Per-cpu workqueues are supposed to remain working and bound to the CPU for normal CPU_DOWN_PREPARE notifiers. This holds mostly true even with workqueue offlining running with higher priority because workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which runs the per-cpu workqueue without concurrency management without explicitly detaching the existing workers. However, if the trustee needs to create new workers, it creates unbound workers which may wander off to other CPUs while CPU_DOWN_PREPARE notifiers are in progress. Furthermore, if the CPU down is cancelled, the per-CPU workqueue may end up with workers which aren't bound to the CPU. While reliably reproducible with a convoluted artificial test-case involving scheduling and flushing CPU burning work items from CPU down notifiers, this isn't very likely to happen in the wild, and, even when it happens, the effects are likely to be hidden by the following successful CPU down. Fix it by using different priorities for up and down notifiers - high priority for up operations and low priority for down operations. Workqueue cpu hotplug operations will soon go through further cleanup. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: drop CPU_DYING notifier operation Workqueue used CPU_DYING notification to mark GCWQ_DISASSOCIATED. This was necessary because workqueue's CPU_DOWN_PREPARE happened before other DOWN_PREPARE notifiers and workqueue needed to stay associated across the rest of DOWN_PREPARE. After the previous patch, workqueue's DOWN_PREPARE happens after others and can set GCWQ_DISASSOCIATED directly. Drop CPU_DYING and let the trustee set GCWQ_DISASSOCIATED after disabling concurrency management. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: ROGUE workers are UNBOUND workers Currently, WORKER_UNBOUND is used to mark workers for the unbound global_cwq and WORKER_ROGUE is used to mark workers for disassociated per-cpu global_cwqs. Both are used to make the marked worker skip concurrency management and the only place they make any difference is in worker_enter_idle() where WORKER_ROGUE is used to skip scheduling idle timer, which can easily be replaced with trustee state testing. This patch replaces WORKER_ROGUE with WORKER_UNBOUND and drops WORKER_ROGUE. This is to prepare for removing trustee and handling disassociated global_cwqs as unbound. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: use mutex for global_cwq manager exclusion POOL_MANAGING_WORKERS is used to ensure that at most one worker takes the manager role at any given time on a given global_cwq. Trustee later hitched on it to assume manager adding blocking wait for the bit. As trustee already needed a custom wait mechanism, waiting for MANAGING_WORKERS was rolled into the same mechanism. Trustee is scheduled to be removed. This patch separates out MANAGING_WORKERS wait into per-pool mutex. Workers use mutex_trylock() to test for manager role and trustee uses mutex_lock() to claim manager roles. gcwq_claim/release_management() helpers are added to grab and release manager roles of all pools on a global_cwq. gcwq_claim_management() always grabs pool manager mutexes in ascending pool index order and uses pool index as lockdep subclass. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: drop @bind from create_worker() Currently, create_worker()'s callers are responsible for deciding whether the newly created worker should be bound to the associated CPU and create_worker() sets WORKER_UNBOUND only for the workers for the unbound global_cwq. Creation during normal operation is always via maybe_create_worker() and @bind is true. For workers created during hotplug, @bind is false. Normal operation path is planned to be used even while the CPU is going through hotplug operations or offline and this static decision won't work. Drop @bind from create_worker() and decide whether to bind by looking at GCWQ_DISASSOCIATED. create_worker() will also set WORKER_UNBOUND autmatically if disassociated. To avoid flipping GCWQ_DISASSOCIATED while create_worker() is in progress, the flag is now allowed to be changed only while holding all manager_mutexes on the global_cwq. This requires that GCWQ_DISASSOCIATED is not cleared behind trustee's back. CPU_ONLINE no longer clears DISASSOCIATED before flushing trustee, which clears DISASSOCIATED before rebinding remaining workers if asked to release. For cases where trustee isn't around, CPU_ONLINE clears DISASSOCIATED after flushing trustee. Also, now, first_idle has UNBOUND set on creation which is explicitly cleared by CPU_ONLINE while binding it. These convolutions will soon be removed by further simplification of CPU hotplug path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: reimplement CPU online rebinding to handle idle workers Currently, if there are left workers when a CPU is being brough back online, the trustee kills all idle workers and scheduled rebind_work so that they re-bind to the CPU after the currently executing work is finished. This works for busy workers because concurrency management doesn't try to wake up them from scheduler callbacks, which require the target task to be on the local run queue. The busy worker bumps concurrency counter appropriately as it clears WORKER_UNBOUND from the rebind work item and it's bound to the CPU before returning to the idle state. To reduce CPU on/offlining overhead (as many embedded systems use it for powersaving) and simplify the code path, workqueue is planned to be modified to retain idle workers across CPU on/offlining. This patch reimplements CPU online rebinding such that it can also handle idle workers. As noted earlier, due to the local wakeup requirement, rebinding idle workers is tricky. All idle workers must be re-bound before scheduler callbacks are enabled. This is achieved by interlocking idle re-binding. Idle workers are requested to re-bind and then hold until all idle re-binding is complete so that no bound worker starts executing work item. Only after all idle workers are re-bound and parked, CPU_ONLINE proceeds to release them and queue rebind work item to busy workers thus guaranteeing scheduler callbacks aren't invoked until all idle workers are ready. worker_rebind_fn() is renamed to busy_worker_rebind_fn() and idle_worker_rebind() for idle workers is added. Rebinding logic is moved to rebind_workers() and now called from CPU_ONLINE after flushing trustee. While at it, add CPU sanity check in worker_thread(). Note that now a worker may become idle or the manager between trustee release and rebinding during CPU_ONLINE. As the previous patch updated create_worker() so that it can be used by regular manager while unbound and this patch implements idle re-binding, this is safe. This prepares for removal of trustee and keeping idle workers across CPU hotplugs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: don't butcher idle workers on an offline CPU Currently, during CPU offlining, after all pending work items are drained, the trustee butchers all workers. Also, on CPU onlining failure, workqueue_cpu_callback() ensures that the first idle worker is destroyed. Combined, these guarantee that an offline CPU doesn't have any worker for it once all the lingering work items are finished. This guarantee isn't really necessary and makes CPU on/offlining more expensive than needs to be, especially for platforms which use CPU hotplug for powersaving. This patch lets offline CPUs removes idle worker butchering from the trustee and let a CPU which failed onlining keep the created first worker. The first worker is created if the CPU doesn't have any during CPU_DOWN_PREPARE and started right away. If onlining succeeds, the rebind_workers() call in CPU_ONLINE will rebind it like any other workers. If onlining fails, the worker is left alone till the next try. This makes CPU hotplugs cheaper by allowing global_cwqs to keep workers across them and simplifies code. Note that trustee doesn't re-arm idle timer when it's done and thus the disassociated global_cwq will keep all workers until it comes back online. This will be improved by further patches. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: remove CPU offline trustee With the previous changes, a disassociated global_cwq now can run as an unbound one on its own - it can create workers as necessary to drain remaining works after the CPU has been brought down and manage the number of workers using the usual idle timer mechanism making trustee completely redundant except for the actual unbinding operation. This patch removes the trustee and let a disassociated global_cwq manage itself. Unbinding is moved to a work item (for CPU affinity) which is scheduled and flushed from CPU_DONW_PREPARE. This patch moves nr_running clearing outside gcwq and manager locks to simplify the code. As nr_running is unused at the point, this is safe. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: simplify CPU hotplug code With trustee gone, CPU hotplug code can be simplified. * gcwq_claim/release_management() now grab and release gcwq lock too respectively and gained _and_lock and _and_unlock postfixes. * All CPU hotplug logic was implemented in workqueue_cpu_callback() which was called by workqueue_cpu_up/down_callback() for the correct priority. This was because up and down paths shared a lot of logic, which is no longer true. Remove workqueue_cpu_callback() and move all hotplug logic into the two actual callbacks. This patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: fix spurious CPU locality WARN from process_one_work() 25511a4776 "workqueue: reimplement CPU online rebinding to handle idle workers" added CPU locality sanity check in process_one_work(). It triggers if a worker is executing on a different CPU without UNBOUND or REBIND set. This works for all normal workers but rescuers can trigger this spuriously when they're serving the unbound or a disassociated global_cwq - rescuers don't have either flag set and thus its gcwq->cpu can be a different value including %WORK_CPU_UNBOUND. Fix it by additionally testing %GCWQ_DISASSOCIATED. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> LKML-Refence: <20120721213656.GA7783@linux.vnet.ibm.com> workqueue: reorder queueing functions so that _on() variants are on top Currently, queue/schedule[_delayed]_work_on() are located below the counterpart without the _on postifx even though the latter is usually implemented using the former. Swap them. This is cleanup and doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: make queueing functions return bool All queueing functions return 1 on success, 0 if the work item was already pending. Update them to return bool instead. This signifies better that they don't return 0 / -errno. This is cleanup and doesn't cause any functional difference. While at it, fix comment opening for schedule_work_on(). Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: add missing smp_wmb() in process_one_work() WORK_STRUCT_PENDING is used to claim ownership of a work item and process_one_work() releases it before starting execution. When someone else grabs PENDING, all pre-release updates to the work item should be visible and all updates made by the new owner should happen afterwards. Grabbing PENDING uses test_and_set_bit() and thus has a full barrier; however, clearing doesn't have a matching wmb. Given the preceding spin_unlock and use of clear_bit, I don't believe this can be a problem on an actual machine and there hasn't been any related report but it still is theretically possible for clear_pending to permeate upwards and happen before work->entry update. Add an explicit smp_wmb() before work_clear_pending(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@vger.kernel.org workqueue: disable irq while manipulating PENDING Queueing operations use WORK_STRUCT_PENDING_BIT to synchronize access to the target work item. They first try to claim the bit and proceed with queueing only after that succeeds and there's a window between PENDING being set and the actual queueing where the task can be interrupted or preempted. There's also a similar window in process_one_work() when clearing PENDING. A work item is dequeued, gcwq->lock is released and then PENDING is cleared and the worker might get interrupted or preempted between releasing gcwq->lock and clearing PENDING. cancel[_delayed]_work_sync() tries to claim or steal PENDING. The function assumes that a work item with PENDING is either queued or in the process of being [de]queued. In the latter case, it busy-loops until either the work item loses PENDING or is queued. If canceling coincides with the above described interrupts or preemptions, the canceling task will busy-loop while the queueing or executing task is preempted. This patch keeps irq disabled across claiming PENDING and actual queueing and moves PENDING clearing in process_one_work() inside gcwq->lock so that busy looping from PENDING && !queued doesn't wait for interrupted/preempted tasks. Note that, in process_one_work(), setting last CPU and clearing PENDING got merged into single operation. This removes possible long busy-loops and will allow using try_to_grab_pending() from bh and irq contexts. v2: __queue_work() was testing preempt_count() to ensure that the caller has disabled preemption. This triggers spuriously if !CONFIG_PREEMPT_COUNT. Use preemptible() instead. Reported by Fengguang Wu. v3: Disable irq instead of preemption. IRQ will be disabled while grabbing gcwq->lock later anyway and this allows using try_to_grab_pending() from bh and irq contexts. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> workqueue: set delayed_work->timer function on initialization delayed_work->timer.function is currently initialized during queue_delayed_work_on(). Export delayed_work_timer_fn() and set delayed_work timer function during delayed_work initialization together with other fields. This ensures the timer function is always valid on an initialized delayed_work. This is to help mod_delayed_work() implementation. To detect delayed_work users which diddle with the internal timer, trigger WARN if timer function doesn't match on queue. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: unify local CPU queueing handling Queueing functions have been using different methods to determine the local CPU. * queue_work() superflously uses get/put_cpu() to acquire and hold the local CPU across queue_work_on(). * delayed_work_timer_fn() uses smp_processor_id(). * queue_delayed_work() calls queue_delayed_work_on() with -1 @cpu which is interpreted as the local CPU. * flush_delayed_work[_sync]() were using raw_smp_processor_id(). * __queue_work() interprets %WORK_CPU_UNBOUND as local CPU if the target workqueue is bound one but nobody uses this. This patch converts all functions to uniformly use %WORK_CPU_UNBOUND to indicate local CPU and use the local binding feature of __queue_work(). unlikely() is dropped from %WORK_CPU_UNBOUND handling in __queue_work(). Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: fix zero @delay handling of queue_delayed_work_on() If @delay is zero and the dealyed_work is idle, queue_delayed_work() queues it for immediate execution; however, queue_delayed_work_on() lacks this logic and always goes through timer regardless of @delay. This patch moves 0 @delay handling logic from queue_delayed_work() to queue_delayed_work_on() so that both functions behave the same. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: move try_to_grab_pending() upwards try_to_grab_pending() will be used by to-be-implemented mod_delayed_work[_on](). Move try_to_grab_pending() and related functions above queueing functions. This patch only moves functions around. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: introduce WORK_OFFQ_FLAG_* Low WORK_STRUCT_FLAG_BITS bits of work_struct->data contain WORK_STRUCT_FLAG_* and flush color. If the work item is queued, the rest point to the cpu_workqueue with WORK_STRUCT_CWQ set; otherwise, WORK_STRUCT_CWQ is clear and the bits contain the last CPU number - either a real CPU number or one of WORK_CPU_*. Scheduled addition of mod_delayed_work[_on]() requires an additional flag, which is used only while a work item is off queue. There are more than enough bits to represent off-queue CPU number on both 32 and 64bits. This patch introduces WORK_OFFQ_FLAG_* which occupy the lower part of the @work->data high bits while off queue. This patch doesn't define any actual OFFQ flag yet. Off-queue CPU number is now shifted by WORK_OFFQ_CPU_SHIFT, which adds the number of bits used by OFFQ flags to WORK_STRUCT_FLAG_SHIFT, to make room for OFFQ flags. To avoid shift width warning with large WORK_OFFQ_FLAG_BITS, ulong cast is added to WORK_STRUCT_NO_CPU and, just in case, BUILD_BUG_ON() to check that there are enough bits to accomodate off-queue CPU number is added. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: factor out __queue_delayed_work() from queue_delayed_work_on() This is to prepare for mod_delayed_work[_on]() and doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: reorganize try_to_grab_pending() and __cancel_timer_work() * Use bool @is_dwork instead of @timer and let try_to_grab_pending() use to_delayed_work() to determine the delayed_work address. * Move timer handling from __cancel_work_timer() to try_to_grab_pending(). * Make try_to_grab_pending() use -EAGAIN instead of -1 for busy-looping and drop the ret local variable. * Add proper function comment to try_to_grab_pending(). This makes the code a bit easier to understand and will ease further changes. This patch doesn't make any functional change. v2: Use @is_dwork instead of @timer. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: mark a work item being canceled as such There can be two reasons try_to_grab_pending() can fail with -EAGAIN. One is when someone else is queueing or deqeueing the work item. With the previous patches, it is guaranteed that PENDING and queued state will soon agree making it safe to busy-retry in this case. The other is if multiple __cancel_work_timer() invocations are racing one another. __cancel_work_timer() grabs PENDING and then waits for running instances of the target work item on all CPUs while holding PENDING and !queued. try_to_grab_pending() invoked from another task will keep returning -EAGAIN while the current owner is waiting. Not distinguishing the two cases is okay because __cancel_work_timer() is the only user of try_to_grab_pending() and it invokes wait_on_work() whenever grabbing fails. For the first case, busy looping should be fine but wait_on_work() doesn't cause any critical problem. For the latter case, the new contender usually waits for the same condition as the current owner, so no unnecessarily extended busy-looping happens. Combined, these make __cancel_work_timer() technically correct even without irq protection while grabbing PENDING or distinguishing the two different cases. While the current code is technically correct, not distinguishing the two cases makes it difficult to use try_to_grab_pending() for other purposes than canceling because it's impossible to tell whether it's safe to busy-retry grabbing. This patch adds a mechanism to mark a work item being canceled. try_to_grab_pending() now disables irq on success and returns -EAGAIN to indicate that grabbing failed but PENDING and queued states are gonna agree soon and it's safe to busy-loop. It returns -ENOENT if the work item is being canceled and it may stay PENDING && !queued for arbitrary amount of time. __cancel_work_timer() is modified to mark the work canceling with WORK_OFFQ_CANCELING after grabbing PENDING, thus making try_to_grab_pending() fail with -ENOENT instead of -EAGAIN. Also, it invokes wait_on_work() iff grabbing failed with -ENOENT. This isn't necessary for correctness but makes it consistent with other future users of try_to_grab_pending(). v2: try_to_grab_pending() was testing preempt_count() to ensure that the caller has disabled preemption. This triggers spuriously if !CONFIG_PREEMPT_COUNT. Use preemptible() instead. Reported by Fengguang Wu. v3: Updated so that try_to_grab_pending() disables irq on success rather than requiring preemption disabled by the caller. This makes busy-looping easier and will allow try_to_grap_pending() to be used from bh/irq contexts. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Fengguang Wu <fengguang.wu@intel.com> workqueue: implement mod_delayed_work[_on]() Workqueue was lacking a mechanism to modify the timeout of an already pending delayed_work. delayed_work users have been working around this using several methods - using an explicit timer + work item, messing directly with delayed_work->timer, and canceling before re-queueing, all of which are error-prone and/or ugly. This patch implements mod_delayed_work[_on]() which behaves similarly to mod_timer() - if the delayed_work is idle, it's queued with the given delay; otherwise, its timeout is modified to the new value. Zero @delay guarantees immediate execution. v2: Updated to reflect try_to_grab_pending() changes. Now safe to be called from bh context. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> workqueue: fix CPU binding of flush_delayed_work[_sync]() delayed_work encodes the workqueue to use and the last CPU in delayed_work->work.data while it's on timer. The target CPU is implicitly recorded as the CPU the timer is queued on and delayed_work_timer_fn() queues delayed_work->work to the CPU it is running on. Unfortunately, this leaves flush_delayed_work[_sync]() no way to find out which CPU the delayed_work was queued for when they try to re-queue after killing the timer. Currently, it chooses the local CPU flush is running on. This can unexpectedly move a delayed_work queued on a specific CPU to another CPU and lead to subtle errors. There isn't much point in trying to save several bytes in struct delayed_work, which is already close to a hundred bytes on 64bit with all debug options turned off. This patch adds delayed_work->cpu to remember the CPU it's queued for. Note that if the timer is migrated during CPU down, the work item could be queued to the downed global_cwq after this change. As a detached global_cwq behaves like an unbound one, this doesn't change much for the delayed_work. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> workqueue: add missing wmb() in clear_work_data() Any operation which clears PENDING should be preceded by a wmb to guarantee that the next PENDING owner sees all the changes made before PENDING release. There are only two places where PENDING is cleared - set_work_cpu_and_clear_pending() and clear_work_data(). The caller of the former already does smp_wmb() but the latter doesn't have any. Move the wmb above set_work_cpu_and_clear_pending() into it and add one to clear_work_data(). There hasn't been any report related to this issue, and, given how clear_work_data() is used, it is extremely unlikely to have caused any actual problems on any architecture. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> workqueue: use enum value to set array size of pools in gcwq Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent size of pools, definition of worker_pool in gcwq doesn't use it. Using it makes code robust and prevent future mistakes. So change code to use this enum value. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: correct req_cpu in trace_workqueue_queue_work() When we do tracing workqueue_queue_work(), it records requested cpu. But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND, requested cpu is changed as local cpu. In case of @wq->flag & WQ_UNBOUND, above change is not occured, therefore it is reasonable to correct it. Use temporary local variable for storing requested cpu. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: change value of lcpu in __queue_delayed_work_on() We assign cpu id into work struct's data field in __queue_delayed_work_on(). In current implementation, when work is come in first time, current running cpu id is assigned. If we do __queue_delayed_work_on() with CPU A on CPU B, __queue_work() invoked in delayed_work_timer_fn() go into the following sub-optimal path in case of WQ_NON_REENTRANT. gcwq = get_gcwq(cpu); if (wq->flags & WQ_NON_REENTRANT && (last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) { Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND. It is sufficient to prevent to go into sub-optimal path. tj: Slightly rephrased the comment. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: introduce system_highpri_wq Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker or highpri worker. But, we don't consider this difference in rebind_workers(), we use just system_wq for highpri worker. It makes mismatch between cwq->pool and worker->pool. It doesn't make error in current implementation, but possible in the future. Now, we introduce system_highpri_wq to use proper cwq for highpri workers in rebind_workers(). Following patch fix this issue properly. tj: Even apart from rebinding, having system_highpri_wq generally makes sense. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: use system_highpri_wq for highpri workers in rebind_workers() In rebind_workers(), we do inserting a work to rebind to cpu for busy workers. Currently, in this case, we use only system_wq. This makes a possible error situation as there is mismatch between cwq->pool and worker->pool. To prevent this, we should use system_highpri_wq for highpri worker to match theses. This implements it. tj: Rephrased comment a bit. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: use system_highpri_wq for unbind_work To speed cpu down processing up, use system_highpri_wq. As scheduling priority of workers on it is higher than system_wq and it is not contended by other normal works on this cpu, work on it is processed faster than system_wq. tj: CPU up/downs care quite a bit about latency these days. This shouldn't hurt anything and makes sense. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: fix checkpatch issues Fixed some checkpatch warnings. tj: adapted to wq/for-3.7 and massaged pr_xxx() format strings a bit. Signed-off-by: Valentin Ilie <valentin.ilie@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <1345326762-21747-1-git-send-email-valentin.ilie@gmail.com> workqueue: make all workqueues non-reentrant By default, each per-cpu part of a bound workqueue operates separately and a work item may be executing concurrently on different CPUs. The behavior avoids some cross-cpu traffic but leads to subtle weirdities and not-so-subtle contortions in the API. * There's no sane usefulness in allowing a single work item to be executed concurrently on multiple CPUs. People just get the behavior unintentionally and get surprised after learning about it. Most either explicitly synchronize or use non-reentrant/ordered workqueue but this is error-prone. * flush_work() can't wait for multiple instances of the same work item on different CPUs. If a work item is executing on cpu0 and then queued on cpu1, flush_work() can only wait for the one on cpu1. Unfortunately, work items can easily cross CPU boundaries unintentionally when the queueing thread gets migrated. This means that if multiple queuers compete, flush_work() can't even guarantee that the instance queued right before it is finished before returning. * flush_work_sync() was added to work around some of the deficiencies of flush_work(). In addition to the usual flushing, it ensures that all currently executing instances are finished before returning. This operation is expensive as it has to walk all CPUs and at the same time fails to address competing queuer case. Incorrectly using flush_work() when flush_work_sync() is necessary is an easy error to make and can lead to bugs which are difficult to reproduce. * Similar problems exist for flush_delayed_work[_sync](). Other than the cross-cpu access concern, there's no benefit in allowing parallel execution and it's plain silly to have this level of contortion for workqueue which is widely used from core code to extremely obscure drivers. This patch makes all workqueues non-reentrant. If a work item is executing on a different CPU when queueing is requested, it is always queued to that CPU. This guarantees that any given work item can be executing on one CPU at maximum and if a work item is queued and executing, both are on the same CPU. The only behavior change which may affect workqueue users negatively is that non-reentrancy overrides the affinity specified by queue_work_on(). On a reentrant workqueue, the affinity specified by queue_work_on() is always followed. Now, if the work item is executing on one of the CPUs, the work item will be queued there regardless of the requested affinity. I've reviewed all workqueue users which request explicit affinity, and, fortunately, none seems to be crazy enough to exploit parallel execution of the same work item. This adds an additional busy_hash lookup if the work item was previously queued on a different CPU. This shouldn't be noticeable under any sane workload. Work item queueing isn't a very high-frequency operation and they don't jump across CPUs all the time. In a micro benchmark to exaggerate this difference - measuring the time it takes for two work items to repeatedly jump between two CPUs a number (10M) of times with busy_hash table densely populated, the difference was around 3%. While the overhead is measureable, it is only visible in pathological cases and the difference isn't huge. This change brings much needed sanity to workqueue and makes its behavior consistent with timer. I think this is the right tradeoff to make. This enables significant simplification of workqueue API. Simplification patches will follow. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: gut flush[_delayed]_work_sync() Now that all workqueues are non-reentrant, flush[_delayed]_work_sync() are equivalent to flush[_delayed]_work(). Drop the separate implementation and make them thin wrappers around flush[_delayed]_work(). * start_flush_work() no longer takes @wait_executing as the only left user - flush_work() - always sets it to %true. * __cancel_work_timer() uses flush_work() instead of wait_on_work(). Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: gut system_nrt[_freezable]_wq() Now that all workqueues are non-reentrant, system[_freezable]_wq() are equivalent to system_nrt[_freezable]_wq(). Replace the latter with wrappers around system[_freezable]_wq(). The wrapping goes through inline functions so that __deprecated can be added easily. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: cosmetic whitespace updates for macro definitions Consistently use the last tab position for '\' line continuation in complex macro definitions. This is to help the following patches. This patch is cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback() workqueue_cpu_down_callback() is used only if HOTPLUG_CPU=y, so hotcpu_notifier() fits better than cpu_notifier(). When HOTPLUG_CPU=y, hotcpu_notifier() and cpu_notifier() are the same. When HOTPLUG_CPU=n, if we use cpu_notifier(), workqueue_cpu_down_callback() will be called during boot to do nothing, and the memory of workqueue_cpu_down_callback() and gcwq_unbind_fn() will be discarded after boot. If we use hotcpu_notifier(), we can avoid the no-op call of workqueue_cpu_down_callback() and the memory of workqueue_cpu_down_callback() and gcwq_unbind_fn() will be discard at build time: $ ls -l kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier -rw-rw-r-- 1 laijs laijs 484080 Sep 15 11:31 kernel/workqueue.o.cpu_notifier -rw-rw-r-- 1 laijs laijs 478240 Sep 15 11:31 kernel/workqueue.o.hotcpu_notifier $ size kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier text data bss dec hex filename 18513 2387 1221 22121 5669 kernel/workqueue.o.cpu_notifier 18082 2355 1221 21658 549a kernel/workqueue.o.hotcpu_notifier tj: Updated description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: reimplement cancel_delayed_work() using try_to_grab_pending() cancel_delayed_work() can't be called from IRQ handlers due to its use of del_timer_sync() and can't cancel work items which are already transferred from timer to worklist. Also, unlike other flush and cancel functions, a canceled delayed_work would still point to the last associated cpu_workqueue. If the workqueue is destroyed afterwards and the work item is re-used on a different workqueue, the queueing code can oops trying to dereference already freed cpu_workqueue. This patch reimplements cancel_delayed_work() using try_to_grab_pending() and set_work_cpu_and_clear_pending(). This allows the function to be called from IRQ handlers and makes its behavior consistent with other flush / cancel functions. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic The compiler may compile the following code into TWO write/modify instructions. worker->flags &= ~WORKER_UNBOUND; worker->flags |= WORKER_REBIND; so the other CPU may temporarily see worker->flags which doesn't have either WORKER_UNBOUND or WORKER_REBIND set and perform local wakeup prematurely. Fix it by using single explicit assignment via ACCESS_ONCE(). Because idle workers have another WORKER_NOT_RUNNING flag, this bug doesn't exist for them; however, update it to use the same pattern for consistency. tj: Applied the change to idle workers too and updated comments and patch description a bit. Change-Id: I9b95f51d146c40c31ba028668d6f412bd74c6026 Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org workqueue: move WORKER_REBIND clearing in rebind_workers() to the end of the function This doesn't make any functional difference and is purely to help the next patch to be simpler. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> workqueue: fix possible deadlock in idle worker rebinding Currently, rebind_workers() and idle_worker_rebind() are two-way interlocked. rebind_workers() waits for idle workers to finish rebinding and rebound idle workers wait for rebind_workers() to finish rebinding busy workers before proceeding. Unfortunately, this isn't enough. The second wait from idle workers is implemented as follows. wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND)); rebind_workers() clears WORKER_REBIND, wakes up the idle workers and then returns. If CPU hotplug cycle happens again before one of the idle workers finishes the above wait_event(), rebind_workers() will repeat the first part of the handshake - set WORKER_REBIND again and wait for the idle worker to finish rebinding - and this leads to deadlock because the idle worker would be waiting for WORKER_REBIND to clear. This is fixed by adding another interlocking step at the end - rebind_workers() now waits for all the idle workers to finish the above WORKER_REBIND wait before returning. This ensures that all rebinding steps are complete on all idle workers before the next hotplug cycle can happen. This problem was diagnosed by Lai Jiangshan who also posted a patch to fix the issue, upon which this patch is based. This is the minimal fix and further patches are scheduled for the next merge window to simplify the CPU hotplug path. Signed-off-by: Tejun Heo <tj@kernel.org> Original-patch-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <1346516916-1991-3-git-send-email-laijs@cn.fujitsu.com> workqueue: restore POOL_MANAGING_WORKERS This patch restores POOL_MANAGING_WORKERS which was replaced by pool->manager_mutex by 6037315269 "workqueue: use mutex for global_cwq manager exclusion". There's a subtle idle worker depletion bug across CPU hotplug events and we need to distinguish an actual manager and CPU hotplug preventing management. POOL_MANAGING_WORKERS will be used for the former and manager_mutex the later. This patch just lays POOL_MANAGING_WORKERS on top of the existing manager_mutex and doesn't introduce any synchronization changes. The next patch will update it. Note that this patch fixes a non-critical anomaly where too_many_workers() may return %true spuriously while CPU hotplug is in progress. While the issue could schedule idle timer spuriously, it didn't trigger any actual misbehavior. tj: Rewrote patch description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: fix possible idle worker depletion across CPU hotplug To simplify both normal and CPU hotplug paths, worker management is prevented while CPU hoplug is in progress. This is achieved by CPU hotplug holding the same exclusion mechanism used by workers to ensure there's only one manager per pool. If someone else seems to be performing the manager role, workers proceed to execute work items. CPU hotplug using the same mechanism can lead to idle worker depletion because all workers could proceed to execute work items while CPU hotplug is in progress and CPU hotplug itself wouldn't actually perform the worker management duty - it doesn't guarantee that there's an idle worker left when it releases management. This idle worker depletion, under extreme circumstances, can break forward-progress guarantee and thus lead to deadlock. This patch fixes the bug by using separate mechanisms for manager exclusion among workers and hotplug exclusion. For manager exclusion, POOL_MANAGING_WORKERS which was restored by the previous patch is used. pool->manager_mutex is now only used for exclusion between the elected manager and CPU hotplug. The elected manager won't proceed without holding pool->manager_mutex. This ensures that the worker which won the manager position can't skip managing while CPU hotplug is in progress. It will block on manager_mutex and perform management after CPU hotplug is complete. Note that hotplug may happen while waiting for manager_mutex. A manager isn't either on idle or busy list and thus the hoplug code can't unbind/rebind it. Make the manager handle its own un/rebinding. tj: Updated comment and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn() busy_worker_rebind_fn() didn't clear WORKER_REBIND if rebinding failed (CPU is down again). This used to be okay because the flag wasn't used for anything else. However, after 25511a477 "workqueue: reimplement CPU online rebinding to handle idle workers", WORKER_REBIND is also used to command idle workers to rebind. If not cleared, the worker may confuse the next CPU_UP cycle by having REBIND spuriously set or oops / get stuck by prematurely calling idle_worker_rebind(). WARNING: at /work/os/wq/kernel/workqueue.c:1323 worker_thread+0x4cd/0x5 00() Hardware name: Bochs Modules linked in: test_wq(O-) Pid: 33, comm: kworker/1:1 Tainted: G O 3.6.0-rc1-work+ #3 Call Trace: [<ffffffff8109039f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff810903fa>] warn_slowpath_null+0x1a/0x20 [<ffffffff810b3f1d>] worker_thread+0x4cd/0x500 [<ffffffff810bc16e>] kthread+0xbe/0xd0 [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10 ---[ end trace e977cf20f4661968 ]--- BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff810b3db0>] worker_thread+0x360/0x500 PGD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: test_wq(O-) CPU 0 Pid: 33, comm: kworker/1:1 Tainted: G W O 3.6.0-rc1-work+ #3 Bochs Bochs RIP: 0010:[<ffffffff810b3db0>] [<ffffffff810b3db0>] worker_thread+0x360/0x500 RSP: 0018:ffff88001e1c9de0 EFLAGS: 00010086 RAX: 0000000000000000 RBX: ffff88001e633e00 RCX: 0000000000004140 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009 RBP: ffff88001e1c9ea0 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000002 R11: 0000000000000000 R12: ffff88001fc8d580 R13: ffff88001fc8d590 R14: ffff88001e633e20 R15: ffff88001e1c6900 FS: 0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000000130e8000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kworker/1:1 (pid: 33, threadinfo ffff88001e1c8000, task ffff88001e1c6900) Stack: ffff880000000000 ffff88001e1c9e40 0000000000000001 ffff88001e1c8010 ffff88001e519c78 ffff88001e1c9e58 ffff88001e1c6900 ffff88001e1c6900 ffff88001e1c6900 ffff88001e1c6900 ffff88001fc8d340 ffff88001fc8d340 Call Trace: [<ffffffff810bc16e>] kthread+0xbe/0xd0 [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10 Code: b1 00 f6 43 48 02 0f 85 91 01 00 00 48 8b 43 38 48 89 df 48 8b 00 48 89 45 90 e8 ac f0 ff ff 3c 01 0f 85 60 01 00 00 48 8b 53 50 <8b> 02 83 e8 01 85 c0 89 02 0f 84 3b 01 00 00 48 8b 43 38 48 8b RIP [<ffffffff810b3db0>] worker_thread+0x360/0x500 RSP <ffff88001e1c9de0> CR2: 0000000000000000 There was no reason to keep WORKER_REBIND on failure in the first place - WORKER_UNBOUND is guaranteed to be set in such cases preventing incorrectly activating concurrency management. Always clear WORKER_REBIND. tj: Updated comment and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: reimplement idle worker rebinding Currently rebind_workers() uses rebinds idle workers synchronously before proceeding to requesting busy workers to rebind. This is necessary because all workers on @worker_pool->idle_list must be bound before concurrency management local wake-ups from the busy workers take place. Unfortunately, the synchronous idle rebinding is quite complicated. This patch reimplements idle rebinding to simplify the code path. Rather than trying to make all idle workers bound before rebinding busy workers, we simply remove all to-be-bound idle workers from the idle list and let them add themselves back after completing rebinding (successful or not). As only workers which finished rebinding can on on the idle worker list, the idle worker list is guaranteed to have only bound workers unless CPU went down again and local wake-ups are safe. After the change, @worker_pool->nr_idle may deviate than the actual number of idle workers on @worker_pool->idle_list. More specifically, nr_idle may be non-zero while ->idle_list is empty. All users of ->nr_idle and ->idle_list are audited. The only affected one is too_many_workers() which is updated to check %false if ->idle_list is empty regardless of ->nr_idle. After this patch, rebind_workers() no longer performs the nasty idle-rebind retries which require temporary release of gcwq->lock, and both unbinding and rebinding are atomic w.r.t. global_cwq->lock. worker->idle_rebind and global_cwq->rebind_hold are now unnecessary and removed along with the definition of struct idle_rebind. Changed from V1: 1) remove unlikely from too_many_workers(), ->idle_list can be empty anytime, even before this patch, no reason to use unlikely. 2) fix a small rebasing mistake. (which is from rebasing the orignal fixing patch to for-next) 3) add a lot of comments. 4) clear WORKER_REBIND unconditionaly in idle_worker_rebind() tj: Updated comments and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: WORKER_REBIND is no longer necessary for busy rebinding Because the old unbind/rebinding implementation wasn't atomic w.r.t. GCWQ_DISASSOCIATED manipulation which is protected by global_cwq->lock, we had to use two flags, WORKER_UNBOUND and WORKER_REBIND, to avoid incorrectly losing all NOT_RUNNING bits with back-to-back CPU hotplug operations; otherwise, completion of rebinding while another unbinding is in progress could clear UNBIND prematurely. Now that both unbind/rebinding are atomic w.r.t. GCWQ_DISASSOCIATED, there's no need to use two flags. Just one is enough. Don't use WORKER_REBIND for busy rebinding. tj: Updated description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: WORKER_REBIND is no longer necessary for idle rebinding Now both worker destruction and idle rebinding remove the worker from idle list while it's still idle, so list_empty(&worker->entry) can be used to test whether either is pending and WORKER_DIE to distinguish between the two instead making WORKER_REBIND unnecessary. Use list_empty(&worker->entry) to determine whether destruction or rebinding is pending. This simplifies worker state transitions. WORKER_REBIND is not needed anymore. Remove it. tj: Updated comments and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: rename manager_mutex to assoc_mutex Now that manager_mutex's role has changed from synchronizing manager role to excluding hotplug against manager, the name is misleading. As it is protecting the CPU-association of the gcwq now, rename it to assoc_mutex. This patch is pure rename and doesn't introduce any functional change. tj: Updated comments and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: use __cpuinit instead of __devinit for cpu callbacks For workqueue hotplug callbacks, it makes less sense to use __devinit which discards the memory after boot if !HOTPLUG. __cpuinit, which discards the memory after boot if !HOTPLUG_CPU fits better. tj: Updated description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: fix possible stall on try_to_grab_pending() of a delayed work item Currently, when try_to_grab_pending() grabs a delayed work item, it leaves its linked work items alone on the delayed_works. The linked work items are always NO_COLOR and will cause future cwq_activate_first_delayed() increase cwq->nr_active incorrectly, and may cause the whole cwq to stall. For example, state: cwq->max_active = 1, cwq->nr_active = 1 one work in cwq->pool, many in cwq->delayed_works. step1: try_to_grab_pending() removes a work item from delayed_works but leaves its NO_COLOR linked work items on it. step2: Later on, cwq_activate_first_delayed() activates the linked work item increasing ->nr_active. step3: cwq->nr_active = 1, but all activated work items of the cwq are NO_COLOR. When they finish, cwq->nr_active will not be decreased due to NO_COLOR, and no further work items will be activated from cwq->delayed_works. the cwq stalls. Fix it by ensuring the target work item is activated before stealing PENDING in try_to_grab_pending(). This ensures that all the linked work items are activated without incorrectly bumping cwq->nr_active. tj: Updated comment and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org workqueue: reimplement work_on_cpu() using system_wq The existing work_on_cpu() implementation is hugely inefficient. It creates a new kthread, execute that single function and then let the kthread die on each invocation. Now that system_wq can handle concurrent executions, there's no advantage of doing this. Reimplement work_on_cpu() using system_wq which makes it simpler and way more efficient. stable: While this isn't a fix in itself, it's needed to fix a workqueue related bug in cpufreq/powernow-k8. AFAICS, this shouldn't break other existing users. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jiri Kosina <jkosina@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Len Brown <lenb@kernel.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: stable@vger.kernel.org workqueue: introduce cwq_set_max_active() helper for thaw_workqueues() Using a helper instead of open code makes thaw_workqueues() clearer. The helper will also be used by the next patch. tj: Slight update to comment and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: use cwq_set_max_active() helper for workqueue_set_max_active() workqueue_set_max_active() may increase ->max_active without activating delayed works and may make the activation order differ from the queueing order. Both aren't strictly bugs but the resulting behavior could be a bit odd. To make things more consistent, use cwq_set_max_active() helper which immediately makes use of the newly increased max_mactive if there are delayed work items and also keeps the activation order. tj: Slight update to description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending() e0aecdd874 ("workqueue: use irqsafe timer for delayed_work") made try_to_grab_pending() safe to use from irq context but forgot to remove WARN_ON_ONCE(in_irq()). Remove it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com> workqueue: cancel_delayed_work() should return %false if work item is idle 57b30ae77b ("workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()") made cancel_delayed_work() always return %true unless someone else is also trying to cancel the work item, which is broken - if the target work item is idle, the return value should be %false. try_to_grab_pending() indicates that the target work item was idle by zero return value. Use it for return. Note that this brings cancel_delayed_work() in line with __cancel_work_timer() in return value handling. Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <444a6439-b1a4-4740-9e7e-bc37267cfe73@default> workqueue: exit rescuer_thread() as TASK_RUNNING A rescue thread exiting TASK_INTERRUPTIBLE can lead to a task scheduling off, never to be seen again. In the case where this occurred, an exiting thread hit reiserfs homebrew conditional resched while holding a mutex, bringing the box to its knees. PID: 18105 TASK: ffff8807fd412180 CPU: 5 COMMAND: "kdmflush" #0 [ffff8808157e7670] schedule at ffffffff8143f489 #1 [ffff8808157e77b8] reiserfs_get_block at ffffffffa038ab2d [reiserfs] #2 [ffff8808157e79a8] __block_write_begin at ffffffff8117fb14 #3 [ffff8808157e7a98] reiserfs_write_begin at ffffffffa0388695 [reiserfs] #4 [ffff8808157e7ad8] generic_perform_write at ffffffff810ee9e2 #5 [ffff8808157e7b58] generic_file_buffered_write at ffffffff810eeb41 #6 [ffff8808157e7ba8] __generic_file_aio_write at ffffffff810f1a3a #7 [ffff8808157e7c58] generic_file_aio_write at ffffffff810f1c88 #8 [ffff8808157e7cc8] do_sync_write at ffffffff8114f850 #9 [ffff8808157e7dd8] do_acct_process at ffffffff810a268f [exception RIP: kernel_thread_helper] RIP: ffffffff8144a5c0 RSP: ffff8808157e7f58 RFLAGS: 00000202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff8107af60 RDI: ffff8803ee491d18 RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay 8376fe22c7 ("workqueue: implement mod_delayed_work[_on]()") implemented mod_delayed_work[_on]() using the improved try_to_grab_pending(). The function is later used, among others, to replace [__]candel_delayed_work() + queue_delayed_work() combinations. Unfortunately, a delayed_work item w/ zero @delay is handled slightly differently by mod_delayed_work_on() compared to queue_delayed_work_on(). The latter skips timer altogether and directly queues it using queue_work_on() while the former schedules timer which will expire on the closest tick. This means, when @delay is zero, that [__]cancel_delayed_work() + queue_delayed_work_on() makes the target item immediately executable while mod_delayed_work_on() may induce delay of upto a full tick. This somewhat subtle difference breaks some of the converted users. e.g. block queue plugging uses delayed_work for deferred processing and uses mod_delayed_work_on() when the queue needs to be immediately unplugged. The above problem manifested as noticeably higher number of context switches under certain circumstances. The difference in behavior was caused by missing special case handling for 0 delay in mod_delayed_work_on() compared to queue_delayed_work_on(). Joonsoo Kim posted a patch to add it - ("workqueue: optimize mod_delayed_work_on() when @delay == 0")[1]. The patch was queued for 3.8 but it was described as optimization and I missed that it was a correctness issue. As both queue_delayed_work_on() and mod_delayed_work_on() use __queue_delayed_work() for queueing, it seems that the better approach is to move the 0 delay special handling to the function instead of duplicating it in mod_delayed_work_on(). Fix the problem by moving 0 delay special case handling from queue_delayed_work_on() to __queue_delayed_work(). This replaces Joonsoo's patch. [1] http://thread.gmane.org/gmane.linux.kernel/1379011/focus=1379012 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Anders Kaseorg <andersk@MIT.EDU> Reported-and-tested-by: Zlatko Calusic <zlatko.calusic@iskon.hr> LKML-Reference: <alpine.DEB.2.00.1211280953350.26602@dr-wily.mit.edu> LKML-Reference: <50A78AA9.5040904@iskon.hr> Cc: Joonsoo Kim <js1304@gmail.com> workqueue: trivial fix for return statement in work_busy() Return type of work_busy() is unsigned int. There is return statement returning boolean value, 'false' in work_busy(). It is not problem, because 'false' may be treated '0'. However, fixing it would make code robust. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: add WARN_ON_ONCE() on CPU number to wq_worker_waking_up() Recently, workqueue code has gone through some changes and we found some bugs related to concurrency management operations happening on the wrong CPU. When a worker is concurrency managed (!WORKER_NOT_RUNNIG), it should be bound to its associated cpu and woken up to that cpu. Add WARN_ON_ONCE() to verify this. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: convert BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s 8852aac25e ("workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay") unexpectedly uncovered a very nasty abuse of delayed_work in megaraid - it allocated work_struct, casted it to delayed_work and then pass that into queue_delayed_work(). Previously, this was okay because 0 @delay short-circuited to queue_work() before doing anything with delayed_work. 8852aac25e moved 0 @delay test into __queue_delayed_work() after sanity check on delayed_work making megaraid trigger BUG_ON(). Although megaraid is already fixed by c1d390d8e6 ("megaraid: fix BUG_ON() from incorrect use of delayed work"), this patch converts BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s so that such abusers, if there are more, trigger warning but don't crash the machine. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Xiaotian Feng <xtfeng@gmail.com> wq Change-Id: Ia3c507777a995f32bf6b40dc8318203e53134229 Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>
Redmi Note 3g's sources are still not released, |
What if the kernel is not even modified? |
Whether the kernel source is modified or not according to GPLv2 source code still need to be provided with it's binary product. |
I guess if other companies will release kernel source cough cough xiaomi MediaTek may follow and release the kernel as well. |
@Vdragon I found that Redmi 1 is using MTK processor, and I heard about rumors about MediaTek do held the source code of their kernels. Perhaps, just maybe, the decision of not releasing kernel source code actually comes from some secret biz deal between Xiaomi and MediaTek. I am not saying that Xiaomi is innocent, they are definitely wrong, but in their management's perspective this device may not be important anymore. |
@qiansen1386 Any Mediatek device will almost never get an open sourced kernel. Any other Xiaomi device will probably have to wait at least a year to get one. |
@zwliew what about the mi4i? There must be some reason why they release it so fast. Oh yah... It's because Mi India developed the device. |
@nick37332001 Good point... 😂 |
@nick37332001 I'm inclined to believe that it has more to do with the Mi 4i being an international device compared to one that is more focused towards the Chinese and other Asian markets. |
Or Gerlitz says: ==================== net/mlx5_core: Enhance flow steering support v0 --> v1 changes: - fixed improperly formatted comments. - compare value of ib_spec->eth.mask.ether_type in network byte order in ('IB/mlx5: Add flow steering utilities'). v1 --> v2 changes: - made sure that service functions added in the IB driver are only static-fied on the last commit, to make sure bisection with -Werror works fine. v2 --> v3 changes: - squashed patches 11 and 12 into one patch, s.t Dave's comment on unused static functions gcc complaints during bisection is correctly addressed. v3 has been generated against net-next commit c9c9931 "Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge" The series is signed by Matan who was revently assigned to a maintainer for the mlx5_core and IB drivers (this is a 4.5-rc1 change to the maintainers file coming from the rdma tree) -- as such I didn't see a neeed to add my signature (Or). This series adds three new functionalists to the driver flow-steering infrastructure: auto-grouped flow tables, chaining of flow tables and updates for the root flow table. 1. Auto-grouped flow tables - Flow table with auto grouping management. When a flow table is created, hints regarding the number of rule types and the number of rules are given in advance. Thus, a flow table is divided into #NUM_TYPES+1 groups each contains (#NUM_RULES)/(#NUM_TYPES+1) rules. The first #NUM_TYPES parts are groups which are filled if the added rule matches the group specification or the group is empty. The last part is filled by rules that can't fit any of the former groups. 2. Chaining flow tables - Flow tables from different priorities are chained together, if there is no match in flow table of priority i we continue searching for a match in priority i+1. This is both true if priorities i and i+1 belongs to the same namespace or not. 3. Updating the root flow table - the root flow table is the flow table with the lowest level. The hardware start searching for a match in the root flow table and continue according to the matches it find along the way. The first usage for the new functionality is flow steering for user-space ConnectX-4 offloaded HW Eth RX queues done through the mlx5 IB driver. When the mlx5 core driver is loaded, it opens three flow namespaces: 1. By-pass namespace (used by mlx5 IB driver). 2. Kernel namespace (used in order to get packets to the networking stack through mlx5 EN driver). 3. Leftovers namespace (used by mlx5 IB and future sniffer) The series is built as follows: Patch #1 introduces auto-grouped flow tables support. Patch #2 add utility functions for finding the next and the previous flow tables in different priorities. This is used in order to chain the flow tables in a downstream patch. Patch #3 introduces a firmware command for updating the root flow table. Patch #4 introduces modify flow table firmware command, this command is used when we want to change the next flow table of an existing flow table. This is used for chaining flow tables as well. Patch #5 connect/disconnect flow tables. This is actually the chaining process when we want to link flow tables. This means that if we couldn't find a match in the first flow table, we'll continue in the chained flow table. Patch #6 updates priority's attributes that is required for flow table level allocation. We update both the max_fts (the number of allowed FTs in the sub-tree of this priority) and the start_level (which is the first level we'll assign to the flow-tables created inside the priority). Patch #7 adds checking of required device capabilities. Some namespaces could be only created if the hardware supports certain attributes. This is especially true for the Bypass and leftovers namespaces. This adds a generic mechanism to check these required attributes. Patch #8 creates two additional namespaces: a. Bypass flow rules(has nine priorities) b. Leftovers packets(have one priority) - for unmatched packets. Patch #9 re-factors ipv4/ipv6 match fields in the mlx5 firmware interface header to be more clear. Patch #10 exports the flow steering API for mlx5_ib usage Patch #11 implements the required support in mlx5_ib in order to support the RDMA flow steering verbs. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
We can end up allocating a new compression stream with GFP_KERNEL from within the IO path, which may result is nested (recursive) IO operations. That can introduce problems if the IO path in question is a reclaimer, holding some locks that will deadlock nested IOs. Allocate streams and working memory using GFP_NOIO flag, forbidding recursive IO and FS operations. An example: inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage. git/20158 [HC0[0]:SC0[0]:HE1:SE1] takes: (jbd2_handle){+.+.?.}, at: start_this_handle+0x4ca/0x555 {IN-RECLAIM_FS-W} state was registered at: __lock_acquire+0x8da/0x117b lock_acquire+0x10c/0x1a7 start_this_handle+0x52d/0x555 jbd2__journal_start+0xb4/0x237 __ext4_journal_start_sb+0x108/0x17e ext4_dirty_inode+0x32/0x61 __mark_inode_dirty+0x16b/0x60c iput+0x11e/0x274 __dentry_kill+0x148/0x1b8 shrink_dentry_list+0x274/0x44a prune_dcache_sb+0x4a/0x55 super_cache_scan+0xfc/0x176 shrink_slab.part.14.constprop.25+0x2a2/0x4d3 shrink_zone+0x74/0x140 kswapd+0x6b7/0x930 kthread+0x107/0x10f ret_from_fork+0x3f/0x70 irq event stamp: 138297 hardirqs last enabled at (138297): debug_check_no_locks_freed+0x113/0x12f hardirqs last disabled at (138296): debug_check_no_locks_freed+0x33/0x12f softirqs last enabled at (137818): __do_softirq+0x2d3/0x3e9 softirqs last disabled at (137813): irq_exit+0x41/0x95 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(jbd2_handle); <Interrupt> lock(jbd2_handle); *** DEADLOCK *** 5 locks held by git/20158: #0: (sb_writers#7){.+.+.+}, at: [<ffffffff81155411>] mnt_want_write+0x24/0x4b #1: (&type->i_mutex_dir_key#2/1){+.+.+.}, at: [<ffffffff81145087>] lock_rename+0xd9/0xe3 #2: (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff8114f8e2>] lock_two_nondirectories+0x3f/0x6b #3: (&sb->s_type->i_mutex_key#11/4){+.+.+.}, at: [<ffffffff8114f909>] lock_two_nondirectories+0x66/0x6b #4: (jbd2_handle){+.+.?.}, at: [<ffffffff811e31db>] start_this_handle+0x4ca/0x555 stack backtrace: CPU: 2 PID: 20158 Comm: git Not tainted 4.1.0-rc7-next-20150615-dbg-00016-g8bdf555-dirty #211 Call Trace: dump_stack+0x4c/0x6e mark_lock+0x384/0x56d mark_held_locks+0x5f/0x76 lockdep_trace_alloc+0xb2/0xb5 kmem_cache_alloc_trace+0x32/0x1e2 zcomp_strm_alloc+0x25/0x73 [zram] zcomp_strm_multi_find+0xe7/0x173 [zram] zcomp_strm_find+0xc/0xe [zram] zram_bvec_rw+0x2ca/0x7e0 [zram] zram_make_request+0x1fa/0x301 [zram] generic_make_request+0x9c/0xdb submit_bio+0xf7/0x120 ext4_io_submit+0x2e/0x43 ext4_bio_write_page+0x1b7/0x300 mpage_submit_page+0x60/0x77 mpage_map_and_submit_buffers+0x10f/0x21d ext4_writepages+0xc8c/0xe1b do_writepages+0x23/0x2c __filemap_fdatawrite_range+0x84/0x8b filemap_flush+0x1c/0x1e ext4_alloc_da_blocks+0xb8/0x117 ext4_rename+0x132/0x6dc ? mark_held_locks+0x5f/0x76 ext4_rename2+0x29/0x2b vfs_rename+0x540/0x636 SyS_renameat2+0x359/0x44d SyS_rename+0x1e/0x20 entry_SYSCALL_64_fastpath+0x12/0x6f [minchan@kernel.org: add stable mark] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Kyeongdon Kim <kyeongdon.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Avoid calling cgroup_threadgroup_change_end() without having called cgroup_threadgroup_change_begin() first. During process creation we need to check whether the cgroup we are in allows us to fork. To perform this check the cgroup needs to guard itself against threadgroup changes and takes a lock. Prior to CLONE_PIDFD the cleanup target "bad_fork_free_pid" would also need to call cgroup_threadgroup_change_end() because said lock had already been taken. However, this is not the case anymore with the addition of CLONE_PIDFD. We are now allocating a pidfd before we check whether the cgroup we're in can fork and thus prior to taking the lock. So when copy_process() fails at the right step it would release a lock we haven't taken. This bug is not even very subtle to be honest. It's just not very clear from the naming of cgroup_threadgroup_change_{begin,end}() that a lock is taken. Here's the relevant splat: entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139 RIP: 0023:0xf7fec849 Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 RSP: 002b:00000000ffed5a8c EFLAGS: 00000246 ORIG_RAX: 0000000000000078 RAX: ffffffffffffffda RBX: 0000000000003ffc RCX: 0000000000000000 RDX: 00000000200005c0 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ------------[ cut here ]------------ DEBUG_LOCKS_WARN_ON(depth <= 0) WARNING: CPU: 1 PID: 7744 at kernel/locking/lockdep.c:4052 __lock_release kernel/locking/lockdep.c:4052 [inline] WARNING: CPU: 1 PID: 7744 at kernel/locking/lockdep.c:4052 lock_release+0x667/0xa00 kernel/locking/lockdep.c:4321 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 7744 Comm: syz-executor007 Not tainted 5.1.0+ MiCode#4 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 panic+0x2cb/0x65c kernel/panic.c:214 __warn.cold+0x20/0x45 kernel/panic.c:566 report_bug+0x263/0x2b0 lib/bug.c:186 fixup_bug arch/x86/kernel/traps.c:179 [inline] fixup_bug arch/x86/kernel/traps.c:174 [inline] do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272 do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:972 RIP: 0010:__lock_release kernel/locking/lockdep.c:4052 [inline] RIP: 0010:lock_release+0x667/0xa00 kernel/locking/lockdep.c:4321 Code: 0f 85 a0 03 00 00 8b 35 77 66 08 08 85 f6 75 23 48 c7 c6 a0 55 6b 87 48 c7 c7 40 25 6b 87 4c 89 85 70 ff ff ff e8 b7 a9 eb ff <0f> 0b 4c 8b 85 70 ff ff ff 4c 89 ea 4c 89 e6 4c 89 c7 e8 52 63 ff RSP: 0018:ffff888094117b48 EFLAGS: 00010086 RAX: 0000000000000000 RBX: 1ffff11012822f6f RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff815af236 RDI: ffffed1012822f5b RBP: ffff888094117c00 R08: ffff888092bfc400 R09: fffffbfff113301d R10: fffffbfff113301c R11: ffffffff889980e3 R12: ffffffff8a451df8 R13: ffffffff8142e71f R14: ffffffff8a44cc80 R15: ffff888094117bd8 percpu_up_read.constprop.0+0xcb/0x110 include/linux/percpu-rwsem.h:92 cgroup_threadgroup_change_end include/linux/cgroup-defs.h:712 [inline] copy_process.part.0+0x47ff/0x6710 kernel/fork.c:2222 copy_process kernel/fork.c:1772 [inline] _do_fork+0x25d/0xfd0 kernel/fork.c:2338 __do_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:240 [inline] __se_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:236 [inline] __ia32_compat_sys_x86_clone+0xbc/0x140 arch/x86/ia32/sys_ia32.c:236 do_syscall_32_irqs_on arch/x86/entry/common.c:334 [inline] do_fast_syscall_32+0x281/0xd54 arch/x86/entry/common.c:405 entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139 RIP: 0023:0xf7fec849 Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 RSP: 002b:00000000ffed5a8c EFLAGS: 00000246 ORIG_RAX: 0000000000000078 RAX: ffffffffffffffda RBX: 0000000000003ffc RCX: 0000000000000000 RDX: 00000000200005c0 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Kernel Offset: disabled Rebooting in 86400 seconds.. Reported-and-tested-by: syzbot+3286e58549edc479faae@syzkaller.appspotmail.com Fixes: b3e5838 ("clone: add CLONE_PIDFD") Signed-off-by: Christian Brauner <christian@brauner.io> (cherry picked from commit c3b7112) Bug: 135608568 Test: test program using syscall(__NR_sys_pidfd_open,..) and poll() Change-Id: Ib9ecb1e5c0c6e2d062b89c25109ec571570eb497 Signed-off-by: Suren Baghdasaryan <surenb@google.com>
commit e84fe99 upstream. Without CONFIG_PREEMPT, it can happen that we get soft lockups detected, e.g., while booting up. watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ MiCode#4 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 RIP: __pageblock_pfn_to_page+0x134/0x1c0 Call Trace: set_zone_contiguous+0x56/0x70 page_alloc_init_late+0x166/0x176 kernel_init_freeable+0xfa/0x255 kernel_init+0xa/0x106 ret_from_fork+0x35/0x40 The issue becomes visible when having a lot of memory (e.g., 4TB) assigned to a single NUMA node - a system that can easily be created using QEMU. Inside VMs on a hypervisor with quite some memory overcommit, this is fairly easy to trigger. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Shile Zhang <shile.zhang@linux.alibaba.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Shile Zhang <shile.zhang@linux.alibaba.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fe2b73d upstream. The code in decode_config4() of arch/mips/kernel/cpu-probe.c asid_mask = MIPS_ENTRYHI_ASID; if (config4 & MIPS_CONF4_AE) asid_mask |= MIPS_ENTRYHI_ASIDX; set_cpu_asid_mask(c, asid_mask); set asid_mask to cpuinfo->asid_mask. So in order to support variable ASID_MASK, KVM_ENTRYHI_ASID should also be changed to cpu_asid_mask(&boot_cpu_data). Cc: Stable <stable@vger.kernel.org> MiCode#4.9+ Reviewed-by: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com> Signed-off-by: Xing Li <lixing@loongson.cn> [Huacai: Change current_cpu_data to boot_cpu_data for optimization] Signed-off-by: Huacai Chen <chenhc@lemote.com> Message-Id: <1590220602-3547-2-git-send-email-chenhc@lemote.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit f6766ff ] We need to check mddev->del_work before flush workqueu since the purpose of flush is to ensure the previous md is disappeared. Otherwise the similar deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the bdev->bd_mutex before flush workqueue. kernel: [ 154.522645] ====================================================== kernel: [ 154.522647] WARNING: possible circular locking dependency detected kernel: [ 154.522650] 5.6.0-rc7-lp151.27-default MiCode#25 Tainted: G O kernel: [ 154.522651] ------------------------------------------------------ kernel: [ 154.522653] mdadm/2482 is trying to acquire lock: kernel: [ 154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0 kernel: [ 154.522673] kernel: [ 154.522673] but task is already holding lock: kernel: [ 154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590 kernel: [ 154.522691] kernel: [ 154.522691] which lock already depends on the new lock. kernel: [ 154.522691] kernel: [ 154.522694] kernel: [ 154.522694] the existing dependency chain (in reverse order) is: kernel: [ 154.522696] kernel: [ 154.522696] -> MiCode#4 (&bdev->bd_mutex){+.+.}: kernel: [ 154.522704] __mutex_lock+0x87/0x950 kernel: [ 154.522706] __blkdev_get+0x79/0x590 kernel: [ 154.522708] blkdev_get+0x65/0x140 kernel: [ 154.522709] blkdev_get_by_dev+0x2f/0x40 kernel: [ 154.522716] lock_rdev+0x3d/0x90 [md_mod] kernel: [ 154.522719] md_import_device+0xd6/0x1b0 [md_mod] kernel: [ 154.522723] new_dev_store+0x15e/0x210 [md_mod] kernel: [ 154.522728] md_attr_store+0x7a/0xc0 [md_mod] kernel: [ 154.522732] kernfs_fop_write+0x117/0x1b0 kernel: [ 154.522735] vfs_write+0xad/0x1a0 kernel: [ 154.522737] ksys_write+0xa4/0xe0 kernel: [ 154.522745] do_syscall_64+0x64/0x2b0 kernel: [ 154.522748] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522749] kernel: [ 154.522749] -> MiCode#3 (&mddev->reconfig_mutex){+.+.}: kernel: [ 154.522752] __mutex_lock+0x87/0x950 kernel: [ 154.522756] new_dev_store+0xc9/0x210 [md_mod] kernel: [ 154.522759] md_attr_store+0x7a/0xc0 [md_mod] kernel: [ 154.522761] kernfs_fop_write+0x117/0x1b0 kernel: [ 154.522763] vfs_write+0xad/0x1a0 kernel: [ 154.522765] ksys_write+0xa4/0xe0 kernel: [ 154.522767] do_syscall_64+0x64/0x2b0 kernel: [ 154.522769] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522770] kernel: [ 154.522770] -> MiCode#2 (kn->count#253){++++}: kernel: [ 154.522775] __kernfs_remove+0x253/0x2c0 kernel: [ 154.522778] kernfs_remove+0x1f/0x30 kernel: [ 154.522780] kobject_del+0x28/0x60 kernel: [ 154.522783] mddev_delayed_delete+0x24/0x30 [md_mod] kernel: [ 154.522786] process_one_work+0x2a7/0x5f0 kernel: [ 154.522788] worker_thread+0x2d/0x3d0 kernel: [ 154.522793] kthread+0x117/0x130 kernel: [ 154.522795] ret_from_fork+0x3a/0x50 kernel: [ 154.522796] kernel: [ 154.522796] -> MiCode#1 ((work_completion)(&mddev->del_work)){+.+.}: kernel: [ 154.522800] process_one_work+0x27e/0x5f0 kernel: [ 154.522802] worker_thread+0x2d/0x3d0 kernel: [ 154.522804] kthread+0x117/0x130 kernel: [ 154.522806] ret_from_fork+0x3a/0x50 kernel: [ 154.522807] kernel: [ 154.522807] -> #0 ((wq_completion)md_misc){+.+.}: kernel: [ 154.522813] __lock_acquire+0x1392/0x1690 kernel: [ 154.522816] lock_acquire+0xb4/0x1a0 kernel: [ 154.522818] flush_workqueue+0xab/0x4b0 kernel: [ 154.522821] md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522823] __blkdev_get+0xea/0x590 kernel: [ 154.522825] blkdev_get+0x65/0x140 kernel: [ 154.522828] do_dentry_open+0x1d1/0x380 kernel: [ 154.522831] path_openat+0x567/0xcc0 kernel: [ 154.522834] do_filp_open+0x9b/0x110 kernel: [ 154.522836] do_sys_openat2+0x201/0x2a0 kernel: [ 154.522838] do_sys_open+0x57/0x80 kernel: [ 154.522840] do_syscall_64+0x64/0x2b0 kernel: [ 154.522842] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522844] kernel: [ 154.522844] other info that might help us debug this: kernel: [ 154.522844] kernel: [ 154.522846] Chain exists of: kernel: [ 154.522846] (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex kernel: [ 154.522846] kernel: [ 154.522850] Possible unsafe locking scenario: kernel: [ 154.522850] kernel: [ 154.522852] CPU0 CPU1 kernel: [ 154.522853] ---- ---- kernel: [ 154.522854] lock(&bdev->bd_mutex); kernel: [ 154.522856] lock(&mddev->reconfig_mutex); kernel: [ 154.522858] lock(&bdev->bd_mutex); kernel: [ 154.522860] lock((wq_completion)md_misc); kernel: [ 154.522861] kernel: [ 154.522861] *** DEADLOCK *** kernel: [ 154.522861] kernel: [ 154.522864] 1 lock held by mdadm/2482: kernel: [ 154.522865] #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590 kernel: [ 154.522868] kernel: [ 154.522868] stack backtrace: kernel: [ 154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G O 5.6.0-rc7-lp151.27-default MiCode#25 kernel: [ 154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 kernel: [ 154.522878] Call Trace: kernel: [ 154.522881] dump_stack+0x8f/0xcb kernel: [ 154.522884] check_noncircular+0x194/0x1b0 kernel: [ 154.522888] ? __lock_acquire+0x1392/0x1690 kernel: [ 154.522890] __lock_acquire+0x1392/0x1690 kernel: [ 154.522893] lock_acquire+0xb4/0x1a0 kernel: [ 154.522895] ? flush_workqueue+0x84/0x4b0 kernel: [ 154.522898] flush_workqueue+0xab/0x4b0 kernel: [ 154.522900] ? flush_workqueue+0x84/0x4b0 kernel: [ 154.522905] ? md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522908] md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522910] __blkdev_get+0xea/0x590 kernel: [ 154.522912] ? bd_acquire+0xc0/0xc0 kernel: [ 154.522914] blkdev_get+0x65/0x140 kernel: [ 154.522916] ? bd_acquire+0xc0/0xc0 kernel: [ 154.522918] do_dentry_open+0x1d1/0x380 kernel: [ 154.522921] path_openat+0x567/0xcc0 kernel: [ 154.522923] ? __lock_acquire+0x380/0x1690 kernel: [ 154.522926] do_filp_open+0x9b/0x110 kernel: [ 154.522929] ? __alloc_fd+0xe5/0x1f0 kernel: [ 154.522935] ? kmem_cache_alloc+0x28c/0x630 kernel: [ 154.522939] ? do_sys_openat2+0x201/0x2a0 kernel: [ 154.522941] do_sys_openat2+0x201/0x2a0 kernel: [ 154.522944] do_sys_open+0x57/0x80 kernel: [ 154.522946] do_syscall_64+0x64/0x2b0 kernel: [ 154.522948] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522951] RIP: 0033:0x7f98d279d9ae And md_alloc also flushed the same workqueue, but the thing is different here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and the flush is necessary to avoid race condition, so leave it as it is. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
commit f5e5677 upstream. NULL pointer exception happens occasionally on serial output initiated by login timeout. This was reproduced only if kernel was built with significant debugging options and EDMA driver is used with serial console. col-vf50 login: root Password: Login timed out after 60 seconds. Unable to handle kernel NULL pointer dereference at virtual address 00000044 Internal error: Oops: 5 [MiCode#1] ARM CPU: 0 PID: 157 Comm: login Not tainted 5.7.0-next-20200610-dirty MiCode#4 Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree) (fsl_edma_tx_handler) from [<8016eb10>] (__handle_irq_event_percpu+0x64/0x304) (__handle_irq_event_percpu) from [<8016eddc>] (handle_irq_event_percpu+0x2c/0x7c) (handle_irq_event_percpu) from [<8016ee64>] (handle_irq_event+0x38/0x5c) (handle_irq_event) from [<801729e4>] (handle_fasteoi_irq+0xa4/0x160) (handle_fasteoi_irq) from [<8016ddcc>] (generic_handle_irq+0x34/0x44) (generic_handle_irq) from [<8016e40c>] (__handle_domain_irq+0x54/0xa8) (__handle_domain_irq) from [<80508bc8>] (gic_handle_irq+0x4c/0x80) (gic_handle_irq) from [<80100af0>] (__irq_svc+0x70/0x98) Exception stack(0x8459fe80 to 0x8459fec8) fe80: 72286b00 e3359f64 00000001 0000412d a0070013 85c98840 85c98840 a0070013 fea0: 8054e0d4 00000000 00000002 00000000 00000002 8459fed0 8081fbe8 8081fbec fec0: 60070013 ffffffff (__irq_svc) from [<8081fbec>] (_raw_spin_unlock_irqrestore+0x30/0x58) (_raw_spin_unlock_irqrestore) from [<8056cb48>] (uart_flush_buffer+0x88/0xf8) (uart_flush_buffer) from [<80554e60>] (tty_ldisc_hangup+0x38/0x1ac) (tty_ldisc_hangup) from [<8054c7f4>] (__tty_hangup+0x158/0x2bc) (__tty_hangup) from [<80557b90>] (disassociate_ctty.part.1+0x30/0x23c) (disassociate_ctty.part.1) from [<8011fc18>] (do_exit+0x580/0xba0) (do_exit) from [<801214f8>] (do_group_exit+0x3c/0xb4) (do_group_exit) from [<80121580>] (__wake_up_parent+0x0/0x14) Issue looks like race condition between interrupt handler fsl_edma_tx_handler() (called as result of fsl_edma_xfer_desc()) and terminating the transfer with fsl_edma_terminate_all(). The fsl_edma_tx_handler() handles interrupt for a transfer with already freed edesc and idle==true. Fixes: d6be34f ("dma: Add Freescale eDMA engine driver support") Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Reviewed-by: Robin Gong <yibin.gong@nxp.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/1591877861-28156-2-git-send-email-krzk@kernel.org Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fc846e9 upstream. The `INSN_CONFIG` comedi instruction with sub-instruction code `INSN_CONFIG_DIGITAL_TRIG` includes a base channel in `data[3]`. This is used as a right shift amount for other bitmask values without being checked. Shift amounts greater than or equal to 32 will result in undefined behavior. Add code to deal with this, adjusting the checks for invalid channels so that enabled channel bits that would have been lost by shifting are also checked for validity. Only channels 0 to 15 are valid. Fixes: a8c66b6 ("staging: comedi: addi_apci_1500: rewrite the subdevice support functions") Cc: <stable@vger.kernel.org> MiCode#4.0+: ef75e14: staging: comedi: verify array index is correct before using it Cc: <stable@vger.kernel.org> MiCode#4.0+ Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Link: https://lore.kernel.org/r/20200717145257.112660-5-abbotti@mev.co.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 254503a upstream. The drm/omap driver was fixed to correct an issue where using a divider of 32 breaks the DSS despite the TRM stating 32 is a valid number. Through experimentation, it appears that 31 works, and it is consistent with the value used by the drm/omap driver. This patch fixes the divider for fbdev driver instead of the drm. Fixes: f76ee89 ("omapfb: copy omapdss & displays for omapfb") Cc: <stable@vger.kernel.org> MiCode#4.5+ Signed-off-by: Adam Ford <aford173@gmail.com> Reviewed-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Cc: Dave Airlie <airlied@gmail.com> Cc: Rob Clark <robdclark@gmail.com> [b.zolnierkie: mark patch as applicable to stable 4.5+ (was 4.9+)] Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200630182636.439015-1-aford173@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e24c644 ] I compiled with AddressSanitizer and I had these memory leaks while I was using the tep_parse_format function: Direct leak of 28 byte(s) in 4 object(s) allocated from: #0 0x7fb07db49ffe in __interceptor_realloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dffe) MiCode#1 0x7fb07a724228 in extend_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:985 MiCode#2 0x7fb07a724c21 in __read_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1140 MiCode#3 0x7fb07a724f78 in read_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1206 MiCode#4 0x7fb07a725191 in __read_expect_type /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1291 MiCode#5 0x7fb07a7251df in read_expect_type /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1299 MiCode#6 0x7fb07a72e6c8 in process_dynamic_array_len /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:2849 MiCode#7 0x7fb07a7304b8 in process_function /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3161 MiCode#8 0x7fb07a730900 in process_arg_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3207 MiCode#9 0x7fb07a727c0b in process_arg /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1786 MiCode#10 0x7fb07a731080 in event_read_print_args /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3285 MiCode#11 0x7fb07a731722 in event_read_print /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3369 MiCode#12 0x7fb07a740054 in __tep_parse_format /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6335 MiCode#13 0x7fb07a74047a in __parse_event /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6389 MiCode#14 0x7fb07a740536 in tep_parse_format /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6431 MiCode#15 0x7fb07a785acf in parse_event ../../../src/fs-src/fs.c:251 MiCode#16 0x7fb07a785ccd in parse_systems ../../../src/fs-src/fs.c:284 MiCode#17 0x7fb07a786fb3 in read_metadata ../../../src/fs-src/fs.c:593 MiCode#18 0x7fb07a78760e in ftrace_fs_source_init ../../../src/fs-src/fs.c:727 MiCode#19 0x7fb07d90c19c in add_component_with_init_method_data ../../../../src/lib/graph/graph.c:1048 MiCode#20 0x7fb07d90c87b in add_source_component_with_initialize_method_data ../../../../src/lib/graph/graph.c:1127 MiCode#21 0x7fb07d90c92a in bt_graph_add_source_component ../../../../src/lib/graph/graph.c:1152 MiCode#22 0x55db11aa632e in cmd_run_ctx_create_components_from_config_components ../../../src/cli/babeltrace2.c:2252 MiCode#23 0x55db11aa6fda in cmd_run_ctx_create_components ../../../src/cli/babeltrace2.c:2347 MiCode#24 0x55db11aa780c in cmd_run ../../../src/cli/babeltrace2.c:2461 MiCode#25 0x55db11aa8a7d in main ../../../src/cli/babeltrace2.c:2673 MiCode#26 0x7fb07d5460b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2) The token variable in the process_dynamic_array_len function is allocated in the read_expect_type function, but is not freed before calling the read_token function. Free the token variable before calling read_token in order to plug the leak. Signed-off-by: Philippe Duplessis-Guindon <pduplessis@efficios.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lore.kernel.org/linux-trace-devel/20200730150236.5392-1-pduplessis@efficios.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d26383d ] The following leaks were detected by ASAN: Indirect leak of 360 byte(s) in 9 object(s) allocated from: #0 0x7fecc305180e in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10780e) MiCode#1 0x560578f6dce5 in perf_pmu__new_format util/pmu.c:1333 MiCode#2 0x560578f752fc in perf_pmu_parse util/pmu.y:59 MiCode#3 0x560578f6a8b7 in perf_pmu__format_parse util/pmu.c:73 MiCode#4 0x560578e07045 in test__pmu tests/pmu.c:155 MiCode#5 0x560578de109b in run_test tests/builtin-test.c:410 MiCode#6 0x560578de109b in test_and_print tests/builtin-test.c:440 MiCode#7 0x560578de401a in __cmd_test tests/builtin-test.c:661 MiCode#8 0x560578de401a in cmd_test tests/builtin-test.c:807 MiCode#9 0x560578e49354 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312 MiCode#10 0x560578ce71a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364 MiCode#11 0x560578ce71a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408 MiCode#12 0x560578ce71a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538 MiCode#13 0x7fecc2b7acc9 in __libc_start_main ../csu/libc-start.c:308 Fixes: cff7f95 ("perf tests: Move pmu tests into separate object") Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20200915031819.386559-12-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
skb_kill_datagram() does not dequeue the skb when MSG_PEEK is unset. This leaves a free'd skb on the queue, resulting a double-free later. Without this, the following oops can occur: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff8154fcf7>] skb_dequeue+0x47/0x70 PGD 0 Oops: 0002 [#1] SMP Modules linked in: af_rxrpc ... CPU: 0 PID: 1191 Comm: listen Not tainted 3.12.0+ #4 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8801183536b0 ti: ffff880035c92000 task.ti: ffff880035c92000 RIP: 0010:[<ffffffff8154fcf7>] skb_dequeue+0x47/0x70 RSP: 0018:ffff880035c93db8 EFLAGS: 00010097 RAX: 0000000000000246 RBX: ffff8800d2754b00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8800d254c084 RBP: ffff880035c93dd0 R08: ffff880035c93cf0 R09: ffff8800d968f270 R10: 0000000000000000 R11: 0000000000000293 R12: ffff8800d254c070 R13: ffff8800d254c084 R14: ffff8800cd861240 R15: ffff880119b39720 FS: 00007f37a969d740(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 00000000d4413000 CR4: 00000000000006f0 Stack: ffff8800d254c000 ffff8800d254c070 ffff8800d254c2c0 ffff880035c93df8 ffffffffa041a5b8 ffff8800cd844c80 ffffffffa04385a0 ffff8800cd844cb0 ffff880035c93e18 ffffffff81546cef ffff8800d45fea00 0000000000000008 Call Trace: [<ffffffffa041a5b8>] rxrpc_release+0x128/0x2e0 [af_rxrpc] [<ffffffff81546cef>] sock_release+0x1f/0x80 [<ffffffff81546d62>] sock_close+0x12/0x20 [<ffffffff811aaba1>] __fput+0xe1/0x230 [<ffffffff811aad3e>] ____fput+0xe/0x10 [<ffffffff810862cc>] task_work_run+0xbc/0xe0 [<ffffffff8106a3be>] do_exit+0x2be/0xa10 [<ffffffff8116dc47>] ? do_munmap+0x297/0x3b0 [<ffffffff8106ab8f>] do_group_exit+0x3f/0xa0 [<ffffffff8106ac04>] SyS_exit_group+0x14/0x20 [<ffffffff8166b069>] system_call_fastpath+0x16/0x1b Signed-off-by: Tim Smith <tim@electronghost.co.uk> Signed-off-by: David Howells <dhowells@redhat.com>
Ezequiel reported that he's facing UBI going into read-only mode after power cut. It turned out that this behavior happens only when updating a static volume is interrupted and Fastmap is used. A possible trace can look like: ubi0 warning: ubi_io_read_vid_hdr [ubi]: no VID header found at PEB 2323, only 0xFF bytes ubi0 warning: ubi_eba_read_leb [ubi]: switch to read-only mode CPU: 0 PID: 833 Comm: ubiupdatevol Not tainted 4.6.0-rc2-ARCH #4 Hardware name: SAMSUNG ELECTRONICS CO., LTD. 300E4C/300E5C/300E7C/NP300E5C-AD8AR, BIOS P04RAP 10/15/2012 0000000000000286 00000000eba949bd ffff8800c45a7b38 ffffffff8140d841 ffff8801964be000 ffff88018eaa4800 ffff8800c45a7bb8 ffffffffa003abf6 ffffffff850e2ac0 8000000000000163 ffff8801850e2ac0 ffff8801850e2ac0 Call Trace: [<ffffffff8140d841>] dump_stack+0x63/0x82 [<ffffffffa003abf6>] ubi_eba_read_leb+0x486/0x4a0 [ubi] [<ffffffffa00453b3>] ubi_check_volume+0x83/0xf0 [ubi] [<ffffffffa0039d97>] ubi_open_volume+0x177/0x350 [ubi] [<ffffffffa00375d8>] vol_cdev_open+0x58/0xb0 [ubi] [<ffffffff8124b08e>] chrdev_open+0xae/0x1d0 [<ffffffff81243bcf>] do_dentry_open+0x1ff/0x300 [<ffffffff8124afe0>] ? cdev_put+0x30/0x30 [<ffffffff81244d36>] vfs_open+0x56/0x60 [<ffffffff812545f4>] path_openat+0x4f4/0x1190 [<ffffffff81256621>] do_filp_open+0x91/0x100 [<ffffffff81263547>] ? __alloc_fd+0xc7/0x190 [<ffffffff812450df>] do_sys_open+0x13f/0x210 [<ffffffff812451ce>] SyS_open+0x1e/0x20 [<ffffffff81a99e32>] entry_SYSCALL_64_fastpath+0x1a/0xa4 UBI checks static volumes for data consistency and reads the whole volume upon first open. If the volume is found erroneous users of UBI cannot read from it, but another volume update is possible to fix it. The check is performed by running ubi_eba_read_leb() on every allocated LEB of the volume. For static volumes ubi_eba_read_leb() computes the checksum of all data stored in a LEB. To verify the computed checksum it has to read the LEB's volume header which stores the original checksum. If the volume header is not found UBI treats this as fatal internal error and switches to RO mode. If the UBI device was attached via a full scan the assumption is correct, the volume header has to be present as it had to be there while scanning to get known as mapped. If the attach operation happened via Fastmap the assumption is no longer correct. When attaching via Fastmap UBI learns the mapping table from Fastmap's snapshot of the system state and not via a full scan. It can happen that a LEB got unmapped after a Fastmap was written to the flash. Then UBI can learn the LEB still as mapped and accessing it returns only 0xFF bytes. As UBI is not a FTL it is allowed to have mappings to empty PEBs, it assumes that the layer above takes care of LEB accounting and referencing. UBIFS does so using the LEB property tree (LPT). For static volumes UBI blindly assumes that all LEBs are present and therefore special actions have to be taken. The described situation can happen when updating a static volume is interrupted, either by a user or a power cut. The volume update code first unmaps all LEBs of a volume and then writes LEB by LEB. If the sequence of operations is interrupted UBI detects this either by the absence of LEBs, no volume header present at scan time, or corrupted payload, detected via checksum. In the Fastmap case the former method won't trigger as no scan happened and UBI automatically thinks all LEBs are present. Only by reading data from a LEB it detects that the volume header is missing and incorrectly treats this as fatal error. To deal with the situation ubi_eba_read_leb() from now on checks whether we attached via Fastmap and handles the absence of a volume header like a data corruption error. This way interrupted static volume updates will correctly get detected also when Fastmap is used. Cc: <stable@vger.kernel.org> Reported-by: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar> Tested-by: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar> Signed-off-by: Richard Weinberger <richard@nod.at>
Shakeel reported a crash in mem_cgroup_protected(), which can be triggered by memcg reclaim if the legacy cgroup v1 use_hierarchy=0 mode is used: BUG: unable to handle kernel NULL pointer dereference at 0000000000000120 PGD 8000001ff55da067 P4D 8000001ff55da067 PUD 1fdc7df067 PMD 0 Oops: 0000 [#4] SMP PTI CPU: 0 PID: 15581 Comm: bash Tainted: G D 4.17.0-smp-clean #5 Hardware name: ... RIP: 0010:mem_cgroup_protected+0x54/0x130 Code: 4c 8b 8e 00 01 00 00 4c 8b 86 08 01 00 00 48 8d 8a 08 ff ff ff 48 85 d2 ba 00 00 00 00 48 0f 44 ca 48 39 c8 0f 84 cf 00 00 00 <48> 8b 81 20 01 00 00 4d 89 ca 4c 39 c8 4c 0f 46 d0 4d 85 d2 74 05 RSP: 0000:ffffabe64dfafa58 EFLAGS: 00010286 RAX: ffff9fb6ff03d000 RBX: ffff9fb6f5b1b000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff9fb6f5b1b000 RDI: ffff9fb6f5b1b000 RBP: ffffabe64dfafb08 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000c800 R12: ffffabe64dfafb88 R13: ffff9fb6f5b1b000 R14: ffffabe64dfafb88 R15: ffff9fb77fffe000 FS: 00007fed1f8ac700(0000) GS:ffff9fb6ff400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000120 CR3: 0000001fdcf86003 CR4: 00000000001606f0 Call Trace: ? shrink_node+0x194/0x510 do_try_to_free_pages+0xfd/0x390 try_to_free_mem_cgroup_pages+0x123/0x210 try_charge+0x19e/0x700 mem_cgroup_try_charge+0x10b/0x1a0 wp_page_copy+0x134/0x5b0 do_wp_page+0x90/0x460 __handle_mm_fault+0x8e3/0xf30 handle_mm_fault+0xfe/0x220 __do_page_fault+0x262/0x500 do_page_fault+0x28/0xd0 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x485b72 The problem happens because parent_mem_cgroup() returns a NULL pointer, which is dereferenced later without a check. As cgroup v1 has no memory guarantee support, let's make mem_cgroup_protected() immediately return MEMCG_PROT_NONE, if the given cgroup has no parent (non-hierarchical mode is used). Link: http://lkml.kernel.org/r/20180611175418.7007-2-guro@fb.com Fixes: bf8d5d5 ("memcg: introduce memory.min") Signed-off-by: Roman Gushchin <guro@fb.com> Reported-by: Shakeel Butt <shakeelb@google.com> Tested-by: Shakeel Butt <shakeelb@google.com> Tested-by: John Stultz <john.stultz@linaro.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 1bb8866 ("mtd: nand: denali: handle timing parameters by setup_data_interface()"), denali_dt.c gets the clock rate from the clock driver. The driver expects the frequency of the bus interface clock, whereas the clock driver of SOCFPGA provides the core clock. Thus, the setup_data_interface() hook calculates timing parameters based on a wrong frequency. To make it work without relying on the clock driver, hard-code the clock frequency, 200MHz. This is fine for existing DT of UniPhier, and also fixes the issue of SOCFPGA because both platforms use 200 MHz for the bus interface clock. Fixes: 1bb8866 ("mtd: nand: denali: handle timing parameters by setup_data_interface()") Cc: linux-stable <stable@vger.kernel.org> #4.14+ Reported-by: Philipp Rosenberger <p.rosenberger@linutronix.de> Suggested-by: Boris Brezillon <boris.brezillon@bootlin.com> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Tested-by: Richard Weinberger <richard@nod.at> Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [qemu-system-x86:10185] CPU: 6 PID: 10185 Comm: qemu-system-x86 Tainted: G OE 4.14.0-rc4+ #4 RIP: 0010:kvm_get_time_scale+0x4e/0xa0 [kvm] Call Trace: get_time_ref_counter+0x5a/0x80 [kvm] kvm_hv_process_stimers+0x120/0x5f0 [kvm] kvm_arch_vcpu_ioctl_run+0x4b4/0x1690 [kvm] kvm_vcpu_ioctl+0x33a/0x620 [kvm] do_vfs_ioctl+0xa1/0x5d0 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x1e/0xa9 This can be reproduced when running kvm-unit-tests/hyperv_stimer.flat and cpu-hotplug stress simultaneously. __this_cpu_read(cpu_tsc_khz) returns 0 (set in kvmclock_cpu_down_prep()) when the pCPU is unhotplug which results in kvm_get_time_scale() gets into an infinite loop. This patch fixes it by treating the unhotplug pCPU as not using master clock. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jiri Pirko says: ==================== mlxsw: GRE offloading fixes Petr says: This patchset fixes a couple bugs in offloading GRE tunnels in mlxsw driver. Patch #1 fixes a problem that local routes pointing at a GRE tunnel device are offloaded even if that netdevice is down. Patch #2 detects that as a result of moving a GRE netdevice to a different VRF, two tunnels now have a conflict of local addresses, something that the mlxsw driver can't offload. Patch #3 fixes a FIB abort caused by forming a route pointing at a GRE tunnel that is eligible for offloading but already onloaded. Patch #4 fixes a problem that next hops migrated to a new RIF kept the old RIF reference, which went dangling shortly afterwards. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 5b72505. The rtc block on i.MX53 is a completely different hardware than the one found on i.MX25. Cc: <stable@vger.kernel.org> #4.14 Reported-by: Noel Vellemans <Noel.Vellemans@visionbms.com> Suggested-by: Juergen Borleis <jbe@pengutronix.de> Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Hendrik Brueckner says: ==================== Perf tool bpf selftests revealed a broken uapi for s390 and arm64. With the BPF_PROG_TYPE_PERF_EVENT program type the bpf_perf_event structure exports the pt_regs structure for all architectures. This fails for s390 and arm64 because pt_regs are not part of the user api and kept in-kernel only. To mitigate the broken uapi, introduce a wrapper that exports pt_regs in an asm-generic way. For arm64, export the exising user_pt_regs structure. For s390, introduce a user_pt_regs structure that exports the beginning of pt_regs. Note that user_pt_regs must export from the beginning of pt_regs as BPF_PROG_TYPE_PERF_EVENT program type is not the only type for running BPF programs. Some more background: For the bpf_perf_event, there is a uapi definition that is passed to the BPF program. For other "probe" points like trace points, kprobes, and uprobes, there is no uapi and the BPF program is always passed pt_regs (which is OK as the BPF program runs in the kernel context). The perf tool can attach BPF programs to all of these "probe" points and, optionally, can create a BPF prologue to access particular arguments (passed as registers). For this, it uses DWARF/CFI information to obtain the register and calls a perf-arch backend function, regs_query_register_offset(). This function returns the index into (user_)pt_regs for a particular register. Then, perf creates a BPF prologue that accesses this register based on the passed stucture from the "probe" point. Part of this series, are also updates to the testing and bpf selftest to deal with asm-specifics. To complete the bpf support in perf, the the regs_query_register_offset function is added for s390 to support BPF prologue creation. Changelog v1 -> v2: - Correct kbuild test bot issues by including asm-generic/bpf_perf_event.h for archictectures that do not have their own asm version. - Added patch to clean-up whitespace and coding style issues in s390 asm/ptrace.h (#4/6) as suggested by Alexei. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
We don't need struct_mutex to initialise userptr (it just allocates a workqueue for itself etc), but we do need struct_mutex later on in i915_gem_init() in order to feed requests onto the HW. This should break the chain [ 385.697902] ====================================================== [ 385.697907] WARNING: possible circular locking dependency detected [ 385.697913] 4.14.0-CI-Patchwork_7234+ #1 Tainted: G U [ 385.697917] ------------------------------------------------------ [ 385.697922] perf_pmu/2631 is trying to acquire lock: [ 385.697927] (&mm->mmap_sem){++++}, at: [<ffffffff811bfe1e>] __might_fault+0x3e/0x90 [ 385.697941] but task is already holding lock: [ 385.697946] (&cpuctx_mutex){+.+.}, at: [<ffffffff8116fe8c>] perf_event_ctx_lock_nested+0xbc/0x1d0 [ 385.697957] which lock already depends on the new lock. [ 385.697963] the existing dependency chain (in reverse order) is: [ 385.697970] -> #4 (&cpuctx_mutex){+.+.}: [ 385.697980] __mutex_lock+0x86/0x9b0 [ 385.697985] perf_event_init_cpu+0x5a/0x90 [ 385.697991] perf_event_init+0x178/0x1a4 [ 385.697997] start_kernel+0x27f/0x3f1 [ 385.698003] verify_cpu+0x0/0xfb [ 385.698006] -> #3 (pmus_lock){+.+.}: [ 385.698015] __mutex_lock+0x86/0x9b0 [ 385.698020] perf_event_init_cpu+0x21/0x90 [ 385.698025] cpuhp_invoke_callback+0xca/0xc00 [ 385.698030] _cpu_up+0xa7/0x170 [ 385.698035] do_cpu_up+0x57/0x70 [ 385.698039] smp_init+0x62/0xa6 [ 385.698044] kernel_init_freeable+0x97/0x193 [ 385.698050] kernel_init+0xa/0x100 [ 385.698055] ret_from_fork+0x27/0x40 [ 385.698058] -> #2 (cpu_hotplug_lock.rw_sem){++++}: [ 385.698068] cpus_read_lock+0x39/0xa0 [ 385.698073] apply_workqueue_attrs+0x12/0x50 [ 385.698078] __alloc_workqueue_key+0x1d8/0x4d8 [ 385.698134] i915_gem_init_userptr+0x5f/0x80 [i915] [ 385.698176] i915_gem_init+0x7c/0x390 [i915] [ 385.698213] i915_driver_load+0x99e/0x15c0 [i915] [ 385.698250] i915_pci_probe+0x33/0x90 [i915] [ 385.698256] pci_device_probe+0xa1/0x130 [ 385.698262] driver_probe_device+0x293/0x440 [ 385.698267] __driver_attach+0xde/0xe0 [ 385.698272] bus_for_each_dev+0x5c/0x90 [ 385.698277] bus_add_driver+0x16d/0x260 [ 385.698282] driver_register+0x57/0xc0 [ 385.698287] do_one_initcall+0x3e/0x160 [ 385.698292] do_init_module+0x5b/0x1fa [ 385.698297] load_module+0x2374/0x2dc0 [ 385.698302] SyS_finit_module+0xaa/0xe0 [ 385.698307] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 385.698311] -> #1 (&dev->struct_mutex){+.+.}: [ 385.698320] __mutex_lock+0x86/0x9b0 [ 385.698361] i915_mutex_lock_interruptible+0x4c/0x130 [i915] [ 385.698403] i915_gem_fault+0x206/0x760 [i915] [ 385.698409] __do_fault+0x1a/0x70 [ 385.698413] __handle_mm_fault+0x7c4/0xdb0 [ 385.698417] handle_mm_fault+0x154/0x300 [ 385.698440] __do_page_fault+0x2d6/0x570 [ 385.698445] page_fault+0x22/0x30 [ 385.698449] -> #0 (&mm->mmap_sem){++++}: [ 385.698459] lock_acquire+0xaf/0x200 [ 385.698464] __might_fault+0x68/0x90 [ 385.698470] _copy_to_user+0x1e/0x70 [ 385.698475] perf_read+0x1aa/0x290 [ 385.698480] __vfs_read+0x23/0x120 [ 385.698484] vfs_read+0xa3/0x150 [ 385.698488] SyS_read+0x45/0xb0 [ 385.698493] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 385.698497] other info that might help us debug this: [ 385.698505] Chain exists of: &mm->mmap_sem --> pmus_lock --> &cpuctx_mutex [ 385.698517] Possible unsafe locking scenario: [ 385.698522] CPU0 CPU1 [ 385.698526] ---- ---- [ 385.698529] lock(&cpuctx_mutex); [ 385.698553] lock(pmus_lock); [ 385.698558] lock(&cpuctx_mutex); [ 385.698564] lock(&mm->mmap_sem); [ 385.698568] *** DEADLOCK *** [ 385.698574] 1 lock held by perf_pmu/2631: [ 385.698578] #0: (&cpuctx_mutex){+.+.}, at: [<ffffffff8116fe8c>] perf_event_ctx_lock_nested+0xbc/0x1d0 [ 385.698589] stack backtrace: [ 385.698595] CPU: 3 PID: 2631 Comm: perf_pmu Tainted: G U 4.14.0-CI-Patchwork_7234+ #1 [ 385.698602] Hardware name: /NUC6CAYB, BIOS AYAPLCEL.86A.0040.2017.0619.1722 06/19/2017 [ 385.698609] Call Trace: [ 385.698615] dump_stack+0x5f/0x86 [ 385.698621] print_circular_bug.isra.18+0x1d0/0x2c0 [ 385.698627] __lock_acquire+0x19c3/0x1b60 [ 385.698634] ? generic_exec_single+0x77/0xe0 [ 385.698640] ? lock_acquire+0xaf/0x200 [ 385.698644] lock_acquire+0xaf/0x200 [ 385.698650] ? __might_fault+0x3e/0x90 [ 385.698655] __might_fault+0x68/0x90 [ 385.698660] ? __might_fault+0x3e/0x90 [ 385.698665] _copy_to_user+0x1e/0x70 [ 385.698670] perf_read+0x1aa/0x290 [ 385.698675] __vfs_read+0x23/0x120 [ 385.698682] ? __fget+0x101/0x1f0 [ 385.698686] vfs_read+0xa3/0x150 [ 385.698691] SyS_read+0x45/0xb0 [ 385.698696] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 385.698701] RIP: 0033:0x7ff1c46876ed [ 385.698705] RSP: 002b:00007fff13552f90 EFLAGS: 00000293 ORIG_RAX: 0000000000000000 [ 385.698712] RAX: ffffffffffffffda RBX: ffffc90000647ff0 RCX: 00007ff1c46876ed [ 385.698718] RDX: 0000000000000010 RSI: 00007fff13552fa0 RDI: 0000000000000005 [ 385.698723] RBP: 000056063d300580 R08: 0000000000000000 R09: 0000000000000060 [ 385.698729] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000046 [ 385.698734] R13: 00007fff13552c6f R14: 00007ff1c6279d00 R15: 00007ff1c6279a40 Testcase: igt/perf_pmu Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171122172621.16158-1-chris@chris-wilson.co.uk Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> (cherry picked from commit ee48700) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
struct list_lru_one l.nr_items could be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39: list_lru_isolate_move+0xf9/0x130 list_lru_isolate_move at mm/list_lru.c:180 inode_lru_isolate+0x12b/0x2a0 __list_lru_walk_one+0x122/0x3d0 list_lru_walk_one+0x75/0xa0 prune_icache_sb+0x8b/0xc0 super_cache_scan+0x1b8/0x250 do_shrink_slab+0x256/0x6d0 shrink_slab+0x41b/0x4a0 shrink_node+0x35c/0xd80 balance_pgdat+0x652/0xd90 kswapd+0x396/0x8d0 kthread+0x1e0/0x200 ret_from_fork+0x27/0x50 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56: list_lru_count_one+0x116/0x2f0 list_lru_count_one at mm/list_lru.c:193 super_cache_count+0xe8/0x170 do_shrink_slab+0x95/0x6d0 shrink_slab+0x41b/0x4a0 shrink_node+0x35c/0xd80 do_try_to_free_pages+0x1f7/0xa10 try_to_free_pages+0x26c/0x5e0 __alloc_pages_slowpath+0x458/0x1290 __alloc_pages_nodemask+0x3bb/0x450 alloc_pages_vma+0x8a/0x2c0 do_anonymous_page+0x170/0x700 __handle_mm_fault+0xc9f/0xd00 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x6f9 page_fault+0x34/0x40 Reported by Kernel Concurrency Sanitizer on: CPU: 56 PID: 6345 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ MiCode#4 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 A shattered l.nr_items could affect the shrinker behaviour due to a data race. Fix it by adding READ_ONCE() for the read. Since the writes are aligned and up to word-size, assume those are safe from data races to avoid readability issues of writing WRITE_ONCE(var, var + val). Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I80f9419bd90136dd7d02249af982d4a649150c06
struct list_lru_one l.nr_items could be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39: list_lru_isolate_move+0xf9/0x130 list_lru_isolate_move at mm/list_lru.c:180 inode_lru_isolate+0x12b/0x2a0 __list_lru_walk_one+0x122/0x3d0 list_lru_walk_one+0x75/0xa0 prune_icache_sb+0x8b/0xc0 super_cache_scan+0x1b8/0x250 do_shrink_slab+0x256/0x6d0 shrink_slab+0x41b/0x4a0 shrink_node+0x35c/0xd80 balance_pgdat+0x652/0xd90 kswapd+0x396/0x8d0 kthread+0x1e0/0x200 ret_from_fork+0x27/0x50 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56: list_lru_count_one+0x116/0x2f0 list_lru_count_one at mm/list_lru.c:193 super_cache_count+0xe8/0x170 do_shrink_slab+0x95/0x6d0 shrink_slab+0x41b/0x4a0 shrink_node+0x35c/0xd80 do_try_to_free_pages+0x1f7/0xa10 try_to_free_pages+0x26c/0x5e0 __alloc_pages_slowpath+0x458/0x1290 __alloc_pages_nodemask+0x3bb/0x450 alloc_pages_vma+0x8a/0x2c0 do_anonymous_page+0x170/0x700 __handle_mm_fault+0xc9f/0xd00 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x6f9 page_fault+0x34/0x40 Reported by Kernel Concurrency Sanitizer on: CPU: 56 PID: 6345 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ MiCode#4 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 A shattered l.nr_items could affect the shrinker behaviour due to a data race. Fix it by adding READ_ONCE() for the read. Since the writes are aligned and up to word-size, assume those are safe from data races to avoid readability issues of writing WRITE_ONCE(var, var + val). Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I80f9419bd90136dd7d02249af982d4a649150c06
struct list_lru_one l.nr_items could be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39: list_lru_isolate_move+0xf9/0x130 list_lru_isolate_move at mm/list_lru.c:180 inode_lru_isolate+0x12b/0x2a0 __list_lru_walk_one+0x122/0x3d0 list_lru_walk_one+0x75/0xa0 prune_icache_sb+0x8b/0xc0 super_cache_scan+0x1b8/0x250 do_shrink_slab+0x256/0x6d0 shrink_slab+0x41b/0x4a0 shrink_node+0x35c/0xd80 balance_pgdat+0x652/0xd90 kswapd+0x396/0x8d0 kthread+0x1e0/0x200 ret_from_fork+0x27/0x50 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56: list_lru_count_one+0x116/0x2f0 list_lru_count_one at mm/list_lru.c:193 super_cache_count+0xe8/0x170 do_shrink_slab+0x95/0x6d0 shrink_slab+0x41b/0x4a0 shrink_node+0x35c/0xd80 do_try_to_free_pages+0x1f7/0xa10 try_to_free_pages+0x26c/0x5e0 __alloc_pages_slowpath+0x458/0x1290 __alloc_pages_nodemask+0x3bb/0x450 alloc_pages_vma+0x8a/0x2c0 do_anonymous_page+0x170/0x700 __handle_mm_fault+0xc9f/0xd00 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x6f9 page_fault+0x34/0x40 Reported by Kernel Concurrency Sanitizer on: CPU: 56 PID: 6345 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ MiCode#4 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 A shattered l.nr_items could affect the shrinker behaviour due to a data race. Fix it by adding READ_ONCE() for the read. Since the writes are aligned and up to word-size, assume those are safe from data races to avoid readability issues of writing WRITE_ONCE(var, var + val). Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I80f9419bd90136dd7d02249af982d4a649150c06
The warning below says it all: BUG: using __this_cpu_read() in preemptible [00000000] code: swapper/0/1 caller is __this_cpu_preempt_check CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.0-rc8 #4 Call Trace: dump_stack check_preemption_disabled ? do_early_param __this_cpu_preempt_check arch_perfmon_init op_nmi_init ? alloc_pci_root_info oprofile_arch_init oprofile_init do_one_initcall ... These accessors should not have been used in the first place: it is PPro so no mixed silicon revisions and thus it can simply use boot_cpu_data. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Fix-creation-mandated-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Robert Richter <rric@kernel.org> Cc: x86@kernel.org Cc: stable@vger.kernel.org
I see the following lockdep splat in the qcom pinctrl driver when attempting to suspend the device. WARNING: possible recursive locking detected 5.4.11 MiCode#3 Tainted: G W -------------------------------------------- cat/3074 is trying to acquire lock: ffffff81f49804c0 (&irq_desc_lock_class){-.-.}, at: __irq_get_desc_lock+0x64/0x94 but task is already holding lock: ffffff81f1cc10c0 (&irq_desc_lock_class){-.-.}, at: __irq_get_desc_lock+0x64/0x94 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&irq_desc_lock_class); lock(&irq_desc_lock_class); *** DEADLOCK *** May be due to missing lock nesting notation 6 locks held by cat/3074: #0: ffffff81f01d9420 (sb_writers#7){.+.+}, at: vfs_write+0xd0/0x1a4 MiCode#1: ffffff81bd7d2080 (&of->mutex){+.+.}, at: kernfs_fop_write+0x12c/0x1fc MiCode#2: ffffff81f4c322f0 (kn->count#337){.+.+}, at: kernfs_fop_write+0x134/0x1fc MiCode#3: ffffffe411a41d60 (system_transition_mutex){+.+.}, at: pm_suspend+0x108/0x348 MiCode#4: ffffff81f1c5e970 (&dev->mutex){....}, at: __device_suspend+0x168/0x41c MiCode#5: ffffff81f1cc10c0 (&irq_desc_lock_class){-.-.}, at: __irq_get_desc_lock+0x64/0x94 stack backtrace: CPU: 5 PID: 3074 Comm: cat Tainted: G W 5.4.11 MiCode#3 Hardware name: Google Cheza (rev3+) (DT) Call trace: dump_backtrace+0x0/0x174 show_stack+0x20/0x2c dump_stack+0xc8/0x124 __lock_acquire+0x460/0x2388 lock_acquire+0x1cc/0x210 _raw_spin_lock_irqsave+0x64/0x80 __irq_get_desc_lock+0x64/0x94 irq_set_irq_wake+0x40/0x144 qpnpint_irq_set_wake+0x28/0x34 set_irq_wake_real+0x40/0x5c irq_set_irq_wake+0x70/0x144 pm8941_pwrkey_suspend+0x34/0x44 platform_pm_suspend+0x34/0x60 dpm_run_callback+0x64/0xcc __device_suspend+0x310/0x41c dpm_suspend+0xf8/0x298 dpm_suspend_start+0x84/0xb4 suspend_devices_and_enter+0xbc/0x620 pm_suspend+0x210/0x348 state_store+0xb0/0x108 kobj_attr_store+0x14/0x24 sysfs_kf_write+0x4c/0x64 kernfs_fop_write+0x15c/0x1fc __vfs_write+0x54/0x18c vfs_write+0xe4/0x1a4 ksys_write+0x7c/0xe4 __arm64_sys_write+0x20/0x2c el0_svc_common+0xa8/0x160 el0_svc_handler+0x7c/0x98 el0_svc+0x8/0xc Set a lockdep class when we map the irq so that irq_set_wake() doesn't warn about a lockdep bug that doesn't exist. Change-Id: If7054fbc6c18d4d01e581a8a2e41cfc8a0990957 Fixes: 12a9eea ("spmi: pmic-arb: convert to v2 irq interfaces to support hierarchical IRQ chips") Cc: Douglas Anderson <dianders@chromium.org> Cc: Brian Masney <masneyb@onstation.org> Cc: Lina Iyer <ilina@codeaurora.org> Cc: Maulik Shah <mkshah@codeaurora.org> Cc: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Stephen Boyd <swboyd@chromium.org> Link: https://lore.kernel.org/r/20200121183748.68662-1-swboyd@chromium.org Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Git-commit: 2d5a2f9 Git-repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [vvegivad@codeaurora.org: minor modifications needed to handle the API difference] Signed-off-by: Veera Vegivada <vvegivad@codeaurora.org>
[ Upstream commit ff612ba ] We've been seeing the following sporadically throughout our fleet panic: kernel BUG at fs/btrfs/relocation.c:4584! netversion: 5.0-0 Backtrace: #0 [ffffc90003adb880] machine_kexec at ffffffff81041da8 MiCode#1 [ffffc90003adb8c8] __crash_kexec at ffffffff8110396c MiCode#2 [ffffc90003adb988] crash_kexec at ffffffff811048ad MiCode#3 [ffffc90003adb9a0] oops_end at ffffffff8101c19a MiCode#4 [ffffc90003adb9c0] do_trap at ffffffff81019114 MiCode#5 [ffffc90003adba00] do_error_trap at ffffffff810195d0 MiCode#6 [ffffc90003adbab0] invalid_op at ffffffff81a00a9b [exception RIP: btrfs_reloc_cow_block+692] RIP: ffffffff8143b614 RSP: ffffc90003adbb68 RFLAGS: 00010246 RAX: fffffffffffffff7 RBX: ffff8806b9c32000 RCX: ffff8806aad00690 RDX: ffff880850b295e0 RSI: ffff8806b9c32000 RDI: ffff88084f205bd0 RBP: ffff880849415000 R8: ffffc90003adbbe0 R9: ffff88085ac90000 R10: ffff8805f7369140 R11: 0000000000000000 R12: ffff880850b295e0 R13: ffff88084f205bd0 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 MiCode#7 [ffffc90003adbbb0] __btrfs_cow_block at ffffffff813bf1cd MiCode#8 [ffffc90003adbc28] btrfs_cow_block at ffffffff813bf4b3 MiCode#9 [ffffc90003adbc78] btrfs_search_slot at ffffffff813c2e6c The way relocation moves data extents is by creating a reloc inode and preallocating extents in this inode and then copying the data into these preallocated extents. Once we've done this for all of our extents, we'll write out these dirty pages, which marks the extent written, and goes into btrfs_reloc_cow_block(). From here we get our current reloc_control, which _should_ match the reloc_control for the current block group we're relocating. However if we get an ENOSPC in this path at some point we'll bail out, never initiating writeback on this inode. Not a huge deal, unless we happen to be doing relocation on a different block group, and this block group is now rc->stage == UPDATE_DATA_PTRS. This trips the BUG_ON() in btrfs_reloc_cow_block(), because we expect to be done modifying the data inode. We are in fact done modifying the metadata for the data inode we're currently using, but not the one from the failed block group, and thus we BUG_ON(). (This happens when writeback finishes for extents from the previous group, when we are at btrfs_finish_ordered_io() which updates the data reloc tree (inode item, drops/adds extent items, etc).) Fix this by writing out the reloc data inode always, and then breaking out of the loop after that point to keep from tripping this BUG_ON() later. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [ add note from Filipe ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
…text commit 0c9e8b3 upstream. stub_probe() and stub_disconnect() call functions which could call sleeping function in invalid context whil holding busid_lock. Fix the problem by refining the lock holds to short critical sections to change the busid_priv fields. This fix restructures the code to limit the lock holds in stub_probe() and stub_disconnect(). stub_probe(): [15217.927028] BUG: sleeping function called from invalid context at mm/slab.h:418 [15217.927038] in_atomic(): 1, irqs_disabled(): 0, pid: 29087, name: usbip [15217.927044] 5 locks held by usbip/29087: [15217.927047] #0: 0000000091647f28 (sb_writers#6){....}, at: vfs_write+0x191/0x1c0 [15217.927062] MiCode#1: 000000008f9ba75b (&of->mutex){....}, at: kernfs_fop_write+0xf7/0x1b0 [15217.927072] MiCode#2: 00000000872e5b4b (&dev->mutex){....}, at: __device_driver_lock+0x3b/0x50 [15217.927082] MiCode#3: 00000000e74ececc (&dev->mutex){....}, at: __device_driver_lock+0x46/0x50 [15217.927090] MiCode#4: 00000000b20abbe0 (&(&busid_table[i].busid_lock)->rlock){....}, at: get_busid_priv+0x48/0x60 [usbip_host] [15217.927103] CPU: 3 PID: 29087 Comm: usbip Tainted: G W 5.1.0-rc6+ MiCode#40 [15217.927106] Hardware name: Dell Inc. OptiPlex 790/0HY9JP, BIOS A18 09/24/2013 [15217.927109] Call Trace: [15217.927118] dump_stack+0x63/0x85 [15217.927127] ___might_sleep+0xff/0x120 [15217.927133] __might_sleep+0x4a/0x80 [15217.927143] kmem_cache_alloc_trace+0x1aa/0x210 [15217.927156] stub_probe+0xe8/0x440 [usbip_host] [15217.927171] usb_probe_device+0x34/0x70 stub_disconnect(): [15279.182478] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:908 [15279.182487] in_atomic(): 1, irqs_disabled(): 0, pid: 29114, name: usbip [15279.182492] 5 locks held by usbip/29114: [15279.182494] #0: 0000000091647f28 (sb_writers#6){....}, at: vfs_write+0x191/0x1c0 [15279.182506] MiCode#1: 00000000702cf0f3 (&of->mutex){....}, at: kernfs_fop_write+0xf7/0x1b0 [15279.182514] MiCode#2: 00000000872e5b4b (&dev->mutex){....}, at: __device_driver_lock+0x3b/0x50 [15279.182522] MiCode#3: 00000000e74ececc (&dev->mutex){....}, at: __device_driver_lock+0x46/0x50 [15279.182529] MiCode#4: 00000000b20abbe0 (&(&busid_table[i].busid_lock)->rlock){....}, at: get_busid_priv+0x48/0x60 [usbip_host] [15279.182541] CPU: 0 PID: 29114 Comm: usbip Tainted: G W 5.1.0-rc6+ MiCode#40 [15279.182543] Hardware name: Dell Inc. OptiPlex 790/0HY9JP, BIOS A18 09/24/2013 [15279.182546] Call Trace: [15279.182554] dump_stack+0x63/0x85 [15279.182561] ___might_sleep+0xff/0x120 [15279.182566] __might_sleep+0x4a/0x80 [15279.182574] __mutex_lock+0x55/0x950 [15279.182582] ? get_busid_priv+0x48/0x60 [usbip_host] [15279.182587] ? reacquire_held_locks+0xec/0x1a0 [15279.182591] ? get_busid_priv+0x48/0x60 [usbip_host] [15279.182597] ? find_held_lock+0x94/0xa0 [15279.182609] mutex_lock_nested+0x1b/0x20 [15279.182614] ? mutex_lock_nested+0x1b/0x20 [15279.182618] kernfs_remove_by_name_ns+0x2a/0x90 [15279.182625] sysfs_remove_file_ns+0x15/0x20 [15279.182629] device_remove_file+0x19/0x20 [15279.182634] stub_disconnect+0x6d/0x180 [usbip_host] [15279.182643] usb_unbind_device+0x27/0x60 Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…rupts commit ef97402 upstream. The passthrough interrupts are defined at the host level and their IRQ data should not be cleared unless specifically deconfigured (shutdown) by the host. They differ from the IPI interrupts which are allocated by the XIVE KVM device and reserved to the guest usage only. This fixes a host crash when destroying a VM in which a PCI adapter was passed-through. In this case, the interrupt is cleared and freed by the KVM device and then shutdown by vfio at the host level. [ 1007.360265] BUG: Kernel NULL pointer dereference at 0x00000d00 [ 1007.360285] Faulting instruction address: 0xc00000000009da34 [ 1007.360296] Oops: Kernel access of bad area, sig: 7 [MiCode#1] [ 1007.360303] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV [ 1007.360314] Modules linked in: vhost_net vhost iptable_mangle ipt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc kvm_hv kvm xt_tcpudp iptable_filter squashfs fuse binfmt_misc vmx_crypto ib_iser rdma_cm iw_cm ib_cm libiscsi scsi_transport_iscsi nfsd ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress lzo_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq multipath mlx5_ib ib_uverbs ib_core crc32c_vpmsum mlx5_core [ 1007.360425] CPU: 9 PID: 15576 Comm: CPU 18/KVM Kdump: loaded Not tainted 5.1.0-gad7e7d0ef MiCode#4 [ 1007.360454] NIP: c00000000009da34 LR: c00000000009e50c CTR: c00000000009e5d0 [ 1007.360482] REGS: c000007f24ccf330 TRAP: 0300 Not tainted (5.1.0-gad7e7d0ef) [ 1007.360500] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24002484 XER: 00000000 [ 1007.360532] CFAR: c00000000009da10 DAR: 0000000000000d00 DSISR: 00080000 IRQMASK: 1 [ 1007.360532] GPR00: c00000000009e62c c000007f24ccf5c0 c000000001510600 c000007fe7f947c0 [ 1007.360532] GPR04: 0000000000000d00 0000000000000000 0000000000000000 c000005eff02d200 [ 1007.360532] GPR08: 0000000000400000 0000000000000000 0000000000000000 fffffffffffffffd [ 1007.360532] GPR12: c00000000009e5d0 c000007fffff7b00 0000000000000031 000000012c345718 [ 1007.360532] GPR16: 0000000000000000 0000000000000008 0000000000418004 0000000000040100 [ 1007.360532] GPR20: 0000000000000000 0000000008430000 00000000003c0000 0000000000000027 [ 1007.360532] GPR24: 00000000000000ff 0000000000000000 00000000000000ff c000007faa90d98c [ 1007.360532] GPR28: c000007faa90da40 00000000000fe040 ffffffffffffffff c000007fe7f947c0 [ 1007.360689] NIP [c00000000009da34] xive_esb_read+0x34/0x120 [ 1007.360706] LR [c00000000009e50c] xive_do_source_set_mask.part.0+0x2c/0x50 [ 1007.360732] Call Trace: [ 1007.360738] [c000007f24ccf5c0] [c000000000a6383c] snooze_loop+0x15c/0x270 (unreliable) [ 1007.360775] [c000007f24ccf5f0] [c00000000009e62c] xive_irq_shutdown+0x5c/0xe0 [ 1007.360795] [c000007f24ccf630] [c00000000019e4a0] irq_shutdown+0x60/0xe0 [ 1007.360813] [c000007f24ccf660] [c000000000198c44] __free_irq+0x3a4/0x420 [ 1007.360831] [c000007f24ccf700] [c000000000198dc8] free_irq+0x78/0xe0 [ 1007.360849] [c000007f24ccf730] [c00000000096c5a8] vfio_msi_set_vector_signal+0xa8/0x350 [ 1007.360878] [c000007f24ccf7f0] [c00000000096c938] vfio_msi_set_block+0xe8/0x1e0 [ 1007.360899] [c000007f24ccf850] [c00000000096cae0] vfio_msi_disable+0xb0/0x110 [ 1007.360912] [c000007f24ccf8a0] [c00000000096cd04] vfio_pci_set_msi_trigger+0x1c4/0x3d0 [ 1007.360922] [c000007f24ccf910] [c00000000096d910] vfio_pci_set_irqs_ioctl+0xa0/0x170 [ 1007.360941] [c000007f24ccf930] [c00000000096b400] vfio_pci_disable+0x80/0x5e0 [ 1007.360963] [c000007f24ccfa10] [c00000000096b9bc] vfio_pci_release+0x5c/0x90 [ 1007.360991] [c000007f24ccfa40] [c000000000963a9c] vfio_device_fops_release+0x3c/0x70 [ 1007.361012] [c000007f24ccfa70] [c0000000003b5668] __fput+0xc8/0x2b0 [ 1007.361040] [c000007f24ccfac0] [c0000000001409b0] task_work_run+0x140/0x1b0 [ 1007.361059] [c000007f24ccfb20] [c000000000118f8c] do_exit+0x3ac/0xd00 [ 1007.361076] [c000007f24ccfc00] [c0000000001199b0] do_group_exit+0x60/0x100 [ 1007.361094] [c000007f24ccfc40] [c00000000012b514] get_signal+0x1a4/0x8f0 [ 1007.361112] [c000007f24ccfd30] [c000000000021cc8] do_notify_resume+0x1a8/0x430 [ 1007.361141] [c000007f24ccfe20] [c00000000000e444] ret_from_except_lite+0x70/0x74 [ 1007.361159] Instruction dump: [ 1007.361175] 38422c00 e9230000 712a0004 41820010 548a2036 7d442378 78840020 71290020 [ 1007.361194] 4082004c e9230010 7c892214 7c0004ac <e9240000> 0c090000 4c00012c 792a0022 Cc: stable@vger.kernel.org # v4.12+ Fixes: 5af5099 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller") Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c3ed222 upstream. Send along the already-allocated fattr along with nfs4_fs_locations, and drop the memcpy of fattr. We end up growing two more allocations, but this fixes up a crash as: PID: 790 TASK: ffff88811b43c000 CPU: 0 COMMAND: "ls" #0 [ffffc90000857920] panic at ffffffff81b9bfde #1 [ffffc900008579c0] do_trap at ffffffff81023a9b #2 [ffffc90000857a10] do_error_trap at ffffffff81023b78 #3 [ffffc90000857a58] exc_stack_segment at ffffffff81be1f45 #4 [ffffc90000857a80] asm_exc_stack_segment at ffffffff81c009de #5 [ffffc90000857b08] nfs_lookup at ffffffffa0302322 [nfs] #6 [ffffc90000857b70] __lookup_slow at ffffffff813a4a5f #7 [ffffc90000857c60] walk_component at ffffffff813a86c4 #8 [ffffc90000857cb8] path_lookupat at ffffffff813a9553 #9 [ffffc90000857cf0] filename_lookup at ffffffff813ab86b Suggested-by: Trond Myklebust <trondmy@hammerspace.com> Fixes: 9558a00 ("NFS: Remove the label from the nfs4_lookup_res struct") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4f40a5b upstream. This was missed in c3ed222 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.") and causes a panic when mounting with '-o trunkdiscovery': PID: 1604 TASK: ffff93dac3520000 CPU: 3 COMMAND: "mount.nfs" #0 [ffffb79140f738f8] machine_kexec at ffffffffaec64bee #1 [ffffb79140f73950] __crash_kexec at ffffffffaeda67fd #2 [ffffb79140f73a18] crash_kexec at ffffffffaeda76ed #3 [ffffb79140f73a30] oops_end at ffffffffaec2658d #4 [ffffb79140f73a50] general_protection at ffffffffaf60111e [exception RIP: nfs_fattr_init+0x5] RIP: ffffffffc0c18265 RSP: ffffb79140f73b08 RFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff93dac304a800 RCX: 0000000000000000 RDX: ffffb79140f73bb0 RSI: ffff93dadc8cbb40 RDI: d03ee11cfaf6bd50 RBP: ffffb79140f73be8 R8: ffffffffc0691560 R9: 0000000000000006 R10: ffff93db3ffd3df8 R11: 0000000000000000 R12: ffff93dac4040000 R13: ffff93dac2848e00 R14: ffffb79140f73b60 R15: ffffb79140f73b30 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffffb79140f73b08] _nfs41_proc_get_locations at ffffffffc0c73d53 [nfsv4] #6 [ffffb79140f73bf0] nfs4_proc_get_locations at ffffffffc0c83e90 [nfsv4] #7 [ffffb79140f73c60] nfs4_discover_trunking at ffffffffc0c83fb7 [nfsv4] #8 [ffffb79140f73cd8] nfs_probe_fsinfo at ffffffffc0c0f95f [nfs] #9 [ffffb79140f73da0] nfs_probe_server at ffffffffc0c1026a [nfs] RIP: 00007f6254fce26e RSP: 00007ffc69496ac8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f6254fce26e RDX: 00005600220a82a0 RSI: 00005600220a64d0 RDI: 00005600220a6520 RBP: 00007ffc69496c50 R8: 00005600220a8710 R9: 003035322e323231 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc69496c50 R13: 00005600220a8440 R14: 0000000000000010 R15: 0000560020650ef9 ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b Fixes: c3ed222 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.") Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
English version
Linux© kernel is a copyrighted software written by @torvalds et al. and is released under the GNU GPLv2 license. According to licensing terms distributor of derived work of Linux kernel must provide modified Linux© source code along with the distribution of the software. Since Redmi (1) uses Linux kernel and is distributed to the public @MiCode must comply to terms of GNU GPL2 License and release Redmi (1)'s Linux© kernel source code.
简体中文(中国大陆地区)版本
Chinese(Simplified)(China) version
Linux© 作业系统内核是由 @torvalds 等人设计的一个享有智慧财产权的软件并其被已 GNU GPL 第 2 版授权条款释出。根据授权条款规定基于 Linux© 作业系统内核的衍伸作品的散布者必需伴随其散布的软件提供被修改的 Linux© 作业系统核心来源代码。既然红米手机使用了 Linux© 作业系统内核且已被散布至公开场合 @MiCode 必须尊从 GNU GPL 第 2 版授权条款的规定释出红米手机的 Linux© 作业系统核心来源代码。
正體中文(台灣地區)版本
Chinese(Traditional)(Taiwan) version
Linux© 作業系統核心是由 @torvalds 等人設計的一個享有智慧財產權的軟體並其被已 GNU GPL 第 2 版授權條款釋出。根據授權條款規定基於 Linux© 作業系統核心的衍伸作品的散佈者必需伴隨其散佈的軟體提供被修改的 Linux© 作業系統核心來源程式碼。既然紅米手機 (1)使用了 Linux© 作業系統核心且已被散佈至公開場合 @MiCode 必須尊從 GNU GPL 第 2 版授權條款的規定釋出紅米手機 (1) 的 Linux© 作業系統核心來源程式碼。
The text was updated successfully, but these errors were encountered: