RFE: perform the audit multicast send from the kauditd thread #23

pcmoore · 2016-10-21T16:17:05Z

We currently queue the audit unicast sends to a separate thread while we send the multicast messages immediately. There doesn't appear to be a reason why we can't also do the multicast send from the separate thread as long as we are careful to only queue the skb once and move the netlink tweaks to the dedicated kauditd_thread.

pcmoore · 2016-10-21T16:21:45Z

Upstream discussion that spawned this RFE:

http://marc.info/?t=147694653900002&r=1&w=2

Related to issue #22.

pcmoore · 2016-10-22T01:39:36Z

In addition, once we have rewritten the audit multicast code, we should reintroduce the changes that were causing the problem identified in #22, if that commit was reverted.

This reverts commit bc51ddd ("netns: avoid disabling irq for netns id") as it was found to cause problems with systems running SELinux/audit, see the mailing list thread below: * http://marc.info/?t=147694653900002&r=1&w=2 Eventually we should be able to reintroduce this code once we have rewritten the audit multicast code to queue messages much the same way we do for unicast messages. A tracking issue for this can be found below: * linux-audit/audit-kernel#23 Reported-by: Stephen Smalley <sds@tycho.nsa.gov> Reported-by: Elad Raz <e@eladraz.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com>

This reverts commit bc51ddd ("netns: avoid disabling irq for netns id") as it was found to cause problems with systems running SELinux/audit, see the mailing list thread below: * http://marc.info/?t=147694653900002&r=1&w=2 Eventually we should be able to reintroduce this code once we have rewritten the audit multicast code to queue messages much the same way we do for unicast messages. A tracking issue for this can be found below: * linux-audit/audit-kernel#23 Reported-by: Stephen Smalley <sds@tycho.nsa.gov> Reported-by: Elad Raz <e@eladraz.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Sending audit netlink multicast messages is bad for all the same reasons that sending audit netlink unicast messages is bad, so this patch reworks things so that we don't do the multicast send in audit_log_end(), we do it from the dedicated kauditd_thread thread just as we do for unicast messages. See the GitHub issues below for more information/history: * linux-audit/audit-kernel#23 * linux-audit/audit-kernel#22 Signed-off-by: Paul Moore <paul@paul-moore.com>

pcmoore · 2016-11-24T01:47:52Z

Initial RFC patchset posted:

https://www.redhat.com/archives/linux-audit/2016-November/msg00035.html

pcmoore · 2016-11-29T22:16:10Z

The RFC patchset is now in audit#next:

https://www.redhat.com/archives/linux-audit/2016-November/msg00054.html

... and the revert from issue #22 has been reverted in audit#next as well:

https://www.redhat.com/archives/linux-audit/2016-November/msg00055.html

Sending audit netlink multicast messages is bad for all the same reasons that sending audit netlink unicast messages is bad, so this patch reworks things so that we don't do the multicast send in audit_log_end(), we do it from the dedicated kauditd_thread thread just as we do for unicast messages. See the GitHub issues below for more information/history: * #23 * #22 Signed-off-by: Paul Moore <paul@paul-moore.com>

Bring back commit bc51ddd ("netns: avoid disabling irq for netns id") now that we've fixed some audit multicast issues that caused problems with original attempt. Additional information, and history, can be found in the links below: * #22 * #23 Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com>

GIT 8375913f32f3e6f5bbdb18bd36046a07bbfb7654 commit e8d7c33232e5fdfa761c3416539bc5b4acd12db5 Author: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Date: Sun Nov 27 19:32:32 2016 +0300 md/raid5: limit request size according to implementation limits Current implementation employ 16bit counter of active stripes in lower bits of bio->bi_phys_segments. If request is big enough to overflow this counter bio will be completed and freed too early. Fortunately this not happens in default configuration because several other limits prevent that: stripe_cache_size * nr_disks effectively limits count of active stripes. And small max_sectors_kb at lower disks prevent that during normal read/write operations. Overflow easily happens in discard if it's enabled by module parameter "devices_handle_discard_safely" and stripe_cache_size is set big enough. This patch limits requests size with 256Mb - 8Kb to prevent overflows. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Shaohua Li <shli@kernel.org> Cc: Neil Brown <neilb@suse.com> Cc: stable@vger.kernel.org Signed-off-by: Shaohua Li <shli@fb.com> commit 1a0ec5c30c37d29e4435a45e75c896f91af970bd Author: JackieLiu <liuyun01@kylinos.cn> Date: Tue Nov 29 11:57:30 2016 +0800 md/raid5-cache: do not need to set STRIPE_PREREAD_ACTIVE repeatedly R5c_make_stripe_write_out has set this flag, do not need to set again. Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com> commit dbd22c8d7fc6276a48627e57a5605cf9565de78a Author: JackieLiu <liuyun01@kylinos.cn> Date: Tue Nov 29 11:13:20 2016 +0800 md/raid5-cache: remove the unnecessary next_cp_seq field from the r5l_log The next_cp_seq field is useless, remove it. Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com> commit bc8f167f9c4656b7a972936237e9c38e6ab80c67 Author: JackieLiu <liuyun01@kylinos.cn> Date: Mon Nov 28 16:19:20 2016 +0800 md/raid5-cache: release the stripe_head at the appropriate location If we released the 'stripe_head' in r5c_recovery_flush_log, ctx->cached_list will both release the data-parity stripes and data-only stripes, which will become empty. And we also need to use the data-only stripes in r5c_recovery_rewrite_data_only_stripes, so we should wait util rewrite data-only stripes is done before releasing them. Reviewed-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com> commit fc833c2a2f4129c42efdaed64b9eb6e9ae5fdcee Author: JackieLiu <liuyun01@kylinos.cn> Date: Mon Nov 28 16:19:19 2016 +0800 md/raid5-cache: use ring add to prevent overflow 'write_pos' must be protected with 'r5l_ring_add', or it may overflow Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> commit 9b69173e5c6000b2c6fafc5085dcd7b173f073c8 Author: JackieLiu <liuyun01@kylinos.cn> Date: Mon Nov 28 16:19:18 2016 +0800 md/raid5-cache: remove unnecessary function parameters The function parameter 'recovery_list' is not used in body, we can delete it Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> commit 462eb7d87297dae5837f3445b68b79e835ab0d6c Author: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Date: Sat Nov 26 10:57:14 2016 +0800 raid5-cache: don't set STRIPE_R5C_PARTIAL_STRIPE flag while load stripe into cache r5c_recovery_load_one_stripe should not set STRIPE_R5C_PARTIAL_STRIPE flag,as the data-only stripe may be STRIPE_R5C_FULL_STRIPE stripe. The state machine would release the stripe later and add it into neither r5c_cached_full_stripes list or r5c_cached_partial_stripes list and set correct flag. Reviewed-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com> commit 5b52b1d37a6fe92bb95c949d56ffd9ec3b7a72b3 Author: Paul Moore <paul@paul-moore.com> Date: Tue Nov 29 16:57:48 2016 -0500 netns: avoid disabling irq for netns id Bring back commit bc51dddf98c9 ("netns: avoid disabling irq for netns id") now that we've fixed some audit multicast issues that caused problems with original attempt. Additional information, and history, can be found in the links below: * https://github.com/linux-audit/audit-kernel/issues/22 * https://github.com/linux-audit/audit-kernel/issues/23 Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com> commit ef256ce8a87ebecc335bd3c42280030e46428362 Author: Paul Moore <paul@paul-moore.com> Date: Tue Nov 29 16:53:26 2016 -0500 audit: don't ever sleep on a command record/message Sleeping on a command record/message in audit_log_start() could slow something, e.g. auditd, from doing something important, e.g. clean shutdown, which could present problems on a heavily loaded system. This patch allows tasks to bypass any queue restrictions if they are logging a command record/message. Signed-off-by: Paul Moore <paul@paul-moore.com> commit 2934a794d81231ce905cdb5fa1f02e776c26c78b Author: Paul Moore <paul@paul-moore.com> Date: Tue Nov 29 16:53:26 2016 -0500 audit: handle a clean auditd shutdown with grace When auditd stops cleanly it sets 'auditd_pid' to 0 with an AUDIT_SET message, in this case we should reset our backlog queues via the auditd_reset() function. This patch also adds a 'auditd_pid' check to the top of kauditd_send_unicast_skb() so we can fail quicker. Signed-off-by: Paul Moore <paul@paul-moore.com> commit 917a7707a4a49b3b0d35c53cac40d94bbfa8060b Author: Paul Moore <paul@paul-moore.com> Date: Tue Nov 29 16:53:25 2016 -0500 audit: wake up kauditd_thread after auditd registers This patch was suggested by Richard Briggs back in 2015, see the link to the mail archive below. Unfortunately, that patch is no longer even remotely valid due to other changes to the code. * https://www.redhat.com/archives/linux-audit/2015-October/msg00075.html Suggested-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> commit 5e6dbd54309b9ddeaa595e5c1015351617bbcc2d Author: Paul Moore <paul@paul-moore.com> Date: Tue Nov 29 16:53:25 2016 -0500 audit: rework audit_log_start() The backlog queue handling in audit_log_start() is a little odd with some questionable design decisions, this patch attempts to rectify this with the following changes: * Never make auditd wait, ignore any backlog limits as we need auditd awake so it can drain the backlog queue. * When we hit a backlog limit and start dropping records, don't wake all the tasks sleeping on the backlog, that's silly. Instead, let kauditd_thread() take care of waking everyone once it has had a chance to drain the backlog queue. * Don't keep a global backlog timeout countdown, make it per-task. A per-task timer means we won't have all the sleeping tasks waking at the same time and hammering on an already stressed backlog queue. Signed-off-by: Paul Moore <paul@paul-moore.com> commit 8b4c3202464e880cb546a2ee6b4a64877c83d2ec Author: Paul Moore <paul@paul-moore.com> Date: Tue Nov 29 16:53:25 2016 -0500 audit: rework the audit queue handling The audit record backlog queue has always been a bit of a mess, and the moving the multicast send into kauditd_thread() from audit_log_end() only makes things worse. This patch attempts to fix the backlog queue with a better design that should hold up better under load and have less of a performance impact at syscall invocation time. While it looks like there is a log going on in this patch, the main change is the move from a single backlog queue to three queues: * A queue for holding records generated from audit_log_end() that haven't been consumed by kauditd_thread() (audit_queue). * A queue for holding records that have been sent via multicast but had a temporary failure when sending via unicast and need a resend (audit_retry_queue). * A queue for holding records that haven't been sent via unicast because no one is listening (audit_hold_queue). Special care is taken in this patch to ensure that the proper record ordering is preserved, e.g. we send everything in the hold queue first, then the retry queue, and finally the main queue. Signed-off-by: Paul Moore <paul@paul-moore.com> commit 2af6ea0994c77c87ffb8620df94498fb288dff55 Author: Paul Moore <paul@paul-moore.com> Date: Tue Nov 29 16:53:24 2016 -0500 audit: rename the queues and kauditd related functions The audit queue names can be shortened and the record sending helpers associated with the kauditd task could be named better, do these small cleanups now to make life easier once we start reworking the queues and kauditd code. Signed-off-by: Paul Moore <paul@paul-moore.com> commit 6bb9f3cfb0f5f5d91029bdc3b991de281864179b Author: Paul Moore <paul@paul-moore.com> Date: Tue Nov 29 16:53:24 2016 -0500 audit: queue netlink multicast sends just like we do for unicast sends Sending audit netlink multicast messages is bad for all the same reasons that sending audit netlink unicast messages is bad, so this patch reworks things so that we don't do the multicast send in audit_log_end(), we do it from the dedicated kauditd_thread thread just as we do for unicast messages. See the GitHub issues below for more information/history: * https://github.com/linux-audit/audit-kernel/issues/23 * https://github.com/linux-audit/audit-kernel/issues/22 Signed-off-by: Paul Moore <paul@paul-moore.com> commit 04c7b99a807fa189cd9b85b2f75767f59ab9b7d1 Author: Paul Moore <paul@paul-moore.com> Date: Tue Nov 29 16:53:23 2016 -0500 audit: fixup audit_init() Make sure everything is initialized before we start the kauditd_thread and don't emit the "initialized" record until everything is finished. We also panic with a descriptive message if we can't start the kauditd_thread. Signed-off-by: Paul Moore <paul@paul-moore.com> commit d93f4097376007702a26886c836568a6f9e8c5dd Author: Richard Guy Briggs <rbriggs@redhat.com> Date: Tue Nov 29 16:53:23 2016 -0500 audit: move kaudit thread start from auditd registration to kaudit init (#2) Richard made this change some time ago but Eric backed it out because the rest of the supporting code wasn't ready. In order to move the netlink multicast send to kauditd_thread we need to ensure the kauditd_thread is always running, so restore commit 6ff5e459 ("audit: move kaudit thread start from auditd registration to kaudit init"). Signed-off-by: Richard Guy Briggs <rbriggs@redhat.com> [PM: brought forward and merged based on Richard's old patch] Signed-off-by: Paul Moore <paul@paul-moore.com> commit d6ba7a9c8b5a6a42e8f7a7efd8345122611b535c Author: Jonathan Corbet <corbet@lwn.net> Date: Fri Nov 18 17:21:32 2016 -0700 doc: Sphinxify the tracepoint docbook Convert the tracepoint docbook template to RST and add it to the core-api manual. No changes to the actual text beyond the mechanical formatting conversion. Cc: Jason Baron <jbaron@redhat.com> Cc: William Cohen <wcohen@redhat.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> commit 8da3dc53347205b0d32ded4ab9c96dcf336061d8 Author: Jonathan Corbet <corbet@lwn.net> Date: Fri Nov 18 17:17:11 2016 -0700 doc: debugobjects: actually pull in the kerneldoc comments Add the appropriate markup to get the kerneldoc comments out of lib/debugobjects.c that have never seen the light of day until now. A logical next step, left for the reader at the moment, is to move the function descriptions *out* of debug-objects.rst and into the kerneldoc comments themselves. Signed-off-by: Jonathan Corbet <corbet@lwn.net> commit 93dc3a112bf8e5f97e3d9744595934ff31708764 Author: Jonathan Corbet <corbet@lwn.net> Date: Fri Nov 18 17:06:13 2016 -0700 doc: Convert the debugobjects DocBook template to sphinx A couple of the most minor heading tweaks, otherwise no changes to the text itself beyond the mechanical conversion. Note that the inclusion of the kerneldoc comments from the source has never worked, since exported symbols were asked for and none of those functions are exported to modules. It doesn't work here either :) Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jonathan Corbet <corbet@lwn.net> commit 0bb33e25e5c91f69304f3799609a52354dca1af4 Author: Jonathan Corbet <corbet@lwn.net> Date: Fri Nov 18 16:04:48 2016 -0700 docs: Move the 802.11 guide into the driver-api manual Put this documentation with the other driver docs and try to keep the top level reasonably clean. Cc: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> commit 9367a9cf15eae56d36b3ac76738970bef3698e63 Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Date: Tue Nov 29 13:19:27 2016 -0800 fixup! rcu: Add functions to test for trivial grace periods commit 19f52e52fedc204e3a7a1ab568b9208064ccfb1e Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Date: Tue Nov 29 13:13:47 2016 -0800 rcu: Add functions to test for trivial grace periods Under some circumstances, RCU grace periods are zero cost. For RCU-preempt, this is the case during boot, and for RCU-bh and RCU-sched, this is the case if there is only one CPU. This means that RCU users might wish to dispense with grace-period-avoidance strategies when grace periods are zero cost, so this commit adds rcu_trivial_gp(), rcu_bh_trivial_gp(), and rcu_sched_trivial_gp() to test for these conditions. Because the conditions leading to zero-cost grace periods can change at any time (for example, when a second CPU is onlined), these functions should be used as performance hints, and must not be relied on for correctness. For example, even if rcu_trivial_gp() returns true, you are required to invoke synchronize_rcu(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> commit 8902bd9bfc63a1dbed31a6a44dc895a33c6238e9 Author: Amir Goldstein <amir73il@gmail.com> Date: Tue Nov 22 11:47:09 2016 +0200 ovl: show redirect_dir mount option Show the value of redirect_dir in /proc/mounts. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> commit a53ca63502e62ca459de32821753c8227dc94197 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Nov 29 12:02:17 2016 +0000 drm: Protect fb_helper list manipulation with a mutex Though we only walk the kernel_fb_helper_list inside a panic (or single thread debugging), we still need to protect the list manipulation on creating/removing a framebuffer device in order to prevent list corruption. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Sean Paul <seanpaul@chromium.org> Link: http://patchwork.freedesktop.org/patch/msgid/20161129120217.7344-3-chris@chris-wilson.co.uk commit 64e94407fb5a4128636c2b15b38fa6e71a427228 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Nov 29 12:02:16 2016 +0000 drm: Pull together probe + setup for drm_fb_helper drm_fb_helper_probe_connector_modes() is always called before drm_setup_crtcs(), so just move the call into drm_setup_crtcs for a small bit of code compaction. Note that register_framebuffer will do a modeset (when fbcon is enabled) and hence must be moved out of the critical section. A follow-up patch will add new locking for the fb list, hence move all the related registration code together. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Daniel Vetter <daniel.vetter@intel.com> Signed-off-by: Sean Paul <seanpaul@chromium.org> Link: http://patchwork.freedesktop.org/patch/msgid/20161129120217.7344-2-chris@chris-wilson.co.uk commit 966a6a13c6660b499caf2932de22ae70c1317786 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Nov 29 12:02:15 2016 +0000 drm: Hold mode_config.lock to prevent hotplug whilst setting up crtcs The fb_helper->connector_count is modified when a new connector is constructed following a hotplug event (e.g. DP-MST). This causes trouble for drm_setup_crtcs() and friends that assume that fb_helper is constant: [ 1250.872997] BUG: KASAN: slab-out-of-bounds in drm_setup_crtcs+0x320/0xf80 at addr ffff88074cdd2608 [ 1250.873020] Write of size 40 by task kworker/u8:3/480 [ 1250.873039] CPU: 2 PID: 480 Comm: kworker/u8:3 Tainted: G U 4.9.0-rc6+ #285 [ 1250.873043] Hardware name: /NUC6i3SYB, BIOS SYSKLi35.86A.0024.2015.1027.2142 10/27/2015 [ 1250.873050] Workqueue: events_unbound async_run_entry_fn [ 1250.873056] ffff88070f9d78f0 ffffffff814b72aa ffff88074e40c5c0 ffff88074cdd2608 [ 1250.873067] ffff88070f9d7918 ffffffff8124ff3c ffff88070f9d79b0 ffff88074cdd2600 [ 1250.873079] ffff88074e40c5c0 ffff88070f9d79a0 ffffffff812501e4 0000000000000005 [ 1250.873090] Call Trace: [ 1250.873099] [<ffffffff814b72aa>] dump_stack+0x67/0x9d [ 1250.873106] [<ffffffff8124ff3c>] kasan_object_err+0x1c/0x70 [ 1250.873113] [<ffffffff812501e4>] kasan_report_error+0x204/0x4f0 [ 1250.873120] [<ffffffff81698df0>] ? drm_dev_printk+0x140/0x140 [ 1250.873127] [<ffffffff81250ac3>] kasan_report+0x53/0x60 [ 1250.873134] [<ffffffff81688b40>] ? drm_setup_crtcs+0x320/0xf80 [ 1250.873142] [<ffffffff8124f18e>] check_memory_region+0x13e/0x1a0 [ 1250.873147] [<ffffffff8124f5f3>] memset+0x23/0x40 [ 1250.873154] [<ffffffff81688b40>] drm_setup_crtcs+0x320/0xf80 [ 1250.873161] [<ffffffff810be7c5>] ? wake_up_q+0x45/0x80 [ 1250.873169] [<ffffffff81b0c180>] ? mutex_lock_nested+0x5a0/0x5a0 [ 1250.873176] [<ffffffff8168a0e6>] drm_fb_helper_initial_config+0x206/0x7a0 [ 1250.873183] [<ffffffff81689ee0>] ? drm_fb_helper_set_par+0x90/0x90 [ 1250.873303] [<ffffffffa0b68690>] ? intel_fbdev_fini+0x140/0x140 [i915] [ 1250.873387] [<ffffffffa0b686b2>] intel_fbdev_initial_config+0x22/0x40 [i915] [ 1250.873391] [<ffffffff810b50ff>] async_run_entry_fn+0x7f/0x270 [ 1250.873394] [<ffffffff810a64b0>] process_one_work+0x3d0/0x960 [ 1250.873398] [<ffffffff810a641d>] ? process_one_work+0x33d/0x960 [ 1250.873401] [<ffffffff810a60e0>] ? max_active_store+0xf0/0xf0 [ 1250.873406] [<ffffffff810f6f9d>] ? do_raw_spin_lock+0x10d/0x1a0 [ 1250.873413] [<ffffffff810a767d>] worker_thread+0x8d/0x840 [ 1250.873419] [<ffffffff810a75f0>] ? create_worker+0x2e0/0x2e0 [ 1250.873426] [<ffffffff810b0454>] kthread+0x194/0x1c0 [ 1250.873432] [<ffffffff810b02c0>] ? kthread_park+0x60/0x60 [ 1250.873438] [<ffffffff810f095d>] ? trace_hardirqs_on+0xd/0x10 [ 1250.873446] [<ffffffff810b02c0>] ? kthread_park+0x60/0x60 [ 1250.873453] [<ffffffff810b02c0>] ? kthread_park+0x60/0x60 [ 1250.873457] [<ffffffff81b12277>] ret_from_fork+0x27/0x40 [ 1250.873460] Object at ffff88074cdd2608, in cache kmalloc-32 size: 32 However, when holding the mode_config.lock around the fb_helper, we have to be careful of any callbacks that may reenter the fb_helper and so try to reacquire the mode_config.lock (e.g. register_framebuffer). To avoid the mutex recursion, we have to rearrange the sequence to move the registration into the caller outside of the mode_config.lock. v2: drop the 1; following the lockdep assertion inside the for(;;), I anticipated an error that doesn't happen! Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98826 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Daniel Vetter <daniel@ffwll.ch> Signed-off-by: Sean Paul <seanpaul@chromium.org> Link: http://patchwork.freedesktop.org/patch/msgid/20161129120217.7344-1-chris@chris-wilson.co.uk commit 033a28bac0a20de78426e6faf3414637e4775fbc Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Date: Tue Nov 29 11:06:05 2016 -0800 rcu: Allow boot-time use of cond_resched_rcu_qs() The cond_resched_rcu_qs() macro is used to force RCU quiescent states into long-running in-kernel loops. However, some of these loops can execute during early boot when interrupts are disabled, and during which time it is therefore illegal to enter the scheduler. This commit therefore makes cond_resched_rcu_qs() be a no-op during early boot. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> commit c3fdff480d5d5e23a6e5529d33d1c01060491e44 Author: Paul Moore <paul@paul-moore.com> Date: Sun Nov 20 16:47:55 2016 -0500 audit: add support for session ID user filter Define AUDIT_SESSIONID in the uapi and add support for specifying user filters based on the session ID. Also add the new session ID filter to the feature bitmap so userspace knows it is available. https://github.com/linux-audit/audit-kernel/issues/4 RFE: add a session ID filter to the kernel's user filter Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: combine multiple patches from Richard into this one] Signed-off-by: Paul Moore <paul@paul-moore.com> commit f7b7bee75e06cbdce864f7b313ac05555e7eff6b Author: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Date: Sat Nov 26 10:57:13 2016 +0800 raid5-cache: add another check conditon before replaying one stripe New stripe that was just allocated has no STRIPE_R5C_CACHING state too, add this check condition could avoid unnecessary replaying for empty stripe. r5l_recovery_replay_one_stripe would reset stripe for any case, delete it to make code more clean. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com> commit 9b57da0630c9fd36ed7a20fc0f98dc82cc0777fa Author: Florian Westphal <fw@strlen.de> Date: Tue Nov 29 02:17:34 2016 +0100 netfilter: ipv6: nf_defrag: drop mangled skb on ream error Dmitry Vyukov reported GPF in network stack that Andrey traced down to negative nh offset in nf_ct_frag6_queue(). Problem is that all network headers before fragment header are pulled. Normal ipv6 reassembly will drop the skb when errors occur further down the line. netfilter doesn't do this, and instead passed the original fragment along. That was also fine back when netfilter ipv6 defrag worked with cloned fragments, as the original, pristine fragment was passed on. So we either have to undo the pull op, or discard such fragments. Since they're malformed after all (e.g. overlapping fragment) it seems preferrable to just drop them. Same for temporary errors -- it doesn't make sense to accept (and perhaps forward!) only some fragments of same datagram. Fixes: 029f7f3b8701cc7ac ("netfilter: ipv6: nf_defrag: avoid/free clone operations") Reported-by: Dmitry Vyukov <dvyukov@google.com> Debugged-by: Andrey Konovalov <andreyknvl@google.com> Diagnosed-by: Eric Dumazet <Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> commit 333ba053d145d6f9152f6b0311a345b876f0fed1 Author: Javier González <javier@cnexlabs.com> Date: Mon Nov 28 22:39:14 2016 +0100 lightnvm: transform target get/set bad block Since targets are given a virtual target device, it is necessary to translate all communication between targets and the backend device. Implement the translation layer for get/set bad block table. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit da2d7cb828ce2714c603827ac5a6e1c98a02e861 Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:39:13 2016 +0100 lightnvm: use target nvm on target-specific ops. On target-specific operations pass on nvm_tgt_dev instead of the generic nvm device. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit a279006afa3377493c4240395c70430f2a9b0d2b Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:39:12 2016 +0100 lightnvm: introduce max_phys_sects helper function Target devices do not have access to the device driver operations. Introduce a helper function that exposes the max. number of physical sectors supported by the underlying device. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit 959e911b31981b52ed3f3d6e351b107bcb9163ef Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:39:11 2016 +0100 lightnvm: introduce helpers for generic ops in rrpc Avoid calling media manager and device-specific operations directly from rrpc. Create helper functions on lightnvm's core instead. Signed-off-by: Javier González <javier@cnexlabs.com> Made it work with null_blk as well. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit 8e53624d44c1de31b1b0d4f500703669418a4c67 Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:39:10 2016 +0100 lightnvm: eliminate nvm_lun abstraction in mm In order to naturally support multi-target instances on an Open-Channel SSD, targets should own the LUNs they get blocks from and manage provisioning internally. This is done in several steps. Since targets own the LUNs the are instantiated on top of and manage the free block list internally, there is no need for a LUN abstraction in the media manager. LUNs are intrinsically managed as in the physical layout (ch:0,lun:0, ..., ch:0,lun:n, ch:1,lun:0, ch:1,lun:n, ..., ch:m,lun:0, ch:m,lun:n) and given to the targets based on the target creation ioctl. This simplifies LUN management and clears the path for a partition manager to sit directly underneath LightNVM targets. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit 2a02e627c245bfa987b97707123d7747d7b0e486 Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:39:09 2016 +0100 lightnvm: eliminate nvm_block abstraction on mm In order to naturally support multi-target instances on an Open-Channel SSD, targets should own the LUNs they get blocks from and manage provisioning internally. This is done in several steps. A part of this transformation is that targets manage their blocks internally. This patch eliminates the nvm_block abstraction and moves block management to the target logic. The rrpc target is transformed. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit eec44565e9ab13bbf5b48864a68871eabf1115c1 Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:39:08 2016 +0100 lightnvm: remove debug lun statistics from gennvm Since LUNs are managed internally on targets, the media manager has no access to the free LUN lists. Thus, debug functions that show LUN information on the device should not be implemented on the media manager, but rather on the target in itself. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit 0ac4072eb10c9627415eb1ca511121156e20012c Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:39:07 2016 +0100 lightnvm: remove get_lun operation on gennvm Since LUNs are managed internally on the target, there is no need for the media manager to implement a get_lun operation. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit 8e79b5cb1d3b8eceaf6862995952dd4de431dd99 Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:39:06 2016 +0100 lightnvm: move block provisioning to targets In order to naturally support multi-target instances on an Open-Channel SSD, targets should own the LUNs they get blocks from and manage provisioning internally. This is done in several steps. This patch moves the block provisioning inside of the target and removes the get/put block interface from the media manager. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit 8176117b82e49e043d045f214ba7a892fba6b827 Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:39:05 2016 +0100 lightnvm: manage lun partitions internally in mm LUNs are exclusively owned by targets implementing a block device FTL. Doing this reservation requires at the moment a 2-way callback gennvm <-> target. The reason behind this is that LUNs were not assumed to always be exclusively owned by targets. However, this design decision goes against I/O determinism QoS (two targets would mix I/O on the same parallel unit in the device). This patch makes LUN reservation as part of the target creation on the media manager. This makes that LUNs are always exclusively owned by the target instantiated on top of them. LUN stripping and/or sharing should be implemented on the target itself or the layers on top. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit de93434fcf74d41754a48e45365a5914e00bc0be Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:39:04 2016 +0100 lightnvm: remove gen_lun abstraction The gen_lun abstraction in the generic media manager was conceived on the assumption that a single target would instantiated on top of it. This has complicated target design to implement multi-instances. Remove this abstraction and move its logic to nvm_lun, which manages physical lun geometry and operations. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit 98379a12c54974ee5856dcf81781a5dc845505c3 Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:39:03 2016 +0100 lightnvm: use constant name instead of value There is a constant to refer to free blocks. Use it when marking bad blocks instead of using a constant value Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit eb00352b5213e52419ac7dd8bbd84a1300fe4b5d Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:39:02 2016 +0100 lightnvm: remove unnecessary variables in rrpc Before vectored I/Os were supported on rrpc, the physical address was stored as part of the nvm_rqd request. This variable become obsolete when the ppa_list was introduced. Cleanup this variable. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit 0e5c3246dbb96b6870634e7d51b2490f05c976cf Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:39:01 2016 +0100 lightnvm: make address conversion functions global Targets are assumed to used the same generic ppa format, where the address is partitioned on ch:lun:block:pg:pl:sec. Thus, make the function in charge of transforming the ppa address from a linear format to the generic one available to all targets. This function will be needed by the media manager in order to do target mapping translations when targets are divided on different physical partitions. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit 7e4f64a9b3004ce592f21653c3b7781628862232 Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:39:00 2016 +0100 lightnvm: cleanup unused target operations Cleanup definition leftovers from old gennvm interface Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit 17b25cfc873e6d7e2283cf8d77f2af93192484a5 Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:38:59 2016 +0100 lightnvm: remove sysfs configuration interface LightNVM used to be managed and configured through sysfs. Since the introduction of management ioctls this interface is redundant and outdated. Get rid of it. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit f0b01b6a610f99fb7683d0f5259bb65649b0fd78 Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:38:58 2016 +0100 lightnvm: rrpc: split bios of size > 256kb rrpc cannot handle bios of size > 256kb due to NVMe using a 64 bit bitmap to signal I/O completion. If a larger bio comes, split it explicitly. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit 402ab9a89d7b5bab08a5534027b39d80085ec19b Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:38:57 2016 +0100 lightnvm: add ECC error codes Add ECC error codes to enable the appropriate handling in the target. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit a24ba4644b7ae5af3cd2eb6992c237cb4548c45e Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:38:56 2016 +0100 lightnvm: export set bad block table Bad blocks should be managed by block owners. This would be either targets for data blocks or sysblk for system blocks. In order to support this, export two functions: One to mark a block as an specific type (e.g., bad block) and another to update the bad block table on the device. Move bad block management to rrpc. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit 8a3c95ab385fb98621455807ae52b4454192f8c5 Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:38:55 2016 +0100 lightnvm: do not protect block 0 Device blocks should be marked by the device and considered as bad blocks by the media manager. Thus, do not make assumptions on which blocks are going to be used by the device. In doing so we might lose valid blocks from the free list. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit bb3149792e0ed52cf5f457dda4c9bf9c5bda1542 Author: Javier González <jg@lightnvm.io> Date: Mon Nov 28 22:38:54 2016 +0100 lightnvm: enable to send hint to erase command Erases might be subject to host hints. An example is multi-plane programming to erase blocks in parallel. Enable targets to specify this hint. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit 3dc87dd048dc442bab633e85bfb96c893612d765 Author: Matias Bjørling <m@bjorling.me> Date: Mon Nov 28 22:38:53 2016 +0100 nvme: lightnvm: attach lightnvm sysfs to nvme block device Previously, LBA read and write were not supported in the lightnvm specification. Now that it supports it, lets use the traditional NVMe gendisk, and attach the lightnvm sysfs geometry export. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit 7498e99fc51ca60b960ef79061e0e7b521feb07e Author: Matias Bjørling <m@bjorling.me> Date: Mon Nov 28 22:38:52 2016 +0100 nvme: lightnvm: frees wrong cmd structure When struct nvme_request was introduced, the nvme_nvm_submit_io was converted to the new interface. The interface moves nvme_nvm_command data structure into the struct request pdu. On io completion, rq->cmd is freed, which should have been the dereferenced pdu nvme_request->cmd. Fixes: d49187e97e94 "nvme: introduce struct nvme_request" Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com> commit e59d8bb574f6d8097e7e14981440dc37425fddc6 Author: Pan Bian <bianpan2016@163.com> Date: Tue Nov 29 07:33:07 2016 +0800 ALSA: echoaudio: Fix improper return value in function load_asic When the second call to load_asic_generic() fails in function load_asic(), "false" is returned. The real value of "false" is 0, which indicates success in the context. As a result, the execution status and the return value may be inconsistent. This patch fixes the bug. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188761 Signed-off-by: Pan Bian <bianpan2016@163.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> commit 40931b85113dad7881d49e8759e5ad41d30a5e6c Author: Eric Dumazet <edumazet@google.com> Date: Fri Nov 25 07:46:20 2016 -0800 mlx4: give precise rx/tx bytes/packets counters mlx4 stats are chaotic because a deferred work queue is responsible to update them every 250 ms. Even sampling stats every one second with "sar -n DEV 1" gives variations like the following : lpaa23:~# sar -n DEV 1 10 | grep eth0 | cut -c1-65 07:39:22 eth0 146877.00 3265554.00 9467.15 4828168.50 07:39:23 eth0 146587.00 3260329.00 9448.15 4820445.98 07:39:24 eth0 146894.00 3259989.00 9468.55 4819943.26 07:39:25 eth0 110368.00 2454497.00 7113.95 3629012.17 <<>> 07:39:26 eth0 146563.00 3257502.00 9447.25 4816266.23 07:39:27 eth0 145678.00 3258292.00 9389.79 4817414.39 07:39:28 eth0 145268.00 3253171.00 9363.85 4809852.46 07:39:29 eth0 146439.00 3262185.00 9438.97 4823172.48 07:39:30 eth0 146758.00 3264175.00 9459.94 4826124.13 07:39:31 eth0 146843.00 3256903.00 9465.44 4815381.97 Average: eth0 142827.50 3179259.70 9206.30 4700578.16 This patch allows rx/tx bytes/packets counters being folded at the time we need stats. We now can fetch stats every 1 ms if we want to check NIC behavior on a small time window. It is also easier to detect anomalies. lpaa23:~# sar -n DEV 1 10 | grep eth0 | cut -c1-65 07:42:50 eth0 142915.00 3177696.00 9212.06 4698270.42 07:42:51 eth0 143741.00 3200232.00 9265.15 4731593.02 07:42:52 eth0 142781.00 3171600.00 9202.92 4689260.16 07:42:53 eth0 143835.00 3192932.00 9271.80 4720761.39 07:42:54 eth0 141922.00 3165174.00 9147.64 4679759.21 07:42:55 eth0 142993.00 3207038.00 9216.78 4741653.05 07:42:56 eth0 141394.06 3154335.64 9113.85 4663731.73 07:42:57 eth0 141850.00 3161202.00 9144.48 4673866.07 07:42:58 eth0 143439.00 3180736.00 9246.05 4702755.35 07:42:59 eth0 143501.00 3210992.00 9249.99 4747501.84 Average: eth0 142835.66 3182165.93 9206.98 4704874.08 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 6a30c5a73fdc6c51cae4a833bb21375f26d5ccba Author: Alexey Brodkin <abrodkin@synopsys.com> Date: Tue Nov 22 14:16:36 2016 +0300 ARC: axs10x: really enable ARC PGU Up until now we had ARC PGU not enabled in axs10x defconfigs trying to not bloat kernel image again with yet another drivers and subsystems. This change configures ARC PGU (as well as DRM bits it depends on) to be built as a module and so those who need LCD screen to work on axs10x may bundle built .ko files in their target's file-system with help of the following command on host: ------------->8------------- make INSTALL_MOD_PATH=_path_to_target_fs_ modules_install ------------->8------------- and later on target with commands as simple as: ------------->8------------- modprobe adv7511.ko modprobe arcpgu.ko ------------->8------------- get LCD working. Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com> Signed-off-by: Vineet Gupta <vgupta@synopsys.com> commit 98d3e6d4a04f3d1b13a2d547f0d3f2d7f42f4f0e Author: Vineet Gupta <vgupta@synopsys.com> Date: Fri Nov 18 14:19:27 2016 -0800 ARC: rename Zebu platform support to HAPS There are more ARC Linux HAPS users than Zebu ones. Same kernel would work fine on both, even with embedded DT, assuming the FPGA bitfile configuration is same Suggested-by: Francois Bedard <fbedard@ynopsys.com> Signed-off-by: Vineet Gupta <vgupta@synopsys.com> commit f3fd0d516943e769e51c521311ab13dd65c7898f Author: Arnd Bergmann <arnd@arndb.de> Date: Tue Nov 22 15:33:44 2016 +0100 clocksource: nps: avoid maybe-uninitialized warning We get a harmless false-positive warning with the newly added nps clocksource driver: drivers/clocksource/timer-nps.c: In function 'nps_setup_clocksource': drivers/clocksource/timer-nps.c:102:6: error: 'nps_timer1_freq' may be used uninitialized in this function [-Werror=maybe-uninitialized] Gcc here fails to identify that IS_ERR() is only true if PTR_ERR() has a nonzero value. Using PTR_ERR_OR_ZERO() to convert the result first makes this obvious and shuts up the warning. Fixes: 0ee4d9922df5 ("clocksource: Add clockevent support to NPS400 driver") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Vineet Gupta <vgupta@synopsys.com> commit ab321058267b4df1cdd515fffa3ef380ed28ecd2 Author: Noam Camus <noamca@mellanox.com> Date: Thu Nov 17 09:12:43 2016 +0200 clocksource: Add clockevent support to NPS400 driver Till now we used clockevent from generic ARC driver. This was enough as long as we worked with simple multicore SoC. When we are working with multithread SoC each HW thread can be scheduled to receive timer interrupt using timer mask register. This patch will provide a way to control clock events per HW thread. The design idea is that for each core there is dedicated register (TSI) serving all 16 HW threads. The register is a bitmask with one bit for each HW thread. When HW thread wants that next expiration of timer interrupt will hit it then the proper bit should be set in this dedicated register. When timer expires all HW threads within this core which their bit is set at the TSI register will be interrupted. Driver can be used from device tree by: compatible = "ezchip,nps400-timer0" <-- for clocksource compatible = "ezchip,nps400-timer1" <-- for clockevent Note that name convention for timer0/timer1 was taken from legacy ARC design. This design is our base before adding HW threads. For backward compatibility we keep "ezchip,nps400-timer" for clocksource Signed-off-by: Noam Camus <noamca@mellanox.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Rob Herring <robh@kernel.org> commit a08a8e2893015a7fc25b14d75c9fc7a222ae7b38 Author: Noam Camus <noamca@mellanox.com> Date: Wed Nov 16 08:31:12 2016 +0200 clocksource: update "fn" at CLOCKSOURCE_OF_DECLARE() of nps400 timer nps_setup_clocksource() should take node as only argument as defined by typedef int (*of_init_fn_1_ret)(struct device_node *) Therefore need to replace: int __init nps_setup_clocksource(struct device_node *node, struct clk *clk) with int __init nps_setup_clocksource(struct device_node *node) This patch also serve as preparation for next patch which add support for clockevents to nps400. Specifically we add new function nps_get_timer_clk() to serve clocksource and later clockevent registration. Signed-off-by: Noam Camus <noamca@mellanox.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> commit 6e022b60285a6fbf8229e1ec268f2d8f55d35477 Author: Noam Camus <noamca@mellanox.com> Date: Wed Nov 16 08:31:11 2016 +0200 soc: Support for NPS HW scheduling This new header file is for NPS400 SoC (part of ARC architecture). The header file includes macros for save/restore of HW scheduling. The control of HW scheduling is achieved by writing core registers. This code was moved from arc/plat-eznps so it can be used from drivers/clocksource/, available only for CONFIG_EZNPS_MTM_EXT. Signed-off-by: Noam Camus <noamca@mellanox.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> commit b8531ea8d22f5678a2a6b9cce7ad0ce02d2c4482 Author: Vineet Gupta <vgupta@synopsys.com> Date: Mon Oct 31 13:46:38 2016 -0700 clocksource: import ARC timer driver This adds support for - CONFIG_ARC_TIMERS : legacy 32-bit TIMER0 and TIMER1 which count UP from @CNT to @LIMIT, before optionally triggering an interrupt. These are programmed using ARC auxiliary register interface. These are present in all ARC cores (ARC700 and ARC HS38) TIMER0 serves as clockevent for all ARC linux builds. TIMER1 is used for clocksource in arc700 builds. - CONFIG_ARC_TIMERS_64BIT: 64-bit counters, RTC and GFRC found in ARC HS38 cores. These are independnet IP blocks with different programming model respectively. Link: http://lkml.kernel.org/r/20161111231132.GA4186@mai Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Vineet Gupta <vgupta@synopsys.com> commit f201a7e1cf0592aa06931e1624d5e63c8116e765 Author: Vineet Gupta <vgupta@synopsys.com> Date: Mon Oct 31 13:06:19 2016 -0700 ARC: breakout timer include code into separate header ... ... which allows for use in drivers/clocksource later Signed-off-by: Vineet Gupta <vgupta@synopsys.com> commit 43c7919a4133d80845573a9b8047e8bf10ea0eca Author: Vineet Gupta <vgupta@synopsys.com> Date: Mon Oct 31 11:27:08 2016 -0700 ARC: move mcip.h into include/soc and adjust the includes Also remove the dependency on ARCv2, to increase compile coverage for !ARCV2 builds Acked-by: Daniel Lezcano <daniel.lezcnao@linaro.org> Signed-off-by: Vineet Gupta <vgupta@synopsys.com> commit 6387055728a2a5a105433d8c7dd2e0fc17d999e2 Author: Vineet Gupta <vgupta@synopsys.com> Date: Mon Oct 31 11:09:34 2016 -0700 ARC: breakout aux handling into a separate header ARC timers use aux registers for programming and this paves way for moving ARC timer drivers into drivers/clocksource Signed-off-by: Vineet Gupta <vgupta@synopsys.com> commit 21857529f5fe5081b74cb0da4785fa64ea332995 Author: Vineet Gupta <vgupta@synopsys.com> Date: Mon Oct 31 13:26:25 2016 -0700 ARC: time: move time_init() out of the driver to allow future git mv of the driver into drivers/clocksource Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Vineet Gupta <vgupta@synopsys.com> commit 8a5b823f98ad30bb334d25c8123e8d5766c4fbf1 Author: Vineet Gupta <vgupta@synopsys.com> Date: Mon Oct 31 14:26:41 2016 -0700 ARC: timer: gfrc, rtc: build under same option (64-bit timers) The original distinction was done as they were developed at different times and primarily because they are specific to UP (RTC) and SMP (GFRC). But given that driver handles that at runtime, (i.e. not allowing RTC as clocksource in SMP), we can simplify things a bit. Signed-off-by: Vineet Gupta <vgupta@synopsys.com> commit d0f76f838d74ffbe0f206d1fb3d6f0b01b372353 Author: Vineet Gupta <vgupta@synopsys.com> Date: Mon Oct 31 13:02:31 2016 -0700 ARC: timer: gfrc, rtc: Read BCR to detect whether hardware exists ... ... don't rely on cpuinfo populated in arc boot code. This paves way for moving this code in drivers/clocksource/ And while at it, convert the WARN() to pr_warn() as sugested by Daniel Signed-off-by: Vineet Gupta <vgupta@synopsys.com> commit 0081d7bcd0a7663d0240befdb634b65bba04c13f Author: Vineet Gupta <vgupta@synopsys.com> Date: Thu Nov 3 11:38:52 2016 -0700 ARC: timer: gfrc, rtc: deuglify big endian code A standard "C" shift will be handled appropriately by the compiler depending on the endian for the build. So we don't need the explicit distinction in code Signed-off-by: Vineet Gupta <vgupta@synopsys.com> commit 76fb051d42945d142fe265b6ec79e06aa9cfb250 Author: Russell King <rmk+kernel@armlinux.org.uk> Date: Mon Nov 21 16:07:05 2016 +0000 ARM: mm: allow set_memory_*() to be used on the vmalloc region We can allow modules to be loaded into the vmalloc region, where they should also benefit from the same protections as those loaded into the more efficient module region. Allow these functions to operate there as well. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> commit 580218f9678e76f712a1cf6cff5a903917fa9558 Author: Russell King <rmk+kernel@armlinux.org.uk> Date: Mon Nov 21 16:02:08 2016 +0000 ARM: mm: fix set_memory_*() bounds checks The set_memory_*() bounds checks are buggy on several fronts: 1. They fail to round the region size up if the passed address is not page aligned. 2. The region check was incomplete, and didn't correspond with what was being asked of apply_to_page_range() So, rework change_memory_common() to fix these problems, adding an "in_region()" helper to determine whether the start & size fit within the provided region start and stop addresses. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> commit d3df9bc5fb5d838b049f32a476721eadbc349553 Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Date: Tue Nov 29 05:49:06 2016 -0800 rcu: Once again use NMI-based stack traces in stall warnings This commit is for all intents and purposes a revert of bc1dce514e9b ("rcu: Don't use NMIs to dump other CPUs' stacks"). The reason to suppose that this can now safely be reverted is the presence of 42a0bb3f7138 ("printk/nmi: generic solution for safe printk in NMI"), which is said to have made NMI-based stack dumps safe. However, this reversion keeps one nice property of bc1dce514e9b ("rcu: Don't use NMIs to dump other CPUs' stacks"), namely that only those CPUs blocking the grace period are dumped. The new trigger_single_cpu_backtrace() is used to make this happen, as suggested by Josh Poimboeuf. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Not-yet-signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> commit f8045446ca778333e960dcb9e30a5858ff2b8c20 Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Date: Mon Nov 28 12:08:49 2016 -0800 srcu: Force full grace-period ordering If a process invokes synchronize_srcu(), is delayed just the right amount of time, and thus does not sleep when waiting for the grace period to complete, there is no ordering between the end of the grace period and the code following the synchronize_srcu(). Similarly, there can be a lack of ordering between the end of the SRCU grace period and callback invocation. This commit adds the necessary ordering. Reported-by: Lance Roy <ldr709@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> commit 6a8b2ca702b279bea0e8f0363056439352e2081c Author: Yuriy Kolerov <yuriy.kolerov@synopsys.com> Date: Mon Nov 28 07:07:17 2016 +0300 ARC: mm: PAE40: Fix crash at munmap commit 1c3c90930392 broke PAE40. Macro pfn_pte(pfn, prot) creates paddr from pfn, but the page shift was getting truncated to 32 bits since we lost the proper cast to 64 bits (for PAE400 Instead of reverting that commit, use a better helper which is 32/64 bits safe just like ARM implementation. Fixes: 1c3c90930392 ("ARC: mm: fix build breakage with STRICT_MM_TYPECHECKS") Cc: <stable@vger.kernel.org> #4.4+ Signed-off-by: Yuriy Kolerov <yuriy.kolerov@synopsys.com> [vgupta: massaged changelog] Signed-off-by: Vineet Gupta <vgupta@synopsys.com> commit 2349b533167315199b00b15db891ddc45b2c909d Author: subhashj@codeaurora.org <subhashj@codeaurora.org> Date: Wed Nov 23 16:33:19 2016 -0800 scsi: ufs: fix default power mode to FAST/SLOW We would by default like to run in FAST/SLOW mode instead of FASTAUTO/SLOWAUTO mode for performance reasons. This change sets the default speed mode to FAST/SLOW mode. Reviewed-by: Venkat Gopalakrishnan <venkatg@codeaurora.org> Signed-off-by: Subhash Jadavani <subhashj@codeaurora.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> commit 0b257734344aa89b565bd148ff94f29aa873ffa6 Author: subhashj@codeaurora.org <subhashj@codeaurora.org> Date: Wed Nov 23 16:33:08 2016 -0800 scsi: ufs: optimize system suspend handling Consider following sequence of events: 1. UFS is runtime suspended, link_state = Hibern8, device_state = sleep 2. System goes into system suspend, ufshcd_system_suspend() brings both link and device to active state and then puts the device in Power_Down state and link in OFF state. 3. System resumes at some later point in time, ufshcd_system_resume() doesn't do anything as UFS state is runtime suspended. Note that link is still on OFF state and device is in Power_Down state. 4. Now system again goes into suspend without any UFS accesses before it. ufshcd_system_suspend() again brings both link and device to active state and then puts the device in Power_Down state and link if OFF state. But it's unnecessary to bring the link & device in active state as both link and device are already in desired low power states. This change fixes this issue by adding proper state checks in ufshcd_system_suspend(). Reviewed-by: Gilad Broner <gbroner@codeaurora.org> Signed-off-by: Subhash Jadavani <subhashj@codeaurora.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> commit f37e9f8cf8bc681250880f8c1372fb882c9379b8 Author: Yaniv Gardi <ygardi@codeaurora.org> Date: Wed Nov 23 16:32:49 2016 -0800 scsi: ufs: fix condition in which DME command failure msg is printed out The condition in which error message is printed out was incorrect and resulted error message only if retries exhausted. But retries happens only if DME command is a peer command, and thus DME commands which are not peer commands and fail are not printed out. This change fixes this issue. Signed-off-by: Yaniv Gardi <ygardi@codeaurora.org> Signed-off-by: Subhash Jadavani <subhashj@codeaurora.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> commit fb7b45f0462f144e1924a357995552f24f0c9d0c Author: Dolev Raviv <draviv@codeaurora.org> Date: Wed Nov 23 16:32:32 2016 -0800 scsi: ufs: handle errors from PHY_ADAPTER_ERROR register The PHY_ADAPTER_ERROR status register indicates PHY lane errors reported by the M-PHY layer. In some occasions the controller can recover from such errors. When the error is not recoverable, a stuck DB error will occur. Since the stuck DB error is spotted separately, no action other than clearing the register is necessary. Signed-off-by: Dolev Raviv <draviv@codeaurora.org> Signed-off-by: Subhash Jadavani <subhashj@codeaurora.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> commit 7caf489b99a42a9017ef3d733912aea8794677e7 Author: subhashj@codeaurora.org <subhashj@codeaurora.org> Date: Wed Nov 23 16:32:20 2016 -0800 scsi: ufs: issue link starup 2 times if device isn't active If we issue the link startup to the device while its UniPro state is LinkDown (and device state is sleep/power-down) then link startup will not move the device state to Active. Device will only move to active state if the link starup is issued when its UniPro state is LinkUp. So in this case, we would have to issue the link startup 2 times to make sure that device moves to active state. Reviewed-by: Gilad Broner <gbroner@codeaurora.org> Signed-off-by: Subhash Jadavani <subhashj@codeaurora.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> commit c6a6db439868c7ba5cc90d4c461d9697ec731fa1 Author: subhashj@codeaurora.org <subhashj@codeaurora.org> Date: Wed Nov 23 16:32:08 2016 -0800 scsi: ufs: ensure that host pa_tactivate is higher than device Some UFS devices require host PA_TACTIVATE to be higher than device PA_TACTIVATE otherwise it may get stuck during hibern8 sequence. This change allows this by using quirk. Reviewed-by: Venkat Gopalakrishnan <venkatg@codeaurora.org> Signed-off-by: Subhash Jadavani <subhashj@codeaurora.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> commit 10fe5888a40e6afbf1f3166c212c685624cae26b Author: subhashj@codeaurora.org <subhashj@codeaurora.org> Date: Wed Nov 23 16:31:52 2016 -0800 scsi: ufs: increase the scsi query response timeout It is found thats UFS device may take longer than 30ms to respond to query requests and in this case we might run into following scenario: 1. UFS host SW sends a query request to UFS device to read an attribute value. SW uses tag #31 for this purpose. 2. UFS host SW waits for 30ms to get the query response (and doorbell to be cleared by UFS host HW). 3. UFS device doesn't respond back within 30ms hence UFS host SW times out waiting for the query response. 4. UFS host SW clears the tag#31 from UTRLCLR register. 5. UFS host SW waits until UFS host HW to clear tag#31 from the doorbell register. 6. UFS host SW retries the same query request on same tag#31 (sends a query request to device to read an attribute value). 7. UFS host HW gets the query response from the device but this was intended as a query response for the 1st query request sent (step-1). 8. Now UFS device sends another query response to host (for query request sent @step-6). Now there are 2 issues that could happen with above scenario: 1. UFS device should have actually responded back with only one query response but it is found that device may respond back with 2 query responses. 2. If UFS device responds back with 2 resposes on same tag, host HW/SW behaviour isn't predictable. To avoid running into above scenario, we would basically allow device to take longer (upto 1.5 seconds) for query response. Reviewed-by: Gilad Broner <gbroner@codeaurora.org> Signed-off-by: Subhash Jadavani <subhashj@codeaurora.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> commit bde44bb665d049468b6a1a2fa7d666434de4f83f Author: subhashj@codeaurora.org <subhashj@codeaurora.org> Date: Wed Nov 23 16:31:41 2016 -0800 scsi: ufs: fix failure to read the string descriptor While reading variable size descriptors (like string descriptor), some UFS devices may report the "LENGTH" (field in "Transaction Specific fields" of Query Response UPIU) same as what was requested in Query Request UPIU instead of reporting the actual size of the variable size descriptor. Although it's safe to ignore the "LENGTH" field for variable size descriptors as we can always derive the length of the descriptor from the descriptor header fields. Hence this change impose the length match check only for fixed size descriptors (for which we always request the correct size as part of Query Request UPIU). Reviewed-by: Venkat Gopalakrishnan <venkatg@codeaurora.org> Signed-off-by: Subhash Jadavani <subhashj@codeaurora.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> commit 24d6243204633be4b710754f279b3ca57c69ceec Author: Yaniv Gardi <ygardi@codeaurora.org> Date: Wed Nov 23 16:31:30 2016 -0800 scsi: ufs: update device descriptor maximum size According to JESD220B - UFS v2.0, the maximum size of device descriptor has changed from 0x1F to 0x40. This patch updates the maximum size of this descriptor. Signed-off-by: Yaniv Gardi <ygardi@codeaurora.org> Signed-off-by: Subhash Jadavani <subhashj@codeaurora.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> commit 4b761b580150b081e6829885afd3d540b746ccf0 Author: Yaniv Gardi <ygardi@codeaurora.org> Date: Wed Nov 23 16:31:18 2016 -0800 scsi: ufs: add index details to query error messages When sending query to the device, the index of the failure is additional useful information that should be printed out as it might specify the logical unit (LU) where the error occurred. Signed-off-by: Yaniv Gardi <ygardi@codeaurora.org> Signed-off-by: Subhash Jadavani <subhashj@codeaurora.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> commit 61e073590b82a539654626ecae91b8fab11db3f3 Author: Dolev Raviv <draviv@codeaurora.org> Date: Wed Nov 23 16:30:49 2016 -0800 scsi: ufs: add queries retry mechanism Some of the queries might fail during init. To avoid system failure, we add retry mechanism to issue queries several times. Signed-off-by: Dolev Raviv <draviv@codeaurora.org> Signed-off-by: Subhash Jadavani <subhashj@codeaurora.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> commit 95d3af6bd18f381b5b1c62f117ce7f152a5db3ea Author: Yazen Ghannam <Yazen.Ghannam@amd.com> Date: Thu Nov 17 17:57:43 2016 -0500 EDAC, amd64: Autoload amd64_edac_mod on Fam17h systems Add Fam17h to the list of families to autoload amd64_edac_mod. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1479423463-8536-18-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Borislav Petkov <bp@suse.de> commit 713ad54675fdfd7358dbcae21ab4788a014c6e23 Author: Yazen Ghannam <Yazen.Ghannam@amd.com> Date: Mon Nov 28 12:59:53 2016 -0600 EDAC, amd64: Define and register UMC error decode function How we need to decode UMC errors is different from how we decode bus errors, so let's define a new function for this. We also need a way to determine the UMC channel since we're not guaranteed that there is a fixed relation between channel and MCA bank. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1480359593-80369-1-git-send-email-Yazen.Ghannam@amd.com [ Fold in decode_synd_reg(), simplify. ] Signed-off-by: Borislav Petkov <bp@suse.de> commit d27f3a348e3677b7d5ee6954ebafce679b011164 Author: Yazen Ghannam <Yaz…

Sending audit netlink multicast messages is bad for all the same reasons that sending audit netlink unicast messages is bad, so this patch reworks things so that we don't do the multicast send in audit_log_end(), we do it from the dedicated kauditd_thread thread just as we do for unicast messages. See the GitHub issues below for more information/history: * #23 * #22 Signed-off-by: Paul Moore <paul@paul-moore.com>

Bring back commit bc51ddd ("netns: avoid disabling irq for netns id") now that we've fixed some audit multicast issues that caused problems with original attempt. Additional information, and history, can be found in the links below: * #22 * #23 Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com>

pcmoore · 2017-03-23T20:04:14Z

Resolved in Linux v4.10.

As Eric Dumazet pointed out this also needs to be fixed in IPv6. v2: Contains the IPv6 tcp/Ipv6 dccp patches as well. We have seen a few incidents lately where a dst_enty has been freed with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that dst_entry. If the conditions/timings are right a crash then ensues when the freed dst_entry is referenced later on. A Common crashing back trace is: #8 [] page_fault at ffffffff8163e648 [exception RIP: __tcp_ack_snd_check+74] . . #9 [] tcp_rcv_established at ffffffff81580b64 #10 [] tcp_v4_do_rcv at ffffffff8158b54a #11 [] tcp_v4_rcv at ffffffff8158cd02 #12 [] ip_local_deliver_finish at ffffffff815668f4 #13 [] ip_local_deliver at ffffffff81566bd9 #14 [] ip_rcv_finish at ffffffff8156656d #15 [] ip_rcv at ffffffff81566f06 #16 [] __netif_receive_skb_core at ffffffff8152b3a2 #17 [] __netif_receive_skb at ffffffff8152b608 #18 [] netif_receive_skb at ffffffff8152b690 #19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3] #20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3] #21 [] net_rx_action at ffffffff8152bac2 #22 [] __do_softirq at ffffffff81084b4f #23 [] call_softirq at ffffffff8164845c #24 [] do_softirq at ffffffff81016fc5 #25 [] irq_exit at ffffffff81084ee5 #26 [] do_IRQ at ffffffff81648ff8 Of course it may happen with other NIC drivers as well. It's found the freed dst_entry here: 224 static bool tcp_in_quickack_mode(struct sock *sk)↩ 225 {↩ 226 ▹ const struct inet_connection_sock *icsk = inet_csk(sk);↩ 227 ▹ const struct dst_entry *dst = __sk_dst_get(sk);↩ 228 ↩ 229 ▹ return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩ 230 ▹ ▹ (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩ 231 }↩ But there are other backtraces attributed to the same freed dst_entry in netfilter code as well. All the vmcores showed 2 significant clues: - Remote hosts behind the default gateway had always been redirected to a different gateway. A rtable/dst_entry will be added for that host. Making more dst_entrys with lower reference counts. Making this more probable. - All vmcores showed a postitive LockDroppedIcmps value, e.g: LockDroppedIcmps 267 A closer look at the tcp_v4_err() handler revealed that do_redirect() will run regardless of whether user space has the socket locked. This can result in a race condition where the same dst_entry cached in sk->sk_dst_entry can be decremented twice for the same socket via: do_redirect()->__sk_dst_check()-> dst_release(). Which leads to the dst_entry being prematurely freed with another socket pointing to it via sk->sk_dst_cache and a subsequent crash. To fix this skip do_redirect() if usespace has the socket locked. Instead let the redirect take place later when user space does not have the socket locked. The dccp/IPv6 code is very similar in this respect, so fixing it there too. As Eric Garver pointed out the following commit now invalidates routes. Which can set the dst->obsolete flag so that ipv4_dst_check() returns null and triggers the dst_release(). Fixes: ceb3320 ("ipv4: Kill routes during PMTU/redirect updates.") Cc: Eric Garver <egarver@redhat.com> Cc: Hannes Sowa <hsowa@redhat.com> Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>

syzkaller reported a double free [1], caused by the fact that tun driver was not updated properly when priv_destructor was added. When/if register_netdevice() fails, priv_destructor() must have been called already. [1] BUG: KASAN: double-free or invalid-free in selinux_tun_dev_free_security+0x15/0x20 security/selinux/hooks.c:5023 CPU: 0 PID: 2919 Comm: syzkaller227220 Not tainted 4.13.0-rc4+ #23 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 print_address_description+0x7f/0x260 mm/kasan/report.c:252 kasan_report_double_free+0x55/0x80 mm/kasan/report.c:333 kasan_slab_free+0xa0/0xc0 mm/kasan/kasan.c:514 __cache_free mm/slab.c:3503 [inline] kfree+0xd3/0x260 mm/slab.c:3820 selinux_tun_dev_free_security+0x15/0x20 security/selinux/hooks.c:5023 security_tun_dev_free_security+0x48/0x80 security/security.c:1512 tun_set_iff drivers/net/tun.c:1884 [inline] __tun_chr_ioctl+0x2ce6/0x3d50 drivers/net/tun.c:2064 tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2309 vfs_ioctl fs/ioctl.c:45 [inline] do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685 SYSC_ioctl fs/ioctl.c:700 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691 entry_SYSCALL_64_fastpath+0x1f/0xbe RIP: 0033:0x443ff9 RSP: 002b:00007ffc34271f68 EFLAGS: 00000217 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00000000004002e0 RCX: 0000000000443ff9 RDX: 0000000020533000 RSI: 00000000400454ca RDI: 0000000000000003 RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000401ce0 R13: 0000000000401d70 R14: 0000000000000000 R15: 0000000000000000 Allocated by task 2919: save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:447 set_track mm/kasan/kasan.c:459 [inline] kasan_kmalloc+0xaa/0xd0 mm/kasan/kasan.c:551 kmem_cache_alloc_trace+0x101/0x6f0 mm/slab.c:3627 kmalloc include/linux/slab.h:493 [inline] kzalloc include/linux/slab.h:666 [inline] selinux_tun_dev_alloc_security+0x49/0x170 security/selinux/hooks.c:5012 security_tun_dev_alloc_security+0x6d/0xa0 security/security.c:1506 tun_set_iff drivers/net/tun.c:1839 [inline] __tun_chr_ioctl+0x1730/0x3d50 drivers/net/tun.c:2064 tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2309 vfs_ioctl fs/ioctl.c:45 [inline] do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685 SYSC_ioctl fs/ioctl.c:700 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691 entry_SYSCALL_64_fastpath+0x1f/0xbe Freed by task 2919: save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:447 set_track mm/kasan/kasan.c:459 [inline] kasan_slab_free+0x6e/0xc0 mm/kasan/kasan.c:524 __cache_free mm/slab.c:3503 [inline] kfree+0xd3/0x260 mm/slab.c:3820 selinux_tun_dev_free_security+0x15/0x20 security/selinux/hooks.c:5023 security_tun_dev_free_security+0x48/0x80 security/security.c:1512 tun_free_netdev+0x13b/0x1b0 drivers/net/tun.c:1563 register_netdevice+0x8d0/0xee0 net/core/dev.c:7605 tun_set_iff drivers/net/tun.c:1859 [inline] __tun_chr_ioctl+0x1caf/0x3d50 drivers/net/tun.c:2064 tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2309 vfs_ioctl fs/ioctl.c:45 [inline] do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685 SYSC_ioctl fs/ioctl.c:700 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691 entry_SYSCALL_64_fastpath+0x1f/0xbe The buggy address belongs to the object at ffff8801d2843b40 which belongs to the cache kmalloc-32 of size 32 The buggy address is located 0 bytes inside of 32-byte region [ffff8801d2843b40, ffff8801d2843b60) The buggy address belongs to the page: page:ffffea000660cea8 count:1 mapcount:0 mapping:ffff8801d2843000 index:0xffff8801d2843fc1 flags: 0x200000000000100(slab) raw: 0200000000000100 ffff8801d2843000 ffff8801d2843fc1 000000010000003f raw: ffffea0006626a40 ffffea00066141a0 ffff8801dbc00100 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8801d2843a00: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc ffff8801d2843a80: 00 00 00 fc fc fc fc fc fb fb fb fb fc fc fc fc >ffff8801d2843b00: 00 00 00 00 fc fc fc fc fb fb fb fb fc fc fc fc ^ ffff8801d2843b80: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc ffff8801d2843c00: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc ================================================================== Fixes: cf124db ("net: Fix inconsistent teardown and release of private netdev state.") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Sending audit netlink multicast messages is bad for all the same reasons that sending audit netlink unicast messages is bad, so this patch reworks things so that we don't do the multicast send in audit_log_end(), we do it from the dedicated kauditd_thread thread just as we do for unicast messages. See the GitHub issues below for more information/history: * linux-audit/audit-kernel#23 * linux-audit/audit-kernel#22 Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: kdrag0n <dragon@khronodragon.com>

Sending audit netlink multicast messages is bad for all the same reasons that sending audit netlink unicast messages is bad, so this patch reworks things so that we don't do the multicast send in audit_log_end(), we do it from the dedicated kauditd_thread thread just as we do for unicast messages. See the GitHub issues below for more information/history: * linux-audit/audit-kernel#23 * linux-audit/audit-kernel#22 Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: kdrag0n <dragon@khronodragon.com> Signed-off-by: Eliminater74 <eliminater74@gmail.com>

Before commit 4bfc0bb ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") cgroup bpf structures were released with corresponding cgroup structures. It guaranteed the hierarchical order of destruction: children were always first. It preserved attached programs from being released before their propagated copies. But with cgroup auto-detachment there are no such guarantees anymore: cgroup bpf is released as soon as the cgroup is offline and there are no live associated sockets. It means that an attached program can be detached and released, while its propagated copy is still living in the cgroup subtree. This will obviously lead to an use-after-free bug. To reproduce the issue the following script can be used: #!/bin/bash CGROOT=/sys/fs/cgroup mkdir -p ${CGROOT}/A ${CGROOT}/B ${CGROOT}/A/C sleep 1 ./test_cgrp2_attach ${CGROOT}/A egress & A_PID=$! ./test_cgrp2_attach ${CGROOT}/B egress & B_PID=$! echo $$ > ${CGROOT}/A/C/cgroup.procs iperf -s & S_PID=$! iperf -c localhost -t 100 & C_PID=$! sleep 1 echo $$ > ${CGROOT}/B/cgroup.procs echo ${S_PID} > ${CGROOT}/B/cgroup.procs echo ${C_PID} > ${CGROOT}/B/cgroup.procs sleep 1 rmdir ${CGROOT}/A/C rmdir ${CGROOT}/A sleep 1 kill -9 ${S_PID} ${C_PID} ${A_PID} ${B_PID} On the unpatched kernel the following stacktrace can be obtained: [ 33.619799] BUG: unable to handle page fault for address: ffffbdb4801ab002 [ 33.620677] #PF: supervisor read access in kernel mode [ 33.621293] #PF: error_code(0x0000) - not-present page [ 33.622754] Oops: 0000 [#1] SMP NOPTI [ 33.623202] CPU: 0 PID: 601 Comm: iperf Not tainted 5.5.0-rc2+ #23 [ 33.625545] RIP: 0010:__cgroup_bpf_run_filter_skb+0x29f/0x3d0 [ 33.635809] Call Trace: [ 33.636118] ? __cgroup_bpf_run_filter_skb+0x2bf/0x3d0 [ 33.636728] ? __switch_to_asm+0x40/0x70 [ 33.637196] ip_finish_output+0x68/0xa0 [ 33.637654] ip_output+0x76/0xf0 [ 33.638046] ? __ip_finish_output+0x1c0/0x1c0 [ 33.638576] __ip_queue_xmit+0x157/0x410 [ 33.639049] __tcp_transmit_skb+0x535/0xaf0 [ 33.639557] tcp_write_xmit+0x378/0x1190 [ 33.640049] ? _copy_from_iter_full+0x8d/0x260 [ 33.640592] tcp_sendmsg_locked+0x2a2/0xdc0 [ 33.641098] ? sock_has_perm+0x10/0xa0 [ 33.641574] tcp_sendmsg+0x28/0x40 [ 33.641985] sock_sendmsg+0x57/0x60 [ 33.642411] sock_write_iter+0x97/0x100 [ 33.642876] new_sync_write+0x1b6/0x1d0 [ 33.643339] vfs_write+0xb6/0x1a0 [ 33.643752] ksys_write+0xa7/0xe0 [ 33.644156] do_syscall_64+0x5b/0x1b0 [ 33.644605] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix this by grabbing a reference to the bpf structure of each ancestor on the initialization of the cgroup bpf structure, and dropping the reference at the end of releasing the cgroup bpf structure. This will restore the hierarchical order of cgroup bpf releasing, without adding any operations on hot paths. Thanks to Josef Bacik for the debugging and the initial analysis of the problem. Fixes: 4bfc0bb ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") Reported-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Sending audit netlink multicast messages is bad for all the same reasons that sending audit netlink unicast messages is bad, so this patch reworks things so that we don't do the multicast send in audit_log_end(), we do it from the dedicated kauditd_thread thread just as we do for unicast messages. See the GitHub issues below for more information/history: * linux-audit/audit-kernel#23 * linux-audit/audit-kernel#22 Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Lau <laststandrighthere@gmail.com>

Sending audit netlink multicast messages is bad for all the same reasons that sending audit netlink unicast messages is bad, so this patch reworks things so that we don't do the multicast send in audit_log_end(), we do it from the dedicated kauditd_thread thread just as we do for unicast messages. See the GitHub issues below for more information/history: * linux-audit/audit-kernel#23 * linux-audit/audit-kernel#22 Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: krazey <admin@krazey.de> Conflicts: kernel/audit.c

After modifying the QP to the Error state, all RX WR would be completed with WC in IB_WC_WR_FLUSH_ERR status. Current implementation does not wait for it is done, but destroy the QP and free the link group directly. So there is a risk that accessing the freed memory in tasklet context. Here is a crash example: BUG: unable to handle page fault for address: ffffffff8f220860 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD f7300e067 P4D f7300e067 PUD f7300f063 PMD 8c4e45063 PTE 800ffff08c9df060 Oops: 0002 [#1] SMP PTI CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: G S OE 5.10.0-0607+ #23 Hardware name: Inspur NF5280M4/YZMB-00689-101, BIOS 4.1.20 07/09/2018 RIP: 0010:native_queued_spin_lock_slowpath+0x176/0x1b0 Code: f3 90 48 8b 32 48 85 f6 74 f6 eb d5 c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 00 c8 02 00 48 03 04 f5 00 09 98 8e <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32 RSP: 0018:ffffb3b6c001ebd8 EFLAGS: 00010086 RAX: ffffffff8f220860 RBX: 0000000000000246 RCX: 0000000000080000 RDX: ffff91db1f86c800 RSI: 000000000000173c RDI: ffff91db62bace00 RBP: ffff91db62bacc00 R08: 0000000000000000 R09: c00000010000028b R10: 0000000000055198 R11: ffffb3b6c001ea58 R12: ffff91db80e05010 R13: 000000000000000a R14: 0000000000000006 R15: 0000000000000040 FS: 0000000000000000(0000) GS:ffff91db1f840000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff8f220860 CR3: 00000001f9580004 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> _raw_spin_lock_irqsave+0x30/0x40 mlx5_ib_poll_cq+0x4c/0xc50 [mlx5_ib] smc_wr_rx_tasklet_fn+0x56/0xa0 [smc] tasklet_action_common.isra.21+0x66/0x100 __do_softirq+0xd5/0x29c asm_call_irq_on_stack+0x12/0x20 </IRQ> do_softirq_own_stack+0x37/0x40 irq_exit_rcu+0x9d/0xa0 sysvec_call_function_single+0x34/0x80 asm_sysvec_call_function_single+0x12/0x20 Fixes: bd4ad57 ("smc: initialize IB transport incl. PD, MR, QP, CQ, event, WR") Signed-off-by: Yacan Liu <liuyacan@corp.netease.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Fix port I/O string accessors such as `insb', `outsb', etc. which use the physical PCI port I/O address rather than the corresponding memory mapping to get at the requested location, which in turn breaks at least accesses made by our parport driver to a PCIe parallel port such as: PCI parallel port detected: 1415:c118, I/O at 0x1000(0x1008), IRQ 20 parport0: PC-style at 0x1000 (0x1008), irq 20, using FIFO [PCSPP,TRISTATE,COMPAT,EPP,ECP] causing a memory access fault: Unable to handle kernel access to user memory without uaccess routines at virtual address 0000000000001008 Oops [#1] Modules linked in: CPU: 1 PID: 350 Comm: cat Not tainted 6.0.0-rc2-00283-g10d4879f9ef0-dirty #23 Hardware name: SiFive HiFive Unmatched A00 (DT) epc : parport_pc_fifo_write_block_pio+0x266/0x416 ra : parport_pc_fifo_write_block_pio+0xb4/0x416 epc : ffffffff80542c3e ra : ffffffff80542a8c sp : ffffffd88899fc60 gp : ffffffff80fa2700 tp : ffffffd882b1e900 t0 : ffffffd883d0b000 t1 : ffffffffff000002 t2 : 4646393043330a38 s0 : ffffffd88899fcf0 s1 : 0000000000001000 a0 : 0000000000000010 a1 : 0000000000000000 a2 : ffffffd883d0a010 a3 : 0000000000000023 a4 : 00000000ffff8fbb a5 : ffffffd883d0a001 a6 : 0000000100000000 a7 : ffffffc800000000 s2 : ffffffffff000002 s3 : ffffffff80d28880 s4 : ffffffff80fa1f50 s5 : 0000000000001008 s6 : 0000000000000008 s7 : ffffffd883d0a000 s8 : 0004000000000000 s9 : ffffffff80dc1d80 s10: ffffffd8807e4000 s11: 0000000000000000 t3 : 00000000000000ff t4 : 393044410a303930 t5 : 0000000000001000 t6 : 0000000000040000 status: 0000000200000120 badaddr: 0000000000001008 cause: 000000000000000f [<ffffffff80543212>] parport_pc_compat_write_block_pio+0xfe/0x200 [<ffffffff8053bbc0>] parport_write+0x46/0xf8 [<ffffffff8050530e>] lp_write+0x158/0x2d2 [<ffffffff80185716>] vfs_write+0x8e/0x2c2 [<ffffffff80185a74>] ksys_write+0x52/0xc2 [<ffffffff80185af2>] sys_write+0xe/0x16 [<ffffffff80003770>] ret_from_syscall+0x0/0x2 ---[ end trace 0000000000000000 ]--- For simplicity address the problem by adding PCI_IOBASE to the physical address requested in the respective wrapper macros only, observing that the raw accessors such as `__insb', `__outsb', etc. are not supposed to be used other than by said macros. Remove the cast to `long' that is no longer needed on `addr' now that it is used as an offset from PCI_IOBASE and add parentheses around `addr' needed for predictable evaluation in macro expansion. No need to make said adjustments in separate changes given that current code is gravely broken and does not ever work. Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Fixes: fab957c ("RISC-V: Atomic and Locking Code") Cc: stable@vger.kernel.org # v4.15+ Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/alpine.DEB.2.21.2209220223080.29493@angie.orcam.me.uk Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>

Syzbot reported a null-ptr-deref of sqd->thread inside io_sqpoll_wq_cpu_affinity. It turns out the sqd->thread can go away from under us during io_uring_register, in case the process gets a fatal signal during io_uring_register. It is not particularly hard to hit the race, and while I am not sure this is the exact case hit by syzbot, it solves it. Finally, checking ->thread is enough to close the race because we locked sqd while "parking" the thread, thus preventing it from going away. I reproduced it fairly consistently with a program that does: int main(void) { ... io_uring_queue_init(RING_LEN, &ring1, IORING_SETUP_SQPOLL); while (1) { io_uring_register_iowq_aff(ring, 1, &mask); } } Executed in a loop with timeout to trigger SIGTERM: while true; do timeout 1 /a.out ; done This will hit the following BUG() in very few attempts. BUG: kernel NULL pointer dereference, address: 00000000000007a8 PGD 800000010e949067 P4D 800000010e949067 PUD 10e46e067 PMD 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 15715 Comm: dead-sqpoll Not tainted 6.5.0-rc7-next-20230825-g193296236fa0-dirty #23 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:io_sqpoll_wq_cpu_affinity+0x27/0x70 Code: 90 90 90 0f 1f 44 00 00 55 53 48 8b 9f 98 03 00 00 48 85 db 74 4f 48 89 df 48 89 f5 e8 e2 f8 ff ff 48 8b 43 38 48 85 c0 74 22 <48> 8b b8 a8 07 00 00 48 89 ee e8 ba b1 00 00 48 89 df 89 c5 e8 70 RSP: 0018:ffffb04040ea7e70 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff93c010749e40 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffffffffa7653331 RDI: 00000000ffffffff RBP: ffffb04040ea7eb8 R08: 0000000000000000 R09: c0000000ffffdfff R10: ffff93c01141b600 R11: ffffb04040ea7d18 R12: ffff93c00ea74840 R13: 0000000000000011 R14: 0000000000000000 R15: ffff93c00ea74800 FS: 00007fb7c276ab80(0000) GS:ffff93c36f200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000007a8 CR3: 0000000111634003 CR4: 0000000000370ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die_body+0x1a/0x60 ? page_fault_oops+0x154/0x440 ? do_user_addr_fault+0x174/0x7b0 ? exc_page_fault+0x63/0x140 ? asm_exc_page_fault+0x22/0x30 ? io_sqpoll_wq_cpu_affinity+0x27/0x70 __io_register_iowq_aff+0x2b/0x60 __io_uring_register+0x614/0xa70 __x64_sys_io_uring_register+0xaa/0x1a0 do_syscall_64+0x3a/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7fb7c226fec9 Code: 2e 00 b8 ca 00 00 00 0f 05 eb a5 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 7f 2d 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe2c0674f8 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb7c226fec9 RDX: 00007ffe2c067530 RSI: 0000000000000011 RDI: 0000000000000003 RBP: 00007ffe2c0675d0 R08: 00007ffe2c067550 R09: 00007ffe2c067550 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffe2c067750 R14: 0000000000000000 R15: 0000000000000000 </TASK> Modules linked in: CR2: 00000000000007a8 ---[ end trace 0000000000000000 ]--- Reported-by: syzbot+c74fea926a78b8a91042@syzkaller.appspotmail.com Fixes: ebdfefc ("io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is used") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/87v8cybuo6.fsf@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>

pcmoore added enhancement priority/medium labels Oct 21, 2016

pcmoore self-assigned this Oct 21, 2016

rgbriggs mentioned this issue Dec 21, 2016

Q: audit queue backlog residual issues #31

Closed

pcmoore closed this as completed Mar 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: perform the audit multicast send from the kauditd thread #23

RFE: perform the audit multicast send from the kauditd thread #23

pcmoore commented Oct 21, 2016

pcmoore commented Oct 21, 2016

pcmoore commented Oct 22, 2016

pcmoore commented Nov 24, 2016

pcmoore commented Nov 29, 2016

pcmoore commented Mar 23, 2017

RFE: perform the audit multicast send from the kauditd thread #23

RFE: perform the audit multicast send from the kauditd thread #23

Comments

pcmoore commented Oct 21, 2016

pcmoore commented Oct 21, 2016

pcmoore commented Oct 22, 2016

pcmoore commented Nov 24, 2016

pcmoore commented Nov 29, 2016

pcmoore commented Mar 23, 2017