kernel changes post-4.7 cause `PTRACE_SYSCALL` notifications to happen before `PTRACE_SECCOMP_EVENT` #1762

rocallahan · 2016-08-02T00:42:36Z

See torvalds/linux@93e35ef. @pipcet reported this in #1552.

This breaks rr, but not fatally. In fact, it makes rr recording a bit more efficient. When recording a non-buffered syscall before, we'd have incur three ptrace-stops: PTRACE_SECCOMP_EVENT, PTRACE_SYSCALL stop, then a final PTRACE_SYSCALL stop for syscall exit. The second stop there is basically redundant with the first stop, but there was no way to stop at syscall exit without incurring that second stop. With the new event ordering, if we always continue to the next system call with PTRACE_CONT, we'll skip right over the syscall-entry PTRACE_SYSCALL stop and just incur two stops: PTRACE_SECCOMP_EVENT and the PTRACE_SYSCALL stop for syscall exit.

We have to update a fair bit of code to detect and handle this reordering. Syscall-entry code that used to run at the first PTRACE_SYSCALL stop has to be triggered by PTRACE_SECCOMP_EVENT instead. Of course, for many years we'll have to support both behaviors.

The text was updated successfully, but these errors were encountered:

rocallahan · 2016-08-02T00:52:42Z

I've got this mostly working, with 92% of tests passing. I'm stuck on the clock_nanosleep test, which is very simple and the problem is probably affecting the other tests: the main thread does an exit_group while another thread is still running, and sometimes that other thread exits without us getting a PTRACE_EVENT_EXIT for it :-(. The problem is intermittent and seems to not be reproducible if I run rr under ftrace, or strace, or gdb, though it usually does happen if I run rr normally.

This could have been a preexisting bug (in older kernels, or rr without my changes), but I don't recall seeing it ever happen before. My changes might somehow cause this, but I can't see how (the logs look very clear about what happens):

[RecordSession] trace time 228: Active task is 32702. Events:
[RecordSession]   32702: handle_ptrace_event PTRACE_EVENT_SECCOMP: event (none)
[RecordSession]   traced syscall entered: nanosleep
[RecordTask]   is syscall interruption of recorded (none)? (now nanosleep)
[RecordSession] EXEC_SYSCALL_ENTRY: status=0x7057f (PTRACE_EVENT_SECCOMP)
[RecordSession] after cont: status=0x7057f (PTRACE_EVENT_SECCOMP)
[RecordSession] EXEC_START: status=0x7057f (PTRACE_EVENT_SECCOMP)
[RecordTask]   is syscall interruption of recorded SYSCALL: nanosleep? (now nanosleep)
[Task] resuming execution of 32702 with PTRACE_SYSCALL
[Scheduler] Scheduling next task
[Scheduler]   32702 is blocked on SYSCALL: nanosleep; checking status ...
[Task] waitpid(32702, NOHANG) returns 0, status 0 (EXIT-0)
[Scheduler]   still blocked
[Scheduler]   need to reschedule
... no mentions of 32702 ...
[Scheduler] Scheduling next task
[Scheduler]   (32701 is un-switchable at SYSCALL: exit_group)
[Scheduler]   and running; waiting for state change
[Task] going into blocking waitpid(32701) ...
[Task]   waitpid(32701) returns 32701; status 0x6057f (PTRACE_EVENT_EXIT)
[Task]   (refreshing register cache)
[Scheduler]   new status is 0x6057f (PTRACE_EVENT_EXIT)
[RecordSession] trace time 241: Active task is 32701. Events:
[WARN handle_ptrace_exit_event() errno: SUCCESS] unstable exit; may misrecord CLONE_CHILD_CLEARTID memory race
[Task] task 32701 (rec:32701) is dying ...
[WARN ~Task() errno: SUCCESS] 32701 is unstable; not blocking on its termination
[Task]   dead
[Scheduler] Scheduling next task
[Scheduler]   need to reschedule
[Scheduler]   32702 is unstable
[Scheduler]   all tasks blocked or some unstable, waiting for runnable (1 total)
[Scheduler]   32702 changed status to 0 (EXIT-0)
[Task]   (refreshing register cache)
[FATAL /home/roc/rr/rr/src/Task.cc:1276:did_waitpid() errno: ESRCH] 
 (task 32702 (rec:32702) at time 242)
 -> Assertion `false' failed to hold. 
Launch gdb with 
  gdb '-l' '-1' '-ex' 'target extended-remote :32702' /home/roc/rr/obj/bin/clock_nanosleep

So maybe it's a regression in the kernel ... maybe related to the seccomp changes, maybe not.

rocallahan · 2016-08-02T03:21:38Z

Building Linux master with that one commit reverted seems to make the bug go away, so it's definitely that kernel commit or my rr changes.

rocallahan · 2016-08-02T04:21:20Z

This also happens in the nanosleep test.

What seems to happen is that the non-main thread enters its final, long nanosleep, rr gets the notification, and then the non-main thread is not scheduled again. The main thread proceeds to exit_group; at this point the non-main thread's kernel stack looks like

[<ffffffff810b5837>] ptrace_stop+0x167/0x2a0
[<ffffffff810b5a08>] ptrace_do_notify+0x98/0xc0
[<ffffffff810b6e6b>] ptrace_notify+0x5b/0x80
[<ffffffff8116235a>] __seccomp_filter+0x20a/0x270
[<ffffffff81162a35>] __secure_computing+0x35/0xb0
[<ffffffff810033ae>] syscall_trace_enter+0xce/0x2f0
[<ffffffff81003d37>] do_syscall_64+0x147/0x160
[<ffffffff817f4821>] return_from_SYSCALL_64+0x0/0x6a
[<ffffffffffffffff>] 0xffffffffffffffff

rocallahan · 2016-08-02T06:10:07Z

With a fair amount of pain I've managed to construct a standalone testcase: https://gist.github.com/rocallahan/b09b1de28a32918cb27d4ad68421678d
I've fairly confident it's a kernel bug now.

rocallahan · 2016-08-02T06:21:34Z

And it reproduces in a kernel without the seccomp reordering changes. So it seems to be a longstanding kernel bug where an exit_group while a thread is at that point in ptrace_stop causes the thread to exit without reporting PTRACE_EVENT_EXIT.

rocallahan · 2016-08-03T22:34:32Z

I've figured this out. Here's what happens...

The problem occurs in this code in __seccomp_filter:

                /* Allow the BPF to provide the event message */
                ptrace_event(PTRACE_EVENT_SECCOMP, data);
                /*
                 * The delivery of a fatal signal during event
                 * notification may silently skip tracer notification.
                 * Terminating the task now avoids executing a system
                 * call that may not be intended.
                 */
                if (fatal_signal_pending(current)) {
                        do_exit(SIGSYS);
                }

When another thread in the thread-group does exit_group while a tracee thread is in the above ptrace-stop (or just after the stop has resumed but before we reach the fatal_signal_pending check), zap_other_threads marks the tracee thread as having a pending SIGKILL, so the tracee thread takes this do_exit path. do_exit calls ptrace_event(PTRACE_EVENT_EXIT, code) which puts the tracee thread into state TASK_TRACED and then (indirectly) calls __schedule. __schedule has this code:

                 if (unlikely(signal_pending_state(prev->state, prev))) {
                        prev->state = TASK_RUNNING;

In this case, SIGKILL is still pending so we change the state to TASK_RUNNING. The ptracer is notified and wakes up in wait4, which examines the tracee thread in wait_task_stopped. wait_task_stopped (via task_stopped_code) decides to skip the tracee because it is not in the TASK_TRACED state, so the ptracer doesn't see the PTRACE_EVENT_EXIT.

When a thread does an exit_group while the tracee thread is anywhere else, the synthetic SIGKILL is detected via get_signal which dequeues the signal before calling do_exit, which means the scheduler code does not force the transition back to TASK_RUNNING and everything works.

rocallahan · 2016-08-03T22:39:46Z

The seccomp reordering change made this a problem for rr because before that change, rr would always advance a tracee from the PTRACE_EVENT_SECCOMP stop to the PTRACE_SYSCALL stop before running any other tracee, so there was no way for another thread to do exit_group while the tracee was in the problematic section.

rocallahan · 2016-08-03T22:41:39Z

I'm not sure how to fix this, though that's mainly my lack of kernel experience. I think probably the seccomp code should dequeue the fatal signal before entering do_exit, to behave more like get_signal.

rocallahan · 2016-08-03T23:51:19Z

Email sent to LKML.

rocallahan · 2016-08-04T05:53:21Z

Here is the email thread: http://marc.info/?l=linux-kernel&m=147026862328685&w=2

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that ptrace was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. This was needlessly paranoid. Instead, the syscall can just be skipped and normal signal handling, tracer notification, and process death can happen. Slightly edited original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35ef ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35ef ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com>

GIT 071e31e254e0e0c438eecba3dba1d6e2d0da36c2 commit 9f834ec18defc369d73ccf9e87a2790bfa05bf46 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon Aug 22 16:41:46 2016 -0700 binfmt_elf: switch to new creds when switching to new mm We used to delay switching to the new credentials until after we had mapped the executable (and possible elf interpreter). That was kind of odd to begin with, since the new executable will actually then _run_ with the new creds, but whatever. The bigger problem was that we also want to make sure that we turn off prof events and tracing before we start mapping the new executable state. So while this is a cleanup, it's also a fix for a possible information leak. Reported-by: Robert Święcki <robert@swiecki.net> Tested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> commit 485a252a5559b45d7df04c819ec91177c62c270b Author: Kees Cook <keescook@chromium.org> Date: Wed Aug 10 16:28:09 2016 -0700 seccomp: Fix tracer exit notifications during fatal signals This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: https://github.com/mozilla/rr/issues/1762#issuecomment-237396255. Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> commit 0d025d271e55f3de21f0aaaf54b42d20404d2b23 Author: Josh Poimboeuf <jpoimboe@redhat.com> Date: Tue Aug 30 08:04:16 2016 -0500 mm/usercopy: get rid of CONFIG_DEBUG_STRICT_USER_COPY_CHECKS There are three usercopy warnings which are currently being silenced for gcc 4.6 and newer: 1) "copy_from_user() buffer size is too small" compile warning/error This is a static warning which happens when object size and copy size are both const, and copy size > object size. I didn't see any false positives for this one. So the function warning attribute seems to be working fine here. Note this scenario is always a bug and so I think it should be changed to *always* be an error, regardless of CONFIG_DEBUG_STRICT_USER_COPY_CHECKS. 2) "copy_from_user() buffer size is not provably correct" compile warning This is another static warning which happens when I enable __compiletime_object_size() for new compilers (and CONFIG_DEBUG_STRICT_USER_COPY_CHECKS). It happens when object size is const, but copy size is *not*. In this case there's no way to compare the two at build time, so it gives the warning. (Note the warning is a byproduct of the fact that gcc has no way of knowing whether the overflow function will be called, so the call isn't dead code and the warning attribute is activated.) So this warning seems to only indicate "this is an unusual pattern, maybe you should check it out" rather than "this is a bug". I get 102(!) of these warnings with allyesconfig and the __compiletime_object_size() gcc check removed. I don't know if there are any real bugs hiding in there, but from looking at a small sample, I didn't see any. According to Kees, it does sometimes find real bugs. But the false positive rate seems high. 3) "Buffer overflow detected" runtime warning This is a runtime warning where object size is const, and copy size > object size. All three warnings (both static and runtime) were completely disabled for gcc 4.6 with the following commit: 2fb0815c9ee6 ("gcc4: disable __compiletime_object_size for GCC 4.6+") That commit mistakenly assumed that the false positives were caused by a gcc bug in __compiletime_object_size(). But in fact, __compiletime_object_size() seems to be working fine. The false positives were instead triggered by #2 above. (Though I don't have an explanation for why the warnings supposedly only started showing up in gcc 4.6.) So remove warning #2 to get rid of all the false positives, and re-enable warnings #1 and #3 by reverting the above commit. Furthermore, since #1 is a real bug which is detected at compile time, upgrade it to always be an error. Having done all that, CONFIG_DEBUG_STRICT_USER_COPY_CHECKS is no longer needed. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Nilay Vaish <nilayvaish@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> commit 9ebae9e4bcd7dff22536af8a969d8f66e6f23900 Author: Alan Cox <alan@linux.intel.com> Date: Tue Aug 30 16:47:02 2016 +0100 pata_ninja32: Avoid corrupting status flags Ninja32 needs to set some flags to indicate it does 32bit IO. However it currently assigns this which loses the initializing flag and causes a warning spew. Fix it to use a logical or as is intended. Signed-off-by: Alan Cox <alan@linux.intel.com> Tested-by: Ellmar Stelnberger <estellnb@elstel.org> Signed-off-by: Tejun Heo <tj@kernel.org> commit 98b0f80c2396224bbbed81792b526e6c72ba9efa Author: Trond Myklebust <trond.myklebust@primarydata.com> Date: Mon Aug 29 11:15:36 2016 -0400 NFSv4.x: Fix a refcount leak in nfs_callback_up_net On error, the callers expect us to return without bumping nn->cb_users[]. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v3.7+ commit 52442f9b11b7e5d4a38d99143011831fd171f8d9 Author: Benjamin Coddington <bcodding@redhat.com> Date: Tue Aug 30 09:20:32 2016 -0400 NFS4: Avoid migration loops If a server returns itself as a location while migrating, the client may end up getting stuck attempting to migrate twice to the same server. Catch this by checking if the nfs_client found is the same as the existing client. For the other two callers to nfs4_set_client, the nfs_client will always be ERR_PTR(-EINVAL). Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> commit 3dc147359e3dcdf0648f1e2c11f62cfae3160df0 Author: Trond Myklebust <trond.myklebust@primarydata.com> Date: Mon Aug 29 15:12:54 2016 -0400 pNFS/flexfiles: Fix an Oopsable condition when connection to the DS fails If the attempt to connect to a DS fails inside ff_layout_pg_init_read or ff_layout_pg_init_write, then we currently end up clearing the layout segment carried by the struct nfs_pageio_descriptor, causing an Oops when we later call into ff_layout_read_pagelist/ff_layout_write_pagelist. The fix is to ensure we return the layout and then retry. Fixes: 446ca2195303 ("pNFS/flexfiles: When initing reads or writes, we...") Cc: stable@vger.kernel.org # v4.7+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> commit 3c3292634fc2de1ab97b6aa3222fee647f737adb Author: Jean Delvare <jdelvare@suse.de> Date: Mon Aug 29 13:18:23 2016 +0200 hwmon: (it87) Add missing sysfs attribute group terminator Attribute array it87_attributes_in lacks its NULL terminator, causing random behavior when operating on the attribute group. Fixes: 52929715634a ("hwmon: (it87) Use is_visible for voltage sensors") Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: stable@vger.kernel.org Signed-off-by: Guenter Roeck <linux@roeck-us.net> commit da43bf0c21e57fff0221da5de0a9a388ec0d27cd Author: Paul Gortmaker <paul.gortmaker@windriver.com> Date: Mon Aug 15 18:24:59 2016 -0400 intel_pmic_gpio: Make explicitly non-modular The Kconfig entry controlling compilation of this code is: drivers/platform/x86/Kconfig:config GPIO_INTEL_PMIC drivers/platform/x86/Kconfig: bool "Intel PMIC GPIO support" ...meaning that it currently is not being built as a module by anyone. Lets remove the couple traces of modular infrastructure use, so that when reading the driver there is no doubt it is builtin-only. We delete the MODULE_LICENSE tag etc. since all that information was (or is now) contained at the top of the file in the comments. We don't replace module.h with init.h since the file already has that. Cc: Alek Du <alek.du@intel.com> Cc: platform-driver-x86@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Darren Hart <dvhart@linux.intel.com> commit f48d1496b8537d75776478c6942dd87f34d7f270 Author: Paul Gortmaker <paul.gortmaker@windriver.com> Date: Mon Aug 15 18:25:17 2016 -0400 platform/olpc: Make ec explicitly non-modular The Kconfig entry controlling compilation of this code is: arch/x86/Kconfig:config OLPC arch/x86/Kconfig: bool "One Laptop Per Child support" ...meaning that it currently is not being built as a module by anyone. Lets remove the couple traces of modular infrastructure use, so that when reading the driver there is no doubt it is builtin-only. We delete the MODULE_LICENSE tag etc. since all that information was (or is now) contained at the top of the file in the comments. Cc: platform-driver-x86@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Andres Salomon <dilinger@queued.net> Signed-off-by: Darren Hart <dvhart@linux.intel.com> commit b99b43bb4bdf1d361f7487cf03d803082bbf9101 Author: Owen Lin <olin@rivetnetworks.com> Date: Fri Aug 26 13:49:09 2016 +0800 Add Killer E2500 device ID in alx driver. Signed-off-by: David S. Miller <davem@davemloft.net> commit 2fb04fdf30192ff1e2b5834e9b7745889ea8bbcb Author: Russell King <rmk+kernel@armlinux.org.uk> Date: Sat Aug 27 17:33:03 2016 +0100 net: smc91x: fix SMC accesses Commit b70661c70830 ("net: smc91x: use run-time configuration on all ARM machines") broke some ARM platforms through several mistakes. Firstly, the access size must correspond to the following rule: (a) at least one of 16-bit or 8-bit access size must be supported (b) 32-bit accesses are optional, and may be enabled in addition to the above. Secondly, it provides no emulation of 16-bit accesses, instead blindly making 16-bit accesses even when the platform specifies that only 8-bit is supported. Reorganise smc91x.h so we can make use of the existing 16-bit access emulation already provided - if 16-bit accesses are supported, use 16-bit accesses directly, otherwise if 8-bit accesses are supported, use the provided 16-bit access emulation. If neither, BUG(). This exactly reflects the driver behaviour prior to the commit being fixed. Since the conversion incorrectly cut down the available access sizes on several platforms, we also need to go through every platform and fix up the overly-restrictive access size: Arnd assumed that if a platform can perform 32-bit, 16-bit and 8-bit accesses, then only a 32-bit access size needed to be specified - not so, all available access sizes must be specified. This likely fixes some performance regressions in doing this: if a platform does not support 8-bit accesses, 8-bit accesses have been emulated by performing a 16-bit read-modify-write access. Tested on the Intel Assabet/Neponset platform, which supports only 8-bit accesses, which was broken by the original commit. Fixes: b70661c70830 ("net: smc91x: use run-time configuration on all ARM machines") Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Tested-by: Robert Jarzmik <robert.jarzmik@free.fr> Signed-off-by: David S. Miller <davem@davemloft.net> commit 7d13eca09ed5e477f6ecfd97a35058762228b5e4 Author: Florian Fainelli <f.fainelli@gmail.com> Date: Sat Aug 27 15:34:20 2016 -0700 Documentation: networking: dsa: Remove platform device TODO Since commit 83c0afaec7b7 ("net: dsa: Add new binding implementation"), the shortcomings of the dsa platform device have been addressed, remove that TODO item. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net> commit e5835f2833b12808c53aa621d1d3aa085706b5b3 Author: Maor Gottlieb <maorg@mellanox.com> Date: Mon Aug 29 01:13:50 2016 +0300 net/mlx5: Increase number of ethtool steering priorities Ethtool has 11 flow tables, each flow table has its own priority. Increase the number of priorities to be aligned with the number of flow tables. Fixes: 1174fce8d141 ('net/mlx5e: Support l3/l4 flow type specs in ethtool flow steering') Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 1722b9694ecfbc602865017c3fa6da0e3ec234d8 Author: Eran Ben Elisha <eranbe@mellanox.com> Date: Mon Aug 29 01:13:49 2016 +0300 net/mlx5: Add error prints when validate ETS failed Upon set ETS failure due to user invalid input, add error prints to specify the exact error to the user. Fixes: cdcf11212b22 ('net/mlx5e: Validate BW weight values of ETS') Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit bf50082c15eb2bc47d1922e70f424c57f36646d5 Author: Kamal Heib <kamalh@mellanox.com> Date: Mon Aug 29 01:13:48 2016 +0300 net/mlx5e: Fix memory leak if refreshing TIRs fails Free 'in' command object also when mlx5_core_modify_tir fails. Fixes: 724b2aa15126 ("net/mlx5e: TIRs management refactoring") Signed-off-by: Kamal Heib <kamalh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit c8cf78fe100b0d152a1932327c24cefc0ba4bdbe Author: Tariq Toukan <tariqt@mellanox.com> Date: Mon Aug 29 01:13:47 2016 +0300 net/mlx5e: Add ethtool counter for TX xmit_more Add a counter in ethtool for the number of times that TX xmit_more was used. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit cc8e9ebf952699cb6870f1366a4920d05b036e31 Author: Eran Ben Elisha <eranbe@mellanox.com> Date: Mon Aug 29 01:13:46 2016 +0300 net/mlx5e: Fix ethtool -g/G rx ring parameter report with striding RQ The driver RQ has two possible configurations: striding RQ and non-striding RQ. Until this patch, the driver always reported the number of hardware WQEs (ring descriptors). For non striding RQ configuration, this was OK since we have one WQE per pending packet For striding RQ, multiple packets can fit into one WQE. For better user experience we normalize the rx_pending parameter (size of wqe/mtu) as the average ring size in case of striding RQ. Fixes: 461017cb006a ('net/mlx5e: Support RX multi-packet WQE ...') Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 6e8dd6d6f4bd2fd6fefdbf2e73bf251e36db59af Author: Saeed Mahameed <saeedm@mellanox.com> Date: Mon Aug 29 01:13:45 2016 +0300 net/mlx5e: Don't wait for SQ completions on close Instead of asking the firmware to flush the SQ (Send Queue) via asynchronous completions when moved to error, we handle SQ flush manually (mlx5e_free_tx_descs) same as we did when SQ flush got timed out or on tx_timeout. This will reduce SQs flush time and speedup interface down procedure. Moved mlx5e_free_tx_descs to the end of en_tx.c for tx critical code locality. Fixes: 29429f3300a3 ('net/mlx5e: Timeout if SQ doesn't flush during close') Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 8484f9ed13b26043be80ff5774506024956eae8f Author: Saeed Mahameed <saeedm@mellanox.com> Date: Mon Aug 29 01:13:44 2016 +0300 net/mlx5e: Don't post fragmented MPWQE when RQ is disabled ICO (Internal control operations) SQ (Send Queue) is closed/disabled after RQ (Receive Queue). After RQ is closed an ICO SQ completion might post a fragmented MPWQE (Multi Packet Work Queue Element) into that RQ. As on regular RQ post, check if we are allowed to post to that RQ (RQ is enabled). Cleanup in-progress UMR MPWQE on mlx5e_free_rx_descs if needed. Fixes: bc77b240b3c5 ('net/mlx5e: Add fragmented memory support for RX multi packet WQE') Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit f2fde18c52a7367a8f6cf6855e2a7174e601c8ee Author: Saeed Mahameed <saeedm@mellanox.com> Date: Mon Aug 29 01:13:43 2016 +0300 net/mlx5e: Don't wait for RQ completions on close This will significantly reduce receive queue flush time on interface down. Instead of asking the firmware to flush the RQ (Receive Queue) via asynchronous completions when moved to error, we handle RQ flush manually (mlx5e_free_rx_descs) same as we did when RQ flush got timed out. This will reduce RQs flush time and speedup interface down procedure (ifconfig down) from 6 sec to 0.3 sec on a 48 cores system. Moved mlx5e_free_rx_descs en_main.c where it is needed, to keep en_rx.c free form non critical data path code for better code locality. Fixes: 6cd392a082de ('net/mlx5e: Handle RQ flush in error cases') Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit fe4c988bdd1cc60402a4e3ca3976a686ea991b5a Author: Saeed Mahameed <saeedm@mellanox.com> Date: Mon Aug 29 01:13:42 2016 +0300 net/mlx5e: Limit UMR length to the device's limitation ConnectX-4 UMR (User Memory Region) MTT translation table offset in WQE is limited to U16_MAX, before this patch we ignored that limitation and requested the maximum possible UMR translation length that the netdev might need (MAX channels * MAX pages per channel). In case of a system with #cores > 32 and when linear WQE allocation fails, falling back to using UMR WQEs will cause the RQ (Receive Queue) to get stuck. Here we limit UMR length to min(U16_MAX, max required pages) (while considering the required alignments) on driver load, by default U16_MAX is sufficient since the default RX rings value guarantees that we are in range, dynamically (on set_ringparam/set_channels) we will check if the new required UMR length (num mtts) is still in range, if not, fail the request. Fixes: bc77b240b3c5 ('net/mlx5e: Add fragmented memory support for RX multi packet WQE') Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 78a3e8889b4b6b99775ed954696ff3e017f5d19b Author: Cyril Bur <cyrilbur@gmail.com> Date: Tue Aug 23 10:46:17 2016 +1000 powerpc: signals: Discard transaction state from signal frames Userspace can begin and suspend a transaction within the signal handler which means they might enter sys_rt_sigreturn() with the processor in suspended state. sys_rt_sigreturn() wants to restore process context (which may have been in a transaction before signal delivery). To do this it must restore TM SPRS. To achieve this, any transaction initiated within the signal frame must be discarded in order to be able to restore TM SPRs as TM SPRs can only be manipulated non-transactionally.. >From the PowerPC ISA: TM Bad Thing Exception [Category: Transactional Memory] An attempt is made to execute a mtspr targeting a TM register in other than Non-transactional state. Not doing so results in a TM Bad Thing: [12045.221359] Kernel BUG at c000000000050a40 [verbose debug info unavailable] [12045.221470] Unexpected TM Bad Thing exception at c000000000050a40 (msr 0x201033) [12045.221540] Oops: Unrecoverable exception, sig: 6 [#1] [12045.221586] SMP NR_CPUS=2048 NUMA PowerNV [12045.221634] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables kvm_hv kvm uio_pdrv_genirq ipmi_powernv uio powernv_rng ipmi_msghandler autofs4 ses enclosure scsi_transport_sas bnx2x ipr mdio libcrc32c [12045.222167] CPU: 68 PID: 6178 Comm: sigreturnpanic Not tainted 4.7.0 #34 [12045.222224] task: c0000000fce38600 ti: c0000000fceb4000 task.ti: c0000000fceb4000 [12045.222293] NIP: c000000000050a40 LR: c0000000000163bc CTR: 0000000000000000 [12045.222361] REGS: c0000000fceb7ac0 TRAP: 0700 Not tainted (4.7.0) [12045.222418] MSR: 9000000300201033 <SF,HV,ME,IR,DR,RI,LE,TM[SE]> CR: 28444280 XER: 20000000 [12045.222625] CFAR: c0000000000163b8 SOFTE: 0 PACATMSCRATCH: 900000014280f033 GPR00: 01100000b8000001 c0000000fceb7d40 c00000000139c100 c0000000fce390d0 GPR04: 900000034280f033 0000000000000000 0000000000000000 0000000000000000 GPR08: 0000000000000000 b000000000001033 0000000000000001 0000000000000000 GPR12: 0000000000000000 c000000002926400 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR24: 0000000000000000 00003ffff98cadd0 00003ffff98cb470 0000000000000000 GPR28: 900000034280f033 c0000000fceb7ea0 0000000000000001 c0000000fce390d0 [12045.223535] NIP [c000000000050a40] tm_restore_sprs+0xc/0x1c [12045.223584] LR [c0000000000163bc] tm_recheckpoint+0x5c/0xa0 [12045.223630] Call Trace: [12045.223655] [c0000000fceb7d80] [c000000000026e74] sys_rt_sigreturn+0x494/0x6c0 [12045.223738] [c0000000fceb7e30] [c0000000000092e0] system_call+0x38/0x108 [12045.223806] Instruction dump: [12045.223841] 7c800164 4e800020 7c0022a6 f80304a8 7c0222a6 f80304b0 7c0122a6 f80304b8 [12045.223955] 4e800020 e80304a8 7c0023a6 e80304b0 <7c0223a6> e80304b8 7c0123a6 4e800020 [12045.224074] ---[ end trace cb8002ee240bae76 ]--- It isn't clear exactly if there is really a use case for userspace returning with a suspended transaction, however, doing so doesn't (on its own) constitute a bad frame. As such, this patch simply discards the transactional state of the context calling the sigreturn and continues. Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Reviewed-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Acked-by: Simon Guo <wei.guo.simon@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> commit a9cbf0b2195b695cbeeeecaa4e2770948c212e9a Author: Mukesh Ojha <mukesh02@linux.vnet.ibm.com> Date: Mon Aug 22 12:17:44 2016 +0530 powerpc/powernv : Drop reference added by kset_find_obj() In a situation, where Linux kernel gets notified about duplicate error log from OPAL, it is been observed that kernel fails to remove sysfs entries (/sys/firmware/opal/elog/0xXXXXXXXX) of such error logs. This is because, we currently search the error log/dump kobject in the kset list via 'kset_find_obj()' routine. Which eventually increment the reference count by one, once it founds the kobject. So, unless we decrement the reference count by one after it found the kobject, we would not be able to release the kobject properly later. This patch adds the 'kobject_put()' which was missing earlier. Signed-off-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> commit cc7786d3ee7e3c979799db834b528db2c0834c2e Author: Nicholas Piggin <npiggin@gmail.com> Date: Mon Jul 25 14:26:51 2016 +1000 powerpc/tm: do not use r13 for tabort_syscall tabort_syscall runs with RI=1, so a nested recoverable machine check will load the paca into r13 and overwrite what we loaded it with, because exceptions returning to privileged mode do not restore r13. Fixes: b4b56f9ecab4 (powerpc/tm: Abort syscalls in active transactions) Cc: stable@vger.kernel.org Signed-off-by: Nick Piggin <npiggin@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> commit d138027a8256a3e9d7657c8d0dae84c08ef2cfe1 Author: Trond Myklebust <trond.myklebust@primarydata.com> Date: Sun Aug 28 12:19:04 2016 -0400 NFSv4.1: Remove obsolete and incorrrect assignment in nfs4_callback_sequence Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> commit 2e80dbe7ac51a911e8a828407b1a48c5ba938cd2 Author: Trond Myklebust <trond.myklebust@primarydata.com> Date: Sun Aug 28 11:50:26 2016 -0400 NFSv4.1: Close callback races for OPEN, LAYOUTGET and LAYOUTRETURN Defer freeing the slot until after we have processed the results from OPEN and LAYOUTGET. This means that the server can rely on the mechanism in RFC5661 Section 2.10.6.3 to ensure that replies to an OPEN or LAYOUTGET/RETURN RPC call don't race with the callbacks that apply to them. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> commit 07e8dcbda71ef87e9cbdc42b5bb16a44c1ab839b Author: Trond Myklebust <trond.myklebust@primarydata.com> Date: Sun Aug 28 10:28:25 2016 -0400 NFSv4.1: Defer bumping the slot sequence number until we free the slot For operations like OPEN or LAYOUTGET, which return recallable state (i.e. delegations and layouts) we want to enable the mechanism for resolving recall races in RFC5661 Section 2.10.6.3. To do so, we will want to defer bumping the slot's sequence number until we have finished processing the RPC results. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> commit 045d2a6d076a2ecd7043ea543ea198af943f8b16 Author: Trond Myklebust <trond.myklebust@primarydata.com> Date: Sun Aug 28 13:25:43 2016 -0400 NFSv4.1: Delay callback processing when there are referring triples If CB_SEQUENCE tells us that the processing of this request depends on the completion of one or more referring triples (see RFC 5661 Section 2.10.6.3), delay the callback processing until after the RPC requests being referred to have completed. If we end up delaying for more than 1/2 second, then fall back to returning NFS4ERR_DELAY in reply to the callback. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> commit e09c978aae5bedfdb379be80363b024b7d82638b Author: Trond Myklebust <trond.myklebust@primarydata.com> Date: Sat Aug 27 23:44:04 2016 -0400 NFSv4.1: Fix Oopsable condition in server callback races The slot table hasn't been an array since v3.7. Ensure that we use nfs4_lookup_slot() to access the slot correctly. Fixes: 87dda67e7386 ("NFSv4.1: Allow SEQUENCE to resize the slot table...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v3.8+ commit 9dbeea7f08f3784b152d9fb3b86beb34aad77c72 Author: Eric Dumazet <edumazet@google.com> Date: Fri Aug 26 08:51:39 2016 -0700 rhashtable: fix a memory leak in alloc_bucket_locks() If vmalloc() was successful, do not attempt a kmalloc_array() Fixes: 4cf0b354d92e ("rhashtable: avoid large lock-array allocations") Reported-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: CAI Qian <caiqian@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit e70c70c38d7a5ced76fc8b1c4a7ccee76e9c2911 Author: Andrew Rybchenko <Andrew.Rybchenko@oktetlabs.ru> Date: Fri Aug 26 11:19:34 2016 +0100 sfc: fix potential stack corruption from running past stat bitmask On 32-bit systems, mask is only an array of 3 longs, not 4, so don't try to write to mask[3]. Also include build-time checks in case the size of the bitmask changes. Fixes: 3c36a2aded8c ("sfc: display vadaptor statistics for all interfaces") Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit c15e07b02bf0450bc8e60f2cc51cb42daa371417 Author: Jiri Pirko <jiri@mellanox.com> Date: Thu Aug 25 18:30:52 2016 +0200 team: loadbalance: push lacpdus to exact delivery When team is in bridge and LACP is utilized, LACPDU packets are pushed to userspace using raw socket and there they are processed. However, since 8626c56c8279b, LACPDU skbs are dropped by bridge rx_handler so they never reach packet handlers in rx path. Fix this by explicity treat LACPDUs to be pushed to exact delivery in team rx_handler. Reported-by: Ido Schimmel <idosch@mellanox.com> Fixes: 8626c56c8279b ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit c234af5875ffeab39d5a2c4230a477a35987a484 Author: Colin Ian King <colin.king@canonical.com> Date: Thu Aug 25 07:51:10 2016 +0100 net: hns: dereference ppe_cb->ppe_common_cb if it is non-null ppe_cb->ppe_common_cb is being dereferenced before a null check is being made on it. If ppe_cb->ppe_common_cb is null then we end up with a null pointer dereference when assigning dsaf_dev. Fix this by moving the initialisation of dsaf_dev once we know ppe_cb->ppe_common_cb is OK to dereference. Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Yisen Zhuang <yisen.zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit b628d611a2a53858263fc419dba552f32431dba4 Author: Gao Feng <fgao@ikuai8.com> Date: Thu Aug 25 09:45:39 2016 +0800 8139cp: Fix one possible deadloop in cp_rx_poll When cp_rx_poll does not get enough packet, it will check the rx interrupt status again. If so, it will jumpt to rx_status_loop again. But the goto jump resets the rx variable as zero too. As a result, it causes one possible deadloop. Assume this case, rx_status_loop only gets the packet count which is less than budget, and (cpr16(IntrStatus) & cp_rx_intr_mask) condition is always true. It causes the deadloop happens and system is blocked. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit f38ff2ee7727994685494bcc4d7c274b35b5418a Author: Anjali Singhai Jain <anjali.singhai@intel.com> Date: Wed Aug 24 17:51:53 2016 -0700 i40e: Change some init flow for the client This change makes a common flow for Client instance open during init and reset path. The Client subtask can handle both the cases instead of making a separate notify_client_of_open call. Also it may fix a bug during reset where the service task was leaking some memory and causing issues. Change-Id: I7232a32fd52b82e863abb54266fa83122f80a0cd Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit c3e70edd7c2eed6acd234627a6007627f5c76e8e Author: Xander Huff <xander.huff@ni.com> Date: Wed Aug 24 16:47:53 2016 -0500 Revert "phy: IRQ cannot be shared" This reverts: commit 33c133cc7598 ("phy: IRQ cannot be shared") On hardware with multiple PHY devices hooked up to the same IRQ line, allow them to share it. Sergei Shtylyov says: "I'm not sure now what was the reason I concluded that the IRQ sharing was impossible... most probably I thought that the kernel IRQ handling code exited the loop over the IRQ actions once IRQ_HANDLED was returned -- which is obviously not so in reality..." Signed-off-by: Xander Huff <xander.huff@ni.com> Signed-off-by: Nathan Sullivan <nathan.sullivan@ni.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 4f101c47791cdcb831b3ef1f831b1cc51e4fe03c Author: Florian Fainelli <f.fainelli@gmail.com> Date: Wed Aug 24 11:01:20 2016 -0700 net: dsa: bcm_sf2: Fix race condition while unmasking interrupts We kept shadow copies of which interrupt sources we have enabled and disabled, but due to an order bug in how intrl2_mask_clear was defined, we could run into the following scenario: CPU0 CPU1 intrl2_1_mask_clear(..) sets INTRL2_CPU_MASK_CLEAR bcm_sf2_switch_1_isr read INTRL2_CPU_STATUS and masks with stale irq1_mask value updates irq1_mask value Which would make us loop again and again trying to process and interrupt we are not clearing since our copy of whether it was enabled before still indicates it was not. Fix this by updating the shadow copy first, and then unasking at the HW level. Fixes: 246d7f773c13 ("net: dsa: add Broadcom SF2 switch driver") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 166ee5b87866de07a3e56c1b757f2b5cabba72a5 Author: Eric Dumazet <edumazet@google.com> Date: Wed Aug 24 09:39:02 2016 -0700 qdisc: fix a module refcount leak in qdisc_create_dflt() Should qdisc_alloc() fail, we must release the module refcount we got right before. Fixes: 6da7c8fcbcbd ("qdisc: allow setting default queuing discipline") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit a5de125dd46c851fc962806135953c1bd0a0f0df Author: Wei Yongjun <weiyongjun1@huawei.com> Date: Wed Aug 24 13:32:19 2016 +0000 tipc: fix the error handling in tipc_udp_enable() Fix to return a negative error code in enable_mcast() error handling case, and release udp socket when necessary. Fixes: d0f91938bede ("tipc: add ip/udp media type") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 4f34228b67246ae3b3ab1dc33b980c77c0650ef4 Author: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Date: Mon Aug 15 16:02:20 2016 +0300 Bluetooth: Fix hci_sock_recvmsg when MSG_TRUNC is not set Similar to bt_sock_recvmsg MSG_TRUNC shall be checked using the original flags not msg_flags. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> commit 90a56f72edb088c678083c32d05936c7c8d9a948 Author: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Date: Fri Aug 12 15:11:28 2016 +0300 Bluetooth: Fix bt_sock_recvmsg when MSG_TRUNC is not set Commit b5f34f9420b50c9b5876b9a2b68e96be6d629054 attempt to introduce proper handling for MSG_TRUNC but recv and variants should still work as read if no flag is passed, but because the code may set MSG_TRUNC to msg->msg_flags that shall not be used as it may cause it to be behave as if MSG_TRUNC is always, so instead of using it this changes the code to use the flags parameter which shall contain the original flags. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> commit 16590a228109e2f318d2cc6466221134cfab723a Author: Chuck Lever <chuck.lever@oracle.com> Date: Mon Aug 22 14:57:42 2016 -0400 SUNRPC: Silence WARN_ON when NFSv4.1 over RDMA is in use Using NFSv4.1 on RDMA should be safe, so broaden the new checks in rpc_create(). WARN_ON_ONCE is used, matching most other WARN call sites in clnt.c. Fixes: 39a9beab5acb ("rpc: share one xps between all backchannels") Fixes: d50039ea5ee6 ("nfsd4/rpc: move backchannel create logic...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> commit 45c91d808ff989d950e260dab9f89e8f4a3c9c2c Author: Shaohua Li <shli@fb.com> Date: Mon Aug 22 21:14:02 2016 -0700 raid5: avoid unnecessary bio data set bio_reset doesn't change bi_io_vec and bi_max_vecs, so we don't need to set them every time. bi_private will be set before the bio is dispatched. Signed-off-by: Shaohua Li <shli@fb.com> commit 5f9d1fde7d54a5d5fd8cccbee9c9c31474fcdcf2 Author: Shaohua Li <shli@fb.com> Date: Mon Aug 22 21:14:01 2016 -0700 raid5: fix memory leak of bio integrity data Yi reported a memory leak of raid5 with DIF/DIX enabled disks. raid5 doesn't alloc/free bio, instead it reuses bios. There are two issues in current code: 1. the code calls bio_init (from init_stripe->raid5_build_block->bio_init) then bio_reset (ops_run_io). The bio is reused, so likely there is integrity data attached. bio_init will clear a pointer to integrity data and makes bio_reset can't release the data 2. bio_reset is called before dispatching bio. After bio is finished, it's possible we don't free bio's integrity data (eg, we don't call bio_reset again) Both issues will cause memory leak. The patch moves bio_init to stripe creation and bio_reset to bio end io. This will fix the two issues. Reported-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com> commit 27028626b4b9022dcac23688e09ea43b36e1183c Author: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Date: Tue Aug 23 10:53:57 2016 +0200 raid10: record correct address of bad block For failed write request record block address on a device, not block address in an array. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com> commit 0f6187dbe542d71ace8ba0908954b0f4f8a30a1e Author: Wei Yongjun <weiyongjun1@huawei.com> Date: Sun Aug 21 14:42:25 2016 +0000 md-cluster: fix error return code in join() Fix to return error code -ENOMEM from the lockres_init() error handling case instead of 0, as done elsewhere in this function. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com> commit 486b0f7bcd64be027535811ef44195bc1027fbd3 Author: Song Liu <songliubraving@fb.com> Date: Fri Aug 19 15:34:01 2016 -0700 r5cache: set MD_JOURNAL_CLEAN correctly Currently, the code sets MD_JOURNAL_CLEAN when the array has MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array will be MD_JOURNAL_CLEAN even if the journal device is missing. With this patch, the MD_JOURNAL_CLEAN is only set when the journal device presents. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> commit 51af96b53469f3b8cfcfe0504d0ff87239175b78 Author: Yotam Gigi <yotamg@mellanox.com> Date: Wed Aug 24 11:18:52 2016 +0200 mlxsw: router: Enable neighbors to be created on stacked devices Make the function mlxsw_router_neigh_construct search the rif according to the neighbour dev other than the dev that was passed to the ndo, thus allowing creating neigbhours upon stacked devices. Fixes: 6cf3c971dc84 ("mlxsw: spectrum_router: Add private neigh table") Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit f888f58795b640442165e60a6fa93e8e623d01a5 Author: Ido Schimmel <idosch@mellanox.com> Date: Wed Aug 24 11:18:51 2016 +0200 mlxsw: spectrum: Add missing flood to router port In case we have a layer 3 interface on top of a bridge (VLAN / FID RIF), then we should flood the following packet types to the router: * Broadcast: If DIP is the broadcast address of the interface, then we need to be able to get it to CPU by trapping it following route lookup. * Reserved IP multicast (224.0.0.X): Some control packets (e.g. OSPF) use this range and are trapped in the router block. Fixes: 99f44bb3527b ("mlxsw: spectrum: Enable L3 interfaces on top of bridge devices") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit dbb50887c8f619fc5c3489783ebc3122bc134a31 Author: Daniel Borkmann <daniel@iogearbox.net> Date: Wed Jul 27 11:40:14 2016 -0700 Bluetooth: split sk_filter in l2cap_sock_recv_cb During an audit for sk_filter(), we found that rx_busy_skb handling in l2cap_sock_recv_cb() and l2cap_sock_recvmsg() looks not quite as intended. The assumption from commit e328140fdacb ("Bluetooth: Use event-driven approach for handling ERTM receive buffer") is that errors returned from sock_queue_rcv_skb() are due to receive buffer shortage. However, nothing should prevent doing a setsockopt() with SO_ATTACH_FILTER on the socket, that could drop some of the incoming skbs when handled in sock_queue_rcv_skb(). In that case sock_queue_rcv_skb() will return with -EPERM, propagated from sk_filter() and if in L2CAP_MODE_ERTM mode, wrong assumption was that we failed due to receive buffer being full. From that point onwards, due to the to-be-dropped skb being held in rx_busy_skb, we cannot make any forward progress as rx_busy_skb is never cleared from l2cap_sock_recvmsg(), due to the filter drop verdict over and over coming from sk_filter(). Meanwhile, in l2cap_sock_recv_cb() all new incoming skbs are being dropped due to rx_busy_skb being occupied. Instead, just use __sock_queue_rcv_skb() where an error really tells that there's a receive buffer issue. Split the sk_filter() and enable it for non-segmented modes at queuing time since at this point in time the skb has already been through the ERTM state machine and it has been acked, so dropping is not allowed. Instead, for ERTM and streaming mode, call sk_filter() in l2cap_data_rcv() so the packet can be dropped before the state machine sees it. Fixes: e328140fdacb ("Bluetooth: Use event-driven approach for handling ERTM receive buffer") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> commit 9afee94939e3eda4c8bf239f7727cb56e158c976 Author: Frederic Dalleau <frederic.dalleau@collabora.co.uk> Date: Tue Aug 23 07:59:19 2016 +0200 Bluetooth: Fix memory leak at end of hci requests In hci_req_sync_complete the event skb is referenced in hdev->req_skb. It is used (via hci_req_run_skb) from either __hci_cmd_sync_ev which will pass the skb to the caller, or __hci_req_sync which leaks. unreferenced object 0xffff880005339a00 (size 256): comm "kworker/u3:1", pid 1011, jiffies 4294671976 (age 107.389s) backtrace: [<ffffffff818d89d9>] kmemleak_alloc+0x49/0xa0 [<ffffffff8116bba8>] kmem_cache_alloc+0x128/0x180 [<ffffffff8167c1df>] skb_clone+0x4f/0xa0 [<ffffffff817aa351>] hci_event_packet+0xc1/0x3290 [<ffffffff8179a57b>] hci_rx_work+0x18b/0x360 [<ffffffff810692ea>] process_one_work+0x14a/0x440 [<ffffffff81069623>] worker_thread+0x43/0x4d0 [<ffffffff8106ead4>] kthread+0xc4/0xe0 [<ffffffff818dd38f>] ret_from_fork+0x1f/0x40 [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Frédéric Dalleau <frederic.dalleau@collabora.co.uk> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> commit 901d3d4fee83e9407d91e7178048e2fed6c91f6b Author: Li Zhong <zhong@linux.vnet.ibm.com> Date: Wed Aug 24 15:34:40 2016 +0800 crypto: vmx - fix null dereference in p8_aes_xts_crypt walk.iv is not assigned a value in blkcipher_walk_init. It makes iv uninitialized. It is possibly a null value(as shown below), which is then used by aes_p8_encrypt. This patch moves iv = walk.iv after blkcipher_walk_virt, in which walk.iv is set. [17856.268050] Unable to handle kernel paging request for data at address 0x00000000 [17856.268212] Faulting instruction address: 0xd000000002ff04bc 7:mon> t [link register ] d000000002ff47b8 p8_aes_xts_crypt+0x168/0x2a0 [vmx_crypto] (938) [c000000013b77960] d000000002ff4794 p8_aes_xts_crypt+0x144/0x2a0 [vmx_crypto] (unreliable) [c000000013b77a70] c000000000544d64 skcipher_decrypt_blkcipher+0x64/0x80 [c000000013b77ac0] d000000003c0175c crypt_convert+0x53c/0x620 [dm_crypt] [c000000013b77ba0] d000000003c043fc kcryptd_crypt+0x3cc/0x440 [dm_crypt] [c000000013b77c50] c0000000000f3070 process_one_work+0x1e0/0x590 [c000000013b77ce0] c0000000000f34c8 worker_thread+0xa8/0x660 [c000000013b77d80] c0000000000fc0b0 kthread+0x110/0x130 [c000000013b77e30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> commit 10bb087ce381c812cd81a65ffd5e6f83e6399291 Author: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Date: Thu Aug 18 19:53:36 2016 +0100 crypto: qat - fix aes-xts key sizes Increase value of supported key sizes for qat_aes_xts. aes-xts keys consists of keys of equal size concatenated. Fixes: def14bfaf30d ("crypto: qat - add support for ctr(aes) and xts(aes)") Cc: stable@vger.kernel.org Reported-by: Wenqian Yu <wenqian.yu@intel.com> Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> commit f74bdd4cb5d0d4c3e89919e850e0bbb8789f32f9 Author: Fabian Frederick <fabf@skynet.be> Date: Tue Aug 16 21:49:45 2016 +0200 hwrng: mxc-rnga - Fix Kconfig dependency We can directly depend on SOC_IMX31 since commit c9ee94965dce ("ARM: imx: deconstruct mxc_rnga initialization") Since that commit, CONFIG_HW_RANDOM_MXC_RNGA could not be switched on with unknown symbol ARCH_HAS_RNGA and mxc-rnga.o can't be generated with ARCH=arm make M=drivers/char/hw_random Previously, HW_RANDOM_MXC_RNGA required ARCH_HAS_RNGA which was based on IMX_HAVE_PLATFORM_MXC_RNGA && ARCH_MXC. IMX_HAVE_PLATFORM_MXC_RNGA was based on SOC_IMX31. Fixes: c9ee94965dce ("ARM: imx: deconstruct mxc_rnga initialization") Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> commit d7226c7a4dd19929d6df4ae04698da2fcf6f875a Author: David Ahern <dsa@cumulusnetworks.com> Date: Tue Aug 23 21:05:27 2016 -0700 net: diag: Fix refcnt leak in error path destroying socket inet_diag_find_one_icsk takes a reference to a socket that is not released if sock_diag_destroy returns an error. Fix by changing tcp_diag_destroy to manage the refcnt for all cases and remove the sock_put calls from tcp_abort. Fixes: c1e64e298b8ca ("net: diag: Support destroying TCP sockets") Reported-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 7b996243fab46092fb3a29c773c54be8152366e4 Author: Soheil Hassas Yeganeh <soheil@google.com> Date: Tue Aug 23 18:22:33 2016 -0400 tun: fix transmit timestamp support Instead of using sock_tx_timestamp, use skb_tx_timestamp to record software transmit timestamp of a packet. sock_tx_timestamp resets and overrides the tx_flags of the skb. The function is intended to be called from within the protocol layer when creating the skb, not from a device driver. This is inconsistent with other drivers and will cause issues for TCP. In TCP, we intend to sample the timestamps for the last byte for each sendmsg/sendpage. For that reason, tcp_sendmsg calls tcp_tx_timestamp only with the last skb that it generates. For example, if a 128KB message is split into two 64KB packets we want to sample the SND timestamp of the last packet. The current code in the tun driver, however, will result in sampling the SND timestamp for both packets. Also, when the last packet is split into smaller packets for retranmission (see tcp_fragment), the tun driver will record timestamps for all of the retransmitted packets and not only the last packet. Fixes: eda297729171 (tun: Support software transmit time stamping.) Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Francis Yan <francisyyan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 75d855a5e93e6f3d9b37a8719d69a5318f051453 Author: Eric Dumazet <edumazet@google.com> Date: Tue Aug 23 09:57:51 2016 -0700 udp: get rid of SLAB_DESTROY_BY_RCU allocations After commit ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU") we do not need this special allocation mode anymore, even if it is harmless. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 232cb53a45965f8789fbf0a9a1962f8c67ab1a3c Author: Lance Richardson <lrichard@redhat.com> Date: Tue Aug 23 11:40:52 2016 -0400 sctp: fix overrun in sctp_diag_dump_one() The function sctp_diag_dump_one() currently performs a memcpy() of 64 bytes from a 16 byte field into another 16 byte field. Fix by using correct size, use sizeof to obtain correct size instead of using a hard-coded constant. Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file") Signed-off-by: Lance Richardson <lrichard@redhat.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit a8184003c0bb1d6362c2af76c560b3caae6832cb Author: Rabin Vincent <rabinv@axis.com> Date: Tue Aug 23 16:31:28 2016 +0200 dwc_eth_qos: fix interrupt enable race We currently enable interrupts before we enable NAPI. If an RX interrupt hits before we enabled NAPI then the NAPI callback is never called and we leave the hardware with RX interrupts disabled, which of course leads us to never handling received packets. Fix this by moving the interrupt enable to after we've enable NAPI and the reclaim tasklet. Fixes: cd5e41234729 ("dwc_eth_qos: do phy_start before resetting hardware") Signed-off-by: Rabin Vincent <rabinv@axis.com> Signed-off-by: Lars Persson <larper@axis.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 53080fe9c451e7625e71b91c384e7bef1be72b00 Author: Fabio Estevam <fabio.estevam@nxp.com> Date: Tue Aug 23 09:48:20 2016 -0300 net: lpc_eth: Check clk_prepare_enable() error clk_prepare_enable() may fail, so we should better check its return value and propagate it in the case of failure While at it, replace __lpc_eth_clock_enable() with a plain clk_prepare_enable/clk_disable_unprepare() call in order to simplify the code. Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com> Acked-by: Vladimir Zapolskiy <vz@mleia.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 1bc261fabe866c4cdc97f52319eaa0c7ee31026e Author: Jamie Lentin <jm@lentin.co.uk> Date: Mon Aug 22 22:47:08 2016 +0100 net: mv88e6xxx: Fix ingress rate removal for mv6131 chips The PORT_RATE_CONTROL register works differently on 88e6095/6095f/6131 in comparison to 6123/61/65, and 0x0 disables. The distinction was lost Linux 4.1 --> 4.2 Signed-off-by: Jamie Lentin <jm@lentin.co.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net> commit f64f14820e2deb5db056a05d7672ee2b1c6290e5 Author: Xander Huff <xander.huff@ni.com> Date: Mon Aug 22 15:57:16 2016 -0500 phy: micrel: Reenable interrupts during resume for ksz9031 Like the ksz8081, the ksz9031 has the behavior where it will clear the interrupt enable bits when leaving power down. This takes advantage of the solution provided by f5aba91. Signed-off-by: Xander Huff <xander.huff@ni.com> Signed-off-by: Nathan Sullivan <nathan.sullivan@ni.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 20a2b49fc538540819a0c552877086548cff8d8d Author: Eric Dumazet <edumazet@google.com> Date: Mon Aug 22 11:31:10 2016 -0700 tcp: properly scale window in tcp_v[46]_reqsk_send_ack() When sending an ack in SYN_RECV state, we must scale the offered window if wscale option was negotiated and accepted. Tested: Following packetdrill test demonstrates the issue : 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 // Establish a connection. +0 < S 0:0(0) win 20000 <mss 1000,sackOK,wscale 7, nop, TS val 100 ecr 0> +0 > S. 0:0(0) ack 1 win 28960 <mss 1460,sackOK, TS val 100 ecr 100, nop, wscale 7> +0 < . 1:11(10) ack 1 win 156 <nop,nop,TS val 99 ecr 100> // check that window is properly scaled ! +0 > . 1:1(0) ack 1 win 226 <nop,nop,TS val 200 ecr 100> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 6c389fc931bcda88940c809f752ada6d7799482c Author: Zefir Kurtisi <zefir.kurtisi@neratec.com> Date: Mon Aug 22 15:58:12 2016 +0200 gianfar: fix size of scatter-gathered frames The current scatter-gather logic in gianfar is flawed, since it does not consider the eTSEC's RxBD 'Data Length' field is context depening: for the last fragment it contains the full frame size, while fragments contain the fragment size, which equals the value written to register MRBLR. This causes data corruption as soon as the hardware starts to fragment receiving frames. As a result, the size of fragmented frames is increased by (nr_frags - 1) * MRBLR We first noticed this issue working with DSA, where an ICMP request sized 1472 bytes causes the scatter-gather logic to kick in. The full Ethernet frame (1518) gets increased by DSA (4), GMAC_FCB_LEN (8), and FSL_GIANFAR_DEV_HAS_TIMER (priv->padding=8) to a total of 1538 octets, which is fragmented by the hardware and reconstructed by the driver to a 3074 octet frame. This patch fixes the problem by adjusting the size of the last fragment. It was tested by setting MRBLR to different multiples of 64, proving correct scatter-gather operation on frames with up to 9000 octets in size. Signed-off-by: Zefir Kurtisi <zefir.kurtisi@neratec.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit b323431bc017e9862870cbbac004774c769ee112 Author: Zefir Kurtisi <zefir.kurtisi@neratec.com> Date: Mon Aug 22 15:56:38 2016 +0200 gianfar: prevent fragmentation in DSA environments The eTSEC register MRBLR defines the maximum space in the RX buffers and is set to 1536 by gianfar. This reasonably covers the common use case where the MTU is kept at default 1500. In that case, the largest Ethernet frame size of 1518 plus an optional GMAC_FCB_LEN of 8, and an additional padding of 8 to handle FSL_GIANFAR_DEV_HAS_TIMER totals to 1534 and nicely fit within the chosen MRBLR. Alas, if the eTSEC is attached to a DSA enabled switch, the (E)DSA header extension (4 or 8 bytes) causes every maximum sized frame to be fragmented by the hardware. This patch increases the maximum RX buffer size by 8 and rounds up to the next multiple of 64, which the hardware's defines as RX buffer granularity. Signed-off-by: Zefir Kurtisi <zefir.kurtisi@neratec.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit e83c6744e81abc93a20d0eb3b7f504a176a6126a Author: Eric Dumazet <edumazet@google.com> Date: Tue Aug 23 13:59:33 2016 -0700 udp: fix poll() issue with zero sized packets Laura tracked poll() [and friends] regression caused by commit e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") udp_poll() needs to know if there is a valid packet in receive queue, even if its payload length is 0. Change first_packet_length() to return an signed int, and use -1 as the indication of an empty queue. Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Reported-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Laura Abbott <labbott@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 41963c10c47a35185e68cb9049f7a3493c94d2d7 Author: Benjamin Coddington <bcodding@redhat.com> Date: Mon Aug 22 14:11:16 2016 -0400 pnfs/blocklayout: update last_write_offset atomically with extents Block/SCSI layout write completion may add committable extents to the extent tree before updating the layout's last-written byte under the inode lock. If a sync happens before this value is updated, then prepare_layoutcommit may find and encode these extents which would produce a LAYOUTCOMMIT request whose encoded extents are larger than the request's loca_length. Fix this by using a last-written byte value that is updated atomically with the extent tree so that commitable extents always match. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> commit b88fa69eaa8649f11828158c7b65c4bcd886ebd5 Author: Trond Myklebust <trond.myklebust@primarydata.com> Date: Tue Aug 23 11:19:33 2016 -0400 pNFS: The client must not do I/O to the DS if it's lease has expired Ensure that the client conforms to the normative behaviour described in RFC5661 Section 12.7.2: "If a client believes its lease has expired, it MUST NOT send I/O to the storage device until it has validated its lease." So ensure that we wait for the lease to be validated before using the layout. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v3.20+ commit 28a10c426e81afc88514bca8e73affccf850fdf6 Author: Jamal Hadi Salim <jhs@mojatatu.com> Date: Mon Aug 22 07:10:20 2016 -0400 net sched: fix encoding to use real length Encoding of the metadata was using the padded length as opposed to the real length of the data which is a bug per specification. This has not been an issue todate because all metadatum specified so far has been 32 bit where aligned and data length are the same width. This also includes a bug fix for validating the length of a u16 field. But since there is no metadata of size u16 yes we are fine to include it here. While at it get rid of magic numbers. Fixes: ef6980b6becb ("net sched: introduce IFE action") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 4870e704d901602e4ae5de462c4e65732cf2ed6c Author: Yuval Mintz <Yuval.Mintz@qlogic.com> Date: Mon Aug 22 12:03:29 2016 +0300 qed: FLR of active VFs might lead to FW assert Driver never bothered marking the VF's vport with the VF's sw_fid. As a result, FLR flows are not going to clean those vports. If the vport was active when FLRed, re-activating it would lead to a FW assertion. Fixes: dacd88d6f6851 ("qed: IOV l2 functionality") Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit c0451fe1f27b815b3f400df2a63b9aecf589b7b0 Author: Shmulik Ladkani <shmulik.ladkani@gmail.com> Date: Sun Aug 21 11:22:32 2016 +0300 net: ip_finish_output_gso: Allow fragmenting segments of tunneled skbs if their DF is unset In b8247f095e, "net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs" gso skbs arriving from an ingress interface that go through UDP tunneling, are allowed to be fragmented if the resulting encapulated segments exceed the dst mtu of the egress interface. This aligned the behavior of gso skbs to non-gso skbs going through udp encapsulation path. However the non-gso vs gso anomaly is present also in the following cases of a GRE tunnel: - ip_gre in collect_md mode, where TUNNEL_DONT_FRAGMENT is not set (e.g. OvS vport-gre with df_default=false) - ip_gre in nopmtudisc mode, where IFLA_GRE_IGNORE_DF is set In both of the above cases, the non-gso skbs get fragmented, whereas the gso skbs (having skb_gso_network_seglen that exceeds dst mtu) get dropped, as they don't go through the segment+fragment code path. Fix: Setting IPSKB_FRAG_SEGS if the tunnel specified IP_DF bit is NOT set. Tunnels that do set IP_DF, will not go to fragmentation o…

khuey · 2016-09-26T21:45:11Z

This got fixed \o/

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35ef ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: ahmedradaideh <ahmed.radaideh@gmail.com>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> (cherry picked from commit a26c91b6d2b97fb0517849f789d9628b53eb7a94)

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35ef ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: Amit Pundir <amit.pundir@linaro.org>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35ef ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> (cherry picked from commit c41d17fc3f588fc7f85e255dd43b743b6b886b16)

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> (cherry picked from commit c41d17fc3f588fc7f85e255dd43b743b6b886b16)

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: Kneba <abenkenary3@gmail.com>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: RyuujiX <saputradenny712@gmail.com> Signed-off-by: dotkit <ewprjkt@proton.me>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35ef ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: Kneba <abenkenary3@gmail.com> Signed-off-by: dotkit <ewprjkt@proton.me>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35ef ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: Kneba <abenkenary3@gmail.com> Signed-off-by: Tiktodz <ewprjkt@proton.me>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: RyuujiX <saputradenny712@gmail.com> Signed-off-by: dotkit <ewprjkt@proton.me>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: RyuujiX <saputradenny712@gmail.com> Signed-off-by: strongreasons <abenkenari@gmail.com>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: RyuujiX <saputradenny712@gmail.com>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35ef ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: Kneba <abenkenary3@gmail.com> Signed-off-by: dotkit <ewprjkt@proton.me>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35ef ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: Kneba <abenkenary3@gmail.com> Signed-off-by: Tiktodz <ewprjkt@proton.me>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: Kneba <abenkenary3@gmail.com>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: Kneba <abenkenary3@gmail.com> Signed-off-by: dotkit <ewprjkt@proton.me>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: RyuujiX <saputradenny712@gmail.com>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: RyuujiX <saputradenny712@gmail.com> Signed-off-by: Kneba <abenkenary3@gmail.com>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: Kneba <abenkenary3@gmail.com> Signed-off-by: Tiktodz <ewprjkt@proton.me> Signed-off-by: dotkit <ewprjkt@proton.me>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: RyuujiX <saputradenny712@gmail.com> Signed-off-by: Tiktodz <ewprjkt@proton.me>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: RyuujiX <saputradenny712@gmail.com>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: Kneba <abenkenary3@gmail.com> Signed-off-by: Tiktodz <ewprjkt@proton.me> Signed-off-by: dotkit <ewprjkt@proton.me> Signed-off-by: strongreasons <abenkenari@gmail.com>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: RyuujiX <saputradenny712@gmail.com> Signed-off-by: strongreasons <strongreasons@users.noreply.github.com>

This fixes a ptrace vs fatal pending signals bug as manifested in seccomp now that seccomp was reordered to happen after ptrace. The short version is that seccomp should not attempt to call do_exit() while fatal signals are pending under a tracer. The existing code was trying to be as defensively paranoid as possible, but it now ends up confusing ptrace. Instead, the syscall can just be skipped (which solves the original concern that the do_exit() was addressing) and normal signal handling, tracer notification, and process death can happen. Paraphrasing from the original bug report: If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here: https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7 The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: rr-debugger/rr#1762 (comment). Reported-by: Robert O'Callahan <robert@ocallahan.org> Reported-by: Kyle Huey <khuey@kylehuey.com> Tested-by: Kyle Huey <khuey@kylehuey.com> Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> (cherry picked from commit 485a252a5559b45d7df04c819ec91177c62c270b) Bug: 119769499 Change-Id: I444e69093e88d58587b4d5c4f2d777985591c32d Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: RyuujiX <saputradenny712@gmail.com> Signed-off-by: dotkit <dotkit@electrowizard.me>

rocallahan mentioned this issue Aug 2, 2016

Support AMD Piledriver? #1552

Closed

khuey closed this as completed Sep 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel changes post-4.7 cause `PTRACE_SYSCALL` notifications to happen before `PTRACE_SECCOMP_EVENT` #1762

kernel changes post-4.7 cause `PTRACE_SYSCALL` notifications to happen before `PTRACE_SECCOMP_EVENT` #1762

rocallahan commented Aug 2, 2016

rocallahan commented Aug 2, 2016

rocallahan commented Aug 2, 2016

rocallahan commented Aug 2, 2016

rocallahan commented Aug 2, 2016

rocallahan commented Aug 2, 2016 •

edited

Loading

rocallahan commented Aug 3, 2016

rocallahan commented Aug 3, 2016

rocallahan commented Aug 3, 2016

rocallahan commented Aug 3, 2016

rocallahan commented Aug 4, 2016

khuey commented Sep 26, 2016

kernel changes post-4.7 cause PTRACE_SYSCALL notifications to happen before PTRACE_SECCOMP_EVENT #1762

kernel changes post-4.7 cause PTRACE_SYSCALL notifications to happen before PTRACE_SECCOMP_EVENT #1762

Comments

rocallahan commented Aug 2, 2016

rocallahan commented Aug 2, 2016

rocallahan commented Aug 2, 2016

rocallahan commented Aug 2, 2016

rocallahan commented Aug 2, 2016

rocallahan commented Aug 2, 2016 • edited Loading

rocallahan commented Aug 3, 2016

rocallahan commented Aug 3, 2016

rocallahan commented Aug 3, 2016

rocallahan commented Aug 3, 2016

rocallahan commented Aug 4, 2016

khuey commented Sep 26, 2016

kernel changes post-4.7 cause `PTRACE_SYSCALL` notifications to happen before `PTRACE_SECCOMP_EVENT` #1762

kernel changes post-4.7 cause `PTRACE_SYSCALL` notifications to happen before `PTRACE_SECCOMP_EVENT` #1762

rocallahan commented Aug 2, 2016 •

edited

Loading