Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix fallback to TCP #11

Closed
pabeni opened this issue Mar 24, 2020 · 1 comment
Closed

fix fallback to TCP #11

pabeni opened this issue Mar 24, 2020 · 1 comment
Assignees

Comments

@pabeni
Copy link

pabeni commented Mar 24, 2020

fallback to TCP after the 3w handshake is not currently functional server side - as the msk lacks the required 'subflow' field for that.

Additionally we don't support infinite mapping, not check for later fallback.

Support for infinite mapping will allow fixing the above and will simplify the recvmsg and poll fallback path - sendmsg still needs explicit fallback check.

@dcaratti dcaratti self-assigned this Mar 24, 2020
jenkins-tessares pushed a commit that referenced this issue Mar 26, 2020
When experimenting with bpf_send_signal() helper in our production
environment (5.2 based), we experienced a deadlock in NMI mode:
   #5 [ffffc9002219f770] queued_spin_lock_slowpath at ffffffff8110be24
   #6 [ffffc9002219f770] _raw_spin_lock_irqsave at ffffffff81a43012
   #7 [ffffc9002219f780] try_to_wake_up at ffffffff810e7ecd
   #8 [ffffc9002219f7e0] signal_wake_up_state at ffffffff810c7b55
   #9 [ffffc9002219f7f0] __send_signal at ffffffff810c8602
  #10 [ffffc9002219f830] do_send_sig_info at ffffffff810ca31a
  #11 [ffffc9002219f868] bpf_send_signal at ffffffff8119d227
  #12 [ffffc9002219f988] bpf_overflow_handler at ffffffff811d4140
  #13 [ffffc9002219f9e0] __perf_event_overflow at ffffffff811d68cf
  #14 [ffffc9002219fa10] perf_swevent_overflow at ffffffff811d6a09
  #15 [ffffc9002219fa38] ___perf_sw_event at ffffffff811e0f47
  #16 [ffffc9002219fc30] __schedule at ffffffff81a3e04d
  #17 [ffffc9002219fc90] schedule at ffffffff81a3e219
  #18 [ffffc9002219fca0] futex_wait_queue_me at ffffffff8113d1b9
  #19 [ffffc9002219fcd8] futex_wait at ffffffff8113e529
  #20 [ffffc9002219fdf0] do_futex at ffffffff8113ffbc
  #21 [ffffc9002219fec0] __x64_sys_futex at ffffffff81140d1c
  #22 [ffffc9002219ff38] do_syscall_64 at ffffffff81002602
  #23 [ffffc9002219ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c00068

The above call stack is actually very similar to an issue
reported by Commit eac9153 ("bpf/stackmap: Fix deadlock with
rq_lock in bpf_get_stack()") by Song Liu. The only difference is
bpf_send_signal() helper instead of bpf_get_stack() helper.

The above deadlock is triggered with a perf_sw_event.
Similar to Commit eac9153, the below almost identical reproducer
used tracepoint point sched/sched_switch so the issue can be easily caught.
  /* stress_test.c */
  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/mman.h>
  #include <pthread.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <fcntl.h>

  #define THREAD_COUNT 1000
  char *filename;
  void *worker(void *p)
  {
        void *ptr;
        int fd;
        char *pptr;

        fd = open(filename, O_RDONLY);
        if (fd < 0)
                return NULL;
        while (1) {
                struct timespec ts = {0, 1000 + rand() % 2000};

                ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
                usleep(1);
                if (ptr == MAP_FAILED) {
                        printf("failed to mmap\n");
                        break;
                }
                munmap(ptr, 4096 * 64);
                usleep(1);
                pptr = malloc(1);
                usleep(1);
                pptr[0] = 1;
                usleep(1);
                free(pptr);
                usleep(1);
                nanosleep(&ts, NULL);
        }
        close(fd);
        return NULL;
  }

  int main(int argc, char *argv[])
  {
        void *ptr;
        int i;
        pthread_t threads[THREAD_COUNT];

        if (argc < 2)
                return 0;

        filename = argv[1];

        for (i = 0; i < THREAD_COUNT; i++) {
                if (pthread_create(threads + i, NULL, worker, NULL)) {
                        fprintf(stderr, "Error creating thread\n");
                        return 0;
                }
        }

        for (i = 0; i < THREAD_COUNT; i++)
                pthread_join(threads[i], NULL);
        return 0;
  }
and the following command:
  1. run `stress_test /bin/ls` in one windown
  2. hack bcc trace.py with the following change:
     --- a/tools/trace.py
     +++ b/tools/trace.py
     @@ -513,6 +513,7 @@ BPF_PERF_OUTPUT(%s);
              __data.tgid = __tgid;
              __data.pid = __pid;
              bpf_get_current_comm(&__data.comm, sizeof(__data.comm));
     +        bpf_send_signal(10);
      %s
      %s
              %s.perf_submit(%s, &__data, sizeof(__data));
  3. in a different window run
     ./trace.py -p $(pidof stress_test) t:sched:sched_switch

The deadlock can be reproduced in our production system.

Similar to Song's fix, the fix is to delay sending signal if
irqs is disabled to avoid deadlocks involving with rq_lock.
With this change, my above stress-test in our production system
won't cause deadlock any more.

I also implemented a scale-down version of reproducer in the
selftest (a subsequent commit). With latest bpf-next,
it complains for the following potential deadlock.
  [   32.832450] -> #1 (&p->pi_lock){-.-.}:
  [   32.833100]        _raw_spin_lock_irqsave+0x44/0x80
  [   32.833696]        task_rq_lock+0x2c/0xa0
  [   32.834182]        task_sched_runtime+0x59/0xd0
  [   32.834721]        thread_group_cputime+0x250/0x270
  [   32.835304]        thread_group_cputime_adjusted+0x2e/0x70
  [   32.835959]        do_task_stat+0x8a7/0xb80
  [   32.836461]        proc_single_show+0x51/0xb0
  ...
  [   32.839512] -> #0 (&(&sighand->siglock)->rlock){....}:
  [   32.840275]        __lock_acquire+0x1358/0x1a20
  [   32.840826]        lock_acquire+0xc7/0x1d0
  [   32.841309]        _raw_spin_lock_irqsave+0x44/0x80
  [   32.841916]        __lock_task_sighand+0x79/0x160
  [   32.842465]        do_send_sig_info+0x35/0x90
  [   32.842977]        bpf_send_signal+0xa/0x10
  [   32.843464]        bpf_prog_bc13ed9e4d3163e3_send_signal_tp_sched+0x465/0x1000
  [   32.844301]        trace_call_bpf+0x115/0x270
  [   32.844809]        perf_trace_run_bpf_submit+0x4a/0xc0
  [   32.845411]        perf_trace_sched_switch+0x10f/0x180
  [   32.846014]        __schedule+0x45d/0x880
  [   32.846483]        schedule+0x5f/0xd0
  ...

  [   32.853148] Chain exists of:
  [   32.853148]   &(&sighand->siglock)->rlock --> &p->pi_lock --> &rq->lock
  [   32.853148]
  [   32.854451]  Possible unsafe locking scenario:
  [   32.854451]
  [   32.855173]        CPU0                    CPU1
  [   32.855745]        ----                    ----
  [   32.856278]   lock(&rq->lock);
  [   32.856671]                                lock(&p->pi_lock);
  [   32.857332]                                lock(&rq->lock);
  [   32.857999]   lock(&(&sighand->siglock)->rlock);

  Deadlock happens on CPU0 when it tries to acquire &sighand->siglock
  but it has been held by CPU1 and CPU1 tries to grab &rq->lock
  and cannot get it.

  This is not exactly the callstack in our production environment,
  but sympotom is similar and both locks are using spin_lock_irqsave()
  to acquire the lock, and both involves rq_lock. The fix to delay
  sending signal when irq is disabled also fixed this issue.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200304191104.2796501-1-yhs@fb.com
jenkins-tessares pushed a commit that referenced this issue Mar 31, 2020
Ido Schimmel says:

====================
Add packet trap policers support

Background
==========

Devices capable of offloading the kernel's datapath and perform
functions such as bridging and routing must also be able to send (trap)
specific packets to the kernel (i.e., the CPU) for processing.

For example, a device acting as a multicast-aware bridge must be able to
trap IGMP membership reports to the kernel for processing by the bridge
module.

Motivation
==========

In most cases, the underlying device is capable of handling packet rates
that are several orders of magnitude higher compared to those that can
be handled by the CPU.

Therefore, in order to prevent the underlying device from overwhelming
the CPU, devices usually include packet trap policers that are able to
police the trapped packets to rates that can be handled by the CPU.

Proposed solution
=================

This patch set allows capable device drivers to register their supported
packet trap policers with devlink. User space can then tune the
parameters of these policers (currently, rate and burst size) and read
from the device the number of packets that were dropped by the policer,
if supported.

These packet trap policers can then be bound to existing packet trap
groups, which are used to aggregate logically related packet traps. As a
result, trapped packets are policed to rates that can be handled the
host CPU.

Example usage
=============

Instantiate netdevsim:

Dump available packet trap policers:
netdevsim/netdevsim10:
  policer 1 rate 1000 burst 128
  policer 2 rate 2000 burst 256
  policer 3 rate 3000 burst 512

Change the parameters of a packet trap policer:

Bind a packet trap policer to a packet trap group:

Dump parameters and statistics of a packet trap policer:
netdevsim/netdevsim10:
  policer 3 rate 100 burst 16
    stats:
        rx:
          dropped 92

Unbind a packet trap policer from a packet trap group:

Patch set overview
==================

Patch #1 adds the core infrastructure in devlink which allows capable
device drivers to register their supported packet trap policers with
devlink.

Patch #2 extends the existing devlink-trap documentation.

Patch #3 extends netdevsim to register a few dummy packet trap policers
with devlink. Used later on to selftests the core infrastructure.

Patches #4-#5 adds infrastructure in devlink to allow binding of packet
trap policers to packet trap groups.

Patch #6 extends netdevsim to allow such binding.

Patch #7 adds a selftest over netdevsim that verifies the core
devlink-trap policers functionality.

Patches #8-#14 gradually add devlink-trap policers support in mlxsw.

Patch #15 adds a selftest over mlxsw. All registered packet trap
policers are verified to handle the configured rate and burst size.

Future plans
============

* Allow changing default association between packet traps and packet
  trap groups
* Add more packet traps. For example, for control packets (e.g., IGMP)

v3:
* Rebase

v2 (address comments from Jiri and Jakub):
* Patch #1: Add 'strict_start_type' in devlink policy
* Patch #1: Have device drivers provide max/min rate/burst size for each
  policer. Use them to check validity of user provided parameters
* Patch #3: Remove check about burst size being a power of 2 and instead
  add a debugfs knob to fail the operation
* Patch #3: Provide max/min rate/burst size when registering policers
  and remove the validity checks from nsim_dev_devlink_trap_policer_set()
* Patch #5: Check for presence of 'DEVLINK_ATTR_TRAP_POLICER_ID' in
  devlink_trap_group_set() and bail if not present
* Patch #5: Add extack error message in case trap group was partially
  modified
* Patch #7: Add test case with new 'fail_trap_policer_set' knob
* Patch #7: Add test case for partially modified trap group
* Patch #10: Provide max/min rate/burst size when registering policers
* Patch #11: Remove the max/min validity checks from
  __mlxsw_sp_trap_policer_set()
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
jenkins-tessares pushed a commit that referenced this issue Apr 1, 2020
The VT_DISALLOCATE ioctl can free a virtual console while tty_release()
is still running, causing a use-after-free in con_shutdown().  This
occurs because VT_DISALLOCATE considers a virtual console's
'struct vc_data' to be unused as soon as the corresponding tty's
refcount hits 0.  But actually it may be still being closed.

Fix this by making vc_data be reference-counted via the embedded
'struct tty_port'.  A newly allocated virtual console has refcount 1.
Opening it for the first time increments the refcount to 2.  Closing it
for the last time decrements the refcount (in tty_operations::cleanup()
so that it happens late enough), as does VT_DISALLOCATE.

Reproducer:
	#include <fcntl.h>
	#include <linux/vt.h>
	#include <sys/ioctl.h>
	#include <unistd.h>

	int main()
	{
		if (fork()) {
			for (;;)
				close(open("/dev/tty5", O_RDWR));
		} else {
			int fd = open("/dev/tty10", O_RDWR);

			for (;;)
				ioctl(fd, VT_DISALLOCATE, 5);
		}
	}

KASAN report:
	BUG: KASAN: use-after-free in con_shutdown+0x76/0x80 drivers/tty/vt/vt.c:3278
	Write of size 8 at addr ffff88806a4ec108 by task syz_vt/129

	CPU: 0 PID: 129 Comm: syz_vt Not tainted 5.6.0-rc2 #11
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191223_100556-anatol 04/01/2014
	Call Trace:
	 [...]
	 con_shutdown+0x76/0x80 drivers/tty/vt/vt.c:3278
	 release_tty+0xa8/0x410 drivers/tty/tty_io.c:1514
	 tty_release_struct+0x34/0x50 drivers/tty/tty_io.c:1629
	 tty_release+0x984/0xed0 drivers/tty/tty_io.c:1789
	 [...]

	Allocated by task 129:
	 [...]
	 kzalloc include/linux/slab.h:669 [inline]
	 vc_allocate drivers/tty/vt/vt.c:1085 [inline]
	 vc_allocate+0x1ac/0x680 drivers/tty/vt/vt.c:1066
	 con_install+0x4d/0x3f0 drivers/tty/vt/vt.c:3229
	 tty_driver_install_tty drivers/tty/tty_io.c:1228 [inline]
	 tty_init_dev+0x94/0x350 drivers/tty/tty_io.c:1341
	 tty_open_by_driver drivers/tty/tty_io.c:1987 [inline]
	 tty_open+0x3ca/0xb30 drivers/tty/tty_io.c:2035
	 [...]

	Freed by task 130:
	 [...]
	 kfree+0xbf/0x1e0 mm/slab.c:3757
	 vt_disallocate drivers/tty/vt/vt_ioctl.c:300 [inline]
	 vt_ioctl+0x16dc/0x1e30 drivers/tty/vt/vt_ioctl.c:818
	 tty_ioctl+0x9db/0x11b0 drivers/tty/tty_io.c:2660
	 [...]

Fixes: 4001d7b ("vt: push down the tty lock so we can see what is left to tackle")
Cc: <stable@vger.kernel.org> # v3.4+
Reported-by: syzbot+522643ab5729b0421998@syzkaller.appspotmail.com
Acked-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20200322034305.210082-2-ebiggers@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
jenkins-tessares pushed a commit that referenced this issue Apr 9, 2020
This change solves below hangtask issue:

INFO: task kworker/u16:1:58 blocked for more than 122 seconds.
      Not tainted 5.6.0-rc2-00590-g9983bdae4974e #11
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:1   D    0    58      2 0x00000000
Workqueue: writeback wb_workfn (flush-179:0)
Backtrace:
 (__schedule) from [<c0913234>] (schedule+0x78/0xf4)
 (schedule) from [<c017ec74>] (rwsem_down_write_slowpath+0x24c/0x4c0)
 (rwsem_down_write_slowpath) from [<c0915f2c>] (down_write+0x6c/0x70)
 (down_write) from [<c0435b80>] (f2fs_write_single_data_page+0x608/0x7ac)
 (f2fs_write_single_data_page) from [<c0435fd8>] (f2fs_write_cache_pages+0x2b4/0x7c4)
 (f2fs_write_cache_pages) from [<c043682c>] (f2fs_write_data_pages+0x344/0x35c)
 (f2fs_write_data_pages) from [<c0267ee8>] (do_writepages+0x3c/0xd4)
 (do_writepages) from [<c0310cbc>] (__writeback_single_inode+0x44/0x454)
 (__writeback_single_inode) from [<c03112d0>] (writeback_sb_inodes+0x204/0x4b0)
 (writeback_sb_inodes) from [<c03115cc>] (__writeback_inodes_wb+0x50/0xe4)
 (__writeback_inodes_wb) from [<c03118f4>] (wb_writeback+0x294/0x338)
 (wb_writeback) from [<c0312dac>] (wb_workfn+0x35c/0x54c)
 (wb_workfn) from [<c014f2b8>] (process_one_work+0x214/0x544)
 (process_one_work) from [<c014f634>] (worker_thread+0x4c/0x574)
 (worker_thread) from [<c01564fc>] (kthread+0x144/0x170)
 (kthread) from [<c01010e8>] (ret_from_fork+0x14/0x2c)

Reported-and-tested-by: Ondřej Jirman <megi@xff.cz>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
matttbe pushed a commit that referenced this issue Jun 2, 2020
Stefano reported a crash with using SQPOLL with io_uring:

  BUG: kernel NULL pointer dereference, address: 00000000000003b0
  CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
  RIP: 0010:task_numa_work+0x4f/0x2c0
  Call Trace:
   task_work_run+0x68/0xa0
   io_sq_thread+0x252/0x3d0
   kthread+0xf9/0x130
   ret_from_fork+0x35/0x40

which is task_numa_work() oopsing on current->mm being NULL.

The task work is queued by task_tick_numa(), which checks if current->mm is
NULL at the time of the call. But this state isn't necessarily persistent,
if the kthread is using use_mm() to temporarily adopt the mm of a task.

Change the task_tick_numa() check to exclude kernel threads in general,
as it doesn't make sense to attempt ot balance for kthreads anyway.

Reported-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/865de121-8190-5d30-ece5-3b097dc74431@kernel.dk
matttbe pushed a commit that referenced this issue Jun 2, 2020
Ido Schimmel says:

====================
devlink: Add support for control packet traps

So far device drivers were only able to register drop and exception
packet traps with devlink. These traps are used for packets that were
either dropped by the underlying device or encountered an exception
(e.g., missing neighbour entry) during forwarding.

However, in the steady state, the majority of the packets being trapped
to the CPU are packets that are required for the correct functioning of
the control plane. For example, ARP request and IGMP query packets.

This patch set allows device drivers to register such control traps with
devlink and expose their default control plane policy to user space.
User space can then tune the packet trap policer settings according to
its needs, as with existing packet traps.

In a similar fashion to exception traps, the action associated with such
traps cannot be changed as it can easily break the control plane. Unlike
drop and exception traps, packets trapped via control traps are not
reported to the kernel's drop monitor as they are not indicative of any
problem.

Patch set overview:

Patches #1-#3 break out layer 3 exceptions to a different group to
provide better granularity. A future patch set will make this completely
configurable.

Patch #4 adds a new trap action ('mirror') that is used for packets that
are forwarded by the device and sent to the CPU. Such packets are marked
by device drivers with 'skb->offload_fwd_mark = 1' in order to prevent
the kernel from forwarding them again.

Patch #5 adds the new trap type, 'control'.

Patches #6-#8 gradually add various control traps to devlink with proper
documentation.

Patch #9 adds a few control traps to netdevsim, which are automatically
exercised by existing devlink-trap selftest.

Patches #10 performs small refactoring in mlxsw.

Patches #11-#13 change mlxsw to register its existing control traps with
devlink.

Patch #14 adds a selftest over mlxsw that exercises all the registered
control traps.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
matttbe pushed a commit that referenced this issue Jun 8, 2020
keep using MPTCP sockets and a "dummy mapping" in case of fallback to
regular TCP. Skip adding DSS option on send, if TCP fallback has been
done earlier.

Notes: I'm unsure on what to do in mptcp_clean_una() to do a one-time
flush of the retransmit queue, as per Mat's suggestion. Any advice?

Changes since v1
 - rebase on top of Paolo's fix for NULL dereference in mptcp_recvmsg()

Changes since RFC v2:
 - use a bit in msk->flags, rather than a dedicated boolean in struct
   msk. This bit is going to be used in combination with another one,
   TCP_FALLBACK_ALLOWED, that is 1 at the first subflow creation
   and gets cleared once TCP fallback is no more allowed.
 - separate code that adds support for "infinite mapping", and use
   the term "dummy" instead of "infinite". Suggested by Mat
 - remove inappropriate call to __mptcp_do_fallback() in
   mptcp_accept() (Paolo)

Changes since RFC v1:
 - use a dedicated member of struct msk to indicate that a fallback
   ha happened, use it in case of infinite mapping
 - don't delete skb_ext in case of infinite mapping (Mat)
 - test the value of pm.subflows on reception of an infinite map to
   ensure that no other subflow is currently opened (Mat)
 - in mptcp_established_options(), avoid adding TCP options in case
   of fallback indication; simplify sendmsg()/recvmsg()/poll() to
   keep using the MPTCP socket in case of TCP fallback. Set the
   fallback indication in case subflow is not mp_capable after
   successful 3-way handshake, instead of flipping 'is_mptcp'
   (Paolo/Mat)
 - remove deadcode in mptcp_finish_connect, and increment
   MPTCP_MIB_MPCAPABLEACTIVEFALLBACK in subflow_finish_connect
   (Paolo)

BugLink: #11
BugLink: #22
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 8, 2020
keep using MPTCP sockets and a "dummy mapping" in case of fallback to
regular TCP. Skip adding DSS option on send, if TCP fallback has been
done earlier.

Notes: I'm unsure on what to do in mptcp_clean_una() to do a one-time
flush of the retransmit queue, as per Mat's suggestion. Any advice?

Changes since v1
 - rebase on top of Paolo's fix for NULL dereference in mptcp_recvmsg()

Changes since RFC v2:
 - use a bit in msk->flags, rather than a dedicated boolean in struct
   msk. This bit is going to be used in combination with another one,
   TCP_FALLBACK_ALLOWED, that is 1 at the first subflow creation
   and gets cleared once TCP fallback is no more allowed.
 - separate code that adds support for "infinite mapping", and use
   the term "dummy" instead of "infinite". Suggested by Mat
 - remove inappropriate call to __mptcp_do_fallback() in
   mptcp_accept() (Paolo)

Changes since RFC v1:
 - use a dedicated member of struct msk to indicate that a fallback
   ha happened, use it in case of infinite mapping
 - don't delete skb_ext in case of infinite mapping (Mat)
 - test the value of pm.subflows on reception of an infinite map to
   ensure that no other subflow is currently opened (Mat)
 - in mptcp_established_options(), avoid adding TCP options in case
   of fallback indication; simplify sendmsg()/recvmsg()/poll() to
   keep using the MPTCP socket in case of TCP fallback. Set the
   fallback indication in case subflow is not mp_capable after
   successful 3-way handshake, instead of flipping 'is_mptcp'
   (Paolo/Mat)
 - remove deadcode in mptcp_finish_connect, and increment
   MPTCP_MIB_MPCAPABLEACTIVEFALLBACK in subflow_finish_connect
   (Paolo)

BugLink: #11
BugLink: #22
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
@matttbe
Copy link
Member

matttbe commented Jun 8, 2020

A fix from @dcaratti is now in our "export" branch, see 60f6bfd

@matttbe matttbe closed this as completed Jun 8, 2020
jenkins-tessares pushed a commit that referenced this issue Jun 9, 2020
keep using MPTCP sockets and a "dummy mapping" in case of fallback to
regular TCP. Skip adding DSS option on send, if TCP fallback has been
done earlier.

Notes: I'm unsure on what to do in mptcp_clean_una() to do a one-time
flush of the retransmit queue, as per Mat's suggestion. Any advice?

Changes since v1
 - rebase on top of Paolo's fix for NULL dereference in mptcp_recvmsg()

Changes since RFC v2:
 - use a bit in msk->flags, rather than a dedicated boolean in struct
   msk. This bit is going to be used in combination with another one,
   TCP_FALLBACK_ALLOWED, that is 1 at the first subflow creation
   and gets cleared once TCP fallback is no more allowed.
 - separate code that adds support for "infinite mapping", and use
   the term "dummy" instead of "infinite". Suggested by Mat
 - remove inappropriate call to __mptcp_do_fallback() in
   mptcp_accept() (Paolo)

Changes since RFC v1:
 - use a dedicated member of struct msk to indicate that a fallback
   ha happened, use it in case of infinite mapping
 - don't delete skb_ext in case of infinite mapping (Mat)
 - test the value of pm.subflows on reception of an infinite map to
   ensure that no other subflow is currently opened (Mat)
 - in mptcp_established_options(), avoid adding TCP options in case
   of fallback indication; simplify sendmsg()/recvmsg()/poll() to
   keep using the MPTCP socket in case of TCP fallback. Set the
   fallback indication in case subflow is not mp_capable after
   successful 3-way handshake, instead of flipping 'is_mptcp'
   (Paolo/Mat)
 - remove deadcode in mptcp_finish_connect, and increment
   MPTCP_MIB_MPCAPABLEACTIVEFALLBACK in subflow_finish_connect
   (Paolo)

BugLink: #11
BugLink: #22
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 9, 2020
keep using MPTCP sockets and a "dummy mapping" in case of fallback to
regular TCP. Skip adding DSS option on send, if TCP fallback has been
done earlier.

Notes: I'm unsure on what to do in mptcp_clean_una() to do a one-time
flush of the retransmit queue, as per Mat's suggestion. Any advice?

Changes since v1
 - rebase on top of Paolo's fix for NULL dereference in mptcp_recvmsg()

Changes since RFC v2:
 - use a bit in msk->flags, rather than a dedicated boolean in struct
   msk. This bit is going to be used in combination with another one,
   TCP_FALLBACK_ALLOWED, that is 1 at the first subflow creation
   and gets cleared once TCP fallback is no more allowed.
 - separate code that adds support for "infinite mapping", and use
   the term "dummy" instead of "infinite". Suggested by Mat
 - remove inappropriate call to __mptcp_do_fallback() in
   mptcp_accept() (Paolo)

Changes since RFC v1:
 - use a dedicated member of struct msk to indicate that a fallback
   ha happened, use it in case of infinite mapping
 - don't delete skb_ext in case of infinite mapping (Mat)
 - test the value of pm.subflows on reception of an infinite map to
   ensure that no other subflow is currently opened (Mat)
 - in mptcp_established_options(), avoid adding TCP options in case
   of fallback indication; simplify sendmsg()/recvmsg()/poll() to
   keep using the MPTCP socket in case of TCP fallback. Set the
   fallback indication in case subflow is not mp_capable after
   successful 3-way handshake, instead of flipping 'is_mptcp'
   (Paolo/Mat)
 - remove deadcode in mptcp_finish_connect, and increment
   MPTCP_MIB_MPCAPABLEACTIVEFALLBACK in subflow_finish_connect
   (Paolo)

BugLink: #11
BugLink: #22
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 10, 2020
keep using MPTCP sockets and a "dummy mapping" in case of fallback to
regular TCP. Skip adding DSS option on send, if TCP fallback has been
done earlier.

Notes: I'm unsure on what to do in mptcp_clean_una() to do a one-time
flush of the retransmit queue, as per Mat's suggestion. Any advice?

Changes since v1
 - rebase on top of Paolo's fix for NULL dereference in mptcp_recvmsg()

Changes since RFC v2:
 - use a bit in msk->flags, rather than a dedicated boolean in struct
   msk. This bit is going to be used in combination with another one,
   TCP_FALLBACK_ALLOWED, that is 1 at the first subflow creation
   and gets cleared once TCP fallback is no more allowed.
 - separate code that adds support for "infinite mapping", and use
   the term "dummy" instead of "infinite". Suggested by Mat
 - remove inappropriate call to __mptcp_do_fallback() in
   mptcp_accept() (Paolo)

Changes since RFC v1:
 - use a dedicated member of struct msk to indicate that a fallback
   ha happened, use it in case of infinite mapping
 - don't delete skb_ext in case of infinite mapping (Mat)
 - test the value of pm.subflows on reception of an infinite map to
   ensure that no other subflow is currently opened (Mat)
 - in mptcp_established_options(), avoid adding TCP options in case
   of fallback indication; simplify sendmsg()/recvmsg()/poll() to
   keep using the MPTCP socket in case of TCP fallback. Set the
   fallback indication in case subflow is not mp_capable after
   successful 3-way handshake, instead of flipping 'is_mptcp'
   (Paolo/Mat)
 - remove deadcode in mptcp_finish_connect, and increment
   MPTCP_MIB_MPCAPABLEACTIVEFALLBACK in subflow_finish_connect
   (Paolo)

BugLink: #11
BugLink: #22
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 10, 2020
keep using MPTCP sockets and a "dummy mapping" in case of fallback to
regular TCP. Skip adding DSS option on send, if TCP fallback has been
done earlier.

Notes: I'm unsure on what to do in mptcp_clean_una() to do a one-time
flush of the retransmit queue, as per Mat's suggestion. Any advice?

Changes since v1
 - rebase on top of Paolo's fix for NULL dereference in mptcp_recvmsg()

Changes since RFC v2:
 - use a bit in msk->flags, rather than a dedicated boolean in struct
   msk. This bit is going to be used in combination with another one,
   TCP_FALLBACK_ALLOWED, that is 1 at the first subflow creation
   and gets cleared once TCP fallback is no more allowed.
 - separate code that adds support for "infinite mapping", and use
   the term "dummy" instead of "infinite". Suggested by Mat
 - remove inappropriate call to __mptcp_do_fallback() in
   mptcp_accept() (Paolo)

Changes since RFC v1:
 - use a dedicated member of struct msk to indicate that a fallback
   ha happened, use it in case of infinite mapping
 - don't delete skb_ext in case of infinite mapping (Mat)
 - test the value of pm.subflows on reception of an infinite map to
   ensure that no other subflow is currently opened (Mat)
 - in mptcp_established_options(), avoid adding TCP options in case
   of fallback indication; simplify sendmsg()/recvmsg()/poll() to
   keep using the MPTCP socket in case of TCP fallback. Set the
   fallback indication in case subflow is not mp_capable after
   successful 3-way handshake, instead of flipping 'is_mptcp'
   (Paolo/Mat)
 - remove deadcode in mptcp_finish_connect, and increment
   MPTCP_MIB_MPCAPABLEACTIVEFALLBACK in subflow_finish_connect
   (Paolo)

BugLink: #11
BugLink: #22
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 11, 2020
keep using MPTCP sockets and a "dummy mapping" in case of fallback to
regular TCP. Skip adding DSS option on send, if TCP fallback has been
done earlier.

Notes: I'm unsure on what to do in mptcp_clean_una() to do a one-time
flush of the retransmit queue, as per Mat's suggestion. Any advice?

Changes since v1
 - rebase on top of Paolo's fix for NULL dereference in mptcp_recvmsg()

Changes since RFC v2:
 - use a bit in msk->flags, rather than a dedicated boolean in struct
   msk. This bit is going to be used in combination with another one,
   TCP_FALLBACK_ALLOWED, that is 1 at the first subflow creation
   and gets cleared once TCP fallback is no more allowed.
 - separate code that adds support for "infinite mapping", and use
   the term "dummy" instead of "infinite". Suggested by Mat
 - remove inappropriate call to __mptcp_do_fallback() in
   mptcp_accept() (Paolo)

Changes since RFC v1:
 - use a dedicated member of struct msk to indicate that a fallback
   ha happened, use it in case of infinite mapping
 - don't delete skb_ext in case of infinite mapping (Mat)
 - test the value of pm.subflows on reception of an infinite map to
   ensure that no other subflow is currently opened (Mat)
 - in mptcp_established_options(), avoid adding TCP options in case
   of fallback indication; simplify sendmsg()/recvmsg()/poll() to
   keep using the MPTCP socket in case of TCP fallback. Set the
   fallback indication in case subflow is not mp_capable after
   successful 3-way handshake, instead of flipping 'is_mptcp'
   (Paolo/Mat)
 - remove deadcode in mptcp_finish_connect, and increment
   MPTCP_MIB_MPCAPABLEACTIVEFALLBACK in subflow_finish_connect
   (Paolo)

BugLink: #11
BugLink: #22
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 14, 2020
keep using MPTCP sockets and a "dummy mapping" in case of fallback to
regular TCP. Skip adding DSS option on send, if TCP fallback has been
done earlier.

Notes: I'm unsure on what to do in mptcp_clean_una() to do a one-time
flush of the retransmit queue, as per Mat's suggestion. Any advice?

Changes since v1
 - rebase on top of Paolo's fix for NULL dereference in mptcp_recvmsg()

Changes since RFC v2:
 - use a bit in msk->flags, rather than a dedicated boolean in struct
   msk. This bit is going to be used in combination with another one,
   TCP_FALLBACK_ALLOWED, that is 1 at the first subflow creation
   and gets cleared once TCP fallback is no more allowed.
 - separate code that adds support for "infinite mapping", and use
   the term "dummy" instead of "infinite". Suggested by Mat
 - remove inappropriate call to __mptcp_do_fallback() in
   mptcp_accept() (Paolo)

Changes since RFC v1:
 - use a dedicated member of struct msk to indicate that a fallback
   ha happened, use it in case of infinite mapping
 - don't delete skb_ext in case of infinite mapping (Mat)
 - test the value of pm.subflows on reception of an infinite map to
   ensure that no other subflow is currently opened (Mat)
 - in mptcp_established_options(), avoid adding TCP options in case
   of fallback indication; simplify sendmsg()/recvmsg()/poll() to
   keep using the MPTCP socket in case of TCP fallback. Set the
   fallback indication in case subflow is not mp_capable after
   successful 3-way handshake, instead of flipping 'is_mptcp'
   (Paolo/Mat)
 - remove deadcode in mptcp_finish_connect, and increment
   MPTCP_MIB_MPCAPABLEACTIVEFALLBACK in subflow_finish_connect
   (Paolo)

BugLink: #11
BugLink: #22
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
matttbe pushed a commit that referenced this issue Jun 15, 2020
when a MPTCP client tries to connect to itself, tcp_finish_connect() is
never reached. Because of this, depending on the socket current state,
multiple faulty behaviours can be observed:

1) a WARN_ON() in subflow_data_ready() is hit
 WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230
 [...]
 CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187
 [...]
 RIP: 0010:subflow_data_ready+0x18b/0x230
 [...]
 Call Trace:
  tcp_data_queue+0xd2f/0x4250
  tcp_rcv_state_process+0xb1c/0x49d3
  tcp_v4_do_rcv+0x2bc/0x790
  __release_sock+0x153/0x2d0
  release_sock+0x4f/0x170
  mptcp_shutdown+0x167/0x4e0
  __sys_shutdown+0xe6/0x180
  __x64_sys_shutdown+0x50/0x70
  do_syscall_64+0x9a/0x370
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

2) client is stuck forever in mptcp_sendmsg() because the socket is not
   TCP_ESTABLISHED

 crash> bt 4847
 PID: 4847   TASK: ffff88814b2fb100  CPU: 1   COMMAND: "gh35"
  #0 [ffff8881376ff680] __schedule at ffffffff97248da4
  #1 [ffff8881376ff778] schedule at ffffffff9724a34f
  #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0
  #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba
  #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859
  #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca
  #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b
  #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa
  #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52
  #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f
 #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26
 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba
 #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c
     RIP: 00007f126f6956ed  RSP: 00007ffc2a320278  RFLAGS: 00000217
     RAX: ffffffffffffffda  RBX: 0000000020000044  RCX: 00007f126f6956ed
     RDX: 0000000000000004  RSI: 00000000004007b8  RDI: 0000000000000003
     RBP: 00007ffc2a3202a0   R8: 0000000000400720   R9: 0000000000400720
     R10: 0000000000400720  R11: 0000000000000217  R12: 00000000004004b0
     R13: 00007ffc2a320380  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake
   didn't complete.

 $ tcpdump -tnnr bad.pcap
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0

force a fallback to TCP in these cases, and adjust the main socket
state to avoid hanging in mptcp_sendmsg().

Closes: #35
Reported-by: Christoph Paasch <cpaasch@apple.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 15, 2020
keep using MPTCP sockets and a "dummy mapping" in case of fallback to
regular TCP. Skip adding DSS option on send, if TCP fallback has been
done earlier.

Notes: I'm unsure on what to do in mptcp_clean_una() to do a one-time
flush of the retransmit queue, as per Mat's suggestion. Any advice?

Changes since v1
 - rebase on top of Paolo's fix for NULL dereference in mptcp_recvmsg()

Changes since RFC v2:
 - use a bit in msk->flags, rather than a dedicated boolean in struct
   msk. This bit is going to be used in combination with another one,
   TCP_FALLBACK_ALLOWED, that is 1 at the first subflow creation
   and gets cleared once TCP fallback is no more allowed.
 - separate code that adds support for "infinite mapping", and use
   the term "dummy" instead of "infinite". Suggested by Mat
 - remove inappropriate call to __mptcp_do_fallback() in
   mptcp_accept() (Paolo)

Changes since RFC v1:
 - use a dedicated member of struct msk to indicate that a fallback
   ha happened, use it in case of infinite mapping
 - don't delete skb_ext in case of infinite mapping (Mat)
 - test the value of pm.subflows on reception of an infinite map to
   ensure that no other subflow is currently opened (Mat)
 - in mptcp_established_options(), avoid adding TCP options in case
   of fallback indication; simplify sendmsg()/recvmsg()/poll() to
   keep using the MPTCP socket in case of TCP fallback. Set the
   fallback indication in case subflow is not mp_capable after
   successful 3-way handshake, instead of flipping 'is_mptcp'
   (Paolo/Mat)
 - remove deadcode in mptcp_finish_connect, and increment
   MPTCP_MIB_MPCAPABLEACTIVEFALLBACK in subflow_finish_connect
   (Paolo)

BugLink: #11
BugLink: #22
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 15, 2020
when a MPTCP client tries to connect to itself, tcp_finish_connect() is
never reached. Because of this, depending on the socket current state,
multiple faulty behaviours can be observed:

1) a WARN_ON() in subflow_data_ready() is hit
 WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230
 [...]
 CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187
 [...]
 RIP: 0010:subflow_data_ready+0x18b/0x230
 [...]
 Call Trace:
  tcp_data_queue+0xd2f/0x4250
  tcp_rcv_state_process+0xb1c/0x49d3
  tcp_v4_do_rcv+0x2bc/0x790
  __release_sock+0x153/0x2d0
  release_sock+0x4f/0x170
  mptcp_shutdown+0x167/0x4e0
  __sys_shutdown+0xe6/0x180
  __x64_sys_shutdown+0x50/0x70
  do_syscall_64+0x9a/0x370
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

2) client is stuck forever in mptcp_sendmsg() because the socket is not
   TCP_ESTABLISHED

 crash> bt 4847
 PID: 4847   TASK: ffff88814b2fb100  CPU: 1   COMMAND: "gh35"
  #0 [ffff8881376ff680] __schedule at ffffffff97248da4
  #1 [ffff8881376ff778] schedule at ffffffff9724a34f
  #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0
  #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba
  #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859
  #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca
  #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b
  #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa
  #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52
  #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f
 #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26
 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba
 #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c
     RIP: 00007f126f6956ed  RSP: 00007ffc2a320278  RFLAGS: 00000217
     RAX: ffffffffffffffda  RBX: 0000000020000044  RCX: 00007f126f6956ed
     RDX: 0000000000000004  RSI: 00000000004007b8  RDI: 0000000000000003
     RBP: 00007ffc2a3202a0   R8: 0000000000400720   R9: 0000000000400720
     R10: 0000000000400720  R11: 0000000000000217  R12: 00000000004004b0
     R13: 00007ffc2a320380  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake
   didn't complete.

 $ tcpdump -tnnr bad.pcap
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0

force a fallback to TCP in these cases, and adjust the main socket
state to avoid hanging in mptcp_sendmsg().

Closes: #35
Reported-by: Christoph Paasch <cpaasch@apple.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 16, 2020
keep using MPTCP sockets and a "dummy mapping" in case of fallback to
regular TCP. Skip adding DSS option on send, if TCP fallback has been
done earlier.

Notes: I'm unsure on what to do in mptcp_clean_una() to do a one-time
flush of the retransmit queue, as per Mat's suggestion. Any advice?

Changes since v1
 - rebase on top of Paolo's fix for NULL dereference in mptcp_recvmsg()

Changes since RFC v2:
 - use a bit in msk->flags, rather than a dedicated boolean in struct
   msk. This bit is going to be used in combination with another one,
   TCP_FALLBACK_ALLOWED, that is 1 at the first subflow creation
   and gets cleared once TCP fallback is no more allowed.
 - separate code that adds support for "infinite mapping", and use
   the term "dummy" instead of "infinite". Suggested by Mat
 - remove inappropriate call to __mptcp_do_fallback() in
   mptcp_accept() (Paolo)

Changes since RFC v1:
 - use a dedicated member of struct msk to indicate that a fallback
   ha happened, use it in case of infinite mapping
 - don't delete skb_ext in case of infinite mapping (Mat)
 - test the value of pm.subflows on reception of an infinite map to
   ensure that no other subflow is currently opened (Mat)
 - in mptcp_established_options(), avoid adding TCP options in case
   of fallback indication; simplify sendmsg()/recvmsg()/poll() to
   keep using the MPTCP socket in case of TCP fallback. Set the
   fallback indication in case subflow is not mp_capable after
   successful 3-way handshake, instead of flipping 'is_mptcp'
   (Paolo/Mat)
 - remove deadcode in mptcp_finish_connect, and increment
   MPTCP_MIB_MPCAPABLEACTIVEFALLBACK in subflow_finish_connect
   (Paolo)

BugLink: #11
BugLink: #22
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 16, 2020
when a MPTCP client tries to connect to itself, tcp_finish_connect() is
never reached. Because of this, depending on the socket current state,
multiple faulty behaviours can be observed:

1) a WARN_ON() in subflow_data_ready() is hit
 WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230
 [...]
 CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187
 [...]
 RIP: 0010:subflow_data_ready+0x18b/0x230
 [...]
 Call Trace:
  tcp_data_queue+0xd2f/0x4250
  tcp_rcv_state_process+0xb1c/0x49d3
  tcp_v4_do_rcv+0x2bc/0x790
  __release_sock+0x153/0x2d0
  release_sock+0x4f/0x170
  mptcp_shutdown+0x167/0x4e0
  __sys_shutdown+0xe6/0x180
  __x64_sys_shutdown+0x50/0x70
  do_syscall_64+0x9a/0x370
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

2) client is stuck forever in mptcp_sendmsg() because the socket is not
   TCP_ESTABLISHED

 crash> bt 4847
 PID: 4847   TASK: ffff88814b2fb100  CPU: 1   COMMAND: "gh35"
  #0 [ffff8881376ff680] __schedule at ffffffff97248da4
  #1 [ffff8881376ff778] schedule at ffffffff9724a34f
  #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0
  #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba
  #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859
  #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca
  #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b
  #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa
  #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52
  #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f
 #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26
 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba
 #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c
     RIP: 00007f126f6956ed  RSP: 00007ffc2a320278  RFLAGS: 00000217
     RAX: ffffffffffffffda  RBX: 0000000020000044  RCX: 00007f126f6956ed
     RDX: 0000000000000004  RSI: 00000000004007b8  RDI: 0000000000000003
     RBP: 00007ffc2a3202a0   R8: 0000000000400720   R9: 0000000000400720
     R10: 0000000000400720  R11: 0000000000000217  R12: 00000000004004b0
     R13: 00007ffc2a320380  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake
   didn't complete.

 $ tcpdump -tnnr bad.pcap
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0

force a fallback to TCP in these cases, and adjust the main socket
state to avoid hanging in mptcp_sendmsg().

Closes: #35
Reported-by: Christoph Paasch <cpaasch@apple.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 16, 2020
Keep using MPTCP sockets and a use "dummy mapping" in case of fallback
to regular TCP. When fallback is triggered, skip addition of the MPTCP
option on send.

Closes: #11
Closes: #22
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 16, 2020
when a MPTCP client tries to connect to itself, tcp_finish_connect() is
never reached. Because of this, depending on the socket current state,
multiple faulty behaviours can be observed:

1) a WARN_ON() in subflow_data_ready() is hit
 WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230
 [...]
 CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187
 [...]
 RIP: 0010:subflow_data_ready+0x18b/0x230
 [...]
 Call Trace:
  tcp_data_queue+0xd2f/0x4250
  tcp_rcv_state_process+0xb1c/0x49d3
  tcp_v4_do_rcv+0x2bc/0x790
  __release_sock+0x153/0x2d0
  release_sock+0x4f/0x170
  mptcp_shutdown+0x167/0x4e0
  __sys_shutdown+0xe6/0x180
  __x64_sys_shutdown+0x50/0x70
  do_syscall_64+0x9a/0x370
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

2) client is stuck forever in mptcp_sendmsg() because the socket is not
   TCP_ESTABLISHED

 crash> bt 4847
 PID: 4847   TASK: ffff88814b2fb100  CPU: 1   COMMAND: "gh35"
  #0 [ffff8881376ff680] __schedule at ffffffff97248da4
  #1 [ffff8881376ff778] schedule at ffffffff9724a34f
  #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0
  #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba
  #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859
  #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca
  #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b
  #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa
  #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52
  #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f
 #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26
 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba
 #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c
     RIP: 00007f126f6956ed  RSP: 00007ffc2a320278  RFLAGS: 00000217
     RAX: ffffffffffffffda  RBX: 0000000020000044  RCX: 00007f126f6956ed
     RDX: 0000000000000004  RSI: 00000000004007b8  RDI: 0000000000000003
     RBP: 00007ffc2a3202a0   R8: 0000000000400720   R9: 0000000000400720
     R10: 0000000000400720  R11: 0000000000000217  R12: 00000000004004b0
     R13: 00007ffc2a320380  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake
   didn't complete.

 $ tcpdump -tnnr bad.pcap
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0

force a fallback to TCP in these cases, and adjust the main socket
state to avoid hanging in mptcp_sendmsg().

Closes: #35
Reported-by: Christoph Paasch <cpaasch@apple.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 18, 2020
Keep using MPTCP sockets and a use "dummy mapping" in case of fallback
to regular TCP. When fallback is triggered, skip addition of the MPTCP
option on send.

Closes: #11
Closes: #22
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 18, 2020
when a MPTCP client tries to connect to itself, tcp_finish_connect() is
never reached. Because of this, depending on the socket current state,
multiple faulty behaviours can be observed:

1) a WARN_ON() in subflow_data_ready() is hit
 WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230
 [...]
 CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187
 [...]
 RIP: 0010:subflow_data_ready+0x18b/0x230
 [...]
 Call Trace:
  tcp_data_queue+0xd2f/0x4250
  tcp_rcv_state_process+0xb1c/0x49d3
  tcp_v4_do_rcv+0x2bc/0x790
  __release_sock+0x153/0x2d0
  release_sock+0x4f/0x170
  mptcp_shutdown+0x167/0x4e0
  __sys_shutdown+0xe6/0x180
  __x64_sys_shutdown+0x50/0x70
  do_syscall_64+0x9a/0x370
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

2) client is stuck forever in mptcp_sendmsg() because the socket is not
   TCP_ESTABLISHED

 crash> bt 4847
 PID: 4847   TASK: ffff88814b2fb100  CPU: 1   COMMAND: "gh35"
  #0 [ffff8881376ff680] __schedule at ffffffff97248da4
  #1 [ffff8881376ff778] schedule at ffffffff9724a34f
  #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0
  #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba
  #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859
  #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca
  #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b
  #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa
  #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52
  #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f
 #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26
 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba
 #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c
     RIP: 00007f126f6956ed  RSP: 00007ffc2a320278  RFLAGS: 00000217
     RAX: ffffffffffffffda  RBX: 0000000020000044  RCX: 00007f126f6956ed
     RDX: 0000000000000004  RSI: 00000000004007b8  RDI: 0000000000000003
     RBP: 00007ffc2a3202a0   R8: 0000000000400720   R9: 0000000000400720
     R10: 0000000000400720  R11: 0000000000000217  R12: 00000000004004b0
     R13: 00007ffc2a320380  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake
   didn't complete.

 $ tcpdump -tnnr bad.pcap
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0

force a fallback to TCP in these cases, and adjust the main socket
state to avoid hanging in mptcp_sendmsg().

Closes: #35
Reported-by: Christoph Paasch <cpaasch@apple.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 19, 2020
Keep using MPTCP sockets and a use "dummy mapping" in case of fallback
to regular TCP. When fallback is triggered, skip addition of the MPTCP
option on send.

Closes: #11
Closes: #22
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 19, 2020
when a MPTCP client tries to connect to itself, tcp_finish_connect() is
never reached. Because of this, depending on the socket current state,
multiple faulty behaviours can be observed:

1) a WARN_ON() in subflow_data_ready() is hit
 WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230
 [...]
 CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187
 [...]
 RIP: 0010:subflow_data_ready+0x18b/0x230
 [...]
 Call Trace:
  tcp_data_queue+0xd2f/0x4250
  tcp_rcv_state_process+0xb1c/0x49d3
  tcp_v4_do_rcv+0x2bc/0x790
  __release_sock+0x153/0x2d0
  release_sock+0x4f/0x170
  mptcp_shutdown+0x167/0x4e0
  __sys_shutdown+0xe6/0x180
  __x64_sys_shutdown+0x50/0x70
  do_syscall_64+0x9a/0x370
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

2) client is stuck forever in mptcp_sendmsg() because the socket is not
   TCP_ESTABLISHED

 crash> bt 4847
 PID: 4847   TASK: ffff88814b2fb100  CPU: 1   COMMAND: "gh35"
  #0 [ffff8881376ff680] __schedule at ffffffff97248da4
  #1 [ffff8881376ff778] schedule at ffffffff9724a34f
  #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0
  #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba
  #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859
  #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca
  #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b
  #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa
  #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52
  #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f
 #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26
 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba
 #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c
     RIP: 00007f126f6956ed  RSP: 00007ffc2a320278  RFLAGS: 00000217
     RAX: ffffffffffffffda  RBX: 0000000020000044  RCX: 00007f126f6956ed
     RDX: 0000000000000004  RSI: 00000000004007b8  RDI: 0000000000000003
     RBP: 00007ffc2a3202a0   R8: 0000000000400720   R9: 0000000000400720
     R10: 0000000000400720  R11: 0000000000000217  R12: 00000000004004b0
     R13: 00007ffc2a320380  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake
   didn't complete.

 $ tcpdump -tnnr bad.pcap
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0

force a fallback to TCP in these cases, and adjust the main socket
state to avoid hanging in mptcp_sendmsg().

Closes: #35
Reported-by: Christoph Paasch <cpaasch@apple.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 19, 2020
Keep using MPTCP sockets and a use "dummy mapping" in case of fallback
to regular TCP. When fallback is triggered, skip addition of the MPTCP
option on send.

Closes: #11
Closes: #22
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Jun 19, 2020
when a MPTCP client tries to connect to itself, tcp_finish_connect() is
never reached. Because of this, depending on the socket current state,
multiple faulty behaviours can be observed:

1) a WARN_ON() in subflow_data_ready() is hit
 WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230
 [...]
 CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187
 [...]
 RIP: 0010:subflow_data_ready+0x18b/0x230
 [...]
 Call Trace:
  tcp_data_queue+0xd2f/0x4250
  tcp_rcv_state_process+0xb1c/0x49d3
  tcp_v4_do_rcv+0x2bc/0x790
  __release_sock+0x153/0x2d0
  release_sock+0x4f/0x170
  mptcp_shutdown+0x167/0x4e0
  __sys_shutdown+0xe6/0x180
  __x64_sys_shutdown+0x50/0x70
  do_syscall_64+0x9a/0x370
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

2) client is stuck forever in mptcp_sendmsg() because the socket is not
   TCP_ESTABLISHED

 crash> bt 4847
 PID: 4847   TASK: ffff88814b2fb100  CPU: 1   COMMAND: "gh35"
  #0 [ffff8881376ff680] __schedule at ffffffff97248da4
  #1 [ffff8881376ff778] schedule at ffffffff9724a34f
  #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0
  #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba
  #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859
  #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca
  #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b
  #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa
  #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52
  #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f
 #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26
 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba
 #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c
     RIP: 00007f126f6956ed  RSP: 00007ffc2a320278  RFLAGS: 00000217
     RAX: ffffffffffffffda  RBX: 0000000020000044  RCX: 00007f126f6956ed
     RDX: 0000000000000004  RSI: 00000000004007b8  RDI: 0000000000000003
     RBP: 00007ffc2a3202a0   R8: 0000000000400720   R9: 0000000000400720
     R10: 0000000000400720  R11: 0000000000000217  R12: 00000000004004b0
     R13: 00007ffc2a320380  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake
   didn't complete.

 $ tcpdump -tnnr bad.pcap
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0

force a fallback to TCP in these cases, and adjust the main socket
state to avoid hanging in mptcp_sendmsg().

Closes: #35
Reported-by: Christoph Paasch <cpaasch@apple.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Oct 13, 2023
Couple of error paths in do_core_test() was returning directly without
doing a necessary cpus_read_unlock().

Following lockdep warning was observed when exercising these scenarios
with PROVE_RAW_LOCK_NESTING enabled:

[  139.304775] ================================================
[  139.311185] WARNING: lock held when returning to user space!
[  139.317593] 6.6.0-rc2ifs01+ #11 Tainted: G S      W I
[  139.324499] ------------------------------------------------
[  139.330908] bash/11476 is leaving the kernel with locks still held!
[  139.338000] 1 lock held by bash/11476:
[  139.342262]  #0: ffffffffaa26c930 (cpu_hotplug_lock){++++}-{0:0}, at:
do_core_test+0x35/0x1c0 [intel_ifs]

Fix the flow so that all scenarios release the lock prior to returning
from the function.

Fixes: 5210fb4 ("platform/x86/intel/ifs: Sysfs interface for Array BIST")
Cc: stable@vger.kernel.org
Signed-off-by: Jithu Joseph <jithu.joseph@intel.com>
Link: https://lore.kernel.org/r/20230927184824.2566086-1-jithu.joseph@intel.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
jenkins-tessares pushed a commit that referenced this issue Oct 17, 2023
Amit Cohen says:

====================
Extend VXLAN driver to support FDB flushing

The merge commit 9271686 ("Merge branch 'br-flush-filtering'") added
support for FDB flushing in bridge driver. Extend VXLAN driver to support
FDB flushing also. Add support for filtering by fields which are relevant
for VXLAN FDBs:
* Source VNI
* Nexthop ID
* 'router' flag
* Destination VNI
* Destination Port
* Destination IP

Without this set, flush for VXLAN device fails:
$ bridge fdb flush dev vx10
RTNETLINK answers: Operation not supported

With this set, such flush works with the relevant arguments, for example:
$ bridge fdb flush dev vx10 vni 5000 dst 193.2.2.1
< flush all vx10 entries with VNI 5000 and destination IP 193.2.2.1>

Some preparations are required, handle them before adding flushing support
in VXLAN driver. See more details in commit messages.

Patch set overview:
Patch #1 prepares flush policy to be used by VXLAN driver
Patches #2-#3 are preparations in VXLAN driver
Patch #4 adds an initial support for flushing in VXLAN driver
Patches #5-#9 add support for filtering by several attributes
Patch #10 adds a test for FDB flush with VXLAN
Patch #11 extends the test to check FDB flush with bridge
====================

Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
jenkins-tessares pushed a commit that referenced this issue Oct 24, 2023
Petr Machata says:

====================
mlxsw: Move allocation of LAG table to the driver

PGT is an in-HW table that maps addresses to sets of ports. Then when some
HW process needs a set of ports as an argument, instead of embedding the
actual set in the dynamic configuration, what gets configured is the
address referencing the set. The HW then works with the appropriate PGT
entry.

Within the PGT is placed a LAG table. That is a contiguous block of PGT
memory where each entry describes which ports are members of the
corresponding LAG port.

The PGT is split to two parts: one managed by the FW, and one managed by
the driver. Historically, the FW part included also the LAG table, referred
to as FW LAG mode. Giving the responsibility for placement of the LAG table
to the driver, referred to as SW LAG mode, makes the whole system more
flexible. The FW currently supports both FW and SW LAG modes. To shed
complexity, the FW should in the future only support SW LAG mode.

Hence this patchset, where support for placement of LAG is added to mlxsw.

There are FW versions out there that do not support SW LAG mode, and on
Spectrum-1 in particular, there is no plan to support it at all. mlxsw will
therefore have to support both modes of operation.

Another aspect is that at least on Spectrum-1, there are FW versions out
there that claim to support driver-placed LAG table, but then reject or
ignore configurations enabling the same. The driver thus has to have a say
in whether an attempt to configure SW LAG mode should even be done.

The feature is therefore expressed in terms of "does the driver prefer SW
LAG mode?", and "what LAG mode the PCI module managed to configure the FW
with". This is unlike current flood mode configuration, where the driver
can give a strict value, and that's what gets configured. But it gives a
chance to the driver to determine whether LAG mode should be enabled at
all.

The "does the driver prefer SW LAG mode?" bit is expressed as a boolean
lag_mode_prefer_sw. The reason for this is largely another feature that
will be introduced in a follow-up patchset: support for CFF flood mode. The
driver currently requires that the FW be configured with what is called
controlled flood mode. But on capable systems, CFF would be preferred. So
there are two values in flight: the preferred flood mode, and the fallback.
This could be expressed with an array of flood modes ordered by preference,
but that looks like an overkill in comparison. This flag/value model is
then reused for LAG mode as well, except the fallback value is absent and
implied to be FW, because there are no other values to choose from.

The patchset progresses as follows:

- Patches #1 to #5 adjust reg.h and cmd.h with new register fields,
  constants and remarks.

- Patches #6 and #7 add the ability to request SW LAG mode and to query the
  LAG mode that was actually negotiated. This is where the abovementioned
  lag_mode_prefer_sw flag is added.

- Patches #7 to #9 generalize PGT allocations to make it possible to
  allocate the LAG table, which is done in patch #10.

- In patch #11, toggle lag_mode_prefer_sw on Spectrum-2 and above, which
  makes the newly-added code live.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
matttbe pushed a commit that referenced this issue Oct 31, 2023
Ido Schimmel says:

====================
Add MDB get support

This patchset adds MDB get support, allowing user space to request a
single MDB entry to be retrieved instead of dumping the entire MDB.
Support is added in both the bridge and VXLAN drivers.

Patches #1-#6 are small preparations in both drivers.

Patches #7-#8 add the required uAPI attributes for the new functionality
and the MDB get net device operation (NDO), respectively.

Patches #9-#10 implement the MDB get NDO in both drivers.

Patch #11 registers a handler for RTM_GETMDB messages in rtnetlink core.
The handler derives the net device from the ifindex specified in the
ancillary header and invokes its MDB get NDO.

Patches #12-#13 add selftests by converting tests that use MDB dump with
grep to the new MDB get functionality.

iproute2 changes can be found here [1].

v2:
* Patch #7: Add a comment to describe attributes structure.
* Patch #9: Add a comment above spin_lock_bh().

[1] https://github.com/idosch/iproute2/tree/submit/mdb_get_v1
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
jenkins-tessares pushed a commit that referenced this issue Nov 21, 2023
Petr Machata says:

====================
mlxsw: Add support for new reset flow

Ido Schimmel writes:

This patchset changes mlxsw to issue a PCI reset during probe and
devlink reload so that the PCI firmware could be upgraded without a
reboot.

Unlike the old version of this patchset [1], in this version the driver
no longer tries to issue a PCI reset by triggering a PCI link toggle on
its own, but instead calls the PCI core to issue the reset.

The PCI APIs require the device lock to be held which is why patches

Patches #7 adds reset method quirk for NVIDIA Spectrum devices.

Patch #8 adds a debug level print in PCI core so that device ready delay
will be printed even if it is shorter than one second.

Patches #9-#11 are straightforward preparations in mlxsw.

Patch #12 finally implements the new reset flow in mlxsw.

Patch #13 adds PCI reset handlers in mlxsw to avoid user space from
resetting the device from underneath an unaware driver. Instead, the
driver is gracefully de-initialized before the PCI reset and then
initialized again after it.

Patch #14 adds a PCI reset selftest to make sure this code path does not
regress.

[1] https://lore.kernel.org/netdev/cover.1679502371.git.petrm@nvidia.com/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
matttbe pushed a commit that referenced this issue Nov 24, 2023
Andrii Nakryiko says:

====================
BPF register bounds range vs range support

This patch set is a continuation of work started in [0]. It adds a big set of
manual, auto-generated, and now also random test cases validating BPF
verifier's register bounds tracking and deduction logic.

First few patches generalize verifier's logic to handle conditional jumps and
corresponding range adjustments in case when two non-const registers are
compared to each other. Patch #1 generalizes reg_set_min_max() portion, while
patch #2 does the same for is_branch_taken() part of the overall solution.

Patch #3 improves equality and inequality for cases when BPF program code
mixes 64-bit and 32-bit uses of the same register. Depending on specific
sequence, it's possible to get to the point where u64/s64 bounds will be very
generic (e.g., after signed 32-bit comparison), while we still keep pretty
tight u32/s32 bounds. If in such state we proceed with 32-bit equality or
inequality comparison, reg_set_min_max() might have to deal with adjusting s32
bounds for two registers that don't overlap, which breaks reg_set_min_max().
This doesn't manifest in <range> vs <const> cases, because if that happens
reg_set_min_max() in effect will force s32 bounds to be a new "impossible"
constant (from original smin32/smax32 bounds point of view). Things get tricky
when we have <range> vs <range> adjustments, so instead of trying to somehow
make sense out of such situations, it's best to detect such impossible
situations and prune the branch that can't be taken in is_branch_taken()
logic.  This equality/inequality was the only such category of situations with
auto-generated tests added later in the patch set.

But when we start mixing arithmetic operations in different numeric domains
and conditionals, things get even hairier. So, patch #4 adds sanity checking
logic after all ALU/ALU64, JMP/JMP32, and LDX operations. By default, instead
of failing verification, we conservatively reset range bounds to unknown
values, reporting violation in verifier log (if verbose logs are requested).
But to aid development, detection, and debugging, we also introduce a new test
flag, BPF_F_TEST_SANITY_STRICT, which triggers verification failure on range
sanity violation.

Patch #11 sets BPF_F_TEST_SANITY_STRICT by default for test_progs and
test_verifier. Patch #12 adds support for controlling this in veristat for
testing with production BPF object files.

Getting back to BPF verifier, patches #5 and #6 complete verifier's range
tracking logic clean up. See respective patches for details.

With kernel-side taken care of, we move to testing. We start with building
a tester that validates existing <range> vs <scalar> verifier logic for range
bounds. Patch #7 implements an initial version of such a tester. We guard
millions of generated tests behind SLOW_TESTS=1 envvar requirement, but also
have a relatively small number of tricky cases that came up during development
and debugging of this work. Those will be executed as part of a normal
test_progs run.

Patch #8 simulates more nuanced JEQ/JNE logic we added to verifier in patch #3.
Patch #9 adds <range> vs <range> "slow tests".

Patch #10 is a completely new one, it adds a bunch of randomly generated cases
to be run normally, without SLOW_TESTS=1 guard. This should help to get
a bunch of cover, and hopefully find some remaining latent problems if
verifier proactively as part of normal BPF CI runs.

Finally, a tiny test which was, amazingly, an initial motivation for this
whole work, is added in lucky patch #13, demonstrating how verifier is now
smart enough to track actual number of elements in the array and won't require
additional checks on loop iteration variable inside the bpf_for() open-coded
iterator loop.

  [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=798308&state=*

v1->v2:
  - use x < y => y > x property to minimize reg_set_min_max (Eduard);
  - fix for JEQ/JNE logic in reg_bounds.c (Eduard);
  - split BPF_JSET and !BPF_JSET cases handling (Shung-Hsi);
  - adjustments to reg_bounds.c to make it easier to follow (Alexei);
  - added acks (Eduard, Shung-Hsi).
====================

Link: https://lore.kernel.org/r/20231112010609.848406-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
matttbe pushed a commit that referenced this issue Nov 24, 2023
…f-times'

Eduard Zingerman says:

====================
verify callbacks as if they are called unknown number of times

This series updates verifier logic for callback functions handling.
Current master simulates callback body execution exactly once,
which leads to verifier not detecting unsafe programs like below:

    static int unsafe_on_zero_iter_cb(__u32 idx, struct num_context *ctx)
    {
        ctx->i = 0;
        return 0;
    }

    SEC("?raw_tp")
    int unsafe_on_zero_iter(void *unused)
    {
        struct num_context loop_ctx = { .i = 32 };
        __u8 choice_arr[2] = { 0, 1 };

        bpf_loop(100, unsafe_on_zero_iter_cb, &loop_ctx, 0);
        return choice_arr[loop_ctx.i];
    }

This was reported previously in [0].
The basic idea of the fix is to schedule callback entry state for
verification in env->head until some identical, previously visited
state in current DFS state traversal is found. Same logic as with open
coded iterators, and builds on top recent fixes [1] for those.

The series is structured as follows:
- patches #1,2,3 update strobemeta, xdp_synproxy selftests and
  bpf_loop_bench benchmark to allow convergence of the bpf_loop
  callback states;
- patches #4,5 just shuffle the code a bit;
- patch #6 is the main part of the series;
- patch #7 adds test cases for #6;
- patch #8 extend patch #6 with same speculative scalar widening
  logic, as used for open coded iterators;
- patch #9 adds test cases for #8;
- patch #10 extends patch #6 to track maximal number of callback
  executions specifically for bpf_loop();
- patch #11 adds test cases for #10.

Veristat results comparing this series to master+patches #1,2,3 using selftests
show the following difference:

File                       Program        States (A)  States (B)  States (DIFF)
-------------------------  -------------  ----------  ----------  -------------
bpf_loop_bench.bpf.o       benchmark               1           2  +1 (+100.00%)
pyperf600_bpf_loop.bpf.o   on_event              322         407  +85 (+26.40%)
strobemeta_bpf_loop.bpf.o  on_event              113         151  +38 (+33.63%)
xdp_synproxy_kern.bpf.o    syncookie_tc          341         291  -50 (-14.66%)
xdp_synproxy_kern.bpf.o    syncookie_xdp         344         301  -43 (-12.50%)

Veristat results comparing this series to master using Tetragon BPF
files [2] also show some differences.
States diff varies from +2% to +15% on 23 programs out of 186,
no new failures.

Changelog:
- V3 [5] -> V4, changes suggested by Andrii:
  - validate mark_chain_precision() result in patch #10;
  - renaming s/cumulative_callback_depth/callback_unroll_depth/.
- V2 [4] -> V3:
  - fixes in expected log messages for test cases:
    - callback_result_precise;
    - parent_callee_saved_reg_precise_with_callback;
    - parent_stack_slot_precise_with_callback;
  - renamings (suggested by Alexei):
    - s/callback_iter_depth/cumulative_callback_depth/
    - s/is_callback_iter_next/calls_callback/
    - s/mark_callback_iter_next/mark_calls_callback/
  - prepare_func_exit() updated to exit with -EFAULT when
    callee->in_callback_fn is true but calls_callback() is not true
    for callsite;
  - test case 'bpf_loop_iter_limit_nested' rewritten to use return
    value check instead of verifier log message checks
    (suggested by Alexei).
- V1 [3] -> V2, changes suggested by Andrii:
  - small changes for error handling code in __check_func_call();
  - callback body processing log is now matched in relevant
    verifier_subprog_precision.c tests;
  - R1 passed to bpf_loop() is now always marked as precise;
  - log level 2 message for bpf_loop() iteration termination instead of
    iteration depth messages;
  - __no_msg macro removed;
  - bpf_loop_iter_limit_nested updated to avoid using __no_msg;
  - commit message for patch #3 updated according to Alexei's request.

[0] https://lore.kernel.org/bpf/CA+vRuzPChFNXmouzGG+wsy=6eMcfr1mFG0F3g7rbg-sedGKW3w@mail.gmail.com/
[1] https://lore.kernel.org/bpf/20231024000917.12153-1-eddyz87@gmail.com/
[2] git@github.com:cilium/tetragon.git
[3] https://lore.kernel.org/bpf/20231116021803.9982-1-eddyz87@gmail.com/T/#t
[4] https://lore.kernel.org/bpf/20231118013355.7943-1-eddyz87@gmail.com/T/#t
[5] https://lore.kernel.org/bpf/20231120225945.11741-1-eddyz87@gmail.com/T/#t
====================

Link: https://lore.kernel.org/r/20231121020701.26440-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
matttbe pushed a commit that referenced this issue Nov 24, 2023
Petr Machata says:

====================
mlxsw: Preparations for support of CFF flood mode

PGT is an in-HW table that maps addresses to sets of ports. Then when some
HW process needs a set of ports as an argument, instead of embedding the
actual set in the dynamic configuration, what gets configured is the
address referencing the set. The HW then works with the appropriate PGT
entry.

Among other allocations, the PGT currently contains two large blocks for
bridge flooding: one for 802.1q and one for 802.1d. Within each of these
blocks are three tables, for unknown-unicast, multicast and broadcast
flooding:

      . . . |    802.1q    |    802.1d    | . . .
            | UC | MC | BC | UC | MC | BC |
             \______ _____/ \_____ ______/
                    v             v
                   FID flood vectors

Thus each FID (which corresponds to an 802.1d bridge or one VLAN in an
802.1q bridge) uses three flood vectors spread across a fairly large region
of PGT.

This way of organizing the flood table (called "controlled") is not very
flexible. E.g. to decrease a bridge scale and store more IP MC vectors, one
would need to completely rewrite the bridge PGT blocks, or resort to hacks
such as storing individual MC flood vectors into unused part of the bridge
table.

In order to address these shortcomings, Spectrum-2 and above support what
is called CFF flood mode, for Compressed FID Flooding. In CFF flood mode,
each FID has a little table of its own, with three entries adjacent to each
other, one for unknown-UC, one for MC, one for BC. This allows for a much
more fine-grained approach to PGT management, where bits of it are
allocated on demand.

      . . . | FID | FID | FID | FID | FID | . . .
            |U|M|B|U|M|B|U|M|B|U|M|B|U|M|B|
             \_____________ _____________/
                           v
                   FID flood vectors

Besides the FID table organization, the CFF flood mode also impacts Router
Subport (RSP) table. This table contains flood vectors for rFIDs, which are
FIDs that reference front panel ports or LAGs. The RSP table contains two
entries per front panel port and LAG, one for unknown-UC traffic, and one
for everything else. Currently, the FW allocates and manages the table in
its own part of PGT. rFIDs are marked with flood_rsp bit and managed
specially. In CFF mode, rFIDs are managed as all other FIDs. The driver
therefore has to allocate and maintain the flood vectors. Like with bridge
FIDs, this is more work, but increases flexibility of the system.

The FW currently supports both the controlled and CFF flood modes. To shed
complexity, in the future it should only support CFF flood mode. Hence this
patchset, which is the first in series of two to add CFF flood mode support
to mlxsw.

There are FW versions out there that do not support CFF flood mode, and on
Spectrum-1 in particular, there is no plan to support it at all. mlxsw will
therefore have to support both controlled flood mode as well as CFF.

Another aspect is that at least on Spectrum-1, there are FW versions out
there that claim to support CFF flood mode, but then reject or ignore
configurations enabling the same. The driver thus has to have a say in
whether an attempt to configure CFF flood mode should even be made.

Much like with the LAG mode, the feature is therefore expressed in terms of
"does the driver prefer CFF flood mode?", and "what flood mode the PCI
module managed to configure the FW with". This gives to the driver a chance
to determine whether CFF flood mode configuration should be attempted.

In this patchset, we lay the ground with new definitions, registers and
their fields, and some minor code shaping. The next patchset will be more
focused on introducing necessary abstractions and implementation.

- Patches #1 and #2 add CFF-related items to the command interface.

- Patch #3 adds a new resource, for maximum number of flood profiles
  supported. (A flood profile is a mapping between traffic type and offset
  in the per-FID flood vector table.)

- Patches #4 to #8 adjust reg.h. The SFFP register is added, which is used
  for configuring the abovementioned traffic-type-to-offset mapping. The
  SFMR, register, which serves for FID configuration, is extended with
  fields specific to CFF mode. And other minor adjustments.

- Patches #9 and #10 add the plumbing for CFF mode: a way to request that
  CFF flood mode be configured, and a way to query the flood mode that was
  actually configured.

- Patch #11 removes dead code.

- Patches #12 and #13 add helpers that the next patchset will make use of.
  Patch #14 moves RIF setup ahead so that FID code can make use of it.
====================

Link: https://lore.kernel.org/r/cover.1700503643.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
matttbe pushed a commit that referenced this issue Dec 1, 2023
[ 8743.393379] ======================================================
[ 8743.393385] WARNING: possible circular locking dependency detected
[ 8743.393391] 6.4.0-rc1+ #11 Tainted: G           OE
[ 8743.393397] ------------------------------------------------------
[ 8743.393402] kworker/0:2/12921 is trying to acquire lock:
[ 8743.393408] ffff888127a14460 (sb_writers#8){.+.+}-{0:0}, at: ksmbd_vfs_setxattr+0x3d/0xd0 [ksmbd]
[ 8743.393510]
               but task is already holding lock:
[ 8743.393515] ffff8880360d97f0 (&type->i_mutex_dir_key#6/1){+.+.}-{3:3}, at: ksmbd_vfs_kern_path_locked+0x181/0x670 [ksmbd]
[ 8743.393618]
               which lock already depends on the new lock.

[ 8743.393623]
               the existing dependency chain (in reverse order) is:
[ 8743.393628]
               -> #1 (&type->i_mutex_dir_key#6/1){+.+.}-{3:3}:
[ 8743.393648]        down_write_nested+0x9a/0x1b0
[ 8743.393660]        filename_create+0x128/0x270
[ 8743.393670]        do_mkdirat+0xab/0x1f0
[ 8743.393680]        __x64_sys_mkdir+0x47/0x60
[ 8743.393690]        do_syscall_64+0x5d/0x90
[ 8743.393701]        entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 8743.393711]
               -> #0 (sb_writers#8){.+.+}-{0:0}:
[ 8743.393728]        __lock_acquire+0x2201/0x3b80
[ 8743.393737]        lock_acquire+0x18f/0x440
[ 8743.393746]        mnt_want_write+0x5f/0x240
[ 8743.393755]        ksmbd_vfs_setxattr+0x3d/0xd0 [ksmbd]
[ 8743.393839]        ksmbd_vfs_set_dos_attrib_xattr+0xcc/0x110 [ksmbd]
[ 8743.393924]        compat_ksmbd_vfs_set_dos_attrib_xattr+0x39/0x50 [ksmbd]
[ 8743.394010]        smb2_open+0x3432/0x3cc0 [ksmbd]
[ 8743.394099]        handle_ksmbd_work+0x2c9/0x7b0 [ksmbd]
[ 8743.394187]        process_one_work+0x65a/0xb30
[ 8743.394198]        worker_thread+0x2cf/0x700
[ 8743.394209]        kthread+0x1ad/0x1f0
[ 8743.394218]        ret_from_fork+0x29/0x50

This patch add mnt_want_write() above parent inode lock and remove
nested mnt_want_write calls in smb2_open().

Fixes: 40b268d ("ksmbd: add mnt_want_write to ksmbd vfs functions")
Cc: stable@vger.kernel.org
Reported-by: Marios Makassikis <mmakassikis@freebox.fr>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
matttbe pushed a commit that referenced this issue Dec 1, 2023
Petr Machata says:

====================
mlxsw: Support CFF flood mode

The registers to configure to initialize a flood table differ between the
controlled and CFF flood modes. In therefore needs to be an op. Add it,
hook up the current init to the existing families, and invoke the op.

PGT is an in-HW table that maps addresses to sets of ports. Then when some
HW process needs a set of ports as an argument, instead of embedding the
actual set in the dynamic configuration, what gets configured is the
address referencing the set. The HW then works with the appropriate PGT
entry.

Among other allocations, the PGT currently contains two large blocks for
bridge flooding: one for 802.1q and one for 802.1d. Within each of these
blocks are three tables, for unknown-unicast, multicast and broadcast
flooding:

      . . . |    802.1q    |    802.1d    | . . .
            | UC | MC | BC | UC | MC | BC |
             \______ _____/ \_____ ______/
                    v             v
                   FID flood vectors

Thus each FID (which corresponds to an 802.1d bridge or one VLAN in an
802.1q bridge) uses three flood vectors spread across a fairly large region
of PGT.

This way of organizing the flood table (called "controlled") is not very
flexible. E.g. to decrease a bridge scale and store more IP MC vectors, one
would need to completely rewrite the bridge PGT blocks, or resort to hacks
such as storing individual MC flood vectors into unused part of the bridge
table.

In order to address these shortcomings, Spectrum-2 and above support what
is called CFF flood mode, for Compressed FID Flooding. In CFF flood mode,
each FID has a little table of its own, with three entries adjacent to each
other, one for unknown-UC, one for MC, one for BC. This allows for a much
more fine-grained approach to PGT management, where bits of it are
allocated on demand.

      . . . | FID | FID | FID | FID | FID | . . .
            |U|M|B|U|M|B|U|M|B|U|M|B|U|M|B|
             \_____________ _____________/
                           v
                   FID flood vectors

Besides the FID table organization, the CFF flood mode also impacts Router
Subport (RSP) table. This table contains flood vectors for rFIDs, which are
FIDs that reference front panel ports or LAGs. The RSP table contains two
entries per front panel port and LAG, one for unknown-UC traffic, and one
for everything else. Currently, the FW allocates and manages the table in
its own part of PGT. rFIDs are marked with flood_rsp bit and managed
specially. In CFF mode, rFIDs are managed as all other FIDs. The driver
therefore has to allocate and maintain the flood vectors. Like with bridge
FIDs, this is more work, but increases flexibility of the system.

The FW currently supports both the controlled and CFF flood modes. To shed
complexity, in the future it should only support CFF flood mode. Hence this
patchset, which adds CFF flood mode support to mlxsw.

Since mlxsw needs to maintain both the controlled mode as well as CFF mode
support, we will keep the layout as compatible as possible. The bridge
tables will stay in the same overall shape, just their inner organization
will change from flood mode -> FID to FID -> flood mode. Likewise will RSP
be kept as a contiguous block of PGT memory, as was the case when the FW
maintained it.

- The way FIDs get configured under the CFF flood mode differs from the
  currently used controlled mode. The simple approach of having several
  globally visible arrays for spectrum.c to statically choose from no
  longer works.

  Patch #1 thus privatizes all FID initialization and finalization logic,
  and exposes it as ops instead.

- Patch #2 renames the ops that are specific to the controlled mode, to
  make room in the namespace for the CFF variants.

  Patch #3 extracts a helper to compute flood table base out of
  mlxsw_sp_fid_flood_table_mid().

- The op fid_setup configured fid_offset, i.e. the number of this FID
  within its family. For rFIDs in CFF mode, to determine this number, the
  driver will need to do fallible queries.

  Thus in patch #4, make the FID setup operation fallible as well.

- Flood mode initialization routine differs between the controlled and CFF
  flood modes. The controlled mode needs to configure flood table layout,
  which the CFF mode does not need to do.

  In patch #5, move mlxsw_sp_fid_flood_table_init() up so that the
  following patch can make use of it.

  In patch #6, add an op to be invoked per table (if defined).

- The current way of determining PGT allocation size depends on the number
  of FIDs and number of flood tables. RFIDs however have PGT footprint
  depending not on number of FIDs, but on number of ports and LAGs, because
  which ports an rFID should flood to does not depend on the FID itself,
  but on the port or LAG that it references.

  Therefore in patch #7, add FID family ops for determining PGT allocation
  size.

- As elaborated above, layout of PGT will differ between controlled and CFF
  flood modes. In CFF mode, it will further differ between rFIDs and other
  FIDs (as described at previous patch). The way to pack the SFMR register
  to configure a FID will likewise differ from controlled to CFF.

  Thus in patches #8 and #9 add FID family ops to determine PGT base
  address for a FID and to pack SFMR.

- Patches #10 and #11 add more bits for RSP support. In patch #10, add a
  new traffic type enumerator, for non-UC traffic. This is a combination of
  BC and MC traffic, but the way that mlxsw maps these mnemonic names to
  actual traffic type configurations requires that we have a new name to
  describe this class of traffic.

  Patch #11 then adds hooks necessary for RSP table maintenance. As ports
  come and go, and join and leave LAGs, it is necessary to update flood
  vectors that the rFIDs use. These new hooks will make that possible.

- Patches #12, #13 and #14 introduce flood profiles. These have been
  implicit so far, but the way that CFF flood mode works with profile IDs
  requires that we make them explicit.

  Thus in patch #12, introduce flood profile objects as a set of flood
  tables that FID families then refer to. The FID code currently only
  uses a single flood profile.

  In patch #13, add a flood profile ID to flood profile objects.

  In patch #14, when in CFF mode, configure SFFP according to the existing
  flood profiles (or the one that exists as of that point).

- Patches #15 and #16 add code to implement, respectively, bridge FIDs and
  RSP FIDs in CFF mode.

- In patch #17, toggle flood_mode_prefer_cff on Spectrum-2 and above, which
  makes the newly-added code live.
====================

Link: https://lore.kernel.org/r/cover.1701183891.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cpaasch pushed a commit that referenced this issue Dec 4, 2023
[ Upstream commit a84fbf2 ]

Generating metrics llc_code_read_mpi_demand_plus_prefetch,
llc_data_read_mpi_demand_plus_prefetch,
llc_miss_local_memory_bandwidth_read,
llc_miss_local_memory_bandwidth_write,
nllc_miss_remote_memory_bandwidth_read, memory_bandwidth_read,
memory_bandwidth_write, uncore_frequency, upi_data_transmit_bw,
C2_Pkg_Residency, C3_Core_Residency, C3_Pkg_Residency,
C6_Core_Residency, C6_Pkg_Residency, C7_Core_Residency,
C7_Pkg_Residency, UNCORE_FREQ and tma_info_system_socket_clks would
trigger an address sanitizer heap-buffer-overflows on a SkylakeX.

```
==2567752==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x5020003ed098 at pc 0x5621a816654e bp 0x7fffb55d4da0 sp 0x7fffb55d4d98
READ of size 4 at 0x5020003eee78 thread T0
    #0 0x558265d6654d in aggr_cpu_id__is_empty tools/perf/util/cpumap.c:694:12
    #1 0x558265c914da in perf_stat__get_aggr tools/perf/builtin-stat.c:1490:6
    #2 0x558265c914da in perf_stat__get_global_cached tools/perf/builtin-stat.c:1530:9
    #3 0x558265e53290 in should_skip_zero_counter tools/perf/util/stat-display.c:947:31
    #4 0x558265e53290 in print_counter_aggrdata tools/perf/util/stat-display.c:985:18
    #5 0x558265e51931 in print_counter tools/perf/util/stat-display.c:1110:3
    #6 0x558265e51931 in evlist__print_counters tools/perf/util/stat-display.c:1571:5
    #7 0x558265c8ec87 in print_counters tools/perf/builtin-stat.c:981:2
    #8 0x558265c8cc71 in cmd_stat tools/perf/builtin-stat.c:2837:3
    #9 0x558265bb9bd4 in run_builtin tools/perf/perf.c:323:11
    #10 0x558265bb98eb in handle_internal_command tools/perf/perf.c:377:8
    #11 0x558265bb9389 in run_argv tools/perf/perf.c:421:2
    #12 0x558265bb9389 in main tools/perf/perf.c:537:3
```

The issue was the use of testing a cpumap with NULL rather than using
empty, as a map containing the dummy value isn't NULL and the -1
results in an empty aggr map being allocated which legitimately
overflows when any member is accessed.

Fixes: 8a96f45 ("perf stat: Avoid SEGV if core.cpus isn't set")
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230906003912.3317462-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
matttbe pushed a commit that referenced this issue Dec 15, 2023
When creating ceq_0 during probing irdma, cqp.sc_cqp will be sent as a
cqp_request to cqp->sc_cqp.sq_ring. If the request is pending when
removing the irdma driver or unplugging its aux device, cqp.sc_cqp will be
dereferenced as wrong struct in irdma_free_pending_cqp_request().

  PID: 3669   TASK: ffff88aef892c000  CPU: 28  COMMAND: "kworker/28:0"
   #0 [fffffe0000549e38] crash_nmi_callback at ffffffff810e3a34
   #1 [fffffe0000549e40] nmi_handle at ffffffff810788b2
   #2 [fffffe0000549ea0] default_do_nmi at ffffffff8107938f
   #3 [fffffe0000549eb8] do_nmi at ffffffff81079582
   #4 [fffffe0000549ef0] end_repeat_nmi at ffffffff82e016b4
      [exception RIP: native_queued_spin_lock_slowpath+1291]
      RIP: ffffffff8127e72b  RSP: ffff88aa841ef778  RFLAGS: 00000046
      RAX: 0000000000000000  RBX: ffff88b01f849700  RCX: ffffffff8127e47e
      RDX: 0000000000000000  RSI: 0000000000000004  RDI: ffffffff83857ec0
      RBP: ffff88afe3e4efc8   R8: ffffed15fc7c9dfa   R9: ffffed15fc7c9dfa
      R10: 0000000000000001  R11: ffffed15fc7c9df9  R12: 0000000000740000
      R13: ffff88b01f849708  R14: 0000000000000003  R15: ffffed1603f092e1
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
  -- <NMI exception stack> --
   #5 [ffff88aa841ef778] native_queued_spin_lock_slowpath at ffffffff8127e72b
   #6 [ffff88aa841ef7b0] _raw_spin_lock_irqsave at ffffffff82c22aa4
   #7 [ffff88aa841ef7c8] __wake_up_common_lock at ffffffff81257363
   #8 [ffff88aa841ef888] irdma_free_pending_cqp_request at ffffffffa0ba12cc [irdma]
   #9 [ffff88aa841ef958] irdma_cleanup_pending_cqp_op at ffffffffa0ba1469 [irdma]
   #10 [ffff88aa841ef9c0] irdma_ctrl_deinit_hw at ffffffffa0b2989f [irdma]
   #11 [ffff88aa841efa28] irdma_remove at ffffffffa0b252df [irdma]
   #12 [ffff88aa841efae8] auxiliary_bus_remove at ffffffff8219afdb
   #13 [ffff88aa841efb00] device_release_driver_internal at ffffffff821882e6
   #14 [ffff88aa841efb38] bus_remove_device at ffffffff82184278
   #15 [ffff88aa841efb88] device_del at ffffffff82179d23
   #16 [ffff88aa841efc48] ice_unplug_aux_dev at ffffffffa0eb1c14 [ice]
   #17 [ffff88aa841efc68] ice_service_task at ffffffffa0d88201 [ice]
   #18 [ffff88aa841efde8] process_one_work at ffffffff811c589a
   #19 [ffff88aa841efe60] worker_thread at ffffffff811c71ff
   #20 [ffff88aa841eff10] kthread at ffffffff811d87a0
   #21 [ffff88aa841eff50] ret_from_fork at ffffffff82e0022f

Fixes: 44d9e52 ("RDMA/irdma: Implement device initialization definitions")
Link: https://lore.kernel.org/r/20231130081415.891006-1-lishifeng@sangfor.com.cn
Suggested-by: "Ismail, Mustafa" <mustafa.ismail@intel.com>
Signed-off-by: Shifeng Li <lishifeng@sangfor.com.cn>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
matttbe pushed a commit that referenced this issue Jan 26, 2024
[BUG]
There is a bug report that, on a ext4-converted btrfs, scrub leads to
various problems, including:

- "unable to find chunk map" errors
  BTRFS info (device vdb): scrub: started on devid 1
  BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 4096
  BTRFS critical (device vdb): unable to find chunk map for logical 2214744064 length 45056

  This would lead to unrepariable errors.

- Use-after-free KASAN reports:
  ==================================================================
  BUG: KASAN: slab-use-after-free in __blk_rq_map_sg+0x18f/0x7c0
  Read of size 8 at addr ffff8881013c9040 by task btrfs/909
  CPU: 0 PID: 909 Comm: btrfs Not tainted 6.7.0-x64v3-dbg #11 c50636e9419a8354555555245df535e380563b2b
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-2 12/24/2023
  Call Trace:
   <TASK>
   dump_stack_lvl+0x43/0x60
   print_report+0xcf/0x640
   kasan_report+0xa6/0xd0
   __blk_rq_map_sg+0x18f/0x7c0
   virtblk_prep_rq.isra.0+0x215/0x6a0 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
   virtio_queue_rqs+0xc4/0x310 [virtio_blk 19a65eeee9ae6fcf02edfad39bb9ddee07dcdaff]
   blk_mq_flush_plug_list.part.0+0x780/0x860
   __blk_flush_plug+0x1ba/0x220
   blk_finish_plug+0x3b/0x60
   submit_initial_group_read+0x10a/0x290 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   flush_scrub_stripes+0x38e/0x430 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   scrub_stripe+0x82a/0xae0 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   scrub_chunk+0x178/0x200 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   scrub_enumerate_chunks+0x4bc/0xa30 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   btrfs_scrub_dev+0x398/0x810 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   btrfs_ioctl+0x4b9/0x3020 [btrfs e57987a360bed82fe8756dcd3e0de5406ccfe965]
   __x64_sys_ioctl+0xbd/0x100
   do_syscall_64+0x5d/0xe0
   entry_SYSCALL_64_after_hwframe+0x63/0x6b
  RIP: 0033:0x7f47e5e0952b

- Crash, mostly due to above use-after-free

[CAUSE]
The converted fs has the following data chunk layout:

    item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 2214658048) itemoff 16025 itemsize 80
        length 86016 owner 2 stripe_len 65536 type DATA|single

For above logical bytenr 2214744064, it's at the chunk end
(2214658048 + 86016 = 2214744064).

This means btrfs_submit_bio() would split the bio, and trigger endio
function for both of the two halves.

However scrub_submit_initial_read() would only expect the endio function
to be called once, not any more.
This means the first endio function would already free the bbio::bio,
leaving the bvec freed, thus the 2nd endio call would lead to
use-after-free.

[FIX]
- Make sure scrub_read_endio() only updates bits in its range
  Since we may read less than 64K at the end of the chunk, we should not
  touch the bits beyond chunk boundary.

- Make sure scrub_submit_initial_read() only to read the chunk range
  This is done by calculating the real number of sectors we need to
  read, and add sector-by-sector to the bio.

Thankfully the scrub read repair path won't need extra fixes:

- scrub_stripe_submit_repair_read()
  With above fixes, we won't update error bit for range beyond chunk,
  thus scrub_stripe_submit_repair_read() should never submit any read
  beyond the chunk.

Reported-by: Rongrong <i@rong.moe>
Fixes: e02ee89 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Tested-by: Rongrong <i@rong.moe>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
matttbe pushed a commit that referenced this issue Mar 12, 2024
Petr Machata says:

====================
mlxsw: Support for nexthop group statistics

ECMP is a fundamental component in L3 designs. However, it's fragile. Many
factors influence whether an ECMP group will operate as intended: hash
policy (i.e. the set of fields that contribute to ECMP hash calculation),
neighbor validity, hash seed (which might lead to polarization) or the type
of ECMP group used (hash-threshold or resilient).

At the same time, collection of statistics that would help an operator
determine that the group performs as desired, is difficult.

Support for nexthop group statistics and their HW collection has been
introduced recently. In this patch set, add HW stats collection support
to mlxsw.

This patchset progresses as follows:

- Patches #1 and #2 add nexthop IDs to notifiers.
- Patches #3 and #4 are code-shaping.
- Patches #5, #6 and #7 adjust the flow counter code.
- Patches #8 and #9 add HW nexthop counters.
- Patch #10 adjusts the HW counter code to allow sharing the same counter
  for several resilient group buckets with the same NH ID.
- Patch #11 adds a selftest.
====================

Link: https://lore.kernel.org/r/cover.1709901020.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
matttbe pushed a commit that referenced this issue Mar 12, 2024
Locally generated packets can increment the new nexthop statistics from
process context, resulting in the following splat [1] due to preemption
being enabled. Fix by using get_cpu_ptr() / put_cpu_ptr() which will
which take care of disabling / enabling preemption.

BUG: using smp_processor_id() in preemptible [00000000] code: ping/949
caller is nexthop_select_path+0xcf8/0x1e30
CPU: 12 PID: 949 Comm: ping Not tainted 6.8.0-rc7-custom-gcb450f605fae #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xbd/0xe0
 check_preemption_disabled+0xce/0xe0
 nexthop_select_path+0xcf8/0x1e30
 fib_select_multipath+0x865/0x18b0
 fib_select_path+0x311/0x1160
 ip_route_output_key_hash_rcu+0xe54/0x2720
 ip_route_output_key_hash+0x193/0x380
 ip_route_output_flow+0x25/0x130
 raw_sendmsg+0xbab/0x34a0
 inet_sendmsg+0xa2/0xe0
 __sys_sendto+0x2ad/0x430
 __x64_sys_sendto+0xe5/0x1c0
 do_syscall_64+0xc5/0x1d0
 entry_SYSCALL_64_after_hwframe+0x63/0x6b
[...]

Fixes: f4676ea ("net: nexthop: Add nexthop group entry stats")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240311162307.545385-5-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
matttbe pushed a commit that referenced this issue Mar 29, 2024
Petr Machata says:

====================
selftests: Fixes for kernel CI

As discussed on the bi-weekly call on Jan 30, and in mailing around
kernel CI effort, some changes are desirable in the suite of forwarding
selftests the better to work with the CI tooling. Namely:

- The forwarding selftests use a configuration file where names of
  interfaces are defined and various variables can be overridden. There
  is also forwarding.config.sample that users can use as a template to
  refer to when creating the config file. What happens a fair bit is
  that users either do not know about this at all, or simply forget, and
  are confused by cryptic failures about interfaces that cannot be
  created.

  In patches #1 - #3 have lib.sh just be the single source of truth with
  regards to which variables exist. That includes the topology variables
  which were previously only in the sample file, and any "tweak
  variables", such as what tools to use, sleep times, etc.

  forwarding.config.sample then becomes just a placeholder with a couple
  examples. Unless specific HW should be exercised, or specific tools
  used, the defaults are usually just fine.

- Several net/forwarding/ selftests (and one net/ one) cannot be run on
  veth pairs, they need an actual HW interface to run on. They are
  generic in the sense that any capable HW should pass them, which is
  why they have been put to net/forwarding/ as opposed to drivers/net/,
  but they do not generalize to veth. The fact that these tests are in
  net/forwarding/, but still complaining when run, is confusing.

  In patches #4 - #6 move these tests to a new directory
  drivers/net/hw.

- The following patches extend the codebase to handle well test results
  other than pass and fail.

  Patch #7 is preparatory. It converts several log_test_skip to XFAIL,
  so that tests do not spuriously end up returning non-0 when they
  are not supposed to.

  In patches #8 - #10, introduce some missing ksft constants, then support
  having those constants in RET, and then finally in EXIT_STATUS.

- The traffic scheduler tests generate a large amount of network traffic
  to test the behavior of the scheduler. This demands a relatively
  high-performance computer. On slow machines, such as with a debugging
  kernel, the test would spuriously fail.

  It can still be useful to "go through the motions" though, to possibly
  catch bugs in setup of the scheduler graph and passing packets around.
  Thus we still want to run the tests, just with lowered demands.

  To that end, in patches #11 - #12, introduce an environment variable
  KSFT_MACHINE_SLOW, with obvious meaning. Tests can then make checks
  more lenient, such as mark failures as XFAIL. A helper, xfail_on_slow,
  is provided to mark performance-sensitive parts of the selftest.

- In patch #13, use a similar mechanism to mark a NH group stats
  selftest to XFAIL HW stats tests when run on VETH pairs.

- All these changes complicate the hitherto straightforward logging and
  checking logic, so in patch #14, add a selftest that checks this
  functionality in lib.sh.

v1 (vs. an RFC circulated through linux-kselftest):
- Patch #9:
    - Clarify intended usage by s/set_ret/ret_set_ksft_status/,
      s/nret/ksft_status/
====================

Link: https://lore.kernel.org/r/cover.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
matttbe pushed a commit that referenced this issue Apr 3, 2024
syzkaller reported infinite recursive calls of fib6_dump_done() during
netlink socket destruction.  [1]

From the log, syzkaller sent an AF_UNSPEC RTM_GETROUTE message, and then
the response was generated.  The following recvmmsg() resumed the dump
for IPv6, but the first call of inet6_dump_fib() failed at kzalloc() due
to the fault injection.  [0]

  12:01:34 executing program 3:
  r0 = socket$nl_route(0x10, 0x3, 0x0)
  sendmsg$nl_route(r0, ... snip ...)
  recvmmsg(r0, ... snip ...) (fail_nth: 8)

Here, fib6_dump_done() was set to nlk_sk(sk)->cb.done, and the next call
of inet6_dump_fib() set it to nlk_sk(sk)->cb.args[3].  syzkaller stopped
receiving the response halfway through, and finally netlink_sock_destruct()
called nlk_sk(sk)->cb.done().

fib6_dump_done() calls fib6_dump_end() and nlk_sk(sk)->cb.done() if it
is still not NULL.  fib6_dump_end() rewrites nlk_sk(sk)->cb.done() by
nlk_sk(sk)->cb.args[3], but it has the same function, not NULL, calling
itself recursively and hitting the stack guard page.

To avoid the issue, let's set the destructor after kzalloc().

[0]:
FAULT_INJECTION: forcing a failure.
name failslab, interval 1, probability 0, space 0, times 0
CPU: 1 PID: 432110 Comm: syz-executor.3 Not tainted 6.8.0-12821-g537c2e91d354-dirty #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl (lib/dump_stack.c:117)
 should_fail_ex (lib/fault-inject.c:52 lib/fault-inject.c:153)
 should_failslab (mm/slub.c:3733)
 kmalloc_trace (mm/slub.c:3748 mm/slub.c:3827 mm/slub.c:3992)
 inet6_dump_fib (./include/linux/slab.h:628 ./include/linux/slab.h:749 net/ipv6/ip6_fib.c:662)
 rtnl_dump_all (net/core/rtnetlink.c:4029)
 netlink_dump (net/netlink/af_netlink.c:2269)
 netlink_recvmsg (net/netlink/af_netlink.c:1988)
 ____sys_recvmsg (net/socket.c:1046 net/socket.c:2801)
 ___sys_recvmsg (net/socket.c:2846)
 do_recvmmsg (net/socket.c:2943)
 __x64_sys_recvmmsg (net/socket.c:3041 net/socket.c:3034 net/socket.c:3034)

[1]:
BUG: TASK stack guard page was hit at 00000000f2fa9af1 (stack is 00000000b7912430..000000009a436beb)
stack guard page: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 223719 Comm: kworker/1:3 Not tainted 6.8.0-12821-g537c2e91d354-dirty #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: events netlink_sock_destruct_work
RIP: 0010:fib6_dump_done (net/ipv6/ip6_fib.c:570)
Code: 3c 24 e8 f3 e9 51 fd e9 28 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 41 57 41 56 41 55 41 54 55 48 89 fd <53> 48 8d 5d 60 e8 b6 4d 07 fd 48 89 da 48 b8 00 00 00 00 00 fc ff
RSP: 0018:ffffc9000d980000 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffffffff84405990 RCX: ffffffff844059d3
RDX: ffff8881028e0000 RSI: ffffffff84405ac2 RDI: ffff88810c02f358
RBP: ffff88810c02f358 R08: 0000000000000007 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000224 R12: 0000000000000000
R13: ffff888007c82c78 R14: ffff888007c82c68 R15: ffff888007c82c68
FS:  0000000000000000(0000) GS:ffff88811b100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc9000d97fff8 CR3: 0000000102309002 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <#DF>
 </#DF>
 <TASK>
 fib6_dump_done (net/ipv6/ip6_fib.c:572 (discriminator 1))
 fib6_dump_done (net/ipv6/ip6_fib.c:572 (discriminator 1))
 ...
 fib6_dump_done (net/ipv6/ip6_fib.c:572 (discriminator 1))
 fib6_dump_done (net/ipv6/ip6_fib.c:572 (discriminator 1))
 netlink_sock_destruct (net/netlink/af_netlink.c:401)
 __sk_destruct (net/core/sock.c:2177 (discriminator 2))
 sk_destruct (net/core/sock.c:2224)
 __sk_free (net/core/sock.c:2235)
 sk_free (net/core/sock.c:2246)
 process_one_work (kernel/workqueue.c:3259)
 worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416)
 kthread (kernel/kthread.c:388)
 ret_from_fork (arch/x86/kernel/process.c:153)
 ret_from_fork_asm (arch/x86/entry/entry_64.S:256)
Modules linked in:

Fixes: 1da177e ("Linux-2.6.12-rc2")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240401211003.25274-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
matttbe pushed a commit that referenced this issue Apr 4, 2024
Petr Machata says:

====================
mlxsw: Preparations for improving performance

Amit Cohen writes:

mlxsw driver will use NAPI for event processing in a next patch set.
Some additional improvements will be added later. This patch set
prepares the code for NAPI usage and refactor some relevant areas. See
more details in commit messages.

Patch Set overview:
Patches #1-#2 are preparations for patch #3
Patch #3 setups tasklets as part of queue initializtion
Patch #4 removes handling of unlikely scenario
Patch #5 removes unused counters
Patch #6 makes style change in mlxsw_pci_eq_tasklet()
Patch #7-#10 poll command interface instead of EQ0 usage
Patches #11-#12 make style change and break the function
mlxsw_pci_cq_tasklet()
Patches #13-#14 remove functions which can be replaced by a stored value
Patch #15 improves accessing to descriptor queue instance
====================

Link: https://lore.kernel.org/r/cover.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
matttbe pushed a commit that referenced this issue Apr 17, 2024
vhost_worker will call tun call backs to receive packets. If too many
illegal packets arrives, tun_do_read will keep dumping packet contents.
When console is enabled, it will costs much more cpu time to dump
packet and soft lockup will be detected.

net_ratelimit mechanism can be used to limit the dumping rate.

PID: 33036    TASK: ffff949da6f20000  CPU: 23   COMMAND: "vhost-32980"
 #0 [fffffe00003fce50] crash_nmi_callback at ffffffff89249253
 #1 [fffffe00003fce58] nmi_handle at ffffffff89225fa3
 #2 [fffffe00003fceb0] default_do_nmi at ffffffff8922642e
 #3 [fffffe00003fced0] do_nmi at ffffffff8922660d
 #4 [fffffe00003fcef0] end_repeat_nmi at ffffffff89c01663
    [exception RIP: io_serial_in+20]
    RIP: ffffffff89792594  RSP: ffffa655314979e8  RFLAGS: 00000002
    RAX: ffffffff89792500  RBX: ffffffff8af428a0  RCX: 0000000000000000
    RDX: 00000000000003fd  RSI: 0000000000000005  RDI: ffffffff8af428a0
    RBP: 0000000000002710   R8: 0000000000000004   R9: 000000000000000f
    R10: 0000000000000000  R11: ffffffff8acbf64f  R12: 0000000000000020
    R13: ffffffff8acbf698  R14: 0000000000000058  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffa655314979e8] io_serial_in at ffffffff89792594
 #6 [ffffa655314979e8] wait_for_xmitr at ffffffff89793470
 #7 [ffffa65531497a08] serial8250_console_putchar at ffffffff897934f6
 #8 [ffffa65531497a20] uart_console_write at ffffffff8978b605
 #9 [ffffa65531497a48] serial8250_console_write at ffffffff89796558
 #10 [ffffa65531497ac8] console_unlock at ffffffff89316124
 #11 [ffffa65531497b10] vprintk_emit at ffffffff89317c07
 #12 [ffffa65531497b68] printk at ffffffff89318306
 #13 [ffffa65531497bc8] print_hex_dump at ffffffff89650765
 #14 [ffffa65531497ca8] tun_do_read at ffffffffc0b06c27 [tun]
 #15 [ffffa65531497d38] tun_recvmsg at ffffffffc0b06e34 [tun]
 #16 [ffffa65531497d68] handle_rx at ffffffffc0c5d682 [vhost_net]
 #17 [ffffa65531497ed0] vhost_worker at ffffffffc0c644dc [vhost]
 #18 [ffffa65531497f10] kthread at ffffffff892d2e72
 #19 [ffffa65531497f50] ret_from_fork at ffffffff89c0022f

Fixes: ef3db4a ("tun: avoid BUG, dump packet on GSO errors")
Signed-off-by: Lei Chen <lei.chen@smartx.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20240415020247.2207781-1-lei.chen@smartx.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
matttbe pushed a commit that referenced this issue Apr 24, 2024
The dev_tracker is added to ax25_cb in ax25_bind(). When the
ax25 device is detaching, the dev_tracker of ax25_cb should be
deallocated in ax25_kill_by_device() instead of the dev_tracker
of ax25_dev. The log reported by ref_tracker is shown below:

[   80.884935] ref_tracker: reference already released.
[   80.885150] ref_tracker: allocated in:
[   80.885349]  ax25_dev_device_up+0x105/0x540
[   80.885730]  ax25_device_event+0xa4/0x420
[   80.885730]  notifier_call_chain+0xc9/0x1e0
[   80.885730]  __dev_notify_flags+0x138/0x280
[   80.885730]  dev_change_flags+0xd7/0x180
[   80.885730]  dev_ifsioc+0x6a9/0xa30
[   80.885730]  dev_ioctl+0x4d8/0xd90
[   80.885730]  sock_do_ioctl+0x1c2/0x2d0
[   80.885730]  sock_ioctl+0x38b/0x4f0
[   80.885730]  __se_sys_ioctl+0xad/0xf0
[   80.885730]  do_syscall_64+0xc4/0x1b0
[   80.885730]  entry_SYSCALL_64_after_hwframe+0x67/0x6f
[   80.885730] ref_tracker: freed in:
[   80.885730]  ax25_device_event+0x272/0x420
[   80.885730]  notifier_call_chain+0xc9/0x1e0
[   80.885730]  dev_close_many+0x272/0x370
[   80.885730]  unregister_netdevice_many_notify+0x3b5/0x1180
[   80.885730]  unregister_netdev+0xcf/0x120
[   80.885730]  sixpack_close+0x11f/0x1b0
[   80.885730]  tty_ldisc_kill+0xcb/0x190
[   80.885730]  tty_ldisc_hangup+0x338/0x3d0
[   80.885730]  __tty_hangup+0x504/0x740
[   80.885730]  tty_release+0x46e/0xd80
[   80.885730]  __fput+0x37f/0x770
[   80.885730]  __x64_sys_close+0x7b/0xb0
[   80.885730]  do_syscall_64+0xc4/0x1b0
[   80.885730]  entry_SYSCALL_64_after_hwframe+0x67/0x6f
[   80.893739] ------------[ cut here ]------------
[   80.894030] WARNING: CPU: 2 PID: 140 at lib/ref_tracker.c:255 ref_tracker_free+0x47b/0x6b0
[   80.894297] Modules linked in:
[   80.894929] CPU: 2 PID: 140 Comm: ax25_conn_rel_6 Not tainted 6.9.0-rc4-g8cd26fd90c1a #11
[   80.895190] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qem4
[   80.895514] RIP: 0010:ref_tracker_free+0x47b/0x6b0
[   80.895808] Code: 83 c5 18 4c 89 eb 48 c1 eb 03 8a 04 13 84 c0 0f 85 df 01 00 00 41 83 7d 00 00 75 4b 4c 89 ff 9
[   80.896171] RSP: 0018:ffff888009edf8c0 EFLAGS: 00000286
[   80.896339] RAX: 1ffff1100141ac00 RBX: 1ffff1100149463b RCX: dffffc0000000000
[   80.896502] RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffff88800a0d6518
[   80.896925] RBP: ffff888009edf9b0 R08: ffff88806d3288d3 R09: 1ffff1100da6511a
[   80.897212] R10: dffffc0000000000 R11: ffffed100da6511b R12: ffff88800a4a31d4
[   80.897859] R13: ffff88800a4a31d8 R14: dffffc0000000000 R15: ffff88800a0d6518
[   80.898279] FS:  00007fd88b7fe700(0000) GS:ffff88806d300000(0000) knlGS:0000000000000000
[   80.899436] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   80.900181] CR2: 00007fd88c001d48 CR3: 000000000993e000 CR4: 00000000000006f0
...
[   80.935774] ref_tracker: sp%d@000000000bb9df3d has 1/1 users at
[   80.935774]      ax25_bind+0x424/0x4e0
[   80.935774]      __sys_bind+0x1d9/0x270
[   80.935774]      __x64_sys_bind+0x75/0x80
[   80.935774]      do_syscall_64+0xc4/0x1b0
[   80.935774]      entry_SYSCALL_64_after_hwframe+0x67/0x6f

Change ax25_dev->dev_tracker to the dev_tracker of ax25_cb
in order to mitigate the bug.

Fixes: feef318 ("ax25: fix UAF bugs of net_device caused by rebinding operation")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Link: https://lore.kernel.org/r/20240419020456.29826-1-duoming@zju.edu.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
matttbe pushed a commit that referenced this issue May 1, 2024
Wen Gu says:

====================
net/smc: SMC intra-OS shortcut with loopback-ism

This patch set acts as the second part of the new version of [1] (The first
part can be referred from [2]), the updated things of this version are listed
at the end.

- Background

SMC-D is now used in IBM z with ISM function to optimize network interconnect
for intra-CPC communications. Inspired by this, we try to make SMC-D available
on the non-s390 architecture through a software-implemented Emulated-ISM device,
that is the loopback-ism device here, to accelerate inter-process or
inter-containers communication within the same OS instance.

- Design

This patch set includes 3 parts:

 - Patch #1: some prepare work for loopback-ism.
 - Patch #2-#7: implement loopback-ism device and adapt SMC-D for it.
   loopback-ism now serves only SMC and no userspace interfaces exposed.
 - Patch #8-#11: memory copy optimization for intra-OS scenario.

The loopback-ism device is designed as an ISMv2 device and not be limited to
a specific net namespace, ends of both inter-process connection (1/1' in diagram
below) or inter-container connection (2/2' in diagram below) can find the same
available loopback-ism and choose it during the CLC handshake.

 Container 1 (ns1)                              Container 2 (ns2)
 +-----------------------------------------+    +-------------------------+
 | +-------+      +-------+      +-------+ |    |        +-------+        |
 | | App A |      | App B |      | App C | |    |        | App D |<-+     |
 | +-------+      +---^---+      +-------+ |    |        +-------+  |(2') |
 |     |127.0.0.1 (1')|             |192.168.0.11       192.168.0.12|     |
 |  (1)|   +--------+ | +--------+  |(2)   |    | +--------+   +--------+ |
 |     `-->|   lo   |-` |  eth0  |<-`      |    | |   lo   |   |  eth0  | |
 +---------+--|---^-+---+-----|--+---------+    +-+--------+---+-^------+-+
              |   |           |                                  |
 Kernel       |   |           |                                  |
 +----+-------v---+-----------v----------------------------------+---+----+
 |    |                            TCP                               |    |
 |    |                                                              |    |
 |    +--------------------------------------------------------------+    |
 |                                                                        |
 |                           +--------------+                             |
 |                           | smc loopback |                             |
 +---------------------------+--------------+-----------------------------+

loopback-ism device creates DMBs (shared memory) for each connection peer.
Since data transfer occurs within the same kernel, the sndbuf of each peer
is only a descriptor and point to the same memory region as peer DMB, so that
the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.

 Container 1 (ns1)                              Container 2 (ns2)
 +-----------------------------------------+    +-------------------------+
 | +-------+                               |    |        +-------+        |
 | | App C |-----+                         |    |        | App D |        |
 | +-------+     |                         |    |        +-^-----+        |
 |               |                         |    |          |              |
 |           (2) |                         |    |     (2') |              |
 |               |                         |    |          |              |
 +---------------|-------------------------+    +----------|--------------+
                 |                                         |
 Kernel          |                                         |
 +---------------|-----------------------------------------|--------------+
 | +--------+ +--v-----+                           +--------+ +--------+  |
 | |dmb_desc| |snd_desc|                           |dmb_desc| |snd_desc|  |
 | +-----|--+ +--|-----+                           +-----|--+ +--------+  |
 | +-----|--+    |                                 +-----|--+             |
 | | DMB C  |    +---------------------------------| DMB D  |             |
 | +--------+                                      +--------+             |
 |                                                                        |
 |                           +--------------+                             |
 |                           | smc loopback |                             |
 +---------------------------+--------------+-----------------------------+

- Benchmark Test

 * Test environments:
      - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
      - SMC sndbuf/DMB size 1MB.

 * Test object:
      - TCP: run on TCP loopback.
      - SMC lo: run on SMC loopback-ism.

1. ipc-benchmark (see [3])

 - ./<foo> -c 1000000 -s 100

                            TCP                  SMC-lo
Message
rate (msg/s)              84991                  151293(+78.01%)

2. sockperf

 - serv: <smc_run> sockperf sr --tcp
 - clnt: <smc_run> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30

                            TCP                  SMC-lo
Bandwidth(MBps)        5033.569                7987.732(+58.69%)
Latency(us)               5.986                   3.398(-43.23%)

3. nginx/wrk

 - serv: <smc_run> nginx
 - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80

                           TCP                   SMC-lo
Requests/s           187951.76                267107.90(+42.12%)

4. redis-benchmark

 - serv: <smc_run> redis-server
 - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024

                           TCP                   SMC-lo
GET(Requests/s)       86132.64                118133.49(+37.15%)
SET(Requests/s)       87374.40                122887.86(+40.65%)

Change log:
v7->v6
- Patch #2: minor: remove unnecessary 'return' of inline smc_loopback_exit().
- Patch #10: minor: directly return 0 instead of 'rc' in smcd_cdc_msg_send().
- all: collect the Reviewed-by tags.

v6->RFC v5
Link: https://lore.kernel.org/netdev/20240414040304.54255-1-guwen@linux.alibaba.com/
- Patch #2: make the use of CONFIG_SMC_LO cleaner.
- Patch #5: mark some smcd_ops that loopback-ism doesn't support as
  optional and check for the support when they are called.
- Patch #7: keep loopback-ism at the beginning of the SMC-D device list.
- Some expression changes in commit logs and comments.

RFC v5->RFC v4:
Link: https://lore.kernel.org/netdev/20240324135522.108564-1-guwen@linux.alibaba.com/
- Patch #2: minor changes in description of config SMC_LO and comments.
- Patch #10: minor changes in comments and if(smc_ism_support_dmb_nocopy())
  check in smcd_cdc_msg_send().
- Patch #3: change smc_lo_generate_id() to smc_lo_generate_ids() and SMC_LO_CHID
  to SMC_LO_RESERVED_CHID.
- Patch #5: memcpy while holding the ldev->dmb_ht_lock.
- Some expression changes in commit logs.

RFC v4->v3:
Link: https://lore.kernel.org/netdev/20240317100545.96663-1-guwen@linux.alibaba.com/
- The merge window of v6.9 is open, so post this series as an RFC.
- Patch #6: since some information fed back by smc_nl_handle_smcd_dev() dose
  not apply to Emulated-ISM (including loopback-ism here), loopback-ism is
  not exposed through smc netlink for the time being. we may refactor this
  part when smc netlink interface is updated.

v3->v2:
Link: https://lore.kernel.org/netdev/20240312142743.41406-1-guwen@linux.alibaba.com/
- Patch #11: use tasklet_schedule(&conn->rx_tsklet) instead of smcd_cdc_rx_handler()
  to avoid possible recursive locking of conn->send_lock and use {read|write}_lock_bh()
  to acquire dmb_ht_lock.

v2->v1:
Link: https://lore.kernel.org/netdev/20240307095536.29648-1-guwen@linux.alibaba.com/
- All the patches: changed the term virtual-ISM to Emulated-ISM as defined by SMCv2.1.
- Patch #3: optimized the description of SMC_LO config. Avoid exposing loopback-ism
  to sysfs and remove all the knobs until future definition clear.
- Patch #3: try to make lockdep happy by using read_lock_bh() in smc_lo_move_data().
- Patch #6: defaultly use physical contiguous DMB buffers.
- Patch #11: defaultly enable DMB no-copy for loopback-ism and free the DMB in
  unregister_dmb or detach_dmb when dmb_node->refcnt reaches 0, instead of using
  wait_event to keep waiting in unregister_dmb.

v1->RFC:
Link: https://lore.kernel.org/netdev/20240111120036.109903-1-guwen@linux.alibaba.com/
- Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
  /sys/devices/virtual/smc/loopback-ism/xfer_bytes
- Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
  merging sndbuf with peer DMB.
- Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
  control of whether to merge sndbuf and DMB. They can be respectively set by:
  /sys/devices/virtual/smc/loopback-ism/dmb_type
  /sys/devices/virtual/smc/loopback-ism/dmb_copy
  The motivation for these two control is that a performance bottleneck was
  found when using vzalloced DMB and sndbuf is merged with DMB, and there are
  many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
  by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
  or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
  vmap lock contention [6]. It has significant effects, but using virtual memory
  still has additional overhead compared to using physical memory.
  So this new version provides controls of dmb_type and dmb_copy to suit
  different scenarios.
- Some minor changes and comments improvements.

RFC->old version([1]):
Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/
- Patch #1: improve the loopback-ism dump, it shows as follows now:
  # smcd d
  FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
  0000 0     loopback-ism  ffff   No        0
- Patch #3: introduce the smc_ism_set_v2_capable() helper and set
  smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
  regardless of whether there is already a device in smcd device list.
- Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
- Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
  to activate or deactivate the loopback-ism.
- Patch #9: introduce the statistics of loopback-ism by
  /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
- Some minor changes and comments improvements.

[1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/
[2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/
[3] https://github.com/goldsborough/ipc-bench
[4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/
[5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/
[6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/
====================

Link: https://lore.kernel.org/r/20240428060738.60843-1-guwen@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
matttbe pushed a commit that referenced this issue May 13, 2024
…/git/pablo/gtp

Pablo neira Ayuso says:

====================
gtp pull request 24-05-07

This v3 includes:
- fix for clang uninitialized variable per Jakub.
- address Smatch and Coccinelle reports per Simon
- remove inline in new IPv6 support per Simon
- fix memleaks in netlink control plane per Simon
-o-

The following patchset contains IPv6 GTP driver support for net-next,
this also includes IPv6 over IPv4 and vice-versa:

Patch #1 removes a unnecessary stack variable initialization in the
         socket routine.

Patch #2 deals with GTP extension headers. This variable length extension
         header to decapsulate packets accordingly. Otherwise, packets are
         dropped when these extension headers are present which breaks
         interoperation with other non-Linux based GTP implementations.

Patch #3 prepares for IPv6 support by moving IPv4 specific fields in PDP
         context objects to a union.

Patch #4 adds IPv6 support while retaining backward compatibility.
         Three new attributes allows to declare an IPv6 GTP tunnel
         GTPA_FAMILY, GTPA_PEER_ADDR6 and GTPA_MS_ADDR6 as well as
         IFLA_GTP_LOCAL6 to declare the IPv6 GTP UDP socket. Up to this
         patch, only IPv6 outer in IPv6 inner is supported.

Patch #5 uses IPv6 address /64 prefix for UE/MS in the inner headers.
         Unlike IPv4, which provides a 1:1 mapping between UE/MS,
         IPv6 tunnel encapsulates traffic for /64 address as specified
         by 3GPP TS. Patch has been split from Patch #4 to highlight
         this behaviour.

Patch #6 passes up IPv6 link-local traffic, such as IPv6 SLAAC, for
         handling to userspace so they are handled as control packets.

Patch #7 prepares to allow for GTP IPv4 over IPv6 and vice-versa by
         moving IP specific debugging out of the function to build
         IPv4 and IPv6 GTP packets.

Patch #8 generalizes TOS/DSCP handling following similar approach as
         in the existing iptunnel infrastructure.

Patch #9 adds a helper function to build an IPv4 GTP packet in the outer
         header.

Patch #10 adds a helper function to build an IPv6 GTP packet in the outer
          header.

Patch #11 adds support for GTP IPv4-over-IPv6 and vice-versa.

Patch #12 allows to use the same TID/TEID (tunnel identifier) for inner
          IPv4 and IPv6 packets for better UE/MS dual stack integration.

This series integrates with the osmocom.org project CI and TTCN-3 test
infrastructure (Oliver Smith) as well as the userspace libgtpnl library.

Thanks to Harald Welte, Oliver Smith and Pau Espin for reviewing and
providing feedback through the osmocom.org redmine platform to make this
happen.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
matttbe pushed a commit that referenced this issue May 24, 2024
ui_browser__show() is capturing the input title that is stack allocated
memory in hist_browser__run().

Avoid a use after return by strdup-ing the string.

Committer notes:

Further explanation from Ian Rogers:

My command line using tui is:
$ sudo bash -c 'rm /tmp/asan.log*; export
ASAN_OPTIONS="log_path=/tmp/asan.log"; /tmp/perf/perf mem record -a
sleep 1; /tmp/perf/perf mem report'
I then go to the perf annotate view and quit. This triggers the asan
error (from the log file):
```
==1254591==ERROR: AddressSanitizer: stack-use-after-return on address
0x7f2813331920 at pc 0x7f28180
65991 bp 0x7fff0a21c750 sp 0x7fff0a21bf10
READ of size 80 at 0x7f2813331920 thread T0
    #0 0x7f2818065990 in __interceptor_strlen
../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:461
    #1 0x7f2817698251 in SLsmg_write_wrapped_string
(/lib/x86_64-linux-gnu/libslang.so.2+0x98251)
    #2 0x7f28176984b9 in SLsmg_write_nstring
(/lib/x86_64-linux-gnu/libslang.so.2+0x984b9)
    #3 0x55c94045b365 in ui_browser__write_nstring ui/browser.c:60
    #4 0x55c94045c558 in __ui_browser__show_title ui/browser.c:266
    #5 0x55c94045c776 in ui_browser__show ui/browser.c:288
    #6 0x55c94045c06d in ui_browser__handle_resize ui/browser.c:206
    #7 0x55c94047979b in do_annotate ui/browsers/hists.c:2458
    #8 0x55c94047fb17 in evsel__hists_browse ui/browsers/hists.c:3412
    #9 0x55c940480a0c in perf_evsel_menu__run ui/browsers/hists.c:3527
    #10 0x55c940481108 in __evlist__tui_browse_hists ui/browsers/hists.c:3613
    #11 0x55c9404813f7 in evlist__tui_browse_hists ui/browsers/hists.c:3661
    #12 0x55c93ffa253f in report__browse_hists tools/perf/builtin-report.c:671
    #13 0x55c93ffa58ca in __cmd_report tools/perf/builtin-report.c:1141
    #14 0x55c93ffaf159 in cmd_report tools/perf/builtin-report.c:1805
    #15 0x55c94000c05c in report_events tools/perf/builtin-mem.c:374
    #16 0x55c94000d96d in cmd_mem tools/perf/builtin-mem.c:516
    #17 0x55c9400e44ee in run_builtin tools/perf/perf.c:350
    #18 0x55c9400e4a5a in handle_internal_command tools/perf/perf.c:403
    #19 0x55c9400e4e22 in run_argv tools/perf/perf.c:447
    #20 0x55c9400e53ad in main tools/perf/perf.c:561
    #21 0x7f28170456c9 in __libc_start_call_main
../sysdeps/nptl/libc_start_call_main.h:58
    #22 0x7f2817045784 in __libc_start_main_impl ../csu/libc-start.c:360
    #23 0x55c93ff544c0 in _start (/tmp/perf/perf+0x19a4c0) (BuildId:
84899b0e8c7d3a3eaa67b2eb35e3d8b2f8cd4c93)

Address 0x7f2813331920 is located in stack of thread T0 at offset 32 in frame
    #0 0x55c94046e85e in hist_browser__run ui/browsers/hists.c:746

  This frame has 1 object(s):
    [32, 192) 'title' (line 747) <== Memory access at offset 32 is
inside this variable
HINT: this may be a false positive if your program uses some custom
stack unwind mechanism, swapcontext or vfork
```
hist_browser__run isn't on the stack so the asan error looks legit.
There's no clean init/exit on struct ui_browser so I may be trading a
use-after-return for a memory leak, but that seems look a good trade
anyway.

Fixes: 05e8b08 ("perf ui browser: Stop using 'self'")
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Ben Gainey <ben.gainey@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@arm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Li Dong <lidong@vivo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Paran Lee <p4ranlee@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Sun Haiyong <sunhaiyong@loongson.cn>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lore.kernel.org/r/20240507183545.1236093-2-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
matttbe pushed a commit that referenced this issue Jun 6, 2024
In the XDP_TX path, ionic driver sends a packet to the TX path with rx
page and corresponding dma address.
After tx is done, ionic_tx_clean() frees that page.
But RX ring buffer isn't reset to NULL.
So, it uses a freed page, which causes kernel panic.

BUG: unable to handle page fault for address: ffff8881576c110c
PGD 773801067 P4D 773801067 PUD 87f086067 PMD 87efca067 PTE 800ffffea893e060
Oops: Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN NOPTI
CPU: 1 PID: 25 Comm: ksoftirqd/1 Not tainted 6.9.0+ #11
Hardware name: ASUS System Product Name/PRIME Z690-P D4, BIOS 0603 11/01/2021
RIP: 0010:bpf_prog_f0b8caeac1068a55_balancer_ingress+0x3b/0x44f
Code: 00 53 41 55 41 56 41 57 b8 01 00 00 00 48 8b 5f 08 4c 8b 77 00 4c 89 f7 48 83 c7 0e 48 39 d8
RSP: 0018:ffff888104e6fa28 EFLAGS: 00010283
RAX: 0000000000000002 RBX: ffff8881576c1140 RCX: 0000000000000002
RDX: ffffffffc0051f64 RSI: ffffc90002d33048 RDI: ffff8881576c110e
RBP: ffff888104e6fa88 R08: 0000000000000000 R09: ffffed1027a04a23
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881b03a21a8
R13: ffff8881589f800f R14: ffff8881576c1100 R15: 00000001576c1100
FS: 0000000000000000(0000) GS:ffff88881ae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8881576c110c CR3: 0000000767a90000 CR4: 00000000007506f0
PKRU: 55555554
Call Trace:
<TASK>
? __die+0x20/0x70
? page_fault_oops+0x254/0x790
? __pfx_page_fault_oops+0x10/0x10
? __pfx_is_prefetch.constprop.0+0x10/0x10
? search_bpf_extables+0x165/0x260
? fixup_exception+0x4a/0x970
? exc_page_fault+0xcb/0xe0
? asm_exc_page_fault+0x22/0x30
? 0xffffffffc0051f64
? bpf_prog_f0b8caeac1068a55_balancer_ingress+0x3b/0x44f
? do_raw_spin_unlock+0x54/0x220
ionic_rx_service+0x11ab/0x3010 [ionic 9180c3001ab627d82bbc5f3ebe8a0decaf6bb864]
? ionic_tx_clean+0x29b/0xc60 [ionic 9180c3001ab627d82bbc5f3ebe8a0decaf6bb864]
? __pfx_ionic_tx_clean+0x10/0x10 [ionic 9180c3001ab627d82bbc5f3ebe8a0decaf6bb864]
? __pfx_ionic_rx_service+0x10/0x10 [ionic 9180c3001ab627d82bbc5f3ebe8a0decaf6bb864]
? ionic_tx_cq_service+0x25d/0xa00 [ionic 9180c3001ab627d82bbc5f3ebe8a0decaf6bb864]
? __pfx_ionic_rx_service+0x10/0x10 [ionic 9180c3001ab627d82bbc5f3ebe8a0decaf6bb864]
ionic_cq_service+0x69/0x150 [ionic 9180c3001ab627d82bbc5f3ebe8a0decaf6bb864]
ionic_txrx_napi+0x11a/0x540 [ionic 9180c3001ab627d82bbc5f3ebe8a0decaf6bb864]
__napi_poll.constprop.0+0xa0/0x440
net_rx_action+0x7e7/0xc30
? __pfx_net_rx_action+0x10/0x10

Fixes: 8eeed83 ("ionic: Add XDP_TX support")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
matttbe pushed a commit that referenced this issue Jun 14, 2024
…PLES event"

This reverts commit 7d1405c.

This causes segfaults in some cases, as reported by Milian:

  ```
  sudo /usr/bin/perf record -z --call-graph dwarf -e cycles -e
  raw_syscalls:sys_enter ls
  ...
  [ perf record: Woken up 3 times to write data ]
  malloc(): invalid next size (unsorted)
  Aborted
  ```

  Backtrace with GDB + debuginfod:

  ```
  malloc(): invalid next size (unsorted)

  Thread 1 "perf" received signal SIGABRT, Aborted.
  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6,
  no_tid=no_tid@entry=0) at pthread_kill.c:44
  Downloading source file /usr/src/debug/glibc/glibc/nptl/pthread_kill.c
  44            return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO
  (ret) : 0;
  (gdb) bt
  #0  __pthread_kill_implementation (threadid=<optimized out>,
  signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
  #1  0x00007ffff6ea8eb3 in __pthread_kill_internal (threadid=<optimized out>,
  signo=6) at pthread_kill.c:78
  #2  0x00007ffff6e50a30 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/
  raise.c:26
  #3  0x00007ffff6e384c3 in __GI_abort () at abort.c:79
  #4  0x00007ffff6e39354 in __libc_message_impl (fmt=fmt@entry=0x7ffff6fc22ea
  "%s\n") at ../sysdeps/posix/libc_fatal.c:132
  #5  0x00007ffff6eb3085 in malloc_printerr (str=str@entry=0x7ffff6fc5850
  "malloc(): invalid next size (unsorted)") at malloc.c:5772
  #6  0x00007ffff6eb657c in _int_malloc (av=av@entry=0x7ffff6ff6ac0
  <main_arena>, bytes=bytes@entry=368) at malloc.c:4081
  #7  0x00007ffff6eb877e in __libc_calloc (n=<optimized out>,
  elem_size=<optimized out>) at malloc.c:3754
  #8  0x000055555569bdb6 in perf_session.do_write_header ()
  #9  0x00005555555a373a in __cmd_record.constprop.0 ()
  #10 0x00005555555a6846 in cmd_record ()
  #11 0x000055555564db7f in run_builtin ()
  #12 0x000055555558ed77 in main ()
  ```

  Valgrind memcheck:
  ```
  ==45136== Invalid write of size 8
  ==45136==    at 0x2B38A5: perf_event__synthesize_id_sample (in /usr/bin/perf)
  ==45136==    by 0x157069: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==  Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd
  ==45136==    at 0x4849BF3: calloc (vg_replace_malloc.c:1675)
  ==45136==    by 0x3574AB: zalloc (in /usr/bin/perf)
  ==45136==    by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==
  ==45136== Syscall param write(buf) points to unaddressable byte(s)
  ==45136==    at 0x575953D: __libc_write (write.c:26)
  ==45136==    by 0x575953D: write (write.c:24)
  ==45136==    by 0x35761F: ion (in /usr/bin/perf)
  ==45136==    by 0x357778: writen (in /usr/bin/perf)
  ==45136==    by 0x1548F7: record__write (in /usr/bin/perf)
  ==45136==    by 0x15708A: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==  Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd
  ==45136==    at 0x4849BF3: calloc (vg_replace_malloc.c:1675)
  ==45136==    by 0x3574AB: zalloc (in /usr/bin/perf)
  ==45136==    by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==
 -----

Closes: https://lore.kernel.org/linux-perf-users/23879991.0LEYPuXRzz@milian-workstation/
Reported-by: Milian Wolff <milian.wolff@kdab.com>
Tested-by: Milian Wolff <milian.wolff@kdab.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: stable@kernel.org # 6.8+
Link: https://lore.kernel.org/lkml/Zl9ksOlHJHnKM70p@x1
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
matttbe pushed a commit that referenced this issue Jun 14, 2024
We have been seeing crashes on duplicate keys in
btrfs_set_item_key_safe():

  BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.c:2620!
  invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
  RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]

With the following stack trace:

  #0  btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
  #1  btrfs_drop_extents (fs/btrfs/file.c:411:4)
  #2  log_one_extent (fs/btrfs/tree-log.c:4732:9)
  #3  btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
  #4  btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
  #5  btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
  #6  btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
  #7  btrfs_sync_file (fs/btrfs/file.c:1933:8)
  #8  vfs_fsync_range (fs/sync.c:188:9)
  #9  vfs_fsync (fs/sync.c:202:9)
  #10 do_fsync (fs/sync.c:212:9)
  #11 __do_sys_fdatasync (fs/sync.c:225:9)
  #12 __se_sys_fdatasync (fs/sync.c:223:1)
  #13 __x64_sys_fdatasync (fs/sync.c:223:1)
  #14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
  #15 do_syscall_64 (arch/x86/entry/common.c:83:7)
  #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)

So we're logging a changed extent from fsync, which is splitting an
extent in the log tree. But this split part already exists in the tree,
triggering the BUG().

This is the state of the log tree at the time of the crash, dumped with
drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
to get more details than btrfs_print_leaf() gives us:

  >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
  leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
  leaf 33439744 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
          item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
                  generation 7 transid 9 size 8192 nbytes 8473563889606862198
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 204 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417704.983333333 (2024-05-22 15:41:44)
                  mtime 1716417704.983333333 (2024-05-22 15:41:44)
                  otime 17592186044416.000000000 (559444-03-08 01:40:16)
          item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
                  index 195 namelen 3 name: 193
          item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 4096 ram 12288
                  extent compression 0 (none)
          item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 4096 nr 8192
          item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096
  ...

So the real problem happened earlier: notice that items 4 (4k-12k) and 5
(8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
item 5 starts at i_size.

Here is the state of the filesystem tree at the time of the crash:

  >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
  >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
  >>> print_extent_buffer(nodes[0])
  leaf 30425088 level 0 items 184 generation 9 owner 5
  leaf 30425088 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
  	...
          item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
                  generation 7 transid 7 size 4096 nbytes 12288
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 6 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417703.220000000 (2024-05-22 15:41:43)
                  mtime 1716417703.220000000 (2024-05-22 15:41:43)
                  otime 1716417703.220000000 (2024-05-22 15:41:43)
          item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
                  index 195 namelen 3 name: 193
          item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 8192 ram 12288
                  extent compression 0 (none)
          item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096

Item 5 in the log tree corresponds to item 183 in the filesystem tree,
but nothing matches item 4. Furthermore, item 183 is the last item in
the leaf.

btrfs_log_prealloc_extents() is responsible for logging prealloc extents
beyond i_size. It first truncates any previously logged prealloc extents
that start beyond i_size. Then, it walks the filesystem tree and copies
the prealloc extent items to the log tree.

If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
unlocks the tree and does another search. However, while the filesystem
tree is unlocked, an ordered extent completion may modify the tree. In
particular, it may insert an extent item that overlaps with an extent
item that was already copied to the log tree.

This may manifest in several ways depending on the exact scenario,
including an EEXIST error that is silently translated to a full sync,
overlapping items in the log tree, or this crash. This particular crash
is triggered by the following sequence of events:

- Initially, the file has i_size=4k, a regular extent from 0-4k, and a
  prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
  the last item in its B-tree leaf.
- The file is fsync'd, which copies its inode item and both extent items
  to the log tree.
- An xattr is set on the file, which sets the
  BTRFS_INODE_COPY_EVERYTHING flag.
- The range 4k-8k in the file is written using direct I/O. i_size is
  extended to 8k, but the ordered extent is still in flight.
- The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
  calls copy_inode_items_to_log(), which calls
  btrfs_log_prealloc_extents().
- btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
  filesystem tree. Since it starts before i_size, it skips it. Since it
  is the last item in its B-tree leaf, it calls btrfs_next_leaf().
- btrfs_next_leaf() unlocks the path.
- The ordered extent completion runs, which converts the 4k-8k part of
  the prealloc extent to written and inserts the remaining prealloc part
  from 8k-12k.
- btrfs_next_leaf() does a search and finds the new prealloc extent
  8k-12k.
- btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
  the log tree. Note that it overlaps with the 4k-12k prealloc extent
  that was copied to the log tree by the first fsync.
- fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
  extent that was written.
- This tries to drop the range 4k-8k in the log tree, which requires
  adjusting the start of the 4k-12k prealloc extent in the log tree to
  8k.
- btrfs_set_item_key_safe() sees that there is already an extent
  starting at 8k in the log tree and calls BUG().

Fix this by detecting when we're about to insert an overlapping file
extent item in the log tree and truncating the part that would overlap.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
matttbe pushed a commit that referenced this issue Jun 28, 2024
The code in ocfs2_dio_end_io_write() estimates number of necessary
transaction credits using ocfs2_calc_extend_credits().  This however does
not take into account that the IO could be arbitrarily large and can
contain arbitrary number of extents.

Extent tree manipulations do often extend the current transaction but not
in all of the cases.  For example if we have only single block extents in
the tree, ocfs2_mark_extent_written() will end up calling
ocfs2_replace_extent_rec() all the time and we will never extend the
current transaction and eventually exhaust all the transaction credits if
the IO contains many single block extents.  Once that happens a
WARN_ON(jbd2_handle_buffer_credits(handle) <= 0) is triggered in
jbd2_journal_dirty_metadata() and subsequently OCFS2 aborts in response to
this error.  This was actually triggered by one of our customers on a
heavily fragmented OCFS2 filesystem.

To fix the issue make sure the transaction always has enough credits for
one extent insert before each call of ocfs2_mark_extent_written().

Heming Zhao said:

------
PANIC: "Kernel panic - not syncing: OCFS2: (device dm-1): panic forced after error"

PID: xxx  TASK: xxxx  CPU: 5  COMMAND: "SubmitThread-CA"
  #0 machine_kexec at ffffffff8c069932
  #1 __crash_kexec at ffffffff8c1338fa
  #2 panic at ffffffff8c1d69b9
  #3 ocfs2_handle_error at ffffffffc0c86c0c [ocfs2]
  #4 __ocfs2_abort at ffffffffc0c88387 [ocfs2]
  #5 ocfs2_journal_dirty at ffffffffc0c51e98 [ocfs2]
  #6 ocfs2_split_extent at ffffffffc0c27ea3 [ocfs2]
  #7 ocfs2_change_extent_flag at ffffffffc0c28053 [ocfs2]
  #8 ocfs2_mark_extent_written at ffffffffc0c28347 [ocfs2]
  #9 ocfs2_dio_end_io_write at ffffffffc0c2bef9 [ocfs2]
#10 ocfs2_dio_end_io at ffffffffc0c2c0f5 [ocfs2]
#11 dio_complete at ffffffff8c2b9fa7
#12 do_blockdev_direct_IO at ffffffff8c2bc09f
#13 ocfs2_direct_IO at ffffffffc0c2b653 [ocfs2]
#14 generic_file_direct_write at ffffffff8c1dcf14
#15 __generic_file_write_iter at ffffffff8c1dd07b
#16 ocfs2_file_write_iter at ffffffffc0c49f1f [ocfs2]
#17 aio_write at ffffffff8c2cc72e
#18 kmem_cache_alloc at ffffffff8c248dde
#19 do_io_submit at ffffffff8c2ccada
#20 do_syscall_64 at ffffffff8c004984
#21 entry_SYSCALL_64_after_hwframe at ffffffff8c8000ba

Link: https://lkml.kernel.org/r/20240617095543.6971-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20240614145243.8837-1-jack@suse.cz
Fixes: c15471f ("ocfs2: fix sparse file & data ordering issue in direct io")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
matttbe pushed a commit that referenced this issue Jul 1, 2024
Petr Machata says:

====================
selftest: Clean-up and stabilize mirroring tests

The mirroring selftests work by sending ICMP traffic between two hosts.
Along the way, this traffic is mirrored to a gretap netdevice, and counter
taps are then installed strategically along the path of the mirrored
traffic to verify the mirroring took place.

The problem with this is that besides mirroring the primary traffic, any
other service traffic is mirrored as well. At the same time, because the
tests need to work in HW-offloaded scenarios, the ability of the device to
do arbitrary packet inspection should not be taken for granted. Most tests
therefore simply use matchall, one uses flower to match on IP address.
As a result, the selftests are noisy.

mirror_test() accommodated this noisiness by giving the counters an
allowance of several packets. But that only works up to a point, and on
busy systems won't be always enough.

In this patch set, clean up and stabilize the mirroring selftests. The
original intention was to port the tests over to UDP, but the logic of
ICMP ends up being so entangled in the mirroring selftests that the
changes feel overly invasive. Instead, ICMP is kept, but where possible,
we match on ICMP message type, thus filtering out hits by other ICMP
messages.

Where this is not practical (where the counter tap is put on a device
that carries encapsulated packets), switch the counter condition to _at
least_ X observed packets. This is less robust, but barely so --
probably the only scenario that this would not catch is something like
erroneous packet duplication, which would hopefully get caught by the
numerous other tests in this extensive suite.

- Patches #1 to #3 clean up parameters at various helpers.

- Patches #4 to #6 stabilize the mirroring selftests as described above.

- Mirroring tests currently allow testing SW datapath even on HW
  netdevices by trapping traffic to the SW datapath. This complicates
  the tests a bit without a good reason: to test SW datapath, just run
  the selftests on the veth topology. Thus in patch #7, drop support for
  this dual SW/HW testing.

- At this point, some cleanups were either made possible by the previous
  patches, or were always possible. In patches #8 to #11, realize these
  cleanups.

- In patch #12, fix mlxsw mirror_gre selftest to respect setting TESTS.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
matttbe pushed a commit that referenced this issue Jul 2, 2024
…rnel/git/netfilter/nf-next into main

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next:

Patch #1 to #11 to shrink memory consumption for transaction objects:

  struct nft_trans_chain { /* size: 120 (-32), cachelines: 2, members: 10 */
  struct nft_trans_elem { /* size: 72 (-40), cachelines: 2, members: 4 */
  struct nft_trans_flowtable { /* size: 80 (-48), cachelines: 2, members: 5 */
  struct nft_trans_obj { /* size: 72 (-40), cachelines: 2, members: 4 */
  struct nft_trans_rule { /* size: 80 (-32), cachelines: 2, members: 6 */
  struct nft_trans_set { /* size: 96 (-24), cachelines: 2, members: 8 */
  struct nft_trans_table { /* size: 56 (-40), cachelines: 1, members: 2 */

  struct nft_trans_elem can now be allocated from kmalloc-96 instead of
  kmalloc-128 slab.

  Series from Florian Westphal. For the record, I have mangled patch #1
  to add nft_trans_container_*() and use if for every transaction object.
   I have also added BUILD_BUG_ON to ensure struct nft_trans always comes
  at the beginning of the container transaction object. And few minor
  cleanups, any new bugs are of my own.

Patch #12 simplify check for SCTP GSO in IPVS, from Ismael Luceno.

Patch #13 nf_conncount key length remains in the u32 bound, from Yunjian Wang.

Patch #14 removes unnecessary check for CTA_TIMEOUT_L3PROTO when setting
          default conntrack timeouts via nfnetlink_cttimeout API, from
          Lin Ma.

Patch #15 updates NFT_SECMARK_CTX_MAXLEN to 4096, SELinux could use
          larger secctx names than the existing 256 bytes length.

Patch #16 adds a selftest to exercise nfnetlink_queue listeners leaving
          nfnetlink_queue, from Florian Westphal.

Patch #17 increases hitcount from 255 to 65535 in xt_recent, from Phil Sutter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
matttbe pushed a commit that referenced this issue Jul 15, 2024
Run bpf_tcp_ca selftests (./test_progs -t bpf_tcp_ca) on a Loongarch
platform, some "Segmentation fault" errors occur:

'''
 test_dctcp:PASS:bpf_dctcp__open_and_load 0 nsec
 test_dctcp:FAIL:bpf_map__attach_struct_ops unexpected error: -524
 #29/1    bpf_tcp_ca/dctcp:FAIL
 test_cubic:PASS:bpf_cubic__open_and_load 0 nsec
 test_cubic:FAIL:bpf_map__attach_struct_ops unexpected error: -524
 #29/2    bpf_tcp_ca/cubic:FAIL
 test_dctcp_fallback:PASS:dctcp_skel 0 nsec
 test_dctcp_fallback:PASS:bpf_dctcp__load 0 nsec
 test_dctcp_fallback:FAIL:dctcp link unexpected error: -524
 #29/4    bpf_tcp_ca/dctcp_fallback:FAIL
 test_write_sk_pacing:PASS:open_and_load 0 nsec
 test_write_sk_pacing:FAIL:attach_struct_ops unexpected error: -524
 #29/6    bpf_tcp_ca/write_sk_pacing:FAIL
 test_update_ca:PASS:open 0 nsec
 test_update_ca:FAIL:attach_struct_ops unexpected error: -524
 settcpca:FAIL:setsockopt unexpected setsockopt: \
					actual -1 == expected -1
 (network_helpers.c:99: errno: No such file or directory) \
					Failed to call post_socket_cb
 start_test:FAIL:start_server_str unexpected start_server_str: \
					actual -1 == expected -1
 test_update_ca:FAIL:ca1_ca1_cnt unexpected ca1_ca1_cnt: \
					actual 0 <= expected 0
 #29/9    bpf_tcp_ca/update_ca:FAIL
 #29      bpf_tcp_ca:FAIL
 Caught signal #11!
 Stack trace:
 ./test_progs(crash_handler+0x28)[0x5555567ed91c]
 linux-vdso.so.1(__vdso_rt_sigreturn+0x0)[0x7ffffee408b0]
 ./test_progs(bpf_link__update_map+0x80)[0x555556824a78]
 ./test_progs(+0x94d68)[0x5555564c4d68]
 ./test_progs(test_bpf_tcp_ca+0xe8)[0x5555564c6a88]
 ./test_progs(+0x3bde54)[0x5555567ede54]
 ./test_progs(main+0x61c)[0x5555567efd54]
 /usr/lib64/libc.so.6(+0x22208)[0x7ffff2aaa208]
 /usr/lib64/libc.so.6(__libc_start_main+0xac)[0x7ffff2aaa30c]
 ./test_progs(_start+0x48)[0x55555646bca8]
 Segmentation fault
'''

This is because BPF trampoline is not implemented on Loongarch yet,
"link" returned by bpf_map__attach_struct_ops() is NULL. test_progs
crashs when this NULL link passes to bpf_link__update_map(). This
patch adds NULL checks for all links in bpf_tcp_ca to fix these errors.
If "link" is NULL, goto the newly added label "out" to destroy the skel.

v2:
 - use "goto out" instead of "return" as Eduard suggested.

Fixes: 06da9f3 ("selftests/bpf: Test switching TCP Congestion Control algorithms.")
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/r/b4c841492bd4ed97964e4e61e92827ce51bf1dc9.1720615848.git.tanggeliang@kylinos.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants