Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPF: packet scheduler #75

Open
2 of 9 tasks
Tracked by #350
matttbe opened this issue Aug 7, 2020 · 11 comments
Open
2 of 9 tasks
Tracked by #350

BPF: packet scheduler #75

matttbe opened this issue Aug 7, 2020 · 11 comments
Assignees
Labels

Comments

@matttbe
Copy link
Member

matttbe commented Aug 7, 2020

Extending MPTCP with BPF is clearly something we want.

It looks like extending the Upstream MPTCP kernel to allow taking some packet scheduling decisions with BPF will be needed and would be needed in priority to #74.

I think the implementation would be similar to what is done in the kernel with BPF TCP CC: the ability to write a congestion control protocol in BPF with BPF_STRUCT_OPS, see: https://linuxplumbersconf.org/event/7/contributions/687/

Or check these file:

  • BPF "kernelspace": net/ipv4/bpf_tcp_ca.c
  • BPF "userspace": tools/testing/selftests/bpf/progs/bpf_cubic.c

From what I saw, the kernel side is a bit tricky. Here, it looks like this solution with TCP CC is designed like that because adding a new TCP CC is done by adding a new TCP CC kernel module. For BPF TCP CC, this module can be controlled via BPF.

On our side with MPTCP, we currently don't have the ability to create other packet schedulers (or path managers).

  • Maybe a first step would be to add the ability to select different packets schedulers implemented in the kernel.
  • Or maybe we could have the current scheduler having this ability to be controlled via BPF. But in this case, can we easily have both: a single packet scheduler that can do the job with and without a BPF program controlling it?

Issues:

Linked to #350:

@matttbe matttbe added this to Needs triage in MPTCP Future via automation Aug 7, 2020
jenkins-tessares pushed a commit that referenced this issue Jan 29, 2021
…abled

When booting a kernel which has been built with CONFIG_AMD_MEM_ENCRYPT
enabled as a Xen pv guest a warning is issued for each processor:

[    5.964347] ------------[ cut here ]------------
[    5.968314] WARNING: CPU: 0 PID: 1 at /home/gross/linux/head/arch/x86/xen/enlighten_pv.c:660 get_trap_addr+0x59/0x90
[    5.972321] Modules linked in:
[    5.976313] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W         5.11.0-rc5-default #75
[    5.980313] Hardware name: Dell Inc. OptiPlex 9020/0PC5F7, BIOS A05 12/05/2013
[    5.984313] RIP: e030:get_trap_addr+0x59/0x90
[    5.988313] Code: 42 10 83 f0 01 85 f6 74 04 84 c0 75 1d b8 01 00 00 00 c3 48 3d 00 80 83 82 72 08 48 3d 20 81 83 82 72 0c b8 01 00 00 00 eb db <0f> 0b 31 c0 c3 48 2d 00 80 83 82 48 ba 72 1c c7 71 1c c7 71 1c 48
[    5.992313] RSP: e02b:ffffc90040033d38 EFLAGS: 00010202
[    5.996313] RAX: 0000000000000001 RBX: ffffffff82a141d0 RCX: ffffffff8222ec38
[    6.000312] RDX: ffffffff8222ec38 RSI: 0000000000000005 RDI: ffffc90040033d40
[    6.004313] RBP: ffff8881003984a0 R08: 0000000000000007 R09: ffff888100398000
[    6.008312] R10: 0000000000000007 R11: ffffc90040246000 R12: ffff8884082182a8
[    6.012313] R13: 0000000000000100 R14: 000000000000001d R15: ffff8881003982d0
[    6.016316] FS:  0000000000000000(0000) GS:ffff888408200000(0000) knlGS:0000000000000000
[    6.020313] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[    6.024313] CR2: ffffc900020ef000 CR3: 000000000220a000 CR4: 0000000000050660
[    6.028314] Call Trace:
[    6.032313]  cvt_gate_to_trap.part.7+0x3f/0x90
[    6.036313]  ? asm_exc_double_fault+0x30/0x30
[    6.040313]  xen_convert_trap_info+0x87/0xd0
[    6.044313]  xen_pv_cpu_up+0x17a/0x450
[    6.048313]  bringup_cpu+0x2b/0xc0
[    6.052313]  ? cpus_read_trylock+0x50/0x50
[    6.056313]  cpuhp_invoke_callback+0x80/0x4c0
[    6.060313]  _cpu_up+0xa7/0x140
[    6.064313]  cpu_up+0x98/0xd0
[    6.068313]  bringup_nonboot_cpus+0x4f/0x60
[    6.072313]  smp_init+0x26/0x79
[    6.076313]  kernel_init_freeable+0x103/0x258
[    6.080313]  ? rest_init+0xd0/0xd0
[    6.084313]  kernel_init+0xa/0x110
[    6.088313]  ret_from_fork+0x1f/0x30
[    6.092313] ---[ end trace be9ecf17dceeb4f3 ]---

Reason is that there is no Xen pv trap entry for X86_TRAP_VC.

Fix that by adding a generic trap handler for unknown traps and wire all
unknown bare metal handlers to this generic handler, which will just
crash the system in case such a trap will ever happen.

Fixes: 0786138 ("x86/sev-es: Add a Runtime #VC Exception Handler")
Cc: <stable@vger.kernel.org> # v5.10
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
@matttbe
Copy link
Member Author

matttbe commented Apr 19, 2021

@geliangtang I just updated the description following our discussion we had.

@geliangtang
Copy link
Member

Round-robin packet scheduler support #194

@geliangtang
Copy link
Member

Hi Matt, I just assigned this issue to myself. I'll dry to implement the Round-robin scheduler using BPF.

@matttbe matttbe removed this from Needs triage in MPTCP Future Oct 7, 2021
@matttbe matttbe added this to To do in MPTCP Next (5.16) via automation Oct 7, 2021
@matttbe
Copy link
Member Author

matttbe commented Oct 7, 2021

(PS: I don't know if notifications are sent when I move items in Github Project but just in case: I'm moving all assigned tickets from "Future" to "Next". It doesn't mean it has to be implemented for the next version, just easier for the tracking to generate a changelog ;-) )

@matttbe matttbe removed this from To do in MPTCP Next (5.16) Nov 4, 2021
@matttbe matttbe added this to To do in MPTCP Next (5.17) via automation Nov 4, 2021
jenkins-tessares pushed a commit that referenced this issue Nov 19, 2021
…fails

Check for a valid hv_vp_index array prior to derefencing hv_vp_index when
setting Hyper-V's TSC change callback.  If Hyper-V setup failed in
hyperv_init(), the kernel will still report that it's running under
Hyper-V, but will have silently disabled nearly all functionality.

  BUG: kernel NULL pointer dereference, address: 0000000000000010
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP
  CPU: 4 PID: 1 Comm: swapper/0 Not tainted 5.15.0-rc2+ #75
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:set_hv_tscchange_cb+0x15/0xa0
  Code: <8b> 04 82 8b 15 12 17 85 01 48 c1 e0 20 48 0d ee 00 01 00 f6 c6 08
  ...
  Call Trace:
   kvm_arch_init+0x17c/0x280
   kvm_init+0x31/0x330
   vmx_init+0xba/0x13a
   do_one_initcall+0x41/0x1c0
   kernel_init_freeable+0x1f2/0x23b
   kernel_init+0x16/0x120
   ret_from_fork+0x22/0x30

Fixes: 9328626 ("x86/hyperv: Reenlightenment notifications support")
Cc: stable@vger.kernel.org
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20211104182239.1302956-2-seanjc@google.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
@matttbe matttbe removed this from To do in MPTCP Next (5.17) Jan 14, 2022
@matttbe matttbe added this to To do in MPTCP Next (5.18) via automation Jan 14, 2022
@matttbe matttbe removed this from To do in MPTCP Next (5.18) Mar 31, 2022
@matttbe matttbe added this to To do in MPTCP Next (5.19) via automation Mar 31, 2022
@matttbe matttbe moved this from To do to In progress in MPTCP Next (5.19) May 3, 2022
@matttbe matttbe removed this from In progress in MPTCP Next (5.19) May 30, 2022
@matttbe matttbe added this to To do in MPTCP Next (6.0) via automation May 30, 2022
@matttbe matttbe moved this from To do to In progress in MPTCP Next (6.0) May 30, 2022
@matttbe matttbe removed this from In progress in MPTCP Next (6.0) Aug 10, 2022
@matttbe matttbe added this to To do in MPTCP Next (6.1 LTS) via automation Aug 10, 2022
@matttbe matttbe moved this from To do to In progress in MPTCP Next (6.1 LTS) Aug 10, 2022
@matttbe
Copy link
Member Author

matttbe commented Sep 8, 2022

Status update:

  • some patches are already in our 'export' branch
  • but still in development, e.g. patches

@matttbe
Copy link
Member Author

matttbe commented Sep 19, 2022

Some feedbacks from LPC2022:

  • BPF dev is going to be similar to working on kernel modules but helped by the verifier and other stuff
  • using BPF STRUCT_OPS seems to be the right direction
  • BPF code depends on the kernel version, it is not an API that is exposed to userspace and cannot be changed (!= UAPI). So we can change the callbacks, kfunc, etc.
  • It is possible to mark an API as unstable/stable
  • There are techniques to have a BPF code working on multiple kernels (CO-RE: Compile Once, Run Everywhere) but it might require specific modifications to support that
  • READ_ONCE(), WRITE_ONCE(), etc. should be supported by BPF: to be tested. (but maybe not needed?)
  • Regarding the security (e.g. access to the token), the best is to clearly mention that in cover-letters
  • Not all the smart stuff should be done in kfunc: a userspace scheduler should be able to iterate over all subflows and take decisions itself. Not just asking the kernel to use one mode or another.

The slides and the video are available online: https://lpc.events/event/16/contributions/1354/

@matttbe matttbe removed this from In progress in MPTCP Next (6.1 LTS) Oct 5, 2022
@matttbe matttbe added this to To Do in MPTCP Next (6.2) via automation Oct 5, 2022
@matttbe matttbe moved this from To Do to In progress in MPTCP Next (6.2) Oct 5, 2022
@VenkateswaranJ
Copy link

Does this task implement Redundant scheduler?

@matttbe
Copy link
Member Author

matttbe commented Nov 30, 2022

@VenkateswaranJ not yet but it is in development to validate the API, see https://lore.kernel.org/all/cover.1669605531.git.geliang.tang@suse.com/

matttbe pushed a commit that referenced this issue Dec 14, 2022
Syzkaller reports a NULL deref bug as follows:

 BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3
 Read of size 4 at addr 0000000000000138 by task file1/1955

 CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0xcd/0x134
  ? io_tctx_exit_cb+0x53/0xd3
  kasan_report+0xbb/0x1f0
  ? io_tctx_exit_cb+0x53/0xd3
  kasan_check_range+0x140/0x190
  io_tctx_exit_cb+0x53/0xd3
  task_work_run+0x164/0x250
  ? task_work_cancel+0x30/0x30
  get_signal+0x1c3/0x2440
  ? lock_downgrade+0x6e0/0x6e0
  ? lock_downgrade+0x6e0/0x6e0
  ? exit_signals+0x8b0/0x8b0
  ? do_raw_read_unlock+0x3b/0x70
  ? do_raw_spin_unlock+0x50/0x230
  arch_do_signal_or_restart+0x82/0x2470
  ? kmem_cache_free+0x260/0x4b0
  ? putname+0xfe/0x140
  ? get_sigframe_size+0x10/0x10
  ? do_execveat_common.isra.0+0x226/0x710
  ? lockdep_hardirqs_on+0x79/0x100
  ? putname+0xfe/0x140
  ? do_execveat_common.isra.0+0x238/0x710
  exit_to_user_mode_prepare+0x15f/0x250
  syscall_exit_to_user_mode+0x19/0x50
  do_syscall_64+0x42/0xb0
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0023:0x0
 Code: Unable to access opcode bytes at 0xffffffffffffffd6.
 RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  </TASK>
 Kernel panic - not syncing: panic_on_warn set ...

This happens because the adding of task_work from io_ring_exit_work()
isn't synchronized with canceling all work items from eg exec. The
execution of the two are ordered in that they are both run by the task
itself, but if io_tctx_exit_cb() is queued while we're canceling all
work items off exec AND gets executed when the task exits to userspace
rather than in the main loop in io_uring_cancel_generic(), then we can
find current->io_uring == NULL and hit the above crash.

It's safe to add this NULL check here, because the execution of the two
paths are done by the task itself.

Cc: stable@vger.kernel.org
Fixes: d56d938 ("io_uring: do ctx initiated file note removal")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Link: https://lore.kernel.org/r/20221206093833.3812138-1-harshit.m.mogalapalli@oracle.com
[axboe: add code comment and also put an explanation in the commit msg]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
@matttbe matttbe removed this from In progress in MPTCP Next (6.2) Dec 14, 2022
@matttbe matttbe added this to To do in MPTCP Next (6.3) via automation Dec 14, 2022
@matttbe matttbe moved this from To do to In progress in MPTCP Next (6.3) Dec 14, 2022
@matttbe matttbe added the sched packets scheduler label Feb 1, 2023
@matttbe matttbe removed this from In progress in MPTCP Next (6.3) Feb 22, 2023
@matttbe matttbe added this to To do in MPTCP Next (6.4) via automation Feb 22, 2023
@matttbe matttbe moved this from To do to In progress in MPTCP Next (6.4) Feb 22, 2023
@matttbe
Copy link
Member Author

matttbe commented Feb 23, 2023

(I just updated the description to add this: )

Issues:

@matttbe
Copy link
Member Author

matttbe commented Apr 17, 2023

I just added one item to the TODO list:

  • BPF selftests: use a dedicated netns for each test, see 02d6a05

@geliangtang
Copy link
Member

@matttbe Matt, the task "BPF selftests: use a dedicated netns for each test" has been completed and can be closed now.

@geliangtang geliangtang added the bpf label Aug 4, 2023
jenkins-tessares pushed a commit that referenced this issue Sep 1, 2023
With latest clang18, I hit test_progs failures for the following test:

  #13/2    bpf_cookie/multi_kprobe_link_api:FAIL
  #13/3    bpf_cookie/multi_kprobe_attach_api:FAIL
  #13      bpf_cookie:FAIL
  #75      fentry_fexit:FAIL
  #76/1    fentry_test/fentry:FAIL
  #76      fentry_test:FAIL
  #80/1    fexit_test/fexit:FAIL
  #80      fexit_test:FAIL
  #110/1   kprobe_multi_test/skel_api:FAIL
  #110/2   kprobe_multi_test/link_api_addrs:FAIL
  #110/3   kprobe_multi_test/link_api_syms:FAIL
  #110/4   kprobe_multi_test/attach_api_pattern:FAIL
  #110/5   kprobe_multi_test/attach_api_addrs:FAIL
  #110/6   kprobe_multi_test/attach_api_syms:FAIL
  #110     kprobe_multi_test:FAIL

For example, for #13/2, the error messages are:

  [...]
  kprobe_multi_test_run:FAIL:kprobe_test7_result unexpected kprobe_test7_result: actual 0 != expected 1
  [...]
  kprobe_multi_test_run:FAIL:kretprobe_test7_result unexpected kretprobe_test7_result: actual 0 != expected 1

clang17 does not have this issue.

Further investigation shows that kernel func bpf_fentry_test7(), used in
the above tests, is inlined by the compiler although it is marked as
noinline.

  int noinline bpf_fentry_test7(struct bpf_fentry_test_t *arg)
  {
        return (long)arg;
  }

It is known that for simple functions like the above (e.g. just returning
a constant or an input argument), the clang compiler may still do inlining
for a noinline function. Adding 'asm volatile ("")' in the beginning of the
bpf_fentry_test7() can prevent inlining.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20230826200843.2210074-1-yonghong.song@linux.dev
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
MPTCP Next (6.4)
In progress
Development

No branches or pull requests

3 participants