Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPF rbtree next-gen datastructure #4174

Closed
wants to merge 14 commits into from

Conversation

kernel-patches-bot
Copy link

Pull request for series with
subject: BPF rbtree next-gen datastructure
version: 1
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=702327

@kernel-patches-bot
Copy link
Author

Upstream branch: aa67961
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=702327
version: 1

@kernel-patches-bot
Copy link
Author

Upstream branch: c21dc52
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=702327
version: 1

@kernel-patches-bot
Copy link
Author

Upstream branch: 235d2ef
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=702327
version: 1

btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
There, a BTF record is created for any type containing a spin_lock or
any next-gen datastructure node/head.

Currently, for non-MAP_VALUE types, reg_btf_record will only search for
a record using struct_meta_tab if the reg->type exactly matches
(PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
"allocated obj" type - returned from bpf_obj_new - might pick up other
flags while working its way through the program.

Loosen the check to be exact for base_type and just use MEM_ALLOC mask
for type_flag.

This patch is marked Fixes as the original intent of reg_btf_record was
unlikely to have been to fail finding btf_record for valid alloc obj
types with additional flags, some of which (e.g. PTR_UNTRUSTED)
are valid register type states for alloc obj independent of this series.
However, I didn't find a specific broken repro case outside of this
series' added functionality, so it's possible that nothing was
triggering this logic error before.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: 4e814da ("bpf: Allow locking bpf_spin_lock in allocated objects")
map_check_btf calls btf_parse_fields to create a btf_record for its
value_type. If there are no special fields in the value_type
btf_parse_fields returns NULL, whereas if there special value_type
fields but they are invalid in some way an error is returned.

An example invalid state would be:

  struct node_data {
    struct bpf_rb_node node;
    int data;
  };

  private(A) struct bpf_spin_lock glock;
  private(A) struct bpf_list_head ghead __contains(node_data, node);

groot should be invalid as its __contains tag points to a field with
type != "bpf_list_node".

Before this patch, such a scenario would result in btf_parse_fields
returning an error ptr, subsequent !IS_ERR_OR_NULL check failing,
and btf_check_and_fixup_fields returning 0, which would then be
returned by map_check_btf.

After this patch's changes, -EINVAL would be returned by map_check_btf
and the map would correctly fail to load.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: aa3496a ("bpf: Refactor kptr_off_tab into btf_record")
This is mostly a nonfunctional change. The verifier log message
"expected false release_on_unlock" was missing a newline, so add it and
move some checks around to reduce indentation level.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Many of the structs recently added to track field info for linked-list
head are useful as-is for rbtree root. So let's do a mechanical renaming
of list_head-related types and fields:

include/linux/bpf.h:
  struct btf_field_list_head -> struct btf_field_datastructure_head
  list_head -> datastructure_head in struct btf_field union
kernel/bpf/btf.c:
  list_head -> datastructure_head in struct btf_field_info

This is a nonfunctional change, functionality to actually use these
fields for rbtree will be added in further patches.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
This patch adds special BPF_RB_{ROOT,NODE} btf_field_types similar to
BPF_LIST_{HEAD,NODE}, adds the necessary plumbing to detect the new
types, and adds bpf_rb_root_free function for freeing bpf_rb_root in
map_values.

structs bpf_rb_root and bpf_rb_node are opaque types meant to
obscure structs rb_root_cached rb_node, respectively.

btf_struct_access will prevent BPF programs from touching these special
fields automatically now that they're recognized.

btf_check_and_fixup_fields now groups list_head and rb_root together as
"owner" fields and {list,rb}_node as "ownee", and does same ownership
cycle checking as before. Note this function does _not_ prevent
ownership type mixups (e.g. rb_root owning list_node) - that's handled
by btf_parse_datastructure_head.

After this patch, a bpf program can have a struct bpf_rb_root in a
map_value, but not add anything to nor do anything useful with it.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
This patch adds implementations of bpf_rbtree_{add,remove,first}
and teaches verifier about their BTF_IDs as well as those of
bpf_rb_{root,node}.

All three kfuncs have some nonstandard component to their verification
that needs to be addressed in future patches before programs can
properly use them:

  * bpf_rbtree_add:     Takes 'less' callback, need to verify it

  * bpf_rbtree_first:   Returns ptr_to_node_type(off=rb_node_off) instead
                        of ptr_to_rb_node(off=0). Return value ref is
			should be released on unlock.

  * bpf_rbtree_remove:  Returns ptr_to_node_type(off=rb_node_off) instead
                        of ptr_to_rb_node(off=0). 2nd arg (node) is a
			release_on_unlock + PTR_UNTRUSTED reg.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Now that we find bpf_rb_root and bpf_rb_node in structs, let's give args
that contain those types special classification and properly handle
these types when checking kfunc args.

"Properly handling" these types largely requires generalizing similar
handling for bpf_list_{head,node}, with little new logic added in this
patch.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Some BPF helpers take a callback function which the helper calls. For
each helper that takes such a callback, there's a special call to
__check_func_call with a callback-state-setting callback that sets up
verifier bpf_func_state for the callback's frame.

kfuncs don't have any of this infrastructure yet, so let's add it in
this patch, following existing helper pattern as much as possible. To
validate functionality of this added plumbing, this patch adds
callback handling for the bpf_rbtree_add kfunc and hopes to lay
groundwork for future next-gen datastructure callbacks.

In the "general plumbing" category we have:

  * check_kfunc_call doing callback verification right before clearing
    CALLER_SAVED_REGS, exactly like check_helper_call
  * recognition of func_ptr BTF types in kfunc args as
    KF_ARG_PTR_TO_CALLBACK + propagation of subprogno for this arg type

In the "rbtree_add / next-gen datastructure-specific plumbing" category:

  * Since bpf_rbtree_add must be called while the spin_lock associated
    with the tree is held, don't complain when callback's func_state
    doesn't unlock it by frame exit
  * Mark rbtree_add callback's args PTR_UNTRUSTED to prevent rbtree
    api functions from being called in the callback

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Newly-added bpf_rbtree_{remove,first} kfuncs have some special properties
that require handling in the verifier:

  * both bpf_rbtree_remove and bpf_rbtree_first return the type containing
    the bpf_rb_node field, with the offset set to that field's offset,
    instead of a struct bpf_rb_node *
    * Generalized existing next-gen list verifier handling for this
      as mark_reg_datastructure_node helper

  * Unlike other functions, which set release_on_unlock on one of their
    args, bpf_rbtree_first takes no arguments, rather setting
    release_on_unlock on its return value

  * bpf_rbtree_remove's node input is a node that's been inserted
    in the tree. Only non-owning references (PTR_UNTRUSTED +
    release_on_unlock) refer to such nodes, but kfuncs don't take
    PTR_UNTRUSTED args
    * Added special carveout for bpf_rbtree_remove to take PTR_UNTRUSTED
    * Since node input already has release_on_unlock set, don't set
      it again

This patch, along with the previous one, complete special verifier
handling for all rbtree API functions added in this series.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Current comment in BPF_PROBE_MEM jit code claims that verifier prevents
insn->off < 0, but this appears to not be true irrespective of changes
in this series. Regardless, changes in this series will result in an
example like:

  struct example_node {
    long key;
    long val;
    struct bpf_rb_node node;
  }

  /* In BPF prog, assume root contains example_node nodes */
  struct bpf_rb_node res = bpf_rbtree_first(&root);
  if (!res)
    return 1;

  struct example_node n = container_of(res, struct example_node, node);
  long key = n->key;

Resulting in a load with off = -16, as bpf_rbtree_first's return is
modified by verifier to be PTR_TO_BTF_ID of example_node w/ offset =
offsetof(struct example_node, node), instead of PTR_TO_BTF_ID of
bpf_rb_node. So it's necessary to support negative insn->off when
jitting BPF_PROBE_MEM.

In order to ensure that page fault for a BPF_PROBE_MEM load of *src_reg +
insn->off is safely handled, we must confirm that *src_reg + insn->off is
in kernel's memory. Two runtime checks are emitted to confirm that:

  1) (*src_reg + insn->off) > boundary between user and kernel address
  spaces
  2) (*src_reg + insn->off) does not overflow to a small positive
  number. This might happen if some function meant to set src_reg
  returns ERR_PTR(-EINVAL) or similar.

Check 1 currently is sligtly off - it compares a

  u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off);

to *src_reg, aborting the load if limit is larger. Rewriting this as an
inequality:

  *src_reg > TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off)
  *src_reg - abs(insn->off) > TASK_SIZE_MAX + PAGE_SIZE

shows that this isn't quite right even if insn->off is positive, as we
really want:

  *src_reg + insn->off > TASK_SIZE_MAX + PAGE_SIZE
  *src_reg > TASK_SIZE_MAX + PAGE_SIZE - insn_off

Since *src_reg + insn->off is the address we'll be loading from, not
*src_reg - insn->off or *src_reg - abs(insn->off). So change the
subtraction to an addition and remove the abs(), as comment indicates
that it was only added to ignore negative insn->off.

For Check 2, currently "does not overflow to a small positive number" is
confirmed by emitting an 'add insn->off, src_reg' instruction and
checking for carry flag. While this works fine for a positive insn->off,
a small negative insn->off like -16 is almost guaranteed to wrap over to
a small positive number when added to any kernel address.

This patch addresses this by not doing Check 2 at BPF prog runtime when
insn->off is negative, rather doing a stronger check at JIT-time. The
logic supporting this is as follows:

1) Assume insn->off is negative, call the largest such negative offset
   MAX_NEGATIVE_OFF. So insn->off >= MAX_NEGATIVE_OFF for all possible
   insn->off.

2) *src_reg + insn->off will not wrap over to an unexpected address by
   virtue of negative insn->off, but it might wrap under if
   -insn->off > *src_reg, as that implies *src_reg + insn->off < 0

3) Inequality (TASK_SIZE_MAX + PAGE_SIZE - insn->off) > (TASK_SIZE_MAX + PAGE_SIZE)
   must be true since insn->off is negative.

4) If we've completed check 1, we know that
   src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off)

5) Combining statements 3 and 4, we know src_reg > (TASK_SIZE_MAX + PAGE_SIZE)

6) By statements 1, 4, and 5, if we can prove
   (TASK_SIZE_MAX + PAGE_SIZE) > -MAX_NEGATIVE_OFF, we'll know that
   (TASK_SIZE_MAX + PAGE_SIZE) > -insn->off for all possible insn->off
   values. We can rewrite this as (TASK_SIZE_MAX + PAGE_SIZE) +
   MAX_NEGATIVE_OFF > 0.

   Since src_reg > TASK_SIZE_MAX + PAGE_SIZE and MAX_NEGATIVE_OFF is
   negative, if the previous inequality is true,
   src_reg + MAX_NEGATIVE_OFF > 0 is also true for all src_reg values.
   Similarly, since insn->off >= MAX_NEGATIVE_OFF for all possible
   negative insn->off vals, src_reg + insn->off > 0 and there can be no
   wrapping under.

So proving (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0 implies
*src_reg + insn->off > 0 for any src_reg that's passed check 1 and any
negative insn->off. Luckily the former inequality does not need to be
checked at runtime, and in fact could be a static_assert if
TASK_SIZE_MAX wasn't determined by a function when CONFIG_X86_5LEVEL
kconfig is used.

Regardless, we can just check (TASK_SIZE_MAX + PAGE_SIZE) +
MAX_NEGATIVE_OFF > 0 once per do_jit call instead of emitting a runtime
check. Given that insn->off is a s16 and is unlikely to grow larger,
this check should always succeed on any x86 processor made in the 21st
century. If it doesn't fail all do_jit calls and complain loudly with
the assumption that the BPF subsystem is misconfigured or has a bug.

A few instructions are saved for negative insn->offs as a result. Using
the struct example_node / off = -16 example from before, code looks
like:

BEFORE CHANGE
  72:   movabs $0x800000000010,%r11
  7c:   cmp    %r11,%rdi
  7f:   jb     0x000000000000008d         (check 1 on 7c and here)
  81:   mov    %rdi,%r11
  84:   add    $0xfffffffffffffff0,%r11   (check 2, will set carry for almost any r11, so bug for
  8b:   jae    0x0000000000000091          negative insn->off)
  8d:   xor    %edi,%edi                  (as a result long key = n->key; will be 0'd out here)
  8f:   jmp    0x0000000000000095
  91:   mov    -0x10(%rdi),%rdi
  95:

AFTER CHANGE:
  5a:   movabs $0x800000000010,%r11
  64:   cmp    %r11,%rdi
  67:   jae    0x000000000000006d     (check 1 on 64 and here, but now JNC instead of JC)
  69:   xor    %edi,%edi              (no check 2, 0 out if %rdi - %r11 < 0)
  6b:   jmp    0x0000000000000071
  6d:   mov    -0x10(%rdi),%rdi
  71:

We could do the same for insn->off == 0, but for now keep code
generation unchanged for previously working nonnegative insn->offs.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
…type

If a BPF program defines a struct or union type which has a field type
that the verifier considers special - spin_lock, next-gen datastructure
heads and nodes - the verifier needs to be able to find fields of that
type using BTF.

For such a program, BTF is required, so modify kernel_needs_btf helper
to ensure that correct "BTF is mandatory" error message is emitted.

The newly-added btf_has_alloc_obj_type looks for BTF_KIND_STRUCTs with a
name corresponding to a special type. If any such struct is found it is
assumed that some variable is using it, and therefore that successful
BTF load is necessary.

Also add a kernel_needs_btf check to bpf_object__create_map where it was
previously missing. When this function calls bpf_map_create, kernel may
reject map creation due to mismatched datastructure owner and ownee
types (e.g. a struct bpf_list_head with __contains tag pointing to
bpf_rbtree_node field). In such a scenario - or any other where BTF is
necessary for verification - bpf_map_create should not be retried
without BTF.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
@kernel-patches-bot
Copy link
Author

Upstream branch: 156ed20
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=702327
version: 1

This patch adds selftests exercising the logic changed/added in the
previous patches in the series. A variety of successful and unsuccessful
rbtree usages are validated:

Success:
  * Add some nodes, let map_value bpf_rbtree_root destructor clean them
    up
  * Add some nodes, remove one using the release_on_unlock ref leftover
    by successful rbtree_add() call
  * Add some nodes, remove one using the release_on_unlock ref returned
    from rbtree_first() call

Failure:
  * BTF where bpf_rb_root owns bpf_list_node should fail to load
  * BTF where node of type X is added to tree containing nodes of type Y
    should fail to load
  * No calling rbtree api functions in 'less' callback for rbtree_add
  * No releasing lock in 'less' callback for rbtree_add
  * No removing a node which hasn't been added to any tree
  * No adding a node which has already been added to a tree
  * No escaping of release_on_unlock references past their lock's
    critical section

These tests mostly focus on rbtree-specific additions, but some of the
Failure cases revalidate scenarios common to both linked_list and rbtree
which are covered in the former's tests. Better to be a bit redundant in
case linked_list and rbtree semantics deviate over time.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
@kernel-patches-bot
Copy link
Author

At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=702327 expired. Closing PR.

@kernel-patches-bot kernel-patches-bot deleted the series/702327=>bpf-next branch December 9, 2022 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants