-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BPF rbtree next-gen datastructure #4174
Conversation
Upstream branch: aa67961 |
bc978f4
to
34310fa
Compare
Upstream branch: c21dc52 |
0d2ed1d
to
2cbb701
Compare
34310fa
to
e4c567d
Compare
Upstream branch: 235d2ef |
2cbb701
to
bd5a04e
Compare
e4c567d
to
661aeb6
Compare
btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c. There, a BTF record is created for any type containing a spin_lock or any next-gen datastructure node/head. Currently, for non-MAP_VALUE types, reg_btf_record will only search for a record using struct_meta_tab if the reg->type exactly matches (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an "allocated obj" type - returned from bpf_obj_new - might pick up other flags while working its way through the program. Loosen the check to be exact for base_type and just use MEM_ALLOC mask for type_flag. This patch is marked Fixes as the original intent of reg_btf_record was unlikely to have been to fail finding btf_record for valid alloc obj types with additional flags, some of which (e.g. PTR_UNTRUSTED) are valid register type states for alloc obj independent of this series. However, I didn't find a specific broken repro case outside of this series' added functionality, so it's possible that nothing was triggering this logic error before. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Fixes: 4e814da ("bpf: Allow locking bpf_spin_lock in allocated objects")
map_check_btf calls btf_parse_fields to create a btf_record for its value_type. If there are no special fields in the value_type btf_parse_fields returns NULL, whereas if there special value_type fields but they are invalid in some way an error is returned. An example invalid state would be: struct node_data { struct bpf_rb_node node; int data; }; private(A) struct bpf_spin_lock glock; private(A) struct bpf_list_head ghead __contains(node_data, node); groot should be invalid as its __contains tag points to a field with type != "bpf_list_node". Before this patch, such a scenario would result in btf_parse_fields returning an error ptr, subsequent !IS_ERR_OR_NULL check failing, and btf_check_and_fixup_fields returning 0, which would then be returned by map_check_btf. After this patch's changes, -EINVAL would be returned by map_check_btf and the map would correctly fail to load. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Fixes: aa3496a ("bpf: Refactor kptr_off_tab into btf_record")
This is mostly a nonfunctional change. The verifier log message "expected false release_on_unlock" was missing a newline, so add it and move some checks around to reduce indentation level. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Many of the structs recently added to track field info for linked-list head are useful as-is for rbtree root. So let's do a mechanical renaming of list_head-related types and fields: include/linux/bpf.h: struct btf_field_list_head -> struct btf_field_datastructure_head list_head -> datastructure_head in struct btf_field union kernel/bpf/btf.c: list_head -> datastructure_head in struct btf_field_info This is a nonfunctional change, functionality to actually use these fields for rbtree will be added in further patches. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
This patch adds special BPF_RB_{ROOT,NODE} btf_field_types similar to BPF_LIST_{HEAD,NODE}, adds the necessary plumbing to detect the new types, and adds bpf_rb_root_free function for freeing bpf_rb_root in map_values. structs bpf_rb_root and bpf_rb_node are opaque types meant to obscure structs rb_root_cached rb_node, respectively. btf_struct_access will prevent BPF programs from touching these special fields automatically now that they're recognized. btf_check_and_fixup_fields now groups list_head and rb_root together as "owner" fields and {list,rb}_node as "ownee", and does same ownership cycle checking as before. Note this function does _not_ prevent ownership type mixups (e.g. rb_root owning list_node) - that's handled by btf_parse_datastructure_head. After this patch, a bpf program can have a struct bpf_rb_root in a map_value, but not add anything to nor do anything useful with it. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
This patch adds implementations of bpf_rbtree_{add,remove,first} and teaches verifier about their BTF_IDs as well as those of bpf_rb_{root,node}. All three kfuncs have some nonstandard component to their verification that needs to be addressed in future patches before programs can properly use them: * bpf_rbtree_add: Takes 'less' callback, need to verify it * bpf_rbtree_first: Returns ptr_to_node_type(off=rb_node_off) instead of ptr_to_rb_node(off=0). Return value ref is should be released on unlock. * bpf_rbtree_remove: Returns ptr_to_node_type(off=rb_node_off) instead of ptr_to_rb_node(off=0). 2nd arg (node) is a release_on_unlock + PTR_UNTRUSTED reg. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Now that we find bpf_rb_root and bpf_rb_node in structs, let's give args that contain those types special classification and properly handle these types when checking kfunc args. "Properly handling" these types largely requires generalizing similar handling for bpf_list_{head,node}, with little new logic added in this patch. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Some BPF helpers take a callback function which the helper calls. For each helper that takes such a callback, there's a special call to __check_func_call with a callback-state-setting callback that sets up verifier bpf_func_state for the callback's frame. kfuncs don't have any of this infrastructure yet, so let's add it in this patch, following existing helper pattern as much as possible. To validate functionality of this added plumbing, this patch adds callback handling for the bpf_rbtree_add kfunc and hopes to lay groundwork for future next-gen datastructure callbacks. In the "general plumbing" category we have: * check_kfunc_call doing callback verification right before clearing CALLER_SAVED_REGS, exactly like check_helper_call * recognition of func_ptr BTF types in kfunc args as KF_ARG_PTR_TO_CALLBACK + propagation of subprogno for this arg type In the "rbtree_add / next-gen datastructure-specific plumbing" category: * Since bpf_rbtree_add must be called while the spin_lock associated with the tree is held, don't complain when callback's func_state doesn't unlock it by frame exit * Mark rbtree_add callback's args PTR_UNTRUSTED to prevent rbtree api functions from being called in the callback Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Newly-added bpf_rbtree_{remove,first} kfuncs have some special properties that require handling in the verifier: * both bpf_rbtree_remove and bpf_rbtree_first return the type containing the bpf_rb_node field, with the offset set to that field's offset, instead of a struct bpf_rb_node * * Generalized existing next-gen list verifier handling for this as mark_reg_datastructure_node helper * Unlike other functions, which set release_on_unlock on one of their args, bpf_rbtree_first takes no arguments, rather setting release_on_unlock on its return value * bpf_rbtree_remove's node input is a node that's been inserted in the tree. Only non-owning references (PTR_UNTRUSTED + release_on_unlock) refer to such nodes, but kfuncs don't take PTR_UNTRUSTED args * Added special carveout for bpf_rbtree_remove to take PTR_UNTRUSTED * Since node input already has release_on_unlock set, don't set it again This patch, along with the previous one, complete special verifier handling for all rbtree API functions added in this series. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Current comment in BPF_PROBE_MEM jit code claims that verifier prevents insn->off < 0, but this appears to not be true irrespective of changes in this series. Regardless, changes in this series will result in an example like: struct example_node { long key; long val; struct bpf_rb_node node; } /* In BPF prog, assume root contains example_node nodes */ struct bpf_rb_node res = bpf_rbtree_first(&root); if (!res) return 1; struct example_node n = container_of(res, struct example_node, node); long key = n->key; Resulting in a load with off = -16, as bpf_rbtree_first's return is modified by verifier to be PTR_TO_BTF_ID of example_node w/ offset = offsetof(struct example_node, node), instead of PTR_TO_BTF_ID of bpf_rb_node. So it's necessary to support negative insn->off when jitting BPF_PROBE_MEM. In order to ensure that page fault for a BPF_PROBE_MEM load of *src_reg + insn->off is safely handled, we must confirm that *src_reg + insn->off is in kernel's memory. Two runtime checks are emitted to confirm that: 1) (*src_reg + insn->off) > boundary between user and kernel address spaces 2) (*src_reg + insn->off) does not overflow to a small positive number. This might happen if some function meant to set src_reg returns ERR_PTR(-EINVAL) or similar. Check 1 currently is sligtly off - it compares a u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off); to *src_reg, aborting the load if limit is larger. Rewriting this as an inequality: *src_reg > TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off) *src_reg - abs(insn->off) > TASK_SIZE_MAX + PAGE_SIZE shows that this isn't quite right even if insn->off is positive, as we really want: *src_reg + insn->off > TASK_SIZE_MAX + PAGE_SIZE *src_reg > TASK_SIZE_MAX + PAGE_SIZE - insn_off Since *src_reg + insn->off is the address we'll be loading from, not *src_reg - insn->off or *src_reg - abs(insn->off). So change the subtraction to an addition and remove the abs(), as comment indicates that it was only added to ignore negative insn->off. For Check 2, currently "does not overflow to a small positive number" is confirmed by emitting an 'add insn->off, src_reg' instruction and checking for carry flag. While this works fine for a positive insn->off, a small negative insn->off like -16 is almost guaranteed to wrap over to a small positive number when added to any kernel address. This patch addresses this by not doing Check 2 at BPF prog runtime when insn->off is negative, rather doing a stronger check at JIT-time. The logic supporting this is as follows: 1) Assume insn->off is negative, call the largest such negative offset MAX_NEGATIVE_OFF. So insn->off >= MAX_NEGATIVE_OFF for all possible insn->off. 2) *src_reg + insn->off will not wrap over to an unexpected address by virtue of negative insn->off, but it might wrap under if -insn->off > *src_reg, as that implies *src_reg + insn->off < 0 3) Inequality (TASK_SIZE_MAX + PAGE_SIZE - insn->off) > (TASK_SIZE_MAX + PAGE_SIZE) must be true since insn->off is negative. 4) If we've completed check 1, we know that src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off) 5) Combining statements 3 and 4, we know src_reg > (TASK_SIZE_MAX + PAGE_SIZE) 6) By statements 1, 4, and 5, if we can prove (TASK_SIZE_MAX + PAGE_SIZE) > -MAX_NEGATIVE_OFF, we'll know that (TASK_SIZE_MAX + PAGE_SIZE) > -insn->off for all possible insn->off values. We can rewrite this as (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0. Since src_reg > TASK_SIZE_MAX + PAGE_SIZE and MAX_NEGATIVE_OFF is negative, if the previous inequality is true, src_reg + MAX_NEGATIVE_OFF > 0 is also true for all src_reg values. Similarly, since insn->off >= MAX_NEGATIVE_OFF for all possible negative insn->off vals, src_reg + insn->off > 0 and there can be no wrapping under. So proving (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0 implies *src_reg + insn->off > 0 for any src_reg that's passed check 1 and any negative insn->off. Luckily the former inequality does not need to be checked at runtime, and in fact could be a static_assert if TASK_SIZE_MAX wasn't determined by a function when CONFIG_X86_5LEVEL kconfig is used. Regardless, we can just check (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0 once per do_jit call instead of emitting a runtime check. Given that insn->off is a s16 and is unlikely to grow larger, this check should always succeed on any x86 processor made in the 21st century. If it doesn't fail all do_jit calls and complain loudly with the assumption that the BPF subsystem is misconfigured or has a bug. A few instructions are saved for negative insn->offs as a result. Using the struct example_node / off = -16 example from before, code looks like: BEFORE CHANGE 72: movabs $0x800000000010,%r11 7c: cmp %r11,%rdi 7f: jb 0x000000000000008d (check 1 on 7c and here) 81: mov %rdi,%r11 84: add $0xfffffffffffffff0,%r11 (check 2, will set carry for almost any r11, so bug for 8b: jae 0x0000000000000091 negative insn->off) 8d: xor %edi,%edi (as a result long key = n->key; will be 0'd out here) 8f: jmp 0x0000000000000095 91: mov -0x10(%rdi),%rdi 95: AFTER CHANGE: 5a: movabs $0x800000000010,%r11 64: cmp %r11,%rdi 67: jae 0x000000000000006d (check 1 on 64 and here, but now JNC instead of JC) 69: xor %edi,%edi (no check 2, 0 out if %rdi - %r11 < 0) 6b: jmp 0x0000000000000071 6d: mov -0x10(%rdi),%rdi 71: We could do the same for insn->off == 0, but for now keep code generation unchanged for previously working nonnegative insn->offs. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
…type If a BPF program defines a struct or union type which has a field type that the verifier considers special - spin_lock, next-gen datastructure heads and nodes - the verifier needs to be able to find fields of that type using BTF. For such a program, BTF is required, so modify kernel_needs_btf helper to ensure that correct "BTF is mandatory" error message is emitted. The newly-added btf_has_alloc_obj_type looks for BTF_KIND_STRUCTs with a name corresponding to a special type. If any such struct is found it is assumed that some variable is using it, and therefore that successful BTF load is necessary. Also add a kernel_needs_btf check to bpf_object__create_map where it was previously missing. When this function calls bpf_map_create, kernel may reject map creation due to mismatched datastructure owner and ownee types (e.g. a struct bpf_list_head with __contains tag pointing to bpf_rbtree_node field). In such a scenario - or any other where BTF is necessary for verification - bpf_map_create should not be retried without BTF. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Upstream branch: 156ed20 |
This patch adds selftests exercising the logic changed/added in the previous patches in the series. A variety of successful and unsuccessful rbtree usages are validated: Success: * Add some nodes, let map_value bpf_rbtree_root destructor clean them up * Add some nodes, remove one using the release_on_unlock ref leftover by successful rbtree_add() call * Add some nodes, remove one using the release_on_unlock ref returned from rbtree_first() call Failure: * BTF where bpf_rb_root owns bpf_list_node should fail to load * BTF where node of type X is added to tree containing nodes of type Y should fail to load * No calling rbtree api functions in 'less' callback for rbtree_add * No releasing lock in 'less' callback for rbtree_add * No removing a node which hasn't been added to any tree * No adding a node which has already been added to a tree * No escaping of release_on_unlock references past their lock's critical section These tests mostly focus on rbtree-specific additions, but some of the Failure cases revalidate scenarios common to both linked_list and rbtree which are covered in the former's tests. Better to be a bit redundant in case linked_list and rbtree semantics deviate over time. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
bd5a04e
to
7dba847
Compare
661aeb6
to
b9829b3
Compare
At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=702327 expired. Closing PR. |
Pull request for series with
subject: BPF rbtree next-gen datastructure
version: 1
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=702327