Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomicity of Map operation helpers #1521

Open
palmtenor opened this issue Jan 10, 2018 · 43 comments
Open

Atomicity of Map operation helpers #1521

palmtenor opened this issue Jan 10, 2018 · 43 comments

Comments

@palmtenor
Copy link
Member

We have certain helpers wraps around and would be rewritten into separated BPF Map syscalls, such as lookup_or_init and increment, both involves a lookup and a potential update. However, race condition could happen that BPF programs running on two different CPUs both see empty result and tries to do the create (update), and thus only one would success. We are not handling those cases in our rewritten code (no error handling here
and here, and it's not clear to the users that those helpers (that looks like one single operation) are not atomic.

In addition, map value operations such as in increment, could also have race conditions such as if two programs tries to do value pointer increment at the same time. Do you think we should replace those with __sync_fetch_and_add if the map is not a per-CPU map? But that would certainly cause performance loss...

@goldshtn
Copy link
Collaborator

Not only is it not clear to our users, I think it is not clear to the tool authors either 😉 I know most of our tools would probably have these embedded race conditions. I sort of hope it wouldn't be critical in most cases, but for high-frequency events the collisions on multiple CPUs could be frequent and produce skewed results. My personal feeling is that we should opt for the synchronized alternative and keep the unsafe behavior for per-CPU maps, but it also depends on how bad the overhead would be. E.g. for a map update that happens 1M times per second across 8 CPUs (for something like counting sched:sched_switch per pid), what is the overhead we should expect from running the map updates synchronized?

@bobrik
Copy link
Contributor

bobrik commented May 9, 2018

Should we have _synchronized version for each call as an intermediate step to allow people to assess the overhead?

  • lookup_or_init -> lookup_or_init + lookup_or_init_synchronized
  • increment -> increment + increment_synchronized

@yonghong-song
Copy link
Collaborator

The way to do synchronization is through per-cpu hash/array. We cannot really do synchronization inside the bpf program itself as locking is not possible at this point.

@bobrik
Copy link
Contributor

bobrik commented May 15, 2018

It would be nice to have examples / tools for percpu tables.

Both examples and tools searches return nothing:

Having just ARRAY is not enough to cover everything as well, it would be great to support HASH and LRU_HASH.

I'm also not quite sure why we can't have at least increment_synchonized via __sync_fetch_and_add .

Kernel ships with examples doing just that:

@yonghong-song
Copy link
Collaborator

bcc does have one for increment_synchonized called lock_xadd:

src/cc/export/helpers.h:#define lock_xadd(ptr, val) ((void)__sync_fetch_and_add(ptr, val))

@yonghong-song
Copy link
Collaborator

Agreed that we may want to have some examples (or tools if racing is a concern) to use per-cpu array/hash. Typical example is like

  map_val = table.lookup(&key);
  if (map_val) (*map_val) ++

If there is a potential racing here, then maybe the map table should be percpu.

@myllynen
Copy link
Contributor

Would it be possible to add few additional sentences about this to the documentation to raise awareness of the issue? Thanks.

@yonghong-song
Copy link
Collaborator

Yes, it would be good if we have volunteers for this. Do you want to give a try?

@myllynen
Copy link
Contributor

I'm very low on time and I'm not sure would I be knowledgeable enough at this point to instruct others.

@yzhao1012
Copy link
Contributor

We'd like to track network traffic on file descriptors. It's obviously that threads on different CPUs would write/send/read/recv() on the same file descriptor all the time. This type of atomic operations would be necessary.

@yonghong-song
Copy link
Collaborator

bpf_spin_lock has been added in 5.1, which can protect val in per entry basis. If you really want precise reference counting in bpf program, this is the way to go.

bcc should already supports this. does anybody want to try this and maybe write an example to show how it works in bcc environment?

@willfindlay
Copy link
Contributor

@yonghong-song I haven't had any success so far with bpf_spin_lock in bcc. The verifier complains as follows:

...
113: (85) call bpf_spin_lock#93
unknown func bpf_spin_lock#93

My kernel version is: 5.1.3-arch1-1-ARCH

@yonghong-song
Copy link
Collaborator

@housedhorse that is strange. 5.1 should have bpf_spin_lock. You can run bpftool built from kernel source to check whether the kernel truely support bpf_spin_lock or not, or check whether btf_find_spin_lock is in /proc/kallsyms or not.

@willfindlay
Copy link
Contributor

willfindlay commented May 22, 2019

@yonghong-song btf_find_spin_lock is indeed present in /proc/kallsyms. My headers also have the appropriate definitions for bpf_spin_lock() and bpf_spin_unlock(). However, when I was looking at the headers, I noticed this:

...
 *		* Tracing programs and socket filter programs cannot use
 *		  **bpf_spin_lock**\ () due to insufficient preemption checks
 *		  (but this may change in the future).
...

Edit: Here is a link for convenience.

@yonghong-song
Copy link
Collaborator

@housedhorse Aha, I missed this. Could you describe your use case here? If we have a solid use case, we can add bpf_spin_lock support for tracing.

@willfindlay
Copy link
Contributor

@yonghong-song I'm maintaining per-executable profiles that need to be created and updated asynchronously.

@yonghong-song
Copy link
Collaborator

@housedhorse could you describe details why bpf_spin_lock is required here? Once the problem statement is well understood, I can guide you to submit a kernel patch if you would like.

@willfindlay
Copy link
Contributor

@yonghong-song I can't go into too much detail since the work I'm doing is related to a research project. Essentially I have profile structures mapped to unique executables which are potentially being updated by multiple processes (running the same binary) at a time. This means per-cpu tables are not an option here.

@yonghong-song
Copy link
Collaborator

@housedhorse no problem. Looks like we do have a use case for tracing. Anybody wants to volunteer doing a kernel patch to extend bpf_spin_lock for tracing programs?

@yzhao1012
Copy link
Contributor

@yonghong-song What kind of work is required?

@yonghong-song
Copy link
Collaborator

In this commit which support bpf_spin_lock/bpf_spin_unlock,
torvalds/linux@d83525c#diff-05da4bf36c7fbcd176254e1615d98b28
the helpers are only added to networking program types,
checking the changes in net/core/filter.c

The two helpers need to be added to helpers in kernel/trace/bpf_trace.c.

@willfindlay
Copy link
Contributor

willfindlay commented Sep 28, 2019

@yonghong-song I've created a kernel patch with the changes you specified. However, now I'm getting the following:

/virtual/main.c:4:23: error: field has incomplete type 'struct bpf_spin_lock'
         struct bpf_spin_lock my_lock;
                              ^
 /virtual/main.c:4:9: note: forward declaration of 'struct bpf_spin_lock'
         struct bpf_spin_lock my_lock;
                ^
 /virtual/main.c:9:1: error: could not open bpf map: Invalid argument
 is maps/array map type enabled in your kernel?
 BPF_ARRAY(testificate, locked_data, 1);
 ^
 /virtual/include/bcc/helpers.h:168:27: note: expanded from macro 'BPF_ARRAY'
   BPF_ARRAYX(__VA_ARGS__, BPF_ARRAY3, BPF_ARRAY2, BPF_ARRAY1)(__VA_ARGS__)
                           ^
 2 errors generated.
 Traceback (most recent call last):
   File "./spinlock.py", line 10, in <module>
     bpf = BPF(src_file="spinlock.c")
   File "/usr/lib/python3/dist-packages/bcc/__init__.py", line 331, in __init__
     raise Exception("Failed to compile BPF module %s" % src_file)
 Exception: Failed to compile BPF module b'spinlock.c'

For some reason the bpf_spinlock struct seems not to be included properly.

Here's my diff on Linux 5.3:

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ca1255d14576..2dbc8a7771a3 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -709,6 +709,18 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 #endif
 	case BPF_FUNC_send_signal:
 		return &bpf_send_signal_proto;
+    default:
+        break;
+    }
+
+    if (!capable(CAP_SYS_ADMIN))
+        return NULL;
+
+    switch (func_id) {
+    case BPF_FUNC_spin_lock:
+        return &bpf_spin_lock_proto;
+    case BPF_FUNC_spin_unlock:
+        return &bpf_spin_unlock_proto;
 	default:
 		return NULL;
 	}

Here's my minimal example:

typedef struct
{
    u64 my_data;
    struct bpf_spin_lock my_lock;
}
locked_data;

BPF_PERF_OUTPUT(event);
BPF_ARRAY(testificate, locked_data, 1);

TRACEPOINT_PROBE(raw_syscalls, sys_enter)
{
    int key = 0;

    locked_data *d = testificate.lookup(&key);

    return 0;
}

Do you have any further guidance to offer?

Edit: I have no problems with the struct on my host OS, only my VM running the custom kernel. I'm thinking I may have screwed up the configuration somehow.

@yonghong-song
Copy link
Collaborator

The map cannot be created. I think the reason mostly due to no BTF (BPF debug format) available. BTF is only available at LLVM9 and later. Could you give a try with latest bcc source?
LLVM9 just released. You can enable debug DEBUG_BTF = 0x20 to print out btf related information.

We did not really test BTF in our builtbot because of old LLVM compilers. Not 100% sure whether BTF is broken or not for distros. It would be good if you can check it in your environment.

@willfindlay
Copy link
Contributor

Thanks very much for the reply. I'll probably have some time to investigate tomorrow.

@yonghong-song
Copy link
Collaborator

just to be clear. bpf_spin_lock() requires BTF.

@jsommers
Copy link

FWIW, it looks like my distro (Ubuntu 19.10) does not have an llvm that emits the right stuff:

475: (85) call bpf_spin_lock#93
map 'mylock' has to have BTF in order to use bpf_spin_lock

llvm version appears to be 10.0, but I get the error above. I'm using bcc from iovisor, not the distro-provided version. Is that causing a mismatch leading to the failure to be able to use a spin lock?

Thanks for any help and/or advice.

@yonghong-song
Copy link
Collaborator

Could you add debug=0x20 for python BPF constructor or flag=0x20 for C++ BPF constructor. The latest iovisor trunk with llvm 10.0 should work. It would be good to know what is missing in your environment.

@jsommers
Copy link

jsommers commented Jan 21, 2020

Thanks. I can see from the debug output (below) that it looks like clang 7.0.1 is being used, which explains it (llvm 10.0 is installed, but the older version is getting used). I can see further down in some debug output (I turned on pretty much all debug output) that the bcc macros to eventually produce directives like struct.____btf_map_mylock = type { i32, %struct.mylock } are correctly getting invoked (no surprise, since I am using the latest iovisor code), so I think this is just a case of the wrong llvm & clang getting invoked.

!2 = distinct !DICompileUnit(language: DW_LANG_C99, file: !3, producer: "clang version 7.0.1-9build1 (tags/RELEASE_701/final)", isOptimized: true, runtimeVersion: 0, emissionKind: FullDebug, enums: !4, retainedTypes: !135, globals: !217)

If I continue to have some issues after forcing my bcc to use a newer version I'll follow up. In any case, thanks for your response and help!

@yonghong-song
Copy link
Collaborator

Yes, please try llvm10.

@jsommers
Copy link

Got success, which required purging the system of a few old versions of llvm & clang -- cmake was picking up one of the older versions on the system. Once only a single version was installed (I used llvm 9, actually), the BTF annotations were correctly emitted and the spin lock didn't cause any problems. Thanks again.

@maharishib
Copy link

Hi @jsommers,
I was facing the same problem while using bpf_spin_lock. I have switched to clang-9.
Kernel version: 5.3.0-42-generic
Now I'm getting the following error:

0 maps not supported in current map section!
Error fixing up map structure, incompatible struct bpf_elf_map used?
Error fetching ELF ancillary data!

Did you face any such issue?

@wwwzrb
Copy link

wwwzrb commented Aug 25, 2020

@yonghong-song I haven't had any success so far with bpf_spin_lock in bcc. The verifier complains as follows:

...
113: (85) call bpf_spin_lock#93
unknown func bpf_spin_lock#93

My kernel version is: 5.1.3-arch1-1-ARCH

@yonghong-song @willfindlay
Hi, I meet similar problem when writing BPF program in C. I add my own code in /samples/bpf. It is compiled alright. However, when I try to run the program, it tells that bpf_spin_lock call cannot be found. I also run the testing spinlock example in /tools/testing/selftest/bpf/test_spin_lock and test_map_lock without any problems/errors.

I'm using kernel 5.8 and llvm 11.0.

I don't know why? Can you give any advices? Thanks!

@netedwardwu
Copy link
Contributor

Did you use bpf_spin_lock() on the tracing program or socket filter?
If so, the kernel doesn't support it yet.

*		* Tracing programs and socket filter programs cannot use
--
*		  **bpf_spin_lock**\ () due to insufficient preemption checks
*		  (but this may change in the future).

@wwwzrb
Copy link

wwwzrb commented Aug 25, 2020

Did you use bpf_spin_lock() on the tracing program or socket filter?
If so, the kernel doesn't support it yet.

*		* Tracing programs and socket filter programs cannot use
--
*		  **bpf_spin_lock**\ () due to insufficient preemption checks
*		  (but this may change in the future).

@netedwardwu
I saw this limitation. But I don't know the specific scope of tracing programs and socket filter programs. Does raw kprobes/kreprobes belong to tracing programs?

@netedwardwu
Copy link
Contributor

I think so.
You can check & trace & debug in
https://elixir.bootlin.com/linux/v5.8-rc4/source/kernel/trace/bpf_trace.c#L1058
It has no any BPF_FUNC_spin_lock support.
This is why your bpf program that can't pass bpf verifier and load into the kernel.

And the below is bpf_spin_lock patch. Also for reference.
https://patchwork.ozlabs.org/project/netdev/patch/20190124041403.2100609-2-ast@kernel.org/

But I don't know too detail about it.

@wwwzrb
Copy link

wwwzrb commented Aug 25, 2020

I think so.
You can check & trace & debug in
https://elixir.bootlin.com/linux/v5.8-rc4/source/kernel/trace/bpf_trace.c#L1058
It has no any BPF_FUNC_spin_lock support.
This is why your bpf program that can't pass bpf verifier and load into the kernel.

And the below is bpf_spin_lock patch. Also for reference.
https://patchwork.ozlabs.org/project/netdev/patch/20190124041403.2100609-2-ast@kernel.org/

But I don't know too detail about it.

Got it, I'm trying a workaround. I just add BPF_FUNC_spin_lock in bpf_trace.c. Hope it will work!

@netedwardwu
Copy link
Contributor

I think so.
You can check & trace & debug in
https://elixir.bootlin.com/linux/v5.8-rc4/source/kernel/trace/bpf_trace.c#L1058
It has no any BPF_FUNC_spin_lock support.
This is why your bpf program that can't pass bpf verifier and load into the kernel.
And the below is bpf_spin_lock patch. Also for reference.
https://patchwork.ozlabs.org/project/netdev/patch/20190124041403.2100609-2-ast@kernel.org/
But I don't know too detail about it.

Got it, I'm trying a workaround. I just add BPF_FUNC_spin_lock in bpf_trace.c. Hope it will work!

I don't think it's a good idea

 *		* Tracing programs and socket filter programs cannot use
 *		  **bpf_spin_lock**\ () due to insufficient preemption checks
 *		  (but this may change in the future).

It means you will get trouble because of insufficient preemption checks.
BPF program is really hard to debug in a complicated environment!
A better idea is to find out why you need bpf_spin_lock, and check if percpu arrays can solve your problem.

@yonghong-song
Copy link
Collaborator

There are some major issues to use bpf_spin_lock in tracing program.

  • if a bpf program is running in NMI context, bpf_spin_lock should not be used. note that
    a kprobe program could run in NMI context.
  • due to internal protection, tracing program with bpf_spin_lock will not nest with tracing program, e.g., kprobe.
    but if a map used with bpf_spin_lock used by both a tracing program and a networking, deadlock may happen.
    kernel has to be changed for safety to prevent deadlock.

Yes, all these issues need to be resolved before allowing tracing (kprobe, etc.) to use bpf_spin_lock.

@wwwzrb could you describe your use case? It would be good to do a deep analysis to see whether bpf_spin_lock() is really needed or not. If it is, we could see how to improve kernel. @netedwardwu is right. In general using bpf_spin_lock is not safe for tracing programs. But you could do some experiments if you are aware of the pitfalls I listed in the above.

@wwwzrb
Copy link

wwwzrb commented Aug 26, 2020

There are some major issues to use bpf_spin_lock in tracing program.

  • if a bpf program is running in NMI context, bpf_spin_lock should not be used. note that
    a kprobe program could run in NMI context.
  • due to internal protection, tracing program with bpf_spin_lock will not nest with tracing program, e.g., kprobe.
    but if a map used with bpf_spin_lock used by both a tracing program and a networking, deadlock may happen.
    kernel has to be changed for safety to prevent deadlock.

Yes, all these issues need to be resolved before allowing tracing (kprobe, etc.) to use bpf_spin_lock.

@wwwzrb could you describe your use case? It would be good to do a deep analysis to see whether bpf_spin_lock() is really needed or not. If it is, we could see how to improve kernel. @netedwardwu is right. In general using bpf_spin_lock is not safe for tracing programs. But you could do some experiments if you are aware of the pitfalls I listed in the above.

I'm very agree with you that wrong usage of bpf_spin_lock can easily deadlock.

I'll try to clarify the use case. Thanks for your suggestions. @netedwardwu @yonghong-song

Exsiting bpf programs usually use BPF_HASH_MAP for key/value strorage, update and so on. I'm trying to replace the functionality of BPF_HASH_MAP by BPF_ARRAY_MAP which supports mmap to reduce the overhead of reading BPF_MAP at user space, i.e., user apps can get access to BPF_ARRAY_MAP directly after mmap without invloking bpf syscall.

However, lots of efforts remain to be done to use BPF_ARRAY_MAP:

  1. Reading of BPF_ARRAY_MAP index while other process/function events (tcp_send_msg, tcp_close, etc) updating index (insert, delete, and so on.).
  2. Concurrent updating of BPF_ARRAY_MAP index by multiple process/function events.
  3. Reading/updating BPF_ARRAY_MAP index while BPF kernel part reading/updating index.

So we may need bpf_spin_lock for concurrent read/update of index where only using BPF_XADD/__sync_fetch_and_add may not be sufficient.

@netedwardwu
Copy link
Contributor

Why do you need to "reduce the overhead of reading BPF_MAP at user space"?
Getting BPF_HASH_MAP back once per second minimum in most bcc tools.

And in BPF program it has RCU protection...I don't think that you need bpf_spin_lock so far...

@wwwzrb
Copy link

wwwzrb commented Aug 26, 2020

Why do you need to "reduce the overhead of reading BPF_MAP at user space"?
Getting BPF_HASH_MAP back once per second minimum in most bcc tools.

And in BPF program it has RCU protection...I don't think that you need bpf_spin_lock so far...

I mean in the case where you have many elements in BPF_MAP and read frequently, the overhead of syscall to traverse the whole BPF_MAP will be evident.

I'll try if it works without bpf_spin_lock.

By the way, how to enable BTF support?

I'm using kernel 5.8, llvm-10.0.0 and pahole 1.17. Also BTF is enabled in kernel config.
I write a kprobe example at /samples/bpf and load it by bpf_load.c.
But BPF verifier tells that "map 'tbl_conn_ipv4_a' has to have BTF in order to use bpf_spin_lock".

image

@yonghong-song
Copy link
Collaborator

yonghong-song commented Aug 30, 2020

By the way, how to enable BTF support?

Your kernel version, pahole version, llvm version sounds right.
Do you have a reproducible test case I can take a look?

@wwwzrb
Copy link

wwwzrb commented Sep 1, 2020

By the way, how to enable BTF support?

Your kernel version, pahole version, llvm version sounds right.
Do you have a reproducible test case I can take a look?

Yes, I modify the map_lock test case under /tools/testing/selftests/bpf to check whether bpf_spin_lock works with tracing context.

I find that the bpf prog is load successfully after adding bpf_spin_lock func proto in bpf_trace.c and annotating the verifier part which forbiddens usage of bpf_spin_lock in tracing prog.

However, the kprobe/tracepoint is not involked at all in our testing.

The sample code is attached below as attachment:

// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2019 Facebook
#include <linux/ptrace.h>
#include <linux/version.h>
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>

#define VAR_NUM 16

struct hmap_elem {
   struct bpf_spin_lock lock;
   int var[VAR_NUM];
};

struct {
   __uint(type, BPF_MAP_TYPE_HASH);
   __uint(max_entries, 1);
   __type(key, __u32);
   __type(value, struct hmap_elem);
} hash_map SEC(".maps");

struct array_elem {
   struct bpf_spin_lock lock;
   int var[VAR_NUM];
};

struct {
   __uint(type, BPF_MAP_TYPE_ARRAY);
   __uint(max_entries, 1);
   __type(key, int);
   __type(value, struct array_elem);
} array_map SEC(".maps");

//SEC("raw_tracepoint/sys_enter") // this function runs frequenctly
//int tcp_get_entry(void *ctx) 
SEC("kprobe/tcp_get_info") // this function will be involked by ss -ti
int tcp_get_entry(struct pt_regs *ctx)
{
   char attach_fmt[] = "Attach tcp_get_info entry successful: %d \n";
   char protocol_fmt[] = "Wrong protocol: %d \n";

   //__u64 pid = bpf_get_current_pid_tgid();
   __u32 pgid = 0;

   bpf_trace_printk(attach_fmt, sizeof(attach_fmt), pgid);

   struct hmap_elem zero = {}, *val;
   int rnd = bpf_get_prandom_u32();
   int key = 0, err = 1, i;
   struct array_elem *q;

   val = bpf_map_lookup_elem(&hash_map, &key);
   if (!val)
   	goto err;
   /* spin_lock in hash map */
   bpf_spin_lock(&val->lock);
   for (i = 0; i < VAR_NUM; i++)
   	val->var[i] = rnd;
   bpf_spin_unlock(&val->lock);

   /* spin_lock in array */
   q = bpf_map_lookup_elem(&array_map, &key);
   if (!q)
   	goto err;
   bpf_spin_lock(&q->lock);
   for (i = 0; i < VAR_NUM; i++)
   	q->var[i] = rnd;
   bpf_spin_unlock(&q->lock);
   err = 0;
err:
   return err;
}
char _license[] SEC("license") = "GPL";

test_map_lock_user.c.txt
test_map_lock.h.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests