Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str() to support large strings #305

Closed
Birch-san opened this issue Jan 1, 2019 · 15 comments · Fixed by #3228
Closed

str() to support large strings #305

Birch-san opened this issue Jan 1, 2019 · 15 comments · Fixed by #3228
Labels
help wanted Extra attention is needed

Comments

@Birch-san
Copy link
Contributor

(Discussion extracted from #299)

This issue is to track "how could str() support kilobyte/megabyte/arbitrary string sizes?"

str() allocates strings on the BPF stack, so we reach the 512-byte limit very quickly (especially since we pay the toll again when printf() composes a perf event output buffer).

In fact, we're currently limited to tracing ~240-byte strings.

For an idea of "when would I want to trace a string larger than 240 bytes?":
My favourite dtrace command was one that was frequently guilty of maxing out the default string buffer: tracing what SQL queries were submitted to MySQL. I'd usually grow the buffer to 1024 bytes for good measure.

=====

I explored the possibility of storing the strings off-stack. The main problem there is that probe_read_str() will only write upon on-stack memory. I have ideas for workarounds (and some code progress towards those workarounds), but they're too big to be included in the scope of this pull request.

=====

Here's some code that I built to explore the possibility of storing strings off-stack:

bd038f9...Birch-san:c5088c9a5d731165e1005a6a800e611bb63f3dc3

It introduces (in codegen) a "StrCall" specialization of "Call".
StrCall has a PERCPU_ARRAY BPF map, to hold large data off-stack.

We cannot use probe_read_str() to output directly into the BPF map, but maybe we could allocate a small on-stack buffer, and ferry bytes (one buffer at a time) from BPF stack into BPF map.

Next, printf() poses a few problems:

  • composes the perf event output struct on the BPF stack
  • doesn't use runtime information to decide length of strings inside the perf event output struct
    • composes a fixed-size structure
    • determines size of structure during semantic analysis

I think PerfEventOutput can transfer data from BPF map to userspace (that's what "join" does), so we could compose the perf event output struct inside a BPF map.

====

We may find that we want a dynamically-sized perf event output structure. Like, imagine we see that args->count is just 3 bytes. We wouldn't want to send MAX_STRLEN bytes to userspace, if we know at runtime a smaller, more specific amount.

Here's some (horrifying) code I wrote to make print decide the size of its MEMCPY based on runtime information rather than semantic analysis:

master...Birch-san:6c1f19a3a4c3cf253aa23db79f5f637b6823e5b2

Notice how expr2_ is used to pass down a strlen at runtime, instead of using the strlen provided during semantic analysis.
And notice how the str call uses a parameterised AllocInst, so instead of isArrayTy(), its expr_ type has isPointerTy(). (it's a pointer to dynamically-allocated memory)

I reached a dead-end here because printf() allocates a buffer based on an LLVM StructType. It can store strings in ArrayType members, but you need to declare the size of the ArrayType up front. You cannot use an LLVM Value to dynamically size it.
The workaround here would be to stop using StructType and instead use a basic AllocInst. These support dynamic sizes based on LLVM Value. Or even better: we can use off-stack memory (a BPF map) instead of doing an allocation inside printf().

====

In conclusion: there may be a path to supporting arbitrarily-large strings. Bonus points if we can also make the perf event output buffer dynamically-sized, so we can transfer the smallest message possible (instead of our maximum buffer size). A big advantage of dynamic sizing is that the user wouldn't have to tune their maximum buffer size for the sake of performance.

@mmarchini mmarchini added the help wanted Extra attention is needed label Jan 14, 2019
@tyroguru
Copy link
Contributor

tyroguru commented Feb 6, 2019

This whole issue needs some serious attention as it is a significant limitation. Nice initial exploration @Birch-san. I have hit this limitation on a number of occasions and for me it is one of bpf(traces) biggest annoyances at the minute.

@Birch-san
Copy link
Contributor Author

Birch-san commented Feb 6, 2019

@tyroguru out of interest: what kind of script do you have that needs big strings? Clear use-cases are valuable for framing the conversation.

I had another thought…
Ferrying the strings like this (~200 bytes at a time):

probe_read_str() -> BPF stack -> BPF map

Has a chance of working, but is high-effort.
I wonder if we could request (of Linux itself) explicit support for the following:

probe_read_str() -> BPF map

Additionally: it'd be worth asking around to assess whether I've characterised the problem correctly. Maybe there's a completely different way to approach big strings.

@tyroguru
Copy link
Contributor

tyroguru commented Feb 7, 2019

Hey @Birch-san - sorry but I was being lazy :-) .

Take this simple example I did a couple of weeks ago for reading /dev/null:

# cat dev-null.bt
kprobe:write_null
{
        printf("%s: %s (size %d)\n", comm, str(arg1), arg2);
}

Obviously, strings much larger than 200 bytes are written to /dev/null.

I need to understand the problem better to make any form of judgement on this but it seems like this is a fundamental problem that needs addressing in the engine room of BPF - to work around it in userland would probably be high effort as you say.

@Birch-san
Copy link
Contributor Author

Birch-san commented Feb 8, 2019

@tyroguru thanks for the example. agreed; we need way bigger strings to attack that use-case.

I've had a look at how probe_read and probe_read_str are implemented in the kernel.

I think what we need is a probe_read_str derivative (probe_read_str_to_map?) that accepts a destination of type ARG_PTR_TO_UNINIT_MAP_VALUE. That would let us copy strings directly into map values.

Of course, anything is possible if you blow holes in the security. So I'd like to understand whether there's any difference in the security model of BPF stack vs BPF map. For example: BPF maps can be shared between processes — though I imagine this is acceptable if the map owner has full control of whether to share the file descriptor.

It definitely feels right to add kernel support, since the proposed workaround is pretty shocking (and I'm pretty sure it'd be hard to write for loops that satisfy the verifier).

Getting the string into a map solves the hardest problem but creates a new one: it makes it difficult for us to compose our perf_event_output buffer (i.e. for printf). I'm not sure how to concatenate a map value with other values and exfiltrate all of those in one buffer.

Well, maybe the key is to fire several perf_event_output()s -- one per printf argument. If it's performance-sensitive, then maybe a batching mechanism would be required. Not sure what a high-quality kernel method for that would look like.

@Birch-san
Copy link
Contributor Author

Birch-san commented Feb 10, 2019

Oh, apparently it used to be possible to probe_read directly into a BPF map value?

iovisor/bcc#900 (comment)

if you want to work on data bigger than 512 bytes don't use the stack, the verifier error you're getting is legit, but as @4ast says you can now use a map element value (which is what I'm doing in my project). Please find more details here

ability to use maps directly as buffer pointers with helpers, so that you can bypass the limited stack size

Indeed, his kernel commit from 2017 claims to enable a very relatable example:

image

I wonder if the "probe_read may only write to a destination on the stack" is a newer validation constraint that was added to the API interface… whereas the actual verifier algorithm is more forgiving. This would indicate a regression (API becoming stricter than necessary).

@Birch-san
Copy link
Contributor Author

Birch-san commented Feb 10, 2019

I had a stab at implementing some new kernel helpers (probe_read/probe_read_str which can write into map value).

branched from:

Ubuntu 18.04.1 LTS
4.15.0-45-generic
git://kernel.ubuntu.com/ubuntu/ubuntu-bionic.git
commit ffdd392b819682b66da292cd0d13569d07a1a7c2 (tag: Ubuntu-4.15.0-45.48, origin/master, origin/HEAD, master)

feature commit:

commit 25f076f737be964bcd2c7ab2c277f5c306074dab (HEAD -> bpf-probe-big-strings)
Author: Alex Birch <Birch-san@users.noreply.github.com>
Date:   Sun Feb 10 12:26:06 2019 +0000

    bpf: helpers to probe_read into BPF map values

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 04b2f613dd06..98f0c62c231d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -677,6 +677,22 @@ union bpf_attr {
  *     @buf: buf to fill
  *     @buf_size: size of the buf
  *     Return : 0 on success or negative error code
+ *
+ * int bpf_probe_read_to_map(void *dst, int size, void *src)
+ *     @dst: pointer to map value
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_probe_read_str_to_map(void *dst, int size, const void *unsafe_ptr)
+ *     Copy a NUL terminated string from unsafe address. In case the string
+ *     length is smaller than size, the target is not padded with further NUL
+ *     bytes. In case the string length is larger than size, just count-1
+ *     bytes are copied and the last byte is set to NUL.
+ *     @dst: pointer to map value
+ *     @size: maximum number of bytes to copy, including the trailing NUL
+ *     @unsafe_ptr: unsafe address
+ *     Return:
+ *       > 0 length of the string including the trailing NUL on success
+ *       < 0 error
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -736,7 +752,9 @@ union bpf_attr {
 	FN(xdp_adjust_meta),		\
 	FN(perf_event_read_value),	\
 	FN(perf_prog_read_value),	\
-	FN(getsockopt),
+	FN(getsockopt),				\
+	FN(probe_read_to_map),		\
+	FN(probe_read_str_to_map),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index fe2429b382ca..ac0fffdc7f90 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -96,6 +96,26 @@ static const struct bpf_func_proto bpf_probe_read_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_probe_read_to_map, void *, dst, u32, size, const void *, unsafe_ptr)
+{
+	int ret;
+
+	ret = probe_kernel_read(dst, unsafe_ptr, size);
+	if (unlikely(ret < 0))
+		memset(dst, 0, size);
+
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_probe_read_to_map_proto = {
+	.func		= bpf_probe_read_to_map,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_UNINIT_MAP_VALUE,
+	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 BPF_CALL_3(bpf_probe_write_user, void *, unsafe_ptr, const void *, src,
 	   u32, size)
 {
@@ -503,6 +523,36 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_probe_read_str_to_map, void *, dst, u32, size,
+	   const void *, unsafe_ptr)
+{
+	int ret;
+
+	/*
+	 * The strncpy_from_unsafe() call will likely not fill the entire
+	 * buffer, but that's okay in this circumstance as we're probing
+	 * arbitrary memory anyway similar to bpf_probe_read() and might
+	 * as well probe the map value. Thus, memory is explicitly cleared
+	 * only in error case, so that improper users ignoring return
+	 * code altogether don't copy garbage; otherwise length of string
+	 * is returned that can be used for bpf_perf_event_output() et al.
+	 */
+	ret = strncpy_from_unsafe(dst, unsafe_ptr, size);
+	if (unlikely(ret < 0))
+		memset(dst, 0, size);
+
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_probe_read_str_to_map_proto = {
+	.func		= bpf_probe_read_str_to_map,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_UNINIT_MAP_VALUE,
+	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
@@ -514,6 +564,8 @@ static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
 		return &bpf_map_delete_elem_proto;
 	case BPF_FUNC_probe_read:
 		return &bpf_probe_read_proto;
+	case BPF_FUNC_probe_read_to_map:
+		return &bpf_probe_read_to_map_proto;
 	case BPF_FUNC_ktime_get_ns:
 		return &bpf_ktime_get_ns_proto;
 	case BPF_FUNC_tail_call:
@@ -542,6 +594,8 @@ static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
 		return &bpf_get_prandom_u32_proto;
 	case BPF_FUNC_probe_read_str:
 		return &bpf_probe_read_str_proto;
+	case BPF_FUNC_probe_read_str_to_map:
+		return &bpf_probe_read_str_to_map_proto;
 	default:
 		return NULL;
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4c223ab30293..8740e7374698 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -736,7 +736,9 @@ union bpf_attr {
 	FN(xdp_adjust_meta),		\
 	FN(perf_event_read_value),	\
 	FN(perf_prog_read_value),	\
-	FN(getsockopt),
+	FN(getsockopt),				\
+	FN(probe_read_to_map),		\
+	FN(probe_read_str_to_map),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call

I got this compiled and booted into it, built libbpf and bcc against it, added new IRBuilder definitions, added some calling code and successfully compiled bpftrace… but I guess I've missed some step. The verifier says that my new helper functions don't exist.

@brendangregg
Copy link
Contributor

@Birch-san can you please bring this up on the netdev mailing list? http://vger.kernel.org/vger-lists.html#netdev This is where kernel BPF development is done.

I think you'll get asked to show why bpf_probe_read_to_map() is necessary, and we can't just use bpf_probe_read() to read to a map. If that's broken, they may want to fix it, and add a test to test_verifier.c to make sure it stays fixed.

@Birch-san
Copy link
Contributor Author

@brendangregg thanks for pointing me to the right people. mailing lists are a new world to me, so I'm lurking for a bit first. then I'll ask whether probe_read_* was intended to be stack-only. may need to prepare a minimal example for this in bcc.

@mmarchini
Copy link
Contributor

composes the perf event output struct on the BPF stack

@Birch-san I opened #750 which should solve this, you might be interested :)

I wonder if the "probe_read may only write to a destination on the stack" is a newer validation constraint that was added to the API interface… whereas the actual verifier algorithm is more forgiving. This would indicate a regression (API becoming stricter than necessary).

I would guess it's more likely that the desired behavior changed and the API documentation was not updated. Which docs was you looking at? I don't see any reference to "stack only" for probe_read[_str] on include/uapi/linux/bpf.h. Also, looking at the tests this seems to be the desired behavior:

// tools/testing/selftests/bpf/test_verifier.c
	{
		"helper access to map: full range",
		.insns = {
			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
			BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
			BPF_LD_MAP_FD(BPF_REG_1, 0),
			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
			BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
			BPF_MOV64_IMM(BPF_REG_2, sizeof(struct test_val)),
			BPF_MOV64_IMM(BPF_REG_3, 0),
			BPF_EMIT_CALL(BPF_FUNC_probe_read),
			BPF_EXIT_INSN(),
		},
		.fixup_map2 = { 3 },
		.result = ACCEPT,
		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
	},

@Birch-san
Copy link
Contributor Author

Birch-san commented Jun 15, 2019

@mmarchini I'm indeed interested; thanks for progressing this.

====

The reason I believe that probe_read_str() refuses to write into a BPF map value (and will only write onto the BPF stack), is that I attempted it before:

bd038f9...Birch-san:c5088c9a5d731165e1005a6a800e611bb63f3dc3

My attempt looks very similar to what's in your pull request. Maybe there's some significance in your using probe_read instead of probe_read_str. Or maybe your fixed-size read gives the verifier extra guarantees.

The error I got at the time (commit message is in commit 52677d0 of this diff) was:

R1 type=map_value_or_null expected=fp

My understanding of that error was as follows:

probe_read_str() wants its arg1 (*dst) to be a pointer to the stack, yet I'm asking it (probably) to write into PTR_TO_MAP_VALUE.

====

I found another expected=fp here: iovisor/bcc#209

Yonghong-song explained the error:

for map operations. the verifier has the following constraints for safely
reason (see kernel/bpf/verifier.c)

ARG_PTR_TO_MAP_KEY is one of such argument constraints. It means that the register type passed to this function must be PTR_TO_STACK and it will be used inside the function as 'pointer to map element key'

====

Here's the prototype for probe_read_str, which demonstrates the constraints on it:

https://github.com/torvalds/linux/blob/390e99cfdda1334f45c718cc02cd26eb3135f233/kernel/trace/bpf_trace.c#L565

static const struct bpf_func_proto bpf_probe_read_str_proto = {
	.func		= bpf_probe_read_str,
	.gpl_only	= true,
	.ret_type	= RET_INTEGER,
	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
	.arg3_type	= ARG_ANYTHING,
};

Its arg1 (*dst), has type ARG_PTR_TO_UNINIT_MEM, which is described here:

https://github.com/torvalds/linux/blob/390e99cfdda1334f45c718cc02cd26eb3135f233/include/linux/bpf.h#L194

the following constraints used to prototype bpf_memcmp() and other functions that access data on eBPF program stack

pointer to memory does not need to be initialized, helper function must fill all bytes or clear them in error case.

This suggests to me that ARG_PTR_TO_UNINIT_MEM refers to stack memory.

====

Recall the error I received, R1 type=map_value_or_null expected=fp.

https://github.com/torvalds/linux/blob/390e99cfdda1334f45c718cc02cd26eb3135f233/kernel/bpf/verifier.c#L389

/* string representation of 'enum bpf_reg_type' */
static const char * const reg_type_str[] = {
//
	[PTR_TO_MAP_VALUE_OR_NULL] = "map_value_or_null",
	[PTR_TO_STACK]		= "fp",
//
};

This suggests again to me that BPF required me to point to the stack, when in reality I pointed at a map value.

====

I would guess it's more likely that the desired behavior changed and the API documentation was not updated. Which docs was you looking at?

The error message I received in my attempt, and the corroboration above were the only reason for my believing this impossible. Maybe there was no change in behaviour, and the original premise #305 (comment) is still possible.

====

It's heartening that you've got probe reads working with map values as a destination. I don't understand (given the above) why it works, but that's good news.

If it turns out now that we have all the kernel support necessary to do this, then I'm tempted to have another stab at it.

@mathias-nyman
Copy link

I tried to raise BPFTRACE_STRLEN beyond 200 and got the error message with a link to this issue. For the next person that follows this path, I'll leave my workaround here with which I was able to print out a max ~300kb of ascii from arg1:

export BPFTRACE_STRLEN=200;\
    export BPFTRACE_STR_TRUNC_TRAILER=; \
    sudo -E bpftrace --unsafe -e \
    'uprobe:<file>:<func/offset>
    {
        $s = arg1;
        $i = 0;
        while ($i < 300000) {
            if(str($s+(uint64)$i) == "\0") {
                break;
            }
            printf("%s", str($s+(uint64)$i));
            $i+=199
        }
        printf("\n\n");
    }'

This was pretty much the max amount I was able to print out after some amount of fiddling. Any larger than this and the BPF program became to large. The amount of instructions inside the while loop seemed detrimental to how big a while loop we could have. So in case you want to increase the logic inside the while loop you may need to reduce the iteration count.

@uli-heller
Copy link

Hey @mathias-nyman, your proposal seems to work for me, thx a lot. I'm wondering about $s+(uint64)$i. Is this "safe"? I don't know much about "bpf". Comparing this to my "C days" decades ago, it looks like some pointer arithmetics potentially leading to a pointer way beyond the valid memory area. For example when arg1 contains a string of 300 characters.

  • i=0 -> $s+(uint64)$i = $s OK
  • i=199 -> $s+(uint64)$i = $s+199 OK
  • i=398 -> $s+(uint64)$i = $s+398 KO, way beyond the end of "arg1"

I did some tests related to tracepoint:syscalls:sys_enter_rename. I wanted to log the potentially long string args->newname
and got output like this:

/home/uli/Downloads/tmp/01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789.txt^X?^U^X?<C2>^U^X?<FA><C2>^U^X?01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789.txtextMenusragebRequesttxt

Please note that I used BPFTRACE_STRLEN=64 (and +=63). I guess this shouldn't change much...
I definitely used "/home/uli/Downloads/tmp/01234567...0123456789.txt" for the "newname".

Thx for your help, Uli

@mathias-nyman
Copy link

@uli-heller you're absolutely right about the lack of bounds checking. I remember being confused as to why it seemed to work for the program I was instrumenting at the time, but it did 🤷‍♂️ I was on a mission of extracting data and didn't care much about correctness at the time, sorry. Lemme know if you figure it out.

If you only have medium-sized strings to print out without the chance of overflow, you can still use this trick by setting BPFTRACE_STRLEN=2 (and +=1) but then you also have to decrease the iteration count (the number in the while loop).

@uli-heller
Copy link

@mathias-nyman : Thx a lot. I followed a different approach using two nested loops - one to simulate something like "strlen()". Your approach is much easier. I modified it a little bit. Now I have a solution which doesn't need and BPFTRACE variables:

      //
      // Abgeleitet von:
      //   https://github.com/iovisor/bpftrace/issues/305
      //
      $st           = args->newname; // hiervon soll der komplette Inhalt ausgegeben werden
      $maxlen       = 500;           // ... hoffentlich nicht mehr als 500 Zeichen (notfalls: Hochsetzen!)
      while ($maxlen > 0) {
	$sst = str($st, 1);
	if ($sst == "") {
	  break;
	}
	printf("%.1s", $sst);
	$st++;
	$maxlen--;
      }
      printf("\n");

There is just one potential caveat: What about multibyte characters?

danobi added a commit to danobi/bpftrace that referenced this issue Jun 10, 2024
Previously, all strings needed to be stored on the stack. Given that BPF
has a 512B stack limit, this put serious limits on strings. In practice,
it was limited to about 200B.

This commit raises the limit much higher to around 1024. The new limit
we hit is from the LLVM memset() builtin. It only supports up to 1K. To
go above this, we need to create our own memset() routine in BPF. This
is quite easy to do and will be attempted in a later commit.

The way big strings work is by assigning each str() call in the script a
unique ID. Then we create a percpu array map such that for each CPU,
there is a percpu entry that can accomodate each str() callsite. This
makes it so scripts can keep as many strings "on the stack" as they'd
like, with the tradeoff being the amount of memory we need to
preallocate.

As long as the kernel never allows BPF programs to nest (eg execute prog
A while prog A is already on the call stack), this approach is robust to
self-corruption.

This closes bpftrace#305.
@danobi danobi mentioned this issue Jun 10, 2024
3 tasks
danobi added a commit to danobi/bpftrace that referenced this issue Jun 10, 2024
Previously, all strings needed to be stored on the stack. Given that BPF
has a 512B stack limit, this put serious limits on strings. In practice,
it was limited to about 200B.

This commit raises the limit much higher to around 1024. The new limit
we hit is from the LLVM memset() builtin. It only supports up to 1K. To
go above this, we need to create our own memset() routine in BPF. This
is quite easy to do and will be attempted in a later commit.

The way big strings work is by assigning each str() call in the script a
unique ID. Then we create a percpu array map such that for each CPU,
there is a percpu entry that can accomodate each str() callsite. This
makes it so scripts can keep as many strings "on the stack" as they'd
like, with the tradeoff being the amount of memory we need to
preallocate.

As long as the kernel never allows BPF programs to nest (eg execute prog
A while prog A is already on the call stack), this approach is robust to
self-corruption.

Note we do not yet remove or change default value of
BPFTRACE_MAX_STRLEN. That is b/c a few other helpers (buf(), path(),
maybe more) also read the value and allocate space on stack accordingly.
We need to decouple or convert those helpers to scratch maps as well.

This closes bpftrace#305.
danobi added a commit to danobi/bpftrace that referenced this issue Jun 10, 2024
Previously, all strings needed to be stored on the stack. Given that BPF
has a 512B stack limit, this put serious limits on strings. In practice,
it was limited to about 200B.

This commit raises the limit much higher to around 1024. The new limit
we hit is from the LLVM memset() builtin. It only supports up to 1K. To
go above this, we need to create our own memset() routine in BPF. This
is quite easy to do and will be attempted in a later commit.

The way big strings work is by assigning each str() call in the script a
unique ID. Then we create a percpu array map such that for each CPU,
there is a percpu entry that can accomodate each str() callsite. This
makes it so scripts can keep as many strings "on the stack" as they'd
like, with the tradeoff being the amount of memory we need to
preallocate.

As long as the kernel never allows BPF programs to nest (eg execute prog
A while prog A is already on the call stack), this approach is robust to
self-corruption.

Note we do not yet remove or change default value of
BPFTRACE_MAX_STRLEN. That is b/c a few other helpers (buf(), path(),
maybe more) also read the value and allocate space on stack accordingly.
We need to decouple or convert those helpers to scratch maps as well.

This closes bpftrace#305.
danobi added a commit to danobi/bpftrace that referenced this issue Jun 10, 2024
Previously, all strings needed to be stored on the stack. Given that BPF
has a 512B stack limit, this put serious limits on strings. In practice,
it was limited to about 200B.

This commit raises the limit much higher to around 1024. The new limit
we hit is from the LLVM memset() builtin. It only supports up to 1K. To
go above this, we need to create our own memset() routine in BPF. This
is quite easy to do and will be attempted in a later commit.

The way big strings work is by assigning each str() call in the script a
unique ID. Then we create a percpu array map such that for each CPU,
there is a percpu entry that can accomodate each str() callsite. This
makes it so scripts can keep as many strings "on the stack" as they'd
like, with the tradeoff being the amount of memory we need to
preallocate.

As long as the kernel never allows BPF programs to nest (eg execute prog
A while prog A is already on the call stack), this approach is robust to
self-corruption.

Note we do not yet remove or change default value of
BPFTRACE_MAX_STRLEN. That is b/c a few other helpers (buf(), path(),
maybe more) also read the value and allocate space on stack accordingly.
We need to decouple or convert those helpers to scratch maps as well.

This closes bpftrace#305.
danobi added a commit to danobi/bpftrace that referenced this issue Jun 10, 2024
Previously, all strings needed to be stored on the stack. Given that BPF
has a 512B stack limit, this put serious limits on strings. In practice,
it was limited to about 200B.

This commit raises the limit much higher to around 1024. The new limit
we hit is from the LLVM memset() builtin. It only supports up to 1K. To
go above this, we need to create our own memset() routine in BPF. This
is quite easy to do and will be attempted in a later commit.

The way big strings work is by assigning each str() call in the script a
unique ID. Then we create a percpu array map such that for each CPU,
there is a percpu entry that can accomodate each str() callsite. This
makes it so scripts can keep as many strings "on the stack" as they'd
like, with the tradeoff being the amount of memory we need to
preallocate.

As long as the kernel never allows BPF programs to nest (eg execute prog
A while prog A is already on the call stack), this approach is robust to
self-corruption.

Note we do not yet remove or change default value of
BPFTRACE_MAX_STRLEN. That is b/c a few other helpers (buf(), path(),
maybe more) also read the value and allocate space on stack accordingly.
We need to decouple or convert those helpers to scratch maps as well.

This closes bpftrace#305.
danobi added a commit to danobi/bpftrace that referenced this issue Jun 12, 2024
Previously, all strings needed to be stored on the stack. Given that BPF
has a 512B stack limit, this put serious limits on strings. In practice,
it was limited to about 200B.

This commit raises the limit much higher to around 1024. The new limit
we hit is from the LLVM memset() builtin. It only supports up to 1K. To
go above this, we need to create our own memset() routine in BPF. This
is quite easy to do and will be attempted in a later commit.

The way big strings work is by assigning each str() call in the script a
unique ID. Then we create a percpu array map such that for each CPU,
there is a percpu entry that can accomodate each str() callsite. This
makes it so scripts can keep as many strings "on the stack" as they'd
like, with the tradeoff being the amount of memory we need to
preallocate.

As long as the kernel never allows BPF programs to nest (eg execute prog
A while prog A is already on the call stack), this approach is robust to
self-corruption.

Note we do not yet remove or change default value of
BPFTRACE_MAX_STRLEN. That is b/c a few other helpers (buf(), path(),
maybe more) also read the value and allocate space on stack accordingly.
We need to decouple or convert those helpers to scratch maps as well.

This closes bpftrace#305.
danobi added a commit that referenced this issue Jun 12, 2024
Previously, all strings needed to be stored on the stack. Given that BPF
has a 512B stack limit, this put serious limits on strings. In practice,
it was limited to about 200B.

This commit raises the limit much higher to around 1024. The new limit
we hit is from the LLVM memset() builtin. It only supports up to 1K. To
go above this, we need to create our own memset() routine in BPF. This
is quite easy to do and will be attempted in a later commit.

The way big strings work is by assigning each str() call in the script a
unique ID. Then we create a percpu array map such that for each CPU,
there is a percpu entry that can accomodate each str() callsite. This
makes it so scripts can keep as many strings "on the stack" as they'd
like, with the tradeoff being the amount of memory we need to
preallocate.

As long as the kernel never allows BPF programs to nest (eg execute prog
A while prog A is already on the call stack), this approach is robust to
self-corruption.

Note we do not yet remove or change default value of
BPFTRACE_MAX_STRLEN. That is b/c a few other helpers (buf(), path(),
maybe more) also read the value and allocate space on stack accordingly.
We need to decouple or convert those helpers to scratch maps as well.

This closes #305.
@danobi
Copy link
Member

danobi commented Jun 12, 2024

Follow ups to initial support are being tracked here: #3229

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
7 participants