Performance impact measurement tool #24

l0kod · 2024-02-01T09:57:52Z

We need tooling to measure the performance impact of kernel changes. Until now, we used simple scripts to get an idea of the worse case scenarios, and manual profiling.

It would be useful to have a way to measure in a predictive way the overhead of sandboxing on a set of scenarios (e.g. path walk) with synthetic benchmarks. Several tracing kernel features might be interesting: tracepoints, eBPF...

We might need kernel changes such as tracepoints.

This measurement tool should be simple to use manually and by a CI (see landlock-test-tools). It would be useful to measure performance improvement changes (#1 and #9).

See seccomp_benchmark.c and other benchmark programs in tools/testing/selftests.

sm1ling-knight · 2024-04-12T12:20:55Z

I don't think that adding tracepoints is a good solution here.
We want to measure the performance impact of kernel changes, i.e. to have information about overhead of Landlock hooks for every required syscall and kernel method (e.g. file_open(), udp_sendmsg()) after every change connected with possible performance overhead. Such a tracepoint would be useless for users.

Possibly, it would be better to use kprobes. However, they have some issues compared to tracepoints:

Kprobes mechanism overhead [1]: however, this may not have a strong effect since we want to measure impact through Landlock versions
Unstable testing environment: we'll have to rewrite testing environment for some Landlock or kernel api changes. But i don't think this will happen often, since kprobes will most likely only need a name of functions and many kernel methods have static tracepoints.

What do you think?

[1] https://lore.kernel.org/all/20230111065930.1494-1-cuiyunhui@bytedance.com/

We can obtain the information with kretprobe, but as we know, kprobe gets
the result by trappig in an exception, which loses performance compared
to tracepoint.
...
kr:
Time per request: 43.168 [ms] (mean)
Time per request: 0.043 [ms] (mean, across all concurrent requests)

tracepoint:
Time per request: 41.004 [ms] (mean)
Time per request: 0.041 [ms] (mean, across all concurrent requests

l0kod · 2024-04-19T21:39:35Z

I think kprobes/kretprobes would be fine from a performance point of view, but that may not be enough to identify requests to specific kernel objects (e.g. the performance impact of opening a specific file). That may not be an issue for benchmark if we can scope the probes to only the benchmarked process though.

I think tracepoints are useful when we need to increment a counter or print variables according to code within a function or with a specific semantic (not necessarily passed as function argument).

The kernel API change and especially the Landlock changes should not be a big issue because the benchmark tool would mostly live with the kernel source code and could be part of the Kselftests and be updated with the kernel code.

gnoack · 2024-04-25T08:58:03Z

Hello @sm1ling-knight -- just for coordination; did you start working on this bug already?

sm1ling-knight · 2024-04-25T10:15:49Z

Hello @sm1ling-knight -- just for coordination; did you start working on this bug already?

Hello! Yep, I'm currently working on it.

l0kod · 2024-04-25T10:26:19Z

Good! We should explicit say when we started or want to work on a task to avoid duplicated work. 😉

sm1ling-knight · 2024-04-25T13:52:18Z

@l0kod, do you have any cases of what events would be useful to trace in Landlock? I currently have an idea of tracing landlock_unmask_layers() in order to view frequency of access checks.

I also want to add CPU profiling to measure overhead of Landlock changes. What do you think?

l0kod · 2024-04-30T09:49:36Z

@l0kod, do you have any cases of what events would be useful to trace in Landlock? I currently have an idea of tracing landlock_unmask_layers() in order to view frequency of access checks.

I expect landlock_unmask_layers() to be very quick to execute, but the number of calls looks like an interesting metric. Same for landlock_find_rule(). Increasing the depth of a path will increase the number of calls and the time taken by these slower functions: is_access_to_paths_allowed() and collect_domain_accesses(). Being able to group the number of landlock_unmask_layers() and landlock_find_rule() calls per caller should help.

On top of that, timing each LSM hook (including the Landlock hook implementations) called per syscall should give a good overview of Landlock's impact on the kernel's execution time.

I also want to add CPU profiling to measure overhead of Landlock changes. What do you think?

That would be very useful!

l0kod · 2024-04-30T09:57:39Z

For reference, Flame Graphs are a visualization tool (relying on kernel features such as perf) useful to see a performance impact and to debug, but not to quickly compare two implementations nor to measure according to a context (e.g., depth of a file path) in an automatic (and deterministic as much as possible) way.

gnoack · 2024-05-13T21:30:05Z

In my mind, the percentage of time spent in Landlock would be an interesting metric to capture; In particular, the entry point functions are interesting, which are hit during program execution; that is: (a) the three syscalls, and (b) the LSM hooks which Landlock overrides.

This is already recorded when doing perf record with -g.
The following invocation is using the landlock-restrict tool which is part of the go-landlock library, to sandbox the ls command:

perf record --freq=max -g landlock-restrict -3 -rw / -- /bin/ls

It might be wiser to replace /bin/ls with something more long running. With ls, it was difficult to find a sufficiently high number of traces with our functions in them, because it only runs so briefly. I attempted something like find / -exec file '{}' \;, but this is admittedly a bit unrealistic and extreme, and might also spend a lot of time in execve().

The result can then be plotted with Brendan Gregg's Flame graph scripts; I'm using this:

cat ~/bin/,flame
#!/bin/sh

while getopts ":s" opt; do
    case $opt in
        s)
            CMDPREFIX=sudo
    esac
done

FG="${HOME}/git/FlameGraph"
$CMDPREFIX perf script | "${FG}/stackcollapse-perf.pl" | "${FG}/flamegraph.pl" > /tmp/flame.svg
chromium /tmp/flame.svg

or it can be analyzed with perf report, where it is a bit easier to find the numbers.

This is just how far I got with the obvious part. I figured that perf's perf.data output file has all the useful data already, when using sampling. Open question are:

I would love to find out how to extract these CPU percentages directly from perf.data in a scriptable way.
I would like to have a collection of a few realistic scenarios which we can profile.
- Idea: Maybe something like https://justine.lol/make/ (creates a new sandbox per compiler invocation, and can be made to run reasonably long; or just override CC= with a sandboxed GCC and use GNU make)
- Idea: The xzdec tool is using Landlock; maybe we can find a realistic use case where this or another compression library gets invoked?
- ...and maybe also some more artificial scenarios in the mix.

sm1ling-knight · 2024-05-16T13:47:56Z

I don't think that perf is suitable for our case. Execution time of hooks may be too short for perf to provide sufficient accuracy.

On my system execution of 'file_open' hook takes about 4e-5 seconds (in average) for sandboxed find / command. This hook takes about 40% of the execution time of the openat syscall.

I got these values using bpftrace tool and implementing lsm_hook_enter/lsm_hook_exit tracepoints that wrap each lsm hook.

#define call_void_hook(FUNC, ...)				\
	do {							\
		struct security_hook_list *P;			\
								\
		hlist_for_each_entry(P, &security_hook_heads.FUNC, list) { \
			trace_lsm_hook_enter(P->lsmid->id, #FUNC);		\
			P->hook.FUNC(__VA_ARGS__);		\
			trace_lsm_hook_exit(P->lsmid->id, #FUNC);		\
		}						\
	} while (0)

Tracepoints were used to gather total execution time and total number of calls for 'file_open' hook and openat syscall. You can see bpftrace script similar to the one I used for gathering at the end of the message

It seems that sampling frequency should be at least on the order of 10^-5 seconds for adequate accuracy.

From Brendan Gregg's blog:
-F 99: sample at 99 Hertz (samples per second). I'll sometimes sample faster than this (up to 999 Hertz), but that also costs overhead. 99 Hertz should be negligible. Also, the value '99' and not '100' is to avoid lockstep sampling, which can produce skewed results.

It turns out that the sampling frequency that does not involve overhead is on the order of 10^-2. Using perf would be unacceptable in order to measure time taken by hook accurately enough. Moreover, using perf involves problem with tracing static symbols in fs.c, net.c (such as hooks) and extracting CPU percentages (as you mentioned).

I suggest using bpftrace tool for profiling. It seems like quite convenient way to measure execution time for our case with almost zero overhead. Bpftrace allows to add gathering instructions for every required tracing blob (tracepoint, kprobe, uprobe..).

To measure time in this way, following steps are required:

Create tracepoints that I described at the beginning (lsm_hook_enter/lsm_hook_exit)
Write a bpftrace script that will gather the total time spent by the workload in landlock hook and in desirable syscalls. The script might look something like this:

tracepoint:security:lsm_hook_enter / args->lsmid == 110 /
{
	@lsm_start[tid, str(args->func)] = nsecs;
}

tracepoint:security:lsm_hook_exit / args->lsmid == 110 /
{
	@ns_lsm[tid, str(args->func)] += (nsecs - @lsm_start[tid, str(args->func)]);
	@count_lsm[tid, str(args->func)]++;
	delete(@lsm_start[tid, str(args->func)]);
}

tracepoint:syscalls:sys_enter_openat
{
	@sys_openat_start[tid] = nsecs;
}

tracepoint:syscalls:sys_exit_openat
{
	@ns_sys_openat[tid] += (nsecs - @sys_openat_start[tid]);
	@count_sys_openat[tid]++;
	delete(@sys_openat_start[tid]);
}

// Shouldn't be called, since script is halted after workload end of execution.
interval:s:300 { exit(); }

l0kod · 2024-05-17T15:22:09Z

Because we can change the Landlock/kernel code, I don't think we should be limited by any sampling frequency.

Your bpftrace script looks good! I guess we should not need more dependency than this tool.

I would love to find out how to extract these CPU percentages directly from perf.data in a scriptable way.

We indeed need all benchmarks to be simply run, eventually by a CI.

However, it would help to also be able to create a flamegraph from these benchmarks, but that could be another iteration of the patches.

I would like to have a collection of a few realistic scenarios which we can profile.

Idea: Maybe something like https://justine.lol/make/ (creates a new sandbox per compiler invocation, and can be made to run reasonably long; or just override CC= with a sandboxed GCC and use GNU make)

Idea: The xzdec tool is using Landlock; maybe we can find a realistic use case where this or another compression library gets invoked?

...and maybe also some more artificial scenarios in the mix.

We should have both microbenchmarks and real-world benchmarks. It looks like bpftrace scripts should be good for both cases.

The main intensive and real-world scenarios that come to mind are:

find with different depth level
wrapped/sandboxed tar, zip, or native zx indeed

sm1ling-knight · 2024-05-23T14:32:33Z

Hello! Can i send the implementation as landlock-test-tools pull request? (when it's ready)

l0kod · 2024-05-30T08:15:15Z

Hello! Can i send the implementation as landlock-test-tools pull request? (when it's ready)

As for the seccomp benchmark tool, it would help to keep similar tools in the kernel source tree e.g., to easily run performance tests with a kernel CI, and to add reference to performance improvements in commit messages. Moreover, the benchmark scripts may need to evolve along with future kernel versions (e.g. if an LSM hooks change).

I think higher level tooling (e.g. generate flame graphs, compare performance of two kernel versions) would make sense in landlock-test-tools though.

sm1ling-knight · 2024-05-30T15:53:44Z

Hello! Can i send the implementation as landlock-test-tools pull request? (when it's ready)

As for the seccomp benchmark tool, it would help to keep similar tools in the kernel source tree e.g., to easily run performance tests with a kernel CI, and to add reference to performance improvements in commit messages. Moreover, the benchmark scripts may need to evolve along with future kernel versions (e.g. if an LSM hooks change).

Do you agree that adding tracepoints lsm_hook_enter/lsm_hook_exit to kernel source is rational? Initially, I thought that they would be presented as some kind of *.patch file in landlock-test-tools.

l0kod · 2024-05-31T17:41:54Z

Do you agree that adding tracepoints lsm_hook_enter/lsm_hook_exit to kernel source is rational? Initially, I thought that they would be presented as some kind of *.patch file in landlock-test-tools.

Yes, that looks reasonable to me.

sm1ling-knight · 2024-05-31T18:00:07Z

Btw, i decided to try bcc to solve this issue. It's a bit more verbose compared to bpftrace, but it makes it easier to work with command line arguments and custom output (for manual usage). I think it doesn't require a lot of dependencies, so I'll try to make the first patch using bcc. It should be easy to switch into bpftrace, if something goes wrong.

l0kod · 2024-06-10T18:20:37Z

FYI, patches fixing an issue in perf with Landlock and BTF:

l0kod added the test Test improvement label Feb 1, 2024

l0kod mentioned this issue Mar 21, 2024

Domain hash table #1

Open

This was referenced Apr 5, 2024

UDP socket control #10

Open

Use dedicated SLABs #19

Closed

l0kod assigned sm1ling-knight Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance impact measurement tool #24

Performance impact measurement tool #24

l0kod commented Feb 1, 2024 •

edited

sm1ling-knight commented Apr 12, 2024

l0kod commented Apr 19, 2024 •

edited

gnoack commented Apr 25, 2024 •

edited

sm1ling-knight commented Apr 25, 2024

l0kod commented Apr 25, 2024

sm1ling-knight commented Apr 25, 2024

l0kod commented Apr 30, 2024

l0kod commented Apr 30, 2024 •

edited

gnoack commented May 13, 2024

sm1ling-knight commented May 16, 2024

l0kod commented May 17, 2024

sm1ling-knight commented May 23, 2024 •

edited

l0kod commented May 30, 2024

sm1ling-knight commented May 30, 2024

l0kod commented May 31, 2024

sm1ling-knight commented May 31, 2024

l0kod commented Jun 10, 2024

Performance impact measurement tool #24

Performance impact measurement tool #24

Comments

l0kod commented Feb 1, 2024 • edited

sm1ling-knight commented Apr 12, 2024

l0kod commented Apr 19, 2024 • edited

gnoack commented Apr 25, 2024 • edited

sm1ling-knight commented Apr 25, 2024

l0kod commented Apr 25, 2024

sm1ling-knight commented Apr 25, 2024

l0kod commented Apr 30, 2024

l0kod commented Apr 30, 2024 • edited

gnoack commented May 13, 2024

sm1ling-knight commented May 16, 2024

l0kod commented May 17, 2024

sm1ling-knight commented May 23, 2024 • edited

l0kod commented May 30, 2024

sm1ling-knight commented May 30, 2024

l0kod commented May 31, 2024

sm1ling-knight commented May 31, 2024

l0kod commented Jun 10, 2024

l0kod commented Feb 1, 2024 •

edited

l0kod commented Apr 19, 2024 •

edited

gnoack commented Apr 25, 2024 •

edited

l0kod commented Apr 30, 2024 •

edited

sm1ling-knight commented May 23, 2024 •

edited