-
Notifications
You must be signed in to change notification settings - Fork 15.3k
Description
When interpreting a perf profile, profgen only stores a single BaseAddress value in the ProfiledBinary class:
llvm-project/llvm/tools/llvm-profgen/ProfiledBinary.h
Lines 200 to 201 in 15d11eb
| // The runtime base address that the first executable segment is loaded at. | |
| uint64_t BaseAddress = 0; |
However, it is possible for a binary to be loaded at multiple addresses at once under different process IDs due to address space layout randomisation. The result of this is that when running multiple instances of the same binary concurrently, new processes will overwrite the BaseAddress before all the samples from the last process were processed.
For example, using this program:
#include <stdint.h>
// Loop for a while so we see samples in the function (compiled at -O0)
void loop1() {
for (uint64_t i = 0; i < 10000000000; i++) {}
}
// Slightly modified to make sure there isn't any kind of linker merging
void loop2() {
for (uint64_t i = 1; i < 10000000001; i++) {}
}
int main(int argc, char *argv[]) {
// Use CLI argument to choose which loop to run, so we can distinguish which process samples were collected from
if (argc >= 2 && argv[1][0] == '1') {
loop1();
}
if (argc >= 2 && argv[1][0] == '2') {
loop2();
}
}compiled at -O0 so the loop doesn't get optimised out:
gcc -g test.c -o test # I happened to use GCC but clang should work the samethen if we run one after the other like this:
./test 1
sleep 1
./test2and collect a perf profile:
taskset -c 1 perf record -o perf.data --freq=max -b -e BR_INST_RETIRED.NEAR_TAKEN:uppp bash run_with_delay.shthen convert to proftext format:
llvm-profgen --binary=./test --format=text -output=output.proftext --perfdata=perf.data
then as expected, we get samples from both functions from each process.
loop1:16274349:0
0: 0
1: 774969
2: 0
loop2:16274349:0
0: 0
1: 774969
2: 0
This is because when the mmap event from the second process being loaded is interpreted by profgen, and the base address is updated, the first process has already exited, so we don't miss any samples.
However, if we instead run both processes concurrently:
./test 1 & ./test 2then we get this (note that the order of functions is now reversed):
loop2:16276302:0
0: 0
1: 775062
2: 0
loop1:3906:0
0: 0
1: 186
2: 0
The first process loads, and profgen reads its mmap event and sets the base address, and begins collecting samples. However, after 186 samples were collected, the second process loads, overwriting the base address. Now, any more samples collected from the first process appear to be out of the range of the binary, and so are discarded. For the rest of the execution, profgen only counts samples from the second process, so while the count for the second process is correct, the count for the first process is too low.
This is an issue when profiling builds, such as running perf record on ninja or make, which appears to work but silently undercounts blocks.
As far as I can tell, the solution to this is to use a map between PID and base address, then use the PID of each sample to disambiguate which base address to use, which appears to fix the problem, and samples from both processes get included in the final sample.
Before submitting a PR, it might be useful to get some clarification on whether this is the intended behaviour: should samples from multiple processes be merged together as if they were collected from one process, or is profgen intended to produce a single profile per process to be merged later? Since there already seems to be partial support for generating profiles with data aggregated from multiple processes, I'm leaning towards this being a supported use case, but the documentation is sparse.