CPU Sampling with
perf and Flame Graphs
In this lab, you will experiment with CPU profiling of a simple application. The techniques you learn will be applicable to application profiling as well as profiling the entire system, will apply to user-mode and kernel-mode code, and will apply to any language or framework that has
perf symbol support through the
/tmp/perf-$PID.map convention, such as Java and Node.js.
Task 1: Compile the Profiled Application
The application you're about to profile is a very simple command-line tool that counts prime numbers in a given range, and uses OpenMP to parallelize the computation. But first, let's build it and make sure everything works. Navigate to the directory that contains primes.c and build it using the following command:
$ gcc -g -fno-omit-frame-pointer -fopenmp primes.c -o primes
Now, experiment with multiple executions of this application with varying numbers of threads by running the following command:
for i in `seq 1 2 16`; do \ OMP_NUM_THREADS=$i bash -c 'time ./primes'; \ done
Elapsed time should stabilize when you create enough threads to keep all the cores saturated throughout the entire run. Because the distribution of work between threads is not ideal, on an N-core system you will need more than N threads to keep them all busy.
Task 2: Profile with
It's time to find out what's taking so long to count our primes. Use the following command to get
perf's opinion. You will need a root shell for this.
# export OMP_NUM_THREADS=16 # perf record -g -F 997 -- ./primes
In the preceding command, the
-gswitch indicates that you want the call stack to be captured; the
-Fswitch lets you set the rate of sampling for events -- 997 times times per second in this case. The trace will include events aggregated across all threads spawned by
primes. To include events from other processes while the program is running, specify
-a, and if you really want to trace only a single thread within a process, you can use use
To figure out where the bottleneck is, inspect
# perf report --stdio
It seems that we're spending a whole lot of time in the
is_divisible function. If you want to know where exactly, you can use the following command:
# perf annotate
Note that, as discussed in class, there may be skid in the displayed results. Notably, the specific instruction that seems to be taking most of the time might actually be masquerading for the previous or following instruction (or couple of instructions).
Task 3: Extract a Flame Graph
Now that you have the basic picture of what's happening in the application, it would be interesting to inspect everything else. The perf report is quite long, though:
# perf report --stdio | wc -l 1286
NOTE: Your report might be a lot shorter; this is just an example. In a real-world application, a report that is hundreds of thousands of lines long will not be unusual. In this simple demo, your report might be just a handful of lines.
So instead of reading long reports, let's take flame graphs for a spin. If you haven't already, clone the FlameGraph repository, and then run the following commands to post-process the perf report:
# perf script > primes.stacks # FlameGraph/stackcollapse-perf.pl primes.stacks > primes.collapsed # FlameGraph/flamegraph.pl primes.collapsed > primes.svg
Open the resulting flame graph and enjoy the results. It is now immediately evident that there are no significant stack tree branches other than the
is_divisible branch that we have already seen.
Task 4: Profiling Java Code
In this section, you are going to profile some Java code and generate flame graphs for that as well. To help
perf figure out the symbols (function names) for your Java code, you will use perf-map-agent, which attaches to your Java process and generates map files that map dynamically-generated Java machine code to class and method names.
perf-map-agent can also run the whole recording session for you, through a set of handy scripts such as
First, run the Java application that we're about to profile. It is a Java version of the same prime-counting app:
$ java -XX:+PreserveFramePointer slowy/App
Do not hit ENTER yet. Instead, in another (non-root) shell, run
jps to find the process id for the Slowy app, and then run the collection tool. After you run
perf-java-report-stack, hit ENTER in the Java app's console so that the collection tool records some meaningful work.
$ jps 2144 Jps 2132 App $ ./perf-java-report-stack 2132 # use the pid from the previous step # hit ENTER in the java console
If everything went well, you should see the
perf report ncurses UI, showing the bottlenecks in the Java program. These will likely be
App::isPrime, because the smaller methods were optimized away (inlined). You can repeat the experiment and run Slowy with the
-XX:-Inline switch to prevent this optimization and obtain more accurate results that include the
To generate flame graphs from this report, you need the perf.data file created by
perf. By default,
perf-java-report-stack places it in
/tmp/perf-PID.data, where PID is the process ID of your Java process. You can then run the following command to generate a flame graph:
$ perf script -i /tmp/perf-PID.data | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl --colors=java > java.svg
--colors=java parameter, which instructs the flamegraph.pl script to colorize functions according to their type: Java methods, native methods, and kernel methods will all get their own distinctive color palettes.
Task 5: Profiling Node.js Code
In this section, you are going to profile a Node.js application. To help
perf figure out the symbols (function names) for your Node.js code, you will need to run the Node process with the
--perf_basic_prof_only_functions command-line switches. These instruct the Node runtime to generate a map file for each JIT-compiled method, which maps the method address to the method name that
perf can parse.
Navigate to the
nodey directory. If you haven't yet, you should make sure all required packages are installed by running
npm install. Then, run the following command to start our simple Node application with perf map support:
$ ./run.sh perf
Now, in a root shell, start recording CPU samples using
# perf record -F 97 -p $(pgrep -n node) -g
In the first shell, run the following command to execute a simple benchmark of one of our Node application's endpoints, which is consuming lots of CPU:
$ ab -c 10 -n 100 -m POST 'http://localhost:3000/users/auth?username=foo&password=wrong'
When the benchmark completes, go back to the root shell and stop the recording by hitting Ctrl+C. Change the permissions on the perf.data file so that you can view it as a standard user:
# chown fedora perf.data
Now, go back to the first shell. You can view the report in text format by running:
$ perf report -i /path/to/perf.data
Or, you can generate a flame graph from the report, by using the following command:
$ perf script -i /path/to/perf.data | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > node.svg
Bonus: Some More Flame Graphs
For some more fun, experiment with collecting CPU samples and then generating flame graphs while the following commands are running:
dd if=/dev/zero of=/dev/null bs=100K count=50K
You might get a lot of
[unknown] frames because of missing debuginfo. Try to obtain debuginfo for the missing components and then try again. If there's no choice, you can always build from source with the
-g option to make sure debuginfo will be available.