Skip to content

Sampling mode

Tudor Brindus edited this page Oct 27, 2023 · 1 revision

If you run magic-trace with the flag -sampling or when running on a machine which doesn't support Intel PT, magic-trace will instead give a trace collected from sampling callstacks rather than reconstructed from Intel PT events. This does mean many short function calls will be missed and there is much higher overhead.

This feature allows one to use magic-trace on machines which don't support Intel PT, e.g. an AMD machine. Additionally this can be useful for long running traces where one wants less granularity. Most of the configuration works similarly across modes, except -snapshot-size and -timer-resolution. -snapshot-size will now be ignored and magic-trace will always output a trace consisting of 512K of data unless -full-execution is passed. See Timer resolution configuration for information on -timer-resolution. There is also a flag -callgraph-mode flag used to configure how to reconstruct callstacks.

Callgraph options

When running magic-trace with the sampling backend (i.e. with -sampling), a -callgraph-mode can be passed or will be selected by default. The options for this argument are:

  • (Last_branch_record (stitched true)) or (Last_branch_record) or lbr
  • (Last_branch_record (stitched false)) or lbr-no-stitch
  • Dwarf or dwarf
  • Frame_pointers or fp

If the user does not select a mode, lbr will be selected if the user is running on an Intel machine which supports LBR (many recent chips do) and dwarf will be selected otherwise. These three options correspond to the argument --call-graph in perf (see here for more info).

When running with lbr or lbr-no-stitch, this will use the last branch record hardware feature from Intel which logs branches to specific MSRs. Generally this supports callstacks of up to 32 entries but differs by architecture. lbr-no-stitch enables perf's --stitch-lbr which can increase callstack sizes around 34%. See this for more info on LBR and this for more info on stitching LBR.

When running with dwarf, this will use the DWARF debugging information. As long as perf is recent enough to be linked with libunwind or libdw this should work. The downside here though is the high overhead from writing the debugging information to perf.data files and the high overhead during decoding. This means there will be ~20x larger file sizes for same number of samples. And it can take multiple orders of magnitude longer to decode. However using a recent version of perf speeds this up significantly, so we recommend running with the most recent perf available (we had success using 5.17).

When running with fp, this will use the frame pointers from the binary in order to reconstruct the callstacks. This requires your binary to be compiled with -fno-omit-frame-pointer (including all libraries linked with). If that is the case, fp will work well with reasonably similar overhead to lbr. If you get bogus looking callstacks with fp, we recommend trying with another option.