Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't Process Perf File #144

Open
bwendling opened this issue Sep 26, 2022 · 17 comments
Open

Can't Process Perf File #144

bwendling opened this issue Sep 26, 2022 · 17 comments

Comments

@bwendling
Copy link
Member

I have a perf file that autofdo doesn't seem to be able to do anything with. Is autofdo not able to handle "use_lbr"?

$ ~/autofdo/build/create_llvm_prof --profile vmlinux.tcp_rr.not-instrumented.perf --binary ../pkgs/boot/vmlinux-4.15.0-smp-922.38.999.999 --out vmlinux.tcp_rr.not-instrumented.profraw
[INFO:/usr/local/google/home/morbo/autofdo/third_party/perf_data_converter/src/quipper/perf_reader.cc:1058] Number of events stored: 61392
[INFO:/usr/local/google/home/morbo/autofdo/third_party/perf_data_converter/src/quipper/perf_parser.cc:274] Parser processed: 1142 MMAP/MMAP2 events, 3148 COMM events, 3219 FORK events, 305 EXIT events, 52726 SAMPLE events, 52473 of these were mapped, 0 SAMPLE events with a data address, 0 of these were mapped
WARNING: Logging before InitGoogleLogging() is written to STDERR
W20220926 22:25:58.449682 2695262 profile.cc:105] use_lbr was enabled but range_count_map was empty!
W20220926 22:25:58.450127 2695262 profile.cc:105] use_lbr was enabled but range_count_map was empty!
W20220926 22:25:58.450407 2695262 profile.cc:105] use_lbr was enabled but range_count_map was empty!
W20220926 22:25:58.459457 2695262 profile.cc:105] use_lbr was enabled but range_count_map was empty!
W20220926 22:25:58.465374 2695262 profile.cc:105] use_lbr was enabled but range_count_map was empty!
W20220926 22:25:58.472229 2695262 profile.cc:105] use_lbr was enabled but range_count_map was empty!
W20220926 22:25:58.474949 2695262 profile.cc:105] use_lbr was enabled but range_count_map was empty!
W20220926 22:25:58.476979 2695262 profile.cc:105] use_lbr was enabled but range_count_map was empty!
W20220926 22:25:58.477377 2695262 profile.cc:105] use_lbr was enabled but range_count_map was empty!
W20220926 22:25:58.484407 2695262 profile.cc:105] use_lbr was enabled but range_count_map was empty!
W20220926 22:25:58.485718 2695262 profile.cc:105] use_lbr was enabled but range_count_map was empty!
W20220926 22:25:58.487694 2695262 profile.cc:105] use_lbr was enabled but range_count_map was empty!
W20220926 22:25:58.489140 2695262 profile.cc:105] use_lbr was enabled but range_count_map was empty!
W20220926 22:25:58.492909 2695262 profile.cc:105] use_lbr was enabled but range_count_map was empty!
 ...

vmlinux.tcp_rr.not-instrumented.perf.gz

@bwendling
Copy link
Member Author

This was the command line used:

Collecting remote profile via perf_events.

remotely executing on oqft10:3988:
perf record -a -c 10000019 -e cycles --call-graph lbr -i -m 16 -N -o - sleep 30
[ perf record: Woken up 354 times to write data ]
[ perf record: Captured and wrote 0.000 MB - ]
Output of '/home/build/nonconf/static/projects/perf/perf report --vmlinux pkgs/boot/vmlinux-4.15.0-smp-922.38.999.999  -n -i merged/vmlinux.tcp_rr.not-instrumented.perf' stored in /dev/null
*********************************************************
remotely executed on oqft10:3988:
perf record -a -c 10000019 -e cycles --call-graph lbr -i -m 16 -N -o - sleep 30

analysis commands need to take a normal profile.
displayed profile with:
/home/build/nonconf/static/projects/perf/perf report --vmlinux pkgs/boot/vmlinux-4.15.0-smp-922.38.999.999  -n -i merged/vmlinux.tcp_rr.not-instrumented.perf

@shenhanc78
Copy link
Collaborator

Thanks for reporting. I'll take a look.

@bwendling
Copy link
Member Author

Thanks! Let me know if you need more information.

@nickdesaulniers
Copy link
Member

See also: https://lore.kernel.org/lkml/CAHk-=whqCT0BeqBQhW8D-YoLLgp_eFY=8Y=9ieREM5xx0ef08w@mail.gmail.com/

This is blocking our ability to use AutoFDO to improve the performance of the Linux kernel.

@shenhanc78
Copy link
Collaborator

Will you be able to use the following perf command:

perf record -a -c 10000019 -e cycles -b -i -m 16 -N -o - sleep 30

That is replace the origin command "--call-graph lbr" with "-b". The latter explicitly instructs perf to record lbr records whereas the former one records callgraph that is obtained from lbr.

After that, you may check if the perf.data file contains any lbr records via:

perf script -Fpid,brstack -i perf.data

This should dump tons of lbr records, one snapshot per line.

Let's see if it helps. I'll follow up with this tomorrow.

@bwendling
Copy link
Member Author

I'll give it a shot. I tried it with perf record -a -c 128 -e cycles -g -i -m 16 -N -o - sleep 30, which was suggested as the "best" way to measure the branches by someone internal, but to no avail:

$ create_llvm_prof --profile vmlinux.tcp_rr.not-instrumented.perf --binary ../pkgs/boot/vmlinux-4.15.0-smp-926.42.999.999 --out vmlinux.tcp_rr.not-instrumented.profraw
[INFO:/usr/local/google/home/morbo/autofdo/third_party/perf_data_converter/src/quipper/perf_reader.cc:1058] Number of events stored: 8857
[INFO:/usr/local/google/home/morbo/autofdo/third_party/perf_data_converter/src/quipper/perf_parser.cc:274] Parser processed: 1101 MMAP/MMAP2 events, 3445 COMM events, 3519 FORK events, 659 EXIT events, 129 SAMPLE events, 129 of these were mapped, 0 SAMPLE events with a data address, 0 of these were mapped
WARNING: Logging before InitGoogleLogging() is written to STDERR
W20221006 17:56:30.677920 4026801 profile.cc:105] use_lbr was enabled but range_count_map was empty!

vmlinux.tcp_rr.not-instrumented.perf.gz

@shenhanc78
Copy link
Collaborator

(just to mention - "-c 128" seems a little bit intrusive, and usually we use a prime number (e.g., 10007) for "-c", so we do not put a bias on some address that is part of a loop. Also, the autofdo tool uses LBR data from perf.data, the LBR data are only recorded when "-b" (alias for "-j any") are given. When autofdo tool sees no LBR data in the binary, it cannot proceed.)

@bwendling
Copy link
Member Author

Ah! Okay. I got a profraw file now with the -b option. Thanks.

Is the LBR data you mention here the same for chips that don't have LBR, like ARM?

@shenhanc78
Copy link
Collaborator

Good to know you got a profraw file.

As to the LBR, it is only available on INTEL architectures, Skylake and later generations have 32-depth LBR records (meaning each snapshot contains a consecutive of last 32 branch records), the Haswell only 16. AMD and ARM architectures do not support LBR for now, so "-b" probably will give an error on those machines.

LBR data are a reflection of code paths. For AMD and INTEL machines, I believe most of the time the code paths are the same, so binaries that are optimized by profraw collected on INTEL should see similar performance boost on both INTEL and AMD machines.

However, for ARM, the code paths may be different (some libraries have different versions for X86_64 / ARM), for that case, if we use profraw to optimize code that is to be run on ARM, we might get a regression.

(We are currently exploring LBR-like perf data on ARM machines, but still not quite there yet....)

@shenhanc78
Copy link
Collaborator

Closing this. (Please reopen if any further questions..)

@bwendling
Copy link
Member Author

I did get a profraw file, but it wasn't very useful. This was part of the profile summary. The total number of functions is very low.

==== profile summary ====
Total functions: 878
Maximum function count: 1378
Maximum block count: 1294
Total number of blocks: 31259
Total count: 661716
Detailed summary:
7 blocks with count >= 1200 account for 1 percentage of the total counts.
93 blocks with count >= 596 account for 10 percentage of the total counts.
264 blocks with count >= 260 account for 20 percentage of the total counts.
607 blocks with count >= 149 account for 30 percentage of the total counts.
1121 blocks with count >= 115 account for 40 percentage of the total counts.
1746 blocks with count >= 94 account for 50 percentage of the total counts.
2635 blocks with count >= 71 account for 60 percentage of the total counts.
3579 blocks with count >= 61 account for 70 percentage of the total counts.
4786 blocks with count >= 50 account for 80 percentage of the total counts.
6348 blocks with count >= 39 account for 90 percentage of the total counts.
7244 blocks with count >= 31 account for 95 percentage of the total counts.
8766 blocks with count >= 5 account for 99 percentage of the total counts.
12691 blocks with count >= 1 account for 99.9 percentage of the total counts.
12691 blocks with count >= 1 account for 99.99 percentage of the total counts.
12691 blocks with count >= 1 account for 99.999 percentage of the total counts.
12691 blocks with count >= 1 account for 99.9999 percentage of the total counts.

@bwendling
Copy link
Member Author

Please reopen this. (I don't have the ability to do that.)

@shenhanc78 shenhanc78 reopened this Nov 15, 2022
@shenhanc78
Copy link
Collaborator

A tangential note - AMD has added BRS (which is the counterpart of INTEL LBR) support to it's Fam19h Model 01h CPUs. And current toolchains support it seamlessly. We've also done an evaluation on AMD BRS and the performance numbers are on par.

As to the profile, the number of functions that have counters are too small, usually we will either tune down "-c" or tune up profiling period. Also use "loadtests" that "saturate" kernel functionality would be crucial.

@nickdesaulniers
Copy link
Member

See also #138

@bage613
Copy link

bage613 commented Aug 14, 2023

A tangential note - AMD has added BRS (which is the counterpart of INTEL LBR) support to it's Fam19h Model 01h CPUs. And current toolchains support it seamlessly. We've also done an evaluation on AMD BRS and the performance numbers are on par.

As to the profile, the number of functions that have counters are too small, usually we will either tune down "-c" or tune up profiling period. Also use "loadtests" that "saturate" kernel functionality would be crucial.

@shenhanc78 Hi, can you share more detail? Such as: use the same commands "perf record -b ./sort"?
I'm evaluating AutoFDO on AMD platform. Right now I use the same commands, met many unsupported events error. Thanks! Looking forwarding to getting more info from you.
image

@fwyzard
Copy link

fwyzard commented Sep 3, 2023

@bage613 @shenhanc78 I'd also be interested if there is a way to use AutoFDO on AMD Milan or Genoa.

@shenhanc78
Copy link
Collaborator

@bage613 is now able to collect raw perf data, convert it to autofdo profile and see some performance improvement for one of his benchmarks, and he is trying to do it for a open source server and reproduce the improvement. This is done on Zen3 (some models) and Zen4 CPUs.

He will share more if he sees wins for the open source server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants