Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #537

zamazan4ik · 2023-10-08T14:42:02Z

Hi!

Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects - the results are available here. So that's why I think it's worth trying to apply PGO to Ouch. I already performed some benchmarks and want to share my results here.

Test environment

Fedora 38
Linux kernel 6.5.5
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.73
Ouch version: the latest for now from the main branch on commit dc21932102011da61a85a98f43d9d8d9ab6bd917
Disabled Turbo boost

Benchmark setup

For benchmarking purposes, I use these benchmarks - https://github.com/ouch-org/ouch/blob/main/benchmarks/run-benchmarks.sh . Release build is done with cargo build --release, PGO optimized build is done with cargo-pgo. PGO profiles are collected from the benchmark workload itself.

All benchmarks are done multiple times, on the same hardware/software setup, with the same background "noise" (as much I can guarantee ofc).

Results

ouch_release - Release build, ouch_optimized - Release + PGO build.

I got the following results:

./run-benchmarks.sh
Benchmark 1: ./ouch_release compress rust output.tar
  Time (mean ± σ):     781.0 ms ±   3.9 ms    [User: 119.2 ms, System: 649.2 ms]
  Range (min … max):   772.3 ms … 789.9 ms    50 runs

Benchmark 2: ./ouch_optimized compress rust output.tar
  Time (mean ± σ):     759.7 ms ±   7.0 ms    [User: 104.1 ms, System: 643.2 ms]
  Range (min … max):   732.5 ms … 784.5 ms    50 runs

Summary
  ./ouch_optimized compress rust output.tar ran
    1.03 ± 0.01 times faster than ./ouch_release compress rust output.tar
Creating tar archive to benchmark decompression...
Benchmark 1: ./ouch_release decompress input.tar --dir output
  Time (mean ± σ):      3.138 s ±  0.022 s    [User: 0.339 s, System: 2.725 s]
  Range (min … max):    3.103 s …  3.239 s    50 runs

Benchmark 2: ./ouch_optimized decompress input.tar --dir output
  Time (mean ± σ):      3.091 s ±  0.014 s    [User: 0.312 s, System: 2.704 s]
  Range (min … max):    3.063 s …  3.134 s    50 runs

Summary
  ./ouch_optimized decompress input.tar --dir output ran
    1.02 ± 0.01 times faster than ./ouch_release decompress input.tar --dir output
Benchmark 1: ./ouch_release compress compiler output.tar.gz
  Time (mean ± σ):      70.5 ms ±   2.6 ms    [User: 729.9 ms, System: 62.0 ms]
  Range (min … max):    66.5 ms …  79.9 ms    50 runs

Benchmark 2: ./ouch_optimized compress compiler output.tar.gz
  Time (mean ± σ):      68.8 ms ±   2.3 ms    [User: 727.0 ms, System: 62.3 ms]
  Range (min … max):    64.6 ms …  76.3 ms    50 runs

Summary
  ./ouch_optimized compress compiler output.tar.gz ran
    1.02 ± 0.05 times faster than ./ouch_release compress compiler output.tar.gz
Creating tar.gz archive to benchmark decompression...
Benchmark 1: ./ouch_release decompress input.tar.gz --dir output
  Time (mean ± σ):     255.9 ms ±   4.0 ms    [User: 82.4 ms, System: 173.9 ms]
  Range (min … max):   251.7 ms … 273.4 ms    50 runs

Benchmark 2: ./ouch_optimized decompress input.tar.gz --dir output
  Time (mean ± σ):     254.8 ms ±   2.9 ms    [User: 79.2 ms, System: 175.4 ms]
  Range (min … max):   250.6 ms … 263.6 ms    50 runs

Summary
  ./ouch_optimized decompress input.tar.gz --dir output ran
    1.00 ± 0.02 times faster than ./ouch_release decompress input.tar.gz --dir output
Benchmark 1: ./ouch_optimized compress compiler output.zip
  Time (mean ± σ):     523.7 ms ±   1.4 ms    [User: 474.3 ms, System: 46.8 ms]
  Range (min … max):   521.4 ms … 530.8 ms    50 runs

Benchmark 2: ./ouch_release compress compiler output.zip
  Time (mean ± σ):     527.0 ms ±   2.5 ms    [User: 479.2 ms, System: 45.1 ms]
  Range (min … max):   524.2 ms … 535.9 ms    50 runs

Summary
  ./ouch_optimized compress compiler output.zip ran
    1.01 ± 0.01 times faster than ./ouch_release compress compiler output.zip
Creating zip archive to benchmark decompression...
Benchmark 1: ./ouch_release decompress input.zip --dir output
  Time (mean ± σ):     241.0 ms ±   2.0 ms    [User: 84.2 ms, System: 157.6 ms]
  Range (min … max):   238.7 ms … 249.3 ms    50 runs

Benchmark 2: ./ouch_optimized decompress input.zip --dir output
  Time (mean ± σ):     243.5 ms ±   3.1 ms    [User: 84.6 ms, System: 158.6 ms]
  Range (min … max):   236.7 ms … 253.0 ms    50 runs

Summary
  ./ouch_release decompress input.zip --dir output ran
    1.01 ± 0.02 times faster than ./ouch_optimized decompress input.zip --dir output

check results at results.md

According to the tests, it's possible to achieve several percent improvements with PGO at least in these benchmarks.

Further steps

I can suggest the following things to do:

Evaluate PGO's applicability to Ouch in more scenarios.
If PGO helps to achieve better performance - add a note to Ouch's documentation about that (probably somewhere in the README file). In this case, users and maintainers will be aware of another optimization opportunity for Ouch.
Provide PGO integration into the build scripts. It can help users and maintainers easily apply PGO for their own workloads.
Optimize prebuilt binaries with PGO.

Here are some examples of how PGO is already integrated into other projects' build scripts:

Rustc: a CI script for the multi-stage build
GCC:
- Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- A part in a "wonderful" configure script
Clang: Docs
Python:
- CPython: README
- Pyston: README
Go: Bash script
V8: Bazel flag
ChakraCore: Scripts
Chromium: Script
Firefox: Docs
- Thunderbird has PGO support too
PHP - Makefile command and old Centminmod scripts
MySQL: CMake script
YugabyteDB: GitHub commit
FoundationDB: Script
Zstd: Makefile
Foot: Scripts
Windows Terminal: GitHub PR
Pydantic-core: GitHub PR

After PGO, I can suggest evaluating LLVM BOLT as an additional optimization step after PGO.

The text was updated successfully, but these errors were encountered:

zamazan4ik added the enhancement New feature or request label Oct 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #537

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #537

zamazan4ik commented Oct 8, 2023

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #537

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #537

Comments

zamazan4ik commented Oct 8, 2023

Test environment

Benchmark setup

Results

Further steps