Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PGO optimized binnary #6943

Closed
FilipAndersson245 opened this issue Oct 29, 2022 · 11 comments
Closed

PGO optimized binnary #6943

FilipAndersson245 opened this issue Oct 29, 2022 · 11 comments
Labels
enhancement New feature or request

Comments

@FilipAndersson245
Copy link
Contributor

FilipAndersson245 commented Oct 29, 2022

Related problem

We would always strive to improve performance, as it improves the user experience if the program has a low latency and predictability.

Describe the solution you'd like

I would suggest we explore if it's possible to integrate Profile-Guided-Optimization (PGO).
PGO builds the binary twice, the first time with intrinsics, and runs it on some set of example input to gather statistics on how the functions are called. this data is then used to optimize a second build enabling future performance, as it can how the program is executed during runtime.
things we need to consider if this is to be implemented:

  1. Collect a representative of data, representing common use of the shell, I believe this should not be too difficult, by running some general set of benchmarks / common commands.
  2. Setup a CI infrastructure to do this for our 3week releases.runs

Describe alternatives you've considered

No response

Additional context and details

https://github.com/Kobzol/cargo-pgo

@FilipAndersson245 FilipAndersson245 added the enhancement New feature or request label Oct 29, 2022
@fdncred
Copy link
Collaborator

fdncred commented Oct 29, 2022

How do you do a PGO build with cargo? I'd like to try it.

EDIT: I see this now. Looks like a pain. I wonder if it would be worth it?
https://doc.rust-lang.org/rustc/profile-guided-optimization.html

@FilipAndersson245
Copy link
Contributor Author

FilipAndersson245 commented Oct 29, 2022

How do you do a PGO build with cargo? I'd like to try it.

https://github.com/Kobzol/cargo-pgo
I managed to run PGO locally using this cargo extension. the extension manages alot of the more anoying parts.

EDIT: after installing it it seem you can do the following

  • cargo pgo build
  • cargo pgo run # Gather statistics
  • cargo pgo optimize # Create the new optimized binary.

@fdncred
Copy link
Collaborator

fdncred commented Oct 29, 2022

Ugh - I get this on Mac.

> cargo pgo run 
[2022-10-29T16:18:11Z INFO  cargo_pgo::pgo::instrument] PGO profile directory will be cleared.
[2022-10-29T16:18:11Z INFO  cargo_pgo::pgo::instrument] PGO profiles will be stored into /Users/fdncred/src/forks/nushell/target/pgo-profiles.
    Finished release [optimized] target(s) in 0.29s
[2022-10-29T16:18:11Z INFO  cargo_pgo::pgo::instrument] PGO instrumentation build finished successfully.
     Running `target/aarch64-apple-darwin/release/nu`
Error:
  × Nushell launched as a REPL, but STDIN is not a TTY; either launch in a
  │ valid terminal or provide arguments to invoke a script!

Cargo finished with an error (1)

@FilipAndersson245
Copy link
Contributor Author

FilipAndersson245 commented Oct 29, 2022

I ran nushell with nu -c 'source ...' and i belive it worked

@rgwood
Copy link
Contributor

rgwood commented Oct 29, 2022

I don't think this is going to be worth the effort or the impact to compile times. But if someone wants to give it a try and report back, feel free.

@FilipAndersson245
Copy link
Contributor Author

I managed to run it locally on some new data, and the runtime performance improvements are quite low, it gives a small to medium improvement to binary size, but have a large impact on compile time, as it has to compile multiple release builds.

@rgwood
Copy link
Contributor

rgwood commented Mar 1, 2023

Thanks for confirming that. Closing this as it sounds like the tradeoff is not worth it at this point in time.

@rgwood rgwood closed this as not planned Won't fix, can't repro, duplicate, stale Mar 1, 2023
@zamazan4ik
Copy link

zamazan4ik commented Jun 14, 2023

I managed to run it locally on some new data, and the runtime performance improvements are quite low, it gives a small to medium improvement to binary size, but have a large impact on compile time, as it has to compile multiple release builds.

@FilipAndersson245 Could you share please how did you perform the training (profiling) and benchmarking phases? On which scenario, how long, etc. And on these scenarios - what exactly numbers did you get? At least just for history :)

Thanks in advance!

@FilipAndersson245
Copy link
Contributor Author

@zamazan4ik scary, I was just looking into PGO again in nushell, I will give some info tomorow, have to sleep now.

@zamazan4ik
Copy link

zamazan4ik commented Jun 14, 2023

I performed some benchmarks. Since I have no understanding about Nushell internals I used what I found - benches. So I trained and evaluated PGO for Nushell on the available benches (cargo pgo bench + cargo pgo optimize bench).

My setup is Apple Macbook Pro M1 14. I tested the following configurations:

  • Default Release build (size-optimized)
  • PGO for size-optimized release build
  • Release with opt-level = 3. Everything else is the same
  • (just for history) Instrumentation mode so you could estimate how instrumentation affects the benchmarks performance

Results are here: https://pastebin.com/KkuhyQ2F

At least according to these benchmarks - PGO helps a lot with performance. How does it influence Nu shell performance in general - I do not know. But my guess (or my hope) is if you have these benchmarks - they are important for you.

Regarding multi-stage builds... You could at least leave PGO scripts in the directory and describe them somewhere in the documentation PGO results. And then people/maintainers could choose on their own - do they want to use PGO in their use-case or not.

More about multiple other PGO applications (and other PGO ways beyond usual PGO like CSPGO (Context-Sensitive PGO) or BOLT you could find here: https://github.com/ZaMaZaN4iK/awesome-pgo

@FilipAndersson245
Copy link
Contributor Author

FilipAndersson245 commented Jun 15, 2023

I was unable to find the specific script I tested on but I will add my take to do PGO
I think we need to use as real as possible data to ensure the optimization capture real performance improvements, it is also misleading to run PGO on the existing benchmarks as it would make the result biased towards the benchmark, skewing the results.

I just now did a quick test comparing the standard nu with nu-pgo where i only do a run of the binnary,

cargo pgo run --  -- "-c 'exit'"
cargo pgo optimize

Running just this gives a faster starup time

Command Mean [ms] Min [ms] Max [ms] Relative
./nu -c 'exit' 17.6 ± 1.2 16.4 29.9 1.15 ± 0.12
./nu-pgo -c 'exit' 15.3 ± 1.1 14.1 24.7 1.00

comparing it with gradient_benchmark_no_check it is less impressive

Command Mean [ms] Min [ms] Max [ms] Relative
./nu -c 'source gradient_benchmark_no_check.nu; exit' 462.1 ± 20.4 442.3 524.6 1.03 ± 0.06
./nu-pgo -c 'source gradient_benchmark_no_check.nu; exit' 447.6 ± 16.4 424.0 476.0 1.00

As pgo have the potential to even slow down the runtime if the data trained on don`t represent actually use cases.
I will try to compile down some script using community code to see if this can be improved, but the initial impression of just using this seems decent to good.

edit: Both those where also run using codegen-units=1 but this would probably not affect the relative difference between the two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants