Investigate performance impact of compiler flags #10608

ggwpez · 2022-01-07T15:47:35Z

Try out different compiler flags and check with the benchmarks for any substantial performance increase.
The LTO flag seems promising after being discussed and later on used in Polkadot: paritytech/polkadot#4311.

If you specifically know more flags that could help: please post them here.

koute · 2022-01-24T11:33:33Z

There's also the -C target-cpu flag (mentioned in paritytech/polkadot#4311); by default AFAIK it's set to x86-64 (run rustc -C target-cpu=help for a full list), and I think we could use the x86-64-v2 by default as CPUs which are a decade old should support it at this point.

The feature levels themselves are described here:

https://gitlab.com/x86-psABIs/x86-64-ABI/-/blob/master/x86-64-ABI/low-level-sys-info.tex

ggwpez · 2022-01-24T12:25:02Z

There's also the -C target-cpu flag (mentioned in paritytech/polkadot#4311); by default AFAIK it's set to x86-64 (run rustc -C target-cpu=help for a full list), and I think we could use the x86-64-v2 by default as CPUs which are a decade old should support it at this point.

Thanks for the idea!
This was investigated as well for wasmi wasmi-labs/wasmi#339 (comment)
I also tested it in Substrate on reference hardware, with no performance increase.

I'm not sure if it makes sense to specify a concrete value (besides native) for it, since we are only concerned about the performance on reference hardware.

koute · 2022-01-24T13:09:38Z

That's good to know, thanks for checking!

I was wondering about the actual performance difference since the Moonbeam guys wrote that it gave them "good results"; it's a bummer you've found no increase. (Although maybe it just makes a difference when a significantly newer target is specified, instead of just slightly bumping the baseline?)

ggwpez · 2022-01-24T13:17:32Z

Although maybe it just makes a difference when a significantly newer target is specified, instead of just slightly bumping the baseline?

The reference hardware is a i7-7700K from 2017, so compiling on newer CPU should already make it faster than the weights that we have specified.
I will still put this in my backlog to take another look, just to ensure that we get maximum performance.
PS: We are looking to upgrading the ref hardware soon anyway, so everything here would change again 🙈

ggwpez · 2022-02-02T14:59:15Z

#10777 Talks about target-cpu again.
As it turns out: we currently do not have the tooling in place to actually test this.
I just ran the benchmarks and compared the weights, but these are only the computational complexity minus the DB operations.
So if the target-cpu thing turns out to improve DB performance, I will test it.
Better tooling for this is on the way.

koute · 2022-02-04T13:31:30Z

Could we perhaps use the export-blocks subcommand and then do an import-blocks to indirectly also test the DB performance? (Not sure whether you've already tried it nor if it'd actually be useful; just throwing the idea out.)

ggwpez · 2022-02-04T14:58:01Z

Could we perhaps use the export-blocks subcommand and then do an import-blocks to indirectly also test the DB performance? (Not sure whether you've already tried it nor if it'd actually be useful; just throwing the idea out.)

Yea I actually tried that here 😄
Currently I am trying to reproduce the setup that came up with these hardcoded weights.
Then I can re-run it and get proper values for DB-Read/Write, BlockBase- and ExtrinsicBaseFees (and compare it with target-cpu).

PS: I did not run the import with different profiles, so that could be done. But it takes quite long to run...

ggwpez · 2022-05-10T07:27:13Z

Does not seem to make a difference as shown here: #10777 (comment)
Closing until we find new compiler flags that we want to test.

MathCryptoDoc · 2022-08-27T12:44:20Z

I wanted to comment on this issue since it seems that the production profile that uses opt-level=3, lto=fat, codegen-units=1 is not always that good when compiling for the native arch. See the summary and more details are in these Python notebooks where I benchmarked basically all possible options. Quite surprisingly, opt-level=2 seems to be better.

The scores uses the benchmarks for BLAKE2 and SR25519, and the timing of the remark extrinsic. Are there other good extrinsics that can be benchmarked via the polkadot binary?

nazar-pc · 2022-08-27T12:51:52Z

If you're compiling for native then you can tweak other flags too. Default builds are not tuned for native.

MathCryptoDoc · 2022-08-27T12:57:03Z

Ah, didn't know this was meant for only default build. Which flags would you use for tuning native then?

nazar-pc · 2022-08-27T13:56:01Z

I meant I believe flags in this repo don't impact compilation flags in project that use it as dependency. It is up to you to have whatever flags you want in your workspace's Cargo.toml or whatnot.

ggwpez · 2022-08-29T09:09:29Z

I wanted to comment on this issue since it seems that the production profile that uses opt-level=3, lto=fat, codegen-units=1 is not always that good when compiling for the native arch.

Yes possibly. But there have to be some settings for a production build and we cant optimize for every CPU.
I think the default options are reasonable? We tried to pick options to let the compiler do all optimizations it can.

Throwing away builds that do not improve upon the official binary, we have 15, 21, 38, 40, 45 as the good builds. They all build with `target-cpu=native``.

So the default release binary should be good enough for most cases. High performance users have to build themselves anyway.

Are there other good extrinsics that can be benchmarked via the polkadot binary?

It prints all available ones if you call it with the --list flag, but currently there is only one other: transfer_keep_alive.
You can also take a look at the other benchmark commands: README.

MathCryptoDoc · 2022-08-29T10:11:07Z

Yes possibly. But there have to be some settings for a production build and we cant optimize for every CPU. I think the default options are reasonable? We tried to pick options to let the compiler do all optimizations it can.

Sure, I agree for a generic production profile, the current settings make a lot of sense of course. I thought this issue was meant to investigate performance of compiler flags in general, including specialised builds. My mistake...

It prints all available ones if you call it with the --list flag, but currently there is only one other: transfer_keep_alive. You can also take a look at the other benchmark commands: README.

Thanks, that convenient. Testing all that will take a long time, but might be interesting.

ggwpez · 2022-09-02T19:21:42Z

Sure, I agree for a generic production profile, the current settings make a lot of sense of course. I thought this issue was meant to investigate performance of compiler flags in general, including specialised builds. My mistake...

We investigated the impact of the flags in order to find good default settings.
Writing some kind of blog article (like you are doing ❤️) is better for more advanced insight.

ggwpez added I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. U3-nice_to_have Issue is worth doing eventually. Z2-medium Can be fixed by a coder with good Rust knowledge but little knowledge of the codebase. labels Jan 7, 2022

ggwpez self-assigned this Jan 7, 2022

ggwpez added this to Backlog in Runtime via automation Jan 7, 2022

ggwpez mentioned this issue Jan 11, 2022

Optimize validator binaries with LTO paritytech/polkadot#4311

Closed

This was referenced Jan 18, 2022

Investigate: dependency on lto="fat" and codegen-units=1 wasmi-labs/wasmi#339

Closed

Optimized weights #10692

Merged

Kailai-Wang mentioned this issue Jan 25, 2022

Differentiation between different cargo profiles litentry/litentry-parachain#326

Closed

ghzlatarev mentioned this issue Mar 2, 2022

[Trial]Enable LTO for wasm build Manta-Network/Manta#346

Closed

ggwpez closed this as completed May 10, 2022

Runtime automation moved this from Backlog to Done May 10, 2022

shawntabrizi moved this from Done to Archive in Runtime May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance impact of compiler flags #10608

Investigate performance impact of compiler flags #10608

ggwpez commented Jan 7, 2022 •

edited

Loading

koute commented Jan 24, 2022

ggwpez commented Jan 24, 2022

koute commented Jan 24, 2022

ggwpez commented Jan 24, 2022

ggwpez commented Feb 2, 2022

koute commented Feb 4, 2022

ggwpez commented Feb 4, 2022 •

edited

Loading

ggwpez commented May 10, 2022

MathCryptoDoc commented Aug 27, 2022 •

edited

Loading

nazar-pc commented Aug 27, 2022

MathCryptoDoc commented Aug 27, 2022

nazar-pc commented Aug 27, 2022

ggwpez commented Aug 29, 2022 •

edited

Loading

MathCryptoDoc commented Aug 29, 2022

ggwpez commented Sep 2, 2022

Investigate performance impact of compiler flags #10608

Investigate performance impact of compiler flags #10608

Comments

ggwpez commented Jan 7, 2022 • edited Loading

koute commented Jan 24, 2022

ggwpez commented Jan 24, 2022

koute commented Jan 24, 2022

ggwpez commented Jan 24, 2022

ggwpez commented Feb 2, 2022

koute commented Feb 4, 2022

ggwpez commented Feb 4, 2022 • edited Loading

ggwpez commented May 10, 2022

MathCryptoDoc commented Aug 27, 2022 • edited Loading

nazar-pc commented Aug 27, 2022

MathCryptoDoc commented Aug 27, 2022

nazar-pc commented Aug 27, 2022

ggwpez commented Aug 29, 2022 • edited Loading

MathCryptoDoc commented Aug 29, 2022

ggwpez commented Sep 2, 2022

ggwpez commented Jan 7, 2022 •

edited

Loading

ggwpez commented Feb 4, 2022 •

edited

Loading

MathCryptoDoc commented Aug 27, 2022 •

edited

Loading

ggwpez commented Aug 29, 2022 •

edited

Loading