Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for supercop ASM in wallet, and benchmark for supercop [ Do Not Merge] #6337

wants to merge 1 commit into
base: master


Copy link

vtnerd commented Feb 12, 2020

Sorry for the do not merge tag again. But this requires a PR in another repo.

This is a working example of the amd64 ASM from supercop to speedup wallet scanning. The speedup is less now, presumably due to slow downs in other areas of the wallet. Scanning from block 1 to 2031549 (mainnet) drops from 5.8 minutes to 4.14 minutes (~28.6%) on my test machine. The benchmark utility (see below) shows a crypto speedup of ~146% with two output transactions.

The test machine is a ryzen 3900x / 32 GiB PC3200 RAM / 970pro nvme / x570 desktop. The daemon and wallet were on the same box. The first run for each crypto implementation was tossed in an effort to "normalize" the file cache. The disk is encrypted, so LMDB hitting disk has an additional hit.

This uses namespace aliases, and therefore has ZERO overhead when disabled. Not even a hit to the executable size. When this is enabled, everything using the "device" library will increase in size due to the extra crypto code. This is unfortunate.

This the output from the benchmark utility included with this PR. It can "sideload" multiple crypto implementations to test the relative performance.

Running benchmark using 1000 iterations
Transaction Component Benchmarks
+ derive_subaddress_public_key step 14248881 ns (+0%)
|-+ standard 14248881 ns (+0%)
| |----> amd64_64_24k => 14248881 ns (+0%)
| |----> amd64_51_30k => 15410920 ns (+8.1553%)
| |--------------> cn => 34357061 ns (+141.121%)
+ generate_key_derivation step 37039500 ns (+159.947%)
|-+ standard 37039500 ns (+0%)
| |----> amd64_64_24k => 37039500 ns (+0%)
| |----> amd64_51_30k => 40765969 ns (+10.0608%)
| |--------------> cn => 93774626 ns (+153.175%)
Transaction Benchmarks
+ Transactions with 2 outputs 65504972 ns (+0%)
|-+ standard 65504972 ns (+0%)
| |----> amd64_64_24k =>  65504972 ns (+0%)
| |----> amd64_51_30k =>  71416899 ns (+9.02516%)
| |--------------> cn => 161385237 ns (+146.371%)
+ Transactions with 4 outputs 93948623 ns (+43.4221%)
|-+ standard 93948623 ns (+0%)
| |----> amd64_64_24k =>  93948623 ns (+0%)
| |----> amd64_51_30k => 102212480 ns (+8.79614%)
| |--------------> cn => 229769098 ns (+144.569%)

This comment has been minimized.

Copy link

binaryFate commented Feb 12, 2020

Impressive numbers, great work!

What's the additional size to expect roughly?


This comment has been minimized.

Copy link
Contributor Author

vtnerd commented Feb 13, 2020

On this setup, static builds of monerod increases by 88.02 KiB (0.5%) and monero-wallet-cli increases by 84.02 KiB (0.5%). The daemon increase is exactly 4Kib (to the byte), so something involving padding was probably triggered.

monerod increases because it has a dependency on device via cryptonote_basic, cryptonote_core, and ringct. AFAIK, it should be dead code too, but it cannot be dropped because default_device is created in static memory, so linking the associated file brings in every function associated with that class. Fixing this is going to require a bit of additional work. It should likely be done regardless of this PR status, because the device design/implementation is dragging too much code into monerod already.

It is possible to build monerod without this ASM, then build monero-walet-cli with the ASM, but a cmake change is required between make calls. I'm not aware of an easier technique.


This comment has been minimized.

Copy link

erciccione commented Feb 13, 2020

a PR in another repo

That link is broken, i guess you meant to link to monero-project/supercop#3 :)

Nice improvement btw, looking forward to see this merged.


This comment has been minimized.

Copy link
Contributor Author

vtnerd commented Feb 14, 2020

I tested scanning on two machines with less awesome CPUs. The same ryzen 3900x / 970pro is being used to run the daemon, with the wallet scanning on:

Fanless laptop with i7-7Y75 (1.3 GHz 2/4 cores 4MiB cache): 46.07 minutes to 20.52 minutes (~55.4%).

2014 Mac Mini i5 (2.5 GHz 2/4 cores 3MiB cache): 38.29 minutes to 24.00 minutes (~37%).

I was incorrect earlier about changes to the code affecting performance - these numbers are similar to what was seen previously. I need a second identical system to see how running the daemon + wallet on a single machine affects performance. Or perhaps just look at the CPU utilization more closely.

Either way, the increase is fairly big for lower-core and/or lower clock speed x86-64 systems.

EDIT: got my cpu cache numbers flipped. The system with higher cache did better with lower clock speed. This suggest the x86-64 code is a bit heavy on the cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.