Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize binary operations on bigints #26

Merged
merged 4 commits into from Aug 18, 2017

Conversation

Projects
None yet
2 participants
@Vurich
Copy link
Contributor

Vurich commented Aug 18, 2017

This is one of the first times that I've seen non-marginal wins from loop unrolling and inlining, each of them doubling the speed in this case. This (as well as significant algorithmic improvements) now causes the Rust versions of all operations to be significantly faster than the inline-asm ones. Yes, even the asm implementations of U256::add and U256::sub, which were 4 instructions each. I have no good answer as to why that is, I'll check out the disassembly as part of a write-up/investigation. My guess is vectorization, although in terms of explaining optimisations that's about 1 level above "a wizard did it".

Benchmarks (using the updated benchmarks included in this PR)

Before (Rust):

test u128_mul      ... bench:     172,084 ns/iter (+/- 3,760)
test u256_add      ... bench:     248,624 ns/iter (+/- 31,677)
test u256_from_be  ... bench:          25 ns/iter (+/- 2)
test u256_from_le  ... bench:          11 ns/iter (+/- 0)
test u256_full_mul ... bench:   1,055,309 ns/iter (+/- 69,297)
test u256_mul      ... bench:     236,830 ns/iter (+/- 5,360)
test u256_sub      ... bench:     234,215 ns/iter (+/- 22,712)
test u512_add      ... bench:     249,946 ns/iter (+/- 1,889)
test u512_mul      ... bench:     377,577 ns/iter (+/- 3,380)
test u512_sub      ... bench:     248,546 ns/iter (+/- 35,421)

Before (asm):

test u128_mul      ... bench:     173,086 ns/iter (+/- 17,930)
test u256_add      ... bench:      43,462 ns/iter (+/- 899)
test u256_from_be  ... bench:          25 ns/iter (+/- 0)
test u256_from_le  ... bench:          11 ns/iter (+/- 3)
test u256_full_mul ... bench:     230,648 ns/iter (+/- 4,758)
test u256_mul      ... bench:     128,773 ns/iter (+/- 6,603)
test u256_sub      ... bench:      44,649 ns/iter (+/- 770)
test u512_add      ... bench:     107,894 ns/iter (+/- 15,486)
test u512_mul      ... bench:     378,872 ns/iter (+/- 133,697)
test u512_sub      ... bench:     175,084 ns/iter (+/- 5,369)

After (Rust):

test u128_mul      ... bench:      37,864 ns/iter (+/- 3,207)
test u256_add      ... bench:      33,512 ns/iter (+/- 2,585)
test u256_from_be  ... bench:          25 ns/iter (+/- 4)
test u256_from_le  ... bench:          11 ns/iter (+/- 0)
test u256_full_mul ... bench:     178,178 ns/iter (+/- 19,570)
test u256_mul      ... bench:      80,793 ns/iter (+/- 2,156)
test u256_sub      ... bench:      37,407 ns/iter (+/- 2,996)
test u512_add      ... bench:      30,292 ns/iter (+/- 2,036)
test u512_mul      ... bench:     275,776 ns/iter (+/- 5,969)
test u512_sub      ... bench:      31,971 ns/iter (+/- 2,705)

Benchcmp results (Before (Rust) vs After (Rust)):

 name           rust.bench ns/iter  optimized.bench ns/iter  diff ns/iter   diff %  speedup 
 u128_mul       172,084             37,864                       -134,220  -78.00%   x 4.54 
 u256_add       248,624             33,512                       -215,112  -86.52%   x 7.42 
 u256_from_be   25                  25                                  0    0.00%   x 1.00 
 u256_from_le   11                  11                                  0    0.00%   x 1.00 
 u256_full_mul  1,055,309           178,178                      -877,131  -83.12%   x 5.92 
 u256_mul       236,830             80,793                       -156,037  -65.89%   x 2.93 
 u256_sub       234,215             37,407                       -196,808  -84.03%   x 6.26 
 u512_add       249,946             30,292                       -219,654  -87.88%   x 8.25 
 u512_mul       377,577             275,776                      -101,801  -26.96%   x 1.37 
 u512_sub       248,546             31,971                       -216,575  -87.14%   x 7.77 

Benchcmp results (Before (asm) vs After (Rust)):

 name           asm.bench ns/iter  optimized.bench ns/iter  diff ns/iter   diff %  speedup 
 u128_mul       173,086            37,864                       -135,222  -78.12%   x 4.57 
 u256_add       43,462             33,512                         -9,950  -22.89%   x 1.30 
 u256_from_be   25                 25                                  0    0.00%   x 1.00 
 u256_from_le   11                 11                                  0    0.00%   x 1.00 
 u256_full_mul  230,648            178,178                       -52,470  -22.75%   x 1.29 
 u256_mul       128,773            80,793                        -47,980  -37.26%   x 1.59 
 u256_sub       44,649             37,407                         -7,242  -16.22%   x 1.19 
 u512_add       107,894            30,292                        -77,602  -71.92%   x 3.56 
 u512_mul       378,872            275,776                      -103,096  -27.21%   x 1.37 
 u512_sub       175,084            31,971                       -143,113  -81.74%   x 5.48 

I should note that the u256 benchmarks are the ones to look at, since that's what we actually use in Parity.

Vurich added some commits Aug 18, 2017

src/uint.rs Outdated

ret[i] = res2;
carry = overflow1 as u64 + overflow2 as u64;
if carry != 0 {

This comment has been minimized.

@NikVolf

NikVolf Aug 18, 2017

Member

why do you need to check carry here?

why not just

let (res1, overflow1) = ($fn)(me[i], you[i]);
let (res2, overflow2) = ($fn)(res1, carry);

<ptr::write...>

let carry = overflow1 | overflow2

since overflow can happen only once?

This comment has been minimized.

@Vurich

Vurich Aug 18, 2017

Author Contributor

Honestly, because the benchmarks told me to. I'll check them again to make certain it's a win though.

@@ -15,11 +15,12 @@ rustc_version = "0.2"
rustc-hex = { version = "1.0", optional = true }
heapsize = { version = "0.4", optional = true }
byteorder = { version = "1", default-features = false }
crunchy = "0.1.5"

This comment has been minimized.

@NikVolf

NikVolf Aug 18, 2017

Member

your crunchy dependency introduced must have non-std variant

did you try to compile bigint with --no-default-features ?

This comment has been minimized.

@Vurich

Vurich Aug 18, 2017

Author Contributor

Gah, thanks. It doesn't even require a core dependency, it's 100% macro code.

EDIT: Looks like it doesn't matter, compiling with --no-default-features works even with the version of crunchy that doesn't include #![no_std]. I guess it's inferred(?)

This comment has been minimized.

@NikVolf

NikVolf Aug 18, 2017

Member

Ok, leave it as it is
I will update myself if something goes wrong in no-std binaries

Vurich

@NikVolf NikVolf merged commit b8baf9d into paritytech:master Aug 18, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@Vurich Vurich referenced this pull request Aug 20, 2017

Closed

Optimization improvements #27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.