Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

math/big: mulAddVWW too slow #22643

Open
TuomLarsen opened this issue Nov 9, 2017 · 17 comments
Open

math/big: mulAddVWW too slow #22643

TuomLarsen opened this issue Nov 9, 2017 · 17 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@TuomLarsen
Copy link

TuomLarsen commented Nov 9, 2017

What version of Go are you using?

go1.9.2 darwin/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using?

darwin/amd64

What did you do?

The following Go program multiplies big.Int with one Word constant:

package main

import (
	"fmt"
	"math/big"
)

func main() {
	a := big.NewInt(1)
	b := big.NewInt(1)
	a.Lsh(a, 8096)
	for i := uint64(1); i <= 10000000; i++ {
		a.Mul(a, b)
	}
	fmt.Println(a.BitLen())
}

What did you expect to see?

The following C program uses GMP 6.1.2:

#include <stdio.h>
#include <stdint.h>
#include "gmp.h"
 
int main(int argc, char *argv[]) {
    mpz_t a, b;
    mpz_init_set_ui(a, 1);
    mpz_init_set_ui(b, 1);
    mpz_mul_2exp(a, a, 8096);
    for (uint64_t i = 1; i <= 10000000ULL; i++) {
        mpz_mul(a, a, b);
    }
    printf("%lu\n", mpz_sizeinbase(a, 2));
    return 0;
}

and terminates under around 1s.

What did you see instead?

The Go program completes on my machine in 1.87s, i.e. almost twice as slow.

mulAddVWW is building block for multiplication (by one word, or more) so a better performance would benefit many applications.

@vielmetti
Copy link

See also https://go-review.googlesource.com/c/go/+/76270 (for arm64) where it is reported:

The lack of proper addMulVVW implementation for arm64 hurts RSA
performance

This is an optimized implementation, it improves RSA2048 performance
by 10X to 15X on ARMv8 based server processors

@griesemer
Copy link
Contributor

Duplicate of #9245.

@randall77
Copy link
Contributor

I'm not seeing the performance difference you're seeing. With your two example programs:

$ go build issue22643.go
$ gcc -O3 -lgmp issue22643.c 
$ time ./a.out 
8097

real	0m1.063s
user	0m1.059s
sys	0m0.002s
$ time ./issue22643
8097

real	0m1.106s
user	0m1.102s
sys	0m0.004s

That's only a few percent slower.
This is with go 1.9 and gmp 6.1.2.
On darwin amd64 with a "gcc" version of Apple LLVM version 9.0.0 (clang-900.0.38).
Intel Xeon E5.

@ALTree
Copy link
Member

ALTree commented Nov 9, 2017

To add another data point, on Linux/amd64 on an Haswell i7 with go1.9 and gmp6.1 and gcc 7.2 I get, on 5 runs:

Go:
real	0m1.243s
real	0m1.244s
real	0m1.268s
real	0m1.261s
real	0m1.248s

C+gmp:
real	0m1.151s
real	0m1.137s
real	0m1.191s
real	0m1.141s
real	0m1.141s

So within ~8% of gmp.

@vielmetti
Copy link

Results for arm64, specifically on a HiSilicon Hi1616 (Packet "Type 2A2")

root@2a2-neon:~/src/golang# time ./a.out
8097

real    0m3.930s
user    0m3.926s
sys     0m0.004s
root@2a2-neon:~/src/golang# go version
go version go1.9.2 linux/arm64
root@2a2-neon:~/src/golang# go build issue22643.go 
root@2a2-neon:~/src/golang# time ./issue22643
8097

real    0m4.150s
user    0m4.145s
sys     0m0.009s

Again, go version within a few % of the C results.

@TuomLarsen
Copy link
Author

TuomLarsen commented Nov 9, 2017

That's really strange, I triple-checked but still getting the same 1.8x difference.
CPU: Intel Core M 1,1 GHz, 256 KB L2 Cache (per Core).

$ cat issue22643.go
package main

import (
	"fmt"
	"math/big"
)

func main() {
	a := big.NewInt(1)
	b := big.NewInt(1)
	a.Lsh(a, 8096)
	for i := uint64(1); i <= 10000000; i++ {
		a.Mul(a, b)
	}
	fmt.Println(a.BitLen())
}
$ go build issue22643.go
$ time ./issue22643
8097

real	0m1.814s
user	0m1.791s
sys	0m0.013s

And:

$ cat issue22643.c 
#include <stdio.h>
#include <stdint.h>
#include "gmp.h"
 
int main(int argc, char *argv[]) {
    mpz_t a, b;
    mpz_init_set_ui(a, 1);
    mpz_init_set_ui(b, 1);
    mpz_mul_2exp(a, a, 8096);
    for (uint64_t i = 1; i <= 10000000ULL; i++) {
        mpz_mul(a, a, b);
    }
    printf("%lu\n", mpz_sizeinbase(a, 2));
    return 0;
}
$ gcc -O3 -o issue22643 issue22643.c -lgmp -lm
$ time ./issue22643
8097

real	0m1.082s
user	0m1.067s
sys	0m0.007s

Btw., I've got Go from the downloads (go1.9.2.darwin-amd64.pkg)

$ shasum darwin_amd64/math/big.a 
9ad030b2c3f78119d15c85d2241a77e680612b10  big.a
$ shasum bin/go
1976f6cd942b2c4b903ee7523dd1f889a161188e  go

@randall77
Copy link
Contributor

Can you profile both instances (Go and C++) and see where the time goes? We can then compare your profile to profiles on our machines.
I suspect a difference between chips.

@TuomLarsen
Copy link
Author

I used pkg/profile, is that enough?

package main

import (
    "fmt"
    "github.com/pkg/profile"
    "math/big"
)

func main() {
    defer profile.Start().Stop()
    a := big.NewInt(1)
    b := big.NewInt(1)
    a.Lsh(a, 8096)
    for i := uint64(1); i <= 10000000; i++ {
        a.Mul(a, b)
    }
    fmt.Println(a.BitLen())
}

And pprof top:

$ go tool pprof [...]/cpu.pprof
Duration: 1.92s, Total samples = 1.71s (89.28%)
(pprof) top
Showing nodes accounting for 1.71s, 100% of 1.71s total
Showing top 10 nodes out of 14
      flat  flat%   sum%        cum   cum%
     1.42s 83.04% 83.04%      1.42s 83.04%  math/big.mulAddVWW /usr/local/go/src/math/big/arith_amd64.s
     0.10s  5.85% 88.89%      1.57s 91.81%  math/big.nat.mulAddWW /usr/local/go/src/math/big/nat.go
     0.06s  3.51% 92.40%      1.69s 98.83%  math/big.(*Int).Mul /usr/local/go/src/math/big/int.go
     0.06s  3.51% 95.91%      1.63s 95.32%  math/big.nat.mul /usr/local/go/src/math/big/nat.go
     0.05s  2.92% 98.83%      0.05s  2.92%  math/big.nat.norm /usr/local/go/src/math/big/nat.go
     0.01s  0.58% 99.42%      0.01s  0.58%  runtime.duffcopy /usr/local/go/src/runtime/duff_amd64.s
     0.01s  0.58%   100%      0.01s  0.58%  runtime.usleep /usr/local/go/src/runtime/sys_darwin_amd64.s
         0     0%   100%      1.69s 98.83%  main.main /Users/tuom/go/issue22643.go
         0     0%   100%      1.69s 98.83%  runtime.main /usr/local/go/src/runtime/proc.go
         0     0%   100%      0.01s  0.58%  runtime.morestack /usr/local/go/src/runtime/asm_amd64.s

@randall77
Copy link
Contributor

@TuomLarsen I need to see a profile of the C+gmp program as well. Without it it is hard to tell whether Go is slow on your system, or gmp is fast.

Using your profiled program, my machine spends 97+% of its time in mulAddVWW, whereas your code only spends 83% there. I'm not sure if that has anything to do with the issue at hand, but it seems weird.

@TuomLarsen
Copy link
Author

From Instruments:

              Weight       Self Weight      Symbol Name
  1.26 s      99.8%          0 s            start
  1.25 s      99.1%         14.00 ms        main
  1.11 s      88.2%          1.11 s         __gmpn_mul_1
123.00 ms      9.7%        123.00 ms        __gmpz_mul
  7.00 ms      0.5%          7.00 ms        DYLD-STUB$$__gmpz_mul
  2.00 ms      0.1%          0 s            <Unknown Address>
  2.00 ms      0.1%          2.00 ms        __gmpz_mul

@randall77
Copy link
Contributor

Thanks for that data. Unfortunately, no smoking gun.
I don't know what is going on with your chip that makes it so different from ours.

For the record, here is the inner loop in GMP:

   18bf0:       4e 89 14 df             mov    %r10,(%rdi,%r11,8)
   18bf4:       49 01 c1                add    %rax,%r9
   18bf7:       4a 8b 04 de             mov    (%rsi,%r11,8),%rax
   18bfb:       49 11 d0                adc    %rdx,%r8
   18bfe:       41 ba 00 00 00 00       mov    $0x0,%r10d
   18c04:       48 f7 e1                mul    %rcx
   18c07:       4e 89 4c df 08          mov    %r9,0x8(%rdi,%r11,8)
   18c0c:       49 01 c0                add    %rax,%r8
   18c0f:       48 11 d3                adc    %rdx,%rbx
   18c12:       4a 8b 44 de 08          mov    0x8(%rsi,%r11,8),%rax
   18c17:       48 f7 e1                mul    %rcx
   18c1a:       4e 89 44 df 10          mov    %r8,0x10(%rdi,%r11,8)
   18c1f:       48 01 c3                add    %rax,%rbx
   18c22:       49 11 d2                adc    %rdx,%r10
   18c25:       4a 8b 44 de 10          mov    0x10(%rsi,%r11,8),%rax
   18c2a:       48 f7 e1                mul    %rcx
   18c2d:       4a 89 5c df 18          mov    %rbx,0x18(%rdi,%r11,8)
   18c32:       41 b8 00 00 00 00       mov    $0x0,%r8d
   18c38:       4c 89 c3                mov    %r8,%rbx
   18c3b:       49 01 c2                add    %rax,%r10
   18c3e:       4a 8b 44 de 18          mov    0x18(%rsi,%r11,8),%rax
   18c43:       4d 89 c1                mov    %r8,%r9
   18c46:       49 11 d1                adc    %rdx,%r9
   18c49:       48 f7 e1                mul    %rcx
   18c4c:       49 83 c3 04             add    $0x4,%r11
   18c50:       78 9e                   js     18bf0 <___gmpn_mul_1+0xa7>

compared with the inner loop in math/big:

U5:	// i+4 <= n
	// regular loop body unrolled 4x
	MOVQ (0*8)(R8)(BX*8), AX
	MULQ R9
	ADDQ CX, AX
	ADCQ $0, DX
	MOVQ AX, (0*8)(R10)(BX*8)
	MOVQ DX, CX
	MOVQ (1*8)(R8)(BX*8), AX
	MULQ R9
	ADDQ CX, AX
	ADCQ $0, DX
	MOVQ AX, (1*8)(R10)(BX*8)
	MOVQ DX, CX
	MOVQ (2*8)(R8)(BX*8), AX
	MULQ R9
	ADDQ CX, AX
	ADCQ $0, DX
	MOVQ AX, (2*8)(R10)(BX*8)
	MOVQ DX, CX
	MOVQ (3*8)(R8)(BX*8), AX
	MULQ R9
	ADDQ CX, AX
	ADCQ $0, DX
	MOVQ AX, (3*8)(R10)(BX*8)
	MOVQ DX, CX
	ADDQ $4, BX		// i += 4
	
	LEAQ 4(BX), DX
	CMPQ DX, R11
	JLE U5

They look remarkably similar. The GMP library does through some hoops to do mov $0, x; adc y, x instead of adc $0, x, possibly relevant but I can't see why.

@TuomLarsen
Copy link
Author

@randall77 And thanks for looking in it! I'm certainly no expert but, if that U5 would indeed be the bottleneck, maybe GMP is trying to avoid data dependency?

Btw., should this issue be re-opened as it seems it is not about general improvement tracked by #9245?

@griesemer
Copy link
Contributor

Re-opened as this is a more special case of #9245.

@griesemer griesemer reopened this Nov 11, 2017
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/77371 mentions this issue: math/big: improve mulAddVWW inner loop

@randall77
Copy link
Contributor

@TuomLarsen : could you try patching in CL 77371 and see if that improves performance on your machine?

@TuomLarsen
Copy link
Author

@randall77 Unfortunately no, it seems it does not improve it. What I did was I created a stub library with both new and old mulAddVWW, and replaced the above a.Mul(a, b) with suitable call. Both took around 1.58s to finish.

@randall77
Copy link
Contributor

Then I'm out of ideas. Without a copy of your machine, it's hard to debug.
You might try copying the exact inner loop from gmp into math/bits and see what happens.
Then mutate it one or a few instructions at a time until the performance degrades.
That might get us a reasonable starting point.

@ianlancetaylor ianlancetaylor added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Mar 29, 2018
@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Mar 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

7 participants