Suggestions from Torbjörn Granlund (personal e-mail):
The multiply primitives, in particular addMulVVW surely deserves more
Offset the pointers so that you can index with a counter register
which goes from -n to 0, saving the CMPQ.
Unroll. You can save most of the ADCQ $0, R that way. Basically,
do one run with just MULQ where you sum the old highpart (DX) with
the new lowpart (AX). You will need some MOVQ to move DX
out-of-the-way too. Then do a new run over these sums where you
bring in the memory addend. This should double the speed on some
A good addMulVVW is probably really the first thing to write in
assembly; addition and subtraction is much less important, usually.
The text was updated successfully, but these errors were encountered:
I believe this was for ARMv8; there's more to do here. The Go team is pre-occupied with generics for 1.18, so this is unlikely to happen for 1.18 unless somebody else wants to step in, preferably with experience in performance-critical arithmetic routines.