Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ARM SVE code vector length agnostic #285

Merged
merged 4 commits into from
Aug 23, 2024

Conversation

fwessels
Copy link
Contributor

This PR updates the ARM SVE code to be vector length agnostic (so no more "hardcoded" constants but using addvl etc. everywhere).

Code has been tested on Graviton 3 with 256-bits vector length (so same vector length as is now supported).

Support for Graviton 4 (128-bits vector length) will be added in a separate PR but should not require any changes to the assembly code itself.

@klauspost
Copy link
Owner

These assume 64 bytes/loop when processing:

return n & (maxInt - 63)
- so this must be adjusted to actual remaining.

You probably also want to make adjustments to minCodeGenSize here

func (r *reedSolomon) hasCodeGen(byteCount int, inputs, outputs int) (_, _ *func(matrix []byte, in, out [][]byte, start, stop int) int, ok bool) {

@fwessels
Copy link
Contributor Author

fwessels commented Aug 21, 2024

See latest commits which adds support for Graviton 4.

There is no need to change the remaining bytes left since all *_64/_64Xor functions still always process 64 bytes (and same for the other routines that always process 32 bytes at a time). Note however that with a smaller vector length (only case now being 16 for Graviton 4), it will always loop twice as much in the routines.

So even when just processing 64 bytes in eg. mulSve_10x1_64 it will still do 2 loops for Graviton 4 whereas just a single loop for Graviton 3 (and AVX2).

@fwessels
Copy link
Contributor Author

Here are the testing results on Graviton 4 (as identified by the support for sve2 which Graviton 3 misses)

$ lscpu | grep sve2
Flags:                                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
$
$ go test --short 
Using ARM+SVE
PASS
ok      github.com/klauspost/reedsolomon        3.071s

@klauspost
Copy link
Owner

AFAICT this should also have vector-length instead of 32 hardcoded:

sz := r.dataShards * r.parityShards * 2 * 32

.. otherwise the temporary slice will be too small..

@fwessels
Copy link
Contributor Author

Good point, but (for now) this is not the case since the vector length is capped at 32 bytes (see code below) plus the now vl-agnostic SVE assembly routines always either process a multiple of 64 or 32 bytes (exactly like the AVX2 code from which it is derived).

		if vl, _ := getVectorLength(); vl <= 256 {
 			// set vector length in bytes
 			defaultOptions.vectorLength = int(vl) >> 3
 		} else {
 			// disable SVE for hardware implementatons over 256 bits (only know to be Fujitsu A64FX atm)
 			defaultOptions.useSVE = false
 		}

So the current implementation has minimal impact on the current code.

However, once we get our hands on SVE hardware that supports vectors that are larger than 256 bits, we will indeed have to do some more "surgery" to enlarge buffers etc. The only known implementation ATM is the Fujitsu chip listed above, so it might be a while before we can get access and work on this ....

@klauspost klauspost merged commit 67157af into klauspost:master Aug 23, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants