Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: add package for using SIMD instructions #53171

Open
mpldr opened this issue May 31, 2022 · 29 comments
Open

proposal: add package for using SIMD instructions #53171

mpldr opened this issue May 31, 2022 · 29 comments
Labels
Projects
Milestone

Comments

@mpldr
Copy link

@mpldr mpldr commented May 31, 2022

SIMD has the potential to greatly increase performance of a lot of data processing applications. Since #35307 was closed with the remark

We agree that there is an opportunity here, but we don't know what it looks like. Also, this is an area that is likely to be affected by generics, which are in progress. For this specific proposal, there is no change in consensus. Closing.

Generics are now available, so I want to use this opportunity to necrobump and suggest a simd package, which allows using SIMD instructions via a highlevel API. I think a simple API like Rust's experimental SIMD support, or the previously linked WebAssembly and Dart would help a lot of developers in improving their data processing performance in a simple and easy way.

@mpldr mpldr added the Proposal label May 31, 2022
@gopherbot gopherbot added this to the Proposal milestone May 31, 2022
@mpldr
Copy link
Author

@mpldr mpldr commented May 31, 2022

As for what the API might look like, the following yenc encoder may provide some insight.

Rust

fn yenc(input: [u8;64]) -> [u8;128] {
    use core::simd::*;
    let mut indata = u8x64::from_array(input); //u8x64::from_array(input);
    indata += u8x64::splat(42);

    // mark special characters
    let mut mask = indata.lanes_eq(u8x64::splat(0));
    mask |= indata.lanes_eq(u8x64::splat(10));
    mask |= indata.lanes_eq(u8x64::splat(13));
    mask |= indata.lanes_eq(u8x64::splat(61));

    indata += mask.select(
        u8x64::splat(64), // add 64 where the mask is set to true
        u8x64::splat(0)   // and don't do anything where it isn't
        );
    
    let escape = mask.select(
        u8x64::splat(61), // add '=' (ASCII 61) where the mask is set to true
        u8x64::splat(0)   // and don't do anything where it isn't
    );
    
    let mut result: [u8;128] = [0;128];
    
    for i in 0..64 {
        result[i*2] = escape[i];
        result[i*2+1] = indata[i];
    }
    
    return result;
}

(Rust Playground)

Go

package simd

This is only supposed to be a very rough draft

package simd

type Uint8x64 [64]uint8

func Add[T SIMD128|SIMD256|SIMD512](x, y T) T {
    // call appropriate add function based on required instruction set
}

func  Uint8x64Splat(value uint8) T {
    // create Uint8x64
}
package simd/mask

This is only supposed to be a very rough draft

package mask

type Maskx64 [64]bool

// 64 lane equality opereator
func Equals64(x, y simd.Uint8x64) *Maskx64  {
    // call appropriate comparison function based on required instruction set
}

func (m1 *Maskx64) Or(m2 *Maskx64) {
    // perform logical or for all lanes
}

func (m *Mask) Select(onTrue, onFalse simd.Uint8x64) simd.Uint8x64 {
    // generate from mask
}
package simd/{sse2,sse3,…}

Specific implementations. Potentially including a fallback if the instruction set is not supported.

Equivalent Go-Code would look something like this (assuming abovementioned API design)

package yenc

import (
    "simd"
    "simd/mask"
)

func yenc(input [64]byte) [128]byte {
    indata := simd.Uint8x64(input)
    indata += Uint8x64Splat(42)

    // mark special characters
    escapeMask := mask.Equals(indata, Uint8x64Splat(0))
    escapeMask.Or(mask.Equals(indata, Uint8x64Splat(10)))
    escapeMask.Or(mask.Equals(indata, Uint8x64Splat(13)))
    escapeMask.Or(mask.Equals(indata, Uint8x64Splat(61)))

    indata += escapeMask.Select(
        Uint8x64Splat(64), // add 64 where the mask is set to true
        Uint8x64Splat(0)   // and don't do anything where it isn't
        )
    
    let escape = escapeMask.Select(
        Uint8x64Splat(61), // add '=' (ASCII 61) where the mask is set to true
        Uint8x64Splat(0)   // and don't do anything where it isn't
    )
    
    var result [128]byte
    
    for i := 0; i < 64; i++ {
        result[i*2] = escape[i]
        result[i*2+1] = indata[i]
    }
    
    return result
}

I think this should be part of the standard library mainly for two reasons:

  • This would allow to write performant SIMD code without using assembler within the standard library (making code more readable)
  • Lowers the barrier of entry for using SIMD by providing a reliable API

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 1, 2022

The difficulty with a general purpose approach to SIMD, which is what you are suggesting, is that the performance can be dramatically different on different processors. Also, for specific processors, performance is not optimal as not all special purpose instructions are available.

(The difficulty with a processor-specific approach to SIMD is that you have to write different code for each processor.)

(As a side note, I don't see any reason to have a package like simd/sse2 in your description. Instead, we would arrange to use the appropriate implementation when building the simd package.)

@ianlancetaylor ianlancetaylor added this to Incoming in Proposals Jun 1, 2022
@Zheng-Xu
Copy link
Contributor

@Zheng-Xu Zheng-Xu commented Jun 1, 2022

I hope the proposal can support different architectures. Arm SVE has a feature called (VLA)Vector Length Agnostics. Different H/W may implement the vector size differently. See: a sneak peek into sve and vla programming

On the API level, it is better that we can decide the vector length at runtime instead of hard coding it as 8x64. Reference: OpenJDK Vector API

On the compiler side, so far Go ABI needs the frame size to be decided at compile time, which doesn't support VLA programming very well yet.

@mpldr
Copy link
Author

@mpldr mpldr commented Jun 1, 2022

we would arrange to use the appropriate implementation when building the simd package

That might not be as easy as it sounds since that would make binaries potentially less portable. Adding a fallback to use in case a specific instructionset is not supported would help in offsetting this. (Yes, I'm aware that this is also not as simple as it sounds.

@qiulaidongfeng
Copy link

@qiulaidongfeng qiulaidongfeng commented Jun 1, 2022

(As a side note, I don't see any reason to have a package like simd/sse2 in your description. Instead, we would arrange to use the appropriate implementation when building the simd package.)

I think we should provide different packages for different instruction sets .
Because it helps programmers better design compatible programs for different hardware.
This also helps to maintain compatibility, because the behavior of a specific instruction set depends on the hardware, the package provides an API based on the specific instruction set.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 2, 2022

@mpldr We already have a mechanism for making binaries more or less portable: the GOAMD64 environment variable and friends. See https://pkg.go.dev/cmd/go#hdr-Environment_variables.

@inkeliz
Copy link

@inkeliz inkeliz commented Jun 2, 2022

I think we should provide different packages for different instruction sets .

It's not the same of writing assembly file for each CPU? I think that defeats the purpose of SIMD package.


I think the Zig approach is better (and I think Rust is similar). For instance, Zig provides one @Vector function which works on any CPU, can read more about that here. That makes the code as portable as non-simd version, and it will use SIMD-instructions when available.

@mpldr
Copy link
Author

@mpldr mpldr commented Jun 2, 2022

@qiulaidongfeng
Copy link

@qiulaidongfeng qiulaidongfeng commented Jun 2, 2022

I think the Zig approach is better (and I think Rust is similar). For instance, Zig provides one @Vector function which works on any CPU, can read more about that here. That makes the code as portable as non-simd version, and it will use SIMD-instructions when available.

This seems to make the code more portable, and I support it.

@beoran
Copy link

@beoran beoran commented Jun 5, 2022

Indeed, a vectorize package that is portable and that uses the CPU relevant instructions would be the best solution. On platforms that do not have such instructions, the default implementation could still be useful for portability and optimized for performance.

@qiulaidongfeng
Copy link

@qiulaidongfeng qiulaidongfeng commented Jun 5, 2022

The current consensus on portability is to use appropriate implementations when building SIMD packages, which I support.

@rsc
Copy link
Contributor

@rsc rsc commented Jun 15, 2022

I came across https://github.com/google/highway a few weeks ago. Is there anything we can learn from that about portable API for SIMD?

@rsc
Copy link
Contributor

@rsc rsc commented Jun 15, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals Jun 15, 2022
@klauspost
Copy link
Contributor

@klauspost klauspost commented Jun 16, 2022

Overall, as mentioned by people a platform independent implementation if simd is at best a half implementation.

While some of the aspects are platform independent, and it may be possible to port a fraction, there is simply too much difference between platforms to make anything that would be genuinely useful. While it is "neat"

Falling back to Go implementation of SIMD would in many cases be much worse than straight up Go, and overall design of this will just slow down the availability. Example: MPSADBW -if there is no HW support, the fallback will be horrible.

SIMD intrinsics should be able to live alongside Go code. The compiler controls register use. This will allow using simd and other instructions without the now forced function call overhead. SIMD should be guarded by platform tags. SIMD types should be available for all platforms, but the functions shouldn't be abstracted, and should match underlying instructions, maybe with some compound functions.

Here are some of my previous observations when looking at intrinsics for Go:

Feature detection

There is a huge number of individual features. Compile-time feature specification will always just select a very low common denominator, and GOAMD64 only provides for a sub-set anyway and cannot be used for this.

It should be easy (maybe looking at imports), which features to check for. The detection should be part of the intrinsic offering.

A quite tricky thing is that some instructions contains several "forms", but with different features. For example ANDPS xmm1, xmm2 (SSE) also has a non-destructive 3 register version VANDPS xmm1,xmm2, xmm3 (AVX). Other intrinsic implementation has made it hard to select which version is used - and either used SSE non-optimally in an AVX code path or (even worse) used AVX in an SSE code path, causing a crash.

Data types

The tricky part about data types is that it will often require type conversion, or be untyped. Data loaded as a []byte should be available as *[][8]float32 or similar types. There should be specific types that typically maps to registers a [8]float32 type to XMM registers for instance.

Some intrinsics has no clear type. For example PXOR operates on bits, and can be mixed with operations that operate on [16]uint8, [8]uint16, [4]uint32. Same for signed/unsigned values. This means the compiler should be able to "convert" between these as a no-op, or just have a single type per register size.

I don't have a ready-to-go solution, but having to copy input from a []byte to a []float32 or vice versa must be avoidable.

The compiler cannot enforce forced constant values. Take PSHUFD, which has an 8 bit immediate value. This must be resolvable at compile time. With current function definitions that isn't really doable, so some handling would be needed for this.

Pointer arguments are a little tricky. There aren't that many, but prefetch instructions and gather/scatter and of course loading and saving.
Loading and storing has more than straight up load/save from memory, for example VPMASKMOVD, so expect there to be platform specific implementations.

Edit: Final word on portable SIMD: I am not against it, but Go should supply the tools to write portable SIMD packages, that cover the feasible subsets.

@beoran
Copy link

@beoran beoran commented Jun 16, 2022

@rsc Highway seems to be a good idea. Since Go now has generics, and Highway is Apache licensed, is there any reason why someone interested could not port it to Go? That would be the first step, I think.

@klauspost It sounds more like you want to use inline SIMD assembly than have a portable vector API. While the go compiler already supports assembly in separate files, I don't think inline SIMD assembly in Go is a good idea.

Edit: a third approach would be for the Go compiler to optimize certain array and float operations using SIMD if possible. This should be documented then, though.

@klauspost
Copy link
Contributor

@klauspost klauspost commented Jun 16, 2022

@beoran That is the title of the proposal.

While portable solutions and automatic vectoring are neat, it only provides a band-aid solution, with quite limited usability.

@Robert-M-Muench
Copy link

@Robert-M-Muench Robert-M-Muench commented Jun 18, 2022

I think the most promising approach to deal with the plethora of instruction set variants and CPU types while avoiding the explosion of the number of compiler flags, different binaries, etc. is to use a JIT-based code generation approach like https://asmjit.com makes possible.

This concept is used very successfully in a highly optimized 2D graphics library, for generating CPU-specific graphic pipelines on the fly, that outperform any other library.

@atomsymbol
Copy link

@atomsymbol atomsymbol commented Jun 18, 2022

Will SIMD support in Go include back-propagation of ISA (Instruction Set Architecture) incompatibilities during compile-time (from callees to callers) via Go's type system, probably by using an updated version of generic types with support for partial specialization? If this won't be the case, then the runtime performance of AVX-tuned programs will be random on ARM/RISC-V/etc (and vice versa, the performance of ARM/RISC-V/etc-tuned programs will be random on x86 CPUs). The main purpose, and almost the only purpose, of adding SIMD support to an already general-purpose programming language is performance (faster execution and/or lower energy consumption): If the type system of the programming language is unable to propagate information about potential performance issues, then adding SIMD support to the runtime and/or to the language itself (language syntax) is questionable/suboptimal, assuming that it is a multi-platform programming language.

@rsc
Copy link
Contributor

@rsc rsc commented Jun 22, 2022

FWIW, Go used to do very minimal JIT and found that made Go untenable in certain environments, so we removed that. Adopting a JIT here would be a big step and probably not what we want.

@rsc
Copy link
Contributor

@rsc rsc commented Jun 22, 2022

Can we come up with a list of example use cases that must be supported well by any approach that we take? Right now it's all very hypothetical.

@klauspost
Copy link
Contributor

@klauspost klauspost commented Jun 23, 2022

@rsc Many, many moons ago I made a test how it could look. It parsed Intel/Arm intrinsic information and generated stubs for intrinsic functions.

It is pretty much a 1:1 mapping of C intrinsics to Go functions. This choice was to limit the documentation needed and also to make C intrinsics easy to convert.

Example A is a conversion of the assembly version.

The idea was that with the native types it would be possible to hook platform intrinsics into the compiler, similar to how selective intrinsics are done in math/bits for example.

Example B is a more complex example that loads a linear 16 bit RGB image data, applies a 3x3 matrix, an sRGB gamma curve and finally converts to 8 bit values and stores the result. There are also other bonus examples in this file.

This is by no means "final proposal quality", and the types are just temporary, but was to give an idea of how Go could look with intrinsics. This package makes no real attempt to unify similar types across platforms. The ARM part is rather lacking, but was included for discussion.

@beoran
Copy link

@beoran beoran commented Jun 23, 2022

@klauspost, With all due respect, but that example A in Go is about as readable as it's assembly counterpart. It is all very low level. I don't see any significant benefits over just using assembly.

I'd rather have a more high level package with Matrix and Vector, etc. types that use whatever acceleration is available. Somewhat like https://github.com/go-gl/mathgl perhaps.

Edit: or something like this: https://github.com/rkusa/gm

@klauspost
Copy link
Contributor

@klauspost klauspost commented Jun 23, 2022

@beoran Someone who does understand it can write your abstracted high-level libraries on top of this. You cannot write this on top of an abstracted, lowest common denominator package.

@beoran
Copy link

@beoran beoran commented Jun 23, 2022

@klauspost Now, someone who understands can also write such a higher level library in assembly, like in gm. My point is that I don't see how such a low level API is significantly easier to use than assembly. Even for something simple like the sum of int in Go, there is the significant convenience that the compiler chooses the assembly instructions necessary for the platform. This is not the case with such an extremely low level API.

Looking at the examples, it seems like adding an int128 type to the compiler, and then have a (standard) library with vectors and matrixes would be much more convenient and teachable to new and existing Go programmers.

@mpldr
Copy link
Author

@mpldr mpldr commented Jun 23, 2022

I quite like the shown API.

but that example A in Go is about as readable as it's assembly counterpart

I would personally disagree with this, but 20 people will likely give you 21 opinions. I find it way easier to read and understand (it could use some more comments though)

I don't see how such a low level API is significantly easier to use than assembly

Writing assembly and using intrinsics is leagues apart. This also would require to write assembler for every architecture (which is not exactly "easy") instead of having the compiler do this.

@josharian
Copy link
Contributor

@josharian josharian commented Jun 23, 2022

Can we come up with a list of example use cases that must be supported well by any approach that we take?

Here are two (slightly redacted) examples from code I personally wrote. These are both quite straightforward, as SIMD goes. I can dig up more examples if it'd be helpful.

(1)

// ·compareCoverBody compares every corresponding byte of base and cur, and
// reports whether cur has any entries bigger than base.
// func · compareCoverBody(base, cur *[65536]byte) bool

This has two variants, one that uses SSE2 and one that uses AVX2.

The SSE2 variant constructs an X register filled with 128 in each byte (PUNPCKLBW, PSHUFL). Then on each loop, it does two loads into X registers (MOVOU), subtracts 128 from both inputs to convert signed comparison to unsigned comparison (PSUBB), does a per-byte comparison (PCMPGTB), extracts the top bit from each byte (PMOVMSKB), then exits with true if any of those bits are set.

The AVX2 variant is similar, but doesn't require the 128 register (VMOVDQU/VPSUBB/VPCMPGTB/VPMOVMSKB). Don't forget the VZEROUPPER.

(2)

// func countMatches8(ptr *uint64, nquad int, mask uint64, word uint64) uint64
TEXT ·countMatches8(SB), NOSPLIT, $0-40

countMatches8 accepts a []uint64 (ptr to 0th element and length), a mask, and a word to match. For each uint64 u in the input, it loads u, and increments a counter iff u & mask == word.

The caller arranges to handle the scalar entries at the end. (I happen to have alignment guarantees.) This version requires AVX2. It broadcasts mask and word to Y registers (VPBROADCASTQ), zeros a single sum accumulation register (VXORPS), and loops over each input. It does a load (VMOVDQU), a bitwise mask (VANDPS), an eq comparison (VPCMPEQQ), and adds that to the accumulation register (VPADDQ). At the end, it does a horizontal sum across the accumulation register by extracting the two halves into X registers (VEXTRACTI128/VEXTRACTI128) and adding them (VPADDQ), then reduces from X to GP the same way (PEXTRQ/PEXTRQ/ADDQ), then negates the result (because "equal" generates 0xFFFFFFFF, which is -1, but we want to count up the results, not down). And then VZEROUPPER.

There is also countMatches16, which does the same, but is masking+comparing pairs of uint64s (i.e. 16 bytes at a time). This is trickier. The main trick is using VPERMQ to rearrange the internal components of the register and then do VANDPS, so that both members of a pair of uint64s must match in order to qualify as a match. I'll drop the code at the end of this comment.

I also implemented these on arm64. The broad approach is very similar, although the VPERMQ was implemented using two rounds of VTRN1/VTRN2.

Lastly, I also have "hasMatches" variants for each of these which exit early as soon as a single match is found.

I chose these two examples because they're (a) not particularly unusual or tricky, just a bunch of vector manipulation and yet (b) they're not completely trivial, particularly the vector rearrangement.

Code for amd64 countMatches16. I'm happy to share any of the other full code mentioned here on request.

// func countMatches16(ptr *uint64, noct int, maskA uint64, maskB uint64, wordA uint64, wordB uint64) uint64
// Requires: AVX, AVX2, SSE2, SSE4.1
TEXT ·countMatches16(SB), NOSPLIT, $0-56
	// Broadcast mask and word to YMM registers.
	// Mask: A B A B
	MOVQ         maskA+16(FP), X0
	MOVQ         maskB+24(FP), X1
	VPBROADCASTQ X0, Y0
	VPBROADCASTQ X1, Y1
	VPUNPCKLQDQ  Y1, Y0, Y0

	// Word: A B A B
	MOVQ         wordA+32(FP), X1
	MOVQ         wordB+40(FP), X2
	VPBROADCASTQ X1, Y1
	VPBROADCASTQ X2, Y2
	VPUNPCKLQDQ  Y2, Y1, Y1

	// Zero sum accumulation register.
	VXORPS Y2, Y2, Y2

	// Prepare loop.
	MOVQ ptr+0(FP), AX
	MOVQ noct+8(FP), CX

	// Loop until all octs have been processed.
loop:
	CMPQ CX, $0x00
	JE   done

	// y := *ptr
	VMOVDQU (AX), Y3

	// y &= yMask
	VANDPS Y3, Y0, Y3

	// y = y == yWord, 64 bit chunks
	VPCMPEQQ Y3, Y1, Y3

	// Given 64 bit chunks A/B/C/D, shuffle to B/A/D/C, store in yShuf
	VPERMQ $0xb1, Y3, Y4

	// y &= yShuf
	VANDPS Y3, Y4, Y3

	// ySum += y
	// We are counting matches.
	// When equal, the int64 is filled with 1s, which is -1.
	// The NEGQ at the very end makes this positive.
	VPADDQ Y2, Y3, Y2

	// Advance pointer, decrement oct count.
	ADDQ $0x20, AX
	DECQ CX
	JMP  loop

done:
	// Sum accumulated outputs.
	// Reduce from YMM to XMM.
	VEXTRACTI128 $0x01, Y2, X0
	VEXTRACTI128 $0x00, Y2, X1
	VPADDQ       X1, X0, X1

	// Reduce from XMM to GP. We only need either the hi or the lo.
	PEXTRQ $0x00, X1, AX

	// Make count positive (see comment in loop), store.
	NEGQ AX
	MOVQ AX, ret+48(FP)

	// Avoid silly SIMD slowdowns.
	VZEROUPPER
	RET

@balasanjay
Copy link
Contributor

@balasanjay balasanjay commented Jun 24, 2022

@klauspost Now, someone who understands can also write such a higher level library in assembly, like in gm. My point is that I don't see how such a low level API is significantly easier to use than assembly. Even for something simple like the sum of int in Go, there is the significant convenience that the compiler chooses the assembly instructions necessary for the platform. This is not the case with such an extremely low level API.

Looking at the examples, it seems like adding an int128 type to the compiler, and then have a (standard) library with vectors and matrixes would be much more convenient and teachable to new and existing Go programmers.

Its true that it could be written in assembly. But assembly has all of the following properties:

  • non-portable
  • unsafe (implies memory unsafety)
  • written in an entirely separate programming language from the bulk of the code
    • no abstractions (e.g. loops), manual register allocation. This can and should be worked around via code generators like avo, but that introduces a Go-specific thing to learn (certainly compared to an intrinsics API, where there is precedent in C/C++/Rust/Swift/Zig)
  • cannot be inlined (atleast with today's compiler)
  • does not support an efficient register-based ABI (atleast with today's compiler)

This long list of legitimate downsides leads to things like the (very reasonable) Go Assembly policy.

But the thing is that none of those properties are intrinsic (no pun intended) to the problem space. Only the first one (non-portability) is. And even for that one, the way you build portable APIs is you define them in terms of non-portable APIs (e.g. package os building on package syscall). Today if you want to build portable APIs, you can either do it in the compiler, or do it in assembly (and pick up all the negative properties I mentioned above). That seems like a false choice.

@atomsymbol
Copy link

@atomsymbol atomsymbol commented Jun 24, 2022

Its true that it could be written in assembly. But assembly has all of the following properties:

  • non-portable

Just a note (1): It is "non-portable" only as long as the assembly code is expected to run on a different CPU architecture at the same performance level as on a CPU which can run the code natively.

  • unsafe (implies memory unsafety)

Just a note (2): It is possible (although in practice it is highly uncommon) to write safe code in assembly, as long as the compiler is able to keep track of how information flows within the assembly program.

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented Jun 24, 2022

I came across google/highway a few weeks ago. Is there anything we can learn from that about portable API for SIMD?

Talking about prior art for portable APIs, there is also the ongoing work on the new Java Vector API (JEPs 338, 414, and 417) that, funnily enough, is somewhat reminiscent of math/big in its syntax.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests