Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: add "simd" package to standard library #67520

Open
Clement-Jean opened this issue May 20, 2024 · 50 comments
Open

proposal: add "simd" package to standard library #67520

Clement-Jean opened this issue May 20, 2024 · 50 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Proposal
Milestone

Comments

@Clement-Jean
Copy link

Clement-Jean commented May 20, 2024

Background

After doing a little bit of research on previous issues mentioning adding SIMD to Go, I found the following proposals:

#58610: This proposal relies on the ispmd keyword which does not follow the compatibility guarantees for Go 1.

#53171: This proposal is closer to what I am proposing, however, there are multiple packages per CPU architecture, and I'm providing a more complete explanation on how to achieve adding SIMD to the Go standard library.

#64634: This proposal is more specific to the crypto package but in essence this is close to what I'm proposing here.

Goals

The main goal of this proposal is to provide an alternative approach to designing a simd package for the Go standard library. As of right now, there is no consensus on what the API should look like and this proposal intends to drive the discussion further.

This proposal mainly includes two things:

  • Adding a new kind of build tag that lets the user specify which SIMD ISA to use at compile time.
  • Using compiler intrinsics to generate inline SIMD instructions in the code.

Build Tag

For the first point, I was thinking about optional build tags like the following:

//go:simd sse2
//go:simd neon
//go:simd avx512

etc.

As mentioned, these build tags are optional. If not specified, we would resolve to the appropriate SIMD ISA available on the current OS and ARCH. However, I still think that these build tags are needed for deeper optimization and platform-specific operations. If we know that some instruction is more performant or only available on a certain architecture, we should be able to enforce using it manually.

Finally, having the optional build tag would let the developer choose, at compile time, which SIMD ISA to target and thus cross compile. We could write something similar to:

$ go build -simd neon ...

With this, we could take advantage of platform-specific features and know at compile time the size of the vector registers (e.g., 128, 256, or 512 bits). This would help us make better decisions for optimizations on the compiler side.

Compiler Intrinsics

The next crucial step would be to create a portable SIMD package that would rely on the compiler to generate SIMD instructions through compiler intrinsics. I demonstrated that this is feasible with a POC. As of right now, it looks like the following (you can see more examples here, including UTF8 string validation):

package main

import (
    "fmt"
    "simd"
)

func main() {
    a := simd.Uint8x16{...}
    b := simd.Uint8x16{...}
    c := simd.AddU8x16(a, b)

    fmt.Printf("%v\n", c)
}

And here, the AddU8x16 gets lowered down to a vector add instruction after SSA lowering.

Notes

  • We can provide functions like AddU8x16, AddU8x32, etc. without changing the generics implementation. Other implementations like Highway, Rust std::simd, and Zig @Vector rely on generics for the API. In Go, we do not have non-type parameters in generics; thus we cannot have something like Simd[uint8, 16].

  • Also, we do not have a compile time Sizeof which could help us have:

    type Simd[T SupportedSimdTypes] = [VectorRegisterSize / SizeofInBits(T)]T

Philosophy

It is important to understand that this proposal is not describing an abstraction of SIMD features. Such an abstraction could create noticeable performance difference between ISAs. That's why we are trying to avoid it. This means that if an operation is not available on an ISA, we simply don't provide it. Each intrinsic should only have 1 underlying instruction, not a sequence of instructions.

Challenges to Overcome

  • Under the hood, the current POC, works with pointers on arrays. The main reason is that fixed-size arrays are not currently SSAable. But because we work with these pointers on arrays which are stored in general purpose registers, the performance is not great (allocations required and load of non-contiguous memory) and it requires us to do the LD/ST dance for each function in the simd package.

I believe we would need some kind of type aliases like the following:

type Int8x16 [16]int8
type Uint8x16 [16]uint8
//...

These types should not be indexable and only instantiable through init functions like Splat8x16, Load8x16, etc. The compiler would then promote these special types to vector registers. This would remove all the LD/ST dance and memory allocation that I have in my POC and thus make everything a lot faster.

In the end the previous code snippet could look like this:

package main

import (
    "fmt"
    "simd"
)

func main() {
    a := simd.SplatU8x16(1)
    b := simd.LoadU8x16([16]uint8{...})
    c := simd.AddU8x16(a, b)

    fmt.Printf("%v\n", c)
}
  • A lot of instructions on arm (and I suppose on other ISAs) are missing. The current POC is using constants (WORD $0x...) to encode these.

I believe we should avoid having to use constants when defining intrinsics. We could implement the missing instructions along the way with the implementation of intrinsics.

  • Naming these intrinsics is not always easy. For example, NEON has instructions called VMIN and UMINV. The former returns a vector of min elements, and the latter reduce to the minimum in a given vector. As we don't have function overloads, we will need to find a way to name them appropriately.

I believe we should try to make the horizontal operations more verbose (e.g. ReduceMin8x16) and promote the vertical ones (e.g. Min8x16). For the example of VMIN and UMINV, the latter does not even seem to exist in SSE2 whereas the first one does.

There are other operations that have "conflicting" names. For example shift right and shift left have both logical and arithmetic shifts. For these cases, I believe we could just call them LogicalShiftRight, ArithmeticShiftRight, ... I agree that this is verbose but it makes it clear what is happening.

  • The current POC did not implement the concept of Masks. This is an important concept but also a tricky one to implement without proper compiler support. After discussion with Jan Wassenberg (author of Highway), I realized that some platforms do not treat masks in the same way. Here is a summary:

    • on NEON, SSE4, and AVX2: 1 bit per bit of the vector.
    • on SVE: 1 bit per byte of vector (variable vector size).
    • on AVX-512, and RVV: 1 bit per lane.
    • AVX-512 has a separate register file for masks.

We could have other types aliases, like the one made for vectors. These types would have different shapes and be loaded differently depending on the platform.

  • Some operations need parameters that are known at compile time and within a certain range. For example SSHR (shift right) on NEON takes an immediate n that need to be restricted between 1 and 8 (see vshr_n_s8). I ran into some problems where during the build of the compiler, the n would resolve to 0 (default value of int passed as param) and it would crash the program.

I believe we could have some way to check at compile time these values are within a certain range. Like a static_assert in C++ or checks on the AST.

Why is it Important?

Without SIMD, we are missing on a lot of potential optimizations. Here is a non-exhaustive list of concrete things that could improve performance in daily life scenarios:

Furthermore, it would make these currently existing packages more portable and maintainable:

There are obviously way more applications of SIMD but I am just trying to say that this is useful in practical scenarios.

@gopherbot gopherbot added this to the Proposal milestone May 20, 2024
@jimwei

This comment was marked as spam.

@jimwei

This comment was marked as spam.

@ianlancetaylor ianlancetaylor added the compiler/runtime Issues related to the Go compiler and/or runtime. label May 20, 2024
@ianlancetaylor ianlancetaylor changed the title proposal: Add simd package to standard library proposal: add "simd" package to standard library May 20, 2024
@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals May 20, 2024
@earthboundkid
Copy link
Contributor

//go:simd sse2

ISTM, this could be a regular build tag //go:build sse2. A simd platform is sort of like GOARCH except that one GOARCH might implement only some SIMD architectures. So maybe there should be a concept of a GOSUBARCH and those could be available as build tags and as runtime lookups at init that PGO could optimize.

This might also help with WASI, where the solution was to make to make wasip1 the GOOS, but as new features get added to WASM itself, we might run into the same issue of needing to differentiate classic WASM from WASM++ or whatever.

@Clement-Jean
Copy link
Author

In the POC this is implemented with a simple build tag. However, I was presenting the idea for backward compatibility reasons. If anyone already defined such build tags, we wouldn't want to interfere with their code.

Furthermore, these new build tags might lead to different build logic or state than the original ones. I haven't look in details so I cannot describe a different use case though...

@earthboundkid
Copy link
Contributor

I guess that's a risk with any new build tag, but yes, it seems more likely to affect real code here. Maybe gosub_amd64_sse2 would be less likely to collide?

@Clement-Jean
Copy link
Author

This seems a little bit terse to me and I'm still not comfortable with impacting others code. That's also why I didn't propose file names having a certain ending like _sse2_amd64.go or _sse2_amd64_test.go.

But overall, I'm open to the idea.

@jan-wassenberg
Copy link

we do not have non-type parameters in generics; thus we cannot have something like Simd[uint8, 16].

Couldn't you have a T32 type that provides a function N that returns 32?
It seems to me that a vector-length agnostic API is important for running on heterogeneous devices, and supporting SVE/RISC-V.

if an operation is not available on an ISA, we simply don't provide it

This seems limiting. For example, dot products are quite important, but platforms differ a bit in what they provide. Also, Scatter isn't universally supported, but important in some fields. Would you simply skip it?

Each intrinsic should only have 1 underlying instruction, not a sequence of instructions.

This cannot be guaranteed with LLVM, which can and does transform intrinsics to something else.

@Clement-Jean
Copy link
Author

Clement-Jean commented May 21, 2024

Couldn't you have a T32 type that provides a function N that returns 32?
It seems to me that a vector-length agnostic API is important for running on heterogeneous devices, and supporting SVE/RISC-V.

If I understand correctly, instead of passing Uint8x16 you are suggesting having a Vector16 type which has a Lanes function that would return 16. And then, depending on which ISA we can resolve we choose the vector type to be used accros all intrinsics. Did I get that right? If yes, I would have to experiment. I'm not entirely sure how this would work.

This seems limiting. For example, dot products are quite important, but platforms differ a bit in what they provide. Also, Scatter isn't universally supported, but important in some fields. Would you simply skip it?

I believe you are seeing that from the Highway point of view. This proposal tries to provide direct intrinsics. Functions wrapping multiple intrinsics could be then created on top of them. These functions would include dot products, scatters, ...

This cannot be guaranteed with LLVM, which can and does transform intrinsics to something else.

Fortunately, we are not dealing with LLVM. Go has its own assembly.

@jan-wassenberg
Copy link

Vector16 type which has a Lanes function that would return 16. And then, depending on which ISA we can resolve we choose the vector type to be used accros all intrinsics.

Yes, that's right. Though I'd say VectorN instead of Vector16. 128-bit vectors have their place for historical reasons, but users asking for Vector1024 is a recipe for poor codegen on most ISAs.

This proposal tries to provide direct intrinsics. Functions wrapping multiple intrinsics could be then created on top of them.

Note that some platforms have dot product instructions e.g. vdpbf16ps, which are too good to pass up on. Of course a dot product kernel would call them in a loop, but you still want these instructions reflected in the abstraction layer.

@earthboundkid
Copy link
Contributor

A sticking point for past proposals has been whether the package should be high level or low level. I think it probably makes sense to make a syscall vs. os style split and have a high level package that calls into a low level package. In terms of implementation, I think starting with the low level package is clearly better. So maybe start with golang.org/x/exp/simd being experimentally special cased by the compiler, if it's successful, move it to runtime/simd and then if everyone is happy with runtime/simd, a separate proposal could add math/vector as a high level library.

@Clement-Jean
Copy link
Author

Clement-Jean commented May 21, 2024

Thank you for clarification @jan-wassenberg

@earthboundkid this is the idea behind this proposal. Let's do the low level intrinsics that are not length agnostic and building on that we will be able to do the fancy length agnostic ones.

@jan-wassenberg
Copy link

It's not clear to me that VL agnostic can be built on top of known-length types, unless for each vector you set aside up to 256 bytes (SVE) or even 64 KiB (RISC-V V)?

@Clement-Jean
Copy link
Author

I'm not even sure the compiler can generate code for SVE and RISC-V V to be honest. So I don't think this is the priority.

But yes for variable length vectors we will need either to set aside a larger amount of bytes, or come up with higher-level vector types.

@klauspost
Copy link
Contributor

What you are proposing is already covered with build tags. With that said build tags are fundamentally a wrong approach for intrinsics.

You seem to imply that intrinsics should be decided at compile time. I disagree. There are maybe 30+ CPU features in x86-64. Add combinations and you have an infinite build matrix already.

If intrinsics should make sense, they can already be platform guarded and cpuid detection should make the compiled binary dynamically either use intrinsics or not. We already have GOAMD64 env var that would allow to bypass certain cpuid checks.

Your proposal for a generic "simd" package is too simplistic. While abstracting certain operations would be possible and fine, I stand by my opinion that this can be done either internally or externally if Go provides the intrinsics to begin with. TBH I don't quite understand how this package would work with the build tags. Would it fail with simd.SplatU8x16(1) is not supported by current simd platform?

type Int8x16 [16]int8
type Uint8x16 [16]uint8

Distinct types is an abstraction over "register with x bits". Take "xor". It doesn't operate on any specific type, but only on a fixed register size. Do you want an xor for all types you define?

I see all of that as a secondary layer to the main complexity of adding vector registers to the compiler. That seems like the primary problem to solve. To me the approach would be to:

A) Find a reasonable way to represent vector registers in Go code, which can be used across platforms.

It could be a type vec128 [8]byte - or it could be an "abstract" type , like int that doesn't attempt to expose the underlying data. I am not the compiler expert to decide on that.

It seems reasonable to provide LoadUint32x4(*[4]uint32) vec128 and StoreUint32x4(*[4]uint32, vec128) as well as splatters, etc. With the slice to array pointer conversion this would be reasonably easy to use. I feel like these should be platform guarded, since these themselves often will have cpuid limitations - for example avx512.

B) Provide intrinsics per platform.

Intrinsics are "close" to the emitted instructions. There is no scalar fallback, and cpuid checks will be up to the user. I still think dividing intrinsics into (primary) cpuid makes sense. Importing a package doesn't mean you use it, merely that you may use it if the cpu supports it.

C) Find a reasonable way for imported packages to "inject" intrinsics into the compiler.

That means importing simd/amd64/gfni (or whatever it gets called) would inject the intrinsics defined in the package into the compiler. That seems to me like a reasonable way to "separate concerns" and not have the compiler explode speedwise with 1000s of new intrinsics to look out for.

D) Constants

A lot of instructions take immediates which must be constant at compile time. For example VGF2P8AFFINEQB(imm8 uint8, src2, src1, vec128) vec128 the imm8 must be resolvable at compile time. Go doesn't have a way to specify that in the type definition.

While it could just result in a compile-time error it would be nice if the function definition by itself could give that information. Not a show-stopper, though - and could be decided later.

There are a lot of "minor" decisions that would need to be made. Should instruction with shared source/destination registers abstract this? How should functions be named? Close to the instruction or slightly abstracted (like C, rust) or fully renamed? While these things would need a resolution at some point, they don't really seem like the primary things to focus on too early.

If the above works out, third parties can implement what they like. The Go stdlib can then look at whether a generic vector package would make sense if that is deemed important.

TLDR; I think the approach of compile-time cpu feature detection is a dead end. I think intrinsics should be low level to allow for the biggest amount of features and leave higher level abstractions to competing third parties. Go will provide the compiler that can make this great.

PS. I am genuinely very sorry I never found the time to reply to your emails.

@Clement-Jean
Copy link
Author

Clement-Jean commented May 22, 2024

So, I believe there are misunderstandings here:

If intrinsics should make sense, they can already be platform guarded and cpuid detection should make the compiled binary dynamically either use intrinsics or not. We already have GOAMD64 env var that would allow to bypass certain cpuid checks.

The build tag is optional and when not provided, the compiler, through cpuid, would detect the all the intrinsics available for a given platform. For example, if you had a machine with SSE and AVX available, the compiler will detect all these features. The build tags let the user restrict to a set of features if needed.

Would it fail with simd.SplatU8x16(1) is not supported by current simd platform?

For that, we could drop to SWAR-like operations or scalar code if a feature is not detected. I believe the currently available intrinsics already have this.

Take "xor". It doesn't operate on any specific type, but only on a fixed register size. Do you want an xor for all types you define?

No, for xor under the hood, there is only one intrinsic covering multiple types.

A) Find a reasonable way to represent vector registers in Go code, which can be used across platforms.

D) Constants

A lot of instructions take immediates which must be constant at compile time. For example VGF2P8AFFINEQB(imm8 uint8, src2, src1, vec128) vec128 the imm8 must be resolvable at compile time. Go doesn't have a way to specify that in the type definition.

While it could just result in a compile-time error it would be nice if the function definition by itself could give that information. Not a show-stopper, though - and could be decided later.

These two points are covered in my proposal. I talk about having to handle constants and making sure that the compiler promote the types to vector registers.

Rest

I still think dividing intrinsics into (primary) cpuid makes sense. Importing a package doesn't mean you use it, merely that you may use it if the cpu supports it.

For importing through different import paths, it just seem to be another way of doing what I proposed with the optional build tag but in reverse. If you don't provide a tag, you get all possible intrinsics. Otherwise, you get the intrinsics only for a set of ISAs.

not have the compiler explode speedwise with 1000s of new intrinsics to look out for.

Once again, a lot of intrinsics are shared across multiple types. This means we will have lot of "functions" calling to the same underlying intrinsic.

There are a lot of "minor" decisions that would need to be made. Should instruction with shared source/destination registers abstract this? How should functions be named? Close to the instruction or slightly abstracted (like C, rust) or fully renamed? While these things would need a resolution at some point, they don't really seem like the primary things to focus on too early.

Totally agree with this. The minor points that I made are purely based on the experience with the POC. They are not the main focus of the proposal.

@Jorropo
Copy link
Member

Jorropo commented May 22, 2024

Under the hood, the current POC, works with pointers on arrays. The main reason is that fixed-size arrays are not currently SSAable. But because we work with these pointers on arrays which are stored in general purpose registers, the performance is not great (allocations required and load of non-contiguous memory) and it requires us to do the LD/ST dance for each function in the simd package.

I think the better solution is to steal the types from #64634:

type ISimd128 ISimd128

Each SIMD vector become it's own native type which are unbreakable.
As unbreakable values it is trivial to lift them up to SSA.

This also discourage programmers to treat them as arrays (which is often slow).
Your type suggest this is ok:

// random example
for i, v := range someSIMDVector {
 if v % 2 == 0 && i % 2 == 1 { // do some logic
  continue
 }
 someSIMDVector[i] = 42 // create inter-loop dependency if lifting to registers
}

however performance for code like this is gonna be way worst than it look it should be on amd64 (and maybe other architectures) due to the high cost to move data between the SIMD and GP complexes.
I guess in your case this is fine since they are pointers so it's always taking the slow path, but we should want the compiler to lift things into registers due to the vast speed increases this gives.

@Clement-Jean
Copy link
Author

Totally agree with you on that. This is why I said:

These types should not be indexable and only instantiable through init functions like Splat8x16, Load8x16, etc. The compiler would then promote these special types to vector registers.

And having a native type is definitely the way to go. I just didn't know how to properly formulate that.

@ytgui
Copy link

ytgui commented Jun 15, 2024

SIMD instructions are crucial for computation-intensive workloads like machine learning and deep learning. This proposal really provides a solid initial plan.

If an operation is not available on an ISA, we simply don't provide it.

However, Golang works at a higher level than C, there should be some more considerations.
Instead of adhering strictly to the ISA, the SIMD package should be implemented across different platforms, leveraging existing hardware intrinsics where available.

Essential SIMD operations like vector addition, reduction, and broadcasting should be implemented across
different platforms. Take vector reduction as example, the _mm512_reduce_add_ps intrinsic can be used on AVX512-compatible architectures, while more intrinsics are needed on AVX2 platforms. If strictly following to the ISA, the vector reduction is not useable on AVX2 platforms, causing non-trival handing.

@Clement-Jean
Copy link
Author

While I agree on the fact that Go is higher level, the abstraction of SIMD intrinsics is a very complicated thing to get right. On top of that, we might find ourselves having performance differences that are not negligible across platforms/ISAs.

I believe that we should let the developer the choice and explicitly make the trade-offs.

@jan-wassenberg
Copy link

If a SIMD abstraction only provides the strict intersection of ops, then what exactly should developers trade off? When they want more ops, aren't they forced to implement the op in assembly, plus some kind of fallback for other platforms? It seems much more useful if the abstraction takes care of that.

It would have to be judicious about what exactly to offer, and avoid ops that really aren't efficient on other platforms. I think we did a decent job of that in Highway :)

@Clement-Jean
Copy link
Author

I definitely would like judicious abstraction , I'm just saying that this is hard. To make that a little bit easier, we should definitely learn from Highway on which ops to provide.

Note that I was trying to be conservative on this proposal and get a minimal scope that we can evolve from. I believe that if we have the basic ops for each platforms, we can later create higher abstractions as part of the simd package.

@klauspost
Copy link
Contributor

@jan-wassenberg I don't think the comparison applies. The HH api is good for what it offers. However I don't see that as a language feature, but as a third party package API.

IMO Go should allow for implementing abstractions and people can choose their preferred abstraction or write directly in intrinsics. Go should provide a compiler that is capable of sensibly allocate registers and handle code generation.

Users of intrinsics would gain the benefits of not having the function call overhead of assembly; The benefits of memory safety (maybe with "unsafe" exceptions for things like scatter/gather). Plus the simplicity of register management compared to assembly.

Abstractions should be implemented by third parties - they will likely be more powerful and different approaches can compete for the best APIs.

@earthboundkid
Copy link
Contributor

See #68188.

@thepudds
Copy link
Contributor

I happened to see a cite to a favorable comment that @lemire made about this proposal roughly 6 months ago, and I thought it worthwhile to take the liberty of pasting the main portion of the comment here.

There is a proposal to add native SIMD support in Go. The proposal would be quite useful for programmers skilled with SIMD programming. I find it quite nice.

and

Clément Jean proposes adding native SIMD support to Go.

[...]

This is somewhat similar to the approach taken by Microsoft in C#.

Go read his proposal: #67520 (this proposal)

@lemire also included a screenshot of what I think was this example from the proposal of SIMD UTF-8 validation, a snippet of which is:

	currBlock := *(*simd.Uint8x16)([]byte(in[processedLen:]))

	if simd.MoveByteMaskU8x16(currBlock) < 0x80 {
		if simd.MoveByteMaskU8x16(prevIncomplete) != 0 {
			return false
		}
		prevIncomplete = simd.Uint8x16{}
	} else {
		prev1 := simd.ExtractU8x16(prevInputBlock, currBlock, 15)
		s := simd.AndU16x8(simd.ShiftRightU16x8(simd.AsUint16x8(prev1), 4), ffBy4)
		byte1High := simd.LookupU8x16(shuf1, simd.AsUint8x16(s))
		byte1Low := simd.LookupU8x16(shuf2, simd.AndU8x16(prev1, v0f))
		s = simd.AndU16x8(simd.ShiftRightU16x8(simd.AsUint16x8(currBlock), 4), ffBy4)
		[...]

@Clement-Jean
Copy link
Author

It's worth mentioning that this is the example I wrote in the repo linked and that me and professor Lemire collaborated to make the example (coming from one of his papers) work.

@janpfeifer
Copy link

A couple of extra suggestions:

  • Dynamic (runtime) dispatching should be possible as a 1st class citizen: I (most?) would compile projects with support for every instruction set, and during runtime call the appropriate specialization. 3rd-party higher level abstractions can be used to automatically generate code for every version.
    • Build tags: not necessary. Code could always be generated for the requested instruction set (even if the compiler machine doesn't support it). It's the user's "runtime" responsibility not to call code with unsupported instruction set.
    • E..g. 3rd party API abstractions like Highway: Highway uses complex C++ macros+templates; in Go this could be achieved with (much saner?) //go:generate, generics and some code translation/specialization.
  • +1 for only having relatively low level access to specialized instruction sets. Higher level abstractions (again like Highway) can be created by third-party. At least to start with.
  • Maybe it's already what is planned (I didn't see it spelled out): define a convention on how to provide access to any specialized instruction/registers sets (e.g.: AES-NI, SHA), current and future for various architectures. Even if simd is the larger first use case.

PS.: My use case: with some colleagues we've been discussing how to create an alternative pure Go engine for GoMLX (currently it uses C/C++ XLA/PJRT, which in turn I think uses Highway for the CPU version) and for ONNX models inference.

There is a current pure Go implementation (using Gorgonia, which in turn uses gonum/blas, currently deprecated) which seems good, but some initial benchmarks shows ~an order of magnitude slower for some sentence encoding (BERT-like) models, vs using ONNXRuntime (CPU) or GoMLX (==XLA/PJRT CPU). I understand Go will always be slower, but we were hoping for a smaller difference. Not sure yet how much is due to simd support or other optimizations though.

@jan-wassenberg
Copy link

I agree runtime dispatch is super useful.

Can you clarify what you mean by "low level access to specialized instruction sets"?

All crypto extensions of which I am aware, including AESNI, uses the same vector regs as other SIMD instructions.

GoMLX (currently it uses C/C++ XLA/PJRT, which in turn I think uses Highway for the CPU version

AFAIK this is not using Highway, or I'd be curious to hear how/where?

@janpfeifer
Copy link

Can you clarify what you mean by "low level access to specialized instruction sets"?

I mean, not try to create abstractions that uses one or the other instruction set, depending on what is the actual hardware. Let this decision be made at a more coarse level, like at a dispatch function, by the user (or 3rd-party) library.

All crypto extensions of which I am aware, including AESNI, uses the same vector regs as other SIMD instructions.

Ops, sorry, I never looked at them, I only knew they existed. I was thinking that each instruction set would go on a different package. But if there is no other instruction sets, then there is nothing to think about it.

GoMLX (currently it uses C/C++ XLA/PJRT, which in turn I think uses Highway for the CPU version

AFAIK this is not using Highway, or I'd be curious to hear how/where?

I thought it used (or could use) XNNPack which I thought used Highway ... ugh I just searched and XNNPack doesn't seem to use Highway, my bad.

@jan-wassenberg
Copy link

hm, is your proposal to only define "AbsDiff" on Arm, which supports it natively, and not on x86, then callers dispatch between to code that either contains calls to Arm::AbsDiff, or to Abs(a - b)? It seems undesirable and expensive to duplicate application code unless there's a really huge performance cliff between platforms. An example could be the recently discussed vp2intersectq, which is awesome on certain AVX-512 implementations, and otherwise requires a completely different algorithm.

However, in my experience, the vast majority of code does not use such specialized instructions, and it's feasible and desirable to support the same ops on all CPUs. The AbsDiff can easily and cheaply be implemented via Abs(a - b). I think filling in such gaps is a big convenience for users who can unconditionally call AbsDiff.

I thought it used (or could use) XNNPack which I thought used Highway

Ah, OK :) There have been some exploratory discussions but indeed no usage at the moment.

@lemire
Copy link

lemire commented Jan 24, 2025

@jan-wassenberg

I do think that the different ISAs call for different algorithmic design relatively often. At least, it does in my work.

The x64 SIMD ISAs have fast instructions to check whether a register is zero. As you well know, one can write entire blog posts on how to do with with ARM NEON. Another example that comes to mind is the fact that some ISAs do not have a fast movemask-to-regular-registers instructions (ARM NEON, Loongson, etc.). Meanwhile, ARM NEON has interleaved loads and stores which can greatly simplify some algorithms... but there is no counterpart on x64.

So I personally favour an approach where, when needed, I can design an ARM-specific algorithm, a RISC-V specific algorithm and an AVX-512 specific algorithm.

It does not contradict your point that, often, one is happy to write 'generic SIMD' code. But one can take this too far. That's what the Java Vector API tried to do. I have recently raised the issue with them, pointing out that many useful algorithms cannot be implemented efficiently. And if it cannot be implemented efficiently, you are better off not implementing it. Nobody needs SIMD that is slow. That's not a theoretical claim: if you search, you will find plenty of folks who spent time doing a SIMD design with the Java Vector API and they end up with worse performance than the naive implementation. Oh. Sure, it runs everywhere and it looks nice, but it is slow. Sometimes people get luckier... it is fast on some processors, but then terribly slow on other processors. And so, the developers then have to write hacks where they somehow detect the CPU and only enable their code when the right CPU is detected.

Let me make a simpler point. Some engineers may decide: I will craft a SIMD algorithm for ARM NEON (or AVX-512 or whatever). Otherwise, I am happy to use fallback code. That's perfectly reasonable from an engineering point of view. In this case, you don't want to be artificially handicaped and, take the case where you target ARM NEON, be disallowed to use interleaved stores and loads. Granted, one can do this in Go today with assembly... but that's somewhat painful...

@jan-wassenberg
Copy link

Hey Daniel, good to see you here.

I agree there can be different implementations of a "reg is zero" op. It still seems reasonable to provide such an op although it may be implemented differently on various ISAs, right?

Movemask is indeed a pain point. Sometimes those semantics are exactly what is required, and one accepts that non-x86 ISAs simply take longer. Otherwise, providing an op that expresses the intent (AllTrue/AllFalse/CountTrue) enables more efficient implementations, including the UMAXP/V you mention.

Interleaved load/stores seem an easier case: we can reasonably emulate them, certainly much faster than scalar code.

I agree it should be possible to specialize algorithm variant(s) for a target ISA, and there should be escape hatches, perhaps to ISA-specific intrinsics. BTW I'm not sure you are aware that we do in fact allow/use such specializations with Highway?

I'm not familiar with Java, is the issue that they only provide the lowest common denominator?

@neon-sunset
Copy link

neon-sunset commented Jan 24, 2025

Apologies for butting it. I think Dr. Lemire is alluding to the challenges of Panama Vectors in Java that cause unnecessary churn in a way that .NET's SIMD abstractions completely avoid.

In .NET, SIMD APIs are structured in such a way to allow writing fully portable implementations with competitive codegen. At the same time, there exists a lower level platform intrinsics API fully interoperable with the same base vector types - this provides an efficient escape hatch to specialize specific parts of the algorithm whenever necessary. For example, Vector128<T> is accepted by both Vector128.GreaterThan and AdvSimd.VectorTableLookup. There is also a platform-dependent width primitive Vector<T> which used to be exclusively higher level but now solves the same task for SVE2 intrinsics.

Reference:

@janpfeifer
Copy link

... Granted, one can do this in Go today with assembly... but that's somewhat painful...

Question: can one mix SIMD assembly code inlined in a normal Go function already ? How does one solve which register is being used to which variable at a certain point ? If not, my understanding (without actually having measured it) is that making function calls in the middle of a hot loop would sacrifice a lot performance. Is that the correct assumption ?

@lemire
Copy link

lemire commented Jan 24, 2025

@neon-sunset is correct. That is what I was alluding to.

@janpfeifer

can one mix SIMD assembly code inlined in a normal Go function already ?

Not to my knowledge.

@janpfeifer
Copy link

Talking to @jan-wassenberg , one topic he brought up (pls, correct me if I'm wrong) is the need for the compiler to know that it is using a certain instruction set, as it may affect the compilation. It needs not only sudo-functions representing SIMD instructions, but also the need for some type of pragma.

I assume now that is what @Clement-Jean meant with the build tags (I first thought they would be like the usual Go build tag constraints. Is that correct?

And if yes, are these build tags scope the whole file ? Should it be per-function, allowing one file to hold specialized functions for different instructions sets ?

@thepudds
Copy link
Contributor

And if yes, are these build tags scope the whole file ? Should it be per-function, allowing one file to hold specialized functions for different instructions sets ?

Hi @janpfeifer, today, build tags (or "build constraints") apply to a whole file. One can move a single function out to its own file if needed to then have multiple versions of that file with different build tags. Some documentation here:
https://pkg.go.dev/cmd/go#hdr-Build_constraints

@klauspost
Copy link
Contributor

@jan-wassenberg There are tons of examples. Take VPTERNLOG. An excellent instruction that allows to eliminate latency by combining other operations. But what would be the impact of making it available as a generic function?

It would be "as expected" on native machines, but absolutely horrible on anything else. It would be so bad that you would need to write 2 versions anyway, since the fallback will be much, much worse than the alternatives. PSHUFB, GF2, CRC32 are pretty much the same. Register masks have already been mentioned.

I am sure that brilliant people like yourself and Daniel can write fine replacements for various platforms, but looking at it realistically I would much rather that these abstractions are handled outside the standard library with the tools provided by the standard library.

A second and perhaps more pragmatic reason is that adding intrinsics by itself is a major task. I would rather the effort was spent on making as much as possible available. I would then rather see time spent on solid compiler support rather than a limited "lowest common denominator with fallback" API. This can nicely be picked up by people that are enthusiastic about a specific set of functionality, providing easy cross-platform support - similar to what the highwayhash API provides.

@thepudds
Copy link
Contributor

@janpfeifer, to more directly comment on the first part of your question -- as I understand it, @Clement-Jean did initially propose a new syntax for SIMD-specific build tags like //go:simd sse2, but there was a later suggestion in #67520 (comment) to use the standard build tag syntax like //go:build sse2, which to me seems a more likely outcome, but also the exact syntax could be hashed out later, I think.

Regardless of the syntax, though, it seems unlikely to me that they would end up as per-function as part of a SIMD proposal. The per-file approach of the current build tags is widely used and works reasonably well today. (Or at least, allowing per-function build tags seems orthogonal to adding a simd package).

@klauspost
Copy link
Contributor

klauspost commented Jan 24, 2025

@thepudds I think adding build tags beyond the existing GOARCH and platform versions (GOAMD64, etc) are mostly pointless. I don't see individual cpu features as a reasonable build tag. Beyond the major features that GOAMD64 cover, you will need specific feature checking at some point in your code. The infinite matrix of avx512 feature groups pretty much makes that mandatory.

Edit: To clarify - build tags are compile time, and that is what I don't see happening beyond big groups, since nobody really want 25 versions of their amd64 program alone.

@cedric-appdirect
Copy link

I think there is two paths here. Either try to abstract the parallelism of SIMD instruction and leave the compiler select instructions as it see fit, like Mojo, Zig, ISPC or GPU shader do. Or no abstraction and developers are expected to write pretty close to the instructions set.

The first option will be eventually getting close to the proposal of 58610 with maybe different keyword, but basically abstraction are required to get the compiler to understand the paralellism of the code. This option is unlikely to get to the maximum potential of performance in all case (as hand written pure assembly is still faster than relying on compiler to generate perfect code), but it will be close and the code will be likely more easy to review, understand and maintain which will likely make it more acceptable in the standard library. This option has a higher complexity on the compiler team to develop, but a MVP could be developed using tinygo fairly quickly.

The second option, which is just enabling using SIMD specific instructions inside Go code, will enable higher performance, at the cost of being less readable. It will lower the complexity and performance cost of writing assembly with Go today as you would be able to call any instruction without the overhead of jumping to it first. This second option will make it easier to copy algorithm and optimization from other project. I would still expect friction into accepting those to the standard library.

@janpfeifer
Copy link

@thepudds : so my question was not about using the tags as constraint build tags -- I think that is what you are referring to. Instead as pragmas to help the compiler decide how to compile things. In other words I'm asking if a tag //go:simd avx512 is not about constraining whether the file is compiled, but rather how the file should be compiled.

@janpfeifer
Copy link

The second option, which is just enabling using SIMD specific instructions inside Go code, will enable higher performance, at the cost of being less readable. It will lower the complexity and performance cost of writing assembly with Go today as you would be able to call any instruction without the overhead of jumping to it first. This second option will make it easier to copy algorithm and optimization from other project. I would still expect friction into accepting those to the standard library.

@cedric-appdirect But doesn't this second option allows for 3rd-party to build libraries/meta-libraries that allow for the easy readability of option 1, without too much compiler change ? Also, wouldn't option 2 be a great first step, and if later we find out that good 3rd-party library abstraction cannot be done without further compile changes, only then we design such abstraction in the stdlib ?

@cedric-appdirect
Copy link

@cedric-appdirect But doesn't this second option allows for 3rd-party to build libraries/meta-libraries that allow for the easy readability of option 1, without too much compiler change ? Also, wouldn't option 2 be a great first step, and if later we find out that good 3rd-party library abstraction cannot be done without further compile changes, only then we design such abstraction in the stdlib ?

The readability I am talking about is the SIMD/parallel algorithm, not the user of the API that would implement it. The reason I highlight this is because of https://go.dev/wiki/AssemblyPolicy which limit what kind of code goes in the stdlib. The problem is not the abstraction for the user of a library that is optimized with handwritten simd, but the maintenance of the optimized code which has no abstraction in that case.

I don't think either option 1 or 2 would impact the API of any library and so both should lead to the same abstraction from a user perspective of such API. I do think, that the difference between option 1 and 2 are in who can write code.

  • Option 1 potentially a lot more people as you just write in higher language abstraction, but you need to learn that abstraction (There is more to learn for most people compared to understanding goroutine for example) and success of Mojo, ISPC and shader shows that it is not the majority/easy.
  • Option 2, you can rely on people that are already writting assembly for other language to be able to port the algorithm to the Go ecosystem. No learning required in Option 2, but also no increase to expect on who write this algorithm as SIMD writing has been a niche for the last 20 years.

Overall, I think this are two different routes which require significant work in both case from the compiler team and are really difficult to mix. Also this is my assumption here regarding my interpretation of the AssemblyPolicy. It would be good to get an opinion from people maintaining the stdlib on this subject.

@cedric-appdirect
Copy link

cedric-appdirect commented Jan 24, 2025

I think a good reading of what Option 1 could be within the scope of this proposal, maybe (please correct me if you think this is an incorrect interpretation of this proposal @Clement-Jean ), is to look at zig: https://ziglang.org/documentation/0.13.0/#Vectors . I would love to hear the opinion of @Clement-Jean and @lemire on Zig vectors approach and see if that might not be easiest to adopt.

If the Go community has a preference for Option 1 over Option 2 ofcourse.

@janpfeifer
Copy link

janpfeifer commented Jan 24, 2025

The readability I am talking about is the SIMD/parallel algorithm, not the user of the API that would implement it. The reason I highlight this is because of https://go.dev/wiki/AssemblyPolicy which limit what kind of code goes in the stdlib. The problem is not the abstraction for the user of a library that is optimized with handwritten simd, but the maintenance of the optimized code which has no abstraction in that case.

Sorry @cedric-appdirect, I'm not 100% sure we are talking about the same thing 🤔 ? Just to make sure: I'm talking about libraries that will facilitate writing portable SIMD code, like Highway for C++ or Zig-like vectors. So most folks will write SIMD code and algorithms with those libraries, and rarely directly with the raw SIMD instructions offered by option 2. Such SIMD abstraction libraries could potentially (but not necessarily) work like Avo (item 3 of AssemblyPolicy you linked).

If we are talking about the same thing, I'm not following what do you mean about "maintenance of the optimized code" ? Aren't the changes of option 2 only to the compiler? The SIMD instructions would all be translated to inlined intrinsic instructions ?

@cedric-appdirect
Copy link

Sorry @cedric-appdirect, I'm not 100% sure we are talking about the same thing 🤔 ? Just to make sure: I'm talking about libraries that will facilitate writing portable SIMD code, like Highway for C++ or Zig-like vectors. So most folks will write SIMD code and algorithms with those libraries, and rarely directly with the raw SIMD instructions offered by option 2. Such SIMD abstraction libraries could potentially (but not necessarily) work like Avo (item 3 of AssemblyPolicy you linked).

I expect that Option 2 means the expressed SIMD feature map 1:1 to the SIMD instructions with the compiler just inlining them and doing register allocation. Now, where I think our expectation differ is that I do believe that Go is not C++ and that it is not possible to build an abstraction like Highway for it. That's also why I think Zig has a vector type as part of the language.

Avo is an interesting idea and I can imagine people starting to basically do what Avo or Templ do, but for SIMD with an Option 2 of this proposal and generate Go code with intrinsic from another grammar. I put those in a different option than library as they have basically their own grammar and they are building a new language. This has a bunch of constraint as bug can be generated by the transformation to Go and you would have to review both the source and its go transformation. Likely tooling will not be aware of that transformation and will require more effort than if it was part of the language to maintain code that use it. That is what I mean by "maintenance". Basically I expect an increase in complexity with Option 2 when creating abstraction.

If we are talking about the same thing, I'm not following what do you mean about "maintenance of the optimized code" ? Aren't the changes of option 2 only to the compiler? The SIMD instructions would all be translated to inlined intrinsic instructions ?

In both case the change are only to the compiler. One case by adding a new type and operation on it, while the other by exposing the intrinsic instructions directly.

@janpfeifer
Copy link

janpfeifer commented Jan 24, 2025

I expect that Option 2 means the expressed SIMD feature map 1:1 to the SIMD instructions with the compiler just inlining them and doing register allocation.

Thanks for explaining, so we are in the same page.

Now, where I think our expectation differ is that I do believe that Go is not C++ and that it is not possible to build an abstraction like Highway for it.

Kind of, per Avo like suggestion. Go doesn't have macros (thanks god), but have generators, which are commonly used, and IMO much better.

Avo is an interesting idea and I can imagine people starting to basically do what Avo or Templ do, but for SIMD with an Option 2 of this proposal and generate Go code with intrinsic

Yes! That would be a nice way of doing it.

from another grammar.
I put those in a different option than library as they have basically their own grammar and they are building a new language.

(edit) Let me suggest that assuming the original code -- that gets converted to SIMD specialized code -- is valid Go code, there is no new grammar so to say. But there is new semantics o learn.

This has a bunch of constraint as bug can be generated by the transformation to Go and you would have to review both the source and its go transformation.

Sorry, in practice I don't see how that would be so: just like enumer and stringer never generate bad Go code (at least not for me). Plus, in Go we usually submit the generated code. A huge advantage, since the maintainer (user of the SIMD library), can make sure the submitted generated code compiles and passes the tests. Without ever needing to touch the lower level simd code.

(edit) While I don't think compile time bugs are an issue, runtime errors, if they happened on generated code, would be much harder to debug.

Likely tooling will not be aware of that transformation and will require more effort than if it was part of the language to maintain code that use it.

Yes a little, but this is not so clear to me why it matters:

  1. Most of the code is not written for SIMD, so even if the mapping of the original code and generated is not well supported by IDEs, it's not where most users will spend their time on.
  2. In a scenario where the original code works (albeit not using simd), it can still be debugged, step through, syntax highlighted, etc.
  3. The generated code will be hard to read (using somewhat raw simd instructions), but it will work as any normal Go code. But in most cases one would never need to read it. Same way one usually doesn't need to read the macro+template expansions in highway code, or the generated stringer or enumer code.

That is what I mean by "maintenance". Basically I expect an increase in complexity with Option 2 when creating abstraction.

Again I'm not seeing it. But I'm no expert. I'd love to hear from others. I'd argue though that a SIMD abstraction (Highway or Zig's vector) is a complex task, and it's simpler to have this complexity separate from the compiler.

In both case the change are only to the compiler. One case by adding a new type and operation on it, while the other by exposing the intrinsic instructions directly.

My understanding is that option 1 would move more complexity into the compiler, while option 2 would move part of the complexity outside of the compiler. In my book that was a win, assuming that the end result trade offs are equivalent.

In my mind I have following pros of having option 2 and separate SIMD library(ies):

  1. Library separate from the compiler:
    • Easier to evolve the library without requiring compiler modifications.
    • Allows multiple candidate SIMD libraries before "blessing" one as "standard" (if ever?)
    • Easier to bump the library version when new better abstractions come along to support newer hardware.
  2. I think dynamic dispatch would be easier, and more flexible with regards on where the dispatch would be done.
    How would "dynamic dispatch" work in option 1 ? Would the compiler automatically generate code for any function that uses the SIMD library for all sets of instructions (one function per set) and have an automatic dispatch at the function call level ?

@jan-wassenberg
Copy link

lower level platform intrinsics API fully interoperable with the same base vector types - this provides an efficient escape hatch to specialize specific parts of the algorithm whenever necessary.

I agree this is useful.

build tags (or "build constraints") apply to a whole file

In C++, the usage of SIMD intrinsics requires that codegen has been enabled for that ISA/extension, either via -mavx2 or #pragma comment.
I'm not sure where in the LLVM and GCC compiler stacks this is enforced, so it might also affect Go?
Seems important to get input from compiler folks here.

PSHUFB

Works great on all SIMD ISAs known to me! No problem exposing this op.

There are tons of examples. Take VPTERNLOG.

I agree this is difficult to provide in that form. The approach we have taken is to offer certain pre-fused ops such as OrAnd that correspond to one value of ternlog's imm8.

adding intrinsics by itself is a major task. I would rather the effort was spent on making as much as possible available.

hm, if you want to add all intrinsics, that sounds like it could actually be more work? Intel's reference lists 6799. 4300 for Arm NEON, 6040 for SVE, 15052(!) for RISC-V.
Some curation of that set sounds advisable. And why add individual neon.fmul, sse4.fmul, ... when they all behave the same?

nobody really want 25 versions of their amd64 program alone.

Agreed. Highway targets are defined as groups of the features, and there are roughly 5 per platform.

While I don't think compile time bugs are an issue, runtime errors, if they happened on generated code, would be much harder to debug.

This is indeed concerning. Runtime crashes are not uncommon for me. What then? Presumably we have line numbers in the generated code, but how do we understand what to change?

it's not where most users will spend their time on.

But users will send a bug report consisting of a large amount of assembly to the authors of a Go library? Sounds like debugging will be difficult (and unpopular?).

Same way one usually doesn't need to read the macro+template expansions in highway code

The difference is that tooling (including sanitizers, the compiler, IDE, and debugger), mostly see through those things. Call stacks mention the C++ functions and original line numbers.

SIMD abstraction (Highway or Zig's vector) is a complex task, and it's simpler to have this complexity separate from the compiler.

I agree not everything has to sit inside the compiler, but surely some compiler changes are required to get to SIMD? I understand your goal is to move as much as possible outside, for example the emulation of missing instructions, which sounds reasonable.

@janpfeifer
Copy link

To facilitate the discussion, I created a "straw man" sketch of a would-be "go-highway" library/generator, that would rely on an option 2 discussed above.

It felt too large (including the C++ version) to be added here, hence the separate document.

ps.: I linked with comment access to everyone. If anyone wants direct edit access (to add alternatives, bullet points, etc) email/ping me in chat, I'm happy to share.

@Clement-Jean
Copy link
Author

Clement-Jean commented Jan 28, 2025

The second option, which is just enabling using SIMD specific instructions inside Go code, will enable higher performance, at the cost of being less readable. It will lower the complexity and performance cost of writing assembly with Go today as you would be able to call any instruction without the overhead of jumping to it first. This second option will make it easier to copy algorithm and optimization from other project. I would still expect friction into accepting those to the standard library.

This is what this proposal is about. I want the API to feel like normal Go code, not new keywords, ...

I expect that Option 2 means the expressed SIMD feature map 1:1 to the SIMD instructions with the compiler just inlining them and doing register allocation.

Exactly. In the first iteration I think it would be interesting to have lower level primitives. In my opinion it is important to have access to them.

To facilitate the discussion, I created a "straw man" sketch of a would-be "go-highway" library/generator, that would rely on an option 2 discussed above.

I like the idea of runtime dispatch you present in the init function. I feel like this could be useful. However, I also think it is important to let people choose between runtime and static dispatch. That's why I presented the system of build tags. Once the build tag system is built it should be feasible to extract an API to also do the runtime dispatch.

Finally, as for the build tags, this seems to be the main point of disagreement (with variable length vector types). It would be worth checking if generation could help here. I'm not entirely sure how just yet but this also seem like an interesting idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Proposal
Projects
Status: No status
Status: Incoming
Development

No branches or pull requests