-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: add "simd" package to standard library #67520
Comments
This comment was marked as spam.
This comment was marked as spam.
This comment was marked as spam.
This comment was marked as spam.
simd
package to standard library
ISTM, this could be a regular build tag This might also help with WASI, where the solution was to make to make wasip1 the GOOS, but as new features get added to WASM itself, we might run into the same issue of needing to differentiate classic WASM from WASM++ or whatever. |
In the POC this is implemented with a simple build tag. However, I was presenting the idea for backward compatibility reasons. If anyone already defined such build tags, we wouldn't want to interfere with their code. Furthermore, these new build tags might lead to different build logic or state than the original ones. I haven't look in details so I cannot describe a different use case though... |
I guess that's a risk with any new build tag, but yes, it seems more likely to affect real code here. Maybe |
This seems a little bit terse to me and I'm still not comfortable with impacting others code. That's also why I didn't propose file names having a certain ending like But overall, I'm open to the idea. |
Couldn't you have a T32 type that provides a function N that returns 32?
This seems limiting. For example, dot products are quite important, but platforms differ a bit in what they provide. Also, Scatter isn't universally supported, but important in some fields. Would you simply skip it?
This cannot be guaranteed with LLVM, which can and does transform intrinsics to something else. |
If I understand correctly, instead of passing
I believe you are seeing that from the Highway point of view. This proposal tries to provide direct intrinsics. Functions wrapping multiple intrinsics could be then created on top of them. These functions would include dot products, scatters, ...
Fortunately, we are not dealing with LLVM. Go has its own assembly. |
Yes, that's right. Though I'd say VectorN instead of Vector16. 128-bit vectors have their place for historical reasons, but users asking for Vector1024 is a recipe for poor codegen on most ISAs.
Note that some platforms have dot product instructions e.g. vdpbf16ps, which are too good to pass up on. Of course a dot product kernel would call them in a loop, but you still want these instructions reflected in the abstraction layer. |
A sticking point for past proposals has been whether the package should be high level or low level. I think it probably makes sense to make a syscall vs. os style split and have a high level package that calls into a low level package. In terms of implementation, I think starting with the low level package is clearly better. So maybe start with golang.org/x/exp/simd being experimentally special cased by the compiler, if it's successful, move it to runtime/simd and then if everyone is happy with runtime/simd, a separate proposal could add math/vector as a high level library. |
Thank you for clarification @jan-wassenberg @earthboundkid this is the idea behind this proposal. Let's do the low level intrinsics that are not length agnostic and building on that we will be able to do the fancy length agnostic ones. |
It's not clear to me that VL agnostic can be built on top of known-length types, unless for each vector you set aside up to 256 bytes (SVE) or even 64 KiB (RISC-V V)? |
I'm not even sure the compiler can generate code for SVE and RISC-V V to be honest. So I don't think this is the priority. But yes for variable length vectors we will need either to set aside a larger amount of bytes, or come up with higher-level vector types. |
What you are proposing is already covered with build tags. With that said build tags are fundamentally a wrong approach for intrinsics. You seem to imply that intrinsics should be decided at compile time. I disagree. There are maybe 30+ CPU features in x86-64. Add combinations and you have an infinite build matrix already. If intrinsics should make sense, they can already be platform guarded and cpuid detection should make the compiled binary dynamically either use intrinsics or not. We already have Your proposal for a generic "simd" package is too simplistic. While abstracting certain operations would be possible and fine, I stand by my opinion that this can be done either internally or externally if Go provides the intrinsics to begin with. TBH I don't quite understand how this package would work with the build tags. Would it fail with
Distinct types is an abstraction over "register with x bits". Take "xor". It doesn't operate on any specific type, but only on a fixed register size. Do you want an I see all of that as a secondary layer to the main complexity of adding vector registers to the compiler. That seems like the primary problem to solve. To me the approach would be to: A) Find a reasonable way to represent vector registers in Go code, which can be used across platforms. It could be a It seems reasonable to provide B) Provide intrinsics per platform. Intrinsics are "close" to the emitted instructions. There is no scalar fallback, and cpuid checks will be up to the user. I still think dividing intrinsics into (primary) cpuid makes sense. Importing a package doesn't mean you use it, merely that you may use it if the cpu supports it. C) Find a reasonable way for imported packages to "inject" intrinsics into the compiler. That means importing D) Constants A lot of instructions take immediates which must be constant at compile time. For example While it could just result in a compile-time error it would be nice if the function definition by itself could give that information. Not a show-stopper, though - and could be decided later. There are a lot of "minor" decisions that would need to be made. Should instruction with shared source/destination registers abstract this? How should functions be named? Close to the instruction or slightly abstracted (like C, rust) or fully renamed? While these things would need a resolution at some point, they don't really seem like the primary things to focus on too early. If the above works out, third parties can implement what they like. The Go stdlib can then look at whether a generic vector package would make sense if that is deemed important. TLDR; I think the approach of compile-time cpu feature detection is a dead end. I think intrinsics should be low level to allow for the biggest amount of features and leave higher level abstractions to competing third parties. Go will provide the compiler that can make this great. PS. I am genuinely very sorry I never found the time to reply to your emails. |
So, I believe there are misunderstandings here:
The build tag is optional and when not provided, the compiler, through cpuid, would detect the all the intrinsics available for a given platform. For example, if you had a machine with SSE and AVX available, the compiler will detect all these features. The build tags let the user restrict to a set of features if needed.
For that, we could drop to SWAR-like operations or scalar code if a feature is not detected. I believe the currently available intrinsics already have this.
No, for xor under the hood, there is only one intrinsic covering multiple types.
These two points are covered in my proposal. I talk about having to handle constants and making sure that the compiler promote the types to vector registers. Rest
For importing through different import paths, it just seem to be another way of doing what I proposed with the optional build tag but in reverse. If you don't provide a tag, you get all possible intrinsics. Otherwise, you get the intrinsics only for a set of ISAs.
Once again, a lot of intrinsics are shared across multiple types. This means we will have lot of "functions" calling to the same underlying intrinsic.
Totally agree with this. The minor points that I made are purely based on the experience with the POC. They are not the main focus of the proposal. |
I think the better solution is to steal the types from #64634: type ISimd128 ISimd128 Each SIMD vector become it's own native type which are unbreakable. This also discourage programmers to treat them as arrays (which is often slow). // random example
for i, v := range someSIMDVector {
if v % 2 == 0 && i % 2 == 1 { // do some logic
continue
}
someSIMDVector[i] = 42 // create inter-loop dependency if lifting to registers
} however performance for code like this is gonna be way worst than it look it should be on amd64 (and maybe other architectures) due to the high cost to move data between the SIMD and GP complexes. |
Totally agree with you on that. This is why I said:
And having a native type is definitely the way to go. I just didn't know how to properly formulate that. |
SIMD instructions are crucial for computation-intensive workloads like machine learning and deep learning. This proposal really provides a solid initial plan.
However, Golang works at a higher level than C, there should be some more considerations. Essential SIMD operations like vector addition, reduction, and broadcasting should be implemented across |
While I agree on the fact that Go is higher level, the abstraction of SIMD intrinsics is a very complicated thing to get right. On top of that, we might find ourselves having performance differences that are not negligible across platforms/ISAs. I believe that we should let the developer the choice and explicitly make the trade-offs. |
If a SIMD abstraction only provides the strict intersection of ops, then what exactly should developers trade off? When they want more ops, aren't they forced to implement the op in assembly, plus some kind of fallback for other platforms? It seems much more useful if the abstraction takes care of that. It would have to be judicious about what exactly to offer, and avoid ops that really aren't efficient on other platforms. I think we did a decent job of that in Highway :) |
I definitely would like judicious abstraction , I'm just saying that this is hard. To make that a little bit easier, we should definitely learn from Highway on which ops to provide. Note that I was trying to be conservative on this proposal and get a minimal scope that we can evolve from. I believe that if we have the basic ops for each platforms, we can later create higher abstractions as part of the simd package. |
@jan-wassenberg I don't think the comparison applies. The HH api is good for what it offers. However I don't see that as a language feature, but as a third party package API. IMO Go should allow for implementing abstractions and people can choose their preferred abstraction or write directly in intrinsics. Go should provide a compiler that is capable of sensibly allocate registers and handle code generation. Users of intrinsics would gain the benefits of not having the function call overhead of assembly; The benefits of memory safety (maybe with "unsafe" exceptions for things like scatter/gather). Plus the simplicity of register management compared to assembly. Abstractions should be implemented by third parties - they will likely be more powerful and different approaches can compete for the best APIs. |
See #68188. |
I happened to see a cite to a favorable comment that @lemire made about this proposal roughly 6 months ago, and I thought it worthwhile to take the liberty of pasting the main portion of the comment here.
and
@lemire also included a screenshot of what I think was this example from the proposal of SIMD UTF-8 validation, a snippet of which is: currBlock := *(*simd.Uint8x16)([]byte(in[processedLen:]))
if simd.MoveByteMaskU8x16(currBlock) < 0x80 {
if simd.MoveByteMaskU8x16(prevIncomplete) != 0 {
return false
}
prevIncomplete = simd.Uint8x16{}
} else {
prev1 := simd.ExtractU8x16(prevInputBlock, currBlock, 15)
s := simd.AndU16x8(simd.ShiftRightU16x8(simd.AsUint16x8(prev1), 4), ffBy4)
byte1High := simd.LookupU8x16(shuf1, simd.AsUint8x16(s))
byte1Low := simd.LookupU8x16(shuf2, simd.AndU8x16(prev1, v0f))
s = simd.AndU16x8(simd.ShiftRightU16x8(simd.AsUint16x8(currBlock), 4), ffBy4)
[...] |
It's worth mentioning that this is the example I wrote in the repo linked and that me and professor Lemire collaborated to make the example (coming from one of his papers) work. |
A couple of extra suggestions:
PS.: My use case: with some colleagues we've been discussing how to create an alternative pure Go engine for GoMLX (currently it uses C/C++ XLA/PJRT, which in turn I think uses Highway for the CPU version) and for ONNX models inference. There is a current pure Go implementation (using Gorgonia, which in turn uses gonum/blas, currently deprecated) which seems good, but some initial benchmarks shows ~an order of magnitude slower for some sentence encoding (BERT-like) models, vs using ONNXRuntime (CPU) or GoMLX (==XLA/PJRT CPU). I understand Go will always be slower, but we were hoping for a smaller difference. Not sure yet how much is due to simd support or other optimizations though. |
I agree runtime dispatch is super useful. Can you clarify what you mean by "low level access to specialized instruction sets"? All crypto extensions of which I am aware, including AESNI, uses the same vector regs as other SIMD instructions.
AFAIK this is not using Highway, or I'd be curious to hear how/where? |
I mean, not try to create abstractions that uses one or the other instruction set, depending on what is the actual hardware. Let this decision be made at a more coarse level, like at a dispatch function, by the user (or 3rd-party) library.
Ops, sorry, I never looked at them, I only knew they existed. I was thinking that each instruction set would go on a different package. But if there is no other instruction sets, then there is nothing to think about it.
I thought it used (or could use) XNNPack which I thought used Highway ... ugh I just searched and XNNPack doesn't seem to use Highway, my bad. |
hm, is your proposal to only define "AbsDiff" on Arm, which supports it natively, and not on x86, then callers dispatch between to code that either contains calls to Arm::AbsDiff, or to Abs(a - b)? It seems undesirable and expensive to duplicate application code unless there's a really huge performance cliff between platforms. An example could be the recently discussed vp2intersectq, which is awesome on certain AVX-512 implementations, and otherwise requires a completely different algorithm. However, in my experience, the vast majority of code does not use such specialized instructions, and it's feasible and desirable to support the same ops on all CPUs. The AbsDiff can easily and cheaply be implemented via
Ah, OK :) There have been some exploratory discussions but indeed no usage at the moment. |
I do think that the different ISAs call for different algorithmic design relatively often. At least, it does in my work. The x64 SIMD ISAs have fast instructions to check whether a register is zero. As you well know, one can write entire blog posts on how to do with with ARM NEON. Another example that comes to mind is the fact that some ISAs do not have a fast movemask-to-regular-registers instructions (ARM NEON, Loongson, etc.). Meanwhile, ARM NEON has interleaved loads and stores which can greatly simplify some algorithms... but there is no counterpart on x64. So I personally favour an approach where, when needed, I can design an ARM-specific algorithm, a RISC-V specific algorithm and an AVX-512 specific algorithm. It does not contradict your point that, often, one is happy to write 'generic SIMD' code. But one can take this too far. That's what the Java Vector API tried to do. I have recently raised the issue with them, pointing out that many useful algorithms cannot be implemented efficiently. And if it cannot be implemented efficiently, you are better off not implementing it. Nobody needs SIMD that is slow. That's not a theoretical claim: if you search, you will find plenty of folks who spent time doing a SIMD design with the Java Vector API and they end up with worse performance than the naive implementation. Oh. Sure, it runs everywhere and it looks nice, but it is slow. Sometimes people get luckier... it is fast on some processors, but then terribly slow on other processors. And so, the developers then have to write hacks where they somehow detect the CPU and only enable their code when the right CPU is detected. Let me make a simpler point. Some engineers may decide: I will craft a SIMD algorithm for ARM NEON (or AVX-512 or whatever). Otherwise, I am happy to use fallback code. That's perfectly reasonable from an engineering point of view. In this case, you don't want to be artificially handicaped and, take the case where you target ARM NEON, be disallowed to use interleaved stores and loads. Granted, one can do this in Go today with assembly... but that's somewhat painful... |
Hey Daniel, good to see you here. I agree there can be different implementations of a "reg is zero" op. It still seems reasonable to provide such an op although it may be implemented differently on various ISAs, right? Movemask is indeed a pain point. Sometimes those semantics are exactly what is required, and one accepts that non-x86 ISAs simply take longer. Otherwise, providing an op that expresses the intent (AllTrue/AllFalse/CountTrue) enables more efficient implementations, including the UMAXP/V you mention. Interleaved load/stores seem an easier case: we can reasonably emulate them, certainly much faster than scalar code. I agree it should be possible to specialize algorithm variant(s) for a target ISA, and there should be escape hatches, perhaps to ISA-specific intrinsics. BTW I'm not sure you are aware that we do in fact allow/use such specializations with Highway? I'm not familiar with Java, is the issue that they only provide the lowest common denominator? |
Apologies for butting it. I think Dr. Lemire is alluding to the challenges of Panama Vectors in Java that cause unnecessary churn in a way that .NET's SIMD abstractions completely avoid. In .NET, SIMD APIs are structured in such a way to allow writing fully portable implementations with competitive codegen. At the same time, there exists a lower level platform intrinsics API fully interoperable with the same base vector types - this provides an efficient escape hatch to specialize specific parts of the algorithm whenever necessary. For example, Reference:
|
Question: can one mix SIMD assembly code inlined in a normal Go function already ? How does one solve which register is being used to which variable at a certain point ? If not, my understanding (without actually having measured it) is that making function calls in the middle of a hot loop would sacrifice a lot performance. Is that the correct assumption ? |
@neon-sunset is correct. That is what I was alluding to.
Not to my knowledge. |
Talking to @jan-wassenberg , one topic he brought up (pls, correct me if I'm wrong) is the need for the compiler to know that it is using a certain instruction set, as it may affect the compilation. It needs not only sudo-functions representing SIMD instructions, but also the need for some type of I assume now that is what @Clement-Jean meant with the build tags (I first thought they would be like the usual Go build tag constraints. Is that correct? And if yes, are these build tags scope the whole file ? Should it be per-function, allowing one file to hold specialized functions for different instructions sets ? |
Hi @janpfeifer, today, build tags (or "build constraints") apply to a whole file. One can move a single function out to its own file if needed to then have multiple versions of that file with different build tags. Some documentation here: |
@jan-wassenberg There are tons of examples. Take It would be "as expected" on native machines, but absolutely horrible on anything else. It would be so bad that you would need to write 2 versions anyway, since the fallback will be much, much worse than the alternatives. PSHUFB, GF2, CRC32 are pretty much the same. Register masks have already been mentioned. I am sure that brilliant people like yourself and Daniel can write fine replacements for various platforms, but looking at it realistically I would much rather that these abstractions are handled outside the standard library with the tools provided by the standard library. A second and perhaps more pragmatic reason is that adding intrinsics by itself is a major task. I would rather the effort was spent on making as much as possible available. I would then rather see time spent on solid compiler support rather than a limited "lowest common denominator with fallback" API. This can nicely be picked up by people that are enthusiastic about a specific set of functionality, providing easy cross-platform support - similar to what the highwayhash API provides. |
@janpfeifer, to more directly comment on the first part of your question -- as I understand it, @Clement-Jean did initially propose a new syntax for SIMD-specific build tags like Regardless of the syntax, though, it seems unlikely to me that they would end up as per-function as part of a SIMD proposal. The per-file approach of the current build tags is widely used and works reasonably well today. (Or at least, allowing per-function build tags seems orthogonal to adding a |
@thepudds I think adding build tags beyond the existing GOARCH and platform versions (GOAMD64, etc) are mostly pointless. I don't see individual cpu features as a reasonable build tag. Beyond the major features that GOAMD64 cover, you will need specific feature checking at some point in your code. The infinite matrix of avx512 feature groups pretty much makes that mandatory. Edit: To clarify - build tags are compile time, and that is what I don't see happening beyond big groups, since nobody really want 25 versions of their amd64 program alone. |
I think there is two paths here. Either try to abstract the parallelism of SIMD instruction and leave the compiler select instructions as it see fit, like Mojo, Zig, ISPC or GPU shader do. Or no abstraction and developers are expected to write pretty close to the instructions set. The first option will be eventually getting close to the proposal of 58610 with maybe different keyword, but basically abstraction are required to get the compiler to understand the paralellism of the code. This option is unlikely to get to the maximum potential of performance in all case (as hand written pure assembly is still faster than relying on compiler to generate perfect code), but it will be close and the code will be likely more easy to review, understand and maintain which will likely make it more acceptable in the standard library. This option has a higher complexity on the compiler team to develop, but a MVP could be developed using tinygo fairly quickly. The second option, which is just enabling using SIMD specific instructions inside Go code, will enable higher performance, at the cost of being less readable. It will lower the complexity and performance cost of writing assembly with Go today as you would be able to call any instruction without the overhead of jumping to it first. This second option will make it easier to copy algorithm and optimization from other project. I would still expect friction into accepting those to the standard library. |
@thepudds : so my question was not about using the tags as constraint build tags -- I think that is what you are referring to. Instead as |
@cedric-appdirect But doesn't this second option allows for 3rd-party to build libraries/meta-libraries that allow for the easy readability of option 1, without too much compiler change ? Also, wouldn't option 2 be a great first step, and if later we find out that good 3rd-party library abstraction cannot be done without further compile changes, only then we design such abstraction in the stdlib ? |
The readability I am talking about is the SIMD/parallel algorithm, not the user of the API that would implement it. The reason I highlight this is because of https://go.dev/wiki/AssemblyPolicy which limit what kind of code goes in the stdlib. The problem is not the abstraction for the user of a library that is optimized with handwritten simd, but the maintenance of the optimized code which has no abstraction in that case. I don't think either option 1 or 2 would impact the API of any library and so both should lead to the same abstraction from a user perspective of such API. I do think, that the difference between option 1 and 2 are in who can write code.
Overall, I think this are two different routes which require significant work in both case from the compiler team and are really difficult to mix. Also this is my assumption here regarding my interpretation of the AssemblyPolicy. It would be good to get an opinion from people maintaining the stdlib on this subject. |
I think a good reading of what Option 1 could be within the scope of this proposal, maybe (please correct me if you think this is an incorrect interpretation of this proposal @Clement-Jean ), is to look at zig: https://ziglang.org/documentation/0.13.0/#Vectors . I would love to hear the opinion of @Clement-Jean and @lemire on Zig vectors approach and see if that might not be easiest to adopt. If the Go community has a preference for Option 1 over Option 2 ofcourse. |
Sorry @cedric-appdirect, I'm not 100% sure we are talking about the same thing 🤔 ? Just to make sure: I'm talking about libraries that will facilitate writing portable SIMD code, like Highway for C++ or Zig-like vectors. So most folks will write SIMD code and algorithms with those libraries, and rarely directly with the raw SIMD instructions offered by option 2. Such SIMD abstraction libraries could potentially (but not necessarily) work like Avo (item 3 of AssemblyPolicy you linked). If we are talking about the same thing, I'm not following what do you mean about "maintenance of the optimized code" ? Aren't the changes of option 2 only to the compiler? The SIMD instructions would all be translated to inlined intrinsic instructions ? |
I expect that Option 2 means the expressed SIMD feature map 1:1 to the SIMD instructions with the compiler just inlining them and doing register allocation. Now, where I think our expectation differ is that I do believe that Go is not C++ and that it is not possible to build an abstraction like Highway for it. That's also why I think Zig has a vector type as part of the language. Avo is an interesting idea and I can imagine people starting to basically do what Avo or Templ do, but for SIMD with an Option 2 of this proposal and generate Go code with intrinsic from another grammar. I put those in a different option than library as they have basically their own grammar and they are building a new language. This has a bunch of constraint as bug can be generated by the transformation to Go and you would have to review both the source and its go transformation. Likely tooling will not be aware of that transformation and will require more effort than if it was part of the language to maintain code that use it. That is what I mean by "maintenance". Basically I expect an increase in complexity with Option 2 when creating abstraction.
In both case the change are only to the compiler. One case by adding a new type and operation on it, while the other by exposing the intrinsic instructions directly. |
Thanks for explaining, so we are in the same page.
Kind of, per Avo like suggestion. Go doesn't have macros (thanks god), but have generators, which are commonly used, and IMO much better.
Yes! That would be a nice way of doing it.
(edit) Let me suggest that assuming the original code -- that gets converted to SIMD specialized code -- is valid Go code, there is no new grammar so to say. But there is new semantics o learn.
Sorry, in practice I don't see how that would be so: just like (edit) While I don't think compile time bugs are an issue, runtime errors, if they happened on generated code, would be much harder to debug.
Yes a little, but this is not so clear to me why it matters:
Again I'm not seeing it. But I'm no expert. I'd love to hear from others. I'd argue though that a SIMD abstraction (Highway or Zig's vector) is a complex task, and it's simpler to have this complexity separate from the compiler.
My understanding is that option 1 would move more complexity into the compiler, while option 2 would move part of the complexity outside of the compiler. In my book that was a win, assuming that the end result trade offs are equivalent. In my mind I have following pros of having option 2 and separate SIMD library(ies):
|
I agree this is useful.
In C++, the usage of SIMD intrinsics requires that codegen has been enabled for that ISA/extension, either via -mavx2 or #pragma comment.
Works great on all SIMD ISAs known to me! No problem exposing this op.
I agree this is difficult to provide in that form. The approach we have taken is to offer certain pre-fused ops such as OrAnd that correspond to one value of ternlog's imm8.
hm, if you want to add all intrinsics, that sounds like it could actually be more work? Intel's reference lists 6799. 4300 for Arm NEON, 6040 for SVE, 15052(!) for RISC-V.
Agreed. Highway targets are defined as groups of the features, and there are roughly 5 per platform.
This is indeed concerning. Runtime crashes are not uncommon for me. What then? Presumably we have line numbers in the generated code, but how do we understand what to change?
But users will send a bug report consisting of a large amount of assembly to the authors of a Go library? Sounds like debugging will be difficult (and unpopular?).
The difference is that tooling (including sanitizers, the compiler, IDE, and debugger), mostly see through those things. Call stacks mention the C++ functions and original line numbers.
I agree not everything has to sit inside the compiler, but surely some compiler changes are required to get to SIMD? I understand your goal is to move as much as possible outside, for example the emulation of missing instructions, which sounds reasonable. |
To facilitate the discussion, I created a "straw man" sketch of a would-be "go-highway" library/generator, that would rely on an option 2 discussed above. It felt too large (including the C++ version) to be added here, hence the separate document. ps.: I linked with comment access to everyone. If anyone wants direct edit access (to add alternatives, bullet points, etc) email/ping me in chat, I'm happy to share. |
This is what this proposal is about. I want the API to feel like normal Go code, not new keywords, ...
Exactly. In the first iteration I think it would be interesting to have lower level primitives. In my opinion it is important to have access to them.
I like the idea of runtime dispatch you present in the init function. I feel like this could be useful. However, I also think it is important to let people choose between runtime and static dispatch. That's why I presented the system of build tags. Once the build tag system is built it should be feasible to extract an API to also do the runtime dispatch. Finally, as for the build tags, this seems to be the main point of disagreement (with variable length vector types). It would be worth checking if generation could help here. I'm not entirely sure how just yet but this also seem like an interesting idea. |
Background
After doing a little bit of research on previous issues mentioning adding SIMD to Go, I found the following proposals:
#58610: This proposal relies on the
ispmd
keyword which does not follow the compatibility guarantees for Go 1.#53171: This proposal is closer to what I am proposing, however, there are multiple packages per CPU architecture, and I'm providing a more complete explanation on how to achieve adding SIMD to the Go standard library.
#64634: This proposal is more specific to the crypto package but in essence this is close to what I'm proposing here.
Goals
The main goal of this proposal is to provide an alternative approach to designing a
simd
package for the Go standard library. As of right now, there is no consensus on what the API should look like and this proposal intends to drive the discussion further.This proposal mainly includes two things:
Build Tag
For the first point, I was thinking about optional build tags like the following:
//go:simd sse2
//go:simd neon
//go:simd avx512
etc.
As mentioned, these build tags are optional. If not specified, we would resolve to the appropriate SIMD ISA available on the current OS and ARCH. However, I still think that these build tags are needed for deeper optimization and platform-specific operations. If we know that some instruction is more performant or only available on a certain architecture, we should be able to enforce using it manually.
Finally, having the optional build tag would let the developer choose, at compile time, which SIMD ISA to target and thus cross compile. We could write something similar to:
With this, we could take advantage of platform-specific features and know at compile time the size of the vector registers (e.g., 128, 256, or 512 bits). This would help us make better decisions for optimizations on the compiler side.
Compiler Intrinsics
The next crucial step would be to create a portable SIMD package that would rely on the compiler to generate SIMD instructions through compiler intrinsics. I demonstrated that this is feasible with a POC. As of right now, it looks like the following (you can see more examples here, including UTF8 string validation):
And here, the
AddU8x16
gets lowered down to avector add
instruction after SSA lowering.Notes
We can provide functions like
AddU8x16
,AddU8x32
, etc. without changing the generics implementation. Other implementations like Highway, Rust std::simd, and Zig @Vector rely on generics for the API. In Go, we do not have non-type parameters in generics; thus we cannot have something likeSimd[uint8, 16]
.Also, we do not have a compile time
Sizeof
which could help us have:Philosophy
It is important to understand that this proposal is not describing an abstraction of SIMD features. Such an abstraction could create noticeable performance difference between ISAs. That's why we are trying to avoid it. This means that if an operation is not available on an ISA, we simply don't provide it. Each intrinsic should only have 1 underlying instruction, not a sequence of instructions.
Challenges to Overcome
I believe we would need some kind of type aliases like the following:
These types should not be indexable and only instantiable through init functions like
Splat8x16
,Load8x16
, etc. The compiler would then promote these special types to vector registers. This would remove all the LD/ST dance and memory allocation that I have in my POC and thus make everything a lot faster.In the end the previous code snippet could look like this:
WORD $0x...
) to encode these.I believe we should avoid having to use constants when defining intrinsics. We could implement the missing instructions along the way with the implementation of intrinsics.
VMIN
andUMINV
. The former returns a vector of min elements, and the latter reduce to the minimum in a given vector. As we don't have function overloads, we will need to find a way to name them appropriately.I believe we should try to make the horizontal operations more verbose (e.g.
ReduceMin8x16
) and promote the vertical ones (e.g.Min8x16
). For the example ofVMIN
andUMINV
, the latter does not even seem to exist in SSE2 whereas the first one does.There are other operations that have "conflicting" names. For example
shift right
andshift left
have bothlogical
andarithmetic
shifts. For these cases, I believe we could just call themLogicalShiftRight
,ArithmeticShiftRight
, ... I agree that this is verbose but it makes it clear what is happening.The current POC did not implement the concept of Masks. This is an important concept but also a tricky one to implement without proper compiler support. After discussion with Jan Wassenberg (author of Highway), I realized that some platforms do not treat masks in the same way. Here is a summary:
We could have other types aliases, like the one made for vectors. These types would have different shapes and be loaded differently depending on the platform.
SSHR
(shift right) on NEON takes an immediaten
that need to be restricted between 1 and 8 (see vshr_n_s8). I ran into some problems where during the build of the compiler, then
would resolve to 0 (default value of int passed as param) and it would crash the program.I believe we could have some way to check at compile time these values are within a certain range. Like a static_assert in C++ or checks on the AST.
Why is it Important?
Without SIMD, we are missing on a lot of potential optimizations. Here is a non-exhaustive list of concrete things that could improve performance in daily life scenarios:
Furthermore, it would make these currently existing packages more portable and maintainable:
There are obviously way more applications of SIMD but I am just trying to say that this is useful in practical scenarios.
The text was updated successfully, but these errors were encountered: