Skip to content

proposal: Go 2: Add implicit single program multiple data #58610

@Bluebugs

Description

@Bluebugs

Author background

I have been doing Go development for a few years now and have had a long experience before that doing C, C++ development and also have experience writing OpenGL shaders.

Proposal

This is a proposal to add the capability to do implicit Single Program Multiple Data, SPMD, computation a la ISPC. It does not have any intent to cover intrinsic, but in a way cover portable SIMD. For the rest of the proposal, I will refer to implicit Single Program Multiple Data as iSPMD.

This proposal is a Go 2 language change proposal that will require change in the go language and large change in the compiler and go tools. I do understand that the bar for such a change is high, but I do think bringing iSPMD into Go might outweigh those difficulties and that people will consider the potential improvement this would bring to the Go ecosystem: readable, maintainable, safe, reliable, portable and fast vectorized code. This proposal should make iSPMD code usable and acceptable in all of Go stdlib and accessible to the entire ecosystem.

There has been already a few proposal trying to bring some form of vectorized code in Go, none looked from the iSPMD angle at it:
#35307
#53171

Due to the high burden of maintaining assembly code and the safety risk it pose, it is very hard to get SIMD optimization in the stdlib even when there is demand as the follow issues shows:
#53178
#24499
#20206
#19636

iSPMD could make it more manageable for projects like gonum to support more architecture and not have to hand write as much assembly for each architecture. Rasterx and go-text will likely benefit from it too. Any compute intensive workload would likely be better off with iSPMD than handwritten assembly. Vector databases like Weaviate have expressed interest in seeing the situation around SIMD improved.

An additional challenge of this proposal is that no language exists, to my knowledge, with both a iSPMD and a non iSPMD context natively. Nevertheless I believe it is doable and this is the aim of this proposal.

TL;DR;

Add the notion of iSPMD context where all code operations are done in parallel when using the keyword ispmd. Add the new keyword uniform to specify that a data is actually representing a single value for all parallel streams of computation in an iSPMD context. Structure and data declared outside of an iSPMD context are all by definition uniform implicitly. Structure and data declared inside of an iSPMD context are all by definition varying of the length of a SIMD register selected by the compiler. Add an each keyword when iterating to specify that each element of a gang comes from a different element in a range, simply loading one SIMD register per step of the for loop instead of loading the same value in each member of the gang.

Proposed syntax for iSPMD context could look like the following short example below and longer one at the end:

func Sum(a []int32) int32 {
   ispmd {
       var partialSum int32
       for _, v := each range a {
           partialSum += v
       }
       return ReduceAdd(partialSum)
   }
}


//go:ismpd AVX for ReduceAdd
func asmReduceAddAVX(a int32) uniform int32
//go:ismpd SSE4 for ReduceAdd
func asmReduceAddSSE4(a int32) uniform int32
//go:ismpd None for ReduceAdd
func noneReduceAdd(a int32) uniform int32 { return a }

For a developer trying to optimize a workload, they just have to pick which unit can be processed in parallel and switch to iSPMD context for it: number, character, word, string, pixel, line of pixels, image, glyph, and so on.

Context

For discussion, I highly recommend reading all of the history of ISPC here as it is full of very useful information to understand the context and benefit of iSPMD: https://pharr.org/matt/blog/2018/04/30/ispc-all with a lot of insight on bringing this to more conventional workload.

ISPC is not the only iSPMD solution in use today. Obviously anyone that wrote a GPU shader in OpenGL, Vulkan or DirectX did write an iSPMD program. Same applies to people writing OpenCL or CUDA compute programs.

In contrast, the claims from Matt Pharr regarding the lack of reliability of auto vectorisation by compiler is still true today with no solution in sight. Auto vectorisation is still not a solution relied upon for generating efficient code. Intrinsic is used a lot, but the amount of developers writing them is still limited and there is significant risk in writing intrinsic or assembly directly.

ISPC was inspired by C, but it did introduce the keyword launch which does pretty much the same things as the keyword go in Golang. As an example: https://github.com/ispc/ispc/blob/main/examples/cpu/mandelbrot_tasks/mandelbrot_tasks.ispc will not feel very far from what a Go iSPMD program could look like. ISPC was focused on just writing a small bit of your program that needed speed, but it enable more complex piece of optimized software like this BC7 encoder: https://github.com/richgel999/bc7enc_rdo/blob/master/bc7e.ispc which would be very hard to do with any other approach .

An interesting experiment from Saksham Jain and Sivaprasad Sudhir that used the Go parser to generate ISPC code from Go code for some specialized function. This was presented in there thesis final report: https://www.andrew.cmu.edu/user/sakshamj/15618/final.pdf and the result of their work can be seen here: https://github.com/sakjain92/govec . Please look at the result of their benchmark which shows the potential for iSPMD in Go. This indicates the potential for several fold speeds up on many compute tasks. This also proves that there is not necessarily the need for much change to actually get iSPMD code mixed with normal Go code and get significant performance out of it. As a readable example of what that looks like and can be compared to the previous ISPC file a mandelbrot example: https://github.com/sakjain92/govec/blob/master/blas/go/mandelbrot.go . The main drawback of this prototype is that it relies on CGo, has some weird syntax requirements to get things working and limitations in the code it generates. It does mean we couldn’t use it to improve any of the standard libraries.

There has also been attempts to make iSPMD work within C++ code with projects like: https://github.com/richgel999/CppSPMD_Fast/ or https://github.com/google/highway. It works and the syntax doesn’t feel bad, again with a mandelbrot example as it seems to be the benchmark: https://github.com/richgel999/CppSPMD_Fast/blob/master/mandelbrot_imp.h . This obviously demonstrates mixing iSPMD context and non iSPMD context, but not really with the help of the language.

Additionally for reference you can read https://www.codeproject.com/Articles/1223361/Benchmarking-NET-Core-SIMD-performance-vs-Intel-IS and https://api-depositonce.tu-berlin.de/server/api/core/bitstreams/7489a854-6557-496e-8b3e-30b47e67b0bf/content to get an idea about the performance of alternative solution and how well ISPC fare. It seems that the only benchmark (short of full intrinsic/assembly approach) that managed to generate better code is rust packed_simd+rayon approach: https://github.com/rust-lang/packed_simd/tree/master/examples/mandelbrot . There is definitively a speedup from their approach at the moment compared to ISPC output, but the code complexity seems quite higher and much less readable.

Proposal

The goal of this proposal is to enable iSPMD code in portion of a golang program that the compiler will be able to generate directly as a vectorized set of operation (see https://pharr.org/matt/blog/2018/04/21/ispc-volta-c-and-spmd for an explanation on how the compiler would do so). There should not be any dependency on CGo. It should not break existing programs as they could not be using the new keyword in a manner that would get the compiler confused. It should be easy for all Go compilers to properly handle an iSPMD section even if they do not have support for SIMD. It should give clear expectations to the developers of what is happening in a iSPMD section.

An iSPMD context in go will execute the code in parallel for each member of a gang using SIMD instruction and an execution mask to follow the execution flow of the code. It means that race conditions are possible when writing code using iSPMD and tools will need to help detect them. A classic example:

func Sum(a []int32) int32 {
   var sum int32
   ispmd for _, v := each range a {
       sum += v // We have a race condition here
   }
   return sum
}

I would expect the above example to actually not compile due to a type mismatch (uniform int32 vs varying int32), but it is nevertheless showing what race condition to expect.

We will need to be able to tag pieces of code to indicate that they are iSPMD pieces of code. For that purpose, a new keyword ispmd could be used to switch into an iSPMD context (I did think of using igo, but I am afraid this would be confusing as this is not creating any goroutine).

There is also a need to support the for iteration picking at each element in the slice independently. Basically loading a bunch of the slice in an entire SIMD register at a time. Putting all of that together, it should look like that:

func Sum(a []int32) int32 {
   ismpd {
       var partialSum int32
       for _, v := each range a {
           partialSum += v
       }
       return ReduceAdd(partialSum)
   }
}

The problem now is that from this context if we do a function call, we must go out of the iSPMD context. A solution to that is by reusing the ispmd keyword, we should be able to indicate to the compiler that a function is actually ok to continue executing it as part of the iSPMD context. This would look like:

ispmd func doSomething(a int, b int) int { return a + b }

When an iSPMD context calls a function, it should generate a function that matches the vector length, propagate the mask and not break out of the iSPMD context. If the function is not tagged, it should break out, iterate over the vector and go back into the iSPMD context after that (This means that when calling a non iSPMD function from an iSPMD context, due to the nature of all the code path in the iSPMD context being explored in parallel for a gang, they would look like being called in random order a bit like if they were called from multiple goroutine meaning there is potential for race condition that the developer need to be aware of). If the function is called from a non iSPMD context, the compiler could for now error out and not allow it even if we should always be able to generate a serial version as fallback.

It should be allowed to have iSPMD functions defined in interfaces.

It should be possible to provide multiple variants of the same function, but with different iSPMD architecture specified to be able to call specialized assembly functions. I could not think of different way than the following:

//go:ismpd AVX for ReduceAdd
func asmReduceAddAVX(a int) uniform int
//go:ismpd SSE4 for ReduceAdd
func asmReduceAddSSE4(a int) uniform int
//go:ismpd None for ReduceAdd
func noneReduceAdd(a int) uniform int { return a }

As shown in this example, we might need to be able to call assembly code for this proposal to express all its potential. This means that there should be a calling convention for calling assembly functions from an iSPMD context. This is mostly a detail of implementation and might vary between compilers. The main issue, here, is that it introduces a concept of overloading which doesn’t exist at this point in the Go ecosystem.

Also as stated earlier, it is left out of this proposal how to optimize the transition from iSPMD to assembly and this proposal does not address nor should block the support for intrinsic types of solution which are orthogonal to this proposal.

In terms of code generation, it will be expected that a version of the iSPMD context is generated per SIMD extension supported for the architecture targeted by the compiler along with a serial implementation to cover CPU that do not have the extension when necessary (The same logic applies on iSPMD tagged functions). If the compiler knows the SIMD extension and tries to generate code for it, it should never fail to generate the code. The compiler should generate the right code to select from the best SIMD extension for the architecture it is targeting and provide meaningful fallback depending on the host CPU capability.

It is also logical that you can not use an ispmd construction inside a iSPMD context (be it a function or inside a previous ispmd) as it doesn’t make sense to switch again to an iSPMD context when you are already in one.

We should trust that the compiler will generate code that for each of the SIMD extensions leads to the exact same computational result, but for testability purposes we should be able to have support in Test and Benchmark to force the SIMD extension choice when running them as we might not want to tests all combination all the time. And we might want to use PGO after running through a benchmark. It would also be necessary to upgrade staticcheck to help detect data race conditions inside an iSPMD context.

Additionally while reading https://pharr.org/matt/blog/2018/04/26/ispc-volta-more-on-performance and https://ispc.github.io/perfguide.html , it seems important to address the following keywords that exist in ISPC.

Specifying instruction set and mask size at compile time: with ISPC and cpp_spmd, it is necessary to specify the instruction set targeted and the size of the mask. This is not really simple for the developers and requires work for any newly supported architecture. It would be best to avoid having to specify it in most cases. It seems possible to me that the compiler could have a pass that explore each iSPMD context and figure out the smallest varying data type processed in that iSPMD context (For function it might just actually infer it directly from the parameters type). With that information the compiler should be able to infer the size of the mask (length of SIMD register/sizeof type) and avoid having the developer specify this information. This should enable a more portable code and easier to read.

uniform: which is a very common keyword in many GPU languages, indicates that the variable has the same value for the current iSPMD context and for all members of the gang. In the case of this proposal, any variable created outside of an ispmd block is automatically an uniform and any variable inside is a varying. A difficulty might arise when calling a function as you can now pass a variable that has been declared outside of the iSPMD context and is an uniform to a iSPMD function that expects only varying. It is also possible to need to declare a variable before we do an assignment and this would also mean that there will be a case where a uniform variable needs to be declared before the association that would allow for type information to be propagated. This information can apply on different levels of a variable type. If in an iSPMD context you are building an array filled with the content of a variable that comes from a non iSPMD context, you need to specify that the type inside the array is also an uniform. This means that, in an iSPMD context, we need the ability to define that a type is uniform. As an example:

ispmd func mandel(c_re float32 , c_im float32 , count uniform int) int32 {
    var test uniform []uniform int
}

I do not see how it would be possible to make the type system work without the uniform keyword which is the reason why the entire proposal is designed with the idea of deep change to the language.

It would be logical to ask the question if we do not need a varying to do the opposite of a uniform keyword. To define that a variable or a type coming from a iSPMD context could be accessed from a non iSPMD context or carry around by non iSPMD context or just define a useful structure type for our iSPMD context, we would need a way to define structure with iSPMD data type (ideally not just anonymous structure). As we have been using the keyword ispmd so far, we can likely continue and use that to define types that embed varying data. Something like that would do:

ispmd type something struct {
   x int
   y int
   id uniform int
}

In this case, if you can pass the structure around in non iSPMD context, only the id which is a uniform would be accessible and all the other fields wouldn’t be accessible to the non iSPMD context. For simplicity, the structure can likely only be allocated inside an iSPMD context and the non uniform field/varying field can only be used inside an iSPMD context.

cif: having to explore both branches of if might be quite costly especially if one is rarely walked (error handling for example). cif exists to tell ISPC to generate additional tests to avoid walking a branch that is usually not worth it (checking the content of the entire mask and jumping depending on it). This should be something equivalent to calling a if AnyTrue(test) { if test {} } with ispmd func AnyTrue(bool test) uniform bool. Not the prettiest syntax, but as this would be possible within the existing proposal without adding any keyword, I don’t think we need at this point to have an equivalent to cif.

foreach: this has two uses. First one it iterates on a different step for each gang. And second it does enable to generate a more optimized loop. The use of each range should be directly matched to ISPC foreach construct. So I do not think we need a foreach keyword or its feature at this point.

The question of how to iterate on just numbers when you do not have a slice to start from is open and I do not have a good answer at this point, nor do I know if it is really necessary in practice. A possibility would be to extend the use of each for the following:

ispmd {
   var partialSum int32
   for i := each 1..10 {
      partialSum += i
   }
}

The second foreach construct is the foreach tiled. I think we can skip it at the moment and look later on how to implement it as this might be related to things like multidimensional array/slice.

Summary

This proposal changes the language by adding 3 keyword: ispmd, uniform and each, the ability to switch to a iSPMD context when entering a block of code specified with ispmd and continuing into a iSPMD context when calling a function and ability to call assembly from within a iSPMD context directly. This change will impact the entire ecosystem and all tooling. It should yield significant performance gain for all computation and I would expect yield performance close to what ISPC has proven was doable.

The Go compiler might slow down a little bit at first, as it will need to have an additional pass and generate multiple variant of the same code, but as it does parse and encode, it should itself be able to use iSPMD context and so it should over time get faster when this proposal get used.

This is trying to still be a “minimal” proposal, but I would expect that over time new functions like the example ReduceAdd or AnyTrue above would be proposed in the stdlib depending on the Go ecosystem needs.

Fully understanding all the potential of the language will be harder, but at the same time it will make using Go as a teaching tool about parallelism one of the best solutions as alternatives would be more complex to approach or difficult to have access to. This could mean university potentially switching to teaching Go as part of the curriculum.

The benefit of having iSPMD directly in the language shouldn’t escape to anyone as parser, encoder, decoder, compression, compare, search, sort would be a good target to improve once the compiler has iSPMD capability. I can see a lot of functions inside the standard library that could use this, but as this is not a magical solution, the code will have to express this parallelism to actually benefit from this. This means code change and not just adding a bunch of iSPMD flags around (This is logical otherwise auto vectorisation would have succeeded by now).
Typical benchmark example

ismpd func mandel(c_re float32 , c_im float32 , count uniform int32) int32 {
   var z_re float32
   var z_im float32


   z_re = c_re
   z_im = c_im


   var i int32
   for i = 0; i < count; i++ {
       var new_re, new_im float32


       if (z_re * z_re + z_im * z_im > 4.0) {
           break
       }


       new_re = z_re*z_re - z_im*z_im
       new_im = 2.0 * z_re * z_im


       z_re = c_re + new_re
       z_im = c_im + new_im
   }
   return i
}


func SerialMandelbrot(
   x0 float32, y0 float32, x1 float32, y1 float32,
   width int, height int,
   startRow int, totalRows int,
   maxIterations int32,
   output []int32) {


   dx := (x1 - x0) / float32(width)
   dy := (y1 - y0) / float32(height)


   endRow := startRow + totalRows


   ispmd for i := each 0..width-1 {
      for j := 0; j < height; j++ {
           x := x0 + float32(i) * dx
           y := y0 + float32(j) * dy


           index = (j * width + i)


           output[index] = mandel(x, y, maxIterations)
       }
   }
}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions