proposal: cmd/compile: intrinsicify user defined assembly functions #17373

Closed
minux opened this Issue Oct 7, 2016 · 27 comments

Projects

None yet

10 participants

@minux
Member
minux commented Oct 7, 2016

I'd like the compiler to intrinsicify user defined assembly functions.

here is a simple sketch of the user interface:
we introduce CLOBBER pseudo instruction to describe that after
certain instruction, certain registers are clobbered.

When the user defines such an assembly function in a package:
TEXT ·Clz(SB), NOSPLIT, $0-4
MOVW 0(FP), R0 // compiler recognizes this as input read
CLZ R0, R1 // the new instruction template
CLOBBER [R2-R3, R11] // as an example of CLOBBER, so that compiler know to spill those registers before using this instruction template
MOVW R1, 4(FP) // compiler recognizes this as output write
RET

The user effectively tells the compiler a new rule to expand call
to Clz function with a custom instruction CLZ.

If we ignore the possibility of doing register allocation for the new
instruction template, I think instrinsicify such assembly functions
with the current SSA backend is within reach.

Of course, this has implications on safety, but if the user is already
using assembly, then that shouldn't be a big concern.

@minux minux added the Proposal label Oct 7, 2016
@rsc
Contributor
rsc commented Jan 4, 2017

I think this would be a significant source of complexity for very little benefit. I do think we need a story for access to functionality like this, but a decent package API for bit twiddling, known to the compiler the same way math.Sqrt is known, should suffice.

Also note that today the Go compiler runs before the assembler, because it generates go_asm.h for use by assembly code. In addition to the complexity of somehow interpreting these magic "inline assembly in another file" stanzas, we'd have to run the compiler, run the assembler, and run the compiler again.

/cc @dr2chase

@rsc rsc added this to the Proposal milestone Jan 4, 2017
@minux
Member
minux commented Jan 4, 2017
@randall77
Contributor

In Go 1.8 we intrinsify some operations in runtime/internal/sys which includes CLZ mentioned in this issue.

The intrinsification is easy to extend to a user-exposed package (we already do so for sync/atomic). The hard part is coming up with APIs to access these ops. The instructions across architectures can vary on what they do in corner cases (e.g. CLZ with a zero argument). Do you expose that in the API, or introduce fixup code to make the API clean but slower? We can afford to be sloppy in runtime/internal but for something external and under the Go1 guarantee it is harder.

Doesn't help with generic assembly inlining, but it may help a bunch of the common cases.

@minux
Member
minux commented Jan 4, 2017
@josharian
Contributor
josharian commented Jan 4, 2017 edited

The hard part is coming up with APIs to access these ops. The instructions across architectures can vary on what they do in corner cases (e.g. CLZ with a zero argument). Do you expose that in the API, or introduce fixup code to make the API clean but slower?

I would really love a fast bit-twiddling package in std. I'd suggest introducing fixup code to make the API clean but slower. I suspect most of these cleanups would be a single highly predictable branch anyway. For many bit-twiddling operations we already have two existing users, the runtime and math/big. That should help some with API design.

The SIMD instructions don't map well to Go's strict typing.

One option is to have package-specific SIMD types, which correspond to SIMD register widths. Then all functions in the package operate on those types, e.g. AddUint32x4(dst, src1, src2 Reg128). We'd provide a bunch of Load/Store functions, e.g. LoadUint64x2(p *[2]uint64) Reg128 and StoreUint64x2(p *[2]uint64, r Reg128). Maybe also pointer-free variants. This corresponds reasonably well to what happens under the hood--data gets loaded into a register, manipulated into that register, and then stored back into memory--which is probably a good thing for such low-level operations. It looks a fair amount like the C intrinsics (be that good or bad). And it'd make the job of the register allocator way easier.

We can be creative, the end user interface doesn't have to be an assembly
file

Interesting. Something a la cgo? But then we have a third DSL to parse, maintain, etc. Seems like we should start with bit-twiddling and maybe SIMD.

@rsc
Contributor
rsc commented Jan 5, 2017

I think the essence of the issue is to provide some kind of user defined SSA rewrite rules (that can expand to instructions unknown to cmd/internal/obj).

I think this would be a serious mistake. It exposes compiler internals that are internals. We do not want them to become the public API of the toolchain. The other problem with designing a general low-level mechanism is that then you're asking everyone to be low-level users.

Instead, find the higher-level things that are not easily accessed (vector math, bit twiddling) and write a good API for them. Yes, good API design is hard. All the more reason to do it once.

@rsc rsc changed the title from cmd/compile: intrinsicify user defined assembly functions to proposal: cmd/compile: intrinsicify user defined assembly functions Jan 5, 2017
@brtzsnr
Contributor
brtzsnr commented Jan 5, 2017

Related to bit-twiddling I made a survey of the available libraries before #10757. On the bug Rob opposed including bit-twidling library as a standard library. Is supporting an external library (e.g. golang.org/bittwiddling) by the compiler fine?

If one were to work on a proposal for a bit-twiddling, what would the timeline be for inclusion in Go1.9?

@randall77
Contributor

I don't think it makes sense for the compiler to reference anything outside the stdlib. When semantic inlining, you really want the code you are semantic inlining and the compiler to be released in lock step.

A proposal for a bit-twiddling library would be good. Probably want that out by Feb 1 so we can discuss + implement in time for 1.9.
Check out runtime/internal/sys/intrinsics.go, we'd want at least these.

@minux
Member
minux commented Jan 5, 2017
@rsc
Contributor
rsc commented Jan 9, 2017

I'd really like to focus our development efforts on whatever the high-level needs are (bit twiddling, etc) and not on the low-level out-of-line inline assembly mechanism.

@rsc
Contributor
rsc commented Jan 9, 2017

To be clear, we can easily imagine a beautiful system along these lines that "just works" and makes things transparently run faster with no effort by users. But that's a huge effort that must be implemented, debugged, and maintained (by the core team). We don't believe that this effort is a high enough priority right now compared to other efforts.

-rsc for @golang/proposal-review

@rsc rsc closed this Jan 9, 2017
@minux
Member
minux commented Jan 10, 2017
@rsc
Contributor
rsc commented Jan 10, 2017

For the specific case of crypto, we seem to be doing OK by (1) defining good, portable Go APIs, and (2) taking advantage of arch-specific hardware acceleration in a way that's transparent to the end user. I don't see why we can't do that in other settings too. They don't all have to happen in the same release. We can build things up incrementally.

@pwaller
Contributor
pwaller commented Jan 10, 2017

I have some application code where I want to use SIMD intrinsics. It would be good to be able to write that in a way that I could take advantage of the compiler's other general smarts, so I definitely see Minux's motivation. I understand that for now at least this is a niche feature and probably entails a lot of additional complexity in the compiler.

I'd still be curious to see this sketched or prototyped by any interested parties, even if it doesn't end up being implemented. Doing 32 multiplies per instruction in AVX512 sounds very appealing!

var a, b, dst [32]int16
asm.VPMULLW(a, b, dst)

I ponder if you could achieve a similar effect a source code transformation, but I guess that would amount to writing (or forking) a compiler. I'd love to give a brief shoutout to PeachPy which uses Python as a DSL to generate Go ASM, which I have found lowers the cognitive burden of writing assembly. It would be interesting to see what it would look like if you used Go as the DSL. Probably not much like Go, but maybe more managable for engineers than assembly?

@btracey
Contributor
btracey commented Jan 11, 2017 edited

@rsc: as input to "whatever the high-level needs are". The following primitives have a large effect on the speed of gonum programs. The most important are Scale (x = a*x), "Axpy (x = a*x+y)", "AxpyTo (z = a*x +y)" and Dot product (sum_i x_i * y_i), where x is []float64, []float32, or the complex equivalents. We use be both non-stided (standard Go slice) and strided versions. All implementations may assume that there is no non-trivial overlap between the inputs.

We have coded our own assembly implementations of them based on real speedups. It would be great to have cooperation with the compiler and for specific processors to make these operations as fast as possible.

@pwaller
Contributor
pwaller commented Jan 11, 2017 edited

@btracey sounds interesting, do you have benchmark numbers? :)

@btracey
Contributor
btracey commented Jan 11, 2017 edited

Yea. The code is https://godoc.org/github.com/gonum/blas/native

Comparing go test -bench Dgemm -tags=noasm with go test -bench Dgemm

DgemmSmSmSm-8        1.88µs ± 6%  1.17µs ± 1%  -37.74%  (p=0.008 n=5+5)
DgemmMedMedMed-8      628µs ±17%   274µs ± 4%  -56.37%  (p=0.008 n=5+5)
DgemmMedLgMed-8      4.37ms ± 3%  1.80ms ± 7%  -58.74%  (p=0.008 n=5+5)
DgemmLgLgLg-8         411ms ± 4%   169ms ± 3%  -58.90%  (p=0.008 n=5+5)
DgemmLgSmLg-8        6.79ms ± 3%  4.03ms ± 5%  -40.69%  (p=0.008 n=5+5)
DgemmLgLgSm-8        4.83ms ± 3%  2.39ms ± 6%  -50.52%  (p=0.008 n=5+5)
DgemmHgHgSm-8         345ms ± 3%   172ms ± 3%  -50.10%  (p=0.008 n=5+5)
DgemmMedMedMedTNT-8   689µs ± 2%   461µs ± 2%  -33.10%  (p=0.008 n=5+5)
DgemmMedMedMedNTT-8   596µs ± 4%   252µs ± 4%  -57.70%  (p=0.008 n=5+5)
DgemmMedMedMedTT-8    762µs ± 5%   667µs ± 2%  -12.39%  (p=0.008 n=5+5)

This translates to about 10% performance on a real program I have.

@minux
Member
minux commented Jan 11, 2017
@chewxy
chewxy commented Jan 11, 2017 edited

I think predefined intrinsics provided by the language will go a long way. Something like __mm256 and _mm_INSERTMACRONAME_ would make life a lot easier for those working with SIMD related assembly.

I also agree with @btracey - having written enough assembly (tho most are generated from C) for Gorgonia's vector related code (and also leveraging Gonum), I'd really love if we could get help from the compiler on this front. Or some integrated dev tools from the Go team

@minux
Member
minux commented Jan 11, 2017
@josharian
Contributor

Minux, would you mind spelling out the SIMD design impossibility you see in a bit more detail? I'm not following it.

Relatedly, would you mind spelling out a bit why (say) the simd.Reg128 type I suggested above is unsuitable? Note that reinterpreting bits in a vector is a common thing to want to do for performance (which is the only point of SIMD), so its flexibility about the type of things it contains is a good thing.

Lastly, an anecdote. When I last made heavy use of SIMD was NEON, five or six years back, and the compiler made such abominable decisions with intrinsics (mostly around spilling and restoring registers) that I ended up dropping down to assembly anyway, just to be able to fully utilize all registers and schedule operations to make good use of the pipeline. I don't know how much better clang is now, but the experience makes me think that a good SIMD package should be fairly close to the metal and provide users relatively large amounts of control. I say this not as an argument for generic inline assembly support but as a consider for SIMD package design.

But honestly, I'd rather see bit twiddling get tackled first. There's probably plenty to learn there, and I suspect it has a broader audience, including known stdlib uses in the runtime and math/big.

@minux
Member
minux commented Jan 11, 2017
@josharian
Contributor

One way to tackle the new register class support problem is to teach the compiler to use SIMD instructions itself when their presence is guaranteed, for the obvious/easy cases (simple for loops, some zeroing).

I just don't understand the resistance.

In my case, the resistance comes from concern that the scope of the generic proposal (including initial implementation, maintenance, and compatibility impact) will end up being much bigger than it initially appears--and it initially appears pretty big, at least to me.

@minux
Member
minux commented Jan 11, 2017 edited
@ianlancetaylor
Contributor

@josharian GCC at least does quite well with intrinsics these days. I wrote C code to do the same thing as Keith's aeshash code in runtime/asm_amd64.s, and the compiled result was essentially the same as the asm code.

https://github.com/golang/gofrontend/blob/master/libgo/runtime/aeshash.c

@dr2chase
Contributor

Another use of intrinsics in C, a previous effort at Oracle to speed up CRC calculations. Start at line 425: http://cr.openjdk.java.net/~drchase/7088419/webrev.01/src/share/native/java/util/zip/CRC32.c.html

The compiler's main role in this was to be a stenographer for XMM code. (Pay no attention to the assembly encoded with .byte -- I used a modern compiler to generate the code, but had to feed it to an archaic assembler that didn't speak "modern" AMD64.)

@btracey
Contributor
btracey commented Jan 17, 2017

As a simple addition to my benchmarks above, here is the profile for Cholesky decomposition:

brendan:~/Documents/mygo/src/github.com/gonum/matrix/mat64$ go test -cpuprofile cpu.prof -run=none -bench=CholeskyLarge
BenchmarkCholeskyLarge-8   	       1	3476738111 ns/op
PASS
ok  	github.com/gonum/matrix/mat64	14.642s
brendan:~/Documents/mygo/src/github.com/gonum/matrix/mat64$ go tool pprof mat64.test cpu.prof 
Entering interactive mode (type "help" for commands)
(pprof) top5
67.51s of 69.60s total (97.00%)
Dropped 91 nodes (cum <= 0.35s)
Showing top 5 nodes out of 16 (cum >= 0.83s)
      flat  flat%   sum%        cum   cum%
    57.79s 83.03% 83.03%     57.79s 83.03%  github.com/gonum/internal/asm.DaxpyUnitaryTo
     9.66s 13.88% 96.91%     67.35s 96.77%  github.com/gonum/blas/native.dgemmSerialTransNot
     0.04s 0.057% 96.97%      1.63s  2.34%  github.com/gonum/matrix/mat64.benchmarkCholesky
     0.01s 0.014% 96.98%     67.49s 96.97%  github.com/gonum/blas/native.dgemmParallel.func1
     0.01s 0.014% 97.00%      0.83s  1.19%  github.com/gonum/lapack/native.Implementation.Dlatrs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment