-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: optimize XOR masking code #31586
Comments
The two links you posted are the same (copy-paste mistake?). The 2nd one in particular does not use unsafe. |
My bad, fixed. The second link was supposed to be https://github.com/gorilla/websocket/blob/0ec3d1bd7fe50c503d6df98ee649d81f4857c564/mask.go#L13-L54 |
Thank you for filing this issue @nhooyr and @ALTree for the triaging! Kindly paging @josharian @randall77 @martisch. |
Please see #28113 #30553 and #28465 And go/src/crypto/cipher/xor_generic.go Line 45 in 68d4b12
If you want fast |
The compiler already recognizes sequences of byte loads and coalesces them into a single multi-byte loads. See e.g. the amd64 rules starting at go/src/cmd/compile/internal/ssa/gen/AMD64.rules Line 1531 in e6ae4e8
In fact, I believe there is a way of writing this that would allow the compiler to do this optimization now. This code: import "encoding/binary"
func f(key [4]byte, b []byte) {
k := binary.LittleEndian.Uint32(key[:])
v := binary.LittleEndian.Uint32(b)
binary.LittleEndian.PutUint32(b, v^k)
} Generates (at its core, surrounded by other stuff):
That seems pretty good. What do you think, @nhooyr ? |
(And if this is seriously performance-critical, you might want to manually inline the LittleEndian rountines and also duplicate key into a uint64 so that you can do 8 bytes at a time.) |
That seems very reasonable @josharian Will give this is a shot and let you know how the performance compares. |
So benchmarked and here are the results on my machine (
Source code at https://github.com/nhooyr/websocket/blob/40b4f235f2e2730ad1d8b81852e7610a8251d080/mask_test.go#L136-L174 Interestingly, the crypto/cipher assembly XOR was the slowest but I might be using it wrong. Your solution is 2x slower than gorilla and gobwas but fast enough for me as I do not want to use unsafe. Thanks @josharian. I'll leave this issue open in case you think there is still improvement that can be made. |
How I implemented what you described: // xor applies the WebSocket masking algorithm to p
// with the given key where the first 3 bits of pos
// are the starting position in the key.
// See https://tools.ietf.org/html/rfc6455#section-5.3
//
// The returned value is the position of the next byte
// to be used for masking in the key. This is so that
// unmasking can be performed without the entire frame.
func nhooyr(key [4]byte, keyPos int, b []byte) int {
// If the payload is greater than 16 bytes, then it's worth
// masking 8 bytes at a time.
// Optimization from https://github.com/golang/go/issues/31586#issuecomment-485530859
if len(b) > 16 {
// We first create a key that is 8 bytes long
// and is aligned on the keyPos correctly.
var alignedKey [8]byte
for i := range alignedKey {
alignedKey[i] = key[(i+keyPos)&3]
}
k := binary.LittleEndian.Uint64(alignedKey[:])
// Then we xor until b is less than 8 bytes.
for len(b) >= 8 {
v := binary.LittleEndian.Uint64(b)
binary.LittleEndian.PutUint64(b, v^k)
b = b[8:]
}
}
// xor remaining bytes.
for i := range b {
b[i] ^= key[keyPos&3]
keyPos++
}
return keyPos & 3
} |
@nhooyr this is a case in which a little unrolling helps. Add something along these lines: for len(b) >= 32 {
v := binary.LittleEndian.Uint64(b)
binary.LittleEndian.PutUint64(b, v^k)
v = binary.LittleEndian.Uint64(b[8:])
binary.LittleEndian.PutUint64(b[8:], v^k)
v = binary.LittleEndian.Uint64(b[16:])
binary.LittleEndian.PutUint64(b[16:], v^k)
v = binary.LittleEndian.Uint64(b[24:])
binary.LittleEndian.PutUint64(b[24:], v^k)
b = b[32:]
} I see 30%+ improvements from that. You may want to experiment a bit to find the sweet spot. What's going on here is that reslicing ( And with that, I think I'll go ahead and close this issue. Thanks very much for filing it--issues like these are helpful in finding opportunities to improve the compiler and runtime. |
Sweet, will use that. Thank you for the tips. |
@josharian out of curiosity but do you have any idea why the crypto/cipher xor was so slow even though its written in assembly? |
Unrolling the loop to 128 makes it about as fast as gobwas/ws and gorilla/websocket for smaller sizes and faster for larger sizes 🎉
|
But what you did is
This is slow. If you want that fast SIMD implementation, I think the assembly code needs to be changed for the WebSocket mask algorithm. |
@crvv makes sense got it. @josharian final code looks like https://github.com/nhooyr/websocket/blob/cfdbe6819dba18b345df68fa06ed1e499874cad8/xor.go#L28-L115 So just tiered loops going getting smaller in unrolled size, this doesn't seem to have any negative affect on performance at low or high byte sizes. However I noticed something interesting. if I change the for loops after the 128 unrolled loop into into if statements which they logically are as the loops after cannot run for more than a single iteration, performance drops 30% at higher byte sizes which is confusing.
|
Interesting. Branch misprediction, perhaps? Loop head unrolling? It might be instructive to look at the generated code and at what
Hooray! One important caveat: This is all tuned for amd64, which is little-endian and supports unaligned loads and stores. It'll work correctly on all architectures (because it is pure Go), but it might not be fast on them. You could probably make it fast on a per-architecture basis (using build flags), if that matters to you. You might even be able to write a useful standalone package that lets you efficiently (per-architecture) take a byte slice and process in turn some leading bytes, a bunch of uint64s, and then some trailing bytes. |
Just to note the aligning key loop can be reduced to a bits.RotateLeft64 |
@renthraysk To clarify, that would involve changing the key's type to uint64 from [4]byte right? i.e there is no way to just replace the loop with bits.RotateLeft64 directly. |
fast2 is with bits.RotateLeft64, pretty significant speedup, especially from 32 till 512, where the speed pretty much doubled. 🎊 |
Yeah, might have misremembered, and using binary.X.Uint32() to read 4 bytes in, perform the alignment rotation with RotateLeft32() and then expanded to 64 bits x<<32|x. Also the last remain byte loop can use RotateLeft() instead of reading from the key array. |
See golang/go#31586 (comment) Thanks @renthraysk Now its faster than basic XOR at every byte size greater than 3 on little endian amd64 machines.
https://github.com/nhooyr/websocket/blob/646e967341551f499e7fc5a2286d02c9c0c238b7/frame.go#L325-L443
Faster than basic XOR at even 3 bytes :) Thanks @renthraysk |
See golang/go#31586 (comment) Thanks @renthraysk Now its faster than basic XOR at every byte size greater than 2 on little endian amd64 machines.
See golang/go#31586 (comment) Thanks @renthraysk benchmark old MB/s new MB/s speedup BenchmarkXOR/2/basic-8 504.71 569.72 1.13x BenchmarkXOR/2/fast-8 470.88 492.61 1.05x BenchmarkXOR/3/basic-8 570.57 731.94 1.28x BenchmarkXOR/3/fast-8 602.24 719.25 1.19x BenchmarkXOR/4/basic-8 488.38 831.70 1.70x BenchmarkXOR/4/fast-8 718.82 1186.64 1.65x BenchmarkXOR/8/basic-8 954.59 1116.38 1.17x BenchmarkXOR/8/fast-8 1027.60 1718.71 1.67x BenchmarkXOR/16/basic-8 1185.95 1340.08 1.13x BenchmarkXOR/16/fast-8 1413.31 3430.46 2.43x BenchmarkXOR/32/basic-8 1496.85 1447.83 0.97x BenchmarkXOR/32/fast-8 2701.81 5585.42 2.07x BenchmarkXOR/128/basic-8 1538.28 1467.95 0.95x BenchmarkXOR/128/fast-8 7757.97 13432.37 1.73x BenchmarkXOR/512/basic-8 1637.51 1539.67 0.94x BenchmarkXOR/512/fast-8 15155.03 18797.79 1.24x BenchmarkXOR/4096/basic-8 1677.37 1636.90 0.98x BenchmarkXOR/4096/fast-8 20689.95 20334.61 0.98x BenchmarkXOR/16384/basic-8 1688.46 1624.81 0.96x BenchmarkXOR/16384/fast-8 21687.87 21613.94 1.00x Now its faster than basic XOR at every byte size greater than 2 on little endian amd64 machines.
See golang/go#31586 (comment) Thanks @renthraysk benchmark old MB/s new MB/s speedup BenchmarkXOR/2/fast-8 470.88 492.61 1.05x BenchmarkXOR/3/fast-8 602.24 719.25 1.19x BenchmarkXOR/4/fast-8 718.82 1186.64 1.65x BenchmarkXOR/8/fast-8 1027.60 1718.71 1.67x BenchmarkXOR/16/fast-8 1413.31 3430.46 2.43x BenchmarkXOR/32/fast-8 2701.81 5585.42 2.07x BenchmarkXOR/128/fast-8 7757.97 13432.37 1.73x BenchmarkXOR/512/fast-8 15155.03 18797.79 1.24x BenchmarkXOR/4096/fast-8 20689.95 20334.61 0.98x BenchmarkXOR/16384/fast-8 21687.87 21613.94 1.00x Now its faster than basic XOR at every byte size greater than 2 on little endian amd64 machines.
Another optimisation. Current code seems to be producing a bounds check per PutUint64(). Instead of passing an open ended slice b[x:] to PutUint64() using b[x:x+8] will eliminate them. |
Thanks. Will give it a shot at some point. coder/websocket#171 (comment) |
@renthraysk I wonder if the bound checks could be eliminated if I reverse the order in which the uint64s are xored in each loop. Thus after the first bounds check, the rest of the PutUint64's in each loop would not need a bounds check. |
Did not have any effect. Going to go with what you suggested instead. |
About a 20% speedup with values >= 128 bytes with your suggestion implemented. 🚀 🔥 Thanks again. |
Yeah, reordering wouldn't have made a difference. It was the _ = b[7] line in PutUint64() that was cause the bounds checks. The compiler failed to prove it's not needed with PutUint64(b[8:], x) however it does eliminate the first check in PutUint64(b, x). Looks like deficiency in the bounds checking elimination. |
cc @zdjones |
Opened #35483 |
In the WebSocket protocol, clients have to generate a random masking key for every frame sent to the server and XOR mask the frame payload with the key. The server will unmask it on retrieval.
The mask/unmask function looks like this https://github.com/gorilla/websocket/blob/0ec3d1bd7fe50c503d6df98ee649d81f4857c564/mask_safe.go#L9-L15
This is masking byte by byte but can be made significantly faster if the masking was done word by word. Unfortunately, the Go compiler does not do this automatically so you have to use unsafe and write this yourself: https://github.com/gorilla/websocket/blob/0ec3d1bd7fe50c503d6df98ee649d81f4857c564/mask.go#L13-L54
Benchmarks at gorilla/websocket#31
It would be nice if the Go compiler did this automatically.
The text was updated successfully, but these errors were encountered: