TypedArrays: Implement swizzling with compiler intrinsics #4522

Closed
wants to merge 3 commits into from

4 participants

@deanm

I have tested on LLVM, I will hopefully have a chance soon to look at MSVC's output.

Besides a performance / generated code improvement this should fix swizzling on big endian machines.

@deanm deanm TypedArrays: Implement swizzling with compiler intrinsics and be awar…
…e of the

native endianness to correctly swap on big endian machines.

This introduces a template function to swap the bytes of a value, and macros
for the low level swap (taking advantage of gcc and msvc intrinsics).  This
produces code like the following (comments are mine):

gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)

setValue<double>:
  movd  %xmm0, %rax         ; fp reg -> gen reg
  bswapq  %rax              ; 64-bit byte swap
  movd  %rax, %xmm0         ; gen reg -> fp reg
  movq  %xmm0, (%r15,%r12)  ; store
ebad743
@tjfontaine

I worry about the qemu comment, to my knowledge qemu is gpl, not saying that the change isn't straight forward there but

@deanm

@piscisaureus What do you mean? The patch already uses MSVC intrinsics.

@deanm

@tjfontaine I can rewrite the macros, but there is really only one right way to write them, so basically I would just be renaming the variables and changing the whitespace.

@piscisaureus

@deanm

@piscisaureus What do you mean? The patch already uses MSVC intrinsics.

Sorry, overlooked. Ignore my comment :-)

@deanm deanm TypedArrays: Implement load and store swizzling operations.
This reduces an unneeded back and forth between types and additionally keeps
the value in the swappable type until it is swapped.  This is important for
correctness when dealing with floating point, to avoid the possibility of
loading the bits of a signaling NaN (because it isn't yet swapped) into the FPU.

This additionally produces better code (comments are mine):

gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)

setValue<double>:
  movd  %xmm0, %rax         ; fp reg -> gen reg
  bswapq  %rax              ; 64-bit byte swap
  movq  %rax, (%r15,%r12)   ; store
e0f81ee
@deanm

I just tested and everything compiles fine on MSVC 2010. I had a look at the assembly dead listing. The generated code overall looks good. It is hard to directly compare (since I was looking at the 64-bit GCC output, but 32-bit for msvc). The FPU code looks a little bit more complicated, but if you look at setGeneric<int>, it is the simplest of what you would expect (all of the memcpy / type juggling is properly compiled away). Comments are mine:

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.30319.01 for 80x86

call    ?Int32Value@Value@v8@@QBEHXZ        ; call v8::Value::Int32Value on the input argument
test    bl, bl
jne SHORT $LN2@setGeneric@6                 ; test if we need to swizzle (little_endian)
bswap   eax                                 ; bswap on the int32
$LN2@setGeneric@6:
mov DWORD PTR [edi+ebp], eax                ; store

So I would say things look quite good on Windows also.

@bnoordhuis bnoordhuis commented on the diff Jan 8, 2013
src/v8_typed_array_bswap.h
+#define V8_TYPED_ARRAY_BSWAP16(x) \
+({ \
+ uint16_t __x = (x); \
+ ((uint16_t)( \
+ (((uint16_t)(__x) & (uint16_t)0x00ffU) << 8) | \
+ (((uint16_t)(__x) & (uint16_t)0xff00U) >> 8) )); \
+})
+#endif
+#endif
+
+
+namespace v8_typed_array {
+
+template <typename T>
+inline T SwapBytes(T x) {
+ typedef char NoSwapBytesForType[sizeof(T) == 0 ? 1 : -1];
@bnoordhuis
Node.js Foundation member

How can sizeof(T) ever be 0?

That aside, there is a STATIC_ASSERT macro in node_internals.h that gcc 4.8 doesn't complain about it (it doesn't like unused typedefs).

@deanm
deanm added a note Jan 8, 2013

No, it should never be 0. The idea is to error if that template is ever used, since we only want the specialized versions to be used.

I am trying to keep the TypedArrays code as independent from the rest of node as possible (since in theory it could be used by other v8 projects). I can replace the static_assert with the sizeof() style though.

@deanm
deanm added a note Jan 8, 2013

Although v8's STATIC_ASSERT is implemented with a typedef, does it have problems with gcc 4.8?

@bnoordhuis
Node.js Foundation member

Yes. I have a half-finished patch to make V8 compile with gcc 4.8 and that's one of the things it addresses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bnoordhuis bnoordhuis and 1 other commented on an outdated diff Jan 8, 2013
src/v8_typed_array_bswap.h
+#else
+ #include <stdint.h>
+#endif
+
+#if defined (_MSC_VER)
+#define V8_TYPED_ARRAY_BSWAP16 _byteswap_ushort
+#define V8_TYPED_ARRAY_BSWAP32 _byteswap_ulong
+#define V8_TYPED_ARRAY_BSWAP64 _byteswap_uint64
+#else
+// We should be able to assume GCC here (and can use ULL constants, etc).
+// Fallback swap macros taken from QEMU bswap.h
+#if defined(__has_builtin) && __has_builtin(__builtin_bswap64)
+#define V8_TYPED_ARRAY_BSWAP64 __builtin_bswap64
+#else
+#define V8_TYPED_ARRAY_BSWAP64(x) \
+({ \
@bnoordhuis
Node.js Foundation member

gcc-ism. Not the end of the world but it's easily avoided by wrapping it in a do { ... } while (0) block.

@deanm
deanm added a note Jan 8, 2013

This code will only run on GCC. With the current code the macro is meant as an expression, meaning you do something like x = SWAP(x); If you wrapped this in a do { } while() it could only be a statement, but this isn't how the intrinsics or macros are designed to be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bnoordhuis
Node.js Foundation member

I can't test the patch (I probably broke it with this afternoon's changes) but it looks good in general. I wonder if the complexity is necessary though. Compilers are generally pretty good at detecting byte swap patterns. Consider this snippet:

#include <stdio.h>
#include <stdlib.h>
int main(int, char** argv)
{
  // introduce side effect so gcc doesn't constant-fold it
  union { int i; short s[2]; } u = { argv[1] ? atoi(argv[1]) : 0x1234 };
  short s = u.s[0];
  u.s[0] = u.s[1];
  u.s[1] = s;
  printf("%x\n", u.i);
}

At -O2, gcc is smart enough to replace the swap with a single rol $16, %edi.

@deanm

@bnoordhuis In general I agree, but I have to assume that GCC added the byteswap intrinsics for a reason. I did some reading about it originally and there should be situations where the intrinsics can do better. I actually think this part of the code is not so complicated, basically just MSVC intrinsics, GCC intrinsics, or fallback to the C implementation.

@bnoordhuis
Node.js Foundation member

It fails to compile for me with gcc 4.6.3.

$ ninja -C out/Release
ninja: Entering directory `out/Release'
[19/30] CXX obj/src/node.v8_typed_array.o
FAILED: g++ -MMD -MF obj/src/node.v8_typed_array.o.d -DNODE_WANT_INTERNALS=1 '-DARCH="x64"' '-DPLATFORM="linux"' '-DNODE_TAG=""' -DHAVE_OPENSSL=1 -D__POSIX__ -D_LARGEFILE_SOURCE -D_FILE_OFFSo
In file included from ../../src/v8_typed_array.cc:26:0:
../../src/v8_typed_array_bswap.h:62:44: error: missing binary operator before token "("
../../src/v8_typed_array_bswap.h:80:44: error: missing binary operator before token "("
../../src/v8_typed_array_bswap.h:94:44: error: missing binary operator before token "("
[19/30] CXX obj/src/node.node_crypto.o

That's the __has_builtin check (whick AFAIK is a clang-ism, not a gcc-ism).

Branch: https://github.com/bnoordhuis/node/compare/issue4522

@deanm

Sorry about that. I have updated the PR with hopefully the simplest solution (although you were right in previously saying that there is some additional complexity that comes with using the intrinsics). I don't actually have a non-llvm compiler to test on right now, but I'm hoping the updated patch should be right.

Thanks

@bnoordhuis
Node.js Foundation member

Thanks Dean, landed in 46a489b and c207d40.

@bnoordhuis bnoordhuis closed this Jan 9, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment