Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: fix build failure for clang 3.2; consolidate byte swapping code; fix buffer writes for unaligned ucs2 strings #7645

Closed
wants to merge 2 commits into from

Conversation

Projects
None yet
8 participants
@zbjornson
Copy link
Contributor

commented Jul 10, 2016

Fixes #7618
Alternative to #7644 that preserve performance.

Also consolidates all byte-swapping code (following what bnoordhuis did in #7644).

@gibfahn

This comment has been minimized.

Copy link
Member

commented Jul 11, 2016

Couldn't get this to build on OSX 10.8 and Clang 3.2 (commit cc11a96)

  c++ '-D_DARWIN_USE_64_BIT_INODE=1' '-DNODE_ARCH="x64"' '-DNODE_WANT_INTERNALS=1' '-DV8_DEPRECATION_WARNINGS=1' '-DNODE_USE_V8_PLATFORM=1' '-DNODE_HAVE_I18N_SUPPORT=1' '-DNODE_HAVE_SMALL_ICU=1' '-DHAVE_INSPECTOR=1' '-DV8_INSPECTOR_USE_STL=1' '-DHAVE_OPENSSL=1' '-DHAVE_DTRACE=1' '-D__POSIX__' '-DNODE_PLATFORM="darwin"' '-DUCONFIG_NO_TRANSLITERATION=1' '-DUCONFIG_NO_SERVICE=1' '-DUCONFIG_NO_REGULAR_EXPRESSIONS=1' '-DU_ENABLE_DYLOAD=0' '-DU_STATIC_IMPLEMENTATION=1' '-DU_HAVE_STD_STRING=0' '-DUCONFIG_NO_BREAK_ITERATION=0' '-DUCONFIG_NO_LEGACY_CONVERSION=1' '-DUCONFIG_NO_CONVERSION=1' '-DHTTP_PARSER_STRICT=0' '-D_LARGEFILE_SOURCE' '-D_FILE_OFFSET_BITS=64' -I../src -I../tools/msvs/genfiles -I../deps/uv/src/ares -I/home/gib/node/out/Release/obj/gen -I../deps/v8_inspector -I../deps/v8_inspector/deps/wtf -I/home/gib/node/out/Release/obj/gen/blink -I../deps/v8/include -I../deps/icu-small/source/i18n -I../deps/icu-small/source/common -I../deps/openssl/openssl/include -I../deps/zlib -I../deps/http_parser -I../deps/cares/include -I../deps/uv/include  -Os -gdwarf-2 -mmacosx-version-min=10.7 -arch x86_64 -Wall -Wendif-labels -W -Wno-unused-parameter -std=gnu++0x -fno-rtti -fno-exceptions -fno-threadsafe-statics -fno-strict-aliasing -MMD -MF /home/gib/node/out/Release/.deps//home/gib/node/out/Release/obj.target/node/src/async-wrap.o.d.raw   -c -o /home/gib/node/out/Release/obj.target/node/src/async-wrap.o ../src/async-wrap.cc
/include -I../deps/uv/include  -Os -gdwarf-2 -mmacosx-version-min=10.7 -arch x86_64 -Wall -Wendif-labels -W -Wno-unused-parameter -std=gnu++0x -fno-rtti -fno-exceptions -fno-threadsafe-statics -fno-strict-aliasing -MMD -MF /home/gib/node/out/Release/.deps//home/gib/node/out/Release/obj.target/node/src/handle_wrap.o.d.raw   -c -o /home/gib/node/out/Release/obj.target/node/src/handle_wrap.o ../src/handle_wrap.cc
In file included from ../test/cctest/util.cc:2:
../src/util-inl.h:250:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ^
../src/util-inl.h:15:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ^
In file included from ../src/inspector_socket.cc:3:
../src/util-inl.h:250:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ^
../src/util-inl.h:15:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ^
In file included from ../src/env.cc:1:
In file included from ../src/env.h:7:
In file included from ../src/debug-agent.h:29:
../src/util-inl.h:250:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ^
../src/util-inl.h:15:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ^
In file included from ../src/cares_wrap.cc:4:
In file included from ../src/async-wrap-inl.h:8:
In file included from ../src/base-object-inl.h:7:
In file included from ../src/env.h:7:
In file included from ../src/debug-agent.h:29:
../src/util-inl.h:250:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ^
../src/util-inl.h:15:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ^
In file included from ../src/debug-agent.cc:22:
In file included from ../src/debug-agent.h:29:
../src/util-inl.h:250:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ^
../src/util-inl.h:15:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ^
In file included from ../src/fs_event_wrap.cc:2:
In file included from ../src/async-wrap-inl.h:8:
In file included from ../src/base-object-inl.h:7:
In file included from ../src/env.h:7:
In file included from ../src/debug-agent.h:29:
../src/util-inl.h:250:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ^
../src/util-inl.h:15:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ^
In file included from ../src/async-wrap.cc:2:
In file included from ../src/async-wrap-inl.h:8:
In file included from ../src/base-object-inl.h:7:
In file included from ../src/env.h:7:
In file included from ../src/debug-agent.h:29:
../src/util-inl.h:250:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ^
../src/util-inl.h:15:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ^
In file included from ../src/handle_wrap.cc:3:
In file included from ../src/async-wrap-inl.h:8:
In file included from ../src/base-object-inl.h:7:
In file included from ../src/env.h:7:
In file included from ../src/debug-agent.h:29:
../src/util-inl.h:250:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ^
../src/util-inl.h:15:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ^

EDIT: I ran make clean && ./configure && make -j8 CXX.host=c++. I'm happy to help with debugging @zbjornson (assuming I didn't somehow configure the build wrong). The same process worked with #7644 FWIW.

@bnoordhuis

View changes

src/string_bytes.cc Outdated
// http://nodejs.org/api/buffer.html regarding Node's "ucs2"
// encoding specification
if (IsBigEndian())
SwapBytes16(const_cast<char*>(reinterpret_cast<const char*>(buf)), buflen);

This comment has been minimized.

Copy link
@bnoordhuis

bnoordhuis Jul 11, 2016

Member

In-place modification of a const buffer is definitely not allowed.

This comment has been minimized.

Copy link
@gibfahn

gibfahn Jul 11, 2016

Member

@bnoordhuis Does that mean this isn't possible?

This comment has been minimized.

Copy link
@zbjornson

zbjornson Jul 11, 2016

Author Contributor

No, I just botched and misread the pointer reassignment in the original code. Will fix. Thanks for catching @bnoordhuis.

@zbjornson

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2016

Sorry, turns out clang also defines __GNUC__ -- fixed. Also fixed the const buffer problem. Can you try again please? Thanks for your help @gibfahn.

@bnoordhuis

This comment has been minimized.

Copy link
Member

commented Jul 11, 2016

@zbjornson First off, thanks for taking the time to work on this. I don't think this PR is the way to go but that doesn't mean I don't appreciate you working on it.

Knowing what I know of clang and gcc (and some quick tests confirm that), I don't think it's the use of intrinsics/builtins that is responsible for the performance improvement, it's that the compiler can:

  1. Assume proper alignment, and
  2. Not worry about src and dst aliasing or overlapping.

If I rewrite #7644 to hint the same conditions to the compiler, I get comparable numbers on benchmark/buffers/buffer-swap.js in the aligned case. The compiler even emits the same machine code as when using the builtins.

(I still need to look into the unaligned case. Perhaps there is something we can tweak there as well.)

@zbjornson

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2016

Interesting, okay. Keen to try out your reworked version. When I played with similar code, the compilers would only emit PSHUFB and similar with -O3, whereas the intrinsics always mapped correctly.

@addaleax

This comment has been minimized.

Copy link
Member

commented Jul 11, 2016

When I played with similar code, the compilers would only emit PSHUFB and similar with -O3

I’d expect that to noticeably outperform the __bswap* intrinsics, but that instruction wouldn’t be available on all x64 CPUs, so at least the Linux binaries wouldn’t be able to leverage support for it.

@zbjornson

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2016

@addaleax err, I meant to say that this type of thing:

char a = data[0];
data[0] = data[1];
data[1] = a;

mapped to PSHUFB only with -O3, whereas the builtins reliably mapped to PSHUFB without -O3 (and afaik node is compiled with -Os and /Od).

@addaleax

This comment has been minimized.

Copy link
Member

commented Jul 11, 2016

@zbjornson Ah – Either way, be aware that, without CPU detection or extra compiler flags, pshufb is something that won’t end up in Linux release builds, so I’d be careful with performance measurements on Mac (I’m assuming from the above that you are using a Mac).

and afaik node is compiled with -Os and /Od

Pretty sure it’s -O3 by default:

'cflags': [ '-O3' ],

@zbjornson

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2016

@addaleax ahh I'd been trying to find that info about whether or not AVX/SSE extensions will be used in release for quite some time, thanks! If that's the case, then it makes sense to use the smaller code from #7644. (I'd been benchmarking on Windows.)

You're right on -O3. (The first line in #7645 (comment) has -Os for some reason, but a normal release build shows -O3.)

@gibfahn

This comment has been minimized.

Copy link
Member

commented Jul 12, 2016

@zbjornson This PR is now building cleanly on my 10.8 machine (d6147aef8e2e9511149cba5809e330fd318e6e38)

@jasnell

This comment has been minimized.

Copy link
Member

commented Aug 8, 2016

@addaleax ... any further thoughts on this one?

@zbjornson

This comment has been minimized.

Copy link
Contributor Author

commented Aug 8, 2016

Note that I want to replace the std::swap calls with the char a = data[0]; data[0] = data[1]; data[1] = a; dance because it's faster. I delayed making that change because I wasn't sure if this would get merged.

I think that because this is substantially faster on at least Windows, and potentially on custom builds on other platforms (to be tested), is a good reason to pursue this PR. I don't see any downsides at least.

I can make the above change and do more benchmarking this week.

@zbjornson zbjornson force-pushed the zbjornson:7618-old-clang-support branch Aug 25, 2016

@Trott Trott force-pushed the nodejs:master branch to c5ce7f4 Sep 21, 2016

@zbjornson zbjornson force-pushed the zbjornson:7618-old-clang-support branch 2 times, most recently Sep 22, 2016

@gibfahn

This comment has been minimized.

Copy link
Member

commented Sep 22, 2016

@bnoordhuis So does this updated PR seem better to you?

ref: #7618 (comment)

src/string_bytes.cc Outdated
// encoding specification
dst.resize(buflen);
SwapBytes(&dst[0], buf, buflen);
std::vector<uint16_t> dst (buf, buf + buflen);

This comment has been minimized.

Copy link
@bnoordhuis

bnoordhuis Sep 22, 2016

Member

Nit: no space before (.

src/string_bytes.cc Outdated
dst.resize(buflen);
SwapBytes(&dst[0], buf, buflen);
std::vector<uint16_t> dst (buf, buf + buflen);
size_t nbytes = buflen * sizeof(uint16_t);

This comment has been minimized.

Copy link
@bnoordhuis

bnoordhuis Sep 22, 2016

Member

Consider writing it as sizeof(dst[0]).

src/util-inl.h Outdated
#define SWAP(a, b) \
tmp = data[a]; \
data[a] = data[b]; \
data[b] = tmp;

This comment has been minimized.

Copy link
@bnoordhuis

bnoordhuis Sep 22, 2016

Member

Is this actually faster than std::swap or an inline function?

src/util-inl.h Outdated
int align = reinterpret_cast<uintptr_t>(data) % sizeof(uint16_t);

if (align == 0) {
uint16_t* data16 = reinterpret_cast<uint16_t*>(data);

This comment has been minimized.

Copy link
@bnoordhuis

bnoordhuis Sep 22, 2016

Member

This is a strict-aliasing violation, strictly speaking. My PR operates on just char for that reason.

@zbjornson zbjornson force-pushed the zbjornson:7618-old-clang-support branch Sep 22, 2016

@zbjornson

This comment has been minimized.

Copy link
Contributor Author

commented Sep 22, 2016

Thanks for your review @bnoordhuis and again sorry that this small thing has dragged on. Revisions submitted.

As far as the strict aliasing violation you pointed out:

I admit that this confused me. When I've asked about reading from char buffers in particular, I get answers along the lines of "it's not UB or a SA violation," "it looks like one but it isn't because you assume that the original pointer is a uint16_t" or "it is but everyone does it" (which seems to be the most correct of the three) (see ref1, ref2 (and later replies)).

Note that the violation already existed in string_bytes.cc and this PR just relocates it:
https://github.com/nodejs/node/pull/7645/files#diff-aab3e751fbd702712b90a419b21b58aeL324

As far as avoiding the violation:

  • Your method (memcpy) compiles the same as a reinterpret_cast by GCC for -Os, -O1 and -O2, but it prevents loop unrolling at -O3 (node's default) in the case of swap16. That's why there's the 2M/400k difference in the benchmarks. For swap32 and swap64 they are equivalent.
  • MSVC doesn't "have" global strict aliasing (i.e. it behaves like -fno-strict-aliasing), so while it's not spec-compliant there is no UB to worry about as far as I know; we've already checked alignment and are don't care about endianness. Using the memcpy method kills perf.

Thus, in the latest revision, I use reinterpret_cast on Windows and eliminated for the rest. What do you think about that?

That yields these benchmarks, which are about as good as they get (with node's default build config) aside from aligned 16 on linux (per above):
image

@zbjornson zbjornson force-pushed the zbjornson:7618-old-clang-support branch Sep 22, 2016

@zbjornson zbjornson changed the title src: fix build failure for clang 3.2 by checking for builtin presence src: fix build failure for clang 3.2; consolidate byte swapping code Sep 22, 2016

@bnoordhuis

This comment has been minimized.

Copy link
Member

commented Sep 23, 2016

"it is but everyone does it" (which seems to be the most correct of the three)

It is and that is why node.js is currently built with -fno-strict-aliasing. Still, I'd prefer to avoid aliasing if reasonably possible.

"it looks like one but it isn't because you assume that the original pointer is a uint16_t"

That is not what the spec says and not how gcc and clang operate when -fstrict-aliasing is in effect. The rule is unambiguous: no pointer can alias another pointer unless the alias is of type char*.

Think of it like this: when strict aliasing is in effect, and when there are no char* pointers in scope, the compiler can assume that values it is trying to read or write through a pointer will not change underneath it.

With -fno-strict-aliasing, gcc and clang conservatively assume that every pointer is being aliased somewhere unless it is trivial to prove otherwise; it's infeasible to track aliasing program-wide.

(Incidentally, that is why the return values of malloc, calloc and operator new are marked as noalias. They logically can't alias other pointers but if the compiler didn't know that, it would have to assume the worst and emit significantly worse code around dynamic allocations. Apparently there are systems that violate that assumption because clang++ has a -fno-sane-operator-new switch, but I digress.)

src/util-inl.h Outdated
if (align == 0) {
// MSVC has no strict aliasing, and is able to highly optimize this case.
uint16_t* data16 = reinterpret_cast<uint16_t*>(data);
size_t len16 = nbytes / sizeof(uint16_t);

This comment has been minimized.

Copy link
@bnoordhuis

bnoordhuis Sep 23, 2016

Member

Can you write this as sizeof(*data16)? EDIT: Here and in the other functions.

src/util-inl.h Outdated
memcpy(&temp, &data[i], sizeof(uint16_t));
temp = BSWAP_2(temp);
memcpy(&data[i], &temp, sizeof(uint16_t));
}

This comment has been minimized.

Copy link
@bnoordhuis

bnoordhuis Sep 23, 2016

Member

Can you use sizeof(temp) in this block and the blocks below? LGTM apart from that.

@zbjornson

This comment has been minimized.

Copy link
Contributor Author

commented Sep 25, 2016

Reviewers: fyi, the one change since Ben gave his LGTM was moving dst up a scope (out of if (IsBigEndian()) {}, back where it was originally. (And adding the test in the 2nd commit.)

@addaleax

This comment has been minimized.

Copy link
Member

commented Sep 25, 2016

LGTM, thanks for @mentioning me!

@bnoordhuis
Copy link
Member

left a comment

Still LGTM but the commit logs should conform to the style guide.

Did you check what code paths clang and gcc emit? In the other PR, I had to prove that src == dst and had proper alignment before the compiler generated the fast case.

@zbjornson zbjornson force-pushed the zbjornson:7618-old-clang-support branch Sep 26, 2016

@zbjornson

This comment has been minimized.

Copy link
Contributor Author

commented Sep 26, 2016

Fixed commit text.

Did you check what code paths clang and gcc emit?

Surprisingly it doesn't appear necessary to give those hints to the compilers in this incarnation for clang or gcc to emit bswap and nothing detrimental around it (https://godbolt.org/g/DxMNmF). The linux benchmarks from this PR match the BMs for #7644 as well (see #7645 (comment)).

@gibfahn

This comment has been minimized.

@zbjornson

This comment has been minimized.

Copy link
Contributor Author

commented Sep 26, 2016

Is this CI situation normal or am I cursed?!

@addaleax

This comment has been minimized.

Copy link
Member

commented Sep 26, 2016

Ahem, yes, let’s give this another shot: https://ci.nodejs.org/job/node-test-commit/5328/

(It is, unfortunately, normal and the act of complaining about the general brokenness of CI is part of what forms the common core collaborator identity. Welcome to the club! :b)

@jasnell
Copy link
Member

left a comment

LGTM

@zbjornson

This comment has been minimized.

Copy link
Contributor Author

commented Sep 30, 2016

More CI attempts needed to land?

@lpinca

This comment has been minimized.

Copy link
Member

commented Sep 30, 2016

@lpinca

This comment has been minimized.

Copy link
Member

commented Sep 30, 2016

Oh didn't notice that there is a conflict.

@zbjornson zbjornson force-pushed the zbjornson:7618-old-clang-support branch Sep 30, 2016

zbjornson added some commits Sep 22, 2016

src: fix build for older clang
Removes use of builtins that are unavailable for older clang. Per
benchmarks, only uses builtins on Windows, where speedup is
significant.

Fixes: #7618
test: add test for unaligned ucs2 buffer write
Between #3410 and #7645, bytes were swapped twice on bigendian
platforms if buffer was not two-byte aligned. See comment in #7645.

@zbjornson zbjornson force-pushed the zbjornson:7618-old-clang-support branch to 2b69933 Sep 30, 2016

@zbjornson

This comment has been minimized.

Copy link
Contributor Author

commented Sep 30, 2016

@lpinca rebased

@lpinca

This comment has been minimized.

Copy link
Member

commented Sep 30, 2016

@gibfahn

This comment has been minimized.

Copy link
Member

commented Oct 2, 2016

@zbjornson

This comment has been minimized.

Copy link
Contributor Author

commented Oct 3, 2016

🎉 7th time's a charm I guess :)

@gibfahn

This comment has been minimized.

Copy link
Member

commented Oct 4, 2016

Okay, I'm going to start landing this.

  • no objections
  • 3 LGTMs
  • Lots of CI

gibfahn added a commit that referenced this pull request Oct 4, 2016

src: fix build for older clang
Removes use of builtins that are unavailable for older clang. Per
benchmarks, only uses builtins on Windows, where speedup is
significant.

Also adds test for unaligned ucs2 buffer write. Between #3410
and #7645, bytes were swapped twice on bigendian platforms if buffer
was not two-byte aligned. See comment in #7645.

PR-URL: #7645
Fixes: #7618
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
@gibfahn

This comment has been minimized.

Copy link
Member

commented Oct 4, 2016

landed in 7420835 , thanks a lot @zbjornson !

@gibfahn gibfahn closed this Oct 4, 2016

@gibfahn

This comment has been minimized.

Copy link
Member

commented Oct 4, 2016

Concerning backporting, I guess this should be backported to wherever #7157 was backported, which is v6 but not v4. If anyone disagrees let me know.

jasnell added a commit that referenced this pull request Oct 6, 2016

src: fix build for older clang
Removes use of builtins that are unavailable for older clang. Per
benchmarks, only uses builtins on Windows, where speedup is
significant.

Also adds test for unaligned ucs2 buffer write. Between #3410
and #7645, bytes were swapped twice on bigendian platforms if buffer
was not two-byte aligned. See comment in #7645.

PR-URL: #7645
Fixes: #7618
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>

Fishrock123 added a commit that referenced this pull request Oct 11, 2016

src: fix build for older clang
Removes use of builtins that are unavailable for older clang. Per
benchmarks, only uses builtins on Windows, where speedup is
significant.

Also adds test for unaligned ucs2 buffer write. Between #3410
and #7645, bytes were swapped twice on bigendian platforms if buffer
was not two-byte aligned. See comment in #7645.

PR-URL: #7645
Fixes: #7618
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>

 Conflicts:
	src/node_buffer.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.