Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: fix build failure for clang 3.2; consolidate byte swapping code; fix buffer writes for unaligned ucs2 strings #7645

Closed
wants to merge 2 commits into from

Conversation

zbjornson
Copy link
Contributor

Fixes #7618
Alternative to #7644 that preserve performance.

Also consolidates all byte-swapping code (following what bnoordhuis did in #7644).

@nodejs-github-bot nodejs-github-bot added the c++ Issues and PRs that require attention from people who are familiar with C++. label Jul 10, 2016
@addaleax addaleax added the lib / src Issues and PRs related to general changes in the lib or src directory. label Jul 10, 2016
@mscdex mscdex added buffer Issues and PRs related to the buffer subsystem. and removed lib / src Issues and PRs related to general changes in the lib or src directory. labels Jul 10, 2016
@gibfahn
Copy link
Member

gibfahn commented Jul 11, 2016

Couldn't get this to build on OSX 10.8 and Clang 3.2 (commit cc11a96)

  c++ '-D_DARWIN_USE_64_BIT_INODE=1' '-DNODE_ARCH="x64"' '-DNODE_WANT_INTERNALS=1' '-DV8_DEPRECATION_WARNINGS=1' '-DNODE_USE_V8_PLATFORM=1' '-DNODE_HAVE_I18N_SUPPORT=1' '-DNODE_HAVE_SMALL_ICU=1' '-DHAVE_INSPECTOR=1' '-DV8_INSPECTOR_USE_STL=1' '-DHAVE_OPENSSL=1' '-DHAVE_DTRACE=1' '-D__POSIX__' '-DNODE_PLATFORM="darwin"' '-DUCONFIG_NO_TRANSLITERATION=1' '-DUCONFIG_NO_SERVICE=1' '-DUCONFIG_NO_REGULAR_EXPRESSIONS=1' '-DU_ENABLE_DYLOAD=0' '-DU_STATIC_IMPLEMENTATION=1' '-DU_HAVE_STD_STRING=0' '-DUCONFIG_NO_BREAK_ITERATION=0' '-DUCONFIG_NO_LEGACY_CONVERSION=1' '-DUCONFIG_NO_CONVERSION=1' '-DHTTP_PARSER_STRICT=0' '-D_LARGEFILE_SOURCE' '-D_FILE_OFFSET_BITS=64' -I../src -I../tools/msvs/genfiles -I../deps/uv/src/ares -I/home/gib/node/out/Release/obj/gen -I../deps/v8_inspector -I../deps/v8_inspector/deps/wtf -I/home/gib/node/out/Release/obj/gen/blink -I../deps/v8/include -I../deps/icu-small/source/i18n -I../deps/icu-small/source/common -I../deps/openssl/openssl/include -I../deps/zlib -I../deps/http_parser -I../deps/cares/include -I../deps/uv/include  -Os -gdwarf-2 -mmacosx-version-min=10.7 -arch x86_64 -Wall -Wendif-labels -W -Wno-unused-parameter -std=gnu++0x -fno-rtti -fno-exceptions -fno-threadsafe-statics -fno-strict-aliasing -MMD -MF /home/gib/node/out/Release/.deps//home/gib/node/out/Release/obj.target/node/src/async-wrap.o.d.raw   -c -o /home/gib/node/out/Release/obj.target/node/src/async-wrap.o ../src/async-wrap.cc
/include -I../deps/uv/include  -Os -gdwarf-2 -mmacosx-version-min=10.7 -arch x86_64 -Wall -Wendif-labels -W -Wno-unused-parameter -std=gnu++0x -fno-rtti -fno-exceptions -fno-threadsafe-statics -fno-strict-aliasing -MMD -MF /home/gib/node/out/Release/.deps//home/gib/node/out/Release/obj.target/node/src/handle_wrap.o.d.raw   -c -o /home/gib/node/out/Release/obj.target/node/src/handle_wrap.o ../src/handle_wrap.cc
In file included from ../test/cctest/util.cc:2:
../src/util-inl.h:250:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ^
../src/util-inl.h:15:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ^
In file included from ../src/inspector_socket.cc:3:
../src/util-inl.h:250:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ^
../src/util-inl.h:15:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ^
In file included from ../src/env.cc:1:
In file included from ../src/env.h:7:
In file included from ../src/debug-agent.h:29:
../src/util-inl.h:250:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ^
../src/util-inl.h:15:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ^
In file included from ../src/cares_wrap.cc:4:
In file included from ../src/async-wrap-inl.h:8:
In file included from ../src/base-object-inl.h:7:
In file included from ../src/env.h:7:
In file included from ../src/debug-agent.h:29:
../src/util-inl.h:250:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ^
../src/util-inl.h:15:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ^
In file included from ../src/debug-agent.cc:22:
In file included from ../src/debug-agent.h:29:
../src/util-inl.h:250:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ^
../src/util-inl.h:15:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ^
In file included from ../src/fs_event_wrap.cc:2:
In file included from ../src/async-wrap-inl.h:8:
In file included from ../src/base-object-inl.h:7:
In file included from ../src/env.h:7:
In file included from ../src/debug-agent.h:29:
../src/util-inl.h:250:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ^
../src/util-inl.h:15:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ^
In file included from ../src/async-wrap.cc:2:
In file included from ../src/async-wrap-inl.h:8:
In file included from ../src/base-object-inl.h:7:
In file included from ../src/env.h:7:
In file included from ../src/debug-agent.h:29:
../src/util-inl.h:250:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ^
../src/util-inl.h:15:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ^
In file included from ../src/handle_wrap.cc:3:
In file included from ../src/async-wrap-inl.h:8:
In file included from ../src/base-object-inl.h:7:
In file included from ../src/env.h:7:
In file included from ../src/debug-agent.h:29:
../src/util-inl.h:250:19: error: use of undeclared identifier '__builtin_bswap16'
      data16[i] = BSWAP_INTRINSIC_2(data16[i]);
                  ^
../src/util-inl.h:15:30: note: expanded from macro 'BSWAP_INTRINSIC_2'
#define BSWAP_INTRINSIC_2(x) __builtin_bswap16(x)
                             ^

EDIT: I ran make clean && ./configure && make -j8 CXX.host=c++. I'm happy to help with debugging @zbjornson (assuming I didn't somehow configure the build wrong). The same process worked with #7644 FWIW.

// http://nodejs.org/api/buffer.html regarding Node's "ucs2"
// encoding specification
if (IsBigEndian())
SwapBytes16(const_cast<char*>(reinterpret_cast<const char*>(buf)), buflen);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-place modification of a const buffer is definitely not allowed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnoordhuis Does that mean this isn't possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I just botched and misread the pointer reassignment in the original code. Will fix. Thanks for catching @bnoordhuis.

@zbjornson
Copy link
Contributor Author

Sorry, turns out clang also defines __GNUC__ -- fixed. Also fixed the const buffer problem. Can you try again please? Thanks for your help @gibfahn.

@bnoordhuis
Copy link
Member

@zbjornson First off, thanks for taking the time to work on this. I don't think this PR is the way to go but that doesn't mean I don't appreciate you working on it.

Knowing what I know of clang and gcc (and some quick tests confirm that), I don't think it's the use of intrinsics/builtins that is responsible for the performance improvement, it's that the compiler can:

  1. Assume proper alignment, and
  2. Not worry about src and dst aliasing or overlapping.

If I rewrite #7644 to hint the same conditions to the compiler, I get comparable numbers on benchmark/buffers/buffer-swap.js in the aligned case. The compiler even emits the same machine code as when using the builtins.

(I still need to look into the unaligned case. Perhaps there is something we can tweak there as well.)

@zbjornson
Copy link
Contributor Author

Interesting, okay. Keen to try out your reworked version. When I played with similar code, the compilers would only emit PSHUFB and similar with -O3, whereas the intrinsics always mapped correctly.

@addaleax
Copy link
Member

When I played with similar code, the compilers would only emit PSHUFB and similar with -O3

I’d expect that to noticeably outperform the __bswap* intrinsics, but that instruction wouldn’t be available on all x64 CPUs, so at least the Linux binaries wouldn’t be able to leverage support for it.

@zbjornson
Copy link
Contributor Author

@addaleax err, I meant to say that this type of thing:

char a = data[0];
data[0] = data[1];
data[1] = a;

mapped to PSHUFB only with -O3, whereas the builtins reliably mapped to PSHUFB without -O3 (and afaik node is compiled with -Os and /Od).

@addaleax
Copy link
Member

addaleax commented Jul 11, 2016

@zbjornson Ah – Either way, be aware that, without CPU detection or extra compiler flags, pshufb is something that won’t end up in Linux release builds, so I’d be careful with performance measurements on Mac (I’m assuming from the above that you are using a Mac).

and afaik node is compiled with -Os and /Od

Pretty sure it’s -O3 by default:

'cflags': [ '-O3' ],

@zbjornson
Copy link
Contributor Author

@addaleax ahh I'd been trying to find that info about whether or not AVX/SSE extensions will be used in release for quite some time, thanks! If that's the case, then it makes sense to use the smaller code from #7644. (I'd been benchmarking on Windows.)

You're right on -O3. (The first line in #7645 (comment) has -Os for some reason, but a normal release build shows -O3.)

@gibfahn
Copy link
Member

gibfahn commented Jul 12, 2016

@zbjornson This PR is now building cleanly on my 10.8 machine (d6147aef8e2e9511149cba5809e330fd318e6e38)

@jasnell
Copy link
Member

jasnell commented Aug 8, 2016

@addaleax ... any further thoughts on this one?

@zbjornson
Copy link
Contributor Author

Note that I want to replace the std::swap calls with the char a = data[0]; data[0] = data[1]; data[1] = a; dance because it's faster. I delayed making that change because I wasn't sure if this would get merged.

I think that because this is substantially faster on at least Windows, and potentially on custom builds on other platforms (to be tested), is a good reason to pursue this PR. I don't see any downsides at least.

I can make the above change and do more benchmarking this week.

@gibfahn
Copy link
Member

gibfahn commented Sep 22, 2016

@bnoordhuis So does this updated PR seem better to you?

ref: #7618 (comment)

// encoding specification
dst.resize(buflen);
SwapBytes(&dst[0], buf, buflen);
std::vector<uint16_t> dst (buf, buf + buflen);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: no space before (.

dst.resize(buflen);
SwapBytes(&dst[0], buf, buflen);
std::vector<uint16_t> dst (buf, buf + buflen);
size_t nbytes = buflen * sizeof(uint16_t);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider writing it as sizeof(dst[0]).

#define SWAP(a, b) \
tmp = data[a]; \
data[a] = data[b]; \
data[b] = tmp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually faster than std::swap or an inline function?

int align = reinterpret_cast<uintptr_t>(data) % sizeof(uint16_t);

if (align == 0) {
uint16_t* data16 = reinterpret_cast<uint16_t*>(data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a strict-aliasing violation, strictly speaking. My PR operates on just char for that reason.

@zbjornson
Copy link
Contributor Author

zbjornson commented Sep 22, 2016

Thanks for your review @bnoordhuis and again sorry that this small thing has dragged on. Revisions submitted.

As far as the strict aliasing violation you pointed out:

I admit that this confused me. When I've asked about reading from char buffers in particular, I get answers along the lines of "it's not UB or a SA violation," "it looks like one but it isn't because you assume that the original pointer is a uint16_t" or "it is but everyone does it" (which seems to be the most correct of the three) (see ref1, ref2 (and later replies)).

Note that the violation already existed in string_bytes.cc and this PR just relocates it:
https://github.com/nodejs/node/pull/7645/files#diff-aab3e751fbd702712b90a419b21b58aeL324

As far as avoiding the violation:

  • Your method (memcpy) compiles the same as a reinterpret_cast by GCC for -Os, -O1 and -O2, but it prevents loop unrolling at -O3 (node's default) in the case of swap16. That's why there's the 2M/400k difference in the benchmarks. For swap32 and swap64 they are equivalent.
  • MSVC doesn't "have" global strict aliasing (i.e. it behaves like -fno-strict-aliasing), so while it's not spec-compliant there is no UB to worry about as far as I know; we've already checked alignment and are don't care about endianness. Using the memcpy method kills perf.

Thus, in the latest revision, I use reinterpret_cast on Windows and eliminated for the rest. What do you think about that?

That yields these benchmarks, which are about as good as they get (with node's default build config) aside from aligned 16 on linux (per above):
image

@zbjornson zbjornson changed the title src: fix build failure for clang 3.2 by checking for builtin presence src: fix build failure for clang 3.2; consolidate byte swapping code Sep 22, 2016
@bnoordhuis
Copy link
Member

"it is but everyone does it" (which seems to be the most correct of the three)

It is and that is why node.js is currently built with -fno-strict-aliasing. Still, I'd prefer to avoid aliasing if reasonably possible.

"it looks like one but it isn't because you assume that the original pointer is a uint16_t"

That is not what the spec says and not how gcc and clang operate when -fstrict-aliasing is in effect. The rule is unambiguous: no pointer can alias another pointer unless the alias is of type char*.

Think of it like this: when strict aliasing is in effect, and when there are no char* pointers in scope, the compiler can assume that values it is trying to read or write through a pointer will not change underneath it.

With -fno-strict-aliasing, gcc and clang conservatively assume that every pointer is being aliased somewhere unless it is trivial to prove otherwise; it's infeasible to track aliasing program-wide.

(Incidentally, that is why the return values of malloc, calloc and operator new are marked as noalias. They logically can't alias other pointers but if the compiler didn't know that, it would have to assume the worst and emit significantly worse code around dynamic allocations. Apparently there are systems that violate that assumption because clang++ has a -fno-sane-operator-new switch, but I digress.)

if (align == 0) {
// MSVC has no strict aliasing, and is able to highly optimize this case.
uint16_t* data16 = reinterpret_cast<uint16_t*>(data);
size_t len16 = nbytes / sizeof(uint16_t);
Copy link
Member

@bnoordhuis bnoordhuis Sep 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write this as sizeof(*data16)? EDIT: Here and in the other functions.

memcpy(&temp, &data[i], sizeof(uint16_t));
temp = BSWAP_2(temp);
memcpy(&data[i], &temp, sizeof(uint16_t));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use sizeof(temp) in this block and the blocks below? LGTM apart from that.

@zbjornson
Copy link
Contributor Author

Reviewers: fyi, the one change since Ben gave his LGTM was moving dst up a scope (out of if (IsBigEndian()) {}, back where it was originally. (And adding the test in the 2nd commit.)

@addaleax
Copy link
Member

LGTM, thanks for @mentioning me!

Copy link
Member

@bnoordhuis bnoordhuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM but the commit logs should conform to the style guide.

Did you check what code paths clang and gcc emit? In the other PR, I had to prove that src == dst and had proper alignment before the compiler generated the fast case.

@zbjornson
Copy link
Contributor Author

Fixed commit text.

Did you check what code paths clang and gcc emit?

Surprisingly it doesn't appear necessary to give those hints to the compilers in this incarnation for clang or gcc to emit bswap and nothing detrimental around it (https://godbolt.org/g/DxMNmF). The linux benchmarks from this PR match the BMs for #7644 as well (see #7645 (comment)).

@gibfahn
Copy link
Member

gibfahn commented Sep 26, 2016

@zbjornson
Copy link
Contributor Author

Is this CI situation normal or am I cursed?!

@addaleax
Copy link
Member

Ahem, yes, let’s give this another shot: https://ci.nodejs.org/job/node-test-commit/5328/

(It is, unfortunately, normal and the act of complaining about the general brokenness of CI is part of what forms the common core collaborator identity. Welcome to the club! :b)

Copy link
Member

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zbjornson
Copy link
Contributor Author

More CI attempts needed to land?

@lpinca
Copy link
Member

lpinca commented Sep 30, 2016

@zbjornson started new CI run: https://ci.nodejs.org/job/node-test-pull-request/4337/

@lpinca
Copy link
Member

lpinca commented Sep 30, 2016

Oh didn't notice that there is a conflict.

Removes use of builtins that are unavailable for older clang. Per
benchmarks, only uses builtins on Windows, where speedup is
significant.

Fixes: nodejs#7618
Between nodejs#3410 and nodejs#7645, bytes were swapped twice on bigendian
platforms if buffer was not two-byte aligned. See comment in nodejs#7645.
@zbjornson
Copy link
Contributor Author

@lpinca rebased

@lpinca
Copy link
Member

lpinca commented Sep 30, 2016

@gibfahn
Copy link
Member

gibfahn commented Oct 2, 2016

@zbjornson
Copy link
Contributor Author

🎉 7th time's a charm I guess :)

@gibfahn
Copy link
Member

gibfahn commented Oct 4, 2016

Okay, I'm going to start landing this.

  • no objections
  • 3 LGTMs
  • Lots of CI

gibfahn pushed a commit that referenced this pull request Oct 4, 2016
Removes use of builtins that are unavailable for older clang. Per
benchmarks, only uses builtins on Windows, where speedup is
significant.

Also adds test for unaligned ucs2 buffer write. Between #3410
and #7645, bytes were swapped twice on bigendian platforms if buffer
was not two-byte aligned. See comment in #7645.

PR-URL: #7645
Fixes: #7618
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
@gibfahn
Copy link
Member

gibfahn commented Oct 4, 2016

landed in 7420835 , thanks a lot @zbjornson !

@gibfahn gibfahn closed this Oct 4, 2016
@gibfahn
Copy link
Member

gibfahn commented Oct 4, 2016

Concerning backporting, I guess this should be backported to wherever #7157 was backported, which is v6 but not v4. If anyone disagrees let me know.

jasnell pushed a commit that referenced this pull request Oct 6, 2016
Removes use of builtins that are unavailable for older clang. Per
benchmarks, only uses builtins on Windows, where speedup is
significant.

Also adds test for unaligned ucs2 buffer write. Between #3410
and #7645, bytes were swapped twice on bigendian platforms if buffer
was not two-byte aligned. See comment in #7645.

PR-URL: #7645
Fixes: #7618
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
Fishrock123 pushed a commit that referenced this pull request Oct 11, 2016
Removes use of builtins that are unavailable for older clang. Per
benchmarks, only uses builtins on Windows, where speedup is
significant.

Also adds test for unaligned ucs2 buffer write. Between #3410
and #7645, bytes were swapped twice on bigendian platforms if buffer
was not two-byte aligned. See comment in #7645.

PR-URL: #7645
Fixes: #7618
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>

 Conflicts:
	src/node_buffer.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants