-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSVC simdjson twice as slow as ClangCL #847
Comments
Possible issues to investigate:
Things to try:
|
More numbers, courtesy of benchfeatures (#857): TLDR: MSVC simdjson is slower at a lot of features, but one of the most striking differences is that stage 1 is 140% slower on a file with just the text "ab" followed by 640K of zeroes. Stage 2 is unaffected on that file (as one would expect).
|
Bruce Dawson explains how to use performance counters under Windows. It involves using a program called tracelog.exe that is apparently part of Windows but that I had never heard about. The batch file (Apache licensed): @setlocal
@set tracelogdir=C:\Program Files (x86)\Windows Kits\10\bin\x64
@for /d %%f in ("C:\Program Files (x86)\Windows Kits\10\bin\10.*") do if exist "%%f\x64\tracelog.exe" set tracelogdir=%%f\x64
@if exist "%tracelogdir%\tracelog.exe" goto tracelog_exists
@echo Can't find tracelog.exe
@exit /b
:tracelog_exists
set path=%path%;%tracelogdir%
@set batchdir=%~dp0
@set demo_app=%batchdir%ConditionalCount\Release\ConditionalCount.exe
@set demo_app_name=ConditionalCount
@if exist "%demo_app%" goto demo_exists
@echo Please build the release configuration of ConditionalCount.exe. It needs
@echo to be here: %demo_app%
@exit /b
:demo_exists
@rem Start tracing with context switches and with a few CPU performance
@rem counters being recorded for each context switch.
tracelog.exe -start pmc_counters -f pmc_counter_test.etl -eflag CSWITCH+PROC_THREAD+LOADER -PMC BranchMispredictions,BranchInstructions:CSWITCH
@if %errorlevel% equ 0 goto started
@echo Make sure you are running from an administrator command prompt
@exit /b
:started
@rem Counters can also be associated with other events such as PROFILE - just
@rem change both occurrences of CSWITCH to PROFILE. But the parsing code will have
@rem to be updated to make it meaningful.
@rem Other available counters can be found by running "tracelog -profilesources Help"
@rem and on my machine I see:
@rem Timer
@rem TotalIssues
@rem BranchInstructions
@rem CacheMisses
@rem BranchMispredictions
@rem TotalCycles
@rem UnhaltedCoreCycles
@rem InstructionRetired - shows same results as TotalIssues
@rem UnhaltedReferenceCycles
@rem LLCReference
@rem LLCMisses
@rem BranchInstructionRetired - shows same results as BranchInstructions
@rem BranchMispredictsRetired - not significantly different from BranchMispredictions
"%demo_app%"
"%demo_app%" -sort
xperf -stop pmc_counters >nul
xperf -merge pmc_counter_test.etl pmc_counters_test_merged.etl
xperf -i pmc_counters_test_merged.etl -o pmc_counters_test.txt There is also a Python script. This is not perfect, but the The first thing that matters is to check the instruction counts. |
It might be interesting also to dump the assembly and review it... Because that's likely to be such a huge job reviewing it, it might be nice to post it online (like on GitHub) and see if we can't tease apart differences collaboratively... Also, having a historical set of assembly dumps could prove useful...
|
I have uploaded an ASM dump of parse.exe compiled with Visual Studio: |
Ok. So I did a little bit of simple stats. I compiled the code using both Visual Studio 2019 and GNU GCC 8 (under Linux). I compiled the parse benchmark. I end up with roughly the same binary size. I am looking for some high-level differences. Here are the most common instructions under Visual Studio...
Here are the most common instructions under GCC...
The Visual Studio assembly has a lot of lines that look like...
Are we expected to find these interrupts in release code? There are 2000 of those out of fewer than 40k instructions. I don't expect that they would slow things down since they are clearly not executed (right?) but this might be a clue to something else? I am just fishing. |
There are ~40k lines of assembly to read. I am not likely to read it all in a comparative manner. But at least, I posted the assembly online. |
Twitter says that the INT 3 are to be expected. Visual Studio is generous on the padding and likes to use INT 3 for security. |
Please check MSVC /sdl flag that added to all build configuration in the simdjson-flags.cmake file. According to MSVC https://docs.microsoft.com/en-us/cpp/build/reference/sdl-enable-additional-security-checks?view=vs-2019 it adds runtime checks. |
@maksqwe Do we know if Looking at the documentation, I cannot see anything that would have a cost for us... But no matter, we can measure it exactly. |
To answer my own query, Visual Studio aligns all jump targets to 16 bytes, presumably as an optimization strategy. |
Compiling simdjson with GNU GCC (MinGW 64-bit) under Windows seems to give good performance that is on par with Linux. I am achieving 2.8 GB/s on twitter.json. Here is a screenshot proof... So it is not Linux holding us back. There is really something bad for us in the Visual Studio backend. Hmmm... |
@lemire I don't know if this is your issue exactly, as I'm not sure how you're invoking MSCV for your tests. Might not be an issue for you at all. However, it turns out this is the reason the pysimdjson benchmarks have been so poor. As you suspected, it's an inlining issue where MSCV is ignoring In particular, under Haswell, this was my main culprit:
It's consuming up to 20% of the total run time even when doing full Python object creation.
Under MSVC2019 (and only 2019) I fixed this issue by enabling aggressive inlining using the |
@TkTech Thanks for your comment. I think we know about /Ob3. But I don't think I was aware that it could have such large effects. What I did was to test what CMake defines by default...
Short story : /Ob2. That's not good. |
/Ob2 is safe - it'll work on most versions of MSVC. /Ob3 is very new, only available recently in 2019. Testing with both /Ob2 and /Ob3, only /Ob3 had such a large impact (by really inlining |
Ob3 is definitively not the default on Visual Studio 2019 https://gitlab.kitware.com/cmake/cmake/-/issues/20824 |
|
@pps83 Can you try adding |
The piece of code you point out is almost certainly in stage 1, so it is not related to the consumption of the DOM, but rather it is a cost that you pay upfront early in the parsing. That stage 1 eats a lot in your budget is interesting. I can build our I have created a work-in-progress PR to ease reproduction of my results... It basically enables /Ob3. First, here are my command lines...
And here are my results... The first pass with /Ob3, the second is with /Ob2 (sorry for the terribly verbose output).
So I cannot reproduce the beneficial effects of /Ob3 on our code. (I am not saying it is not real, just that I do not see it right now. Maybe I made a mistake?) Now... I use about 425,000 nanoseconds to parse twitter.json. You report taking 4 ms or 5 ms... say 5,000,000 ns. That's about 10x. This suggests that 90% of your running time is spent outside of our core parsing routine (stage 1 and stage 2). If I am correct, it makes it unlikely that the piece of code you point at could use 20% of your running time. Could you run the parse benchmark from the simdjson project and compare? |
For fun, here is exactly the same benchmark, on the same OS, but using a different compiler (GCC from MSYS2):
It is almost twice as fast!!! This should not be possible. There is no way that GNU GCC is 2x faster than Visual Studio. Someone at Microsoft would get fired if it were true. There is a very real mystery. (Note that this is verified using other benchmarks as well.) |
@lemire for reproduction purposes, can you double check the exact version of MSVC you're using? Inlining behavior was changed in 16.3. |
Can you report your results with...
We need to be able to reason about it. The simdjson library itself definitively uses less than a nanosecond per byte. (For just the core parsing step.) |
@TkTech Irrespective of my version, you should be doing the core parsing at less than 1 ns per input byte. The twitter file has 631515 bytes. You seem to be using something like 8 ns per input byte. Again, this is for the core parsing routine... if you are doing extra work, the story differs, of course. |
The Python/MSVC appears to be be resolved. I have updated my Python demo at https://github.com/lemire/simple_simdjson_python_wrapper I am achieving 1.6 GB/s with twitter.json when compiling the extension with Visual Studio which is the same result I get using our C++ benchmark. So parsing twitter.json takes 0.4 ms on my setup. This is without /Ob3. This is pretty reassuring. Of course, the Python wrapper just literally calls the C++ function from the library... so it should hit the same speed (minus some call overhead due to Python). It is still the case that compiling the Python extension with ClangCL (if possible) or GCC MSYS2 would give much better performance. |
I might have a lead. Under Visual Studio, half of the running time of stage 1 is due to UTF-8 validation even when the input is pure ASCII. This is unexpected: we have UTF-8 validation tuned very finely. This should not happen. |
Looking at the assembly, it could be that the data is reloaded into vector registers, instead of having the registers just be passed around. (This is speculative, but there seems to be a lot of loads where I do not expect them.) |
I have now become convinced that the problem is tied to the abstraction (simd8 class) that we have put on top of the intrinsics. When I toy with the abstraction, I make Visual Studio go faster. Just like Visual Studio got faster after I removed the functional programming idioms, I think I can make it go faster by removing other high level abstractions. |
Is there a specific code generation reason why?
…On Tue, Jul 7, 2020 at 12:09 PM Daniel Lemire ***@***.***> wrote:
I have now become convinced that the problem is tied to the abstraction
(simd8 class) that we have put on top of the intrinsics. When I toy with
the abstraction, I make Visual Studio go faster. Just like Visual Studio
got faster after I removed the functional programming idioms, I think I can
make it go faster by removing other high level abstractions.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#847 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABRW5WHIOYVT6DPAFED743R2NCDLANCNFSM4MXHNJHQ>
.
|
I have now made the gap much more acceptable. It is only 30% compared to ClangCL. It is not just this one commit, I had made gains earlier...
@terrymah We had a data structure simd::simd8x64<uint8_t> that contains, say, 2 256-bit registers/variables. We were passing it around by copy. Our expectation was that the compile would elide the copies but my impression that Visual Studio was doing either a lot of copies, or else some jumping around. I am not sure. Passing it by reference (and deleting the copy constructor) helped a lot. Earlier, we had a "lambdas" that would be passed to this simd8x64 and do map/reduce on the 2 or 4 variables. I have removed these lambdas and instead I use simpler code like this... really_inline simd8<T> reduce_or() const {
return this->chunks[0] | this->chunks[1];
} This helped. I had to do it because the latest Visual Studio 2019 update broke our built with lambdas under ClangCL... Under ClangCL, to use SIMD instructions, we use "function attributes" but they would not propagate to the lambdas. (This is not necessarily a bug in Visual Studio but it feels like a bug.) Anyhow, doing that accidentally helped our performance. Another thing that helped is that we had this loop... this->check_utf8_bytes(input.chunks[0], this->prev_input_block);
for (int i=1; i<simd8x64<uint8_t>::NUM_CHUNKS; i++) {
this->check_utf8_bytes(input.chunks[i], input.chunks[i-1]);
} where NUM_CHUNKS is a compile-time constant with value 2... Removing the loop helped a lot... this->check_utf8_bytes(input.chunks[0], this->prev_input_block);
this->check_utf8_bytes(input.chunks[1], input.chunks[0]); This surprised me a little. So it is a bunch of small things like that... Basically, it comes down to coding at lower level, with fewer abstractions. |
Link to relevant PR: #1031 It is possible that much more gain could be possible. I don't know exactly how to go about it. I am trying to recognize patterns that Visual Studio does not handle well. |
Agreed. |
@terrymah I think I have identified a large difference between Visual Studio and ClangCL that probably explains the remaining performance difference. Visual Studio (but not ClangCL) is spending the bulk of its instructions loading and unloading vector registers. I cannot explain it but the volume of stores and loads does not make any sense. This code is mostly computational, there should be very few loads and stores. My explanation is probably wrong and naive, but it is as if Visual Studio thought "oh! I am done with this code, I must de-register everything"... and then... "Oh, new code path, I must re-register everything again"... and so forth. Note though that it is all one function... but it is as if Visual Studio was scared of leaving stuff in Y registers. Compiling with /arch:AVX2 does not help. |
There are severel bugs reported to msvc support forum: I think these bugs connected with this one. |
I have reported the issue at
It might be. For now, I would just settle on understanding what the compiler is doing. If I understand what it is doing, then I might be able to produce a workaround. |
Thanks guys, we'll look into it on our end! Often times the compiler (pessimistically) thinks it has to reload the registers if some external call it made and the state/contents of the variable or memory are potentially unknown again. This is what we call "interference calculation" combined with "escape analysis" to know what is going on with variables and memory. But I'll let you know what we find out. |
@terrymah Thank you. |
@terrymah I have been experimenting... struct go {
__m128i array[4];
};
__m128i orme(go & x) {
__m128i z = x.array[0];
for(size_t i = 1; i < 4; i++)
z = _mm_or_si128(z, x.array[i]);
return z;
}
Both GCC and clang compile it to something like...
Yet Visual Studio does the following...
I don't understand what Visual Studio is doing. |
Nor I. I mean, clearly we didn't unroll the loop and vectorize for unknown reasons. Is there a devcomm bug on this? It looks like you have a good minimal example that should be a lot more pleasant to investigate. |
I added it to https://developercommunity.visualstudio.com/content/problem/1106837/unexpected-volume-of-256-bit-loads-and-stores-in-h.html as a comment. I can open a whole new issue if you'd like. I am of a mixed mind about it because I actually fixed it by manually unrolling our code throughout. So I no longer have this issue, but I did. I am mildly concerned that someone could say "just unroll the loop manually" (someone, for real, told me this), which would be missing the point, of course. I take is a symptom of a large issues where Visual Studio seems to be doing something unexpected when vector registers are involved. I am a hard working guy, so I will eventually reduce the problems to various cases. And I can create a new issue each time. But it seems to me quite likely that they are all closely related. |
So, we looked into it, and it turns out the issue is that having an intrinsic in a loop actually causes our loop unroller to bail out. So that's not great. We'll get it fixed. |
@terrymah I have been able to reduce an example of crazy register unloading and reloading... #ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
#include <cstdint>
struct simd8x32 {
const __m256i chunks;
};
bool inline is_ascii(simd8x32 input) {
return !_mm256_testz_si256(input.chunks, _mm256_set1_epi8(0b10000000));
}
simd8x32 * count_ascii(simd8x32 * x) {
while(is_ascii(*x)) {
x++;
}
return x;
} What this code does is simple enough... you just advance as long as your inputs are "ASCII". If you are SIMD fluent, then you can compile the result in your head. So clang will compile it to...
The hot spot is very clean...
You load a register, increment the location, execute vptest, and conditionally jump. But Visual Studio will do that...
The hot loop is ...
I don't understand this assembly, but there are too many loads. The XMM registers are unneeded. I'll add this example to the issue I already flagged (instead of opening a new one). |
@terrymah Ah!!! It is inlining related !!! You can make the issue disappear if you manually inline the is_ascii function like so... simd8x32 * fast_count_ascii(simd8x32 * x) {
while(!_mm256_testz_si256(x->chunks, _mm256_set1_epi8(0b10000000))) {
x++;
}
return x;
} Then Visual Studio will generate this much better assembly...
This is really, really bad news for simdjson because it is all engineered with little functions that all get magically inlined... Godbolt link https://godbolt.org/z/bf6oz7 |
@terrymah I think that there is an issue with inlining. If you have a struct containing a single vector register, you expect to get good performance by passing it by value... and this is indeed what simdjson does, but under Visual Studio, when combined with inlining, it sometimes deliver poor results. Let me illustrate. #ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
#include <cstdint>
struct simd8x32 {
__m256i chunks;
};
bool inline is_ascii_ref(simd8x32& v) {
return !_mm256_testz_si256(v.chunks, _mm256_set1_epi8(0b10000000));
}
bool inline is_ascii_copy(simd8x32 v) {
return !_mm256_testz_si256(v.chunks, _mm256_set1_epi8(0b10000000));
}
simd8x32 * count_ascii_copy(simd8x32 * x) {
while(is_ascii_copy(*x)) {
x++;
}
return x;
}
simd8x32 * count_ascii_ref(simd8x32 * x) {
while(is_ascii_ref(*x)) {
x++;
}
return x;
} The count_ascii_copy function gets compiled as
The count_ascii_ref function gets compiled as
|
I am going to close this. The issue is largely resolved. |
Just going to randomly jump in, if you dont use /fp:fast VC performs alot of extra moves/rounds https://social.msdn.microsoft.com/Forums/vstudio/en-US/0e319e71-75fc-4213-91b4-f91c19b60dbd/strange-fp-floating-point-model-flag-behavior?forum=vcgeneral MOVUPS I think is supposed to round or trigger FP exceptions on paper, MOVDQA is the "proper" instruction. VC thinks its FP SIMD when its not. FP fast switch removes "ANSI C/IEEE" round on every operand behavior. Also LTCG/WPO is required with VC for modern inlining/"random calling conventions". |
On Windows, simdjson compiled with MSVC parses twitter.json at almost half the speed of ClangCL-compiled simdjson, on the same machine, in the same command prompt.
Methodology:
git clean -ffdx build && cmake -B build && cmake --build build --target parse --config Release && build\benchmark\Release\parse jsonexamples\twitter.json
git clean -ffdx build && cmake -B build -T ClangCL && cmake --build build --target parse --config Release && build\benchmark\Release\parse jsonexamples\twitter.json
I validated that MSVC simdjson is using the haswell implementation, both by running json2json to print out the implementation, and by doing
set SIMDJSON_FORCE_IMPLEMENTATION=haswell
.The text was updated successfully, but these errors were encountered: