Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSVC simdjson twice as slow as ClangCL #847

Closed
jkeiser opened this issue May 1, 2020 · 64 comments
Closed

MSVC simdjson twice as slow as ClangCL #847

jkeiser opened this issue May 1, 2020 · 64 comments
Labels
performance platform coverage Makes simdjson run on more machines or compilers.
Milestone

Comments

@jkeiser
Copy link
Member

jkeiser commented May 1, 2020

On Windows, simdjson compiled with MSVC parses twitter.json at almost half the speed of ClangCL-compiled simdjson, on the same machine, in the same command prompt.

Platform Overall Stage 1 Stage 2
MSVC 19.25.28614 1.3051 2.3777 3.3898
ClangCL 9.0.0 2.2221 5.4161 4.6401

Methodology:

  • MSVC: git clean -ffdx build && cmake -B build && cmake --build build --target parse --config Release && build\benchmark\Release\parse jsonexamples\twitter.json
  • ClangCL: git clean -ffdx build && cmake -B build -T ClangCL && cmake --build build --target parse --config Release && build\benchmark\Release\parse jsonexamples\twitter.json

I validated that MSVC simdjson is using the haswell implementation, both by running json2json to print out the implementation, and by doing set SIMDJSON_FORCE_IMPLEMENTATION=haswell.

@jkeiser jkeiser changed the title MSVC simdjson parses twice as slow as ClangCL MSVC simdjson twice as slow as ClangCL May 1, 2020
@lemire
Copy link
Member

lemire commented May 1, 2020

Possible issues to investigate:

  • It seems that MSVC is shy about strict aliasing.
  • It seems that MSVC is shy about inlining.

Things to try:

  • /LTCG (unlikely, but...)
  • PGO (related to inlining)

@jkeiser
Copy link
Member Author

jkeiser commented May 2, 2020

More numbers, courtesy of benchfeatures (#857):

TLDR: MSVC simdjson is slower at a lot of features, but one of the most striking differences is that stage 1 is 140% slower on a file with just the text "ab" followed by 640K of zeroes. Stage 2 is unaffected on that file (as one would expect).

Stage Platform Base 7 Struct UTF-8 Escape 15 Str. 16+ Str. 7 Struct Miss UTF-8 Miss Escape Miss 15 Str. Miss 16+ Str. Miss
Stage 1 ClangCL 9.0.0 6.89 9.42 5.06 0.32 5.34 10.5 3.32 6.68 -0.68 5.08 6.56
Stage 1 MSVC 19.25.28614.0 20.5 10.1 10 -0.14 5.22 6 0.84 1.92 0.2 1.76 1.12
Stage 2 ClangCL 9.0.0 0.35 25.9 0.04 4.26 22.4 33.5 0.2 -0.36 8.52 1.32 -0.88
Stage 2 MSVC 19.25.28614.0 0.36 32.9 0.08 4.18 33.4 41 0.38 -0.24 12.1 -0.28 1
All ClangCL 9.0.0 10.6 36.2 8.3 6.04 29.9 43 2.9 4.36 9.52 6.88 5.8
All MSVC 19.25.28614.0 23.6 44.2 10.5 4.68 38.2 48.6 1.26 -0.2 14.2 4.16 1.08

@lemire
Copy link
Member

lemire commented May 4, 2020

Bruce Dawson explains how to use performance counters under Windows.

It involves using a program called tracelog.exe that is apparently part of Windows but that I had never heard about.

The batch file (Apache licensed):

@setlocal

@set tracelogdir=C:\Program Files (x86)\Windows Kits\10\bin\x64
@for /d %%f in ("C:\Program Files (x86)\Windows Kits\10\bin\10.*") do if exist "%%f\x64\tracelog.exe" set tracelogdir=%%f\x64

@if exist "%tracelogdir%\tracelog.exe" goto tracelog_exists
@echo Can't find tracelog.exe
@exit /b
:tracelog_exists

set path=%path%;%tracelogdir%

@set batchdir=%~dp0
@set demo_app=%batchdir%ConditionalCount\Release\ConditionalCount.exe
@set demo_app_name=ConditionalCount

@if exist "%demo_app%" goto demo_exists
@echo Please build the release configuration of ConditionalCount.exe. It needs
@echo to be here: %demo_app%
@exit /b

:demo_exists

@rem Start tracing with context switches and with a few CPU performance
@rem counters being recorded for each context switch.
tracelog.exe -start pmc_counters -f pmc_counter_test.etl -eflag CSWITCH+PROC_THREAD+LOADER -PMC BranchMispredictions,BranchInstructions:CSWITCH
@if %errorlevel% equ 0 goto started
@echo Make sure you are running from an administrator command prompt
@exit /b
:started

@rem Counters can also be associated with other events such as PROFILE - just
@rem change both occurrences of CSWITCH to PROFILE. But the parsing code will have
@rem to be updated to make it meaningful.

@rem Other available counters can be found by running "tracelog -profilesources Help"
@rem and on my machine I see:
@rem Timer
@rem TotalIssues
@rem BranchInstructions
@rem CacheMisses
@rem BranchMispredictions
@rem TotalCycles
@rem UnhaltedCoreCycles
@rem InstructionRetired - shows same results as TotalIssues
@rem UnhaltedReferenceCycles
@rem LLCReference
@rem LLCMisses
@rem BranchInstructionRetired - shows same results as BranchInstructions
@rem BranchMispredictsRetired - not significantly different from BranchMispredictions

"%demo_app%"
"%demo_app%" -sort
xperf -stop pmc_counters >nul
xperf -merge pmc_counter_test.etl pmc_counters_test_merged.etl
xperf -i pmc_counters_test_merged.etl -o pmc_counters_test.txt

There is also a Python script.

This is not perfect, but the parse utility has the ability to just do stage 1, or both, and we can increase the number of iterations (-n) so that it dominates the running time. So we could really figure out the performance counters.

The first thing that matters is to check the instruction counts.

@jkeiser jkeiser added performance platform coverage Makes simdjson run on more machines or compilers. labels May 4, 2020
@lemire
Copy link
Member

lemire commented May 4, 2020

It might be interesting also to dump the assembly and review it... Because that's likely to be such a huge job reviewing it, it might be nice to post it online (like on GitHub) and see if we can't tease apart differences collaboratively... Also, having a historical set of assembly dumps could prove useful...

dumpbin /disasm ourbinary

@lemire
Copy link
Member

lemire commented May 14, 2020

I have uploaded an ASM dump of parse.exe compiled with Visual Studio:
https://github.com/simdjson/asmdump/blob/master/visualstudio2019/parse.exe.asm

@lemire
Copy link
Member

lemire commented May 14, 2020

Ok. So I did a little bit of simple stats. I compiled the code using both Visual Studio 2019 and GNU GCC 8 (under Linux). I compiled the parse benchmark. I end up with roughly the same binary size.

I am looking for some high-level differences.

Here are the most common instructions under Visual Studio...

Visual Studio 2019
mov                 	:9540
lea                 	:2792
cmp                 	:1919
int                 	:1905
add                 	:1776
movzx               	:1251
jmp                 	:1143
call                	:1086
jne                 	:972
test                	:940
sub                 	:884
je                  	:734
or                  	:717
inc                 	:608
xor                 	:515
ja                  	:479
movdqa              	:447
and                 	:446
vmovdqu             	:419
movsxd              	:347
nop                 	:292
pop                 	:291
push                	:271
shr                 	:265
shl                 	:251
movaps              	:201
ret                 	:200
movups              	:176
xorps               	:152
pmovmskb            	:144
pcmpeqb             	:144
movdqu              	:139
jb                  	:138
movsd               	:134
vmovaps             	:126
movsx               	:111
sd                  	:108
vpmovmskb           	:104
tzcnt               	:104
vpcmpeqb            	:103

Here are the most common instructions under GCC...

mov                 	:8858
lea                 	:2951
cmp                 	:2059
add                 	:1697
movzbl              	:1386
je                  	:1327
jmpq                	:1136
call                	:992
sub                 	:944
jmp                 	:852
or                  	:764
test                	:704
xor                 	:679
and                 	:664
jne                 	:588
shr                 	:522
ja                  	:470
shl                 	:441
movabs              	:423
pop                 	:403
jbe                 	:401
movdqa              	:391
push                	:348
pxor                	:315
nop                 	:307
sd                  	:272
callq               	:264
vmovdqa             	:236
nopl                	:218
tzcnt               	:176
imul                	:170
nopw                	:163
movups              	:163
js                  	:161
movl                	:158
xchg                	:157
movsd               	:155
divsd               	:155
jae                 	:143
cmpb                	:121
movb                	:119
jg                  	:118
movaps              	:116
movq                	:115
addsd               	:111
pmovmskb            	:104
pcmpeqb             	:104

The Visual Studio assembly has a lot of lines that look like...

  000000014002264C: CC                 int         3

Are we expected to find these interrupts in release code? There are 2000 of those out of fewer than 40k instructions.

I don't expect that they would slow things down since they are clearly not executed (right?) but this might be a clue to something else?

I am just fishing.

@lemire
Copy link
Member

lemire commented May 14, 2020

I would not expect something like this in release code...

Screen Shot 2020-05-14 at 2 07 44 PM

@lemire
Copy link
Member

lemire commented May 14, 2020

There are ~40k lines of assembly to read. I am not likely to read it all in a comparative manner. But at least, I posted the assembly online.

@lemire
Copy link
Member

lemire commented May 14, 2020

Twitter says that the INT 3 are to be expected. Visual Studio is generous on the padding and likes to use INT 3 for security.

@maksqwe
Copy link

maksqwe commented Jun 3, 2020

Please check MSVC /sdl flag that added to all build configuration in the simdjson-flags.cmake file.

According to MSVC https://docs.microsoft.com/en-us/cpp/build/reference/sdl-enable-additional-security-checks?view=vs-2019 it adds runtime checks.
I do not think that this is the main reason, but it is better to remove this flag from the release build.

@lemire
Copy link
Member

lemire commented Jun 3, 2020

@maksqwe Do we know if /sdl impacts the performance for us?

Looking at the documentation, I cannot see anything that would have a cost for us...

https://docs.microsoft.com/en-us/cpp/build/reference/sdl-enable-additional-security-checks?view=vs-2019

But no matter, we can measure it exactly.

@lemire
Copy link
Member

lemire commented Jun 3, 2020

I would not expect something like this in release code...

To answer my own query, Visual Studio aligns all jump targets to 16 bytes, presumably as an optimization strategy.

@lemire
Copy link
Member

lemire commented Jun 30, 2020

Compiling simdjson with GNU GCC (MinGW 64-bit) under Windows seems to give good performance that is on par with Linux. I am achieving 2.8 GB/s on twitter.json.

Here is a screenshot proof...

Screen Shot 2020-06-29 at 9 35 33 PM

So it is not Linux holding us back. There is really something bad for us in the Visual Studio backend.

Hmmm...

cc @jkeiser @eddelbuettel

@TkTech
Copy link
Member

TkTech commented Jul 2, 2020

@lemire I don't know if this is your issue exactly, as I'm not sure how you're invoking MSCV for your tests. Might not be an issue for you at all. However, it turns out this is the reason the pysimdjson benchmarks have been so poor. As you suspected, it's an inlining issue where MSCV is ignoring __forceinline (which it's allowed to do and is documented).

In particular, under Haswell, this was my main culprit:

 really_inline uint64_t lteq(const T m) const {
      const simd8<T> mask = simd8<T>::splat(m);
      return this->map( [&](simd8<T> a) { return a <= mask; } ).to_bitmask();
    }

It's consuming up to 20% of the total run time even when doing full Python object creation.

 %Own   %Total  OwnTime  TotalTime  Function (filename:line)
 17.00%  17.00%    6.19s     6.19s   <lambda_e3ba89e388457ca8f4aaff0297f9a273>::operator() (simdjson\simdjson.cpp:8210)
 16.00%  17.00%    5.82s     6.27s   element_to_primitive (simdjson\binding.cpp:49)
 10.00%  34.00%    4.73s    13.34s   simdjson::haswell::stage1::json_structural_indexer::index<128> (simdjson\simdjson.cpp:9140)
 14.00%  15.00%    4.72s     5.06s   element_to_primitive (simdjson\binding.cpp:40)
 11.00%  14.00%    3.75s     4.29s   element_to_primitive (simdjson\binding.cpp:80)
  8.00% 100.00%    3.40s    40.99s   <module> (x.py:8)
  5.00%   5.00%    1.97s     1.97s   simdjson::haswell::stage2::`anonymous namespace'::parse_structurals<0> (simdjson\simdjson.cpp:10744)
  1.00%   1.00%    1.37s     1.37s   element_to_primitive (simdjson\binding.cpp:39)
  6.00%   6.00%    1.31s     1.31s   <lambda_1c761a5312a2bb3c85b0462009320da7>::operator() (simdjson\simdjson.cpp:8253)
  1.00%   1.00%   0.990s    0.990s   <lambda_88c8bc0843da91103236cb3bd551f608>::operator() (simdjson\simdjson.cpp:8258)

Under MSVC2019 (and only 2019) I fixed this issue by enabling aggressive inlining using the /Ob3 strategy, which is only available under 2019.

@lemire
Copy link
Member

lemire commented Jul 2, 2020

@TkTech Thanks for your comment. I think we know about /Ob3. But I don't think I was aware that it could have such large effects.

What I did was to test what CMake defines by default...

PS C:\Users\User\github\simdjson\buildvs2019> cmake --build . --target parse --verbose --config Release
Microsoft (R) Build Engine version 16.5.1+4616136f8 for .NET Framework
Copyright (C) Microsoft Corporation. All rights reserved.

Build started 7/2/2020 6:50:30 PM.
Project "C:\Users\User\github\simdjson\buildvs2019\benchmark\parse.vcxproj" on node 1 (default targets).
Project "C:\Users\User\github\simdjson\buildvs2019\benchmark\parse.vcxproj" (1) is building "C:\Users\User\github\simdjson\buildvs2019\ZERO_CHECK.vcxproj" (2) on node 1 (default targets).
PrepareForBuild:
  Creating directory "x64\Release\ZERO_CHECK\".
  Creating directory "x64\Release\ZERO_CHECK\ZERO_CHECK.tlog\".
InitializeBuildStatus:
  Creating "x64\Release\ZERO_CHECK\ZERO_CHECK.tlog\unsuccessfulbuild" because "AlwaysCreate" was specified.
CustomBuild:
  Checking Build System
FinalizeBuildStatus:
  Deleting file "x64\Release\ZERO_CHECK\ZERO_CHECK.tlog\unsuccessfulbuild".
  Touching "x64\Release\ZERO_CHECK\ZERO_CHECK.tlog\ZERO_CHECK.lastbuildstate".
Done Building Project "C:\Users\User\github\simdjson\buildvs2019\ZERO_CHECK.vcxproj" (default targets).

Project "C:\Users\User\github\simdjson\buildvs2019\benchmark\parse.vcxproj" (1) is building "C:\Users\User\github\simdjson\buildvs2019\jsonexamples\generated\generated-data.vcxproj" (3) on node 1 (default targets).
PrepareForBuild:
  Creating directory "x64\Release\generated-data\".
  Creating directory "x64\Release\generated-data\generated-data.tlog\".
InitializeBuildStatus:
  Creating "x64\Release\generated-data\generated-data.tlog\unsuccessfulbuild" because "AlwaysCreate" was specified.
CustomBuild:
  Building Custom Rule C:/Users/User/github/simdjson/jsonexamples/generated/CMakeLists.txt
FinalizeBuildStatus:
  Deleting file "x64\Release\generated-data\generated-data.tlog\unsuccessfulbuild".
  Touching "x64\Release\generated-data\generated-data.tlog\generated-data.lastbuildstate".
Done Building Project "C:\Users\User\github\simdjson\buildvs2019\jsonexamples\generated\generated-data.vcxproj" (default targets).

Project "C:\Users\User\github\simdjson\buildvs2019\benchmark\parse.vcxproj" (1) is building "C:\Users\User\github\simdjson\buildvs2019\src\simdjson.vcxproj" (4) on node 1 (default targets).
PrepareForBuild:
  Creating directory "simdjson.dir\Release\".
  Creating directory "C:\Users\User\github\simdjson\buildvs2019\src\Release\".
  Creating directory "simdjson.dir\Release\simdjson.tlog\".
  Creating "simdjson.dir\Release\simdjson.tlog\unsuccessfulbuild" because "AlwaysCreate" was specified.
CustomBuild:
  Building Custom Rule C:/Users/User/github/simdjson/src/CMakeLists.txt
ClCompile:
  C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.25.28610\bin\HostX64\x64\CL.exe /c /IC:\Users\User\github\simdjson\include /IC:\Users\User\github\simdjson\src /nologo /W3 /WX /diagnostics:column /sdl /O2 /Ob2 /D WIN32 /D _WIN 
  DOWS /D NDEBUG /D SIMDJSON_THREADS_ENABLED=1 /D "CMAKE_INTDIR=\"Release\"" /D _MBCS /Gm- /EHsc /MD /GS /fp:precise /Zc:wchar_t /Zc:forScope /Zc:inline /GR /std:c++17 /Fo"simdjson.dir\Release\\" /Fd"simdjson.dir\Release\simdjson.pdb" /Gd /TP /errorReport:qu 
  eue C:\Users\User\github\simdjson\src\simdjson.cpp
  simdjson.cpp
Lib:
  C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.25.28610\bin\HostX64\x64\Lib.exe /OUT:"C:\Users\User\github\simdjson\buildvs2019\src\Release\simdjson.lib" /NOLOGO /MACHINE:X64  /machine:x64 simdjson.dir\Release\simdjson.obj   
  simdjson.vcxproj -> C:\Users\User\github\simdjson\buildvs2019\src\Release\simdjson.lib
FinalizeBuildStatus:
  Deleting file "simdjson.dir\Release\simdjson.tlog\unsuccessfulbuild".
  Touching "simdjson.dir\Release\simdjson.tlog\simdjson.lastbuildstate".
Done Building Project "C:\Users\User\github\simdjson\buildvs2019\src\simdjson.vcxproj" (default targets).

PrepareForBuild:
  Creating directory "parse.dir\Release\".
  Creating directory "C:\Users\User\github\simdjson\buildvs2019\benchmark\Release\".
  Creating directory "parse.dir\Release\parse.tlog\".
InitializeBuildStatus:
  Creating "parse.dir\Release\parse.tlog\unsuccessfulbuild" because "AlwaysCreate" was specified.
CustomBuild:
  Building Custom Rule C:/Users/User/github/simdjson/benchmark/CMakeLists.txt
ClCompile:
  C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.25.28610\bin\HostX64\x64\CL.exe /c /IC:\Users\User\github\simdjson\benchmark\. /IC:\Users\User\github\simdjson\benchmark\linux /IC:\Users\User\github\simdjson\include /IC:\Users 
  \User\github\simdjson\windows\. /nologo /W3 /WX- /diagnostics:column /O2 /Ob2 /D WIN32 /D _WINDOWS /D NDEBUG /D SIMDJSON_THREADS_ENABLED=1 /D _CRT_SECURE_NO_WARNINGS /D "SIMDJSON_TEST_DATA_DIR=\"C:/Users/User/github/simdjson/jsonchecker/\"" /D "SIMDJSON_BE 
  NCHMARK_DATA_DIR=\"C:/Users/User/github/simdjson/buildvs2019/jsonexamples/\"" /D "CMAKE_INTDIR=\"Release\"" /D _MBCS /Gm- /EHsc /MD /GS /fp:precise /Zc:wchar_t /Zc:forScope /Zc:inline /GR /std:c++17 /Fo"parse.dir\Release\\" /Fd"parse.dir\Release\vc142.pdb" 
   /Gd /TP /errorReport:queue C:\Users\User\github\simdjson\benchmark\parse.cpp
  parse.cpp
Link:
  C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.25.28610\bin\HostX64\x64\link.exe /ERRORREPORT:QUEUE /OUT:"C:\Users\User\github\simdjson\buildvs2019\benchmark\Release\parse.exe" /INCREMENTAL:NO /NOLOGO ..\src\Release\simdjson 
  .lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /manifest:embed /PDB:"C:/Users/User/github/simdjson/buildvs2019/benchmark/ 
  Release/parse.pdb" /SUBSYSTEM:CONSOLE /TLBID:1 /DYNAMICBASE /NXCOMPAT /IMPLIB:"C:/Users/User/github/simdjson/buildvs2019/benchmark/Release/parse.lib" /MACHINE:X64  /machine:x64 parse.dir\Release\parse.obj
     Creating library C:/Users/User/github/simdjson/buildvs2019/benchmark/Release/parse.lib and object C:/Users/User/github/simdjson/buildvs2019/benchmark/Release/parse.exp
  parse.vcxproj -> C:\Users\User\github\simdjson\buildvs2019\benchmark\Release\parse.exe
FinalizeBuildStatus:
  Deleting file "parse.dir\Release\parse.tlog\unsuccessfulbuild".
  Touching "parse.dir\Release\parse.tlog\parse.lastbuildstate".
Done Building Project "C:\Users\User\github\simdjson\buildvs2019\benchmark\parse.vcxproj" (default targets).


Build succeeded.
    0 Warning(s)
    0 Error(s)

Short story : /Ob2.

That's not good.

@TkTech
Copy link
Member

TkTech commented Jul 2, 2020

/Ob2 is safe - it'll work on most versions of MSVC. /Ob3 is very new, only available recently in 2019. Testing with both /Ob2 and /Ob3, only /Ob3 had such a large impact (by really inlining really_inline uint64_t lteq(const T m) const). If you can confirm the performance difference, we should try to figure out why it's refusing to inline it. Honestly I'd expect it to inline that without any hints let alone __forceinline.

@lemire
Copy link
Member

lemire commented Jul 2, 2020

Ob3 is definitively not the default on Visual Studio 2019 https://gitlab.kitware.com/cmake/cmake/-/issues/20824

@TkTech
Copy link
Member

TkTech commented Jul 2, 2020

/Ob3:

--------------------------------------------------------------------------------------- benchmark 'jsonexamples/canada.json deserialization': 5 tests ---------------------------------------------------------------------------------------
Name (time in ms)                                                            Min                 Max                Mean             StdDev              Median                IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_loads_json[jsonexamples/canada.json-orjson-loads]                   12.7110 (1.0)       32.5141 (1.0)       19.6919 (1.0)       4.7391 (1.0)       19.6451 (1.0)       8.3362 (1.35)         32;0  50.7824 (1.0)          78           1
test_loads_json[jsonexamples/canada.json-csimdjson-csimdjson_loads]      17.4827 (1.38)      32.9298 (1.01)      25.5541 (1.30)      5.1183 (1.08)      26.3138 (1.34)      9.6864 (1.56)         15;0  39.1326 (0.77)         29           1
test_loads_json[jsonexamples/canada.json-simplejson-loads]               58.9083 (4.63)      78.7260 (2.42)      69.2096 (3.51)      5.7666 (1.22)      70.3775 (3.58)      6.1945 (1.0)           5;0  14.4489 (0.28)         15           1
test_loads_json[jsonexamples/canada.json-json-loads]                     62.1295 (4.89)     100.2388 (3.08)      72.1008 (3.66)     10.7503 (2.27)      69.3329 (3.53)     10.3572 (1.67)          2;1  13.8695 (0.27)         13           1
test_loads_json[jsonexamples/canada.json-rapidjson-loads]               123.1268 (9.69)     151.0345 (4.65)     136.3408 (6.92)     10.0373 (2.12)     135.5881 (6.90)     17.1828 (2.77)          3;0   7.3346 (0.14)          8           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------ benchmark 'jsonexamples/citm_catalog.json deserialization': 5 tests -------------------------------------------------------------------------------------
Name (time in ms)                                                                 Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_loads_json[jsonexamples/citm_catalog.json-orjson-loads]                   6.7576 (1.0)      16.6529 (1.0)       8.7681 (1.0)      2.7797 (1.06)      7.4177 (1.0)      1.1109 (1.93)        29;30  114.0496 (1.0)         135           1
test_loads_json[jsonexamples/citm_catalog.json-csimdjson-csimdjson_loads]      8.7475 (1.29)     18.4730 (1.11)     11.0364 (1.26)     2.9541 (1.13)      9.6908 (1.31)     1.2305 (2.14)        23;23   90.6095 (0.79)        103           1
test_loads_json[jsonexamples/citm_catalog.json-json-loads]                    12.2955 (1.82)     20.7384 (1.25)     14.2051 (1.62)     2.6244 (1.0)      12.9596 (1.75)     0.5761 (1.0)         11;11   70.3971 (0.62)         52           1
test_loads_json[jsonexamples/citm_catalog.json-simplejson-loads]              14.0895 (2.08)     23.9601 (1.44)     16.0244 (1.83)     2.7864 (1.06)     14.5429 (1.96)     1.2061 (2.09)        15;16   62.4049 (0.55)         66           1
test_loads_json[jsonexamples/citm_catalog.json-rapidjson-loads]               18.4948 (2.74)     27.3133 (1.64)     20.6354 (2.35)     2.9715 (1.13)     19.0349 (2.57)     3.2403 (5.62)         12;3   48.4604 (0.42)         53           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------- benchmark 'jsonexamples/github_events.json deserialization': 5 tests -------------------------------------------------------------------------------------------
Name (time in us)                                                                   Min                   Max                Mean              StdDev              Median                IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_loads_json[jsonexamples/github_events.json-orjson-loads]                  263.8000 (1.0)      1,223.8000 (1.29)     303.7025 (1.0)       69.9696 (1.28)     275.8000 (1.0)      31.8000 (1.74)      325;397        3.2927 (1.0)        3106           1
test_loads_json[jsonexamples/github_events.json-csimdjson-csimdjson_loads]     271.0000 (1.03)     1,152.4000 (1.22)     363.2034 (1.20)     103.0636 (1.88)     340.2000 (1.23)     98.2250 (5.36)      285;187        2.7533 (0.84)       2209           1
test_loads_json[jsonexamples/github_events.json-json-loads]                    359.8000 (1.36)     1,053.3000 (1.11)     384.8412 (1.27)      54.6782 (1.0)      366.8000 (1.33)     18.3250 (1.0)       168;265        2.5985 (0.79)       2497           1
test_loads_json[jsonexamples/github_events.json-simplejson-loads]              374.9000 (1.42)     1,170.7000 (1.24)     410.5879 (1.35)      75.4451 (1.38)     380.2000 (1.38)     26.7000 (1.46)      213;306        2.4355 (0.74)       2348           1
test_loads_json[jsonexamples/github_events.json-rapidjson-loads]               405.9000 (1.54)       945.7000 (1.0)      436.7293 (1.44)      65.4202 (1.20)     410.4000 (1.49)     21.1000 (1.15)      171;280        2.2897 (0.70)       1971           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------ benchmark 'jsonexamples/mesh.json deserialization': 5 tests -------------------------------------------------------------------------------------
Name (time in ms)                                                         Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_loads_json[jsonexamples/mesh.json-orjson-loads]                   3.0391 (1.0)      10.0547 (1.0)       3.4522 (1.0)      1.0462 (1.09)      3.1750 (1.0)      0.2672 (1.0)         11;21  289.6715 (1.0)         276           1
test_loads_json[jsonexamples/mesh.json-csimdjson-csimdjson_loads]      4.8624 (1.60)     10.5712 (1.05)      5.3252 (1.54)     0.9655 (1.00)      5.0847 (1.60)     0.2957 (1.11)         9;13  187.7856 (0.65)        196           1
test_loads_json[jsonexamples/mesh.json-json-loads]                    11.5187 (3.79)     17.8210 (1.77)     12.2219 (3.54)     1.1206 (1.17)     11.9125 (3.75)     0.5415 (2.03)          5;5   81.8205 (0.28)         80           1
test_loads_json[jsonexamples/mesh.json-simplejson-loads]              15.0371 (4.95)     20.9504 (2.08)     15.9193 (4.61)     0.9616 (1.0)      15.7204 (4.95)     0.4585 (1.72)          2;4   62.8169 (0.22)         62           1
test_loads_json[jsonexamples/mesh.json-rapidjson-loads]               27.3326 (8.99)     33.6098 (3.34)     28.5304 (8.26)     1.6151 (1.68)     27.7898 (8.75)     1.0767 (4.03)          5;5   35.0503 (0.12)         37           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------- benchmark 'jsonexamples/twitter.json deserialization': 5 tests -----------------------------------------------------------------------------------
Name (time in ms)                                                           Min                Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_loads_json[jsonexamples/twitter.json-orjson-loads]                  3.2781 (1.0)      13.1376 (1.11)     4.2042 (1.0)      1.3922 (1.44)     3.8113 (1.0)      0.8624 (1.25)        20;20  237.8573 (1.0)         234           1
test_loads_json[jsonexamples/twitter.json-csimdjson-csimdjson_loads]     4.1741 (1.27)     16.0963 (1.36)     5.3350 (1.27)     1.4182 (1.47)     4.9819 (1.31)     0.8923 (1.29)        13;12  187.4427 (0.79)        160           1
test_loads_json[jsonexamples/twitter.json-json-loads]                    5.0752 (1.55)     16.8196 (1.42)     6.0501 (1.44)     1.2820 (1.33)     5.7113 (1.50)     0.6900 (1.0)         11;15  165.2853 (0.69)        170           1
test_loads_json[jsonexamples/twitter.json-simplejson-loads]              5.3695 (1.64)     11.8515 (1.0)      6.3214 (1.50)     0.9642 (1.0)      6.0799 (1.60)     0.6925 (1.00)        12;10  158.1935 (0.67)        156           1
test_loads_json[jsonexamples/twitter.json-rapidjson-loads]               7.2519 (2.21)     16.4038 (1.38)     8.7533 (2.08)     1.6014 (1.66)     8.2328 (2.16)     1.5150 (2.20)         11;7  114.2422 (0.48)        112           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

/Ob2:

--------------------------------------------------------------------------------------- benchmark 'jsonexamples/canada.json deserialization': 5 tests ---------------------------------------------------------------------------------------
Name (time in ms)                                                            Min                 Max                Mean             StdDev              Median                IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_loads_json[jsonexamples/canada.json-orjson-loads]                   12.7721 (1.0)       44.6660 (1.0)       24.9496 (1.0)       7.5507 (1.23)      24.3483 (1.0)      11.3058 (1.87)         27;0  40.0808 (1.0)          77           1
test_loads_json[jsonexamples/canada.json-csimdjson-csimdjson_loads]      24.1292 (1.89)      67.6120 (1.51)      36.5351 (1.46)     10.1058 (1.64)      35.4246 (1.45)      7.8892 (1.30)          7;3  27.3709 (0.68)         29           1
test_loads_json[jsonexamples/canada.json-simplejson-loads]               67.1054 (5.25)      87.7825 (1.97)      76.9953 (3.09)      6.1585 (1.0)       76.7263 (3.15)      9.7448 (1.61)          4;0  12.9878 (0.32)         13           1
test_loads_json[jsonexamples/canada.json-json-loads]                     68.4460 (5.36)      95.2086 (2.13)      80.3624 (3.22)      8.0041 (1.30)      82.4061 (3.38)      9.5175 (1.57)          5;0  12.4436 (0.31)         14           1
test_loads_json[jsonexamples/canada.json-rapidjson-loads]               143.9842 (11.27)    164.2740 (3.68)     152.7980 (6.12)      6.8629 (1.11)     151.2086 (6.21)      6.0457 (1.0)           2;0   6.5446 (0.16)          6           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------ benchmark 'jsonexamples/citm_catalog.json deserialization': 5 tests ------------------------------------------------------------------------------------
Name (time in ms)                                                                 Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_loads_json[jsonexamples/citm_catalog.json-orjson-loads]                   7.0504 (1.0)      30.3530 (1.24)     11.8654 (1.0)      4.8345 (1.33)      9.7551 (1.0)      6.5479 (1.86)         14;2  84.2789 (1.0)          81           1
test_loads_json[jsonexamples/citm_catalog.json-csimdjson-csimdjson_loads]     12.0042 (1.70)     24.5645 (1.0)      15.4309 (1.30)     3.9568 (1.09)     13.4198 (1.38)     4.3805 (1.24)         10;3  64.8052 (0.77)         47           1
test_loads_json[jsonexamples/citm_catalog.json-json-loads]                    12.5980 (1.79)     25.5070 (1.04)     16.0024 (1.35)     3.6422 (1.0)      14.5595 (1.49)     3.8339 (1.09)          9;4  62.4908 (0.74)         48           1
test_loads_json[jsonexamples/citm_catalog.json-simplejson-loads]              14.2370 (2.02)     27.5492 (1.12)     17.8507 (1.50)     3.7241 (1.02)     16.6642 (1.71)     3.5186 (1.0)          12;9  56.0203 (0.66)         62           1
test_loads_json[jsonexamples/citm_catalog.json-rapidjson-loads]               18.7392 (2.66)     32.0456 (1.30)     23.2776 (1.96)     3.7246 (1.02)     22.0177 (2.26)     5.9326 (1.69)         15;0  42.9597 (0.51)         50           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------- benchmark 'jsonexamples/github_events.json deserialization': 5 tests -------------------------------------------------------------------------------------------
Name (time in us)                                                                   Min                   Max                Mean              StdDev              Median                IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_loads_json[jsonexamples/github_events.json-orjson-loads]                  265.5000 (1.0)      2,131.5000 (1.62)     316.5130 (1.0)      100.9891 (1.0)      287.8000 (1.0)      39.9000 (1.40)      257;313        3.1594 (1.0)        2395           1
test_loads_json[jsonexamples/github_events.json-json-loads]                    357.8000 (1.35)     2,188.4000 (1.66)     408.6719 (1.29)     117.3531 (1.16)     369.7000 (1.28)     28.5000 (1.0)       217;370        2.4470 (0.77)       2424           1
test_loads_json[jsonexamples/github_events.json-simplejson-loads]              378.5000 (1.43)     2,190.9000 (1.66)     436.6626 (1.38)     117.9327 (1.17)     394.4000 (1.37)     40.4250 (1.42)      246;335        2.2901 (0.72)       2399           1
test_loads_json[jsonexamples/github_events.json-csimdjson-csimdjson_loads]     385.0000 (1.45)     1,316.4000 (1.0)      474.5734 (1.50)     111.6074 (1.11)     452.8000 (1.57)     90.7500 (3.18)      190;143        2.1072 (0.67)       1817           1
test_loads_json[jsonexamples/github_events.json-rapidjson-loads]               412.5000 (1.55)     1,893.6000 (1.44)     487.2423 (1.54)     145.3598 (1.44)     431.3000 (1.50)     56.2750 (1.97)      225;311        2.0524 (0.65)       2031           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------ benchmark 'jsonexamples/mesh.json deserialization': 5 tests -------------------------------------------------------------------------------------
Name (time in ms)                                                         Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_loads_json[jsonexamples/mesh.json-orjson-loads]                   3.3877 (1.0)      16.4911 (1.15)      4.5650 (1.0)      2.0053 (1.51)      3.7943 (1.0)      1.2722 (1.45)        16;15  219.0597 (1.0)         232           1
test_loads_json[jsonexamples/mesh.json-csimdjson-csimdjson_loads]      5.6660 (1.67)     14.3779 (1.0)       6.6014 (1.45)     1.3240 (1.0)       6.1512 (1.62)     0.8802 (1.0)           9;7  151.4829 (0.69)        141           1
test_loads_json[jsonexamples/mesh.json-json-loads]                    11.4239 (3.37)     23.1809 (1.61)     13.7530 (3.01)     2.0783 (1.57)     13.4284 (3.54)     2.2863 (2.60)         14;2   72.7113 (0.33)         58           1
test_loads_json[jsonexamples/mesh.json-simplejson-loads]              15.1747 (4.48)     22.2214 (1.55)     17.2338 (3.78)     1.6043 (1.21)     17.2193 (4.54)     2.2475 (2.55)         22;2   58.0255 (0.26)         63           1
test_loads_json[jsonexamples/mesh.json-rapidjson-loads]               28.6373 (8.45)     37.9510 (2.64)     31.1420 (6.82)     2.0981 (1.58)     30.4538 (8.03)     2.2486 (2.55)          6;2   32.1110 (0.15)         30           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------- benchmark 'jsonexamples/twitter.json deserialization': 5 tests -----------------------------------------------------------------------------------
Name (time in ms)                                                           Min                Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_loads_json[jsonexamples/twitter.json-orjson-loads]                  3.3507 (1.0)      13.2179 (1.0)      4.6349 (1.0)      1.5100 (1.06)     3.9645 (1.0)      1.9602 (2.08)         42;6  215.7554 (1.0)         255           1
test_loads_json[jsonexamples/twitter.json-json-loads]                    5.0405 (1.50)     15.2623 (1.15)     6.1014 (1.32)     1.4215 (1.0)      5.5645 (1.40)     0.9414 (1.0)         22;19  163.8963 (0.76)        180           1
test_loads_json[jsonexamples/twitter.json-csimdjson-csimdjson_loads]     5.4595 (1.63)     16.5721 (1.25)     7.1209 (1.54)     1.6969 (1.19)     6.4730 (1.63)     1.6867 (1.79)         21;7  140.4321 (0.65)        168           1
test_loads_json[jsonexamples/twitter.json-simplejson-loads]              5.4633 (1.63)     19.7964 (1.50)     7.3919 (1.59)     1.9405 (1.37)     6.7216 (1.70)     2.2782 (2.42)         21;4  135.2825 (0.63)        149           1
test_loads_json[jsonexamples/twitter.json-rapidjson-loads]               7.2222 (2.16)     17.1126 (1.29)     9.4650 (2.04)     1.9786 (1.39)     8.8117 (2.22)     2.7553 (2.93)         27;2  105.6519 (0.49)        114           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

@TkTech
Copy link
Member

TkTech commented Jul 2, 2020

@pps83 Can you try adding /Ob3 to your build and let us know if the perf issues disappear? (I'm assuming you're using vs2019)

@lemire
Copy link
Member

lemire commented Jul 3, 2020

@TkTech

The piece of code you point out is almost certainly in stage 1, so it is not related to the consumption of the DOM, but rather it is a cost that you pay upfront early in the parsing. That stage 1 eats a lot in your budget is interesting.

I can build our parse benchmark with /Ob2 and with /Ob3 and I see no difference whatsoever. The parse benchmark should hit the part that you say is hurting you.

I have created a work-in-progress PR to ease reproduction of my results... It basically enables /Ob3.

First, here are my command lines...

 cmake -G "Visual Studio 16 2019"  ..
cmake --build .  --verbose --config Release --target parse
.\benchmark\Release\parse.exe -n 100 ..\jsonexamples\twitter.json

And here are my results... The first pass with /Ob3, the second is with /Ob2 (sorry for the terribly verbose output).

ClCompile:
  C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.25.28610\bin\HostX64\x64\CL.exe /c /IC:\Users\User\github\simdjson\include /IC:\Users\User\github\simdjson\src /nologo /W3 /WX /diagnostics:column /sdl /O2 /D WIN32 /D _WINDOWS  
  /D NDEBUG /D SIMDJSON_THREADS_ENABLED=1 /D "CMAKE_INTDIR=\"Release\"" /D _MBCS /Gm- /EHsc /MD /GS /fp:precise /Zc:wchar_t /Zc:forScope /Zc:inline /GR /std:c++17 /Fo"simdjson.dir\Release\\" /Fd"simdjson.dir\Release\simdjson.pdb" /Gd /TP /errorReport:queue   
  /Ob3 C:\Users\User\github\simdjson\src\simdjson.cpp
  simdjson.cpp
(...)

Time Elapsed 00:00:09.75
PS C:\Users\User\github\simdjson\buildvs2019> .\benchmark\Release\parse.exe -n 100 ..\jsonexamples\twitter.json
number of iterations 100

..\jsonexamples\twitter.json
============================
     9867 blocks -     631515 bytes - 55262 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages
|    Speed        :  43.2002 ns per block ( 88.32%) -   0.6750 ns per byte -   7.7142 ns per structural -   1.4814 GB/s
|- Allocation
|    Speed        :   2.2801 ns per block (  4.66%) -   0.0356 ns per byte -   0.4072 ns per structural -  28.0673 GB/s
|- Stage 1
|    Speed        :  23.3583 ns per block ( 47.75%) -   0.3650 ns per byte -   4.1710 ns per structural -   2.7398 GB/s
|- Stage 2
|    Speed        :  17.3794 ns per block ( 35.53%) -   0.2716 ns per byte -   3.1034 ns per structural -   3.6823 GB/s




ClCompile:
  C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.25.28610\bin\HostX64\x64\CL.exe /c /IC:\Users\User\github\simdjson\include /IC:\Users\User\github\simdjson\src /nologo /W3 /WX /diagnostics:column /sdl /O2 /Ob2 /D WIN32 /D _WIN 
  DOWS /D NDEBUG /D SIMDJSON_THREADS_ENABLED=1 /D "CMAKE_INTDIR=\"Release\"" /D _MBCS /Gm- /EHsc /MD /GS /fp:precise /Zc:wchar_t /Zc:forScope /Zc:inline /GR /std:c++17 /Fo"simdjson.dir\Release\\" /Fd"simdjson.dir\Release\simdjson.pdb" /Gd /TP /errorReport:qu 
  eue C:\Users\User\github\simdjson\src\simdjson.cpp
  simdjson.cpp

(...)
PS C:\Users\User\github\simdjson\buildvs2019> .\benchmark\Release\parse.exe -n 100 ..\jsonexamples\twitter.json
number of iterations 100

..\jsonexamples\twitter.json
============================
     9867 blocks -     631515 bytes - 55262 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages
|    Speed        :  42.9976 ns per block ( 90.22%) -   0.6719 ns per byte -   7.6780 ns per structural -   1.4884 GB/s
|- Allocation
|    Speed        :   2.5638 ns per block (  5.38%) -   0.0401 ns per byte -   0.4578 ns per structural -  24.9611 GB/s
|- Stage 1
|    Speed        :  23.2469 ns per block ( 48.78%) -   0.3633 ns per byte -   4.1511 ns per structural -   2.7529 GB/s
|- Stage 2
|    Speed        :  17.1565 ns per block ( 36.00%) -   0.2681 ns per byte -   3.0636 ns per structural -   3.7302 GB/s

So I cannot reproduce the beneficial effects of /Ob3 on our code. (I am not saying it is not real, just that I do not see it right now. Maybe I made a mistake?)

Now... I use about 425,000 nanoseconds to parse twitter.json.

You report taking 4 ms or 5 ms... say 5,000,000 ns. That's about 10x.

This suggests that 90% of your running time is spent outside of our core parsing routine (stage 1 and stage 2). If I am correct, it makes it unlikely that the piece of code you point at could use 20% of your running time.

Could you run the parse benchmark from the simdjson project and compare?

@lemire
Copy link
Member

lemire commented Jul 3, 2020

For fun, here is exactly the same benchmark, on the same OS, but using a different compiler (GCC from MSYS2):

> .\benchmark\parse.exe -n 100 ..\jsonexamples\twitter.json
number of iterations 100

..\jsonexamples\twitter.json
============================
     9867 blocks -     631515 bytes - 55262 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages
|    Speed        :  23.6015 ns per block ( 65.01%) -   0.3688 ns per byte -   4.2145 ns per structural -   2.7115 GB/s
|- Allocation
|    Speed        :   2.2700 ns per block (  6.25%) -   0.0355 ns per byte -   0.4053 ns per structural -  28.1926 GB/s
|- Stage 1
|    Speed        :  10.0730 ns per block ( 27.75%) -   0.1574 ns per byte -   1.7987 ns per structural -   6.3533 GB/s
|- Stage 2
|    Speed        :  11.1471 ns per block ( 30.71%) -   0.1742 ns per byte -   1.9905 ns per structural -   5.7410 GB/s

4293.7 documents parsed per second (best)
PS C:\Users\User\github\simdjson\build> g++ --version
g++.exe (Rev1, Built by MSYS2 project) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

It is almost twice as fast!!!

This should not be possible. There is no way that GNU GCC is 2x faster than Visual Studio. Someone at Microsoft would get fired if it were true.

There is a very real mystery.

(Note that this is verified using other benchmarks as well.)

@TkTech
Copy link
Member

TkTech commented Jul 3, 2020

@lemire for reproduction purposes, can you double check the exact version of MSVC you're using? Inlining behavior was changed in 16.3.

@lemire
Copy link
Member

lemire commented Jul 3, 2020

@TkTech

Can you report your results with...

mkdir build && cd build && cmake -G "Visual Studio 16 2019" -B .. && cmake --build . --target parse --config Release && benchmark\Release\parse ..\jsonexamples\twitter.json

We need to be able to reason about it.

The simdjson library itself definitively uses less than a nanosecond per byte. (For just the core parsing step.)

@lemire
Copy link
Member

lemire commented Jul 3, 2020

I seem to have 16.5.5.

Screen Shot 2020-07-02 at 8 14 19 PM

That's not the very latest, but it is not old.

@lemire
Copy link
Member

lemire commented Jul 3, 2020

@TkTech Irrespective of my version, you should be doing the core parsing at less than 1 ns per input byte. The twitter file has 631515 bytes. You seem to be using something like 8 ns per input byte. Again, this is for the core parsing routine... if you are doing extra work, the story differs, of course.

@lemire
Copy link
Member

lemire commented Jul 7, 2020

The Python/MSVC appears to be be resolved. I have updated my Python demo at https://github.com/lemire/simple_simdjson_python_wrapper

I am achieving 1.6 GB/s with twitter.json when compiling the extension with Visual Studio which is the same result I get using our C++ benchmark. So parsing twitter.json takes 0.4 ms on my setup. This is without /Ob3.

This is pretty reassuring. Of course, the Python wrapper just literally calls the C++ function from the library... so it should hit the same speed (minus some call overhead due to Python).

It is still the case that compiling the Python extension with ClangCL (if possible) or GCC MSYS2 would give much better performance.

@lemire
Copy link
Member

lemire commented Jul 7, 2020

I might have a lead. Under Visual Studio, half of the running time of stage 1 is due to UTF-8 validation even when the input is pure ASCII. This is unexpected: we have UTF-8 validation tuned very finely. This should not happen.

@lemire
Copy link
Member

lemire commented Jul 7, 2020

Looking at the assembly, it could be that the data is reloaded into vector registers, instead of having the registers just be passed around. (This is speculative, but there seems to be a lot of loads where I do not expect them.)

@lemire
Copy link
Member

lemire commented Jul 7, 2020

I have now become convinced that the problem is tied to the abstraction (simd8 class) that we have put on top of the intrinsics. When I toy with the abstraction, I make Visual Studio go faster. Just like Visual Studio got faster after I removed the functional programming idioms, I think I can make it go faster by removing other high level abstractions.

@terrymah
Copy link

terrymah commented Jul 7, 2020 via email

@lemire
Copy link
Member

lemire commented Jul 7, 2020

I have now made the gap much more acceptable. It is only 30% compared to ClangCL. It is not just this one commit, I had made gains earlier...

Platform Overall Stage 1 Stage 2
MSVC new 2.0699 GB/s 4.3916 GB/s 4.7304 GB/s
ClangCL 2.7232 GB/s 6.5038 GB/s 5.7990 GB/s

@terrymah We had a data structure simd::simd8x64<uint8_t> that contains, say, 2 256-bit registers/variables. We were passing it around by copy. Our expectation was that the compile would elide the copies but my impression that Visual Studio was doing either a lot of copies, or else some jumping around. I am not sure. Passing it by reference (and deleting the copy constructor) helped a lot.

Earlier, we had a "lambdas" that would be passed to this simd8x64 and do map/reduce on the 2 or 4 variables. I have removed these lambdas and instead I use simpler code like this...

really_inline simd8<T> reduce_or() const {
       return this->chunks[0] | this->chunks[1];
}

This helped. I had to do it because the latest Visual Studio 2019 update broke our built with lambdas under ClangCL... Under ClangCL, to use SIMD instructions, we use "function attributes" but they would not propagate to the lambdas. (This is not necessarily a bug in Visual Studio but it feels like a bug.) Anyhow, doing that accidentally helped our performance.

Another thing that helped is that we had this loop...

this->check_utf8_bytes(input.chunks[0], this->prev_input_block);		
for (int i=1; i<simd8x64<uint8_t>::NUM_CHUNKS; i++) {		
           this->check_utf8_bytes(input.chunks[i], input.chunks[i-1]);		
}

where NUM_CHUNKS is a compile-time constant with value 2... Removing the loop helped a lot...

 this->check_utf8_bytes(input.chunks[0], this->prev_input_block);		
 this->check_utf8_bytes(input.chunks[1], input.chunks[0]);		

This surprised me a little.

So it is a bunch of small things like that...

Basically, it comes down to coding at lower level, with fewer abstractions.

@lemire
Copy link
Member

lemire commented Jul 7, 2020

Link to relevant PR: #1031

It is possible that much more gain could be possible. I don't know exactly how to go about it. I am trying to recognize patterns that Visual Studio does not handle well.

@lemire lemire added this to the 0.5 milestone Jul 7, 2020
@lemire
Copy link
Member

lemire commented Jul 7, 2020

@jkeiser I am planning to close this issue at some point after #1031 is merged since we will have largely resolved the matter. It will only remain to figure out how to get Visual Studio to be closer in performance to ClangCL... but a 30% gap is a lot easier to manage and justify.

@jkeiser
Copy link
Member Author

jkeiser commented Jul 7, 2020

Agreed.

@lemire
Copy link
Member

lemire commented Jul 8, 2020

@terrymah I think I have identified a large difference between Visual Studio and ClangCL that probably explains the remaining performance difference. Visual Studio (but not ClangCL) is spending the bulk of its instructions loading and unloading vector registers. I cannot explain it but the volume of stores and loads does not make any sense. This code is mostly computational, there should be very few loads and stores.

My explanation is probably wrong and naive, but it is as if Visual Studio thought "oh! I am done with this code, I must de-register everything"... and then... "Oh, new code path, I must re-register everything again"... and so forth. Note though that it is all one function... but it is as if Visual Studio was scared of leaving stuff in Y registers.

Compiling with /arch:AVX2 does not help.

Screen Shot 2020-07-08 at 11 12 56 AM

@maksqwe
Copy link

maksqwe commented Jul 8, 2020

@lemire
Copy link
Member

lemire commented Jul 8, 2020

I have reported the issue at

https://developercommunity.visualstudio.com/content/problem/1106837/unexpected-volume-of-256-bit-loads-and-stores-in-h.html

There are severel bugs reported to msvc support forum: (...) I think these bugs connected with this one.

It might be. For now, I would just settle on understanding what the compiler is doing. If I understand what it is doing, then I might be able to produce a workaround.

@terrymah
Copy link

terrymah commented Jul 8, 2020

Thanks guys, we'll look into it on our end! Often times the compiler (pessimistically) thinks it has to reload the registers if some external call it made and the state/contents of the variable or memory are potentially unknown again. This is what we call "interference calculation" combined with "escape analysis" to know what is going on with variables and memory. But I'll let you know what we find out.

@lemire
Copy link
Member

lemire commented Jul 8, 2020

@terrymah Thank you.

@lemire
Copy link
Member

lemire commented Jul 9, 2020

@terrymah I have been experimenting...

struct go  {
    __m128i array[4];
};

__m128i orme(go & x) {
    __m128i z = x.array[0];
    for(size_t i = 1; i < 4; i++)
      z = _mm_or_si128(z, x.array[i]);
    return z;
}

Both GCC and clang compile it to something like...

orme(go&):
        movdqa  xmm0, XMMWORD PTR [rdi]
        movdqa  xmm1, XMMWORD PTR [rdi+32]
        por     xmm0, XMMWORD PTR [rdi+16]
        por     xmm1, XMMWORD PTR [rdi+48]
        por     xmm0, xmm1
        ret

Yet Visual Studio does the following...

__m128i orme(go &) PROC                    ; orme, COMDAT
        movdqu  xmm1, XMMWORD PTR [rcx]
        lea     rax, QWORD PTR [rcx+16]
        mov     ecx, 3
        npad    3
$LL4@orme:
        movdqu  xmm0, XMMWORD PTR [rax]
        lea     rax, QWORD PTR [rax+16]
        por     xmm0, xmm1
        movdqa  xmm1, xmm0
        sub     rcx, 1
        jne     SHORT $LL4@orme
        ret     0

I don't understand what Visual Studio is doing.

@terrymah
Copy link

Nor I. I mean, clearly we didn't unroll the loop and vectorize for unknown reasons. Is there a devcomm bug on this? It looks like you have a good minimal example that should be a lot more pleasant to investigate.

@lemire
Copy link
Member

lemire commented Jul 10, 2020

@terrymah

I added it to https://developercommunity.visualstudio.com/content/problem/1106837/unexpected-volume-of-256-bit-loads-and-stores-in-h.html as a comment.

I can open a whole new issue if you'd like.

I am of a mixed mind about it because I actually fixed it by manually unrolling our code throughout. So I no longer have this issue, but I did. I am mildly concerned that someone could say "just unroll the loop manually" (someone, for real, told me this), which would be missing the point, of course.

I take is a symptom of a large issues where Visual Studio seems to be doing something unexpected when vector registers are involved.

I am a hard working guy, so I will eventually reduce the problems to various cases. And I can create a new issue each time. But it seems to me quite likely that they are all closely related.

@terrymah
Copy link

So, we looked into it, and it turns out the issue is that having an intrinsic in a loop actually causes our loop unroller to bail out. So that's not great. We'll get it fixed.

@lemire
Copy link
Member

lemire commented Jul 13, 2020

@terrymah I have been able to reduce an example of crazy register unloading and reloading...

#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
#include <cstdint>


struct simd8x32 {
  const __m256i chunks;
};

bool inline is_ascii(simd8x32 input) {
  return !_mm256_testz_si256(input.chunks, _mm256_set1_epi8(0b10000000));
}

simd8x32 * count_ascii(simd8x32 * x) {
    while(is_ascii(*x)) {
        x++;
    }
    return x;
}

What this code does is simple enough... you just advance as long as your inputs are "ASCII". If you are SIMD fluent, then you can compile the result in your head.

So clang will compile it to...

       .quad   -9187201950435737472    # 0x8080808080808080
count_ascii(simd8x32*):              # @count_ascii(simd8x32*)
        mov     rax, rdi
        add     rax, -32
        vpbroadcastq    ymm0, qword ptr [rip + .LCPI0_0] # ymm0 = [9259542123273814144,9259542123273814144,9259542123273814144,9259542123273814144]
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        vmovdqa ymm1, ymmword ptr [rax + 32]
        add     rax, 32
        vptest  ymm1, ymm0
        jne     .LBB0_1
        vzeroupper
        ret

The hot spot is very clean...

.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        vmovdqa ymm1, ymmword ptr [rax + 32]
        add     rax, 32
        vptest  ymm1, ymm0
        jne     .LBB0_1

You load a register, increment the location, execute vptest, and conditionally jump.

But Visual Studio will do that...

simd8x32 * count_ascii(simd8x32 *) PROC       ; count_ascii, COMDAT
$LN13:
        push    rbp
        sub     rsp, 64                             ; 00000040H
        lea     rbp, QWORD PTR [rsp+32]
        and     rbp, -32                      ; ffffffffffffffe0H
        movups  xmm0, XMMWORD PTR [rcx]
        movups  xmm1, XMMWORD PTR [rcx+16]
        movaps  XMMWORD PTR $T1[rbp], xmm0
        movaps  XMMWORD PTR $T1[rbp+16], xmm1
        vmovdqu ymm0, YMMWORD PTR $T1[rbp]
        vptest  ymm0, YMMWORD PTR __ymm@8080808080808080808080808080808080808080808080808080808080808080
        vzeroupper
        je      SHORT $LN11@count_asci
        vmovdqu ymm2, YMMWORD PTR __ymm@8080808080808080808080808080808080808080808080808080808080808080
        npad    7
$LL2@count_asci:
        vmovups xmm0, XMMWORD PTR [rcx+32]
        add     rcx, 32                             ; 00000020H
        vmovaps XMMWORD PTR $T1[rbp], xmm0
        vmovups xmm1, XMMWORD PTR [rcx+16]
        vmovaps XMMWORD PTR $T1[rbp+16], xmm1
        vmovdqu ymm0, YMMWORD PTR $T1[rbp]
        vptest  ymm0, ymm2
        jne     SHORT $LL2@count_asci
$LN11@count_asci:
        mov     rax, rcx
        add     rsp, 64                             ; 00000040H
        pop     rbp
        ret     0

The hot loop is ...

$LL2@count_asci:
        vmovups xmm0, XMMWORD PTR [rcx+32]
        add     rcx, 32                             ; 00000020H
        vmovaps XMMWORD PTR $T1[rbp], xmm0
        vmovups xmm1, XMMWORD PTR [rcx+16]
        vmovaps XMMWORD PTR $T1[rbp+16], xmm1
        vmovdqu ymm0, YMMWORD PTR $T1[rbp]
        vptest  ymm0, ymm2
        jne     SHORT $LL2@count_asci

I don't understand this assembly, but there are too many loads. The XMM registers are unneeded.

I'll add this example to the issue I already flagged (instead of opening a new one).

Here is the godbolt link.

@lemire
Copy link
Member

lemire commented Jul 13, 2020

@terrymah Ah!!! It is inlining related !!!

You can make the issue disappear if you manually inline the is_ascii function like so...

simd8x32 * fast_count_ascii(simd8x32 * x) {
  while(!_mm256_testz_si256(x->chunks, _mm256_set1_epi8(0b10000000))) {
   x++;
  }
  return x;
}

Then Visual Studio will generate this much better assembly...

simd8x32 * fast_count_ascii(simd8x32 *) PROC  ; fast_count_ascii, COMDAT
vmovdqu ymm1, YMMWORD PTR__ymm@8080808080808080808080808080808080808080808080808080808080808080
vmovdqu ymm0, YMMWORD PTR [rcx]
vptest  ymm0, ymm1
je  SHORT $LN9@fast_count
npad  13
$LL2@fast_count:
vmovdqu ymm0, YMMWORD PTR [rcx+32]
add  rcx, 32  ; 00000020H
vptest  ymm0, ymm1
jne  SHORT $LL2@fast_count
$LN9@fast_count:
mov  rax, rcx
vzeroupper
ret  0

This is really, really bad news for simdjson because it is all engineered with little functions that all get magically inlined...

Godbolt link https://godbolt.org/z/bf6oz7

@lemire
Copy link
Member

lemire commented Jul 14, 2020

Note that even if I compile with /arch:AVX2 (something we want to avoid), the performance is still just as bad, and the loads are still there, albeit different...
Screen Shot 2020-07-13 at 10 13 24 PM

@lemire
Copy link
Member

lemire commented Jul 14, 2020

@terrymah I think that there is an issue with inlining. If you have a struct containing a single vector register, you expect to get good performance by passing it by value... and this is indeed what simdjson does, but under Visual Studio, when combined with inlining, it sometimes deliver poor results. Let me illustrate.

#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
#include <cstdint>


struct simd8x32 {
 __m256i chunks;


};


bool inline is_ascii_ref(simd8x32& v) {
    return !_mm256_testz_si256(v.chunks, _mm256_set1_epi8(0b10000000));
}

bool inline is_ascii_copy(simd8x32 v) {
    return !_mm256_testz_si256(v.chunks, _mm256_set1_epi8(0b10000000));
}



simd8x32 * count_ascii_copy(simd8x32 * x) {
    while(is_ascii_copy(*x)) {
        x++;
    }
    return x;
}

simd8x32 * count_ascii_ref(simd8x32 * x) {
    while(is_ascii_ref(*x)) {
        x++;
    }
    return x;
}

The count_ascii_copy function gets compiled as

simd8x32 * count_ascii_copy(simd8x32 *) PROC        ; count_ascii_copy, COMDAT
$LN13:
        push    rbp
        sub     rsp, 64                             ; 00000040H
        lea     rbp, QWORD PTR [rsp+32]
        and     rbp, -32                      ; ffffffffffffffe0H
        movups  xmm0, XMMWORD PTR [rcx]
        movups  xmm1, XMMWORD PTR [rcx+16]
        movaps  XMMWORD PTR $T1[rbp], xmm0
        movaps  XMMWORD PTR $T1[rbp+16], xmm1
        vmovdqu ymm0, YMMWORD PTR $T1[rbp]
        vptest  ymm0, YMMWORD PTR __ymm@8080808080808080808080808080808080808080808080808080808080808080
        vzeroupper
        je      SHORT $LN11@count_asci
        vmovdqu ymm2, YMMWORD PTR __ymm@8080808080808080808080808080808080808080808080808080808080808080
        npad    7
$LL2@count_asci:
        vmovups xmm0, XMMWORD PTR [rcx+32]
        add     rcx, 32                             ; 00000020H
        vmovaps XMMWORD PTR $T1[rbp], xmm0
        vmovups xmm1, XMMWORD PTR [rcx+16]
        vmovaps XMMWORD PTR $T1[rbp+16], xmm1
        vmovdqu ymm0, YMMWORD PTR $T1[rbp]
        vptest  ymm0, ymm2
        jne     SHORT $LL2@count_asci
$LN11@count_asci:
        mov     rax, rcx
        add     rsp, 64                             ; 00000040H
        pop     rbp
        ret     0

The count_ascii_ref function gets compiled as

simd8x32 * count_ascii_ref(simd8x32 *) PROC   ; count_ascii_ref, COMDAT
        vmovdqu ymm1, YMMWORD PTR __ymm@8080808080808080808080808080808080808080808080808080808080808080
        vmovdqu ymm0, YMMWORD PTR [rcx]
        vptest  ymm0, ymm1
        je      SHORT $LN11@count_asci
        npad    13
$LL2@count_asci:
        vmovdqu ymm0, YMMWORD PTR [rcx+32]
        add     rcx, 32                             ; 00000020H
        vptest  ymm0, ymm1
        jne     SHORT $LL2@count_asci
$LN11@count_asci:
        mov     rax, rcx
        vzeroupper
        ret     0

@lemire
Copy link
Member

lemire commented Jul 31, 2020

I am going to close this. The issue is largely resolved.

@lemire lemire closed this as completed Jul 31, 2020
@bulk88
Copy link

bulk88 commented Jul 21, 2021

Just going to randomly jump in, if you dont use /fp:fast VC performs alot of extra moves/rounds https://social.msdn.microsoft.com/Forums/vstudio/en-US/0e319e71-75fc-4213-91b4-f91c19b60dbd/strange-fp-floating-point-model-flag-behavior?forum=vcgeneral

MOVUPS I think is supposed to round or trigger FP exceptions on paper, MOVDQA is the "proper" instruction. VC thinks its FP SIMD when its not. FP fast switch removes "ANSI C/IEEE" round on every operand behavior.

Also LTCG/WPO is required with VC for modern inlining/"random calling conventions".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance platform coverage Makes simdjson run on more machines or compilers.
Projects
None yet
Development

No branches or pull requests

6 participants