Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile-time and runtime CPU feature (SIMD) detection and dispatch #115

Merged
merged 79 commits into from
Aug 1, 2022

Conversation

mosra
Copy link
Owner

@mosra mosra commented Mar 17, 2021

Moved from mosra/magnum#306, as such low-level scaffolding is better to have here. Corrade's own algorithms for memory copies or shuffles could benefit from these as well.

Things to do (mostly moved from mosra/magnum#306):

  • implement some Utility::System::alignedAllocation() Utility::allocateAligned() that uses posix_memalign (or the Windows equivalent), since aligned allocations are not available until C++17. Emscripten alternative? For other platforms go with the "dumb" way of allocating alignment - 1 and then offseting. -- c095551
  • Implement CORRADE_LIKELY() and CORRADE_UNLIKELY() annotations including the C++20 attribute support -- 2ab85d1
  • Add DoNotOptimize and ClobberMemory utils to TestSuite to make benchmarking more predictable meh, my use case involves calling large functions that don't suffer from common microbenchmark pitfalls
  • Non-deprecated Release build on GCC 4.8.5 on the CI fails to detect any runtime features, investigate I'm stupid, used CORRADE_ASSERT() and not CORRADE_ASSERT_OUTPUT(), good thing I have a build without assertions!
  • Think more about the design of the SIMD dispatch tags, should they have its own namespace? should I name them Math::SseSimdT (SimdSseT?) instead to avoid colon cancer? or not in Math at all, but rather in root, Simd::SseT, as there will be non-math things using those as well? in Corrade
    • should I put them into the Tags.h header to avoid having an excessive number of extremely small headers? probably not since the runtime dispatch thingy may be quite involved (including system headers etc) Corrade/Simd.h
    • Make the tags non-default-constructible to avoid issues with overloads (foo({}) should not pick foo(Simd::ScalarT) and should also not result in ambiguous overloads)
    • how to make the templated benchmarks nicer without using the ugly T{typename T::Init{}}? any chance to run a test case directly as testFoo<Simd::Sse3>() without having to specify the type as well (testFoo<Simd::Sse3T, Simd::Sse3>() is ew) -- Simd::tag<T>() and Simd::feature<T>() is it
    • have a Simd::DefaultT typedef that's one of these based on what the code was compiled for
    • and then a Simd::Runtime{} for a runtime dispatch (and testing?) or Simd::Detect{}? (doesn't fit well if there would be runtime params to it -- Simd::Detect{Simd::Avx2} feels strange and confusing, but Simd::Runtime{Simd::Avx2} correctly suggest that an AVX2 path was chosen at runtime) Simd::Features{}
    • what about tag combinations? let's say AVX2 + FMA or SSE4 + POPCNT? -- solved now
      • add Fma, Popcnt, ... tags not derived from anything (otherwise we'd have diamond inheritance and that's only solveable with virtual inheritance which has non-zero runtime impact)
      • add implicit conversion operators to the tag types for pathological combinations like Simd::Avx2Fma -> Simd::AvxFma, if Simd::Avx*Fma are types derived from Simd::Avx* and Simd::Fma? useless, as inheritance gets picked over conversion operators and so Avx2Fma -> Scalar has a higher priority than Avx2Fma -> AvxFma :(
      • some other option? could be also e.g. foo(Simd::Avx2T, Simd::FmaT, ...) but then there needs to be a template dispatcher that takes combined flags and expands them this way
      • have an implicit extra parameter that defines a concrete ordering priority, such as in https://gist.github.com/sthalik/0828be17146ddad36f4a3e11b47cbbcc -- YES thank you
    • tagged functions have different signatures and while it makes dispatching easier, putting them into a runtime dispatch table is impossible unless wrapped in lambdas solved now
    • document the decision process on why the tags are ordered like this (maybe reuse the diagram from the blog post) -- no suspicious decision to document after all, but the diagram is still nice
  • Should the particular tagged implementations be exposed as well? (if not, then the SIMD tags are useless to have publicly) nope, only the high-level dispatcher taking the Simd::Features "enum", and only internally
    • Think about doc workflow -- combine their docs into a single documented function explaining what tags can be used instead of having it repeated 4+ times? or just use @overload and, given the tags are implicit, the user doesn't really need to know what all variants are there --> ideally i should fix the doc generator to collapse the overloads together already obsolete as only the high-level dispatcher is exposed
  • investigate FMV not portable, can't use, need to do our own dispatch instead
    • GCC is warning about unused functions in the multi-versioned case, not sure why? is that a bug? obsolete
    • investigate what subset of FMV is available on GCC 4.8 (is AVX2 there?) obsolete
    • any possibility of FMV-like functionality on MSVC or we need to select at compile time there isn't
  • runtime CPU feature detection -- Simd::Features
  • GCC has a target attribute, check if that works on Clang as well and what's the alternative on MSVC
    • use it instead of the CMake source-file flags for Simd.cpp
    • provide some macros for these the obvious name conflicts with CORRADE_TARGET_* CORRADE_ENABLE_AVX_FMA, e.g., also defined only if under x86 and the compiler supports that so we can use it also as an #ifdef around the functions (and it'll be defined but empty on MSVC, for example)
  • Make the Traits class public and add name() to it, similarly to Math::TypeTraits in Magnum -- need this for labeling tests
  • Simd::Avx2|Simd::Fma should result in a new type and not a runtime Features flag to avoid the need for Avx2Fma, AvxFma, Avx2FmaF16c... combined tags, in function signatures it's then decltype(Simd::Avx2|Simd::Fma) but hidden behind a macro because the reality is always more complex than just one decltype
  • provide something like GLIBC_TUNABLES to override CPU feature detection at runtime? https://stackoverflow.com/a/61048314 would need env access in the low level code, which is rather heavyweight postponed, possible if IFUNC is not enabled to get env access
  • rename to Cpu?? Since it's not really just SIMD anymore...
  • implement ARM detection on macOS and iOS
  • implement the runtime dispatch helpers
  • implement the ifunc dispatch where available
    • on ARM android (API 30+?) it gives me the hwcaps itself which is nice which is the only way to implement this, use that
    • on ARM Android API < 30 and Linux it's called so early that i can't use getauxval()... just disable this altogether :/
    • actually not even x86 GCC 4.8 works, probably need to inline the cpuid query?? --coverage was broken
    • sanitizers also don't work, disable it there and add a kill switch to configure.h
    • does it allow me to overwrite the function pointer later? -- it doesn't, need a fallback for testing
    • deinlined works only when no other SO has ifunc (i.e., CpuTest alone worked fine, but when CorradeUtility got ifuncs as well, it started crashing on older GCC and on all Clang) -- have to inline it, same case for ARM
      • figure out how to not have to include the massive intrinsics file :( and the system headers -- rewrote in asm on GCC/Clang, forward-declaring the intrinsics on MSVC
  • on mac/windows go with shared library constructors like already used for plugin registration
    • what's the difference from ifunc, actually? -- ability to go w/o function pointer, one indirection less
    • verify on a larger real-world scenario that the proxied implementation gets inlined
  • patch the TARGET_ defines and runtime-detected features to be really supersets as documented, to handle outlier Celeron or Atom not following the order obsolete, i now handle the extras properly so AVX+FMA, AVX+F16C and AVX2 without FMA is all possible and correctly handled
  • multiple __attribute__((__target__(...))) annotations work only on Clang, GCC takes just the last one -- make a CORRADE_ENABLE() macro
  • if I import TZCNT with -mbmi on GCC 4.8 and then later use it in a function that has -mavx -mbmi set, it'll fail to link -- disable all "extra" CORRADE_ENABLE_ macros there :(
    • maybe look which commit fixed it for 4.9 and check if there could be a workaround? there's no way
  • look into optimizing the dispatcher to use equality comparison instead of & so compilers can optimize better meh, the disassembly is good enough and it's not called in a hot loop either
  • publish the blog post once all the remaining cursed crap is solved -- https://blog.magnum.graphics/backstage/cpu-feature-detection-dispatch/

Practical use cases inside Corrade to prove this

Needs to be a part of this PR, otherwise I'd likely merge a state that isn't practically usable.

  • make it possible to disable runtime dispatch at compile time (CORRADE_BUILD_CPU_RUNTIME_DISPATCH, enabled by default only on Linux / Android, other platforms have it experimental and OFF) ... or have a possibility to disable runtime CPU detection as well? it will rely just on compile-time options and have the functions truly baked in
    • tests will force pointers by default for best coverage, but such behavior should be disableable for easier benchmarking (CORRADE_BUILD_TESTS_FORCE_CPU_POINTER_DISPATCH)
  • Character variant of String::find() and findLast() (other variants postponed, not critical to verify the design)
    • Inline the functions in headers and make them call into an internal API
    • x86 SSE2 + LZCNT / TZCNT
      • AVX variant also, as SSE2 doesn't seem to beat memchr() yet
      • exploit LZCNT / TZCNT returning 32 for zero input to reduce branching postponed, although still requiring them
    • WASM SIMD128
    • ARM NEON
    • Add unified macros to VisibilityMacros.h that unify IFUNC, pointer and compile-time dispatch not really possible, it's all just aliases anyway
    • Make a library-specific use of them in visibility.h similarly to the export macros
    • Have the *TestLib explicitly use function pointers, independently of how the main build is configured
      • And SIMD128 on WebAssembly, hopefully the old Emscripten on CI knows it well enough already implicitly disabled for old Emscripten and ancient Node.js in EMSDK (thus we can't test on the CI at all)
    • Iterate over all arch variants in the tests
      • skip the unsupported ones
    • How to test that they actually get picked, apart from printing the function pointer and eyeballing it? based on the benchmark numbers, if they're not differing much then I'm doing something wrong, also coverage reports
    • Test edge cases present only in the SIMD variants (aligned/unaligned, too small, all possible mask cases)
    • Benchmark all variants, compare to STL and libc
      • Also the small cases
      • Compare to memrchr() on Linux
  • Something in Magnum to verify cross-SO usability (SIMD playground magnum#306) it didn't work even in Corrade, thus nothing to verify
  • ASan complains that CorradeUtility and CorradeUtilityTestLib have the dispatch function pointers duplicated (well, yes, but so are also all functions), solve this somehow (or just suppress?) -- link to CorradeTestSuiteTestLib (which would link to CorradeUtlityTestLib instead of CorradeUtility)

@mosra mosra added this to the 2020.0b milestone Mar 17, 2021
@mosra mosra added this to TODO in Platforms via automation Mar 17, 2021
@mosra mosra mentioned this pull request Mar 17, 2021
33 tasks
@LB--
Copy link
Contributor

LB-- commented Mar 18, 2021

You may already know this, but on Windows at least with Microsoft's CRT they can't implement std::aligned_alloc because the C++ standard requires it to be able to be freed with std::free, whereas Microsoft's memory allocation routines use a separate free function for freeing aligned memory vs ordinary allocations. This means that you have to account for remembering which free function to call on Windows if you want an owning pointer to be able to refer to both aligned and unaligned memory.

@mosra
Copy link
Owner Author

mosra commented Mar 18, 2021

on Windows at least with Microsoft's CRT they can't implement std::aligned_alloc because the C++ standard requires it to be able to be freed with std::free

Thank you, yeah, AFAIK this is one of the reasons it took so long to get in the standard. In my case I'll be implementing this via custom Array deleters anyway, so a different deallocation function should be no problem here :)

@mosra mosra self-assigned this Mar 22, 2021
@mosra mosra force-pushed the simd branch 5 times, most recently from b0b6d1d to 28f6249 Compare March 22, 2021 16:18
@codecov
Copy link

codecov bot commented Mar 22, 2021

Codecov Report

Merging #115 (86bb47f) into master (63a5f4b) will increase coverage by 0.02%.
The diff coverage is 98.40%.

@@            Coverage Diff             @@
##           master     #115      +/-   ##
==========================================
+ Coverage   97.89%   97.92%   +0.02%     
==========================================
  Files         130      132       +2     
  Lines       10402    10739     +337     
==========================================
+ Hits        10183    10516     +333     
- Misses        219      223       +4     
Impacted Files Coverage Δ
src/Corrade/Cpu.h 100.00% <ø> (ø)
src/Corrade/Utility/Memory.h 100.00% <ø> (ø)
src/Corrade/Containers/StringView.cpp 98.02% <97.79%> (-0.52%) ⬇️
src/Corrade/Containers/StringView.h 93.22% <100.00%> (+3.60%) ⬆️
src/Corrade/Cpu.cpp 100.00% <100.00%> (ø)
src/Corrade/TestSuite/Tester.h 100.00% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us.

@mosra
Copy link
Owner Author

mosra commented May 31, 2021

Cherry-picked the following features to master, as their design is finished and the API won't need any changes anymore:

  • CORRADE_LIKELY() and CORRADE_UNLIKELY() macros in 2ab85d1
  • Aligned allocation in c095551

The remainning SIMD-related changes still need a real-world case that proves their usefulness.

mosra added 16 commits August 1, 2022 17:45
See? This is what it takes to add a new variant and have it tested.
The 31 bytes case is significantly faster due to this falling back to 2
overlapping SSE2 vectors, but we're interested in the scalar fallback
too.
Currently it unconditionally adds -msimd128 when building the test. I
expect this to bring immense pain later.
Doesn't seem to make the usual case any slower, but makes the small case
1.5-1.8x faster on x86 and 25% on WASM. Which is quite significant.
Couldn't make a common helper to cover both because then it stopped
being fast on WASM. Not on x86 tho. Strange, hitting some inlined size
limit??
I ... didn't expect to hit such massive instruction portability issues
so soon. But even with that it's still running circles around the
standard memchr().
There was quite a lot of renaming and shuffling around before then, so
be sure to verify only the actually finalized variant and not something
else.
While the earlier versions support a certain WIP variant of the
intrinsics, it's not all, and some of these were renamed, opcodes
renumbered etc. Doesn't really make sense to pretend they're usable in
that case.

And then emsdk DOES NOT MAKE IT ANY SIMPLER by bundling a prerelease of
it, thus we *also* have to check emscripten version to be really sure
that SIMD128 can be used. UGH
…est().

Snce version 15 Node.js ships with V8 8.6, which contains also the
remaining bitmask instructions, so from that version onwards there
should be no harm from having SIMD enabled. Older versions have SIMD
incomplete, meaning general code won't really run anyway, independently
of the flag being set or not.
Need to use NodeJs_VERSION in order to figure out a default for a CMake
option. The curse runs deep, yes?
I... can't even. This was supposed to be a simple one-line change, not
an investigation spanning several days!!
It surely sounds like there's a lot more buts than actually
working features.
@mosra mosra marked this pull request as ready for review August 1, 2022 16:29
Since 9.3 is the stock default on 20.04, it doesn't sound practical to
just fail compilation there. I can't tell everyone to

    apt install g++-9

and then suffer through compiler switching in all their pipelines and CI
scripts. Another option I considered was disabling (not defining) *all*
CORRADE_ENABLE_* macros on this compiler, but again that would mean
Ubuntu 20.04 users would silently have a much worse perf unless they
switch compilers or use -march=native. Also far from ideal.

Instead I'm just not using trailing return types anymore, which is
actually the simplest way to get rid of this problem. Also adding a
dedicated test for CORRADE_ENABLE_* and lambdas into CpuTest, since this
was so far only tested in production, i.e. inside StringView code. Which
is no good.

This reverts commit 6615e44.
Comparing regular functions, function pointers, IFUNCs and all that also
in an external (dynamic) library.
This was a *rewarning* learning experience.
@mosra mosra merged commit 86bb47f into master Aug 1, 2022
Platforms automation moved this from TODO to Done Aug 1, 2022
@mosra mosra deleted the simd branch August 1, 2022 21:31
@mosra
Copy link
Owner Author

mosra commented Aug 2, 2022

For anyone subscribed to this PR, a post with a detailed overview of the new features has just been published: https://blog.magnum.graphics/backstage/cpu-feature-detection-dispatch/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Platforms
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants