New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for VNNI(Vector Neural Network Instructions 512) - Request for testing #2987
Conversation
No functional change bench: 4244812
Probably best measured with the new bench commands
so we see the impact of NNUE relative to classical on VNNI. |
In the meantime I was able to verify the bench is correct using the Intel Software Development Emulator |
Emulation support for the Intel® Advanced Matrix Extensions (Intel® AMX) :-) Edit... at a given point NNUE could be faster than classical |
I agree it could. It's just the first implementation is supposed to be on Sapphire Rapids which isn't scheduled to be introduced till 2021. https://en.wikichip.org/wiki/x86/amx |
yes, definitely not urgent, but probably interesting to look into it, given the code would already validate on the emulator. |
Using an AWS cloud instance (C5.24x.large is cascade lake, spot price < 1.5$/h) and gcc-10, node count is correct, and nps is quite nice
testing code: #!/bin/bash
make clean && make -j ARCH=x86-64-vnni COMP=gcc COMPILER=g++-10 profile-build && mv ./stockfish ./stockfish.vnni
make clean && make -j ARCH=x86-64-avx512 COMP=gcc COMPILER=g++-10 profile-build && mv ./stockfish ./stockfish.avx512
make clean && make -j ARCH=x86-64-avx2 COMP=gcc COMPILER=g++-10 profile-build && mv ./stockfish ./stockfish.avx2
make clean && make -j ARCH=x86-64-modern COMP=gcc COMPILER=g++-10 profile-build && mv ./stockfish ./stockfish.modern
printf "| target | classical | NNUE | ratio |\n"
printf "| :-- | --: | --: | --: |\n"
for exe in vnni avx512 avx2 modern
do
./stockfish.$exe bench 1024 1 20 default depth classical >& out.classical.$exe
classical=`grep 'Nodes/' out.classical.$exe | awk '{print $NF}'`
./stockfish.$exe bench 1024 1 20 default depth NNUE >& out.NNUE.$exe
NNUE=`grep 'Nodes/' out.NNUE.$exe | awk '{print $NF}'`
ratio=`echo $classical $NNUE | awk '{printf("%5.2f",$2/$1 * 100)}'`
printf "| %s | %s | %s | %s |\n" $exe $classical $NNUE $ratio
done
|
That's awesome. Thanks for taking the time to do all this! |
Is there a VNNI compile i could try! |
I'm wondering since we know this can only run on Cascade Lake if we should add |
I don't think so, at least so far we have not been doing so. |
Adds support for Vector Neural Network Instructions (avx512), as available on Intel Cascade Lake The _mm512_dpbusd_epi32() intrinsic (vpdpbusd instruction) is taylor made for NNUE. on a cascade lake CPU (AWS C5.24x.large, gcc 10) NNUE eval is at roughly 78% nps of classical (single core test) bench 1024 1 24 default depth: target classical NNUE ratio vnni 2207232 1725987 78.20 avx512 2216789 1671734 75.41 avx2 2194006 1611263 73.44 modern 2185001 1352469 61.90 closes official-stockfish/Stockfish#2987 No functional change
Adds support for Vector Neural Network Instructions (avx512), as available on Intel Cascade Lake The _mm512_dpbusd_epi32() intrinsic (vpdpbusd instruction) is taylor made for NNUE. on a cascade lake CPU (AWS C5.24x.large, gcc 10) NNUE eval is at roughly 78% nps of classical (single core test) bench 1024 1 24 default depth: target classical NNUE ratio vnni 2207232 1725987 78.20 avx512 2216789 1671734 75.41 avx2 2194006 1611263 73.44 modern 2185001 1352469 61.90 closes official-stockfish/Stockfish#2987 No functional change
Adds support for Vector Neural Network Instructions (avx512), as available on Intel Cascade Lake The _mm512_dpbusd_epi32() intrinsic (vpdpbusd instruction) is taylor made for NNUE. on a cascade lake CPU (AWS C5.24x.large, gcc 10) NNUE eval is at roughly 78% nps of classical (single core test) bench 1024 1 24 default depth: target classical NNUE ratio vnni 2207232 1725987 78.20 avx512 2216789 1671734 75.41 avx2 2194006 1611263 73.44 modern 2185001 1352469 61.90 closes official-stockfish#2987 No functional change
Adds support for Vector Neural Network Instructions (avx512), as available on Intel Cascade Lake The _mm512_dpbusd_epi32() intrinsic (vpdpbusd instruction) is taylor made for NNUE. on a cascade lake CPU (AWS C5.24x.large, gcc 10) NNUE eval is at roughly 78% nps of classical (single core test) bench 1024 1 24 default depth: target classical NNUE ratio vnni 2207232 1725987 78.20 avx512 2216789 1671734 75.41 avx2 2194006 1611263 73.44 modern 2185001 1352469 61.90 closes official-stockfish#2987 No functional change
Starting with the Cascade Lake architecture Intel added VNNI(Vector Neural Network Instructions)
The _mm512_dpbusd_epi32() intrinsic (vpdpbusd instruction) is taylor made for NNUE.
https://en.wikichip.org/wiki/x86/avx512_vnni
I can compile but not run w/o new hardware. Could someone with Cascade Lake or newer hardware please verify this gives the correct bench and report the speed difference compared to plain avx512? To compile gcc 10.x is preferred.
make build ARCH=x86-64-vnni
No functional change
bench: 4244812