Support for VNNI(Vector Neural Network Instructions 512) - Request for testing #2987

mstembera · 2020-08-12T06:23:56Z

Starting with the Cascade Lake architecture Intel added VNNI(Vector Neural Network Instructions)
The _mm512_dpbusd_epi32() intrinsic (vpdpbusd instruction) is taylor made for NNUE.
https://en.wikichip.org/wiki/x86/avx512_vnni

I can compile but not run w/o new hardware. Could someone with Cascade Lake or newer hardware please verify this gives the correct bench and report the speed difference compared to plain avx512? To compile gcc 10.x is preferred.
make build ARCH=x86-64-vnni

No functional change
bench: 4244812

No functional change bench: 4244812

vondele · 2020-08-12T07:24:23Z

Probably best measured with the new bench commands

make clean && make -j ARCH=x86-64-avx512 profile-build
./stockfish bench 16 1 18 default depth classical
./stockfish bench 16 1 18 default depth NNUE
make clean && make -j ARCH=x86-64-vnni profile-build
./stockfish bench 16 1 18 default depth classical
./stockfish bench 16 1 18 default depth NNUE

so we see the impact of NNUE relative to classical on VNNI.

mstembera · 2020-08-12T08:10:33Z

In the meantime I was able to verify the bench is correct using the Intel Software Development Emulator
https://software.intel.com/content/www/us/en/develop/articles/intel-software-development-emulator.html

vondele · 2020-08-12T08:12:05Z

Emulation support for the Intel® Advanced Matrix Extensions (Intel® AMX) :-)

Edit... at a given point NNUE could be faster than classical

vondele · 2020-08-12T08:40:54Z

I think it could really be a match as well https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions/intrinsics-for-intel-advanced-matrix-extensions-amx-int8-instructions/tile-dpbssd.html

mstembera · 2020-08-12T08:51:06Z

I agree it could. It's just the first implementation is supposed to be on Sapphire Rapids which isn't scheduled to be introduced till 2021. https://en.wikichip.org/wiki/x86/amx

vondele · 2020-08-12T08:56:28Z

yes, definitely not urgent, but probably interesting to look into it, given the code would already validate on the emulator.
The fact that NNUE was so well aligned with CPU development was one of my reasons to go for it.

vondele · 2020-08-12T12:15:42Z

Using an AWS cloud instance (C5.24x.large is cascade lake, spot price < 1.5$/h) and gcc-10, node count is correct, and nps is quite nice

bench 1024 1 20 default depth:

target	classical	NNUE	ratio
vnni	2219618	1724043	77.67
avx512	2217302	1667814	75.22
avx2	2196324	1598652	72.79
modern	2197197	1340897	61.03

bench 1024 1 24 default depth:

target	classical	NNUE	ratio
vnni	2207232	1725987	78.20
avx512	2216789	1671734	75.41
avx2	2194006	1611263	73.44
modern	2185001	1352469	61.90

testing code:

#!/bin/bash

make clean && make -j ARCH=x86-64-vnni COMP=gcc COMPILER=g++-10 profile-build && mv ./stockfish ./stockfish.vnni
make clean && make -j ARCH=x86-64-avx512 COMP=gcc COMPILER=g++-10 profile-build && mv ./stockfish ./stockfish.avx512
make clean && make -j ARCH=x86-64-avx2 COMP=gcc COMPILER=g++-10 profile-build && mv ./stockfish ./stockfish.avx2
make clean && make -j ARCH=x86-64-modern COMP=gcc COMPILER=g++-10 profile-build && mv ./stockfish ./stockfish.modern

printf "| target | classical | NNUE | ratio |\n"
printf "| :-- | --: | --: | --: |\n"

for exe in vnni avx512 avx2 modern
do
 ./stockfish.$exe bench 1024 1 20 default depth classical >& out.classical.$exe
 classical=`grep 'Nodes/' out.classical.$exe | awk '{print $NF}'`

 ./stockfish.$exe bench 1024 1 20 default depth NNUE >& out.NNUE.$exe
 NNUE=`grep 'Nodes/' out.NNUE.$exe | awk '{print $NF}'`

 ratio=`echo $classical $NNUE | awk '{printf("%5.2f",$2/$1 * 100)}'`
 printf "| %s | %s | %s | %s |\n" $exe $classical $NNUE $ratio
done

mstembera · 2020-08-12T12:24:19Z

That's awesome. Thanks for taking the time to do all this!

Ipmanchess · 2020-08-12T12:28:03Z

Is there a VNNI compile i could try!

vondele · 2020-08-12T12:33:02Z

and since I was at this, the following is a profile as obtained with perf:

so we spend about 30% of time in NNUE evaluation 7% in classical evaluation and the rest elsewhere

Edit: hmm timing account for <60% maybe something wrong, take them with a grain of salt.

mstembera · 2020-08-13T04:20:51Z

I'm wondering since we know this can only run on Cascade Lake if we should add
-march=cascadelake -mtune=cascadelake to the CXXFLAGS ?

vondele · 2020-08-13T05:11:21Z

I don't think so, at least so far we have not been doing so. -march=cascadelake is basically an alias for a whole bunch of flags. Probably there will one day be a CPU that supports vnni but not some other feature of cascade. I think the code is ready for commit. Independently, one should now do the PR to the fishtest repo to select the vnni target if possible (see worker/games.py).

Adds support for Vector Neural Network Instructions (avx512), as available on Intel Cascade Lake The _mm512_dpbusd_epi32() intrinsic (vpdpbusd instruction) is taylor made for NNUE. on a cascade lake CPU (AWS C5.24x.large, gcc 10) NNUE eval is at roughly 78% nps of classical (single core test) bench 1024 1 24 default depth: target classical NNUE ratio vnni 2207232 1725987 78.20 avx512 2216789 1671734 75.41 avx2 2194006 1611263 73.44 modern 2185001 1352469 61.90 closes official-stockfish/Stockfish#2987 No functional change

Adds support for Vector Neural Network Instructions (avx512), as available on Intel Cascade Lake The _mm512_dpbusd_epi32() intrinsic (vpdpbusd instruction) is taylor made for NNUE. on a cascade lake CPU (AWS C5.24x.large, gcc 10) NNUE eval is at roughly 78% nps of classical (single core test) bench 1024 1 24 default depth: target classical NNUE ratio vnni 2207232 1725987 78.20 avx512 2216789 1671734 75.41 avx2 2194006 1611263 73.44 modern 2185001 1352469 61.90 closes official-stockfish#2987 No functional change

Add support for VNNI (Intel Vector Neural Network Instructions 512)

fcbfffa

No functional change bench: 4244812

vondele added the to be merged Will be merged shortly label Aug 13, 2020

vondele closed this in dd63b98 Aug 13, 2020

mstembera mentioned this pull request Aug 21, 2020

Support VNNI on 256bit vectors #3038

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for VNNI(Vector Neural Network Instructions 512) - Request for testing #2987

Support for VNNI(Vector Neural Network Instructions 512) - Request for testing #2987

mstembera commented Aug 12, 2020 •

edited

vondele commented Aug 12, 2020 •

edited

mstembera commented Aug 12, 2020

vondele commented Aug 12, 2020 •

edited

vondele commented Aug 12, 2020

mstembera commented Aug 12, 2020

vondele commented Aug 12, 2020

vondele commented Aug 12, 2020

mstembera commented Aug 12, 2020

Ipmanchess commented Aug 12, 2020 •

edited

vondele commented Aug 12, 2020 •

edited

mstembera commented Aug 13, 2020

vondele commented Aug 13, 2020

Support for VNNI(Vector Neural Network Instructions 512) - Request for testing #2987

Support for VNNI(Vector Neural Network Instructions 512) - Request for testing #2987

Conversation

mstembera commented Aug 12, 2020 • edited

vondele commented Aug 12, 2020 • edited

mstembera commented Aug 12, 2020

vondele commented Aug 12, 2020 • edited

vondele commented Aug 12, 2020

mstembera commented Aug 12, 2020

vondele commented Aug 12, 2020

vondele commented Aug 12, 2020

mstembera commented Aug 12, 2020

Ipmanchess commented Aug 12, 2020 • edited

vondele commented Aug 12, 2020 • edited

mstembera commented Aug 13, 2020

vondele commented Aug 13, 2020

mstembera commented Aug 12, 2020 •

edited

vondele commented Aug 12, 2020 •

edited

vondele commented Aug 12, 2020 •

edited

Ipmanchess commented Aug 12, 2020 •

edited

vondele commented Aug 12, 2020 •

edited