Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for VNNI(Vector Neural Network Instructions 512) - Request for testing #2987

Closed
wants to merge 1 commit into from

Conversation

mstembera
Copy link
Contributor

@mstembera mstembera commented Aug 12, 2020

Starting with the Cascade Lake architecture Intel added VNNI(Vector Neural Network Instructions)
The _mm512_dpbusd_epi32() intrinsic (vpdpbusd instruction) is taylor made for NNUE.
https://en.wikichip.org/wiki/x86/avx512_vnni

I can compile but not run w/o new hardware. Could someone with Cascade Lake or newer hardware please verify this gives the correct bench and report the speed difference compared to plain avx512? To compile gcc 10.x is preferred.
make build ARCH=x86-64-vnni

No functional change
bench: 4244812

@vondele
Copy link
Member

vondele commented Aug 12, 2020

Probably best measured with the new bench commands

make clean && make -j ARCH=x86-64-avx512 profile-build
./stockfish bench 16 1 18 default depth classical
./stockfish bench 16 1 18 default depth NNUE
make clean && make -j ARCH=x86-64-vnni profile-build
./stockfish bench 16 1 18 default depth classical
./stockfish bench 16 1 18 default depth NNUE

so we see the impact of NNUE relative to classical on VNNI.

@mstembera
Copy link
Contributor Author

In the meantime I was able to verify the bench is correct using the Intel Software Development Emulator
https://software.intel.com/content/www/us/en/develop/articles/intel-software-development-emulator.html

@vondele
Copy link
Member

vondele commented Aug 12, 2020

Emulation support for the Intel® Advanced Matrix Extensions (Intel® AMX) :-)

Edit... at a given point NNUE could be faster than classical

@mstembera
Copy link
Contributor Author

I agree it could. It's just the first implementation is supposed to be on Sapphire Rapids which isn't scheduled to be introduced till 2021. https://en.wikichip.org/wiki/x86/amx

@vondele
Copy link
Member

vondele commented Aug 12, 2020

yes, definitely not urgent, but probably interesting to look into it, given the code would already validate on the emulator.
The fact that NNUE was so well aligned with CPU development was one of my reasons to go for it.

@vondele
Copy link
Member

vondele commented Aug 12, 2020

Using an AWS cloud instance (C5.24x.large is cascade lake, spot price < 1.5$/h) and gcc-10, node count is correct, and nps is quite nice

bench 1024 1 20 default depth:

target classical NNUE ratio
vnni 2219618 1724043 77.67
avx512 2217302 1667814 75.22
avx2 2196324 1598652 72.79
modern 2197197 1340897 61.03

bench 1024 1 24 default depth:

target classical NNUE ratio
vnni 2207232 1725987 78.20
avx512 2216789 1671734 75.41
avx2 2194006 1611263 73.44
modern 2185001 1352469 61.90

testing code:

#!/bin/bash

make clean && make -j ARCH=x86-64-vnni COMP=gcc COMPILER=g++-10 profile-build && mv ./stockfish ./stockfish.vnni
make clean && make -j ARCH=x86-64-avx512 COMP=gcc COMPILER=g++-10 profile-build && mv ./stockfish ./stockfish.avx512
make clean && make -j ARCH=x86-64-avx2 COMP=gcc COMPILER=g++-10 profile-build && mv ./stockfish ./stockfish.avx2
make clean && make -j ARCH=x86-64-modern COMP=gcc COMPILER=g++-10 profile-build && mv ./stockfish ./stockfish.modern

printf "| target | classical | NNUE | ratio |\n"
printf "| :-- | --: | --: | --: |\n"

for exe in vnni avx512 avx2 modern
do
 ./stockfish.$exe bench 1024 1 20 default depth classical >& out.classical.$exe
 classical=`grep 'Nodes/' out.classical.$exe | awk '{print $NF}'`

 ./stockfish.$exe bench 1024 1 20 default depth NNUE >& out.NNUE.$exe
 NNUE=`grep 'Nodes/' out.NNUE.$exe | awk '{print $NF}'`

 ratio=`echo $classical $NNUE | awk '{printf("%5.2f",$2/$1 * 100)}'`
 printf "| %s | %s | %s | %s |\n" $exe $classical $NNUE $ratio
done

@mstembera
Copy link
Contributor Author

That's awesome. Thanks for taking the time to do all this!

@Ipmanchess
Copy link

Ipmanchess commented Aug 12, 2020

Is there a VNNI compile i could try!

@vondele
Copy link
Member

vondele commented Aug 12, 2020

and since I was at this, the following is a profile as obtained with perf:
image
so we spend about 30% of time in NNUE evaluation 7% in classical evaluation and the rest elsewhere

Edit: hmm timing account for <60% maybe something wrong, take them with a grain of salt.

@mstembera
Copy link
Contributor Author

I'm wondering since we know this can only run on Cascade Lake if we should add
-march=cascadelake -mtune=cascadelake to the CXXFLAGS ?

@vondele
Copy link
Member

vondele commented Aug 13, 2020

I don't think so, at least so far we have not been doing so. -march=cascadelake is basically an alias for a whole bunch of flags. Probably there will one day be a CPU that supports vnni but not some other feature of cascade. I think the code is ready for commit. Independently, one should now do the PR to the fishtest repo to select the vnni target if possible (see worker/games.py).

@vondele vondele added the to be merged Will be merged shortly label Aug 13, 2020
@vondele vondele closed this in dd63b98 Aug 13, 2020
joergoster pushed a commit to joergoster/Stockfish-old that referenced this pull request Aug 13, 2020
Adds support for Vector Neural Network Instructions (avx512), as available on Intel Cascade Lake

The _mm512_dpbusd_epi32() intrinsic (vpdpbusd instruction) is taylor made for NNUE.

on a cascade lake CPU (AWS C5.24x.large, gcc 10) NNUE eval is at roughly 78% nps of classical
(single core test)

bench 1024 1 24 default depth:
target 	classical 	NNUE 	ratio
vnni 	2207232 	1725987 	78.20
avx512 	2216789 	1671734 	75.41
avx2 	2194006 	1611263 	73.44
modern 	2185001 	1352469 	61.90

closes official-stockfish/Stockfish#2987

No functional change
lucabrivio pushed a commit to lucabrivio/Stockfish that referenced this pull request Aug 13, 2020
Adds support for Vector Neural Network Instructions (avx512), as available on Intel Cascade Lake

The _mm512_dpbusd_epi32() intrinsic (vpdpbusd instruction) is taylor made for NNUE.

on a cascade lake CPU (AWS C5.24x.large, gcc 10) NNUE eval is at roughly 78% nps of classical
(single core test)

bench 1024 1 24 default depth:
target 	classical 	NNUE 	ratio
vnni 	2207232 	1725987 	78.20
avx512 	2216789 	1671734 	75.41
avx2 	2194006 	1611263 	73.44
modern 	2185001 	1352469 	61.90

closes official-stockfish/Stockfish#2987

No functional change
noobpwnftw pushed a commit to noobpwnftw/Stockfish that referenced this pull request Aug 15, 2020
Adds support for Vector Neural Network Instructions (avx512), as available on Intel Cascade Lake

The _mm512_dpbusd_epi32() intrinsic (vpdpbusd instruction) is taylor made for NNUE.

on a cascade lake CPU (AWS C5.24x.large, gcc 10) NNUE eval is at roughly 78% nps of classical
(single core test)

bench 1024 1 24 default depth:
target 	classical 	NNUE 	ratio
vnni 	2207232 	1725987 	78.20
avx512 	2216789 	1671734 	75.41
avx2 	2194006 	1611263 	73.44
modern 	2185001 	1352469 	61.90

closes official-stockfish#2987

No functional change
Dantist added a commit to Dantist/Stockfish that referenced this pull request Dec 22, 2020
Adds support for Vector Neural Network Instructions (avx512), as available on Intel Cascade Lake

The _mm512_dpbusd_epi32() intrinsic (vpdpbusd instruction) is taylor made for NNUE.

on a cascade lake CPU (AWS C5.24x.large, gcc 10) NNUE eval is at roughly 78% nps of classical
(single core test)

bench 1024 1 24 default depth:
target 	classical 	NNUE 	ratio
vnni 	2207232 	1725987 	78.20
avx512 	2216789 	1671734 	75.41
avx2 	2194006 	1611263 	73.44
modern 	2185001 	1352469 	61.90

closes official-stockfish#2987

No functional change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
to be merged Will be merged shortly
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants