Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of sdot/udot in ARM #4193

Closed
Developer-Ecosystem-Engineering opened this issue Oct 12, 2022 · 7 comments
Closed

Usage of sdot/udot in ARM #4193

Developer-Ecosystem-Engineering opened this issue Oct 12, 2022 · 7 comments

Comments

@Developer-Ecosystem-Engineering

Modern ARM processors, including Apple silicon, support sdot/udot instructions which could improve many of your NNUE operations, also overriding Simd::neon_m128_add_dpbusd_epi32x2 could help.

Information about these instructions and their use can be read about in more detail here.
We think you’d see about a 10% improvement leveraging these capabilities in Stockfish.

@Sopel97
Copy link
Member

Sopel97 commented Oct 12, 2022

Thanks, this indeed appears to be useful. After some searching I found the relevant intrinsics, however I'm unable to get them to work https://godbolt.org/z/j8faff6xM. I'm also not sure if we have any developers who have both the required expertise and access to the needed hardware, so a PR would be the most welcome.

@Developer-Ecosystem-Engineering
Copy link
Author

Hi @Sopel97,

The compiler needs to be told to use dotprod. Here is one such example in another project.

@Sopel97
Copy link
Member

Sopel97 commented Oct 12, 2022

I see, didn't know about the +dotproduct march. Working example https://godbolt.org/z/sfKnnE5E3.

@vondele
Copy link
Member

vondele commented Oct 16, 2022

maybe @domschl , who added the original NEON code in (c402fe7), can help with this?

@Chess321
Copy link

Chess321 commented Dec 8, 2022

@w1wwwwww
Copy link
Contributor

w1wwwwww commented Jan 4, 2023

Are dot products even used in Stockfish? (if so, where?)

@vondele
Copy link
Member

vondele commented Jan 4, 2023

see the link provided in the initial comment. Basically, NNUE inference code is a matrix vector product, which can be done using dot products.

UniQP added a commit to UniQP/Stockfish that referenced this issue Feb 20, 2023
UniQP added a commit to UniQP/Stockfish that referenced this issue Feb 21, 2023
The sdot instruction computes (and accumulates) a signed dot product,
which is quite handy for Stockfish's NNUE code. The instruction is
optional for Armv8.2 and Armv8.3, and mandatory for Armv8.4 and above.

The commit adds a new 'arm-dotprod' architecture with enabled dot
product support. It also enables dot product support for the existing
'apple-silicon' architecture, which is at least Armv8.5.

The following local speed test was performed on an Apple M1 with
ARCH=apple-silicon. I had to remove CPU pinning from the benchmark
script. However, the results were still consistent: Checking both
binaries against themselves reported a +0.0000 and a +0.0005 speedup,
respectively.

Result of 100 runs
==================
base (...ish.037ef3e1) =    1917997  +/- 7152
test (...fish.dotprod) =    2159682  +/- 9066
diff                   =    +241684  +/- 2923

speedup        = +0.1260
P(speedup > 0) =  1.0000

CPU: 10 x arm
Hyperthreading: off

Fixes official-stockfish#4193

No functional change
UniQP added a commit to UniQP/Stockfish that referenced this issue Feb 21, 2023
The sdot instruction computes (and accumulates) a signed dot product,
which is quite handy for Stockfish's NNUE code. The instruction is
optional for Armv8.2 and Armv8.3, and mandatory for Armv8.4 and above.

The commit adds a new 'arm-dotprod' architecture with enabled dot
product support. It also enables dot product support for the existing
'apple-silicon' architecture, which is at least Armv8.5.

The following local speed test was performed on an Apple M1 with
ARCH=apple-silicon. I had to remove CPU pinning from the benchmark
script. However, the results were still consistent: Checking both
binaries against themselves reported a +0.0000 and a +0.0005 speedup,
respectively.

```
Result of 100 runs
==================
base (...ish.037ef3e1) =    1917997  +/- 7152
test (...fish.dotprod) =    2159682  +/- 9066
diff                   =    +241684  +/- 2923

speedup        = +0.1260
P(speedup > 0) =  1.0000

CPU: 10 x arm
Hyperthreading: off
```

Fixes official-stockfish#4193

No functional change
UniQP added a commit to UniQP/Stockfish that referenced this issue Feb 21, 2023
The sdot instruction computes (and accumulates) a signed dot product,
which is quite handy for Stockfish's NNUE code. The instruction is
optional for Armv8.2 and Armv8.3, and mandatory for Armv8.4 and above.

The commit adds a new 'arm-dotprod' architecture with enabled dot
product support. It also enables dot product support for the existing
'apple-silicon' architecture, which is at least Armv8.5.

The following local speed test was performed on an Apple M1 with
ARCH=apple-silicon. I had to remove CPU pinning from the benchmark
script. However, the results were still consistent: Checking both
binaries against themselves reported a speedup of +0.0000 and +0.0005,
respectively.

```
Result of 100 runs
==================
base (...ish.037ef3e1) =    1917997  +/- 7152
test (...fish.dotprod) =    2159682  +/- 9066
diff                   =    +241684  +/- 2923

speedup        = +0.1260
P(speedup > 0) =  1.0000

CPU: 10 x arm
Hyperthreading: off
```

Fixes official-stockfish#4193

No functional change
UniQP added a commit to UniQP/Stockfish that referenced this issue Feb 21, 2023
The sdot instruction computes (and accumulates) a signed dot product,
which is quite handy for Stockfish's NNUE code. The instruction is
optional for Armv8.2 and Armv8.3, and mandatory for Armv8.4 and above.

The commit adds a new 'arm-dotprod' architecture with enabled dot
product support. It also enables dot product support for the existing
'apple-silicon' architecture, which is at least Armv8.5.

The following local speed test was performed on an Apple M1 with
ARCH=apple-silicon. I had to remove CPU pinning from the benchmark
script. However, the results were still consistent: Checking both
binaries against themselves reported a speedup of +0.0000 and +0.0005,
respectively.

```
Result of 100 runs
==================
base (...ish.037ef3e1) =    1917997  +/- 7152
test (...fish.dotprod) =    2159682  +/- 9066
diff                   =    +241684  +/- 2923

speedup        = +0.1260
P(speedup > 0) =  1.0000

CPU: 10 x arm
Hyperthreading: off
```

Fixes official-stockfish#4193

No functional change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants