Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed shift fancy magic bitboards #3429

Conversation

GediminasMasaitis
Copy link

@GediminasMasaitis GediminasMasaitis commented Apr 14, 2021

Theory

Fixed shift fancy magic bitboards are a variant of magic bitboards where instead of calculating a variable shift based on possible occupancy permutations of a given slide, a compile-constant shift is used. This the compiler can use less instructions to perform the shift by a constant, reducing indexing time to retrieve the attack bitboard.

Additionally it's possible to permute the pieces in the attack table in such a way that the end of square X attaks matches square Y attacks, given this it's possible to overlap entries, reducing the table size, in theory improving cache locality. Not all magics are equal, different magics will result in different permutations of attacks of a certain square, and by continuing to test other magics it's possible to reduce the table size even more. It's unreasonable to compute this kind of a compressed table on engine startup, finding magics and their permutations which result in a small table takes hours and hours. Because of this, magics are pre-computed and hardcoded.

Implementation

  • Using magics and their offsets originally found by Volker Annuss and posted on http://www.talkchess.com/forum3/viewtopic.php?p=670709#p670709, the table size ends up being ~699 kB instead of 841kB.
  • RookTable and BishopTable have been merged into one SlideAttackTable to allow overlapping.
  • Magic class has been templated to allow constant shift, index() has been updated.
  • Since the 32-bit implementations uses a different hashing mechanism, it's impossible to reuse the same magics for it and changing the hashing to be equivalent to the 64-bit version would have hurt performance more than the gain from fixed shift. This is the only reason why magics generation still remains, otherwise it could have been removed completely.

Tests

I ran some tests on my machine on 64 bit with PEXT disabled. Test results are "hot" - meaning first few test results are dicsarded, this helps stabilize the results and decrease deviation. bench 16 1 5 default perft:

  Fancy magics Fixed shift fancy
  9938 ms 9872 ms
  9946 ms 9853 ms
  9914 ms 9852 ms
  9924 ms 9886 ms
  9911 ms 9874 ms
  9929 ms 9849 ms
  9930 ms 9868 ms
  9910 ms 9863 ms
  9913 ms 9845 ms
  9914 ms 9858 ms
  9934 ms 9874 ms
  9917 ms 9857 ms
  9932 ms 9864 ms
  9933 ms 9868 ms
  9912 ms 9882 ms
  9926 ms 9885 ms
  9927 ms 9846 ms
  9910 ms 9863 ms
  9941 ms 9878 ms
  9928 ms 9851 ms
Avg: 9924 ms 9864 ms
Std. dev 11.18 ms 12.91 ms
Improvements are quite small but consistent, average of 60 ms or 0.61%. PEXT and 32 bit performance seems unaffected.

Tests ran with PEXT disabled on test and base branches in order to avoid testing essentially master vs. master

STC:
LLR: 2.95 (-2.94,2.94) {-0.20,1.10}
Total: 26280 W: 2370 L: 2208 D: 21702
Ptnml(0-2): 76, 1736, 9365, 1876, 87
https://tests.stockfishchess.org/tests/view/6079579e162adf76afa5b7ac

LTC:
LLR: 2.94 (-2.94,2.94) {0.20,0.90}
Total: 104904 W: 4056 L: 3791 D: 97057
Ptnml(0-2): 44, 3234, 45642, 3477, 55
https://tests.stockfishchess.org/tests/view/60797216162adf76afa5b7b2

src/bitboard.cpp Outdated
// and if using 64 bit then existing magics are pre-computed.
if constexpr (HasPext || Is64Bit) {
unsigned index = m.index(occupied);
m.attacks[index] = reference[size];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a good opportunity to assert that attacks[index] was empty or the same, to verify the validity of the known magics.

@snicolet
Copy link
Member

Can we test the patch on fishtest, to verify if the added complexity gains Elo? Thanks :-)

@GediminasMasaitis
Copy link
Author

GediminasMasaitis commented Apr 14, 2021

Can we test the patch on fishtest, to verify if the added complexity gains Elo? Thanks :-)

Sure, there is currently a STC test (paused currently): https://tests.stockfishchess.org/html/live_elo.html?60763a7b8141753378960666
I was advised on Discord to just create a PR, so I did without waiting for the test, sorry if this was incorrect, I will update the PR with results once it's finished.

In case STC results come out good, should I do LTC as well?

@Vizvezdenec
Copy link
Contributor

I think the problem is that it requires 64 bit machines with PEXT off, which is not that common case at fishtest I think @snicolet
Usually in this case we do bench test on machines that are sped up by this instead...

@GediminasMasaitis
Copy link
Author

GediminasMasaitis commented Apr 14, 2021

@Vizvezdenec I think you'd be surprised, PEXT is horribly slow on AMD Ryzen cpus, on my machine perft takes 70% (!!!) longer with PEXT-enabled builds, even though it "fully supports" BMI2. With rise of AMD cpus I wouldn't be surprised if it's the majority of CPUs that would benefit from this. I'm not sure about the distribution of build parameters on fishtest though.

@GediminasMasaitis
Copy link
Author

@Vizvezdenec I think a possible testing path for this could be to create a branch from master, remove pext so it always uses old fancy magics, and then use that as the base branch - this way all builds would test the new vs. old bitboards.

@Vizvezdenec
Copy link
Contributor

The thing is that majority of fishtest is from @noobpwnftw and I don't quite remember what CPUs he has, if they are supporting PEXT then like 80% of machines wouldn't get anything from this patch.

@GediminasMasaitis
Copy link
Author

GediminasMasaitis commented Apr 15, 2021

Finished running fishtest, STC reported 3.61 elo, LTC 0.23 elo.

I expected LTC to do worse, it applies to any performance upgrades that don't affect the search path. Increased search depth gives diminishing returns - it's much more beneficial for example to sometimes search depth 15 instead of 14 than it is to sometime search depth 25 instead of 24.

@snicolet
Copy link
Member

Hmmm, so this is a huge patch (233 additions and 54 deletions) which did not pass LTC...
I don't think that we should commit this, despite all the love and work behind the idea

@GediminasMasaitis
Copy link
Author

GediminasMasaitis commented Apr 15, 2021

Hm, originally I wasn't even going to run fishtest... The thing is, most patches change the behavior of search, this doesn't - I can just run bench and show that this is measurably faster. If we agree in general that higher nps is good and horizon effect is irrelevant over a large number of searches then there's no chance of regression and elo must be >0 on any tc. I took a look at the rest of movegen - it seems pretty optimal in terms of code efficiency, well at least I couldn't spot anything else to optimize. If STC / LTC is a hard requirement, it would be very hard to make patches of similar nature.

Regarding lines of code, I admit this is quite the hippo. Almost all of that comes from known magics and comments. Using known magics is unavoidable, creating a compact attack table is NP-complete: https://backscattering.de/chess/hashtable-packing/. Think training NNUE on startup. A lot of effort (way more than I put into this patch) has been put into finding optimal magics by Volker Annuss, Grant Osborne and probably some others from as far back as 2010 and their results are tested and being used by a lot of other engines. I was quite surprised when I looked at Stockfish and found it's not using neither white nor black magics.

I'm probably biased towards this since I already put time into this, but adding fixed shift seemed like a no-brainer from the start.

@ddobbelaere
Copy link
Contributor

ddobbelaere commented Apr 15, 2021

Fascinating stuff and really nice research and implementation. However, I personally tend to agree with @snicolet's doubts if it's really worth the extra code complexity for arguably little gain, as some architectures are faster with PEXT on.

I wouldn't put too much trust in the 3.64 elo estimate of https://tests.stockfishchess.org/html/live_elo.html?6077088a81417533789606b6 TBH, because of high error bounds and biased estimator for passed patches, plus the fact that you "only" measure 0.61% speedup. A true fishtest elo test at STC (no SPRT) would be better to estimate elo. But even then, the question is if forcing the entire fishtest fleet to PEXT off yields meaningful results (we are not interested in archs where PEXT on is faster anyway).

@mstembera
Copy link
Contributor

For reference here #1538 is an old PR that completely removed magic generation inside SF. Historically SF only has magic generation to "show how it's done". IMO since magic generation isn't needed in an engine and in fact can be done much better offline it doesn't belong.

@Sopel97
Copy link
Member

Sopel97 commented Apr 15, 2021

I don't know why some think this is too complicated. It's as simple as magics get. Most of the code is constants. This PR is great IMO.

@Vizvezdenec
Copy link
Contributor

Yeah I also don't even think that we are a chess engine wikipedia of some sort.
If one wants to show how it's done just write an article at chessprogramming wiki.
Other than this I don't see a lot of need in this code being "on fly generation" while being a slowdown.

@GediminasMasaitis
Copy link
Author

GediminasMasaitis commented Apr 16, 2021

Just to clarify, on-the-fly generation is not removed, it's still used for the 32-bit implementation since the known magics are incompatible with the 32-bit hashing. I could have removed the 32-bit exclusive hashing but that could allow regression on 32-bit if that ends up slower. So I thought - better limit the scope and not touch that for this patch.

So I guess there's still a "wiki" for now.

@vdbergh
Copy link
Contributor

vdbergh commented Apr 16, 2021

I don't think 32 and 64 bit should have different magics. If that would imply a small regression for 32 bit then so be it.

@GediminasMasaitis
Copy link
Author

Added conditional compilation to Magic class to reduce size from 32 to 24 bytes and managed to pass LTC. Whew!

STC: 1.93 elo
LTC: 0.76 elo

Seems quite reasonable.

@GediminasMasaitis
Copy link
Author

GediminasMasaitis commented Apr 16, 2021

About removing 32 bit on-the-fly magic generation completely:

  • Some feel sf should "show how to do it", some think it shouldn't, I personally don't really care as long as it doesn't sacrifice performance on 64 bit for that cause. I think comments are good though, and if it's removed then it should probably be commented where these numbers come from. Maybe a link to a wiki or talkchess post or something, I think we can find something for sure.
  • It could be regressive, or it could not be. Do we care if it's regressive? Personally not that much, but sure I can see if some do.
  • I think the idea of patches is to keep them to a minimum, I interpret that as least functional changes, so that each change could be tested individually. I mean hey, sure, I'm up to test and do it. Probably not in this patch though unless folks really want me to.

@GediminasMasaitis
Copy link
Author

With #3435 being proposed I though I'd test with a 32-bit build to see what would happen if on-the-fly magics generation would be removed completely and the 64-bit hash was used on 32-bit. I decided to used bench instead of perft test because that better reflects the real usage of magics in a game.


  32-bit specialized hash fancy magics Fixed shift fancy non-specialized
  7756 7814
  7765 7823
  7749 7815
  7750 7809
  7757 7825
  7763 7815
  7782 7818
  7756 7833
  7768 7809
  7752 7805
  7763 7806
  7751 7808
  7754 7820
  7757 7816
  7751 7810
  7749 7821
  7763 7818
  7754 7816
  7740 7823
  7755 7836
Avg: 7757 ms 7817 ms
Std. dev 8.90 ms 8.34 ms

Average: -60ms, or -0.78%.


The difference is similar to adding fixed shift to 64-bit but in the opposite direction, so probably -2 elo STC, -1 elo LTC or thereabouts, the benefits to 64-bit would be unchanged.

This would get rid of a special case for 32 bit, get rid of ~40 lines of code, and remove magics generation on-the-fly completely.

What do you think, is this worth doing?

@BM123499
Copy link
Contributor

@GediminasMasaitis Is it possible to use your code on fishtest? if yes, could you test it with fixed number of games, so we may see the extension of this slowdown?

If it's as you say, this could be the patch that removes 32bit optimization.

@GediminasMasaitis
Copy link
Author

GediminasMasaitis commented Apr 17, 2021

@GediminasMasaitis Is it possible to use your code on fishtest? if yes, could you test it with fixed number of games, so we may see the extension of this slowdown?

If it's as you say, this could be the patch that removes 32bit optimization.

Unfortunately no, I think there's no way to control build architecture on fishtest. If I forced 64-bit hash vs 32-bit hash code then it would still build with 64-bit and would crush the 32-bit version because it's not really built on 32-bit...

@BM123499
Copy link
Contributor

BM123499 commented Apr 17, 2021

@GediminasMasaitis Can't you just change Makefile to force 32bit build? (I'm not an expert in compilers)

@BM123499
Copy link
Contributor

BM123499 commented Apr 18, 2021

By using MIPS to produce a similar multiplication of 64 bit as done in a 32 bit architecture. Removing 32 bits optimization failed non-regression test as shown here on 32 bit arch. This test is not optimal since some compiler may improve a bit more than a 32 bit machine. In addition, I don't know much about architectures, but this test assumes that the architecture produces two 32 bit numbers (hi and lo) by multiplying two 32 bit numbers. If it doesn't, the slowdown might be more severe.

@NightlyKing
Copy link
Contributor

I just closed my issue #3435 as vondele made a good suggestion. So long 32 bit works it is fine to not waste time and optimize it as much as possible.

@vdbergh
Copy link
Contributor

vdbergh commented Apr 18, 2021

You are attacking a straw man (it is almost comical). Nobody has been optimizing for 32 bit for a long time. There is no time being wasted on 32 bit.

@NightlyKing
Copy link
Contributor

NightlyKing commented Apr 18, 2021

You are attacking a straw man (it is almost comical). Nobody has been optimizing for 32 bit for a long time. There is no time being wasted on 32 bit.

I'm referring to this:

The difference is similar to adding fixed shift to 64-bit but in the opposite direction, so probably -2 elo STC, -1 elo LTC or thereabouts, the benefits to 64-bit would be unchanged.

This would get rid of a special case for 32 bit, get rid of ~40 lines of code, and remove magics generation on-the-fly completely.

What do you think, is this worth doing?

I would like to know how 40 new lines for a small 32 bit speed boost isn't overkill?

@GediminasMasaitis
Copy link
Author

GediminasMasaitis commented Apr 18, 2021

For what it's worth, non-regression test for removing 32-bit magics:

STC:
LLR: 2.95 (-2.94,2.94) {-1.00,0.20}
Total: 38000 W: 3516 L: 3415 D: 31069
Ptnml(0-2): 118, 2666, 13366, 2697, 153
https://tests.stockfishchess.org/tests/view/607c7088111fbbda54e115b5

I really don't think we need to waste resources by running LTC, there's few 32 bit machines on fishtest, if any at all.

I'd like to get confirmation from the maintainers, given this non-regression, the bench results for 32 bit, and the "simulation" @BM123499 did, is it ok to remove magics generation and the 32-bit hashing code?

@Torom
Copy link
Contributor

Torom commented Apr 19, 2021

This patch seems to be a slowdown on my system.
About my test: I changed pyshbench to run a bench 16 1 5 default perft. However, it greps for nodes/second, but this value should not go down either, right?

./stockfish-git compiler
Stockfish 190421 by the Stockfish developers (see AUTHORS file)

Compiled by g++ (GNUC) 10.3.0 on Linux
Compilation settings include:  64bit SSE41 SSSE3 SSE2 POPCNT
__VERSION__ macro expands to: 10.3.0
./stockfish-git.fixed_shift_magics compiler
Stockfish 200421 by the Stockfish developers (see AUTHORS file)

Compiled by g++ (GNUC) 10.3.0 on Linux
Compilation settings include:  64bit SSE41 SSSE3 SSE2 POPCNT
__VERSION__ macro expands to: 10.3.0
./pyshbench ../stockfish-git ../stockfish-git.fixed_shift_magics 20
Result of  20 runs
==================
base (...tockfish-git) =  173653373  +/- 132205
test (...shift_magics) =  170248754  +/- 139246
diff                   =   -3404619  +/- 71117

speedup        = -0.0196
P(speedup > 0) =  0.0000

CPU: 2 x Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
Hyperthreading: on

@GediminasMasaitis
Copy link
Author

GediminasMasaitis commented Apr 20, 2021

@Torom Yes, perft NPS should be higher as well. But I don't think I trust that tool... I tried to run a plain old master vs. master and it's very consistently inconsistent:

 60    2153954    2174755   +20801
 61    2173705    2147791   -25914
 62    2149841    2170562   +20721
 63    2180018    2149841   -30177
 64    2153954    2171609   +17655
 65    2178963    2148815   -30148
 66    2150868    2174755   +23887
 67    2183188    2151895   -31293
 68    2149841    2175805   +25964
 69    2176857    2151895   -24962
 70    2153954    2174755   +20801
 71    2173705    2151895   -21810
 72    2140645    2177910   +37265
 73    2175805    2154984   -20821
 74    2151895    2170562   +18667
 75    2183188    2149841   -33347
 76    2150868    2173705   +22837
 77    2174755    2149841   -24914
 78    2148815    2168472   +19657
 79    2178963    2151895   -27068
 80    2149841    2170562   +20721
 81    2177910    2147791   -30119

+20k, -20k, that pattern repeats all the way down. I'm not sure what it's doing exactly but it's very wonky... Also it's confidently telling me that master (test) is 100% confidently worse than master (base).

Well, regardless, I did the same modification to make it run bench 16 1 5 default perft, here are my results from running latest master (base) vs. this PR branch (test), but rebased to include latest master, same build options:

Stockfish 200421 by the Stockfish developers (see AUTHORS file)
compiler

Compiled by g++ (GNUC) 10.2.0 on MinGW64
Compilation settings include:  64bit SSE41 SSSE3 SSE2 POPCNT
__VERSION__ macro expands to: 10.2.0
Result of  50 runs
==================
base (...4-modern.exe) =  223250292  +/- 250652
test (...4-modern.exe) =  248422887  +/- 285232
diff                   =  +25172595  +/- 360654

speedup        = +0.1128
P(speedup > 0) =  1.0000

CPU: 12 x AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD    // It's a 3900X
Hyperthreading: on

+11.3%? Eh... Well, maybe. I'll play around with it though. I'll try to set up a linux-based VM, try to test a non-mingw build.

@GediminasMasaitis
Copy link
Author

GediminasMasaitis commented Apr 24, 2021

I got a Linux VM set up and my results on build/run with gcc are similar within margin of error compared to my results on mingw. I don't think it's OS-dependent. I don't think there's more I can test on my machine.

@Torom
Copy link
Contributor

Torom commented Apr 24, 2021

Yeah, it's probably more of a system thing than an OS thing.

Same system, windows, different testing tool:

Compiled by g++ (GNUC) 10.2.0 on MinGW64
Compilation settings include:  64bit SSE41 SSSE3 SSE2 POPCNT
__VERSION__ macro expands to: 10.2.0

Engine 1: master, Engine 2: your patch (rebased to master)

Build Tester: 1.4.7.0
Windows 10 (Version 10.0, Build 0, 64-bit Edition)
Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
CPU Warmup: Yes
Command Line: bench 16 1 5 default perft
Tests per Build: 25

                Engine# (NPS)                     Speedup     Sp     Conf. 99%    S.S.
1  (188.500.045,7 ) ---> 2  (185.556.859,4 ) --->  1,586%  1.744.628,0    Yes       No

@vondele
Copy link
Member

vondele commented Apr 25, 2021

I understand your fishtest test on removing the special case for 32bit system, i.e. removing the generation code, did pass the non-regression test, and it would make sense to make that part of this PR. In that case, we'd be using either pext or one set of magics? That would be an advantage

Indeed pext is slow on amd system that will be around for a while (but it is fixed in their zen3).

While the code is fairly large in lines, most of it is just the table and some comments. I would suggest put two constants per line, that's going to be a bit more compact. Could you also add an url to the code were you refer to Annuss' post?

So please combine with the 32 bit removal, and squash to a single commit, and I'll have a more detailed look.

@BM123499
Copy link
Contributor

BM123499 commented Apr 25, 2021

This is basically doing #1352 again, right?

@BM123499
Copy link
Contributor

BM123499 commented Apr 25, 2021

For what it's worth, non-regression test for removing 32-bit magics:

STC:
LLR: 2.95 (-2.94,2.94) {-1.00,0.20}
Total: 38000 W: 3516 L: 3415 D: 31069
Ptnml(0-2): 118, 2666, 13366, 2697, 153
https://tests.stockfishchess.org/tests/view/607c7088111fbbda54e115b5

I really don't think we need to waste resources by running LTC, there's few 32 bit machines on fishtest, if any at all.

I'd like to get confirmation from the maintainers, given this non-regression, the bench results for 32 bit, and the "simulation" @BM123499 did, is it ok to remove magics generation and the 32-bit hashing code?

I don't see how this "simulation" represents what I've done. It seems you removed optimization for 32 bit and tested on 64 bit machines with no adjustment. Please correct me if I'm wrong.
What I did was explicit simulating 64 bit multiplication of a 32 bit machine. But it seems you straight multiply 64 bit numbers, in which would be done in a way in 64 bit machine and in another way in a 32 bit machine.

I agree that it's non-regression on 64bit machine, but I don't think it shows that it's non-regression on a 32bit machine.

@BM123499
Copy link
Contributor

It seems cache management impacts more than the slowdown in 64bit multiplication in 32bit machine as shown in:

STC:
LLR: 2.94 (-2.94,2.94) <-2.50,0.50>
Total: 12240 W: 1137 L: 1015 D: 10088
Ptnml(0-2): 39, 763, 4394, 885, 39
https://tests.stockfishchess.org/tests/view/6085f40c95e7f1852abd2882

Said that, I don't believe it will cause slowdown on 32bit machines.

@GediminasMasaitis GediminasMasaitis force-pushed the fixed_shift_magics branch 2 times, most recently from 29eebfc to ba5d989 Compare April 26, 2021 13:33
@BM123499
Copy link
Contributor

BM123499 commented Apr 26, 2021

@GediminasMasaitis why don't you merge KnownMagic with Magic? it would remove some lines of code.

Edit: As an example: BM123499@81ba78b

@vondele
Copy link
Member

vondele commented May 2, 2021

So, I have tried to measure the impact of this patch carefully, and I see no speedup over master. That's compiled on AMD Ryzen 9 3950X with make -j ARCH=x86-64-avx2 profile-build

sudo perf stat -r 50 -a -B -e cycles:u,instructions:u ./stockfish.master bench 128 1 15 default depth > /dev/null

patch:

 Performance counter stats for 'system wide' (50 runs):

    24’974’981’857      cycles:u                                                      ( +-  1.69% )
    42’528’185’637      instructions:u            #    1.70  insn per cycle           ( +-  1.50% )

            5.0341 +- 0.0170 seconds time elapsed  ( +-  0.34% )

master:

 Performance counter stats for 'system wide' (50 runs):

    23’015’845’489      cycles:u                                                      ( +-  0.06% )
    39’919’793’448      instructions:u            #    1.73  insn per cycle           ( +-  0.07% )

            5.0135 +- 0.0125 seconds time elapsed  ( +-  0.25% )

who's measuring a speedup for a bench?

@Torom
Copy link
Contributor

Torom commented May 5, 2021

Intel i7-3520M make -j profile-build ARCH=x86-64-sse41-popcnt COMP=gcc

master:

 Performance counter stats for 'system wide' (50 runs):

    28.464.620.071      cycles:u                                                      ( +-  0,05% )
    41.812.512.447      instructions:u            #    1,47  insn per cycle           ( +-  0,03% )

           7,90292 +- 0,00308 seconds time elapsed  ( +-  0,04% )

patch:

 Performance counter stats for 'system wide' (50 runs):

    28.217.638.387      cycles:u                                                      ( +-  0,05% )
    41.180.747.985      instructions:u            #    1,46  insn per cycle           ( +-  0,03% )

           7,83639 +- 0,00257 seconds time elapsed  ( +-  0,03% )

COMP=clang

master:

 Performance counter stats for 'system wide' (50 runs):

    28.665.441.772      cycles:u                                                      ( +-  0,07% )
    40.159.781.307      instructions:u            #    1,40  insn per cycle           ( +-  0,03% )

           7,94518 +- 0,00467 seconds time elapsed  ( +-  0,06% )

patch:

 Performance counter stats for 'system wide' (50 runs):

    28.475.198.191      cycles:u                                                      ( +-  0,06% )
    39.697.270.432      instructions:u            #    1,39  insn per cycle           ( +-  0,03% )

           7,89362 +- 0,00292 seconds time elapsed  ( +-  0,04% )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet