Use dense precomputed fixed shift magics [v2] #1538

niklasf · 2018-04-10T08:32:45Z

This is based on a previous patch #1352, which was closed very quickly. The main concern was that Stockfish no longer shows how to generate the magics that are now a table of constants. Since then there has been some discussion on the closed pull request and on the mailing list and I've been advised to submit a new version.

Concerns:

How to generate magics? It's not feasible to generate magics of this quality in the build process or at startup. It can still be explained outside of the code. The specific method is now mentioned in the comments and a relevant Wiki article is linked.

Maintainability and correctness. This is a non-functional change to the magic bitboard initialization. It is self contained, no code outside is affected. Pure line count is increased, but note that 128 lines are just constants. An assertion would detect any (even subtly) incorrect magic factors or offsets.

PEXT is the future. While this may be true for Intel processors it is not clear if AMD's slow pext will be improved or if pext will be available on arm. However we can gain performance right now. Also note that almost all fishtest workers will benefit from this change.

These densely packed fixed shift magics computed by Volker Annuss
achieve a considerably smaller table size than the current magics, even
without looking up a specific shift for each square.

Allows simplifying away the optimized multiplication for 32-bit
architectures without regression.

PEXT-Bitboards do not benefit (and replacing them would be a regression
on bmi2).

Speedup on x86-64-modern:

base =    2461271 +/- 15128
test =    2484510 +/- 14926
diff =      23238 +/- 27939
speedup = 0.009442

No regression on x86-64-bmi2:

base =    2501857 +/- 17219
test =    2500387 +/- 17443
diff =      -1469 +/- 31698
speedup = -0.000588

STC with disabled bmi2:

LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 35070 W: 7279 L: 6987 D: 20804

Update: STC against master:

LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 31719 W: 6644 L: 6366 D: 18709

No functional change.

These densely packed fixed shift magics computed by Volker Annuss achieve a considerably smaller table size than the current magics, even without looking up a specific shift for each square. Allows simplifying away the optimized multiplication for 32-bit architectures without regression. PEXT-Bitboards do not benefit (and replacing them would be a regression on bmi2). Speedup on x86-64-modern: base = 2461271 +/- 15128 test = 2484510 +/- 14926 diff = 23238 +/- 27939 speedup = 0.009442 No regression on x86-64-bmi2: base = 2501857 +/- 17219 test = 2500387 +/- 17443 diff = -1469 +/- 31698 speedup = -0.000588 STC with disabled bmi2: LLR: 2.95 (-2.94,2.94) [0.00,5.00] Total: 35070 W: 7279 L: 6987 D: 20804 No functional change.

Hanamuke · 2018-04-10T09:05:28Z

Can you explain why it is faster ? Is it only cache locality ?

Don't have strong opinion about this code as i understand how it worked before, but i'd want it merged if i was an AMD user

niklasf · 2018-04-10T09:20:59Z

Yes, it's more cache efficient because the table size was reduced and the shift is now determined at compile time.

Hanamuke · 2018-04-10T10:21:57Z

It seems possible to determine shift at compile time in master too, have you checked that part of the speedup is indeed due to the cache locality ? I don't doubt it tbh but you're never too cautious.

niklasf · 2018-04-10T10:39:53Z

In current master shift depends on the square: https://github.com/official-stockfish/Stockfish/pull/1538/files#diff-22152eaf5866a06c6b3b68c78aace385L85. Without this the table size would be (2^12 + 2^9) * 64 * 64 bit = 2.3 MB, so overlapping magics are required to make the approach competetive.

mcostalba · 2018-04-10T10:45:18Z

The test is against master or against a lame version? I guess this second case. I think that or you decide to use fishtest in the proper way or you don't use it and rely on a different argument.

Even in the best case the speed up is less then 1% and anyhow on modern hardware with bmi2 there is no speed up, just a net loss of knowledge and (to me) sensible insight for people interested in understanding magics.

Moreover the patch adds way more (dumb) stuff than what it removes.

If this patch should be judged in usual way it would not pass fishtest, and even not considering this, it adds a lot for a very small speed up and only in specific cases....and this cases refers to old hardware, so not at the future.

mcostalba · 2018-04-10T10:56:45Z

Intel introduced bmi2 instructions with Haswell processors in 2013.

We are talking of a patch made for cpu of more than 5 years ago!!!

mcostalba · 2018-04-10T10:59:43Z

And even in those obsolete cases the speed up is very marginal at best.

niklasf · 2018-04-10T11:01:45Z

I think I've answered all this before. To reiterate:

I indeed disabled bmi2 for the STC test to get a clean test: http://tests.stockfishchess.org/tests/view/5acbc4b30ebc59547e537d1d. Otherwise it would depend on how many machines with or without pext are randomly assigned. To me it seems like a waste, but if you want I can start such a test.

Loss of knowledge may be prevented by linking the appropriate resources such as https://chessprogramming.wikispaces.com/Magic+Bitboards which has many more insights.

Speedups I've seen range from 0.9% to 1.3%. I've reported the worst result here. (Speedup of bmi2 seems to be 2.1%, for comparison.)

It absolutely does not refer only to old hardware, but also to the most recent AMD and arm processors.

niklasf · 2018-04-10T11:13:08Z

it adds a lot for a very small speed up and only in specific cases

I believe judging this tradeoff should be the main point.

mcostalba · 2018-04-10T11:17:37Z

AMD introduced BMI2 since "Excavator" family in 2015.

niklasf · 2018-04-10T11:18:58Z

Yes, but AMD's pext is slow. Using it would be a regression.

niklasf · 2018-04-10T12:51:53Z

I couldn't help it, so here's the result against master (also green, fwiw):

LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 31719 W: 6644 L: 6366 D: 18709

mstembera · 2018-04-10T16:51:36Z

Results for 20 tests for each version:

            Base      Test      Diff      
    Mean    1023087   1039504   -16417    
    StDev   50601     51184     11836     

p-value: 0.917
speedup: 0.016

IMO showing how to calculate magics is nice but it can't be done well/optimally inside an engine because calculating optimal magics takes too long. Therefore I think showing how to do it inside a special tool for calculating them is a better place for it. Otherwise other engines will end up with sub optimal magics just as we have.

noobpwnftw · 2018-04-10T16:57:33Z

To be fair, I believe most of the fishtest contributors are not changing their default compiler settings, meaning 'x86-64-modern' and no PEXT. That's why a test against master would result in green.

But, if we are here to imply PEXT availability is now a hard precondition to any speedup optimizations, then I think we should describe it somewhere and make clear that we could've made it faster when PEXT is not used but we won't do it because it's just lame.

@niklasf Please run that test again against master since now the majority of the workers should have PEXT enabled by default.

However the way I see it is not like we should stop making improvements on older hardware, 32-bit builds are even lamer in this sense and we are still supporting it.

A documentation of how to generate those magics should justify the "obscureness" of having pre-calculated constants in code, serving as a source of where those numbers are coming from. It isn't like Zobrist hash keys we are using now are initialized with a magic random seed of our choice, but we didn't and won't need to explain why it is 1070372 instead of 0.

niklasf · 2018-04-10T19:45:45Z

Cool, having PEXT by default is a nice win for fishtest performance!

Running another test seems unnecessary. We can just assume it fails [0, 5] if the patch is now neutral on most of the workers. Should I do it anyway?

Volker Annuss described the process of creating these specific magics here: http://www.talkchess.com/forum/viewtopic.php?t=60065&start=14. Should I add a comment in the code that summarizes this? Currently there's only the link to the Chessprogramming Wiki (which also references that forum post amongst others).

atumanian · 2018-04-10T20:26:01Z

Intel Core i3-3120M, Linux 4.9, GCC 6.3
100 runs each
test: atumanian@addab00 (the proposed version in this PR updated with master)
base: 62619fa

Version	Mean time (ms)	St. dev. of time (ms)
test	3629.12	25.66
base	3700.53	21.09

Speedup: 1.97%

mcostalba · 2018-04-10T20:35:24Z

But if it does not improve on fishtest (with PEXT enable) why it should be committed?

If I write a patch that adds more than 100 lines to speedup 32 bit CPU of 1% it would be committed?

noobpwnftw · 2018-04-10T21:37:35Z

How is that a problem with having more lines of constants in code or to put them in a separate header file, let whoever reads it understand that those are non-trivial to calculate at run-time, while it can give some performance increase on the compiling profile currently categorized as "x86-64-modern"?

If there is a standard on how much performance increase could justify how many more lines introduced, then I guess these numbers should go because they give absolutely no ELO gain on any hardware.

    // Optimal PRNG seeds to pick the correct magics in the shortest time
    int seeds[][RANK_NB] = { { 8977, 44560, 54343, 38998,  5731, 95205, 104912, 17020 },
                             {  728, 10316, 55013, 32803, 12281, 15100,  16645,   255 } };

Constants in this patch might fit in half of their lines, but if that is really the right direction to go for a commit then I'd hope the author never do that.

syzygy1 · 2018-04-10T22:02:00Z

I'd like to know the source of those "Optimal PRNG seeds" because I'm so lame that I do not understand why they are as they are just by looking at them.

Another question: how much performance improvement does PEXT give on ARM?

tthsqe12 · 2018-04-10T22:52:46Z

Pext doesnt exist on ARM, so i guess the performance benefit of not using it is infinite. When I changed from variable shifts to these fixed shifts in the compiler-neutral x86fish, I think my numbers resembled atumanian's - between 1 and 2%, but definitely closer to 2%. It was a clear speed up.
@snicolet The summary is (1) the new code will have 256 magical numbers instead of 16. And these numbers will be arrainged in two 64-by-2 grids in the source instead of one 2-by-8 grid. (2) This will be a win on ALL architectures except newish intel. (3) The new code is beautiful and represents the fruits of a considerable engineering task. (4) You will then forget about this new code because it just works and its just movegen.

syzygy1 · 2018-04-11T05:45:33Z

The new code is also easier to understand than the old code.

snicolet · 2018-04-11T09:46:13Z

Part of the problem, I think, is that all the magic code and data should be kept separated from the Bitboard type stuff.

The current code, even in master, is strangely structured (and badly commented, but that should be easier to fix :-), because in Bitboard.cpp and Bitboard.h we have a mix of
• high level, chess related concepts (Squares, Files, Distances, PawnAttacks, passed_pawn_mask(), etc)
• very low-level complicated tricks for the magics tables and functions, which are used in fact only in one place (in lines 265-266 of bitboard.h, in function attacks_bb() )

Basically everything that has the word "Magic" in its name or uses something called MagicBlah should disappear from Bitboard.cpp and Bitboard.h (except attacks_bb()). This modular approach would make the magic implementation a black box that we could comment on, then forget more easily, without impacting the quality of the chess related code.

If we manage to put all the magic stuff in two new files magics.hand magics.cpp (or even better, in only one) while keeping the speed benefit, I will commit the patch. For instance, "all the magic stuff" could include:

• struct Magic {}
• struct MagicInit {}
• Magic RookMagics[SQUARE_NB];
• Magic BishopMagics[SQUARE_NB];
• MagicInit BishopMagicInit[SQUARE_NB] = {}
• MagicInit RookMagicInit[SQUARE_NB] = {}
• Direction BishopDirections[]
• Direction RookDirections[]
• template<PieceType Pt> void init_magics(MagicInit init[], Magic magics[], Direction directions[]);
• Bitboard AttackTable[HasPext ? 107648 : 88772] = { 0 };
• Bitboard relevant_occupancies(Direction directions[], Square s);
• Bitboard sliding_attack(Direction directions[], Square sq, Bitboard occupied) {}
• template<typename T> T sparse_rand()

niklasf · 2018-04-11T09:49:52Z

@snicolet I imagine it would be best to put this current patch on hold and try this as a pure refactoring of the status quo? Or should I try to make one big patch?

snicolet · 2018-04-11T10:27:49Z

A second commit on top of your patch (the version with the big table) would be fine to judge the effect of refactoring.

syzygy1 · 2018-04-11T13:18:56Z

It may or may not be helpful to separate/modularise the magic and pext implementations (perhaps even in a way that would allow people to experiment with still further approaches such as hyperbola quintessence). At the moment the two are so intertwined that it is difficult to make modifications.

CoffeeOne · 2018-04-17T18:13:12Z

Hello,
I confirm the very big speed up on some CPUs:

Intel Westmere CPU (Xeon E7-4870 2.4GHz)
under Windows, gcc 7.3 profile-build with lto enabled
30 runs of each "bench 16 1 16"
speedup of 1,613% compared to master with same signature "Reset negative statScore on fail high"!

Remark: speed comparison against latest master (different signature) under the exact same conditions gave a speedup of only 0,382%, so "Simplify condition in space definition" was a huge speed gain, too!

MichaelB7 · 2018-04-18T23:00:42Z

1.6 % speedup here
bench 16 1 24, turbo boost off, run concurrently

patch:
===========================
Total time (ms) : 207809
Nodes searched  : 359232560
Nodes/second    : 1728666

base:
===========================
Total time (ms) : 211282
Nodes searched  : 359232560
Nodes/second    : 1700251

lucasart · 2018-04-27T03:51:18Z

src/bitboard.h

-    unsigned lo = unsigned(occupied) & unsigned(mask);
-    unsigned hi = unsigned(occupied >> 32) & unsigned(mask >> 32);
-    return (lo * unsigned(magic) ^ hi * unsigned(magic >> 32)) >> shift;
+    return unsigned(((occupied & mask) * magic) >> shift);


Won't this be a speed regression for 32 bit ?
Actually, do we still care about 32 bit performance ? (I don't, but I remember that @mcostalba was very adamant about it).

It seems that on the machines I tested the gains of this patch make up for the removal of the specialized multiplication.

ghost · 2018-04-28T13:11:52Z

@mcostalba mcostalba PEXT isn't available on all machines. For example my pentium G3220 is a Haswell, but due market segmentation Pentiums don't get "premium" instructions like BMI2 even on newest Skylake/KabyLake BMI/BMI2 is only for i3/i5/i7 and Xeon processors.
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/584240
Additionally AMD processors have a slow PEXT implementation:
https://chessprogramming.wikispaces.com/BMI2

Ipmanchess · 2018-04-28T13:37:19Z

Want to add i9's my i9 7980XE use BMI2 and even AVX512 ,with LCZero almost dubbled my nodes/sec.

snicolet · 2018-05-04T23:08:39Z

@niklasf Did you get the time to try regrouping all the magic stuff in a separated file?

niklasf · 2018-05-04T23:14:05Z

@snicolet I can share an attempt soon, but I am not happy with it yet, because it depends on some low level stuff from bitboards.cpp (e.g. file and rank masks) and some higher level stuff in bitboards.cpp (e.g. PseudoAttacks, line and between masks) depends on it, so it needs to be initialized exactly in between.

niklasf · 2018-06-04T23:12:23Z

Sorry all, I got sidetracked. I'll finish this with some (maybe improved) magics later this year. In the meantime, if anyone wants to work on this, they are most welcome.

lucasart reviewed Apr 27, 2018

View reviewed changes

niklasf closed this Jun 4, 2018

mstembera mentioned this pull request Apr 15, 2021

Fixed shift fancy magic bitboards #3429

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use dense precomputed fixed shift magics [v2] #1538

Use dense precomputed fixed shift magics [v2] #1538

niklasf commented Apr 10, 2018 •

edited

Hanamuke commented Apr 10, 2018

niklasf commented Apr 10, 2018

Hanamuke commented Apr 10, 2018

niklasf commented Apr 10, 2018

mcostalba commented Apr 10, 2018

mcostalba commented Apr 10, 2018

mcostalba commented Apr 10, 2018

niklasf commented Apr 10, 2018

niklasf commented Apr 10, 2018

mcostalba commented Apr 10, 2018 •

edited

niklasf commented Apr 10, 2018

niklasf commented Apr 10, 2018 •

edited

mstembera commented Apr 10, 2018

noobpwnftw commented Apr 10, 2018 •

edited

niklasf commented Apr 10, 2018

atumanian commented Apr 10, 2018

mcostalba commented Apr 10, 2018

noobpwnftw commented Apr 10, 2018

syzygy1 commented Apr 10, 2018

tthsqe12 commented Apr 10, 2018 •

edited

syzygy1 commented Apr 11, 2018

snicolet commented Apr 11, 2018 •

edited

niklasf commented Apr 11, 2018 •

edited

snicolet commented Apr 11, 2018

syzygy1 commented Apr 11, 2018

CoffeeOne commented Apr 17, 2018 •

edited

MichaelB7 commented Apr 18, 2018

lucasart Apr 27, 2018

niklasf Apr 27, 2018

ghost commented Apr 28, 2018

Ipmanchess commented Apr 28, 2018

snicolet commented May 4, 2018

niklasf commented May 4, 2018

niklasf commented Jun 4, 2018

Use dense precomputed fixed shift magics [v2] #1538

Use dense precomputed fixed shift magics [v2] #1538

Conversation

niklasf commented Apr 10, 2018 • edited

Hanamuke commented Apr 10, 2018

niklasf commented Apr 10, 2018

Hanamuke commented Apr 10, 2018

niklasf commented Apr 10, 2018

mcostalba commented Apr 10, 2018

mcostalba commented Apr 10, 2018

mcostalba commented Apr 10, 2018

niklasf commented Apr 10, 2018

niklasf commented Apr 10, 2018

mcostalba commented Apr 10, 2018 • edited

niklasf commented Apr 10, 2018

niklasf commented Apr 10, 2018 • edited

mstembera commented Apr 10, 2018

noobpwnftw commented Apr 10, 2018 • edited

niklasf commented Apr 10, 2018

atumanian commented Apr 10, 2018

mcostalba commented Apr 10, 2018

noobpwnftw commented Apr 10, 2018

syzygy1 commented Apr 10, 2018

tthsqe12 commented Apr 10, 2018 • edited

syzygy1 commented Apr 11, 2018

snicolet commented Apr 11, 2018 • edited

niklasf commented Apr 11, 2018 • edited

snicolet commented Apr 11, 2018

syzygy1 commented Apr 11, 2018

CoffeeOne commented Apr 17, 2018 • edited

MichaelB7 commented Apr 18, 2018

lucasart Apr 27, 2018

Choose a reason for hiding this comment

niklasf Apr 27, 2018

Choose a reason for hiding this comment

ghost commented Apr 28, 2018

Ipmanchess commented Apr 28, 2018

snicolet commented May 4, 2018

niklasf commented May 4, 2018

niklasf commented Jun 4, 2018

niklasf commented Apr 10, 2018 •

edited

mcostalba commented Apr 10, 2018 •

edited

niklasf commented Apr 10, 2018 •

edited

noobpwnftw commented Apr 10, 2018 •

edited

tthsqe12 commented Apr 10, 2018 •

edited

snicolet commented Apr 11, 2018 •

edited

niklasf commented Apr 11, 2018 •

edited

CoffeeOne commented Apr 17, 2018 •

edited