Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dense precomputed fixed shift magics [v2] #1538

Closed
wants to merge 1 commit into from

Conversation

niklasf
Copy link
Contributor

@niklasf niklasf commented Apr 10, 2018

This is based on a previous patch #1352, which was closed very quickly. The main concern was that Stockfish no longer shows how to generate the magics that are now a table of constants. Since then there has been some discussion on the closed pull request and on the mailing list and I've been advised to submit a new version.

Concerns:

How to generate magics? It's not feasible to generate magics of this quality in the build process or at startup. It can still be explained outside of the code. The specific method is now mentioned in the comments and a relevant Wiki article is linked.

Maintainability and correctness. This is a non-functional change to the magic bitboard initialization. It is self contained, no code outside is affected. Pure line count is increased, but note that 128 lines are just constants. An assertion would detect any (even subtly) incorrect magic factors or offsets.

PEXT is the future. While this may be true for Intel processors it is not clear if AMD's slow pext will be improved or if pext will be available on arm. However we can gain performance right now. Also note that almost all fishtest workers will benefit from this change.


These densely packed fixed shift magics computed by Volker Annuss
achieve a considerably smaller table size than the current magics, even
without looking up a specific shift for each square.

Allows simplifying away the optimized multiplication for 32-bit
architectures without regression.

PEXT-Bitboards do not benefit (and replacing them would be a regression
on bmi2).

Speedup on x86-64-modern:

base =    2461271 +/- 15128
test =    2484510 +/- 14926
diff =      23238 +/- 27939
speedup = 0.009442

No regression on x86-64-bmi2:

base =    2501857 +/- 17219
test =    2500387 +/- 17443
diff =      -1469 +/- 31698
speedup = -0.000588

STC with disabled bmi2:

LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 35070 W: 7279 L: 6987 D: 20804

Update: STC against master:

LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 31719 W: 6644 L: 6366 D: 18709

No functional change.

These densely packed fixed shift magics computed by Volker Annuss
achieve a considerably smaller table size than the current magics, even
without looking up a specific shift for each square.

Allows simplifying away the optimized multiplication for 32-bit
architectures without regression.

PEXT-Bitboards do not benefit (and replacing them would be a regression
on bmi2).

Speedup on x86-64-modern:

base =    2461271 +/- 15128
test =    2484510 +/- 14926
diff =      23238 +/- 27939
speedup = 0.009442

No regression on x86-64-bmi2:

base =    2501857 +/- 17219
test =    2500387 +/- 17443
diff =      -1469 +/- 31698
speedup = -0.000588

STC with disabled bmi2:

LLR: 2.95 (-2.94,2.94) [0.00,5.00]
Total: 35070 W: 7279 L: 6987 D: 20804

No functional change.
@Hanamuke
Copy link

Can you explain why it is faster ? Is it only cache locality ?

Don't have strong opinion about this code as i understand how it worked before, but i'd want it merged if i was an AMD user

@niklasf
Copy link
Contributor Author

niklasf commented Apr 10, 2018

Yes, it's more cache efficient because the table size was reduced and the shift is now determined at compile time.

@Hanamuke
Copy link

It seems possible to determine shift at compile time in master too, have you checked that part of the speedup is indeed due to the cache locality ? I don't doubt it tbh but you're never too cautious.

@niklasf
Copy link
Contributor Author

niklasf commented Apr 10, 2018

In current master shift depends on the square: https://github.com/official-stockfish/Stockfish/pull/1538/files#diff-22152eaf5866a06c6b3b68c78aace385L85. Without this the table size would be (2^12 + 2^9) * 64 * 64 bit = 2.3 MB, so overlapping magics are required to make the approach competetive.

@mcostalba
Copy link

The test is against master or against a lame version? I guess this second case. I think that or you decide to use fishtest in the proper way or you don't use it and rely on a different argument.

Even in the best case the speed up is less then 1% and anyhow on modern hardware with bmi2 there is no speed up, just a net loss of knowledge and (to me) sensible insight for people interested in understanding magics.

Moreover the patch adds way more (dumb) stuff than what it removes.

If this patch should be judged in usual way it would not pass fishtest, and even not considering this, it adds a lot for a very small speed up and only in specific cases....and this cases refers to old hardware, so not at the future.

@mcostalba
Copy link

Intel introduced bmi2 instructions with Haswell processors in 2013.

We are talking of a patch made for cpu of more than 5 years ago!!!

@mcostalba
Copy link

And even in those obsolete cases the speed up is very marginal at best.

@niklasf
Copy link
Contributor Author

niklasf commented Apr 10, 2018

I think I've answered all this before. To reiterate:

I indeed disabled bmi2 for the STC test to get a clean test: http://tests.stockfishchess.org/tests/view/5acbc4b30ebc59547e537d1d. Otherwise it would depend on how many machines with or without pext are randomly assigned. To me it seems like a waste, but if you want I can start such a test.

Loss of knowledge may be prevented by linking the appropriate resources such as https://chessprogramming.wikispaces.com/Magic+Bitboards which has many more insights.

Speedups I've seen range from 0.9% to 1.3%. I've reported the worst result here. (Speedup of bmi2 seems to be 2.1%, for comparison.)

It absolutely does not refer only to old hardware, but also to the most recent AMD and arm processors.

@niklasf
Copy link
Contributor Author

niklasf commented Apr 10, 2018

it adds a lot for a very small speed up and only in specific cases

I believe judging this tradeoff should be the main point.

@mcostalba
Copy link

mcostalba commented Apr 10, 2018

AMD introduced BMI2 since "Excavator" family in 2015.

@niklasf
Copy link
Contributor Author

niklasf commented Apr 10, 2018

Yes, but AMD's pext is slow. Using it would be a regression.

@niklasf
Copy link
Contributor Author

niklasf commented Apr 10, 2018

I couldn't help it, so here's the result against master (also green, fwiw):

LLR: 2.96 (-2.94,2.94) [0.00,5.00]
Total: 31719 W: 6644 L: 6366 D: 18709

@mstembera
Copy link
Contributor

Results for 20 tests for each version:

            Base      Test      Diff      
    Mean    1023087   1039504   -16417    
    StDev   50601     51184     11836     

p-value: 0.917
speedup: 0.016

IMO showing how to calculate magics is nice but it can't be done well/optimally inside an engine because calculating optimal magics takes too long. Therefore I think showing how to do it inside a special tool for calculating them is a better place for it. Otherwise other engines will end up with sub optimal magics just as we have.

@noobpwnftw
Copy link
Contributor

noobpwnftw commented Apr 10, 2018

To be fair, I believe most of the fishtest contributors are not changing their default compiler settings, meaning 'x86-64-modern' and no PEXT. That's why a test against master would result in green.

But, if we are here to imply PEXT availability is now a hard precondition to any speedup optimizations, then I think we should describe it somewhere and make clear that we could've made it faster when PEXT is not used but we won't do it because it's just lame.

@niklasf Please run that test again against master since now the majority of the workers should have PEXT enabled by default.

However the way I see it is not like we should stop making improvements on older hardware, 32-bit builds are even lamer in this sense and we are still supporting it.

A documentation of how to generate those magics should justify the "obscureness" of having pre-calculated constants in code, serving as a source of where those numbers are coming from. It isn't like Zobrist hash keys we are using now are initialized with a magic random seed of our choice, but we didn't and won't need to explain why it is 1070372 instead of 0.

@niklasf
Copy link
Contributor Author

niklasf commented Apr 10, 2018

Cool, having PEXT by default is a nice win for fishtest performance!

Running another test seems unnecessary. We can just assume it fails [0, 5] if the patch is now neutral on most of the workers. Should I do it anyway?

Volker Annuss described the process of creating these specific magics here: http://www.talkchess.com/forum/viewtopic.php?t=60065&start=14. Should I add a comment in the code that summarizes this? Currently there's only the link to the Chessprogramming Wiki (which also references that forum post amongst others).

@atumanian
Copy link

Intel Core i3-3120M, Linux 4.9, GCC 6.3
100 runs each
test: atumanian@addab00 (the proposed version in this PR updated with master)
base: 62619fa

Version Mean time (ms) St. dev. of time (ms)
test 3629.12 25.66
base 3700.53 21.09

Speedup: 1.97%

@mcostalba
Copy link

But if it does not improve on fishtest (with PEXT enable) why it should be committed?

If I write a patch that adds more than 100 lines to speedup 32 bit CPU of 1% it would be committed?

@noobpwnftw
Copy link
Contributor

How is that a problem with having more lines of constants in code or to put them in a separate header file, let whoever reads it understand that those are non-trivial to calculate at run-time, while it can give some performance increase on the compiling profile currently categorized as "x86-64-modern"?

If there is a standard on how much performance increase could justify how many more lines introduced, then I guess these numbers should go because they give absolutely no ELO gain on any hardware.

    // Optimal PRNG seeds to pick the correct magics in the shortest time
    int seeds[][RANK_NB] = { { 8977, 44560, 54343, 38998,  5731, 95205, 104912, 17020 },
                             {  728, 10316, 55013, 32803, 12281, 15100,  16645,   255 } };

Constants in this patch might fit in half of their lines, but if that is really the right direction to go for a commit then I'd hope the author never do that.

@syzygy1
Copy link
Contributor

syzygy1 commented Apr 10, 2018

I'd like to know the source of those "Optimal PRNG seeds" because I'm so lame that I do not understand why they are as they are just by looking at them.

Another question: how much performance improvement does PEXT give on ARM?

@tthsqe12
Copy link

tthsqe12 commented Apr 10, 2018

Pext doesnt exist on ARM, so i guess the performance benefit of not using it is infinite. When I changed from variable shifts to these fixed shifts in the compiler-neutral x86fish, I think my numbers resembled atumanian's - between 1 and 2%, but definitely closer to 2%. It was a clear speed up.
@snicolet The summary is (1) the new code will have 256 magical numbers instead of 16. And these numbers will be arrainged in two 64-by-2 grids in the source instead of one 2-by-8 grid. (2) This will be a win on ALL architectures except newish intel. (3) The new code is beautiful and represents the fruits of a considerable engineering task. (4) You will then forget about this new code because it just works and its just movegen.

@syzygy1
Copy link
Contributor

syzygy1 commented Apr 11, 2018

The new code is also easier to understand than the old code.

@snicolet
Copy link
Member

snicolet commented Apr 11, 2018

Part of the problem, I think, is that all the magic code and data should be kept separated from the Bitboard type stuff.

The current code, even in master, is strangely structured (and badly commented, but that should be easier to fix :-), because in Bitboard.cpp and Bitboard.h we have a mix of
• high level, chess related concepts (Squares, Files, Distances, PawnAttacks, passed_pawn_mask(), etc)
• very low-level complicated tricks for the magics tables and functions, which are used in fact only in one place (in lines 265-266 of bitboard.h, in function attacks_bb() )

Basically everything that has the word "Magic" in its name or uses something called MagicBlah should disappear from Bitboard.cpp and Bitboard.h (except attacks_bb()). This modular approach would make the magic implementation a black box that we could comment on, then forget more easily, without impacting the quality of the chess related code.

If we manage to put all the magic stuff in two new files magics.hand magics.cpp (or even better, in only one) while keeping the speed benefit, I will commit the patch. For instance, "all the magic stuff" could include:

struct Magic {}
struct MagicInit {}
Magic RookMagics[SQUARE_NB];
Magic BishopMagics[SQUARE_NB];
MagicInit BishopMagicInit[SQUARE_NB] = {}
MagicInit RookMagicInit[SQUARE_NB] = {}
Direction BishopDirections[]
Direction RookDirections[]
template<PieceType Pt> void init_magics(MagicInit init[], Magic magics[], Direction directions[]);
Bitboard AttackTable[HasPext ? 107648 : 88772] = { 0 };
Bitboard relevant_occupancies(Direction directions[], Square s);
Bitboard sliding_attack(Direction directions[], Square sq, Bitboard occupied) {}
template<typename T> T sparse_rand()

@niklasf
Copy link
Contributor Author

niklasf commented Apr 11, 2018

@snicolet I imagine it would be best to put this current patch on hold and try this as a pure refactoring of the status quo? Or should I try to make one big patch?

@snicolet
Copy link
Member

A second commit on top of your patch (the version with the big table) would be fine to judge the effect of refactoring.

@syzygy1
Copy link
Contributor

syzygy1 commented Apr 11, 2018

It may or may not be helpful to separate/modularise the magic and pext implementations (perhaps even in a way that would allow people to experiment with still further approaches such as hyperbola quintessence). At the moment the two are so intertwined that it is difficult to make modifications.

@CoffeeOne
Copy link

CoffeeOne commented Apr 17, 2018

Hello,
I confirm the very big speed up on some CPUs:

Intel Westmere CPU (Xeon E7-4870 2.4GHz)
under Windows, gcc 7.3 profile-build with lto enabled
30 runs of each "bench 16 1 16"
speedup of 1,613% compared to master with same signature "Reset negative statScore on fail high"!

Remark: speed comparison against latest master (different signature) under the exact same conditions gave a speedup of only 0,382%, so "Simplify condition in space definition" was a huge speed gain, too!

@MichaelB7
Copy link
Contributor

1.6 % speedup here
bench 16 1 24, turbo boost off, run concurrently

patch:
===========================
Total time (ms) : 207809
Nodes searched  : 359232560
Nodes/second    : 1728666

base:
===========================
Total time (ms) : 211282
Nodes searched  : 359232560
Nodes/second    : 1700251

unsigned lo = unsigned(occupied) & unsigned(mask);
unsigned hi = unsigned(occupied >> 32) & unsigned(mask >> 32);
return (lo * unsigned(magic) ^ hi * unsigned(magic >> 32)) >> shift;
return unsigned(((occupied & mask) * magic) >> shift);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this be a speed regression for 32 bit ?
Actually, do we still care about 32 bit performance ? (I don't, but I remember that @mcostalba was very adamant about it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that on the machines I tested the gains of this patch make up for the removal of the specialized multiplication.

@ghost
Copy link

ghost commented Apr 28, 2018

@mcostalba mcostalba PEXT isn't available on all machines. For example my pentium G3220 is a Haswell, but due market segmentation Pentiums don't get "premium" instructions like BMI2 even on newest Skylake/KabyLake BMI/BMI2 is only for i3/i5/i7 and Xeon processors.
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/584240
Additionally AMD processors have a slow PEXT implementation:
https://chessprogramming.wikispaces.com/BMI2

@Ipmanchess
Copy link

Want to add i9's my i9 7980XE use BMI2 and even AVX512 ,with LCZero almost dubbled my nodes/sec.

@snicolet
Copy link
Member

snicolet commented May 4, 2018

@niklasf Did you get the time to try regrouping all the magic stuff in a separated file?

@niklasf
Copy link
Contributor Author

niklasf commented May 4, 2018

@snicolet I can share an attempt soon, but I am not happy with it yet, because it depends on some low level stuff from bitboards.cpp (e.g. file and rank masks) and some higher level stuff in bitboards.cpp (e.g. PseudoAttacks, line and between masks) depends on it, so it needs to be initialized exactly in between.

@niklasf
Copy link
Contributor Author

niklasf commented Jun 4, 2018

Sorry all, I got sidetracked. I'll finish this with some (maybe improved) magics later this year. In the meantime, if anyone wants to work on this, they are most welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet