New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use dense precomputed fixed shift magics [v2] #1538
Conversation
These densely packed fixed shift magics computed by Volker Annuss achieve a considerably smaller table size than the current magics, even without looking up a specific shift for each square. Allows simplifying away the optimized multiplication for 32-bit architectures without regression. PEXT-Bitboards do not benefit (and replacing them would be a regression on bmi2). Speedup on x86-64-modern: base = 2461271 +/- 15128 test = 2484510 +/- 14926 diff = 23238 +/- 27939 speedup = 0.009442 No regression on x86-64-bmi2: base = 2501857 +/- 17219 test = 2500387 +/- 17443 diff = -1469 +/- 31698 speedup = -0.000588 STC with disabled bmi2: LLR: 2.95 (-2.94,2.94) [0.00,5.00] Total: 35070 W: 7279 L: 6987 D: 20804 No functional change.
Can you explain why it is faster ? Is it only cache locality ? Don't have strong opinion about this code as i understand how it worked before, but i'd want it merged if i was an AMD user |
Yes, it's more cache efficient because the table size was reduced and the shift is now determined at compile time. |
It seems possible to determine shift at compile time in master too, have you checked that part of the speedup is indeed due to the cache locality ? I don't doubt it tbh but you're never too cautious. |
In current master shift depends on the square: https://github.com/official-stockfish/Stockfish/pull/1538/files#diff-22152eaf5866a06c6b3b68c78aace385L85. Without this the table size would be |
The test is against master or against a lame version? I guess this second case. I think that or you decide to use fishtest in the proper way or you don't use it and rely on a different argument. Even in the best case the speed up is less then 1% and anyhow on modern hardware with bmi2 there is no speed up, just a net loss of knowledge and (to me) sensible insight for people interested in understanding magics. Moreover the patch adds way more (dumb) stuff than what it removes. If this patch should be judged in usual way it would not pass fishtest, and even not considering this, it adds a lot for a very small speed up and only in specific cases....and this cases refers to old hardware, so not at the future. |
Intel introduced bmi2 instructions with Haswell processors in 2013. We are talking of a patch made for cpu of more than 5 years ago!!! |
And even in those obsolete cases the speed up is very marginal at best. |
I think I've answered all this before. To reiterate: I indeed disabled bmi2 for the STC test to get a clean test: http://tests.stockfishchess.org/tests/view/5acbc4b30ebc59547e537d1d. Otherwise it would depend on how many machines with or without pext are randomly assigned. To me it seems like a waste, but if you want I can start such a test. Loss of knowledge may be prevented by linking the appropriate resources such as https://chessprogramming.wikispaces.com/Magic+Bitboards which has many more insights. Speedups I've seen range from 0.9% to 1.3%. I've reported the worst result here. (Speedup of bmi2 seems to be 2.1%, for comparison.) It absolutely does not refer only to old hardware, but also to the most recent AMD and arm processors. |
I believe judging this tradeoff should be the main point. |
AMD introduced BMI2 since "Excavator" family in 2015. |
Yes, but AMD's pext is slow. Using it would be a regression. |
I couldn't help it, so here's the result against master (also green, fwiw):
|
IMO showing how to calculate magics is nice but it can't be done well/optimally inside an engine because calculating optimal magics takes too long. Therefore I think showing how to do it inside a special tool for calculating them is a better place for it. Otherwise other engines will end up with sub optimal magics just as we have. |
To be fair, I believe most of the fishtest contributors are not changing their default compiler settings, meaning 'x86-64-modern' and no PEXT. That's why a test against master would result in green. But, if we are here to imply PEXT availability is now a hard precondition to any speedup optimizations, then I think we should describe it somewhere and make clear that we could've made it faster when PEXT is not used but we won't do it because it's just lame. @niklasf Please run that test again against master since now the majority of the workers should have PEXT enabled by default. However the way I see it is not like we should stop making improvements on older hardware, 32-bit builds are even lamer in this sense and we are still supporting it. A documentation of how to generate those magics should justify the "obscureness" of having pre-calculated constants in code, serving as a source of where those numbers are coming from. It isn't like Zobrist hash keys we are using now are initialized with a magic random seed of our choice, but we didn't and won't need to explain why it is 1070372 instead of 0. |
Cool, having PEXT by default is a nice win for fishtest performance! Running another test seems unnecessary. We can just assume it fails [0, 5] if the patch is now neutral on most of the workers. Should I do it anyway? Volker Annuss described the process of creating these specific magics here: http://www.talkchess.com/forum/viewtopic.php?t=60065&start=14. Should I add a comment in the code that summarizes this? Currently there's only the link to the Chessprogramming Wiki (which also references that forum post amongst others). |
Intel Core i3-3120M, Linux 4.9, GCC 6.3
Speedup: 1.97% |
But if it does not improve on fishtest (with PEXT enable) why it should be committed? If I write a patch that adds more than 100 lines to speedup 32 bit CPU of 1% it would be committed? |
How is that a problem with having more lines of constants in code or to put them in a separate header file, let whoever reads it understand that those are non-trivial to calculate at run-time, while it can give some performance increase on the compiling profile currently categorized as "x86-64-modern"? If there is a standard on how much performance increase could justify how many more lines introduced, then I guess these numbers should go because they give absolutely no ELO gain on any hardware.
Constants in this patch might fit in half of their lines, but if that is really the right direction to go for a commit then I'd hope the author never do that. |
I'd like to know the source of those "Optimal PRNG seeds" because I'm so lame that I do not understand why they are as they are just by looking at them. Another question: how much performance improvement does PEXT give on ARM? |
Pext doesnt exist on ARM, so i guess the performance benefit of not using it is infinite. When I changed from variable shifts to these fixed shifts in the compiler-neutral x86fish, I think my numbers resembled atumanian's - between 1 and 2%, but definitely closer to 2%. It was a clear speed up. |
The new code is also easier to understand than the old code. |
Part of the problem, I think, is that all the magic code and data should be kept separated from the Bitboard type stuff. The current code, even in master, is strangely structured (and badly commented, but that should be easier to fix :-), because in Bitboard.cpp and Bitboard.h we have a mix of Basically everything that has the word "Magic" in its name or uses something called MagicBlah should disappear from Bitboard.cpp and Bitboard.h (except If we manage to put all the magic stuff in two new files • |
@snicolet I imagine it would be best to put this current patch on hold and try this as a pure refactoring of the status quo? Or should I try to make one big patch? |
A second commit on top of your patch (the version with the big table) would be fine to judge the effect of refactoring. |
It may or may not be helpful to separate/modularise the magic and pext implementations (perhaps even in a way that would allow people to experiment with still further approaches such as hyperbola quintessence). At the moment the two are so intertwined that it is difficult to make modifications. |
Hello, Intel Westmere CPU (Xeon E7-4870 2.4GHz) Remark: speed comparison against latest master (different signature) under the exact same conditions gave a speedup of only 0,382%, so "Simplify condition in space definition" was a huge speed gain, too! |
1.6 % speedup here
|
unsigned lo = unsigned(occupied) & unsigned(mask); | ||
unsigned hi = unsigned(occupied >> 32) & unsigned(mask >> 32); | ||
return (lo * unsigned(magic) ^ hi * unsigned(magic >> 32)) >> shift; | ||
return unsigned(((occupied & mask) * magic) >> shift); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this be a speed regression for 32 bit ?
Actually, do we still care about 32 bit performance ? (I don't, but I remember that @mcostalba was very adamant about it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that on the machines I tested the gains of this patch make up for the removal of the specialized multiplication.
@mcostalba mcostalba PEXT isn't available on all machines. For example my pentium G3220 is a Haswell, but due market segmentation Pentiums don't get "premium" instructions like BMI2 even on newest Skylake/KabyLake BMI/BMI2 is only for i3/i5/i7 and Xeon processors. |
Want to add i9's my i9 7980XE use BMI2 and even AVX512 ,with LCZero almost dubbled my nodes/sec. |
@niklasf Did you get the time to try regrouping all the magic stuff in a separated file? |
@snicolet I can share an attempt soon, but I am not happy with it yet, because it depends on some low level stuff from bitboards.cpp (e.g. file and rank masks) and some higher level stuff in bitboards.cpp (e.g. PseudoAttacks, line and between masks) depends on it, so it needs to be initialized exactly in between. |
Sorry all, I got sidetracked. I'll finish this with some (maybe improved) magics later this year. In the meantime, if anyone wants to work on this, they are most welcome. |
This is based on a previous patch #1352, which was closed very quickly. The main concern was that Stockfish no longer shows how to generate the magics that are now a table of constants. Since then there has been some discussion on the closed pull request and on the mailing list and I've been advised to submit a new version.
Concerns:
How to generate magics? It's not feasible to generate magics of this quality in the build process or at startup. It can still be explained outside of the code. The specific method is now mentioned in the comments and a relevant Wiki article is linked.
Maintainability and correctness. This is a non-functional change to the magic bitboard initialization. It is self contained, no code outside is affected. Pure line count is increased, but note that 128 lines are just constants. An assertion would detect any (even subtly) incorrect magic factors or offsets.
PEXT is the future. While this may be true for Intel processors it is not clear if AMD's slow pext will be improved or if pext will be available on arm. However we can gain performance right now. Also note that almost all fishtest workers will benefit from this change.
These densely packed fixed shift magics computed by Volker Annuss
achieve a considerably smaller table size than the current magics, even
without looking up a specific shift for each square.
Allows simplifying away the optimized multiplication for 32-bit
architectures without regression.
PEXT-Bitboards do not benefit (and replacing them would be a regression
on bmi2).
Speedup on x86-64-modern:
No regression on x86-64-bmi2:
STC with disabled bmi2:
Update: STC against master:
No functional change.