New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Packed like sardines TT #3
Conversation
This change supports 50% more entries for a given sized transposition table (`TT`). Elo is increased as fewer time-consuming position evaluations are needed. The old `TT` used 64-byte clusters, with each cluster containing 4 14-byte entries. This change uses 32-byte clusters, with each cluster containing 3 10-byte entries. To reduce each entry from 14 to 10 bytes, some fields in the `TTEntry` structure have been condensed: * The key field (32-bits) is now just 16-bits. * The generation (8-bits) and bound (8-bits) fields have been combined into a single 8-bit field with 6-bits used for the generation and 2-bits for the bound. * The depth field (16-bits) is now just 8-bits with the depth stored as an offset from the minimum depth `DEPTH_NONE` (which has been adjusted from -254 to -12). In this way, all depths can be stored as positive `uint8` values. Since only a partial key is stored with each entry, the same entry used to store information about one position might be used to store information about a different position. This produces occasional type-2 errors during probes, where a probe returns information for a wrong position. These rare, but possible, type-2 errant probes are automatically detected with the same logic that detects corrupted entries generated during unsafe threaded access to the `TT`. To improve speed, several optimizations have been made: * The number of entries in each cluster has been reduced from 4 to 3. * `first_entry` only uses 2 machine instructions and no longer uses a multiply. * The `probe` loop has been unrolled. * `store` writes the key only when necessary. It also evaluates only the second and third entries as potentially better replacement candidates, since the first entry starts as the best replacement candidate.
I'm against merging this. but let's try this majority vote amongst maintainers decide. @zamar, @glinscott: what's your take on this? Reasons I don't like it:
|
I'm also against merging this.
On Fri, Jun 20, 2014 at 8:07 AM, lucasart notifications@github.com wrote:
|
OK. Case closed. 2/3 is majority vote. That being said some valid points have been raised
|
On Fri, Jun 20, 2014 at 11:42 AM, lucasart notifications@github.com wrote:
|
I don't think Ron was spamming the framework with these tests. He was trying different variations on the theme to try and find the best way to do it. I think that should be encouraged, rather than be called spamming. Yes, the idea is a fairly big change from the way hash table is done today, but I really don't see how it's conflating different ideas. If you change the hash table structure in the way he did, you pretty much have to make most of the changes in this patch. The one big concern I have with the patch is reducing the key from 32 bits to 16 bits. This will lead to many more hash collisions. That being said, when he ran tests with hash pressure, it seemed to be fine. Each bucket has 3 elements, so a 3/65536 chance of having a collision when full. If the ttable is overwritten 2-3x times per search, this is actually a fairly small chance of a collision that will have any meaningful effect. The restrictions on depth is the other concern, I haven't had time to analyze that. But otherwise, I think there is more good in this patch then bad. It's also really quite a small change. He also rewrote his original version to stay more within the coding guidelines. |
Since one of use disagrees, it's probably best that I'll take a longer look
I'm not convinced that 2-3 elo points would be worth this... But I'll On Fri, Jun 20, 2014 at 3:57 PM, Gary Linscott notifications@github.com
|
@zamar: regarding hash, it was demonstrated on fishtest that 16 vs 128 is equal at STC. so the hash to use is at most 16 for stc. beyond that it's just useless, even harmful in fact (32 vs 128 was a measurable gain). so really it should be, at most, 16 for stc and 64 for ltc. And even less would be better because it puts a bit more hash pressure which simulates real life conditions a little bit closer. On that I fully agree with Ron, and disagree with Marco (and you apparently). @glinscott: what happens to a QS depth of -7 in Ron's patch? Well, it rolls up modulo 256 to some 120 or so ply, meaning that in endgame positions where you reach the max depth some funny stuff can happen, with deep qs entries being used near the root... I casted my vote, won't change it. Up to you guys to decide. Whatever you choose I will be happy and won't complain, promised :) |
@lucasart |
I've queued up a test that approximates TCEC conditions. http://tests.stockfishchess.org/tests/view/53a48b7f0ebc5975754cb287 TCEC conditions are 16Gb hash table, with 16 threads, so simplify down to 1Gb/thread. Then average searching time of 3 hours, so 1Gb / (3 * 60) ~= 5Mb. I used 4 since we use power of two anyway. One other interesting test might be testing using an increment since more time is spent in the endgame, where the hash table is more critical. I will queue that up at low priority. |
There has been some concern about type-2 errors during probes with only a 16-bit intracluster key. Therefore, I instrumented the Under low pressure (<100% full), there are no more than about 5 type-2 errors per million probes. As the pressure increases, the error rate starts a downward curve, following the function: Even under tournament-like pressure (~300% full), there's only about 13 type-2 errors per million probes. Also, |
@Britvich, thanks for the analysis! The tests are looking extremely promising. It appears that collisions are not really an issue any more for these TC/hash conditions, which are fairly stressful for the hash table. One thing that could be interesting to consider is where to steal some extra bits to increase the key length. There are a few possibilities I can see:
If we are at 13 type-2 errors at 300% full, then if we add just a few more bits, we will be down to 0-1 errors per million probes, which would greatly increase my confidence. I'd be in favor of option number 2, even though it would add a bit more code complexity. |
if (!tte->key32 || tte->key32 == key32) // Empty or overwrite old | ||
|
||
// Empty entry? | ||
if (!tte->key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove this if block. tte->key will be 0 a lot of the time for valid keys as well, so we could replace an entry with high depth here.
But maybe this is worth a test against itself, as it will be difficult to measure the effect against normal SF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. Good point.
You are correct, store
will always overwrite a 0-key, even if it's not the best candidate for replacement. Even though this 0-key check allows the TT
to fill faster (as it breaks immediately upon finding an "empty" entry), that probably doesn't buy much under tournament-like conditions with a full TT
.
I was wrong about the QS depth problem, and overlooked the fact that QS stores only 0 and -1 ply entries in fact. So the depth is not an issue. It introduces some rigidity though:
The 16 bit still seems like an heresy, but OTOH, no test so far has shown that is causes problems. Maybe we can live with it, and if one day it causes issues, we can either:
Just one thing though. The loop unrolling is ugly, and I think we should test without it, to compare apples and apples. If the gain of the patch is partly driven by that, we can just do the loop unrolling without messing up the TT structure. |
|
||
return NULL; | ||
if (tte->key != key16 && (++tte)->key != key16 && (++tte)->key != key16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code leads to undefined behaviour. Never use several ++ operators on the same variable within an expression. Simply use tte[1] and tte[2] instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucasart, that is incorrect. It doesn't lead to "undefined behavior". According to ISO/IEC C++ Standard 14882 §5.14 Logical AND operator:
The && operator groups left-to-right. The operands are both implicitly converted to type bool (clause 4). The result is true if both operands are true and false otherwise. Unlike &, && guarantees left-to-right evaluation: the second operand is not evaluated if the first operand is false.
Also, I want tte
pointing to the matching entry when/if the statement gets short-circuited. Using tte[1]
and tte[2]
doesn't accomplish that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, you're right. The lazy boolean evaluation is guaranteed by the standard, so no compiler can screw up that code.
@glinscott, Good ideas...
I like stealing more bits, so that we can double the number of hash entries. If we could squish down each entry to 8 bytes, it would probably add a lot more Elo than if we used those extra bits to reduce the already rare type-2 errors that don't really cost any Elo. Of course, the trade-off is more packing/unpacking vs speed. I'm amazed how sensitive
I like and agree.
Yes, it relatively easy to take advantage of those bytes. If each entry was 8 bytes, and there were 4 entries/cluster, then 4 bits would become available for each entry.
Yes, something along the lines of a 10-bit minifloat would work. That would make available 6 more bits twice. Each of the 4 8-byte entries per cluster would look like:
This would take a little work, but my estimate is that it would improve Elo more than EGTBs do, with about 2,800 less lines of code and a lot less disk space. 😄 |
Are you saying:
If so, 😟 the main Elo gain is coming from the increase in the number of entries (1.5x) for a given hash size. That requires "messing up the If that's not what you're saying, then, never mind. 😌 |
Today I spent around 1 hour understanding and analyzing this patch and I've made up my mind... The potential ELO increase for this patch is big enough that it outweights the potential problems that will arise from key conflicts. So in principal, I think that this is the way to go... However the patch in its current form is not commitable:
Until these issues are addressed, I'm against committing this patch. One of the reasons that Marco got tired was that people were writing low quality patches (not following existing coding standards), and after those patches passed in testing framework, they expected Marco to clean up the mess. This must stop. After the patch has passed in testing framework, it's the responsibility of the patch author to clean up his patch, so that it's actually commitable. Of course, if patch author refuses to clean up his mess, anyone can step in instead and do the work for him. Even I might do this at some point in the future, but not during this weekend... |
@zamar, before you spent time "understanding and analyzing this patch", I thought you already made up your mind...
It negatively affects my reputation when you make false claims such as:
The changes I made aren't just "speed optimization attempts", they were the result of extensive testing, instrumentation, examination of generated machine code, and performance profiling and analysis using Visual Studio Ultimate. Your opinion may be that they "uglify" things, but to say they have "no proven benefit" is just false. Take for example the code in if ((tte->gen() == generation || tte->bound() == BOUND_EXACT)
+ - (replace->gen() == generation)
+ - (tte->depth8 < replace->depth8) < 0) That statement is actually pretty expensive (and a hotspot as shown by performance profiling). It is also executed unnecessarily (the code performs no function, except to waste cycles) against the first entry of each cluster. Just on face value, most programmers see that moving that code (or "uglifying" it, to use your term) so that it only considers the second and third entries in a cluster, as better candidates for replacement, is a "proven benefit" (at least that's what my "spamming [of] the test queue" has shown).
I only removed one comment: before the definition of the
Ron (the "test queue spamming, not listening, code uglifying, in general, making a mess") Britvich |
Hi Ron, Each change to source code should be verified individually. It might be that in the end the code ends up to look like just as it is written in your patch. If that condition is really a speed critical spot, then we should be able to verify the speed change individually. If we gain >= 1-2% speed by introducing the change, then we are going to do it. If not, then we are not going to do it... That's why it's important that you don't post huge patches, because it's very difficult for others verify the changes. Instead you split the patches into smaller ones. Then you can say: patch A improves speed by 2%, patch B is a functional change which gains 4 ELO points proven by these tests. Then we verify the results and commit it. So many times people have posted here 50 lines long patches and asked us to commit them. And in the end, they were reduced to 5 lines or less... |
Hi @zamar, You know what "should be verified individually" first? Statements you make about people. Once you have those straight, perhaps looking at the code and test results too before declaring judgement on someone's 🐟 contributions:
...might gain you some respect points ✨ as a "moderator". |
Fixes a race when ucinewgame is issued 'after' a search, as seen for a thread-sanitized binary issuing : ``` go depth 5 [wait for bestmove ..] ucinewgame ``` leading to ``` WARNING: ThreadSanitizer: data race (pid=5148) Write of size 4 at 0x7fbf5dcf6af8 by main thread: #0 __gnu_cxx::__enable_if<!std::__is_scalar<Move>::__value, void>::__type std::__fill_a<Move*, Move>(Move*, Move*, Move const&) <null> (stockfish+0x000000462884) #1 void std::fill<Move*, Move>(Move*, Move*, Move const&) /usr/include/c++/5/bits/stl_algobase.h:747 (stockfish+0x0000004618cc) #2 StatBoards<16, 64, Move>::fill(Move const&) src/movepick.h:36 (stockfish+0x00000046084f) #3 Search::clear() src/search.cpp:194 (stockfish+0x000000451732) #4 newgame src/uci.cpp:145 (stockfish+0x0000004738a0) #5 UCI::loop(int, char**) src/uci.cpp:200 (stockfish+0x000000473d7d) #6 main src/main.cpp:47 (stockfish+0x000000433322) Previous write of size 4 at 0x7fbf5dcf6af8 by thread T1: #0 update_stats src/search.cpp:1417 (stockfish+0x000000453ecb) #1 search<(<unnamed>::NodeType)0u> src/search.cpp:1114 (stockfish+0x00000045c4c6) #2 search<(<unnamed>::NodeType)1u> src/search.cpp:997 (stockfish+0x000000457658) #3 search<(<unnamed>::NodeType)1u> src/search.cpp:1019 (stockfish+0x000000457a50) #4 search<(<unnamed>::NodeType)1u> src/search.cpp:1019 (stockfish+0x000000457a50) #5 Thread::search() src/search.cpp:402 (stockfish+0x000000452e28) #6 MainThread::search() src/search.cpp:264 (stockfish+0x000000451c32) #7 Thread::idle_loop() src/thread.cpp:114 (stockfish+0x000000468c90) ``` The solution adopted is to split the uci commands in those that can be issued during search (quit, stop, ponderhit, isready, uci) and those that should not run during search (ucinewgame, setoption, .. ). Waiting for search to finish was previously done only for a 'go'. This fixes the above race. There are two side effects, one possibly negative, depending on how a gui deals with uci (or a user at the cli). - crashes are avoided (e.g. resizing hash during search). - as earlier commands can block on the search to be completed, 'stop' can be non-effective. This behavior is already present on master : ``` go depth 20 go depth 5 stop ``` will wait for depth 20 to be completed before stopping depth 5. This behavior now shows for other commands as well. As such some gui testing of this patch would be useful. No functional change.
limits.ponder was used as a signal between uci and search threads, but is not an atomic variable, leading to the following race as flagged by a sanitized binary. Expect input: ``` spawn ./stockfish send "uci\n" expect "uciok" send "setoption name Ponder value true\n" send "go wtime 4000 btime 4000\n" expect "bestmove" send "position startpos e2e4 d7d5\n" send "go wtime 4000 btime 4000 ponder\n" sleep 0.01 send "ponderhit\n" expect "bestmove" send "quit\n" expect eof ``` Race: ``` WARNING: ThreadSanitizer: data race (pid=7191) Read of size 4 at 0x0000005c2260 by thread T1: #0 MainThread::check_time() src/search.cpp:1488 (stockfish+0x000000454471) #1 search<(<unnamed>::NodeType)0u> src/search.cpp:566 (stockfish+0x0000004594e0) #2 search<(<unnamed>::NodeType)0u> src/search.cpp:997 (stockfish+0x00000045bfb6) #3 search<(<unnamed>::NodeType)0u> src/search.cpp:1006 (stockfish+0x00000045c1a3) #4 search<(<unnamed>::NodeType)1u> src/search.cpp:997 (stockfish+0x000000457658) #5 Thread::search() src/search.cpp:402 (stockfish+0x000000452e28) #6 MainThread::search() src/search.cpp:264 (stockfish+0x000000451c32) #7 Thread::idle_loop() src/thread.cpp:114 (stockfish+0x000000468c90) Previous write of size 4 at 0x0000005c2260 by main thread: #0 UCI::loop(int, char**) src/uci.cpp:193 (stockfish+0x000000473c9b) #1 main src/main.cpp:47 (stockfish+0x000000433322) Location is global 'Search::Limits' of size 88 at 0x0000005c2220 (stockfish+0x0000005c2260) ``` The fix is to add an atomic bool to the threads structure to signal the ponder status, letting Search::Limits to reflect just what was passed to 'go'. No functional change.
limits.ponder was used as a signal between uci and search threads, but is not an atomic variable, leading to the following race as flagged by a sanitized binary. Expect input: ``` spawn ./stockfish send "uci\n" expect "uciok" send "setoption name Ponder value true\n" send "go wtime 4000 btime 4000\n" expect "bestmove" send "position startpos e2e4 d7d5\n" send "go wtime 4000 btime 4000 ponder\n" sleep 0.01 send "ponderhit\n" expect "bestmove" send "quit\n" expect eof ``` Race: ``` WARNING: ThreadSanitizer: data race (pid=7191) Read of size 4 at 0x0000005c2260 by thread T1: #0 MainThread::check_time() src/search.cpp:1488 (stockfish+0x000000454471) #1 search<(<unnamed>::NodeType)0u> src/search.cpp:566 (stockfish+0x0000004594e0) #2 search<(<unnamed>::NodeType)0u> src/search.cpp:997 (stockfish+0x00000045bfb6) #3 search<(<unnamed>::NodeType)0u> src/search.cpp:1006 (stockfish+0x00000045c1a3) #4 search<(<unnamed>::NodeType)1u> src/search.cpp:997 (stockfish+0x000000457658) #5 Thread::search() src/search.cpp:402 (stockfish+0x000000452e28) #6 MainThread::search() src/search.cpp:264 (stockfish+0x000000451c32) #7 Thread::idle_loop() src/thread.cpp:114 (stockfish+0x000000468c90) Previous write of size 4 at 0x0000005c2260 by main thread: #0 UCI::loop(int, char**) src/uci.cpp:193 (stockfish+0x000000473c9b) #1 main src/main.cpp:47 (stockfish+0x000000433322) Location is global 'Search::Limits' of size 88 at 0x0000005c2220 (stockfish+0x0000005c2260) ``` The fix is to add an atomic bool to the threads structure to signal the ponder status, letting Search::Limits to reflect just what was passed to 'go'. No functional change.
…lude Bishops/Knights Hit #0: Total 18031 Hits 335 Hit Rate (%) 1.85791 King Hit #1: Total 18031 Hits 4616 Hit Rate (%) 25.6004 Queen Hit #2: Total 18031 Hits 3916 Hit Rate (%) 21.7182 Rook Hit official-stockfish#3: Total 18031 Hits 2766 Hit Rate (%) 15.3402 Bishop Hit official-stockfish#4: Total 18031 Hits 6398 Hit Rate (%) 35.4833 Knight bench: 4868464
tt_sardines
🐟 supports 50% more entries for a given sized transposition table (TT
). Under tournament-style hash pressure, Elo is increased (+7.68 ± 2) as fewer time-consuming position evaluations are needed:tt_sardines
passed both STC (with 4MB hash) and LTC (with 16MB hash) easily. Also, at STC with no hash pressure (with 128MB hash), theTT
performs at least as good as it did before:The old
TT
used 64-byte clusters, with each cluster containing 4 14-byte entries.tt_sardines
uses 32-byte clusters, with each cluster containing 3 10-byte entries.To reduce each entry from 14 to 10 bytes, some fields in the
TTEntry
structure have been condensed:DEPTH_NONE
(which has been adjusted from -254 to -12). In this way, all depths can be stored as positiveuint8
values.Since only a partial key is stored with each entry, the same entry used to store information about one position might be used to store information about a different position. This produces occasional type-2 errors during probes, where a probe returns information for a wrong position. These rare, but possible, type-2 errant probes are automatically detected with the same logic that detects corrupted entries generated during unsafe threaded access to the
TT
.To improve speed, several optimizations have been made:
first_entry
only uses 2 machine instructions and no longer uses a multiply.probe
loop has been unrolled.store
writes the key only when necessary. It also evaluates only the second and third entries as potentially better replacement candidates, since the first entry starts as the best replacement candidate.Note: No
#pragma
directives or other modern C++ constructs have been used. TheTT
contains the same entry fields and data (such asgeneration
) that existed before, but in condensed form. Probes and stores are processed as they were before, just a little faster.