Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packed like sardines TT #3

Closed
wants to merge 1 commit into from
Closed

Packed like sardines TT #3

wants to merge 1 commit into from

Conversation

Britvich
Copy link

tt_sardines 🐟 supports 50% more entries for a given sized transposition table (TT). Under tournament-style hash pressure, Elo is increased (+7.68 ± 2) as fewer time-consuming position evaluations are needed:

image

tt_sardines passed both STC (with 4MB hash) and LTC (with 16MB hash) easily. Also, at STC with no hash pressure (with 128MB hash), the TT performs at least as good as it did before:

image

The old TT used 64-byte clusters, with each cluster containing 4 14-byte entries. tt_sardines uses 32-byte clusters, with each cluster containing 3 10-byte entries.

To reduce each entry from 14 to 10 bytes, some fields in the TTEntry structure have been condensed:

  • The key field (32-bits) is now just 16-bits.
  • The generation (8-bits) and bound (8-bits) fields have been combined into a single 8-bit field with 6-bits used for the generation and 2-bits for the bound.
  • The depth field (16-bits) is now just 8-bits with the depth stored as an offset from the minimum depth DEPTH_NONE (which has been adjusted from -254 to -12). In this way, all depths can be stored as positive uint8 values.

Since only a partial key is stored with each entry, the same entry used to store information about one position might be used to store information about a different position. This produces occasional type-2 errors during probes, where a probe returns information for a wrong position. These rare, but possible, type-2 errant probes are automatically detected with the same logic that detects corrupted entries generated during unsafe threaded access to the TT.

To improve speed, several optimizations have been made:

  • The number of entries in each cluster has been reduced from 4 to 3.
  • first_entry only uses 2 machine instructions and no longer uses a multiply.
  • The probe loop has been unrolled.
  • store writes the key only when necessary. It also evaluates only the second and third entries as potentially better replacement candidates, since the first entry starts as the best replacement candidate.

Note: No #pragma directives or other modern C++ constructs have been used. The TT contains the same entry fields and data (such as generation) that existed before, but in condensed form. Probes and stores are processed as they were before, just a little faster.

This change supports 50% more entries for a given sized transposition table (`TT`). Elo is increased as fewer time-consuming position evaluations are needed.

The old `TT` used 64-byte clusters, with each cluster containing 4 14-byte entries. This change uses 32-byte clusters, with each cluster containing 3 10-byte entries.

To reduce each entry from 14 to 10 bytes, some fields in the `TTEntry` structure have been condensed:

* The key field (32-bits) is now just 16-bits.
* The generation (8-bits) and bound (8-bits) fields have been combined into a single 8-bit field with 6-bits used for the generation and 2-bits for the bound.
* The depth field (16-bits) is now just 8-bits with the depth stored as an offset from the minimum depth `DEPTH_NONE` (which has been adjusted from -254 to -12). In this way, all depths can be stored as positive `uint8` values.

Since only a partial key is stored with each entry, the same entry used to store information about one position might be used to store information about a different position. This produces occasional type-2 errors during probes, where a probe returns information for a wrong position. These rare, but possible, type-2 errant probes are automatically detected with the same logic that detects corrupted entries generated during unsafe threaded access to the `TT`.

To improve speed, several optimizations have been made:
* The number of entries in each cluster has been reduced from 4 to 3.
* `first_entry` only uses 2 machine instructions and no longer uses a multiply.
* The `probe` loop has been unrolled.
* `store` writes the key only when necessary. It also evaluates only the second and third entries as potentially better replacement candidates, since the first entry starts as the best replacement candidate.
@lucasart
Copy link

I'm against merging this. but let's try this majority vote amongst maintainers decide. @zamar, @glinscott: what's your take on this?

Reasons I don't like it:

  • it introduces some ugly hacks for the sake of speed optimization, that were put there in an attempt to make the whole lot pass.
  • depth are assumed to never be lower than -6 plies, yet nothing in the qs imposes a depth limit (only a ply limit, and recaptures past a certain depth).
  • the idea of using 16bits, having loads of collisions, and saying we don't care and hope for the safety net of legal move detection to catch those is something I personally don't like.
  • there has been far too many tests on that subject. the patch author ignored marco's comments and pursued his TT packing obsession, spending endless framework resources on it. If we cave in, and accept this to not piss him off, we send the wrong signal, and we will have other people doing the same thing (ie. Uri-style trolling framework endlessly and whining until his stuff gets committed). it's important we don't cave and make it clear that this tactic will not work.

@zamar
Copy link

zamar commented Jun 20, 2014

I'm also against merging this.

  • Marco was against this (for good a reason), because the patch is so ugly.
  • Patch author is ignoring all the feedback and spamming the test queue, and we don't want to
    encourage such a behaviour.

On Fri, Jun 20, 2014 at 8:07 AM, lucasart notifications@github.com wrote:

I'm against merging this. but let's try this majority vote amongst
maintainers decide. @zamar https://github.com/zamar, @glinscott
https://github.com/glinscott: what's your take on this?

Reasons I don't like it:

  • it introduces some ugly hacks for the sake of speed optimization,
    that were put there in an attempt to make the whole lot pass.
  • depth are assumed to never be lower than -6 plies, yet nothing in
    the qs imposes a depth limit (only a ply limit, and recaptures past a
    certain depth).
  • the idea of using 16bits, having loads of collisions, and saying we
    don't care and hope for the safety net of legal move detection to catch
    those is something I personally don't like.
  • there has been far too many tests on that subject. the patch author
    ignored marco's comments and pursued his TT packing obsession, spending
    endless framework resources on it. If we cave in, and accept this to not
    piss him off, we send the wrong signal, and we will have other people doing
    the same thing (ie. Uri-style trolling framework endlessly and whining
    until his stuff gets committed). it's important we don't cave and make it
    clear that this tactic will not work.


Reply to this email directly or view it on GitHub
#3 (comment)
.

@lucasart
Copy link

OK. Case closed. 2/3 is majority vote.

That being said some valid points have been raised

  1. testing with 128mb hash is stupid. already we know that 16 is enough for STC (no measurable elo diff). at the very least, we should be using 16 for stc and 64 for ltc.
  2. the current tt implementation is improvable. perhaps we should try something simple and not so ugly: just keep the tt entries and replacement logic as they are, but use 2 entries by cluster (32 byte clusters, 2 by cache line). Has this been tried?

@lucasart lucasart closed this Jun 20, 2014
@zamar
Copy link

zamar commented Jun 20, 2014

  1. It's difficult to decide the hash size to test. For a long time, I
    believed that "bigger was better", but this doesn't seem to be the case at
    all. We should select something that suits for all time controls, and I
    think 128mb is quite a good choice. Of course if you're testing
    improvements for TT, it's fine to use a smaller hash size...
  2. Sure, post a patch :-)

On Fri, Jun 20, 2014 at 11:42 AM, lucasart notifications@github.com wrote:

OK. Case closed. 2/3 is majority vote.

That being said some valid points have been raised

  1. testing with 128mb hash is stupid. already we know that 16 is enough
    for STC (no measurable elo diff). at the very least, we should be using 16
    for stc and 64 for ltc.
  2. the current tt implementation is improvable. perhaps we should try
    something simple and not so ugly: just keep the tt entries and replacement
    logic as they are, but use 2 entries by cluster (32 byte clusters, 2 by
    cache line). Has this been tried?


Reply to this email directly or view it on GitHub
#3 (comment)
.

@glinscott
Copy link
Contributor

I don't think Ron was spamming the framework with these tests. He was trying different variations on the theme to try and find the best way to do it. I think that should be encouraged, rather than be called spamming.

Yes, the idea is a fairly big change from the way hash table is done today, but I really don't see how it's conflating different ideas. If you change the hash table structure in the way he did, you pretty much have to make most of the changes in this patch.

The one big concern I have with the patch is reducing the key from 32 bits to 16 bits. This will lead to many more hash collisions. That being said, when he ran tests with hash pressure, it seemed to be fine. Each bucket has 3 elements, so a 3/65536 chance of having a collision when full. If the ttable is overwritten 2-3x times per search, this is actually a fairly small chance of a collision that will have any meaningful effect.

The restrictions on depth is the other concern, I haven't had time to analyze that.

But otherwise, I think there is more good in this patch then bad. It's also really quite a small change. He also rewrote his original version to stay more within the coding guidelines.

@zamar
Copy link

zamar commented Jun 20, 2014

Since one of use disagrees, it's probably best that I'll take a longer look
at this patch during the weekend...

  • Just 8 bits for depth is quite severe restriction.
  • More hash collisions can cause strange non-reproducible bugs.

I'm not convinced that 2-3 elo points would be worth this... But I'll
reconsider my opinion.

On Fri, Jun 20, 2014 at 3:57 PM, Gary Linscott notifications@github.com
wrote:

I don't think Ron was spamming the framework with these tests. He was
trying different variations on the theme to try and find the best way to do
it. I think that should be encouraged, rather than be called spamming.

Yes, the idea is a fairly big change from the way hash table is done
today, but I really don't see how it's conflating different ideas. If you
change the hash table structure in the way he did, you pretty much have to
make most of the changes in this patch.

The one big concern I have with the patch is reducing the key from 32 bits
to 16 bits. This will lead to many more hash collisions. That being said,
when he ran tests with hash pressure, it seemed to be fine. Each bucket has
3 elements, so a 3/65536 chance of having a collision when full. If the
ttable is overwritten 2-3x times per search, this is actually a fairly
small chance of a collision that will have any meaningful effect.

The restrictions on depth is the other concern, I haven't had time to
analyze that.

But otherwise, I think there is more good in this patch then bad. It's
also really quite a small change. He also rewrote his original version to
stay more within the coding guidelines.


Reply to this email directly or view it on GitHub
#3 (comment)
.

@lucasart lucasart reopened this Jun 20, 2014
@lucasart
Copy link

@zamar: regarding hash, it was demonstrated on fishtest that 16 vs 128 is equal at STC. so the hash to use is at most 16 for stc. beyond that it's just useless, even harmful in fact (32 vs 128 was a measurable gain). so really it should be, at most, 16 for stc and 64 for ltc. And even less would be better because it puts a bit more hash pressure which simulates real life conditions a little bit closer. On that I fully agree with Ron, and disagree with Marco (and you apparently).

@glinscott: what happens to a QS depth of -7 in Ron's patch? Well, it rolls up modulo 256 to some 120 or so ply, meaning that in endgame positions where you reach the max depth some funny stuff can happen, with deep qs entries being used near the root...

I casted my vote, won't change it. Up to you guys to decide. Whatever you choose I will be happy and won't complain, promised :)

@sf-x
Copy link

sf-x commented Jun 20, 2014

@lucasart
https://github.com/official-stockfish/Stockfish/blob/master/src/search.cpp#L1069
so deep QS is not a problem
I was hoping that new maintainers would actually check their claims regards
X

@glinscott
Copy link
Contributor

I've queued up a test that approximates TCEC conditions. http://tests.stockfishchess.org/tests/view/53a48b7f0ebc5975754cb287

TCEC conditions are 16Gb hash table, with 16 threads, so simplify down to 1Gb/thread. Then average searching time of 3 hours, so 1Gb / (3 * 60) ~= 5Mb. I used 4 since we use power of two anyway.

One other interesting test might be testing using an increment since more time is spent in the endgame, where the hash table is more critical. I will queue that up at low priority.

@Britvich
Copy link
Author

There has been some concern about type-2 errors during probes with only a 16-bit intracluster key. Therefore, I instrumented the TT to store the full 64-bit Zobrist key and measured the false probe rate under various levels of hash pressure. Here's the number of type-2 errors per million probes vs the percentage of hash fullness (pressure = nodes searched / entries in hash table):

image

Under low pressure (<100% full), there are no more than about 5 type-2 errors per million probes. As the pressure increases, the error rate starts a downward curve, following the function:

image

Even under tournament-like pressure (~300% full), there's only about 13 type-2 errors per million probes.

Also, TT.store() is never called with a depth less than DEPTH_NONE (-12).

@glinscott
Copy link
Contributor

@Britvich, thanks for the analysis! The tests are looking extremely promising. It appears that collisions are not really an issue any more for these TC/hash conditions, which are fairly stressful for the hash table.

One thing that could be interesting to consider is where to steal some extra bits to increase the key length.

There are a few possibilities I can see:

  1. Shrinking the generation, we probably only need 2-3 bits for generation, as after 4-8 moves, most of the old TT entries are probably long overwritten. Currently generation has 6 bits, so if we got down to 2 bits, we get 4 bits for the key.
  2. Using the 2 bytes of padding per cluster. That gives 5 bits extra (they are a pain to access though...). But those 5 bits would give us 32x less collisions! It's a tricky tradeoff.
  3. Something a little more radical, like shrinking the evaluation down from 16 bits to 8-10 bits. This would be done with a lookup table along some sort of curve, so we keep evaluation fidelity close to zero, but for big evaluation values it's less important. I don't think this is worth it though.

If we are at 13 type-2 errors at 300% full, then if we add just a few more bits, we will be down to 0-1 errors per million probes, which would greatly increase my confidence.

I'd be in favor of option number 2, even though it would add a bit more code complexity.

if (!tte->key32 || tte->key32 == key32) // Empty or overwrite old

// Empty entry?
if (!tte->key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this if block. tte->key will be 0 a lot of the time for valid keys as well, so we could replace an entry with high depth here.

But maybe this is worth a test against itself, as it will be difficult to measure the effect against normal SF.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. Good point.

You are correct, store will always overwrite a 0-key, even if it's not the best candidate for replacement. Even though this 0-key check allows the TT to fill faster (as it breaks immediately upon finding an "empty" entry), that probably doesn't buy much under tournament-like conditions with a full TT.

@lucasart
Copy link

I was wrong about the QS depth problem, and overlooked the fact that QS stores only 0 and -1 ply entries in fact. So the depth is not an issue. It introduces some rigidity though:

  • we can't remove the QS depth trick
  • we can't increase max ply
    I think these limitations are not a problem though. We'll need an assert to enforce that depth is within bounds, in case someone screws that up (by changing tt store depth in qs, or increasing max ply).

The 16 bit still seems like an heresy, but OTOH, no test so far has shown that is causes problems. Maybe we can live with it, and if one day it causes issues, we can either:

  • revert back to previous TT implementation.
  • use 5 extra bits per entry from the 2 padding bytes, as suggested by Gary. might be a bit ugly codewise, but shouldn't be a problem performance wise, since we have prefetched the cacheline containing the 32 byte cluster.

Just one thing though. The loop unrolling is ugly, and I think we should test without it, to compare apples and apples. If the gain of the patch is partly driven by that, we can just do the loop unrolling without messing up the TT structure.


return NULL;
if (tte->key != key16 && (++tte)->key != key16 && (++tte)->key != key16)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code leads to undefined behaviour. Never use several ++ operators on the same variable within an expression. Simply use tte[1] and tte[2] instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucasart, that is incorrect. It doesn't lead to "undefined behavior". According to ISO/IEC C++ Standard 14882 §5.14 Logical AND operator:

The && operator groups left-to-right. The operands are both implicitly converted to type bool (clause 4). The result is true if both operands are true and false otherwise. Unlike &, && guarantees left-to-right evaluation: the second operand is not evaluated if the first operand is false.

Also, I want tte pointing to the matching entry when/if the statement gets short-circuited. Using tte[1] and tte[2] doesn't accomplish that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you're right. The lazy boolean evaluation is guaranteed by the standard, so no compiler can screw up that code.

@Britvich
Copy link
Author

@glinscott, Good ideas...

One thing that could be interesting to consider is where to steal some extra bits to increase the key length.

I like stealing more bits, so that we can double the number of hash entries. If we could squish down each entry to 8 bytes, it would probably add a lot more Elo than if we used those extra bits to reduce the already rare type-2 errors that don't really cost any Elo.

Of course, the trade-off is more packing/unpacking vs speed. I'm amazed how sensitive probe is. It's an ultra hotspot. It's also part of the reason I started working on the TT first.

Shrinking the generation, we probably only need 2-3 bits for generation, as after 4-8 moves, most of the old TT entries are probably long overwritten. Currently generation has 6 bits, so if we got down to 2 bits, we get 4 bits for the key.

I like and agree.

Using the 2 bytes of padding per cluster. That gives 5 bits extra (they are a pain to access though...). But those 5 bits would give us 32x less collisions! It's a tricky tradeoff.

Yes, it relatively easy to take advantage of those bytes. If each entry was 8 bytes, and there were 4 entries/cluster, then 4 bits would become available for each entry.

Something a little more radical, like shrinking the evaluation down from 16 bits to 8-10 bits. This would be done with a lookup table along some sort of curve, so we keep evaluation fidelity close to zero, but for big evaluation values it's less important. I don't think this is worth it though.

Yes, something along the lines of a 10-bit minifloat would work. That would make available 6 more bits twice.

Each of the 4 8-byte entries per cluster would look like:

key move eval static eval gen bnd depth
16 16 10 10 2 2 8
'''''''''''''''' '''''''''''''''' '''''''''' '''''''''' '' '' ''''''''

This would take a little work, but my estimate is that it would improve Elo more than EGTBs do, with about 2,800 less lines of code and a lot less disk space. 😄

@Britvich
Copy link
Author

@lucasart,

The loop unrolling is ugly, and I think we should test without it, to compare apples and apples. If the gain of the patch is partly driven by that, we can just do the loop unrolling without messing up the TT structure.

Are you saying:

If the loop optimization is what is causing most of the Elo gain, then we don't need to be "messing up the TT structure?"

If so, 😟 the main Elo gain is coming from the increase in the number of entries (1.5x) for a given hash size. That requires "messing up the TTEntry structure". The optimizations are there to try to keep the speed up when there's no hash pressure (as the packing/unpacking code is slightly slower). Do you want to remove the optimizations?

If that's not what you're saying, then, never mind. 😌

@zamar
Copy link

zamar commented Jun 22, 2014

Today I spent around 1 hour understanding and analyzing this patch and I've made up my mind...

The potential ELO increase for this patch is big enough that it outweights the potential problems that will arise from key conflicts. So in principal, I think that this is the way to go...

However the patch in its current form is not commitable:

  • It's again a mix of actual changes + speed optimization attempts which badly uglify the code for no proven benefit (TranspositionTable::store() function especially).
  • Instead of updating existing documentation/comments in code, the patch author has simply removed them.

Until these issues are addressed, I'm against committing this patch.

One of the reasons that Marco got tired was that people were writing low quality patches (not following existing coding standards), and after those patches passed in testing framework, they expected Marco to clean up the mess. This must stop. After the patch has passed in testing framework, it's the responsibility of the patch author to clean up his patch, so that it's actually commitable.

Of course, if patch author refuses to clean up his mess, anyone can step in instead and do the work for him. Even I might do this at some point in the future, but not during this weekend...

@Britvich
Copy link
Author

@zamar, before you spent time "understanding and analyzing this patch", I thought you already made up your mind...

I'm also against merging this.

  • Marco was against this (for good a reason), because the patch is so ugly.
  • Patch author is ignoring all the feedback and spamming the test queue, and we don't want to
    encourage such a behaviour.

It negatively affects my reputation when you make false claims such as:

  1. "Patch author [that's me, my name is Ron Britvich] is ignoring all the feedback"
    Please see (if you haven't already) Packed like sardines TT mcostalba/Stockfish#243 where I go through Marco's comments, suggestions, and cautions, one-by-one, showing how I listened and made the changes he requested.
  2. "Patch author [me again] is spamming the test queue"
    This is a significant change, as can seen by all the tests being run on tt_sardines in the last few days (by the way, I haven't run a single fishtest in almost a week). During my testing, I experimented with multiple ways to compress the TT and several new replacement algorithms. While I may have run some ill-advised tests early on (while learning the ropes), I never deliberately attempted to abuse the fishtest framework.

However the patch in its current form is not commitable:

It's again a mix of actual changes + speed optimization attempts which badly uglify the code for no proven benefit (TranspositionTable::store() function especially).

The changes I made aren't just "speed optimization attempts", they were the result of extensive testing, instrumentation, examination of generated machine code, and performance profiling and analysis using Visual Studio Ultimate. Your opinion may be that they "uglify" things, but to say they have "no proven benefit" is just false. Take for example the code in store that implements the replacement strategy:

if ((tte->gen() == generation || tte->bound() == BOUND_EXACT)
 +          - (replace->gen() == generation)
 +          - (tte->depth8 < replace->depth8) < 0)

That statement is actually pretty expensive (and a hotspot as shown by performance profiling). It is also executed unnecessarily (the code performs no function, except to waste cycles) against the first entry of each cluster. Just on face value, most programmers see that moving that code (or "uglifying" it, to use your term) so that it only considers the second and third entries in a cluster, as better candidates for replacement, is a "proven benefit" (at least that's what my "spamming [of] the test queue" has shown).

Instead of updating existing documentation/comments in code, the patch author has simply removed them.

I only removed one comment: before the definition of the struct TTEntry, as I thought the struct definition itself was self-documenting. I'll gladly put that back in if you wish. No other comments were harmed during the making of the new TT (American Humane Association).

Of course, if patch author refuses to clean up his mess, anyone can step in instead and do the work for him

Ron (the "test queue spamming, not listening, code uglifying, in general, making a mess") Britvich

@zamar
Copy link

zamar commented Jun 22, 2014

Hi Ron,

Each change to source code should be verified individually. It might be that in the end the code ends up to look like just as it is written in your patch.

If that condition is really a speed critical spot, then we should be able to verify the speed change individually. If we gain >= 1-2% speed by introducing the change, then we are going to do it. If not, then we are not going to do it...

That's why it's important that you don't post huge patches, because it's very difficult for others verify the changes. Instead you split the patches into smaller ones. Then you can say: patch A improves speed by 2%, patch B is a functional change which gains 4 ELO points proven by these tests.

Then we verify the results and commit it. So many times people have posted here 50 lines long patches and asked us to commit them. And in the end, they were reduced to 5 lines or less...

@Britvich
Copy link
Author

Hi @zamar,

You know what "should be verified individually" first? Statements you make about people. Once you have those straight, perhaps looking at the code and test results too before declaring judgement on someone's 🐟 contributions:

I'm also against merging this.

I'm not convinced that 2-3 elo points would be worth this

...might gain you some respect points ✨ as a "moderator".

@glinscott glinscott closed this in ccd823a Jun 28, 2014
ajithcj referenced this pull request in ajithcj/Stockfish Jul 21, 2016
vondele referenced this pull request in vondele/Stockfish Jul 26, 2017
Fixes a race when ucinewgame is issued 'after' a search, as seen for a thread-sanitized binary issuing :

```
go depth 5
[wait for bestmove ..]
ucinewgame
```
leading to

```
WARNING: ThreadSanitizer: data race (pid=5148)
  Write of size 4 at 0x7fbf5dcf6af8 by main thread:
    #0 __gnu_cxx::__enable_if<!std::__is_scalar<Move>::__value, void>::__type std::__fill_a<Move*, Move>(Move*, Move*, Move const&) <null> (stockfish+0x000000462884)
    #1 void std::fill<Move*, Move>(Move*, Move*, Move const&) /usr/include/c++/5/bits/stl_algobase.h:747 (stockfish+0x0000004618cc)
    #2 StatBoards<16, 64, Move>::fill(Move const&) src/movepick.h:36 (stockfish+0x00000046084f)
    #3 Search::clear() src/search.cpp:194 (stockfish+0x000000451732)
    #4 newgame src/uci.cpp:145 (stockfish+0x0000004738a0)
    #5 UCI::loop(int, char**) src/uci.cpp:200 (stockfish+0x000000473d7d)
    #6 main src/main.cpp:47 (stockfish+0x000000433322)

  Previous write of size 4 at 0x7fbf5dcf6af8 by thread T1:
    #0 update_stats src/search.cpp:1417 (stockfish+0x000000453ecb)
    #1 search<(<unnamed>::NodeType)0u> src/search.cpp:1114 (stockfish+0x00000045c4c6)
    #2 search<(<unnamed>::NodeType)1u> src/search.cpp:997 (stockfish+0x000000457658)
    #3 search<(<unnamed>::NodeType)1u> src/search.cpp:1019 (stockfish+0x000000457a50)
    #4 search<(<unnamed>::NodeType)1u> src/search.cpp:1019 (stockfish+0x000000457a50)
    #5 Thread::search() src/search.cpp:402 (stockfish+0x000000452e28)
    #6 MainThread::search() src/search.cpp:264 (stockfish+0x000000451c32)
    #7 Thread::idle_loop() src/thread.cpp:114 (stockfish+0x000000468c90)

```

The solution adopted is to split the uci commands in those that can be issued during search (quit, stop, ponderhit, isready, uci) and those that should not run during search (ucinewgame, setoption, .. ).  Waiting for search to finish was previously done only for a 'go'.

This fixes the above race.

There are two side effects, one possibly negative, depending on how a gui deals with uci (or a user at the cli).
- crashes are avoided (e.g. resizing hash during search).
- as earlier commands can block on the search to be completed, 'stop' can be non-effective. This behavior is already present on master :

```
go depth 20
go depth 5
stop
```

will wait for depth 20 to be completed before stopping depth 5. This behavior now shows for other commands as well. As such some gui testing of this patch would be useful.

No functional change.
vondele referenced this pull request in vondele/Stockfish Jul 27, 2017
limits.ponder was used as a signal between uci and search threads, but is not an atomic variable,
leading to the following race as flagged by a sanitized binary.

Expect input:
```
 spawn  ./stockfish
 send "uci\n"
 expect "uciok"
 send "setoption name Ponder value true\n"
 send "go wtime 4000 btime 4000\n"
 expect "bestmove"
 send "position startpos e2e4 d7d5\n"
 send "go wtime 4000 btime 4000 ponder\n"
 sleep 0.01
 send "ponderhit\n"
 expect "bestmove"
 send "quit\n"
 expect eof
```

Race:
```
WARNING: ThreadSanitizer: data race (pid=7191)
  Read of size 4 at 0x0000005c2260 by thread T1:
    #0 MainThread::check_time() src/search.cpp:1488 (stockfish+0x000000454471)
    #1 search<(<unnamed>::NodeType)0u> src/search.cpp:566 (stockfish+0x0000004594e0)
    #2 search<(<unnamed>::NodeType)0u> src/search.cpp:997 (stockfish+0x00000045bfb6)
    #3 search<(<unnamed>::NodeType)0u> src/search.cpp:1006 (stockfish+0x00000045c1a3)
    #4 search<(<unnamed>::NodeType)1u> src/search.cpp:997 (stockfish+0x000000457658)
    #5 Thread::search() src/search.cpp:402 (stockfish+0x000000452e28)
    #6 MainThread::search() src/search.cpp:264 (stockfish+0x000000451c32)
    #7 Thread::idle_loop() src/thread.cpp:114 (stockfish+0x000000468c90)

  Previous write of size 4 at 0x0000005c2260 by main thread:
    #0 UCI::loop(int, char**) src/uci.cpp:193 (stockfish+0x000000473c9b)
    #1 main src/main.cpp:47 (stockfish+0x000000433322)

  Location is global 'Search::Limits' of size 88 at 0x0000005c2220 (stockfish+0x0000005c2260)
```

The fix is to add an atomic bool to the threads structure to signal the ponder status, letting Search::Limits to reflect just what was passed to 'go'.

No functional change.
vondele referenced this pull request in vondele/Stockfish Aug 2, 2017
limits.ponder was used as a signal between uci and search threads, but is not an atomic variable,
leading to the following race as flagged by a sanitized binary.

Expect input:
```
 spawn  ./stockfish
 send "uci\n"
 expect "uciok"
 send "setoption name Ponder value true\n"
 send "go wtime 4000 btime 4000\n"
 expect "bestmove"
 send "position startpos e2e4 d7d5\n"
 send "go wtime 4000 btime 4000 ponder\n"
 sleep 0.01
 send "ponderhit\n"
 expect "bestmove"
 send "quit\n"
 expect eof
```

Race:
```
WARNING: ThreadSanitizer: data race (pid=7191)
  Read of size 4 at 0x0000005c2260 by thread T1:
    #0 MainThread::check_time() src/search.cpp:1488 (stockfish+0x000000454471)
    #1 search<(<unnamed>::NodeType)0u> src/search.cpp:566 (stockfish+0x0000004594e0)
    #2 search<(<unnamed>::NodeType)0u> src/search.cpp:997 (stockfish+0x00000045bfb6)
    #3 search<(<unnamed>::NodeType)0u> src/search.cpp:1006 (stockfish+0x00000045c1a3)
    #4 search<(<unnamed>::NodeType)1u> src/search.cpp:997 (stockfish+0x000000457658)
    #5 Thread::search() src/search.cpp:402 (stockfish+0x000000452e28)
    #6 MainThread::search() src/search.cpp:264 (stockfish+0x000000451c32)
    #7 Thread::idle_loop() src/thread.cpp:114 (stockfish+0x000000468c90)

  Previous write of size 4 at 0x0000005c2260 by main thread:
    #0 UCI::loop(int, char**) src/uci.cpp:193 (stockfish+0x000000473c9b)
    #1 main src/main.cpp:47 (stockfish+0x000000433322)

  Location is global 'Search::Limits' of size 88 at 0x0000005c2220 (stockfish+0x0000005c2260)
```

The fix is to add an atomic bool to the threads structure to signal the ponder status, letting Search::Limits to reflect just what was passed to 'go'.

No functional change.
@FauziAkram FauziAkram mentioned this pull request Jun 4, 2019
protonspring referenced this pull request in protonspring/Stockfish Jul 23, 2019
@mattginsberg mattginsberg mentioned this pull request Jan 20, 2021
@gvreuls gvreuls mentioned this pull request Jun 3, 2021
pb00068 added a commit to pb00068/Stockfish that referenced this pull request Mar 23, 2023
…lude Bishops/Knights

Hit #0: Total 18031 Hits 335 Hit Rate (%) 1.85791  King
Hit #1: Total 18031 Hits 4616 Hit Rate (%) 25.6004  Queen
Hit #2: Total 18031 Hits 3916 Hit Rate (%) 21.7182  Rook
Hit official-stockfish#3: Total 18031 Hits 2766 Hit Rate (%) 15.3402  Bishop
Hit official-stockfish#4: Total 18031 Hits 6398 Hit Rate (%) 35.4833  Knight

bench: 4868464
pb00067 referenced this pull request in pb00068/Stockfish Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants