-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce anti-suicide feature #2666
Conversation
In some recent tournament games, Stockfish exhibited the following self-destructing behaviour. Stockfish was suffering in a long shuffle session, having a bad evaluation in a blocked or semi-blocked position for about 40 moves and yet the eval was sort of flatlined, indicating that the opponent engine (Leela) had trouble converting the position. Then, not long before the 50-moves draw rule would be reached reached, the opponent would play its pieces to some strange places and SF would push a pawn, thinking she would get a slightly "less worse" evaluation. However, the slightly less worse evaluation would prove to be delusional, the position with a sacrificed pawn crackable and SF eventually lost these games. This issue was discussed in the following thread: official-stockfish/Stockfish#2620 This commit is our best attempt to patch this issue, so that SF gets more patient in worse positions and tries to play for 50 moves as much as possible and not suicide. The implementation uses pure evaluation methods rather than search, damping down the eval after 25 moves of shuffling (damping factor is linear, starting from 1.0 after 25 shuffling moves and reaching 0.04 after 50 moves of shuffling). This damping puts the burden on the attacking player to prove that he can break the fortress, as now the search will get more and more optimistic for the defending player to be able to reach a draw by 50 moves rule. This solution seems to work as intended for the few cases extracted from tournament losses, according to tests done by @vondele in the following comments: official-stockfish/Stockfish#2620 (comment) a66d3c0#commitcomment-38963042 In Fishtest, the best result we managed to get after extensive testing was a double yellow with Elo-gaining bounds (this patch), maybe because the problem is quite rare at the short time controls we use in our tests compared to the longer time controls used in tournament games: STC: LLR: -2.97 (-2.94,2.94) {-0.50,1.50} Total: 201928 W: 38274 L: 38174 D: 125480 Ptnml(0-2): 3452, 23520, 46844, 23772, 3376 https://tests.stockfishchess.org/tests/view/5eb281dd2326444a3b6d3499 LTC: LLR: -2.94 (-2.94,2.94) {0.25,1.75} Total: 90232 W: 11446 L: 11353 D: 67433 Ptnml(0-2): 631, 8421, 26967, 8418, 679 https://tests.stockfishchess.org/tests/view/5eb34a862326444a3b6d37ff Bench: 4834675
@snicolet I've done a more extensive analysis on all fens posted in the issue. For each fen, I've done a rather deep multiPV search (200s), as well as 200 short (1s) searches to get a distribution of bestmoves (on 250 threads, 80GB hash). Can you have a look at the result, to help judge if the patch brings improvement beyond the one fen:
|
I have always considered that the ability of solving blind spots has a special value, which is hard to measure as elo gain. By locating, targeting and removing them one by one, the completeness of the engine is not only making chess analysts and CC players happy, but is bound to scale well. I leave it to others to judge if the degree of mitigation of the problematic subset this pull offers is worth it. As a principle I would suggest that any solution of a problematic subset (that also does not introduce another one) be treated as a bug-fix. A few lines of code and cpu cycles is a small price to pay for an eventually blunderfree & blindfree SF. The tricky part is measuring and assessing the amount of help they offer. |
@snicolet I'm reluctant to commit the patch in this form, as it adds code, failed Elo gainer tests, and shows clear benefit only on this one specific fen (AFAIK). However, a variant of this patch (just the initiative term) actually is a simplification with respect to master, and is very simple overal. vondele/Stockfish@66ed8b6...537d51d It tested nicely: passed LTC and it performs essentially equally well on the test FEN 8/p3kp2/Pp2p3/1n2PpP1/5P2/1Kp5/8/R7 b - - 68 143
I propose we merge that one instead. Agree? |
I honestly dislike scale factor as a concept, it basically says "we are failing to evaluate this endgames properly, let's multiply their eval by something ". |
Thanks! |
In some recent tournament games, Stockfish exhibited the following
self-destructing behaviour. Stockfish was suffering in a long shuffle
session, having a bad evaluation in a blocked or semi-blocked position
for about 40 moves and yet the eval was sort of flatlined, indicating
that the opponent engine (Leela) had trouble converting the position.
Then, not long before the 50-moves draw rule would be reached,
the opponent would play its pieces to some strange places and SF would
push a pawn, thinking she would get a slightly "less worse" evaluation.
However, the slightly less worse evaluation would prove to be delusional,
the position with a sacrificed pawn crackable and SF eventually lost
these games.
This issue was discussed in the following thread:
#2620
This commit is our best attempt to patch this issue, so that SF gets
more patient in worse positions and tries to play for 50 moves as much
as possible and not suicide. The implementation uses pure evaluation
methods rather than search, damping down the eval after 25 moves of
shuffling (damping factor is linear, starting from 1.0 after 25 shuffling
moves and reaching 0.04 after 50 moves of shuffling). This damping
puts the burden on the attacking player to prove that he can break
the fortress, as now the search will get more and more optimistic
for the defending player to be able to reach a draw by 50 moves rule.
This solution seems to work as intended for the few cases extracted
from tournament losses, according to tests done by @vondele in the
following comments:
#2620 (comment)
snicolet/Stockfish@a66d3c0#commitcomment-38963042
In Fishtest, the best result we managed to get after extensive testing
was a double yellow with Elo-gaining bounds (this patch), maybe because
the problem is quite rare at the short time controls we use in our tests
compared to the longer time controls used in tournament games:
STC:
LLR: -2.97 (-2.94,2.94) {-0.50,1.50}
Total: 201928 W: 38274 L: 38174 D: 125480
Ptnml(0-2): 3452, 23520, 46844, 23772, 3376
https://tests.stockfishchess.org/tests/view/5eb281dd2326444a3b6d3499
LTC:
LLR: -2.94 (-2.94,2.94) {0.25,1.75}
Total: 90232 W: 11446 L: 11353 D: 67433
Ptnml(0-2): 631, 8421, 26967, 8418, 679
https://tests.stockfishchess.org/tests/view/5eb34a862326444a3b6d37ff
Bench: 4834675