-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SF needs to stop suiciding! #2620
Comments
Happened yet again in CCC just now. Threw away the fortress that leela was completely clueless about. |
Two examples of Stockfish recently throwing a fortress away vs Leela:
Move
|
It’s helpful to post fen. Thx. |
i've edited my comment to include the FEN right before Stockfish's move for both games |
Maybe, it would be interesting to see if bae019b influenced that behavior close to the 50moves rule. |
Just happened again in superfinals game 26, in which sf needlessly made pawn moves that made defense much more difficult. |
@vondele Might be responsible for the otherwise inexplicable play in https://www.chess.com/computer-chess-championship#event=ccc13-finals&game=63 |
@Alayan-stk-2 could you try to investigate that a bit carefully, i.e. see if the behavior really changes with this patch? @joergoster, it that something you could look into? |
It looks like the blunder in game 63 was not 161. .. Kf5-e4, but the following 162. .. Rc3-c2+. 7 Threads, 2 GB Hash, 6-man syzygy bases, 3 min, multipv=10:
Could someone please confirm?
|
And this is the relevant part of the logfile. Thanks to @Alayan-stk-2 for providing it!
|
if these are ttMoves and played in positions with a high value of the rule50 counter. The unusual extension of 2 is safe in this context as awarding it will reset the rule50 counter, making sure it is awarded very rarely in a search path. This patch partially addresses official-stockfish#2620 as it should make it less likely to play a move that resets the counter, but that is worse than alternative moves after a slightly deeper search. passed STC: LLR: 2.96 (-2.94,2.94) {-0.50,1.50} Total: 71658 W: 13840 L: 13560 D: 44258 Ptnml(0-2): 1058, 7921, 17643, 8097, 1110 https://tests.stockfishchess.org/tests/view/5e90d0f6754c3424c4cf9f41 passed LTC: LLR: 2.94 (-2.94,2.94) {0.25,1.75} Total: 85082 W: 11069 L: 10680 D: 63333 Ptnml(0-2): 459, 6982, 27259, 7393, 448 https://tests.stockfishchess.org/tests/view/5e917470af0a0143109dc341 Bench: 4499282
if these are ttMoves and played in positions with a high value of the rule50 counter. The unusual extension of 2 is safe in this context as awarding it will reset the rule50 counter, making sure it is awarded very rarely in a search path. This patch partially addresses official-stockfish#2620 as it should make it less likely to play a move that resets the counter, but that is worse than alternative moves after a slightly deeper search. passed STC: LLR: 2.96 (-2.94,2.94) {-0.50,1.50} Total: 71658 W: 13840 L: 13560 D: 44258 Ptnml(0-2): 1058, 7921, 17643, 8097, 1110 https://tests.stockfishchess.org/tests/view/5e90d0f6754c3424c4cf9f41 passed LTC: LLR: 2.94 (-2.94,2.94) {0.25,1.75} Total: 85082 W: 11069 L: 10680 D: 63333 Ptnml(0-2): 459, 6982, 27259, 7393, 448 https://tests.stockfishchess.org/tests/view/5e917470af0a0143109dc341 closes official-stockfish#2623 Bench: 4432822
Would this make any sense, conceptually (the actual code would look different) :
It doesn't gain elo in rating lists, it rarely would change anything when adjudication is used ; but on the odd occasion that SF sees it's lost and randomly pick a 50mr reset, it would make it so much less frustrating. Arguably not worth the hassle, it's simply frustrating that when SF sees that everything loses, its defense becomes less challenging. |
no wouldn't make much sense, e.g. the optimal move to give the longest path to mate could be a pawn push or a capture, or unless one captures next move is a fork on K and Q etc.... would be ugly very quickly. SF plays the best move, one can only try to improve its understanding of what is a bestmove. |
@vondele Do you think there's anything else that can be done for this, other than the patch you just merged today? If not then I'll close the issue. |
Let's keep it open for a few days, in case @joergoster sees any relationship with the commit mentioned previously. If not we can close. Thanks. |
Yeah, I generally agree ; but in lost positions where everything is horrible, there is no good way to differentiate "challenging moves" from "delay mate longer but give easy play to the opponent". It doesn't seem really fixable. |
if these are ttMoves and played in positions with a high value of the rule50 counter. The unusual extension of 2 is safe in this context as awarding it will reset the rule50 counter, making sure it is awarded very rarely in a search path. This patch partially addresses official-stockfish#2620 as it should make it less likely to play a move that resets the counter, but that is worse than alternative moves after a slightly deeper search. passed STC: LLR: 2.96 (-2.94,2.94) {-0.50,1.50} Total: 71658 W: 13840 L: 13560 D: 44258 Ptnml(0-2): 1058, 7921, 17643, 8097, 1110 https://tests.stockfishchess.org/tests/view/5e90d0f6754c3424c4cf9f41 passed LTC: LLR: 2.94 (-2.94,2.94) {0.25,1.75} Total: 85082 W: 11069 L: 10680 D: 63333 Ptnml(0-2): 459, 6982, 27259, 7393, 448 https://tests.stockfishchess.org/tests/view/5e917470af0a0143109dc341 closes official-stockfish#2623 Bench: 4432822
@vondele I thought it to be quite obvious now from my above posts, that one possible cause for this issue is the thread voting patch. See the highlighted PV line at the end of my 2nd post. The question remains, though, why one thread (or even more than one?) keeps flying through the plies and reaching depth 73 without noticing that this root move is losing. It is possible that the patch you mentioned is causing this. OTOH, a single-threaded search doesn't show this problem. |
if (pos.rule50_count() < 90) It seems to me very likely that 90 is more drastic than required and backfires, while the rare ill effects of GHI could be alleviated with smaller margin, say 94. But the question is how to test? Normal TC's rarely trigger it. Maybe just on those 50-move positions? |
@joergoster I indeed didn't see that you highlighted the interesting fact that the different threads must have had a difference of depth ~40 in that run. That's an indication of a potential issue, but one would need to understand better. |
@vondele Yes, it would certainly help to know whether this also happened in other cases. |
I must admit I didn't see this huge jump in depth at first as I skipped between the highlighted parts of the log. It's very suspicious, especially when considering seldepth... The output shows a seldepth of 9 only. The line that was chosen in PV was not forced at all, so I fail to see the why of this abysmal seldepth. SF expected Ke1 and a quick exchange into a 7-men TB draw (this is a set of TB it must have at CCC). The 50mr counter on Rc2+ was at 77, so there is a good chance that this bug isn't directly caused by 50mr. |
Happened again in game 94 of SUFI. Leela was completely clueless and had no idea how to convert the endgame, but SF pushed a pawn which helped Leela. Even if the endgame was objectively lost, I strongly believe that if SF didn't push the pawn it woulda drawn the game due to Leela's cluelessness. We need a patch that ignores pawn moves in completely lost positions. This patch can be tested for non regression and applied for tournament play only. I imagine doing so definitely won't hurt since if a position is already completely lost then it doesn't matter what move is played, but it'll definitely exploit Leela's bad endgames and possibly draw some otherwise lost endgames. @vondele @Alayan-stk-2 |
@adentong link, fen, move played, better move, + deep analysis to correctly assess the position. In this case, the cutechess log would be useful as well (to check the actual depth etc). |
TCEC S17 Sufi Game 94, position after 143. Ra1: 8/p3kp2/Pp2p3/1n2PpP1/5P2/1Kp5/8/R7 b - - 68 143. TCECfish played the instantly suicidal Nd7, supposedly after evaluating 530 million nodes. The better move was Kd7. My 20200407 Homefish quickly switches away from Nc7 / c2 / Nd4+ (moves it does initially consider at low depths), after which it prefers Kd7 forever. The linked Lifish also mirrors my Homefish's behavior. |
the low seldepth can happen IMO. I'm not sure this is the real issue. |
The cycle detection mechanism just came to mind. By a-priori detecting no-progress via transposition, duplicate search is avoided, but what happens when the few in-between moves alter a 50-move win to a 50-move draw or a draw to a loss? Pretty rare, since 3 things need to coincide:
Pure speculation here, since cycle detection was AFAIR introduced just before SF9. Nvm it was just after SF9 91a7633 |
An attempt to start discussing/testing the "suicide issue" ( official-stockfish/Stockfish#2620 ) via pure evaluation methods: with this patch the evaluation of the position is damped down to zero after a long shuffling period (damping factor is linear, starting from 1.0 after 25 shuffling moves and reaching 0.0 after 50 moves of shuffling). Not sure how to really test this: Elo gaining bounds or non-regression bounds? Bench: 4557513
There is another case mentioned in the german CSS forum in this thread.
I'm more and more inclined to think that this is an issue with TB scores flooding the hash table, which are being stored with maximum depth and thus will hardly ever be replaced! @AndyGrant already changed this for Ethereal here AndyGrant/Ethereal@12dd95f |
Stockfish does the following
TB scores have "inflated" depths, but you could argue the +6 is a fair adjustment, since the TB scores are "true" values. Personally, I don't think that argument holds weight, so I opted to just saved at the actual depth, as that maintains the most consistency in how I deal with the TT. In relation to this, but not really to the thread as a whole, I considered the idea of flagging TT entries as belonging to the TB or not, and having those have maximal depth but also the highest prio. to be replaced. I lack the power to test that to an extent that could justify such an overhaul. |
@AndyGrant Yes, you're right. But even this The repeated observation in this thread of reaching very high search depths with a 0.00 score leads me to this guess. |
That raises another question, which is whether or not storing any TB hit into the TT is worthwhile. Is storing a TB hit into the TT solely done to avoid a TB lookup? Assuming SyzygyProbeDepth is set such that no TB probes are restricted, it would appear to me that any position which would look up a TB hit in the TT, would also find the same exact score just a few steps of the search later. I'm not convinced that |
Note that the tests above reproduce one of the issues without TB. There might be several issues however. |
@AndyGrant I fully agree. See joergoster@db04a51 where I don't save TB scores at all, but let the parent node do this as for every other score. With my limited testing I was not able to measure any drawback. :-) |
Trying to read search.cpp from scratch with a fresh look, I noted that the two functions called
This property seems to have been broken by be5a2f0 , can this fact be relevant for the current discussion? |
unlikely, as the behavior discussed in this issue was also seen in SF9 and SF10. #2620 (comment) |
This commit is our best attempt to patch this issue, so that SF gets more patient in worse positions and try to play for 50 moves as much as possible and not suicide. The implementation uses pure evaluation methods rather than search, by damping down the eval after 25 moves of shuffling (damping factor is linear, starting from 1.0 after 25 shuffling moves and reaching 0.04 after 50 moves of shuffling). This solution seems to work as intended for the few cases extracted from tournament losses, as according to tests done by @vondele in the following comments: official-stockfish/Stockfish#2620 (comment) a66d3c0#commitcomment-38963042 In Fishtest, the best result we managed to get after extensive testing was a double yellow with Elo-gaining bounds (this patch), maybe because the problem is quite rare at the short time controls we use in our tests compared to the longer time controls used in tournament games: STC: LLR: -2.97 (-2.94,2.94) {-0.50,1.50} Total: 201928 W: 38274 L: 38174 D: 125480 Ptnml(0-2): 3452, 23520, 46844, 23772, 3376 https://tests.stockfishchess.org/tests/view/5eb281dd2326444a3b6d3499 LTC: LLR: -2.94 (-2.94,2.94) {0.25,1.75} Total: 90232 W: 11446 L: 11353 D: 67433 Ptnml(0-2): 631, 8421, 26967, 8418, 679 https://tests.stockfishchess.org/tests/view/5eb34a862326444a3b6d37ff Bench: 4834675
This commit is our best attempt to patch this issue, so that SF gets more patient in worse positions and try to play for 50 moves as much as possible and not suicide. The implementation uses pure evaluation methods rather than search, by damping down the eval after 25 moves of shuffling (damping factor is linear, starting from 1.0 after 25 shuffling moves and reaching 0.04 after 50 moves of shuffling). This solution seems to work as intended for the few cases extracted from tournament losses, according to tests done by @vondele in the following comments: official-stockfish/Stockfish#2620 (comment) a66d3c0#commitcomment-38963042 In Fishtest, the best result we managed to get after extensive testing was a double yellow with Elo-gaining bounds (this patch), maybe because the problem is quite rare at the short time controls we use in our tests compared to the longer time controls used in tournament games: STC: LLR: -2.97 (-2.94,2.94) {-0.50,1.50} Total: 201928 W: 38274 L: 38174 D: 125480 Ptnml(0-2): 3452, 23520, 46844, 23772, 3376 https://tests.stockfishchess.org/tests/view/5eb281dd2326444a3b6d3499 LTC: LLR: -2.94 (-2.94,2.94) {0.25,1.75} Total: 90232 W: 11446 L: 11353 D: 67433 Ptnml(0-2): 631, 8421, 26967, 8418, 679 https://tests.stockfishchess.org/tests/view/5eb34a862326444a3b6d37ff Bench: 4834675
This commit is our best attempt to patch this issue, so that SF gets more patient in worse positions and try to play for 50 moves as much as possible and not suicide. The implementation uses pure evaluation methods rather than search, damping down the eval after 25 moves of shuffling (damping factor is linear, starting from 1.0 after 25 shuffling moves and reaching 0.04 after 50 moves of shuffling). This damping puts the burden on the attacking playing to prove that it can break the fortress, as now the search will get more and more optimistic for the defending player to be able to reach a draw by 50 moves rule. This solution seems to work as intended for the few cases extracted from tournament losses, according to tests done by @vondele in the following comments: official-stockfish/Stockfish#2620 (comment) a66d3c0#commitcomment-38963042 In Fishtest, the best result we managed to get after extensive testing was a double yellow with Elo-gaining bounds (this patch), maybe because the problem is quite rare at the short time controls we use in our tests compared to the longer time controls used in tournament games: STC: LLR: -2.97 (-2.94,2.94) {-0.50,1.50} Total: 201928 W: 38274 L: 38174 D: 125480 Ptnml(0-2): 3452, 23520, 46844, 23772, 3376 https://tests.stockfishchess.org/tests/view/5eb281dd2326444a3b6d3499 LTC: LLR: -2.94 (-2.94,2.94) {0.25,1.75} Total: 90232 W: 11446 L: 11353 D: 67433 Ptnml(0-2): 631, 8421, 26967, 8418, 679 https://tests.stockfishchess.org/tests/view/5eb34a862326444a3b6d37ff Bench: 4834675
In some recent tournament games, Stockfish exhibited the following self-destructing behaviour. Stockfish was suffering in a long shuffle session, having a bad evaluation in a blocked or semi-blocked position for about 40 moves and yet the eval was sort of flatlined, indicating that the opponent engine (Leela) had trouble converting the position. Then, not long before the 50-moves draw rule would be reached reached, the opponent would play its pieces to some strange places and SF would push a pawn, thinking she would get a slightly "less worse" evaluation. However, the slightly less worse evaluation would prove to be delusional, the position with a sacrificed pawn crackable and SF eventually lost these games. This issue was discussed in the following thread: official-stockfish/Stockfish#2620 This commit is our best attempt to patch this issue, so that SF gets more patient in worse positions and try to play for 50 moves as much as possible and not suicide. The implementation uses pure evaluation methods rather than search, damping down the eval after 25 moves of shuffling (damping factor is linear, starting from 1.0 after 25 shuffling moves and reaching 0.04 after 50 moves of shuffling). This damping puts the burden on the attacking playing to prove that it can break the fortress, as now the search will get more and more optimistic for the defending player to be able to reach a draw by 50 moves rule. This solution seems to work as intended for the few cases extracted from tournament losses, according to tests done by @vondele in the following comments: official-stockfish/Stockfish#2620 (comment) a66d3c0#commitcomment-38963042 In Fishtest, the best result we managed to get after extensive testing was a double yellow with Elo-gaining bounds (this patch), maybe because the problem is quite rare at the short time controls we use in our tests compared to the longer time controls used in tournament games: STC: LLR: -2.97 (-2.94,2.94) {-0.50,1.50} Total: 201928 W: 38274 L: 38174 D: 125480 Ptnml(0-2): 3452, 23520, 46844, 23772, 3376 https://tests.stockfishchess.org/tests/view/5eb281dd2326444a3b6d3499 LTC: LLR: -2.94 (-2.94,2.94) {0.25,1.75} Total: 90232 W: 11446 L: 11353 D: 67433 Ptnml(0-2): 631, 8421, 26967, 8418, 679 https://tests.stockfishchess.org/tests/view/5eb34a862326444a3b6d37ff Bench: 4834675
I have pushed a pull request there: #2666 |
In some recent tournament games, Stockfish exhibited the following self-destructing behaviour. Stockfish was suffering in a long shuffle session, having a bad evaluation in a blocked or semi-blocked position for about 40 moves and yet the eval was sort of flatlined, indicating that the opponent engine (Leela) had trouble converting the position. Then, not long before the 50-moves draw rule would be reached reached, the opponent would play its pieces to some strange places and SF would push a pawn, thinking she would get a slightly "less worse" evaluation. However, the slightly less worse evaluation would prove to be delusional, the position with a sacrificed pawn crackable and SF eventually lost these games. This issue was discussed in the following thread: official-stockfish/Stockfish#2620 This commit is our best attempt to patch this issue, so that SF gets more patient in worse positions and try to play for 50 moves as much as possible and not suicide. The implementation uses pure evaluation methods rather than search, damping down the eval after 25 moves of shuffling (damping factor is linear, starting from 1.0 after 25 shuffling moves and reaching 0.04 after 50 moves of shuffling). This damping puts the burden on the attacking player to prove that he can break the fortress, as now the search will get more and more optimistic for the defending player to be able to reach a draw by 50 moves rule. This solution seems to work as intended for the few cases extracted from tournament losses, according to tests done by @vondele in the following comments: official-stockfish/Stockfish#2620 (comment) a66d3c0#commitcomment-38963042 In Fishtest, the best result we managed to get after extensive testing was a double yellow with Elo-gaining bounds (this patch), maybe because the problem is quite rare at the short time controls we use in our tests compared to the longer time controls used in tournament games: STC: LLR: -2.97 (-2.94,2.94) {-0.50,1.50} Total: 201928 W: 38274 L: 38174 D: 125480 Ptnml(0-2): 3452, 23520, 46844, 23772, 3376 https://tests.stockfishchess.org/tests/view/5eb281dd2326444a3b6d3499 LTC: LLR: -2.94 (-2.94,2.94) {0.25,1.75} Total: 90232 W: 11446 L: 11353 D: 67433 Ptnml(0-2): 631, 8421, 26967, 8418, 679 https://tests.stockfishchess.org/tests/view/5eb34a862326444a3b6d37ff Bench: 4834675
In some recent tournament games, Stockfish exhibited the following self-destructing behaviour. Stockfish was suffering in a long shuffle session, having a bad evaluation in a blocked or semi-blocked position for about 40 moves and yet the eval was sort of flatlined, indicating that the opponent engine (Leela) had trouble converting the position. Then, not long before the 50-moves draw rule would be reached reached, the opponent would play its pieces to some strange places and SF would push a pawn, thinking she would get a slightly "less worse" evaluation. However, the slightly less worse evaluation would prove to be delusional, the position with a sacrificed pawn crackable and SF eventually lost these games. This issue was discussed in the following thread: official-stockfish/Stockfish#2620 This commit is our best attempt to patch this issue, so that SF gets more patient in worse positions and tries to play for 50 moves as much as possible and not suicide. The implementation uses pure evaluation methods rather than search, damping down the eval after 25 moves of shuffling (damping factor is linear, starting from 1.0 after 25 shuffling moves and reaching 0.04 after 50 moves of shuffling). This damping puts the burden on the attacking player to prove that he can break the fortress, as now the search will get more and more optimistic for the defending player to be able to reach a draw by 50 moves rule. This solution seems to work as intended for the few cases extracted from tournament losses, according to tests done by @vondele in the following comments: official-stockfish/Stockfish#2620 (comment) a66d3c0#commitcomment-38963042 In Fishtest, the best result we managed to get after extensive testing was a double yellow with Elo-gaining bounds (this patch), maybe because the problem is quite rare at the short time controls we use in our tests compared to the longer time controls used in tournament games: STC: LLR: -2.97 (-2.94,2.94) {-0.50,1.50} Total: 201928 W: 38274 L: 38174 D: 125480 Ptnml(0-2): 3452, 23520, 46844, 23772, 3376 https://tests.stockfishchess.org/tests/view/5eb281dd2326444a3b6d3499 LTC: LLR: -2.94 (-2.94,2.94) {0.25,1.75} Total: 90232 W: 11446 L: 11353 D: 67433 Ptnml(0-2): 631, 8421, 26967, 8418, 679 https://tests.stockfishchess.org/tests/view/5eb34a862326444a3b6d37ff Bench: 4834675
I'm not sure if there is a problem here, I don't think the engine should ever change it's move because it's opponent doesn't know how to convert it's advantage. The goal shouldn't be winning engine tournaments, it's helping humans analyzing positions. |
@USGroup1 Different goals for different folks. There will never be universal agreement just as there is no right or wrong answer as to what the goal be. You might want to consider trying my fork of Stockfish as I am also a corr player as well and tailer my fork more towards long term analysis as well as keeping it current with development Stockfish. Check out the honey branch of SF @MichaelB7 . You can also grab the latest release under the release tab. |
@USGroup1 I have to disagree. This isn't about beating leela in tournaments. It's about SF with a high enough thread count will sometimes just make completely nonsensical and losing moves. Of course if it weren't for TCEC no one would probably even realize this problem exists, but it's a legitimate problem nonetheless. Of course a nice byproduct of fixing this would be to lose a few less games against leela, but that's really beside the point. |
@vondele Huh, did I miss something? Do we know what is causing these blunders? |
When I look at a number of FENs #2666 (comment) most of the positions were clearly lost, and the one that was not was fixed. However, maybe I overlooked a FEN? I propose that a new issue is opened for an issue, with an analysis, showing which move is a clearly holding the draw. Not every lost game is worth an issue however... |
I've seen this many, many, many, many, many times. When SF seems to be losing and yet the eval is sort of flatlined, it would randomly push a pawn and suicide instead of playing for 50 moves. Can we make it so that SF plays for 50 moves as much as possible and not suicide? Even if this doesn't get merged into master it's useful to have in a special tournament build.
The text was updated successfully, but these errors were encountered: